Skip to main content
Version: 0.22.2

Train

Train lets you train specific types of machine learning models. Compared to Analyze, Train is more specific, advanced, and does not have automated features such as automatic binning and feature trimming.

Format

Each available model has its own utterance.

Catboost Classifier/Regressor

  • Train a CatBoost (classifier | regressor) model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a CatBoost classifier model reusing pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

Decision Tree Classifier/Regressor

  • Train a decision tree (classifier | regressor) ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) and label as the column <label column> (with criterion <criterion> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a decision tree (classifier | regressor) reusing pipeline for the model <model> (with max depth <depth> | max <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

Gradient Boosting Regressor

  • Train a gradient boosting regressor ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) and label as the column <label column> (with loss function <function> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a gradient boosting regressor reusing pipeline for the model <model> (with loss function <function> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

HDBSCSAN

  • Train a HDBSCAN model (picking as features the columns | all columns except) <feature columns>) (with scoring metrics <metrics> | minimum cluster size <cluster size>)
  • Train a HDBSCAN model reusing pipeline for the model <model>) (with scoring metrics <metrics> | minimum cluster size <cluster size>)

Hierarchical

  • Train a hierarchical model ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) (with number of clusters <num clusters>)
  • Train a hierarchical model reusing pipeline for the model <model> (with number of clusters <num clusters>)

KMeans

  • Train a KMeans model (picking as features the columns | all columns except) <feature columns>) (with number of clusters <num clusters>)
  • Train a KMeans model reusing pipeline for the model <model> (with number of clusters <num clusters>)

KNN Classifier/Regressor

  • Train a KNN (classifier | regressor) (picking as features the columns | all columns except) <feature columns>) and label as the column <label column> (with number of neighbors <num neighbors> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a KNN (classifier | regressor) reusing pipeline for the model <model> (with number of neighbors <num neighbors> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

LightGBM Classifier/Regressor

  • Train a LightGBM (classifier model | regressor) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a LightGBM (classifier model | regressor) reusing pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

Linear Regression

  • Train a Linear regression model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>) (turning off intercept fitting)
  • Train a Linear regression model reusing pipeline for the model <model> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>) (turning off intercept fitting)

Logistic Regression

  • Train a logistic regression model (picking as features the columns | all columns except) <feature columns>) and label as the column <label column> (with penalty <penalty> | scoring metrics <metrics> | solver <solver> | test holdout percentage <percentage> | test split method <method>)
  • Train a logistic regression model reusing pipeline for the model <model> (with penalty <penalty> | scoring metrics <metrics> | solver <solver> | test holdout percentage <percentage> | test split method <method>)

Naive Bayes Classifier

  • Train a naive Bayes classifier (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a naive Bayes classifier reusing pipeline for the model <model> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

Random Forest Classifier/Regressor

  • Train a random forest (classifier | regressor model) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with criterion <criterion> | max depth <depth> | max features <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a random forest (classifier | regressor model) reusing the pipeline for the model <model> (with criterion <criterion> | max depth <depth> | max features <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

SVM Classifier/Regressor

  • Train an svm (classifier | regressor) model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with <penalty> | kernel <kernel> | loss <loss> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method> | <optimization on>)
  • Train an svm (classifier | regressor) model reusing the pipeline for the model <model> (with <penalty> | kernel <kernel> | loss <loss> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method> | <optimization on>)

XGBoost Classifier/Regressor

  • Train a XGBoost (classifier | regressor) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)
  • Train a XGBoost (classifier | regressor) reusing the pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)

Parameters

Train uses the following parameters:

  • feature columns (required). The columns to include or exclude as the features used to train the model.
  • model (required). The model whose pipeline to reuse.
  • label column (required). The column whose values you want to predict.
  • depth (optional). The maximum depth of the trees used in the model.
  • estimators (optional). For catboost models, this is the number of estimators that should be used when training the model. An estimator is a formula that helps to pick the best model.
  • metrics (optional). The scoring metrics to be used in the model. The options include:
    • accuracy.
    • auc.
    • DBCV. HDBSCAN models only.
    • f1.
    • f1_unbiased.
    • f1_unbiased_classwise.
    • f1_unbiased_highest_weight.
    • log_loss.
    • precision.
    • recall.
    • silhoutte_score. HDBSCAN models only.
  • percentage (optional). The percentage of the dataset to hold out from the training dataset for testing.
  • method (optional). The method by which to separate the testing data from the training data. The options include:
    • random. The testing data is separated from the training data by randomly sampling the original dataset.
    • stratified. The relative class distributions of the target column are preserved in both the testing and training datasets.
  • criterion (optional). For decision tree models, this is the criterion used to find the best split between features. The options include:
    • all.
    • entropy.
    • gini.
  • features (optional). For decision tree models, this is the maximum number of features to consider for each split.
  • function (optional). For gradient boosting regressor models:
    • huber.
    • lad.
    • ls.
    • quantile.
  • cluster size (optional). For HDBSCAN models, this is the size of each cluster.
  • num clusters (optional). For hierarchical and KMeans models, this is the number of clusters to use in the model.
  • num neighbors (optional). For KNN classifier and regressor models, this is the number of nearest neighbors to consider.
  • penalty (optional). For SVM classifier, SVM regressor, and logistic regression models, this is the technique used to regularize the model. The options include:
    • l1.
    • l2.
  • kernel (optional). For SVM classifier and regressor models, the kernel is used to expand the input dataset into a higher-dimensional space for the model to work on. The options include:
    • linear
    • poly
    • rbf
    • sigmoid
  • loss (optional). For SVM classifier and regressor models, the loss function allows the model to place the different classes as far away from each other as possible. A squared hinge loss formulation penalizes larger losses more severely than the hinge loss formulation.
  • optimization on (optional). For SVM classifier and regressor models, this enables hyperparameter optimization.
  • solver (optional). For logistic regression models, this is the solver to use when training the model. The options include:
    • lbfgs.
    • liblinear.
    • newton-cg.
    • sag.
    • saga.
  • turning off intercept fitting (optional). For linear regression models, this parameter prevents the model from being fit with a non-zero intercept. By default, the model is not guaranteed to intercept at zero.
  • batch (optional). For neural networks, this is the size of each batch used in the network.
  • epochs (optional). For neural networks, this is the number of epochs used in the network.

Output

If the model is successfully trained, the chat history shows a success message which includes details and statistics about the model. Otherwise, the chat history shows a failure message.

The model is saved according to the model type:

  • Classification Model: Classifier#
  • Cluster Model: ClusterModel#
  • Regression Model: Regressor#

where # indicates an incrementing count within a session. For example, the third regression model created in a session is named Regressor3.

note

The count increments even if a model is removed with Forget. For example, if Regressor3 is removed, the next regression model will be named Regressor4, incrementing the count, even though Regressor3 is no longer available.

Enter List all the models to view the current session's models that can be used with Predict.

Examples

Consider a dataset containing information about the passengers aboard the Titanic. It has the following columns:

  • PassengerID.
  • Survived.
  • Pclass.
  • Name.
  • Gender.
  • Age.

To train a CatBoost classifier on this dataset to predict whether a passenger would survive the disaster, enter Train a CatBoost classifier model picking as features all columns except PassengerID, Ticket, Cabin, Survived and label as the colum Survived.

Going Deeper

After you run Train you can also view the residual statistics of your trained model by entering Plot residuals for the model <model name>. This generates a scatter chart of residuals and returns residual statistics in the chat history.