# Train

`Train`

lets you train specific types of machine learning models. Compared to `Analyze`

, `Train`

is more specific, advanced, and does not have automated features such as automatic binning and feature trimming.

### Format

Each available model has its own utterance.

#### Catboost Classifier/Regressor

`Train a CatBoost (classifier | regressor) model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a CatBoost classifier model reusing pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### Decision Tree Classifier/Regressor

`Train a decision tree (classifier | regressor) ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) and label as the column <label column> (with criterion <criterion> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a decision tree (classifier | regressor) reusing pipeline for the model <model> (with max depth <depth> | max <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### Gradient Boosting Regressor

`Train a gradient boosting regressor ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) and label as the column <label column> (with loss function <function> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a gradient boosting regressor reusing pipeline for the model <model> (with loss function <function> | max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### HDBSCSAN

`Train a HDBSCAN model (picking as features the columns | all columns except) <feature columns>) (with scoring metrics <metrics> | minimum cluster size <cluster size>)`

`Train a HDBSCAN model reusing pipeline for the model <model>) (with scoring metrics <metrics> | minimum cluster size <cluster size>)`

#### Hierarchical

`Train a hierarchical model ((picking as features the columns | all columns except) <feature columns>) | reusing pipeline for the model <model>) (with number of clusters <num clusters>)`

`Train a hierarchical model reusing pipeline for the model <model> (with number of clusters <num clusters>)`

#### KMeans

`Train a KMeans model (picking as features the columns | all columns except) <feature columns>) (with number of clusters <num clusters>)`

`Train a KMeans model reusing pipeline for the model <model> (with number of clusters <num clusters>)`

#### KNN Classifier/Regressor

`Train a KNN (classifier | regressor) (picking as features the columns | all columns except) <feature columns>) and label as the column <label column> (with number of neighbors <num neighbors> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a KNN (classifier | regressor) reusing pipeline for the model <model> (with number of neighbors <num neighbors> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### LightGBM Classifier/Regressor

`Train a LightGBM (classifier model | regressor) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a LightGBM (classifier model | regressor) reusing pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### Linear Regression

`Train a Linear regression model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>) (turning off intercept fitting)`

`Train a Linear regression model reusing pipeline for the model <model> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>) (turning off intercept fitting)`

#### Logistic Regression

`Train a logistic regression model (picking as features the columns | all columns except) <feature columns>) and label as the column <label column> (with penalty <penalty> | scoring metrics <metrics> | solver <solver> | test holdout percentage <percentage> | test split method <method>)`

`Train a logistic regression model reusing pipeline for the model <model> (with penalty <penalty> | scoring metrics <metrics> | solver <solver> | test holdout percentage <percentage> | test split method <method>)`

#### Naive Bayes Classifier

`Train a naive Bayes classifier (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a naive Bayes classifier reusing pipeline for the model <model> (with scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### Random Forest Classifier/Regressor

`Train a random forest (classifier | regressor model) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with criterion <criterion> | max depth <depth> | max features <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a random forest (classifier | regressor model) reusing the pipeline for the model <model> (with criterion <criterion> | max depth <depth> | max features <features> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

#### SVM Classifier/Regressor

`Train an svm (classifier | regressor) model (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with <penalty> | kernel <kernel> | loss <loss> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method> | <optimization on>)`

`Train an svm (classifier | regressor) model reusing the pipeline for the model <model> (with <penalty> | kernel <kernel> | loss <loss> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method> | <optimization on>)`

#### XGBoost Classifier/Regressor

`Train a XGBoost (classifier | regressor) (picking as features the columns | all columns except) <feature columns> and label as the column <label column> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

`Train a XGBoost (classifier | regressor) reusing the pipeline for the model <model> (with max depth <depth> | number of estimators <estimators> | scoring metrics <metrics> | test holdout percentage <percentage> | test split method <method>)`

### Parameters

`Train`

uses the following parameters:

`feature columns`

(required). The columns to include or exclude as the features used to train the model.`model`

(required). The model whose pipeline to reuse.`label column`

(required). The column whose values you want to predict.`depth`

(optional). The maximum depth of the trees used in the model.`estimators`

(optional). For catboost models, this is the number of estimators that should be used when training the model. An estimator is a formula that helps to pick the best model.`metrics`

(optional). The scoring metrics to be used in the model. The options include:- accuracy.
- auc.
- DBCV. HDBSCAN models only.
- f1.
- f1_unbiased.
- f1_unbiased_classwise.
- f1_unbiased_highest_weight.
- log_loss.
- precision.
- recall.
- silhoutte_score. HDBSCAN models only.

`percentage`

(optional). The percentage of the dataset to hold out from the training dataset for testing.`method`

(optional). The method by which to separate the testing data from the training data. The options include:- random. The testing data is separated from the training data by randomly sampling the original dataset.
- stratified. The relative class distributions of the target column are preserved in both the testing and training datasets.

`criterion`

(optional). For decision tree models, this is the criterion used to find the best split between features. The options include:- all.
- entropy.
- gini.

`features`

(optional). For decision tree models, this is the maximum number of features to consider for each split.`function`

(optional). For gradient boosting regressor models:- huber.
- lad.
- ls.
- quantile.

`cluster size`

(optional). For HDBSCAN models, this is the size of each cluster.`num clusters`

(optional). For hierarchical and KMeans models, this is the number of clusters to use in the model.`num neighbors`

(optional). For KNN classifier and regressor models, this is the number of nearest neighbors to consider.`penalty`

(optional). For SVM classifier, SVM regressor, and logistic regression models, this is the technique used to regularize the model. The options include:- l1.
- l2.

`kernel`

(optional). For SVM classifier and regressor models, the kernel is used to expand the input dataset into a higher-dimensional space for the model to work on. The options include:- linear
- poly
- rbf
- sigmoid

`loss`

(optional). For SVM classifier and regressor models, the loss function allows the model to place the different classes as far away from each other as possible. A squared hinge loss formulation penalizes larger losses more severely than the hinge loss formulation.`optimization on`

(optional). For SVM classifier and regressor models, this enables hyperparameter optimization.`solver`

(optional). For logistic regression models, this is the solver to use when training the model. The options include:- lbfgs.
- liblinear.
- newton-cg.
- sag.
- saga.

`turning off intercept fitting`

(optional). For linear regression models, this parameter prevents the model from being fit with a non-zero intercept. By default, the model is not guaranteed to intercept at zero.`batch`

(optional). For neural networks, this is the size of each batch used in the network.`epochs`

(optional). For neural networks, this is the number of epochs used in the network.

### Output

If the model is successfully trained, the chat history shows a success message which includes details and statistics about the model. Otherwise, the chat history shows a failure message.

The model is saved according to the model type:

- Classification Model: Classifier#
- Cluster Model: ClusterModel#
- Regression Model: Regressor#

where # indicates an incrementing count within a session. For example, the third regression model created in a session is named Regressor3.

The count increments even if a model is removed with `Forget`

. For example, if Regressor3 is removed, the next regression model will be named Regressor4, incrementing the count, even though Regressor3 is no longer available.

Enter `List all the models`

to view the current session's models that can be used with `Predict`

.

### Examples

Consider a dataset containing information about the passengers aboard the Titanic. It has the following columns:

- PassengerID.
- Survived.
- Pclass.
- Name.
- Gender.
- Age.

To train a CatBoost classifier on this dataset to predict whether a passenger would survive the disaster, enter `Train a CatBoost classifier model picking as features all columns except PassengerID, Ticket, Cabin, Survived and label as the colum Survived`

.

### Going Deeper

After you run `Train`

you can also view the residual statistics of your trained model by entering `Plot residuals for the model <model name>`

. This generates a scatter chart of residuals and returns residual statistics in the chat history.