Train a Model
The Train Model
form handles many of the complicated aspects of machine learning, such as feature identification and removal (known as "pruning"), data optimization (such as binning continuous values), and more. Using the Train Model
form also helps to build specific types of models or can provide more granular control over how the model is trained.
By default, DataChat automatically explores multiple models and selects the optimal one for you. If the initial model has a low score, DataChat will continue to evaluate additional models across various types. Alternatively, you have the flexibility to choose and configure specific models to suit your particular requirements.
Train a Model
To create model, click Machine Learning > Train Model in the skill menu, then:
If you're connected to a BigQuery database, you can leverage BigQuery ML within DataChat.
Feature Selection (required)
- Choose your target column.
- Optionally, include or exclude specific feature columns from the model. By default, if no columns are selected, all other columns will be used as feature columns.
- Optionally, toggle Generate charts to Visualize the Data depending on whether you want to generate charts based on the created model. By default, this toggle is on.
- Optionally, toggle the Use Default Model Exploration. By default, Catboost and LightGBM models will be explored. If neither of these models preform well, Random Forest and logistic/linear regression models will be explored.
- Click Submit or continue to Advanced Options.
Advanced Options (optional)
-
Depending on the target column type, differing toggle options appear:
- For continuous target columns, toggle the Treat the Target Column as Categorical to change the problem type to classification.
- For categorical target columns, toggle the Treat the Target Column as Continuous to change the problem type to regression. Note that only numerical columns enable this toggle.
-
Click Fix Class Imbalances to open a multi-toggle dropdown that includes:
- Oversampling. Toggle "On" to oversample underrepresented values in the training dataset.
- Automatic Label Weighting. Toggle "On" to assign relative importances to target column values.
- Custom Label Weighting. Toggle "On" to provide custom weights to target column values.
-
Select the metrics to use when scoring the model. Regression models have the following scoring options:
- Mean Absolute Error.
- Mean Absolute Percentage Error.
- Mean Squared Error.
- R2.
- Root Mean Squared Error.
Classification models have the following scoring options:
- AUC.
- Accuracy.
- F1.
- Log Loss.
- Precision.
- Recall.
-
Select the metric to use when selecting the best model.
-
Enter the holdout percentage to use for holdout validation.
-
Feature Engineering Optimizations includes the following checkboxes:
- Autobinning (tree based models). Check this box to automatically bin continuous feature columns.
- Feature Pruning. Check this box to automatically prune unimportant feature columns.
- Temporal Slicing. Check this box to automatically slice temporal feature columns.
-
Click Submit or continue to Model Selection.
Model Selection (optional)
Model Selection is only available if "Use Default Model Exploration" is toggled "Off".
- Select the classifier or regressor models to use in training.
- Select the hyperparameter types and values for each selected model. Optionally, you can also select the Tune All checkbox to turn on hyperparameter tuning for all hyperparameters with default ranges and steps:
Catboost | Decision Tree | KNN | LightGBM | Linear Regression | Linear SVC | Linear SVR | Logistic Regression | Random Forest | Support Vector | XGBoost | |
---|---|---|---|---|---|---|---|---|---|---|---|
Bagging Fraction | ✔️ | ||||||||||
Bagging Frequency | ✔️ | ||||||||||
Degree | ✔️ | ||||||||||
Depth | ✔️ | ||||||||||
Iterations | ✔️ | ||||||||||
Fit Intercept | ✔️ | ✔️ | |||||||||
Kernel | ✔️ | ||||||||||
Learning Rate | ✔️ | ✔️ | |||||||||
L2 Regularization | ✔️ | ||||||||||
Max Depth | ✔️ | ✔️ | ✔️ | ✔️ | |||||||
Min Data in Leaf | ✔️ | ||||||||||
Min Samples Leaf | ✔️ | ✔️ | |||||||||
Number of Estimators | ✔️ | ✔️ | |||||||||
Number of Iterations | ✔️ | ||||||||||
Number of Leaves | ✔️ | ||||||||||
Number of Neighbors | ✔️ | ||||||||||
Penalty | ✔️ | ✔️ | |||||||||
Random Strength | ✔️ | ||||||||||
Solver | ✔️ |
If you selected "Tuned" for any of the hyperparameter types:
- Enter the maximum time allowed (in minutes) for training the selected models.
- Enter the number of tuning iterations to find the best hyperparameter combinations.
- Click Submit.
BigQuery ML
For information on BigQuery ML permissions, refer to BigQuery Permissions.
If your dataset came from a BigQuery connection, you can optionally toggle the Enable BigQuery ML option under Feature Selection, which is enabled by default for BigQuery datasets. When enabled, DataChat leverages BigQuery ML to train models.
By default, DataChat explores BigQuery Logistic/Linear Regression and BigQuery XGBoost models. Optionally, you can select the BigQuery classifier or regressor models to use in training and specify their hyperparameter types and values under Model Selection:
BigQuery Deep Neural Network | BigQuery Linear Regression | BigQuery Logistic Regression | BigQuery Random Forest | BigQuery Wide and Deep Network | BigQuery XGBoost | |
---|---|---|---|---|---|---|
Activation Function | ✔️ | ✔️ | ||||
Batch Size | ✔️ | ✔️ | ||||
Learning Rate | ✔️ | |||||
L1 Regularization | ✔️ | ✔️ | ||||
L2 Regularization | ✔️ |