Create a Model
Before you begin, make sure you've read our Best Practices for machine learning.
Introduction
The Train ML Model
form handles many of the complicated aspects of machine learning, such as feature identification and removal (known as "pruning"), data optimization (such as binning continuous values), and more. Using the Train ML Model
form also helps to build specific types of models or can provide more granular control over how the model is trained.
By default, DataChat explores several gradient-boosted models and chooses the best one for you. If the initial model has a low score, DataChat will explore additional models of different types, such as tree and linear models. If your training dataset is small, you can choose to allow DataChat to explore additional model types that might work better with smaller datasets.
Build a Model
To create model, click ML > Train Model in the sidebar, then:
Feature Selection (required)
- Choose your target column.
- Optionally, include or exclude specific feature columns from the model. By default, if no columns are selected, all other columns will be used as feature columns.
- Optionally, toggle Generate charts to Visualize the Data depending on whether you want to generate charts based on the created model. By default, this toggle is on.
- Optionally, toggle the Use Default Model Exploration. By default, Catboost and LightGBM models will be explored. If neither of these models preform well, Random Forest and logistic/linear regression models will be explored.
- Click Submit or continue to Advanced Options.
Advanced Options (optional)
Depending on the target column type, differing toggle options appear:
- For continuous target columns, either:
- Toggle the Treat the Target Column as Categorical to change the problem type to classification.
- Optionally, enter the number of bins to use for the target column. Note that this option is only available if Treat the Target Column as Categorical is toggled "Off".
- For categorical target columns:
- Toggle the Treat the Target Column as Continuous to change the problem type to regression. Note that only numerical columns enable this toggle.
- For continuous target columns, either:
Click Fix Class Imbalances to open a multi-toggle dropdown that includes:
- Oversampling. Toggle "On" to oversample underrepresented values in the training dataset.
- Automatic Label Weighting. Toggle "On" to assign relative importances to target column values.
- Custom Label Weighting. Toggle "On" to provide custom weights to target column values.
Select the metrics to use when scoring the model. Regression models have the following scoring options:
- Mean Absolute Error.
- Mean Absolute Percentage Error.
- Mean Squared Error.
- R2.
- Root Mean Squared Error.
Classification models have the following scoring options:
- AUC.
- Accuracy.
- F1.
- Log Loss.
- Precision.
- Recall.
Select the metric to use when selecting the best model.
Enter the holdout percentage to use for holdout validation.
Feature Engineering Optimizations includes the following checkboxes:
- Autobinning (tree based models). Check this box to automatically bin continuous feature columns.
- Feature Pruning. Check this box to automatically prune unimportant feature columns.
- Temporal Slicing. Check this box to automatically slice temporal feature columns.
Click Submit or continue to Model Selection.
Model Selection is only available if "Use Default Model Exploration" is toggled "Off".
Model Selection (optional)
Select the classifier or regressor models to use in training.
Select the hyperparameter types and values for each selected model:
Catboost Decision Tree Gradient Boosting KNN LightGBM Linear Regression Linear SVC Linear SVR Logistic Regression Random Forest Support Vector XGBoost Bagging Fraction ✔️ Bagging Frequency ✔️ Degree ✔️ Depth ✔️ Iterations ✔️ Fit Intercept ✔️ ✔️ Kernel ✔️ Learning Rate ✔️ ✔️ L2 Leaf Regularization ✔️ Max Depth ✔️ ✔️ ✔️ ✔️ ✔️ Min Data in Leaf ✔️ Min Samples Leaf ✔️ ✔️ Number of Estimators ✔️ ✔️ ✔️ Number of Iterations ✔️ Number of Leaves ✔️ Number of Neighbors ✔️ Penalty ✔️ ✔️ Random Strength ✔️ Solver ✔️
If you selected "Tuned" for any of the hyperparameter types:
Enter the number of tuning iterations to find the best hyperparameter combinations.
Enter the number of cross-validation folds to split your data into.
Optionally, you can also select the Tune All checkbox to turn on hyperparameter tuning for all hyperparameters with default ranges and steps.
Click Submit.
Outputs
When Train ML Model
is finished, it provides a number of important results that can be found in the tabbed output.
Impact
DataChat selects a winning model and is given the name “BestFit1”. Subsequent models created by Train ML Model
are named “BestFit2”, “BestFit3”, and so on. This is the model you can save, publish, and use to create predictions.
DataChat also produces an impact chart, which illustrates the average impact each input feature has on the target feature. The features are sorted from most-impactful to least-impactful.
To obtain the impact scores for each feature as displayed on the bar chart, we first calculate the Shapley values for each feature for each sample in the training dataset. From these "local Shapley scores", we take the absolute values and then average them for each feature over all the individual data samples. These values are then displayed in the bar chart as the feature-wise model impact scores. For more information on Shapley values, see this article.
Visualize
Train ML Model
also provides several visualizations that can help illustrate the trends identified by the model. These charts can help you understand your data, show trends for impactful features, and help catch errors in your data that could be affecting your models. Use the dropdown next to Visualize to view different generated charts.
Models
The Models section of the output contains four sections:
- ModelType. The model that was used in training.
- A set of scores generated for all the trained models.
- ModelReport. If the candidate model used has a model report, you can click the link to explore the candidate model's scores and pipeline report.
Scores
Different model scores are used for different model types, by default:
- Regression models are given an R2 score, which can be interpreted as the percentage of the data that fits the model's trend. Whether the score is good or bad is highly dependent on the model's use case.
- Classification models use a few different scoring methods, but AUC is the primary one. If label weighting was used or detailed model scores were requested, AUC is replaced (or complemented) with a F1 score. It represents the "precision" of the model (# of true positives / # of true positives + # of false positives) divided by the model's "recall" (# of true positives / # of true positives + # of false negatives).
Scores will depend on which scoring metrics are selected in the Train ML Model form.
Residual or Confusion Matrix Plots
Residual Plots
For regression models, a residual plot can help you investigate whether your model is a strong one. A residual value is the difference between the actual value and the predicted value of a given data point. A residual plot places all of your residual values around the horizontal axis, which represents the model’s line of best fit.
If your residual values follow a normal distribution and are centered around the zero residual line, then your dataset is well-suited for linear regression. If not, such as in the example below, your dataset might not be a good fit for linear regression models.
Confusion Matrix Plots
For classification models, a confusion matrix helps investigate the performance of the model. A confusion matrix compares how well the model’s predicted values match the actual values in the dataset.
The actual values are placed on the Y axis, while the predicted values are placed on the X axis. Each section represents the percentage of values that satisfy each case where the predicted value was X and the actual value was Y. In the example below, our model had:
- 20% of samples predicted as 0 had an actual value of 1
- 80% of samples predicted as 1 had an actual value of 1
- 97% of samples predicted as 0 had an actual value of 0
- 3% of samples predicted as 1 had an actual value of 0
Pipeline Report
The pipeline report shows detailed information about each step of the model training process (known as a “pipeline”). From this report, you can see important information from each step of the model training process, including which features were pruned, what features were used, the edges of any automatically-created bins, and more.