Create a Model with Analyze
Before you begin, make sure you've read our Best Practices for machine learning.
Introduction
Analyze handles many of the complicated aspects of machine learning, such as feature identification and removal (known as "pruning"), data optimization (such as binning continuous values), and more. More advanced users can also take advantage of the Train skill if they'd like to build specific types of models or need more control over how the model is trained.
By default, Analyze
explores several gradient-boosted models and chooses the best one for you. If the initial model has a low score, Analyze
will explore additional models of different types, such as tree and linear models. If your training dataset is small, you can choose to allow Analyze to explore additional model types that might work better with smaller datasets.
Build a Model
To create model:
Load
data into your session.Click ML > Analyze in the sidebar.
Choose your target column.
Choose any additional options you want to add, such as:
- Including or excluding specific feature columns from the model.
- Configuring how the model is trained using tools such as feature pruning, temporal slicing, including more detailed model scores, model parameter optimization, or enabling additional model exploration.
- Disabling default optimizations such as autobinning and fixing class imbalances.
- Balancing the data by weighting it with another dataset.
Click Submit.
Chat Box Outputs
When Analyze
is finished, it provides a number of important outputs located in the Impact Scores Explained section of the chat box.
Learn How Impact Scores are Calculated
This directs you to the Feature Importance section to provide information on Shapely values and how the scores are calculated.
Show a Sample of Records Used in Training
This displays a table of sample records from your dataset that were used to Analyze
.
Display a Detailed Set of Scores
This displays the impact scores as a unique chart without tabbed options in the chart's header.
Compute Impact Scores with Bootstrapping
By default, Analyze
creates a bar chart showing the impact scores of each feature in the dataset. When you compute those impact scores with bootstrapping enabled, the same chart is generated, but it also contains error bars showing the confidence intervals for each feature.
Computing impact scores with bootstrapping can take a considerable amount of time and will block other work in a session until the computation is complete.
Header Outputs
When Analyze
is finished, it provides a number of important outputs that can be found in the chart header.
Impact
Feature Importance
Analyze
first explores several Gradient-boosted models. If the initial model has a low score, Analyze
explores additional models of different types, such as tree and linear models. Eventually, a winning model is selected and is given the name “BestFit1”. Subsequent models created by Analyze
are named “BestFit2”, “BestFit3”, and so on. This is the model you can save, publish, and use to create predictions.
When Analyze
is finished, it produces an impact chart, which illustrates the average impact each input feature has on the target feature. The features are sorted from most- to least-impactful. In the example below, we can see that the “registeredRiders” feature has the most impact on the “allRiders” feature, while features such as “weatherSituation” and “holiday” had very little or no impact on “allRiders”.
To obtain the impact scores for each feature as displayed on the bar chart, we first calculate the Shapley values for each feature for each sample in the training dataset. From these "local Shapley scores", we take the absolute values and then average them for each feature over all the individual data samples. These values are then displayed in the bar chart as the feature-wise model impact scores. For more information on Shapley values, see this article.
Top Features
This option provides a bar chart that plots the actual Shapley values of each of the top three most impactful features across 10 samples from the training dataset. Reviewing the impact of these features on a per-data point basis is helpful to understand how individual features can contribute, positively or negatively, to a predicted value.
Visualize
Analyze
also provides several visualizations that can help illustrate the trends identified by the model. These charts can help you understand your data, show trends for impactful features, and help catch errors in your data that could be affecting your models.
Model Stats
The Model Stats section of Analyze
's output contains three sections:
Scores
Different model scores are used for different model types:
- Regression models are given an R2 score, which can be interpreted as the percentage of the data that fits the model's trend. Whether the score is good or bad is highly dependent on the model's use case.
- Classification models use a few different scoring methods, but AUC is the primary one. If label weighting was used or detailed model scores were requested, AUC is replaced (or complemented) with a F1 score. It represents the "precision" of the model (# of true positives / # of true positives + # of false positives) divided by the model's "recall" (# of true positives / # of true positives + # of false negatives).
- Time series models are given symmetrical mean absolute error (SMAPE) and mean absolute error (MAE) scores along with a confidence interval. The more narrow the confidence interval, the more confidence the model has.
Residual or Confusion Matrix Plots
Residual Plots
For regression models, a residual plot can help you investigate whether your model is a strong one. A residual value is the difference between the actual value and the predicted value of a given data point. A residual plot places all of your residual values around the horizontal axis, which represents the model’s line of best fit.
If your residual values follow a normal distribution and are centered around the zero residual line, then your dataset is well-suited for linear regression. If not, such as in the example below, your dataset might not be a good fit for linear regression models.
In the example below, the model tends to overestimate predictions for higher Age ranges and underestimate predictions in the lower Age ranges, indicating a positive correlation between the error terms and the predictor.
Confusion Matrix Plots
For classification models, a confusion matrix helps investigate the performance of the model. A confusion matrix compares how well the model’s predicted values match the actual values in the dataset.
The actual values are placed on the Y axis, while the predicted values are placed on the X axis. Each section represents the percentage of values that satisfy each case where the predicted value was X and the actual value was Y. In the example below, our model had:
- 36% of samples predicted as 0 had an actual value of 1
- 64% of samples predicted as 1 had an actual value of 1
- 94% of samples predicted as 0 had an actual value of 0
- 6% of samples predicted as 1 had an actual value of 0
Other Models
Other Models is a table that lists each of the models that were explored, but ultimately not selected, by Analyze. You can view their names, their model type, their scores, and view their pipeline reports:
Pipeline Report
The pipeline report shows detailed information about each step of the model training process (known as a “pipeline”). From this report, you can see important information from each step of the model training process, including which features were pruned, what features were used, the edges of any automatically-created bins, and more.
Optimize Analyze
By default, Analyze applies many industry-standard model-building best practices and optimizations for you. However, you can turn off, change, or enable many of these options as you work to refine your model. Specifically, you can enable:
- Automatic feature pruning.
- Model optimization.
- Temporal slicing.
- Detailed model scores.
- Additional model exploration.
Implement Feature Pruning
By default, Analyze
does not remove low-impact input features from the model training process. With the feature pruning parameter enabled, Analyze
removes low impact features from the model training process for you.
Model Optimization
When enabled, Analyze considers a wider range of parameters during the model training process to maximize the model’s performance.
Temporal Slicing
By default, Analyze removes date- or time-based values because they’re unparseable. When enabled, Analyze breaks date- or time-based values into more manageable pieces, such as minutes, hours, or days.
Detailed Model Scores
By default, Analyze
provides two main scores, depending on the type of model. For classification models, the main scores are Accuracy and AUC. For regression models, the main scores are R2 and Mean Absolute Error.
With the Detailed Model Scores parameter enabled, Analyze
also reports the following scores:
- For classification models
- Log loss
- F1 scores
- Precision
- Recall
- For regression models
- Mean Squared Error
- Root Mean Squared Error
- Mean Absolute Percentage Error
Model Exploration
By default, Analyze explores additional models if the initial models are low-scoring. When enabled, Analyze explores additional models regardless of the initial model scores.
Turn Off Default Optimizations
You can turn off some of Analyze’s default optimizations as you work to refine your models.
Autobinning
By default, Analyze bins continuous, numeric columns to improve model performance. The bin boundaries are calculated using an auxiliary model that fits best on the provided labels. You can turn this off to use those values as they are.
Fix Class Imbalances
By default, Analyze attempts to balance the training dataset by oversampling or overweighting the data if class imbalances are detected. You can turn this off to use the dataset as-is.
Use Label Weighting
You can load or create a label weighting dataset to manually account for class imbalances in your data. Note that your dataset must follow this structure:
labelName | labelWeight |
---|---|
label1 | weight1 |
label2 | weight2 |
... | ... |
…
Then, select your weighting dataset in the “Weight” section: