Skip to main content
Version: 0.26.4

Create a Model with Analyze

tip

Before you begin, make sure you've read our Best Practices for machine learning.

Introduction

Analyze handles many of the complicated aspects of machine learning, such as feature identification and removal (known as "pruning"), data optimization (such as binning continuous values), and more. More advanced users can also take advantage of the Train skill if they'd like to build specific types of models or need more control over how the model is trained.

By default, Analyze explores several gradient-boosted models and chooses the best one for you. If the initial model has a low score, Analyze will explore additional models of different types, such as tree and linear models. If your training dataset is small, you can choose to allow Analyze to explore additional model types that might work better with smaller datasets.

Build a Model

To create model:

  1. Load data into your session.

  2. Click ML > Analyze in the sidebar.

  3. Choose your target column.

  4. Choose any additional options you want to add, such as:

    • Including or excluding specific feature columns from the model.
    • Configuring how the model is trained using tools such as feature pruning, temporal slicing, including more detailed model scores, model parameter optimization, or enabling additional model exploration.
    • Disabling default optimizations such as autobinning and fixing class imbalances.
    • Balancing the data by weighting it with another dataset.
  5. Click Submit.

    the analyze form

Chat Box Outputs

When Analyze is finished, it provides a number of important outputs located in the Impact Scores Explained section of the chat box.

chat box outputs

Learn How Impact Scores are Calculated

This directs you to the Feature Importance section to provide information on Shapely values and how the scores are calculated.

Show a Sample of Records Used in Training

This displays a table of sample records from your dataset that were used to Analyze.

sample of training dataset

Display a Detailed Set of Scores

This displays the impact scores as a unique chart without tabbed options in the chart's header.

detailed scores

Compute Impact Scores with Bootstrapping

By default, Analyze creates a bar chart showing the impact scores of each feature in the dataset. When you compute those impact scores with bootstrapping enabled, the same chart is generated, but it also contains error bars showing the confidence intervals for each feature.

bootstrapped scores

note

Computing impact scores with bootstrapping can take a considerable amount of time and will block other work in a session until the computation is complete.

Header Outputs

When Analyze is finished, it provides a number of important outputs that can be found in the chart header.

Analyze outputs

Impact

Feature Importance

Analyze first explores several Gradient-boosted models. If the initial model has a low score, Analyze explores additional models of different types, such as tree and linear models. Eventually, a winning model is selected and is given the name “BestFit1”. Subsequent models created by Analyze are named “BestFit2”, “BestFit3”, and so on. This is the model you can save, publish, and use to create predictions.

When Analyze is finished, it produces an impact chart, which illustrates the average impact each input feature has on the target feature. The features are sorted from most- to least-impactful. In the example below, we can see that the “registeredRiders” feature has the most impact on the “allRiders” feature, while features such as “weatherSituation” and “holiday” had very little or no impact on “allRiders”.

impact chart

To obtain the impact scores for each feature as displayed on the bar chart, we first calculate the Shapley values for each feature for each sample in the training dataset. From these "local Shapley scores", we take the absolute values and then average them for each feature over all the individual data samples. These values are then displayed in the bar chart as the feature-wise model impact scores. For more information on Shapley values, see this article.

Top Features

This option provides a bar chart that plots the actual Shapley values of each of the top three most impactful features across 10 samples from the training dataset. Reviewing the impact of these features on a per-data point basis is helpful to understand how individual features can contribute, positively or negatively, to a predicted value.

shapley values

Visualize

Analyze also provides several visualizations that can help illustrate the trends identified by the model. These charts can help you understand your data, show trends for impactful features, and help catch errors in your data that could be affecting your models.

Analyze visualize output

Model Stats

The Model Stats section of Analyze's output contains three sections:

Scores

Different model scores are used for different model types:

  • Regression models are given an R2 score, which can be interpreted as the percentage of the data that fits the model's trend. Whether the score is good or bad is highly dependent on the model's use case.
  • Classification models use a few different scoring methods, but AUC is the primary one. If label weighting was used or detailed model scores were requested, AUC is replaced (or complemented) with a F1 score. It represents the "precision" of the model (# of true positives / # of true positives + # of false positives) divided by the model's "recall" (# of true positives / # of true positives + # of false negatives).
  • Time series models are given symmetrical mean absolute error (SMAPE) and mean absolute error (MAE) scores along with a confidence interval. The more narrow the confidence interval, the more confidence the model has.

scores

Residual or Confusion Matrix Plots

Residual Plots

For regression models, a residual plot can help you investigate whether your model is a strong one. A residual value is the difference between the actual value and the predicted value of a given data point. A residual plot places all of your residual values around the horizontal axis, which represents the model’s line of best fit.

If your residual values follow a normal distribution and are centered around the zero residual line, then your dataset is well-suited for linear regression. If not, such as in the example below, your dataset might not be a good fit for linear regression models.

In the example below, the model tends to overestimate predictions for higher Age ranges and underestimate predictions in the lower Age ranges, indicating a positive correlation between the error terms and the predictor.

residual plot

Confusion Matrix Plots

For classification models, a confusion matrix helps investigate the performance of the model. A confusion matrix compares how well the model’s predicted values match the actual values in the dataset.

The actual values are placed on the Y axis, while the predicted values are placed on the X axis. Each section represents the percentage of values that satisfy each case where the predicted value was X and the actual value was Y. In the example below, our model had:

  • 36% of samples predicted as 0 had an actual value of 1
  • 64% of samples predicted as 1 had an actual value of 1
  • 94% of samples predicted as 0 had an actual value of 0
  • 6% of samples predicted as 1 had an actual value of 0

confusion matrix

Other Models

Other Models is a table that lists each of the models that were explored, but ultimately not selected, by Analyze. You can view their names, their model type, their scores, and view their pipeline reports:

pipeline report table

Pipeline Report

The pipeline report shows detailed information about each step of the model training process (known as a “pipeline”). From this report, you can see important information from each step of the model training process, including which features were pruned, what features were used, the edges of any automatically-created bins, and more.

pipeline report

Optimize Analyze

By default, Analyze applies many industry-standard model-building best practices and optimizations for you. However, you can turn off, change, or enable many of these options as you work to refine your model. Specifically, you can enable:

  • Automatic feature pruning.
  • Model optimization.
  • Temporal slicing.
  • Detailed model scores.
  • Additional model exploration.

optimizations

Implement Feature Pruning

By default, Analyze does not remove low-impact input features from the model training process. With the feature pruning parameter enabled, Analyze removes low impact features from the model training process for you.

Model Optimization

When enabled, Analyze considers a wider range of parameters during the model training process to maximize the model’s performance.

Temporal Slicing

By default, Analyze removes date- or time-based values because they’re unparseable. When enabled, Analyze breaks date- or time-based values into more manageable pieces, such as minutes, hours, or days.

Detailed Model Scores

By default, Analyze provides two main scores, depending on the type of model. For classification models, the main scores are Accuracy and AUC. For regression models, the main scores are R2 and Mean Absolute Error.

With the Detailed Model Scores parameter enabled, Analyze also reports the following scores:

  • For classification models
    • Log loss
    • F1 scores
    • Precision
    • Recall
  • For regression models
    • Mean Squared Error
    • Root Mean Squared Error
    • Mean Absolute Percentage Error

Model Exploration

By default, Analyze explores additional models if the initial models are low-scoring. When enabled, Analyze explores additional models regardless of the initial model scores.

Turn Off Default Optimizations

You can turn off some of Analyze’s default optimizations as you work to refine your models.

default optimizations

Autobinning

By default, Analyze bins continuous, numeric columns to improve model performance. The bin boundaries are calculated using an auxiliary model that fits best on the provided labels. You can turn this off to use those values as they are.

Fix Class Imbalances

By default, Analyze attempts to balance the training dataset by oversampling or overweighting the data if class imbalances are detected. You can turn this off to use the dataset as-is.

Use Label Weighting

You can load or create a label weighting dataset to manually account for class imbalances in your data. Note that your dataset must follow this structure:

labelNamelabelWeight
label1weight1
label2weight2
......

Then, select your weighting dataset in the “Weight” section:

weight field