Analyze lets you choose a target column in a dataset and then visualize the impact the other columns in the dataset have on predicting the value of the target column.
Analyze also trains a set of machine learning models and chooses the best model among the set, which you can then use to carry out further analysis and predictions.
Analyze selects the best dataset that matches the columns provided, meaning you don’t need to have the dataset that contains the column selected to analyze the column.
Analyze automatically optimizes the data in your target dataset by automatically:
- Grouping continuous, numeric values into optimally-sized bins.
- Oversampling or auto-weighting to help ensure your data is balanced.
- Discarding during analysis columns that contain over 95% null values.
Analyze can also optimize data in your target dataset by:
- Slicing datetime values into the most relevant components, such as minutes, hours, or days.
- Accounting for heavily-weighted data with label weighting. To use label weighting, you must first create a separate dataset that contains one column with the labels (such as each value in a particular column), and another that contains the weight of each label.
- Discarding columns in your dataset that don't have much impact on the model.
- Splitting your dataset into testing and training datasets.
Each of these optimizations can be enabled or disabled as needed. Refer to the sections below for more information.
Analyze has a single utterance with several options:
Analyze <target column name> (using <feature columns>) (excluding <excluded columns>) (weighting with <dataset>) (with/without <optimizations>) (setting test holdout percentage as <percentage> | setting test split method as <split method>) (treating the target column as <type>) (binning the target column by percentile setting the number of bins to <bins>).
You can use the following parameters to specify what column to analyze and how to analyze it:
target column(required). The column on which you need to run an analysis and receive insights on.
feature columns(optional). A list of columns that are used in the analysis of the target column. All other columns are ignored.
excluded columns(optional). A list of columns that are not used in the analysis of the target column. All other columns are used in the analysis.
dataset(optional). The dataset to use to balance out imbalanced data points.
optimization(optional). Enables or disables various optimizations. The available optimizations include:
detailed model scores. Show more scoring metrics in the final response message after a model is trained. If a
percentageis specified to test the trained model on a percentage of the dataset, additional metrics are also showed for that test.
feature pruning. Automatically discard columns in the dataset that aren't impactful to the model. This optimization is disabled by default.
model optimization. Automatically consider a wider range of parameters to maximize the performance of the model. This optimization is disabled by default.
model exploration. Automatically explore additional model types. This optimization is automatically enabled if the initial models have low performance scores.
temporal slicing. Automatically extract more manageable components, such as minutes, hours, or days, from date and time values. This optimization is disabled by default.
autobinning. Automatically group continuous, numeric values into optimally-sized bins. This optimization is enabled by default.
fixing class imbalances. Automatically attempt to balance the dataset by oversampling or auto-weighting the data. This optimization is enabled by default.
percentage. The percentage of the dataset to hold out from the training dataset for testing.
split method. The method by which to separate the testing data from the training data. The options include:
- random. The testing and training datasets are created by randomly sampling the original dataset.
- stratified. Takes into account the class distributions of the target column and preserves those distributions in the testing and training datasets. This option is recommended for classification problems, such as predicting whether an email is spam or if a given passenger would survive the Titanic disaster. If the value you're analyzing is a quantity, such as next quarter's revenue, this option is ignored.
type(optional). Whether to treat the target column as either of the following types:
bins(optional). The number of bins to use to group the values of the target column. Refer to
Binfor more information.
If the analysis is successful, the skill outputs a number of messages. These include:
- Whether unsuitable columns were detected. If the dataset has special column types such as Date, JSON etc which are currently unsupported by the skill, the platform returns a message informing that such columns have been dropped.
- Whether the utterance can be optimized by removing columns. If the skill detects that an included column is of little importance to the analysis, such as an ID column that has a unique value for every row in the dataset, the platform informs the user that the skill execution can be optimized by dropping that column from the included list of columns along with a link to rerun the utterance with that column excluded.
- The model that was trained. The skill provides information on the complex model that was selected based on its performance and provides the performance scores. If a test holdout percentage was specified, the model's score against the test dataset is also included.
- A bar chart showing the impact each column used in the analysis has on the target column. A bar chart visualizing the relative impacts of each column on the model’s prediction of the target column is shown in the display panel. This is accompanied by a list of the most impactful features and a link to view a table of all of the impact scores in the chat box.
- Alternative visualizations. A list of links to charts are returned in the chat box. These links let you plot interesting visualizations of the dataset using the most impactful feature columns.
- Machine learning pipeline report.
Analyzereturns a link to view the machine learning pipeline report, which describes all of the steps completed while training the selected model. This report gives a detailed description of the various parameters selected during different stages of the pipeline and is intended for users who would like to have more information on underlying operations and settings chosen for the skill execution.
- When analyzing a small dataset, a recommendation to use
model optimizationand re-run the analysis might appear.
Consider a dataset called "Titanic" that contains information on each passenger, including the following columns:
Age. Their age.
Gender. Their gender.
Name. Their name.
PClass. Their class.
Survived. Whether they survived the disaster.
To analyze the
Survived column based on all other columns of the dataset, enter
To analyze the
Survived column using the passengers’s age and gender, enter
Analyze Survived using Age, Gender
To analyze the
Survived column with feature-pruning enabled, enter
Analyze Survived using Age, Gender with feature-pruning
To analyze the
Survived column with model-optimization disabled, enter
Analyze Survived using Age, Gender without model-optimization
To analyze the
Survived column while excluding the Name column, enter
Analyze Survived excluding Name
After you run
Analyze once and determine which feature columns have the highest impact, check the chat box to see if there were any suggestions for optimizing the utterance, such as dropping columns with a high number of unique values.
Then, from the bar plot, try to determine if some of the columns are closely related and therefore provide redundant information about the target column. For example, some of the columns might be a linear combination of another column. Such columns are highly correlated and some of them can be dropped to improve the model. Rerun the utterance to see the updated impact scores of the remaining columns.
Optionally, for regression models (continuous target), you can also view the residual statistics of your analyzed model by entering
Plot residuals for the model <model name>. This generates a scatter chart of residuals against the target value and residual statistics are returned in the chat history.