Introduction
DataChat’s machine learning tools are intended to make machine learning as simple as possible. There are four main parts to DataChat’s machine learning functionality. You can:
- Train models off of known data to better understand your data.
- Analyze those models to find interesting information you might not have found otherwise.
- Use standard machine learning tools, such as confusion matrices, residuals, and impact charts to learn more about the distribution of your data.
- Use those models to predict values on other data.
In this section, we’ll cover how to:
- Create models with the
Train
skill. - Test and explore models in the Model Profiler and Model Predictor.
- Apply models to data with the
Predict
skill. - Create time series predictions.
Before you begin, there are some best practices and definitions to keep in mind as you prepare your data and build your models.
Best Practices
Some best practices are only available or best accessed using the input field.
Identify the Problem
There are four main types of problems when it comes to machine learning:
- Regression. Predicting towards a continuous, numerical outcome. For example, a model that predicts the fuel efficiency of a car is solving a regression problem.
- Classification. Predicting a set number of categorical outcomes. For example, a model that determines whether a lead is a quality one is solving a classification problem.
- Clustering. Grouping data together based on their similarities. For example, a model that powers recommendations by grouping similar customers together.
- Time series. Predicting future values based on historical trends. For example, a model that forecasts median home prices is solving a time series problem.
Combine Relevant Datasets
Oftentimes, all of the relevant data is spread across multiple datasets. Before you begin, make sure you've combined all of the relevant data into a single dataset. Refer to the Combine section for more information.
Slice Data and Train Models for Each Unique Feature
Slice datasets by features that are highly heterogeneous and build a model for each slice. For example, predicting sales might benefit from building a model per region rather than one model for the entire country.
Remove Distracting Data
Drop or Keep distracting data that causes the model to focus on trends you don't care about. For example, when you're training a model to predict something per hour of the day, it might make sense to cut out evening hours or non-business hours so that day/night differences don't wash out more subtle trends during the day.
Use Lag or Window Columns to Focus Your Model
Create lag or window columns to help your model focus on specific trends within time series data. For example, a two week cycle can be emphasized with a column of the two week lag on the actual value.
Normalize Units of Measurement
Compute normalized units across columns so that regression models don't interpret large numbers in one feature as more important than small numbers in another. For example, if one column is measured in ounces and another is in pounds, you can normalize the two to both be in ounces or pounds.
Create Relevant Metrics Before Training
Combine columns into business-relevant metrics. For example, a pool may have water flowing in and water flowing out which are fairly stable in absolute terms, but the business relevant metric is actually the difference between the two indicating whether the pool is filling or emptying.
Limit Your Training Data to a Subset of Features
Limit your initial columns to the ones that make intuitive sense. This reduces the effort for you and avoids overfitting the data by providing too many options. For example, if you have a dataset that represents the passenger manifest of the "Titanic Dataset", and you want to predict whether a given passenger would have survived the disaster, you might drop columns that obviously don’t affect survival rate, such as names, ticket numbers, and cabin numbers.
Avoid Highly-Unique Features
Avoid using columns that have a lot of unique values. Models typically can’t find reliable trends in these columns without a lot more data.
Create Bins for Continuous Features
Bin continuous values (such as numbers) into 3-10 buckets rather than using them directly. Simple high, medium, and low bins are often a good place to start. For example, it might make sense to bin an "Age" column into common demographic buckets, such as 18-24, 25-34, 35-44, 45-54, 55-64, and so on.
Be Careful with Skewed Features
Be careful judging your results if your input data is heavily skewed. DataChat can compensate for this using industry standard techniques, but you should scrutinize the results more carefully.
Definitions
Attribute Value Frequency (AVF)
The Attribute Value Frequency (AVF) is a simple and fast approach to detect outliers in categorical data. The AVF method generates scores that are always positive numbers. The lower the score, the more likely the object is an outlier. More information can be found here.
Chatterjee Correlation Coefficient
Chatterjee's correlation coefficient is a rank correlation metric. It can be used to measure the degree of dependence between two variables. The coefficient is 0 only if the variables are completely independent and 1 only if one variable is a measurable function of the other. Note that the coefficient is asymmetric in nature, meaning that its value when calculated for variable A in relation to variable B is not guaranteed to be same when calculated for variable B in relation to variable A.
Correlation Measure Between Continuous and Discrete Variables
This is a measure of the distance between a continuous variable and categorical variable. The two distances are calculated to get an informed correlation between the two variables. All of the values of the continuous variable are clustered into the unique values of the categorical variable. An intra-class distance is calculated between all of the values belonging to a particular unique value and summed across all unique values. This distance defines the intra-class distance. An inter-class distance is calculated by taking the sum of the distances of each cluster mean from the total mean. The measure of correlation can then be defined as the ratio of inter-class distance to the sum of the inter-class and intra-class distance. The range is [0,1]. More information can be found here.
Cumulative Variance Ratio
Variance is a measure of the spread in the data after a Principle Component Analysis (PCA) has been performed on it. The cumulative variance ratio is used to how much of the total variance comes from each component.
Durbin-Watson Test
The Durbin-Watson test is used to detect whether there is any correlation in a regression model's prediction errors. More information can be found here.
Explained Variance Ratio
The explained variance ratio is the ratio of a given component in a PCA to the the total variance.
Isolation Forest
An isolation forest is an algorithm for detecting outliers in numerical data. The scores generated by the algorithm range from -1 to 1. The lower the score, the higher chance that the corresponding object is an outlier. More information can be found here.
Mean Absolute Error
The mean absolute error (MAE) is used to measure the average amount of forecast errors in a set of predictions without considering whether the forecast over predicts or under predicts. It is a scale-dependent metric that shows the average error to expect from a given forecast. More information can be found here.
Mean Absolute Percentage Error
Mean absolute percentage error (MAPE) is used for measuring predictions in regression tasks. It can be expressed as a float value from 0 to 1 or equivalently as percentage. MAPE generates unbounded values when ground truth is at or near zero. More information can be found here.
Mutual Information Gain
This is the measure of the mutual dependence of two categorical variables. This value points to the amount of information we can obtain for one variable by just observing the other variable. More information can be found here.
Pearson Correlation Coefficient
Pearson's correlation coefficient is a measure of the linear relationship between two variables. It's the covariance between the two variables divided by the standard deviations of each of the variables. A value of -1 and 1 indicate that both are highly-correlated with each other. A value of 0 indicates that there is no correlation. Both of the variables must be continuous in nature. More information can be found here.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a way to reduce the size of a dataset by reducing the number of variables in a dataset to the principal ones that retain most of the dataset's information. More information can be found here.
Symmetric Mean Absolute Percentage Error
Symmetric mean absolute percentage error (SMAPE) is used to measure the predictive accuracy of a model. It has both lower and upper bounds, between 0% and 200%. More information can be found here.