Skip to main content
Version: 0.18.3

Create a Model

DataChat’s machine learning skills are intended to make machine learning as simple as possible. There are three main parts to DataChat’s machine learning functionality:

  1. You can train models off of known data to better understand your data.
  2. You can analyze those models to find interesting information you might not have found otherwise.
  3. You can use those models to predict values on other data.

Prep Your Data

Before you begin, you should make sure your data is as ready as possible to make it easier for a model to identify meaningful trends. We recommend you use the data cleaning tools in DataChat to make sure your data satisfies our best practices before attempting to train a model.

Build Your Model

Analyze handles many of the more complicated aspects of machine learning for you, such as feature identification and removal (known as "pruning"), data optimization (such as binning continuous values), and more. More advanced users can also take advantage of the Train skill if they'd like to build specific types of models or need more control over how the model is trained.

By default, Analyze explores several Gradient-boosted models and chooses the best one for you. If the initial model has a low score, Analyze will explore additional models of different types, such as tree and linear models. If your training dataset is small, you can choose to allow Analyze to explore additional model types.

To create a model:

  1. Load data into your session.

  2. Click the ML button and select Analyze.

    the analyze button

  3. Choose your target column.

  4. Choose any additional options you want to add, such as:

    • Including or excluding specific columns from the model.
    • Optimizing the model by using tools such as feature pruning, temporal slicing, including more detailed model scores, or enabling additional model exploration.
    • Disabling default optimizations such as autobinning and fixing class imbalances.
    • Balancing the data by weighting it with another dataset.
  5. Click Submit.

The impact chart is shown in the chart panel and the model's statistics, such as its accuracy, are shown in the chat history.

the analyze form

Test Your Model

note

You must Save and Publish your model before you can use it in the Model Profiler or Predictor.

The Model Profiler and Predictor are useful tools to test your model's performance. The Model Profiler lets you see how up to three different features affect your model's target feature while the Model Predictor lets you see how different inputs might affect the model's predictions.

Using the Model Profiler

To use the Model Profiler to test your model's predictions with or without various inputs:

  1. Save the model you created with Analyze or Train.

  2. Create a new API for the Profiler to use with your model. For example, to create a new API called "TitanicModelTest," enter Create a new API TitanicModelTest.

  3. Publish your model to the newly-created API.

  4. Open the Profiler by going to the menu > My Models or by clicking here in the response after publishing the model to your API.

  5. Select the model you want to work with.

    the model select screen in the Profiler

Inside the Profiler, you can:

  1. See metadata about your model, such as its name and other model-dependent metrics, such as its accuracy.

  2. Use the Profiler to see how up to three different features relate to each other in the model

  3. Use the Predictor to see how the model's predictions change with given values.

    the main Profiler screen

To profile your model:

  1. Select the feature columns to profile.
  2. Optionally, define the windows or subsets of those feature columns to use in the Profiler.
    • For columns with continuous values, such as ages, you can pick the minimum and the maximum value to use. By default, the original minimum and maximum values are used. Up to 100 values can be selected.
    • For columns with categorical data, such as ticket classes (first, second, third), you can choose to use a subset of those categories. By default, all of the categories are used. Up to 100 values can be selected.
    • For columns with date or time data, you can choose the start and end date or time to use. By default, window starts at the earliest date or time and ends at the latest date or time.
  3. Optionally, define the default values to use for the columns that weren't selected. By default, the Profiler uses the imputed values provided by your model.
  4. Click Submit.

The Profiler creates one graph for each of the selected feature columns. The values of the feature column are on the x-axis and the predicted values of the target columns are on the y-axis. You can then use the slider at the bottom of the graph or click inside the graph to change the value of that feature column. If changing the value of that column affects the other feature columns or the predicted value, the other graphs automatically adjust to reflect that impact.

After you save and publish your model, you can share a link to the Profiler by:

  • Clicking the Shareable Link button. the shareable link button
  • Clicking here in the response and sharing the URL of the new tab. the word here that takes you to the profiler in a new tab

Using the Predictor

The Predictor lets you modify the input values used by your model to see how changes in those values might affect the predicted value.

To run a prediction:

  1. Open the Predictor by going to the menu > My Models and then clicking the Predictor tab.

  2. Modify at least one of the input values. Note that you don't need to fill in every value. In fact, you might purposefully leave some empty to see how the model is affected. Any inputs that are left blank use a default value (which is calculated by the model) to create the predicted value.

  3. Click Predict. The model's output appears in the Prediction Results section.

Apply Your Model

After you've built a model that fits your needs (such as accuracy or confidence requirements), you can use Predict or Predict Time Series to run that model on different versions of the dataset you used to train it create predictions.

For example, you might get a file in the same format with which you ran Analyze on above but without a Z column. You could then use your model to predict the value of the Z column and take action based on those predictions.

To use Predict:

  1. Click the ML button.

  2. Select Predict.

    the predict button

  3. Select the model to use for your prediction.

  4. Click Submit

    the predict form

To use Predict Time Series:

  1. Click the ML button.

  2. Select Predict Time Series.

    the predict time series button

  3. Select the variables to use in your prediction.

  4. Optionally, select variables to use for partition.

  5. Click Submit

    the predict time series form

Review Predictions

An easy way to compare your model's performance is to create a confusion matrix with the Plot skill. A confusion matrix is a chart that compares the values of a known column (such as Survived) with the values predicted by your model. The values in each section range from 0 to 1. The closer the number is to 1, the more better the model is at correctly predicting that value.

note

If your target column has 10 or fewer unique values, you'll be prompted to create a matrix after using Predict.

In some cases, you already know the layout of your data in the database and need to work with only a small number of tables inside DataChat. In these cases, you can create a file that contains the queries that you need and, in conjunction with the Load skill, pull only the data you need into DataChat.

Define Your Queries

First, create a JSON file that contains your SQL queries. In the file, list the dataset as the key, and the SQL query as the value. For example, the SalesQueries file below creates a music_data and customer_data dataset in DataChat that contain the results of their SQL queries:

{
"music_data": "SELECT * FROM albums WHERE artist_id > 3",
"customer_data": "SELECT * FROM customers",
}

Then, upload your file to DataChat. Refer to the Upload Files topic for more information.

Use Your Queries

After you've defined and uploaded your queries, use the Load skill to run your queries on your database and create your datasets. For example, to run the queries we defined above on a database called MusicSales, we would run the utterance Load data from the database MusicSales using the queries from SalesQueries.