Skip to main content
Version: 0.22.2

Guided Learning

In this chapter, we'll explore how we can use DataChat to train models on our data.

Start a new session

  1. Download a file. Go to Students Adaptability Level in Online Education on Kaggle.com and download "archive.zip", which contains "students_adaptability_level_online_education.csv".

  2. Open a new session in DataChat. From the homepage, click New > Session. The Load Data form appears by default. You can turn off data recommendations in Settings > Sessions.

  3. Load the file "students_adaptability_level_online_education.csv" into our new session. The session opens in Grid mode by default. You can set default mode to either Grid or Notebook in Settings > Sessions.

  4. Let's switch to Notebook mode, to focus more directly on our output. Click the Notebook mode button at the top right of our session.

    ML Guided Learning mode switch

  5. Name the session "StudentAdaptability".

  6. Let's shorten the dataset name. Double-click on the name, "students_adaptability_level_online_education", and rename it to "OnlineAdaptability".

    ML Guided Learning dataset name change

  7. Press Space in the chat box at the bottom right corner. By default, the Show sample of current dataset toggle is "On" and a popup appears showing a sample of the current dataset, as well as autocomplete suggestions for available DataChat skills.

    ML Guided Learning dataset name change

If no popup appears, you might want to turn this setting "On".

Get to Know the Data

  1. Click Dataset > Describe in the sidebar. A table appears in the display panel, named "OnlineAdaptability_Describe", that displays information about each of the columns.

    ML Guided Learning Describe dataset

    For datasets with primarily numerical data, we might use Describe to assess the dataset's columns. However, this dataset contains only string values, so we'll rely on the description of the dataset. we see that the target feature is the "AdaptivityLevel" column, meaning that it is influenced by the other columns in the dataset. Notice that the target column has three unique values. The column "Age", on the other hand, has six different values of age ranges.

  2. Click on the link for the "AdaptivityLevel" column to run a DataChat sentence with Describe the column.

    ML Guided Learning Describe Column

  3. A donut chart that describes the column appears in the display panel. Hover over the chart to see details about the sections.

    ML Guided Learning Describe Column Image

    We can see that the values are imbalanced. They represent a natural setting where students with "High" adaptability are rare. The majority of students are "Moderate", while over a third struggle to adapt to their education setup. The chat history displays further statistics about the column.

  4. Let's view another way to explore our data. Click ML > Visualize in the sidebar. The Visualize form appears. Select "AdaptivityLevel" as the key performance indicator (KPI) and click Submit.

    ML Guided Learning Visualize form

  5. The same donut chart appears in the display panel. The Visualize skill displays the first chart listed in the chat history, "Chart1A".

    ML Guided Learning Visualize

  6. Click the link for "Chart1B" to display a donut chart with subplots for the "Age" column, and "Chart 1C" to see a bubble chart splitting by "Age" and "Gender".

    ML Guided Learning Visualize

Analyze the Target Column

Now that we've explored our data, we'll use machine learning techniques to reveal patterns. The Analyze skill trains a set of machine learning models and selects the model that best predicts how other columns in our dataset impact our target column.

  1. Click ML > Analyze in the sidebar. The Analyze form opens. Let's first run Analyze on our target column with only default optimizations. Select "AdaptivityLevel" from the dropdown and click Submit.

    ML Guided Learning Analyze 1

    Analyze takes a little time to train various models on the data and then to select the best one. The chat box displays the steps.

  2. A bar chart appears in the display panel that shows how each column in the dataset impacts the dependent column "AdaptivityLevel".

    ML Guided Learning Analyze 1 Result 1

    Hover over each bar in the chart to view more details. We can see that "FinancialCondition" and "ClassDuration" have the most impact on "AdaptivityLevel".

  3. The chat history includes further insights about the analysis. In the Visualize Trends section, click on the links for Chart2A and Chart2B to view other visualizations.

  4. Click on Model Stats to learn more about the model, which is called "BestFit2".

    ML Guided Learning Analyze Result 2

  5. Click on Impact Scores Explained for further context. We'll look into how to use this information in more detail later.

    ML Guided Learning Analyze 1 Result 3

  6. Click on the pipeline report link to view a table of the pipeline report in the display panel. The table breaks down Analyze's stages with details on the parameters used to preprocess the data and to train the model.

    ML Guided Learning Analyze 1 Result 3

  7. Click on the blue plus signs ((+)) in the "PipelineReport_Preview" table to see more details. We can see that for the Class Imbalance stage, the model uses an oversampling method to correct for the imbalance in the "AdaptivityLevel" column. You can use the Normalized Shannon Entropy score to gauge how imbalanced a column might be.

Test the Model with a Test Split

  1. Let's run the same analysis with a test split and detailed scores. This will show us the performance of each category of adaptability levels. Enter in the chat box:

    Analyze AdaptivityLevel with detailed model scores setting test holdout percentage as 20

    ML Guided Learning Analyze 2

  2. The impact score chart appears in the display panel. The new model is called "BestFit3".

    ML Guided Learning Analyze 2 Result 1

    The Analyze results have changed, notably with "FinancialCondition" dropping into second place with a significant reduction in impact score.

  3. Click on the pipeline report link. We can see from the AnalyzeStage column that Split Dataset shows the split into training (80%) and testing (20%) for the "BestFit3" model.

    ML Guided Learning Analyze 2 Result 2

  4. Let's use the Predict skill on the test holdout set. Enter in the chat box:

    Predict using BestFit3 on current and display prediction probabilities

    ML Guided Learning Predict 1 Result 1

    Predict generates a table called "PredictionsAdaptivitylevel", which includes three appended columns of the probability classes for the three values of the target column "AdaptivityLevel": "High", "Low", and "Moderate".

  5. Predict also tests the model generated by Analyze above, "BestFit3", that includes a test split, and displays testing scores in the chat history.

    ML Guided Learning Predict 1 Result 2

  6. Click the link Generate on the last line to display the confusion matrix comparing "AdaptivitylevelActual" from the 20% portion of the split with "AdaptivitylevelPredicted" as predicted with the model from the 80% portion of the split.

    ML Guided Learning Predict 1 Result 3

    We see that our model, "BestFit3", is highly sensitive to signals that correspond to high adaptability levels, with a 98% probability of correct predictions for "High" levels. The probability of correct predictions decreases for the other levels 93% for "Moderate" and 81% for "Low".

    We noted the inherent imbalance in the dataset from our initial Visualize chart. To compensate for the imbalance, Analyze automatically uses industry-standard imbalance handling mechanisms, such as oversampling, to focus on high adaptability levels. Statistically, the highest number of false positives (11%) occur when the model predicts a "High" adaptivity level but the real adaptivity level is "Moderate."

Test the Model without Imbalance Handling

Let's turn off imbalance handling to see if the model's performance changes.

  1. Enter a space in the chat box to see that the current dataset is now the table generated by the last Predict, "PredictionsAdaptivitylevel". Let's reset the current dataset. Enter in the chat box:

    Use the dataset OnlineAdaptability

    We can also scroll up to the top of the display panel and click on the Use this dataset button in the header of the "OnlineAdaptability" table.

  2. To train another model that doesn't automatically fix class imbalances, enter in the chat box:

    Analyze AdaptivityLevel with detailed model scores without fixing class imbalances setting test holdout percentage as 20

  3. Let's again use the Predict skill on the newly generated test holdout set. Enter in the chat box:

    Predict using BestFit4 on current and display prediction probabilities

  4. Click Generate at the bottom of the chat history message to plot the confusion matrix.

    ML Guided Learning Predict 2 Result 1

    Without automatic imbalance handling, our model begins to represent the inherent statistical distribution in the data. The "Moderate" category has the most values, and our new model's predictions are now focused on those values. From the testing scores in the chat history, we can see an increase in accuracy, but a decrease in recall:

    Model NameMethod UsedAccuracyRecall (Sensitivity)
    BestFit2With oversampling88%0.91
    BestFit4Without oversampling91%0.86

Tune the Model with Target Weighting

Let's tune our model to a specific set of adaptivity levels a model that determines which factors are important only for "Low" and "High" adaptivity levels. We'll assign weights to these labels.

  1. To use Analyze with target weighting, we require a separate dataset that specifies the weights we want to apply to each label. Click Dataset > Create in the sidebar to open the Manage your datasets form. Click create a dataset from scratch to open the Create Dataset form:

    ML Guided Learning Analyze 4 Dataset

    Create two columns in the first row called "labelName" and "labelWeight" with values "Low", "Moderate", "High" and 100, 10, and 1000, respectively. Double-click sheet1 to rename the dataset as "labelWeights". Click Save Dataset and close the form to return to our session.

    We can also enter in the chat box:

    Create a new dataset labelWeights with columns labelName, labelWeight and values Low, 100, Moderate, 10, High, 1000

  2. We'll again reset the current dataset. Enter into the chat box:

    Use the dataset OnlineAdaptability

    We can also use the up-arrow key in the chat box to find our previous usage.

  3. Now we're ready to Analyze our original dataset again, tuning with our newly created target-weighting table. Enter into the chat box:

    Analyze AdaptivityLevel with detailed model scores weighting with labelWeights setting test holdout percentage as 20

  4. The impact scores chart appears in the display panel.

    ML Guided Learning Analyze 4 Result 1

  5. Click Model Stats to open the panel and under F1 score, click here. The "Classwise_F1_scores_Preview" table appears in the display panel.

    ML Guided Learning Analyze 4 Result 2

    The F1 score is another measure of the test's accuracy.

  6. Click pipeline report to view the "PipelineReport_Preview" table.

    ML Guided Learning Analyze 4 Result 3

    We can see that in the "AnalyzeStage column, the "Class Imbalance" uses the the weight table we created, the values of which are available if we click the blue plus sign ((+)) in the last column. Also note that the "ModelType" has changed to a "Random Forest Classifier" from a "Lightgbm Classifier", alongside our parameter changes.

  7. Let's reset our dataset again and test our model with Predict. Enter into the chat box:

    Use the dataset OnlineAdaptability
    Predict using BestFit5 on current and display prediction probabilities

    Or we can use our up-arrow key to find each DataChat sentence and enter it. Don't forget to update the model to "BestFit5" for the Predict.

  8. Again, click Generate to plot a confusion matrix.

    ML Guided Learning Predict 4 Result 1

    From the classwise F1 scores and the confusion matrix, we can see that our models can be heavily tuned to focus on specific adaptivity values. As one can see from the impact chart, the important factors have now changed slightly. EducationLevel now plays a much greater role in determining adaptivity. Even though there might be a correlation between Age and EducationLevel, when considering a model focused highly on the extreme ends of the spectrum, we can derive insights that adaptivity levels is a systematic symptom. Financial condition also plays a huge role in determining whether you are part of the average crowd or are at the extremes.

    Model NameMethod UsedRecall
    BestFit4Without label weighting0.8
    BestFit5With label weighting1.0

Plot Confidence Distributions

  1. Check that the current dataset is still "PredictionsAdaptivitylevel". If not, enter into the chat box:

    Use the dataset PredictionsAdaptivitylevel

  2. Let's plot histograms for the two "AdaptivityLevel" values we've weighted for "Low" and "High". Enter into the chat box:

    Plot a histogram with the x-axis Probability_Class_High
    Plot a histogram with the x-axis Probability_Class_Low

    high probability histogram

    low probability histogram

    The histogram plots show the confidence levels of the predictions. We can see that the model is fairly confident when predicting for lower adaptivity levels, but the same can't be said for higher levels.

  3. Let's observe the same confidence distribution plots for our initial oversampled model before.

    Use the dataset PredictionsAdaptivitylevel , version 1
    Plot a histogram with the x-axis Probability_Class_High
    Plot a histogram with the x-axis Probability_Class_Moderate
    Plot a histogram with the x-axis Probability_Class_Low

    second high probability histogram

    moderate probability histogram

    second low probability of histogram

  4. Let's observe the same confidence distribution plots for the unsampled models before.

    Use the dataset PredictionsAdaptivitylevel , version 2
    Plot a histogram with the x-axis Probability_Class_High
    Plot a histogram with the x-axis Probability_Class_Moderate
    Plot a histogram with the x-axis Probability_Class_Low

    third high probability histogram

    second moderate probability histogram

    third low probability histogram

    We can see that unsampled models have poor calibration when trying to predict for high adaptivity levels. The oversampled models work somewhat better, but not as well as the weighted model.