Skip to main content

Guided Learning

This section explores training models in DataChat using telecommunications contract data.

Load Our Data

To start, Load the provided demo dataset, telcoCustomerChurn, into the session. We're then given a sample of our dataset that looks something like this:

Telco dataset

Describe Our Data

Let's get a better understanding of the data we're working with by clicking Show Descriptive Statistics in the dataset header.

Describe table

Each column now provides some summary statistics about our data, including column types, unique counts, minimum and maximum values, and more.

We see that the target feature is the "Churn" column, meaning that it is influenced by the other columns in the dataset. We can see that the target column has two unique values, "yes" and "no". The column "CustomerID", on the other hand, has 7043 different values.

From here, let's dive a bit deeper into the "Churn" column. Click the More menu > Describe in the churn column. This creates a donut chart breaking down the number of records for each unique value.

donut chart

We can see that 73.46% of contracts do not churn, while 26.54% of contracts do.

Investigate Target Column

Now that we've explored our data, we can use the Data Assistant to do some preliminary investigation of the factors that most impact whether or not a customer churns.

Navigate back to the Data tab, then ask the Data Assistant "Which factors affect Churn".

After completing this process, it provides a feature importance chart that visualizes how each column in the dataset affects the target column, "Churn," in the conversation history:

feature chart

Click Add to Chart tab.

feature chart in chart tab

Hover over each bar in the chart to view more details. We can see that "Contract" followed by "InternetService" have the most impact on "Churn".

From here, we can click Charts in the chart header. This provides a dropdown of different auto-generated charts that visualize the most impactful columns. By default, Visualize opens with Chart1A, in this case a series of donut charts displaying the churn ratios for each contract type across internet services.

chart1A

We can see that across all internet service types, month-to-month contracts have the highest churn, while one-year and two-year contracts tend to have substantially less churn.

Let's click the Charts dropdown and select Chart1B.

chart1B

This bubble chart offers an alternative visualization of the relationship between contracts, internet service types, and churn rates. Notably, the highest churn ratio occurs with month-to-month contracts, particularly for fiber optic service, where the number of customers churning exceeds those who stay.

Train the Target Column

Let's get more granular with our investigation by building a machine learning model using the four most impactful columns ("Contract", "InternetService", "OnlineSecurity", and "tenure") from our initial feature importance chart.

Create a Model

Navigate back to the Data tab, click Machine Learning > Train Model from the skill menu, then:

  1. Select "Churn" for the target column.
  2. Select "Contract", "InternetService", "OnlineSecurity", and "tenure" for the feature columns to include.
  3. Set Generate Charts to Visualize the Data toggle to "On".
  4. Set Use Default Model Exploration toggle to "On".
  5. Click Submit.

train form

DataChat trains multiple models and chooses the one that best predicts the impact of other columns on the target column. Once complete, the impact chart is automatically displayed in the Chart tab:

impact chart

This impact chart differs from our feature importance chart by showing that contract type and internet services have a more comparable influence on churn. Additionally, it reveals an increased impact of tenure.

Models Scores and Pipeline Report

Let's investigate the models used in training. Click on Models to learn more about the model, which is called "BestFit1".

models section

This shows us that two model types were used in training, a LightGBM Classifier with an accuracy score of 74%, and CatBoost Classifier with another accuracy score of 75%.

Let's also investigate the pipeline report provided for the model. Click Pipeline Report to view a table that breaks down the training stages with details on the parameters used to preprocess the data and train the model.

pipeline report section

This report highlights several key details, including feature imputation for columns with missing values, the absence of eligible columns for optimal binning, and adjustments made to address target imbalances.

Predict on the Target Column

Now that we have a model built, we can predict with our BestFit1 model. Navigate back to the Data tab, then:

  1. Click Machine Learning > Predict in the skill menu.
  2. Select "BestFit1" for the model.
  3. Select "Dataset" for the content.
  4. Click Submit.

predict form

DataChat then generates a table called "PredictionsChurn", which includes two appended columns, "ChurnPredicted" and "ChurnActual", indicating whether a customer is predicated to churn, and whether or not they actually did.

predict with model

DataChat also tests the model generated by Train Model above, "BestFit1", and displays testing scores in the conversation history.

conversation history

This displays the multi-classification AUC score as 76% and the accuracy score as 75%.