Skip to main content
Version: 0.32.2

Use Rename, Keep, and Extend to Train Models

In this example, we'll explore two datasets, learning how to keep or drop specific columns and extend datasets together to test our hypothesis that a country's overall happiness is affected more by alcohol consumption than by GPD per capita.

Load Data

To get started, click on the links below to download the 2018 World Happiness Report dataset and the 2018 Alcohol Consumption dataset.

Now we can Load the files into the session. We now see two datasets that look like this:

Alcohol Consumption dataset

2018 Alcohol Consumption

This dataset contains the following columns:

  • Entity. The country.
  • Code. A code given for the country name.
  • Year. The year of the study.
  • TotalAlcoholConsumptionPerCapita. The total alcohol consumption per capita.

World Happiness dataset

2018 Happiness Report

This dataset contains the following columns:

  • OverallRank. The rank of a country's happiness compared to others.
  • CountryOrRegion. The country or region.
  • Score. The overall happiness score.
  • GDPPerCapita. The GDP per capita.
  • SocialSupport. How much support the country provides.
  • HealthyLifeExpectancy. The average life expectancy.
  • FreedomToMakeLifeChoices. How much freedom one has to make life choices.
  • Generosity. How generous their fellow citizens are.
  • PerceptionsOfCorruption. How corrupt the country appears to be.

Rename Datasets

The names of our datasets contain a lot of random characters and are not very clear. Before moving forward, let's rename them to "Alcohol_Consumption" and "Happiness2018". To Rename the datasets:

  1. Double-click the dataset name next to the version.
  2. Enter the new dataset name. By default, renaming saves automatically.

renamed 2018 Happiness Report

Clean Our Data

To be able to analyze how alcohol consumption affects a country's overall happiness, we must first clean and wrangle our data to make sure it's ready to use with machine learning.

2018 World Happiness Data

We can now begin cleaning the Happiness2018 dataset. Currently, our dataset has information that we don't really need in order to analyze the impact alcohol consumption has on overall happiness. Let's use the Drop skill to remove the unneeded columns:

  1. Click Column > Drop to open the Drop form.
  2. Enter "OverallRank" for the column.
  3. Click Submit.

drop form

Now our dataset no longer has the OverallRank column:

dropped column

Alcohol Consumption Data

We can now begin cleaning our other dataset. Lets make sure we're working with the right one. Click the dataset name, "Alcohol-Consumption" to make it the current dataset.

Like our Happiness2018 dataset, this dataset has information that we don't really need in order to analyze the impact alcohol consumption has on overall happiness. Since we only want data from 2018, let's use the Keep skill to keep only the rows we need:

  1. Click Row > Keep to open the Keep form.
  2. Enter "Year" for the column, "is equal to" for the expression, "the value" for the value type, and "2018" for the value.
  3. Click Submit.

the keep form

Now we see a dataset with reports from only the year 2018.

Keep Rows

From here, we can also Drop columns we don't really need to analyze, including the columns Code and Year. To drop columns:

  1. Click Column > Drop to open the Drop form.
  2. Enter "Code" and "Year" for the columns.
  3. Click Submit.

drop form

Our dataset now looks something like this:

Dropped Columns

From here, we can use the Rename skill again to match the column names for the countries between the datasets. This will make it easier to extend the datasets later:

  1. Double-click on the column name "Entity".
  2. Enter "CountryOrRegion"
  3. Press Enter.

rename column

Extend Our Data

From here, we can now combine the our two datasets using the Extend skill. This way, all of the relevant information can be found in one dataset.

Let's extend the cleaned 2018 World Happiness Dataset with our cleaned 2018 Alcohol Consumption dataset:

  1. Click the "Happiness2018" dataset to make it the current dataset.
  2. Click Dataset > Extend to open the Extend form.
  3. Enter "Alcohol_Consumption" for the dataset to extend with.
  4. Click Submit.

extend form

The new dataset should look something like this:

Extend Dataset

As you can see, TotalAlcoholConsumptionPerCapita has been added to the Happiness2018 dataset, and is now called Happiness2018_Extend.

Train an ML Model

Now that all of our relevant data is within one dataset, we can now use Train to determine the impact that alcohol has in comparison to the other variables on overall happiness:

  1. Click ML > Train Model to open the Train ML Model form.
  2. Enter "Score" for the column.
  3. Click Submit.

train form

This generates an impact chart that looks something like this:

impact Chart

As you can see, total alcohol consumption has a minimal impact on overall happiness in comparison to social support, GDP per capita, life expectancy, and freedom. From this, we can conclude that our hypothesis, a country's overall happiness is affected more by alcohol consumption than by GPD per capita, is wrong.

Full Recipe

Here is the entire recipe we used in this topic:

Load data from the file <strong>2018-0ba7ae108b9b2765994baad38fcde5e2.csv</strong>
Rename the current dataset to Happiness2018 {"overwrite":true}
Load data from the file <strong>Alcohol-consumption-9cc25bb03ab769310171f9a34a0156ea.csv</strong>
Rename the current dataset to Alcohol-Consumption {"overwrite":true}
Use the dataset called Happiness2018 version 1
Drop the columns OverallRank
Use the dataset called Alcohol_Consumption version 1
Keep the rows where <strong>Year</strong> is equal to the value <strong>2018</strong>
Drop the columns Code, Year
Rename the column Entity to CountryOrRegion
Extend the dataset Happiness2018 with the dataset Alcohol_Consumption
Train an ML model on Score with the config <strong class="TaggedSpec" aria-label="Click to Expand">{...}</strong> and generate charts for data visualization