Use Rename, Keep, and Extend to Train Models
In this example, we'll explore two datasets, learning how to keep or drop specific columns and extend datasets together to test our hypothesis that a country's overall happiness is affected more by alcohol consumption than by GPD per capita.
Load Data
To get started, click on the links below to download the 2018 World Happiness Report dataset and the 2018 Alcohol Consumption dataset.
Now we can Load
the files into the session. We now see two datasets that look like this:
Alcohol Consumption dataset
This dataset contains the following columns:
- Entity. The country.
- Code. A code given for the country name.
- Year. The year of the study.
- TotalAlcoholConsumptionPerCapita. The total alcohol consumption per capita.
World Happiness dataset
This dataset contains the following columns:
- OverallRank. The rank of a country's happiness compared to others.
- CountryOrRegion. The country or region.
- Score. The overall happiness score.
- GDPPerCapita. The GDP per capita.
- SocialSupport. How much support the country provides.
- HealthyLifeExpectancy. The average life expectancy.
- FreedomToMakeLifeChoices. How much freedom one has to make life choices.
- Generosity. How generous their fellow citizens are.
- PerceptionsOfCorruption. How corrupt the country appears to be.
Rename Datasets
The names of our datasets contain a lot of random characters and are not very clear. Before moving forward, let's rename them to "Alcohol_Consumption" and "Happiness2018". To Rename
the datasets:
- Double-click the dataset name next to the version.
- Enter the new dataset name. By default, renaming saves automatically.
Clean Our Data
To be able to analyze how alcohol consumption affects a country's overall happiness, we must first clean and wrangle our data to make sure it's ready to use with machine learning.
2018 World Happiness Data
We can now begin cleaning the Happiness2018 dataset. Currently, our dataset has information that we don't really need in order to analyze the impact alcohol consumption has on overall happiness. Let's use the Drop
skill to remove the unneeded columns:
- Click Column > Drop to open the Drop form.
- Enter "OverallRank" for the column.
- Click Submit.
Now our dataset no longer has the OverallRank column:
Alcohol Consumption Data
We can now begin cleaning our other dataset. Lets make sure we're working with the right one. Click the dataset name, "Alcohol-Consumption" to make it the current dataset.
Like our Happiness2018 dataset, this dataset has information that we don't really need in order to analyze the impact alcohol consumption has on overall happiness. Since we only want data from 2018, let's use the Keep
skill to keep only the rows we need:
- Click Row > Keep to open the Keep form.
- Enter "Year" for the column, "is equal to" for the expression, "the value" for the value type, and "2018" for the value.
- Click Submit.
Now we see a dataset with reports from only the year 2018.
From here, we can also Drop
columns we don't really need to analyze, including the columns Code and Year. To drop columns:
- Click Column > Drop to open the Drop form.
- Enter "Code" and "Year" for the columns.
- Click Submit.
Our dataset now looks something like this:
From here, we can use the Rename
skill again to match the column names for the countries between the datasets. This will make it easier to extend the datasets later:
- Double-click on the column name "Entity".
- Enter "CountryOrRegion"
- Press Enter.
Extend Our Data
From here, we can now combine the our two datasets using the Extend
skill. This way, all of the relevant information can be found in one dataset.
Let's extend the cleaned 2018 World Happiness Dataset with our cleaned 2018 Alcohol Consumption dataset:
- Click the "Happiness2018" dataset to make it the current dataset.
- Click Dataset > Extend to open the Extend form.
- Enter "Alcohol_Consumption" for the dataset to extend with.
- Click Submit.
The new dataset should look something like this:
As you can see, TotalAlcoholConsumptionPerCapita has been added to the Happiness2018 dataset, and is now called Happiness2018_Extend.
Train an ML Model
Now that all of our relevant data is within one dataset, we can now use Train
to determine the impact that alcohol has in comparison to the other variables on overall happiness:
- Click ML > Train Model to open the Train ML Model form.
- Enter "Score" for the column.
- Click Submit.
This generates an impact chart that looks something like this:
As you can see, total alcohol consumption has a minimal impact on overall happiness in comparison to social support, GDP per capita, life expectancy, and freedom. From this, we can conclude that our hypothesis, a country's overall happiness is affected more by alcohol consumption than by GPD per capita, is wrong.
Full Recipe
Here is the entire recipe we used in this topic:
Load data from the file <strong>2018-0ba7ae108b9b2765994baad38fcde5e2.csv</strong>
Rename the current dataset to Happiness2018 {"overwrite":true}
Load data from the file <strong>Alcohol-consumption-9cc25bb03ab769310171f9a34a0156ea.csv</strong>
Rename the current dataset to Alcohol-Consumption {"overwrite":true}
Use the dataset called Happiness2018 version 1
Drop the columns OverallRank
Use the dataset called Alcohol_Consumption version 1
Keep the rows where <strong>Year</strong> is equal to the value <strong>2018</strong>
Drop the columns Code, Year
Rename the column Entity to CountryOrRegion
Extend the dataset Happiness2018 with the dataset Alcohol_Consumption
Train an ML model on Score with the config <strong class="TaggedSpec" aria-label="Click to Expand">{...}</strong> and generate charts for data visualization