Skip to main content

Handle Different Types of Outliers

Outliers can represent measurement errors, entry errors, poor sampling, and more. They can have a large impact on your analysis and can skew your findings. In DataChat, we can:

  • Find outliers by creating visualizations to view outliers in plots.
  • Handle outliers by replacing or removing numerical outliers or null values.

In this section, we’ll use data about healthy city lifestyles to show how to find and handle numerical outliers and null values.

Load Our Data

To start, upload the Healthy Lifestyle Cities Report 2021 dataset. Note that this downloads as a .zip folder. Extract the contents of the .zip folder and upload the .csv file into DataChat.

sample of Healthy Lifestyle data

Visualize Our Data

With our data loaded, we can create a visualization to see if there are any visible outliers in the data:

  1. Click Plot Chart in the skill menu to open the Chart Builder.

  2. Select Scatter from the chart options. Violin and boxplot charts are also suggested options to view outliers.

  3. Enter “SunshineHoursCity” in the X-Axis field, and “LifeExpectancyyearsCountry" in the Y-Axis field.

  4. Click Submit.

    visualize data

    visualize outlier

From the plot, we can see that Johannesburg is a distinct outlier that doesn’t align with the rest of the cities.

Remove Outliers

Let’s remove this outlier from our dataset. Navigate back to the Data tab, then:

  1. Click Wrangle > Drop Rows in the skill menu.

  2. Select “City” for the column, “contains” for the expression, and “Johannesburg" for the predicate value.

  3. Click Submit.

    drop outlier

This creates a new dataset, “healthy_lifestyle_city_2021 v2”, with the “Johannesburg” row dropped.

Replace Null Values

From here, we can also handle null values in our target columns. There are no null values in the LifeExpectancyyearsCountry column, but there is one null value in the the SunshineHoursCity column where "City" is "Geneva".

view null value

We can replace this null value with an average using the Clean skill:

  1. Click Wrangle > Clean in the skill menu.

  2. Select Single Column.

  3. Enter "SunshineHoursCity" for the column to clean.

  4. Enter "NULL" for the old value.

  5. Select "The Aggregation", "average" and "SunshineHoursCity" for the value to replace with.

  6. Click Submit.

    clean form

    replaced null value

The null value is replaced with the column's average (2,224.95) and a new dataset is created.