Skip to main content
Version: 0.32.2

Handle Different Types of Outliers

Outliers can represent measurement errors, entry errors, poor sampling, and more. They can have a large impact on your analysis and can skew your findings. In DataChat, we can:

  • Find outliers by creating visualizations to view outliers in plots.
  • Handle outliers by replacing or removing numerical outliers or null values.

In this section, we’ll use data about healthy city lifestyles to show how to find and handle numerical outliers and null values.

Load Our Data

To start, download the Healthy Lifestyle Cities Report 2021 dataset. Then, upload the file into the session. We're then given a sample of our dataset that looks something like this:

sample of Healthy Lifestyle data

Visualize Our Data

With our data loaded, we can create a visualization to see if there are any visible outliers in the data:

  1. Click Plot in the sidebar to open the Chart Builder.

  2. Select Scatter from the chart options. Violin and boxplot charts are also suggested options to view outliers.

  3. Enter “SunshineHoursCity” in the X-Axis field, and “LifeExpectancyyearsCountry" in the Y-Axis field.

  4. Click Submit.

    visualize data

    visualize outlier

From the plot, we can see that Johannesburg is a distinct outlier that doesn’t align with the rest of the cities.

Remove Outliers

Let’s remove this outlier from our dataset. Select the original dataset “healthy_lifestyle_city_2021 v1”, then:

  1. Click Row > Drop in the sidebar to open the Drop form.

  2. Select “City” for the column, “contains” for the expression, and “Johannesburg" for the predicate value.

  3. Click Submit.

    drop outlier

This creates a new dataset, “healthy_lifestyle_city_2021 v2”, with the “Johannesburg” row dropped.

Replace Null Values

From here, we can also handle null values in our target columns. There are no null values in the LifeExpectancyyearsCountry column, but there is one null value in the the SunshineHoursCity column where "City" is "Geneva".

view null value

We can replace this null value with an average using the Clean form:

  1. Click Column > Clean in the sidebar.

  2. Select A Numeric Column.

  3. Enter "SunshineHoursCity" for the column to clean, "NULL" for the old value, and "average SunshineHoursCity matching exact values" for the expression.

  4. Click Submit.

    clean form

    replaced null value

The null value is replaced with the column's average (2,224.95) and a new dataset is created.

Full Recipe

Load data from the file <strong>topcities.zip</strong>
Plot an EChart with the specification <strong>{"caption": "", "typeVersion": 2, "values": {"columns": [{"name": "AnnualAvgHoursWorked", "type": "Integer", "unique": 23}, {"name": "City", "type": "String", "unique": 44}, {"name": "CostOfABottleOfWaterCity", "type": "Float", "unique": 39}, {"name": "CostOfAMonthlyGymMembershipCity", "type": "Float", "unique": 44}, {"name": "HappinessLevelsCountry", "type": "Float", "unique": 30}, {"name": "LifeExpectancyyearsCountry", "type": "Float", "unique": 27}, {"name": "NumberOfTakeOutPlacesCity", "type": "Integer", "unique": 44}, {"name": "ObesityLevelsCountry", "type": "String", "unique": 28}, {"name": "OutdoorActivitiesCity", "type": "Integer", "unique": 43}, {"name": "PollutionIndexScoreCity", "type": "Float", "unique": 44}, {"name": "Rank", "type": "Integer", "unique": 44}, {"name": "SunshineHoursCity", "type": "Integer", "unique": 40}], "complete": false, "reference": {"type": "dataset", "params": {"dataset": "healthy_lifestyle_city_2021"}}, "transforms": [], "aggregate": [], "metadata": {}}, "plot": {"series": [{"type": "scatter", "mark": {"x": "SunshineHoursCity", "y": "LifeExpectancyyearsCountry"}, "group": {}}], "presentation": {"title": {"textStyle": {"fontSize": 18}, "text": ["LifeExpectancyyearsCountry vs. SunshineHoursCity"]}, "xaxis": {"nameTextStyle": {"fontSize": 18}, "axisLabel": {"fontSize": 12}, "text": "SunshineHoursCity"}, "yaxis": {"nameTextStyle": {"fontSize": 18}, "axisLabel": {"fontSize": 12}, "text": "LifeExpectancyyearsCountry"}, "annotations": [], "colorOverride": []}}}</strong>
Use the dataset healthy_lifestyle_city_2021, version 1
Drop the rows where City contains <strong>Johannesburg</strong>
Clean the numeric column SunshineHoursCity by replacing <strong><em contenteditable="false">NULL</em></strong> with the average SunshineHoursCity matching exact values