Skip to main content
Version: 0.21.2

Clean

Clean replaces values with other values. You can clean individual columns or, if you’re replacing a value with another, apply across all of the columns in your dataset that have the same type (numeric or string).

How a column can be cleaned depends on the type of the column.

  • If the column contains numeric values, you can:
    • Replace the current value with an aggregated value of a column in the dataset, including the current column.
    • Replace the current value with a value you specify.
    • Mark the time at which values were replaced.
  • If the column contains string values, you can:
    • Convert values to upper- or lowercase.
    • Delete a phrase.
    • Remove leading or trailing spaces.
    • Replace the entire value of a cell or replace a substring contained anywhere in a cell.

Format

Clean has several utterance variations.

For columns of float and integer type:

  • Clean all <numeric type> columns by replacing <old value> with the value <new value> matching <match value> where <column> is <predicate>
  • Clean the numeric column <column name> by replacing <old value> with the <new value> matching <match value> where <column> is <predicate>

Optional extension: noting time. Creates a column that contains either timestamps for when the row was cleaned or null values if the row remains untouched.

For columns of string type:

  • Clean all string columns by replacing <old value> with the <new value> using match semantics <semantic> where <column> is <predicate>

  • Clean the string column <column name>

    • by converting it to <case type> using match semantics <semantic>
    • by deleting the phrase <substring> using match semantics <semantic>
    • by removing all <blank type> blanks using match semantics <semantic>
    • by replacing <old value> with the value <new value> using match semantics <semantic>

Optional extension: case insensitive. Disregards the case of each character when replacing values.

Whether to be case insensitive when looking for a matching value. If this parameter is not specified, case sensitive matching is used.

Parameters

You can use the following parameters to specify how a column should be cleaned:

  • blank type (required). The type of blanks to remove. Choose from "leading" or "trailing".
  • case type (required). The case to apply to the matching string. Choose from "lowercase" or "uppercase".
  • column (optional). The column whose values to use as part of the cleaning criteria.
  • column name (required). The name of the column to be cleaned.
  • matching value (required). The type of numerical value to match. Choose from "digit values" or "exact values".
  • new value (required). The new value to replace the old value. This can be a specific value or an aggregate value.
  • numeric type (required). The type of the numeric column to be cleaner. Choose from "float" or "integer".
  • old value (required). The value to be replaced.
  • predicate (optional). The predicate used as part of the cleaning criteria.
  • semantic (required). The match semantics used when replacing one value with another. Choose "fullstring" to replace the entire contents of a cell's old value with a new value. Choose "substring" to replace any occurrence of the old value with the new value.
  • substring (required). The string to delete from the larger string value.

Output

If a column is successfully cleaned, the updated dataset becomes [Dataset] v2 or the next incremental version value.

If a column can’t be cleaned, an error message appears in the chat box.

Examples

Consider a dataset called "Titanic" that contains information on each passenger, including the following columns:

  • Age. Their age.
  • Gender. Their gender.
  • Name. Their name.
  • PClass. Their class.
  • Survived. Whether they survived the disaster.

To replace all of the null values in the Age column with the average age, enter Clean the numeric column Age by replacing NULL with the value average Age using match semantics fullstring

To replace all of the null values in the PClass column with the value zero, enter Clean the numeric column PClass by replacing NULL with the value 0 using match semantics fullstring

To replace all of the null values in the Cabin column to 0 where Pclass is 3, enter Clean the string column Cabin by replacing NULL with the value 0 using match semantics fullstring where Pclass is equal to the value 3

To fill in any missing values in the Name column with the string “N/A,” enter Clean the string column Name by replacing EMPTY STRING with the value N/A using match semantics fullstring

To convert all of the values in the Name column to uppercase, enter Clean the string column Name by converting it to uppercase using match semantics fullstring

To remove any trailing spaces from the values in the Gender column, enter Clean the string column Gender by removing all trailing blanks using match semantics fullstring

To replace the word “female” with “woman” in all string columns, enter Clean all string type columns by replacing female with the value woman using match semantics fullstring

To replace the word "male" with "boy" in all string columns where age is less than 18, enter Clean all string type columns by replacing male with the value boy using match semantics fullstring where Age is less than the value 18

To replace any misspelling of “female” or “male” as “femail” or “mail” in all string columns, enter Clean all string type columns by replacing mail with the value male with match semantics substring

To remove hyphens from the Name column, enter Clean the string column Name by deleting the phrase - using match semantics fullstring