Skip to main content
Version: 0.35.7

Clean

Clean replaces values with other values. You can clean individual columns or, if you’re replacing a value with another, apply across all of the columns in your dataset that have the same type (numeric or string).

How a column can be cleaned depends on the type of the column.

  • If the column contains numeric values, you can:
    • Replace the current value with an aggregated value of a column in the dataset, including the current column.
    • Replace the current value with a value you specify.
    • Mark the time at which values were replaced.
  • If the column contains string values, you can:
    • Convert values to upper- or lowercase.
    • Delete a phrase.
    • Remove leading or trailing spaces.
    • Replace the entire value of a cell or replace a substring contained anywhere in a cell.

Format

Clean has several formats:

For columns of float and integer type:

  • Clean all (float | integer | numeric) columns by replacing <old value> with the <predicate> matching (digit | exact) values where <column> is <predicate>
  • Clean the numeric column <column name> by replacing <old value> with the <predicate> matching (digit | extact) values where <column> is <predicate>

Optional extension: noting time. Creates a column that contains either timestamps for when the row was cleaned or null values if the row remains untouched.

For columns of string type:

  • Clean all string columns by replacing <old string> with the <predicate> using match semantics <semantic> (where <column> is <predicate>)

Optional extension: noting time. Creates a column that contains either timestamps for when the row was cleaned or null values if the row remains untouched.

  • Clean the string column <column name>
    • by converting it to (uppercase | lowercase) using match semantics <semantic>
    • by deleting the phrase <substring> using match semantics <semantic>
    • by removing all (leading | trailing) blanks using match semantics <semantic>
    • by replacing <string> with the <predicate> using match semantics <semantic>

Optional extension: case insensitive. Whether to be case insensitive when looking for a matching value. If this parameter is not specified, case sensitive matching is used.

Parameters

You can use the following parameters to specify how a column should be cleaned:

  • column (optional). The column whose values to use as part of the cleaning criteria.
  • column name (required). The name of the column to be cleaned.
  • matching value (required). The type of numerical value to match. Choose from "digit values" or "exact values".
  • numeric type (required). The type of the numeric column to be cleaner. Choose from "float" or "integer".
  • old string (required). The string to be replaced.
  • old value (required). The value to be replaced.
  • predicate (optional). The predicate used as part of the cleaning criteria.
  • semantic (required). The match semantics used when replacing one value with another. Choose "fullstring" to replace the entire contents of a cell's old value with a new value. Choose "substring" to replace any occurrence of the old value with the new value.
  • substring (required). The string to delete from the larger string value.

Output

If a column is successfully cleaned, the updated dataset becomes [Dataset] v2 or the next incremental version value.

If a column can’t be cleaned, an error message appears in the log.

Examples

Consider a dataset called "Titanic" that contains information on each passenger, including the following columns:

  • Age. Their age.
  • Gender. Their gender.
  • Name. Their name.
  • PClass. Their class.
  • Survived. Whether they survived the disaster.

To replace all of the null values in the Age column with the average age, enter Clean the numeric column Age by replacing NULL with the value average Age using match semantics fullstring

To replace all of the null values in the PClass column with the value zero, enter Clean the numeric column PClass by replacing NULL with the value 0 using match semantics fullstring

To replace all of the null values in the Cabin column to 0 where Pclass is 3, enter Clean the string column Cabin by replacing NULL with the value 0 using match semantics fullstring where Pclass is equal to the value 3

To fill in any missing values in the Name column with the string “N/A,” enter Clean the string column Name by replacing EMPTY STRING with the value N/A using match semantics fullstring

To convert all of the values in the Name column to uppercase, enter Clean the string column Name by converting it to uppercase using match semantics fullstring

To remove any trailing spaces from the values in the Gender column, enter Clean the string column Gender by removing all trailing blanks using match semantics fullstring

To replace the word “female” with “woman” in all string columns, enter Clean all string type columns by replacing female with the value woman using match semantics fullstring

To replace the word "male" with "boy" in all string columns where age is less than 18, enter Clean all string type columns by replacing male with the value boy using match semantics fullstring where Age is less than the value 18

To replace any misspelling of “female” or “male” as “femail” or “mail” in all string columns, enter Clean all string type columns by replacing mail with the value male with match semantics substring

To remove hyphens from the Name column, enter Clean the string column Name by deleting the phrase - using match semantics fullstring