Skip to main content

Clean

Clean replaces values with other values. You can clean individual columns or, if you’re replacing a value with another, apply across all of the columns in your dataset that have the same type (numeric or string).

How a column can be cleaned depends on the type of the column.

  • If the column contains numeric values, you can:
    • Replace the current value with an aggregated value of a column in the dataset, including the current column.
    • Replace the current value with a value you specify.
    • Mark the time at which values were replaced.
  • If the column contains string values, you can:
    • Convert values to upper- or lowercase.
    • Delete a phrase.
    • Remove leading or trailing spaces.
    • Replace the entire value of a cell or replace a substring contained anywhere in a cell.

Format

Clean has several formats:

For columns of float, integer, and numeric type:

  • Clean all (float | integer | numeric) columns by replacing <old value> with the <predicate> matching (digit | exact) values where <column> is <predicate>
  • Clean the numeric column <column name> by replacing <old value> with the <predicate> matching (digit | extact) values where <column> is <predicate>

Optional extension: noting time. Creates a column that contains either timestamps for when the row was cleaned or null values if the row remains untouched.

For columns of string type:

  • Clean all string columns by replacing <old string> with the <predicate> using match semantics <semantic> (where <column> is <predicate>)

Optional extension: noting time. Creates a column that contains either timestamps for when the row was cleaned or null values if the row remains untouched.

  • Clean the string column <column name>
    • by converting it to (uppercase | lowercase) where <column> is <predicate>
    • by deleting the phrase <substring> where <column> is <predicate>
    • by removing all (leading | trailing) blanks where <column> is <predicate>
    • by replacing <string> with the <predicate> using match semantics <semantic>

Optional extension for replacements: case insensitive. Whether to be case insensitive when looking for a matching value. If this parameter is not specified, case sensitive matching is used.

Parameters

You can use the following parameters to specify how a column should be cleaned:

  • column (optional). The column whose values to use as part of the cleaning criteria.
  • column name (required). The name of the column to be cleaned.
  • matching value (required). The type of numerical value to match. Choose from "digit values" or "exact values".
  • old string (required). The string to be replaced.
  • old value (required). The value to be replaced.
  • predicate (optional). The predicate used as part of the cleaning criteria.
  • semantic (required). The match semantics used when replacing one value with another. Choose "fullstring" to replace the entire contents of a cell's old value with a new value. Choose "substring" to replace any occurrence of the old value with the new value.
  • string (required). The string to replace.
  • substring (required). The string to delete from the larger string value.

Output

If a column is successfully cleaned, the updated dataset becomes [Dataset] v2 or the next incremental version value.

If a column can’t be cleaned, an error message appears in the conversation history.

Examples

Consider a dataset called "Credit_Risk" that contains information on loan applications and status:

  • person_age. Applicant's age.
  • person_income. Annual income.
  • person_home_ownership. Type of home ownership.
  • person_emp_length. Employment length in years.
  • loan_intent. Loan intent.
  • loan_grade. Loans graded "A" have the lowest expected risk of loss (low interest rate) while loans graded "G" have the highest expected rate of loss (high interest rate).
  • loan_amnt. Loan amount in USD.
  • loan_int_rate. Interest rate.
  • loan_status. Loan status of "0" (non default) or "1" (default).
  • loan_percent_income. Value of the loan compared to applicant's income.
  • cb_person_default_on_file. Whether the applicant has historically defaulted.
  • cb_preson_cred_hist_length. Credit history length.

To replace all of the null values in the loan_amnt column with the average loan_amnt, enter Clean the numeric column loan_amnt by replacing NULL with the average loan_amnt matching exact values

To replace all of the null values in the loan_amnt column with the value zero, enter Clean the numeric column loan_amnt by replacing NULL with the value 0 matching exact values

To replace all of the null values in the loan_status column to 0 where loan_amnt is less than $50,000, enter Clean the numeric column loan_status by replacing NULL with the value 0 matching exact values where loan_amnt is less than the value 50000

To fill in any missing values in the person_home_ownership column with the string “N/A,” enter Clean the string column person_home_ownership by replacing EMPTY with the value N/A using match semantics fullstring

To convert all of the values in the loan_grade column to uppercase, enter Clean the string column loan_grade by converting it to uppercase

To remove any trailing spaces from the values in the loan_intent column, enter Clean the string column loan_intent by removing all trailing blanks

To replace the word “RENT” with “LEASE” in all string columns, enter Clean all string columns by replacing RENT with the value LEASE using match semantics fullstring

To replace any misspelling of “MORTGAGE” as “MORGAGE” in all string columns, enter Clean all string columns by replacing MORGAGE with the value MORTGAGE using match semantics fullstring

Feedback