Skip to main content
Version: 0.18.3

Wrangle

After you've explored your data and are comfortable with it, you're ready to begin cleaning and wrangling your data to prepare it for visualization, machine learning, and other uses. To clean your data, you might:

  • Create a new dataset, such as a mapping dataset, to help clean your data.
  • Fill in missing values
  • Replace values
  • Drop rows or columns based on certain criteria
  • Add new columns
  • Group columns with lots of precise values (such as ages or salaries) values into more manageable "bins."

Then, when your data is cleaned, you can wrangle it by:

  • Combining it with other datasets.
  • Computing aggregates, such as averages.
  • Create pivot tables.

Clean Your Data

note

When using a skill on a dataset, the resulting output dataset will be titled [dataset]_[Skill] or [dataset] v[x] depending on the skill used. If the skill creates a new dataset, it will use the convention [dataset]_[Skill]. If the skill alters your existing dataset to a new version, it will use the convention [dataset] v[x].

Column Level

Use the Column button to:

  • Bin the values of a column.
  • Clean one or more columns.
  • Create a new column.
  • Describe a column.
  • Drop one or more columns.
  • Extract date and time parts (such as minutes, hours, or days) from a column.
  • Keep one or more columns.
  • Rename a column.
  • Split a column.

the column button

Bin

To group a column's values into bins:

  1. Select the column whose values to bin.
  2. Select and configure how the values should be binned. Refer to the reference page for Bin for more information on these options.
  3. Enter the values used to bin the column.
  4. Optionally, enter a list of names to use for the new bins.
  5. Click Submit.

the form for the Bin skill

Clean

To clean one or more columns:

  1. Select whether to clean all columns of a specific type or a specific string or numeric column.
  2. Select the columns you'd like to clean.
  3. Enter the new value you want to use.
  4. Click Submit.

the form for the Clean skill

Create

To create a new column, select the method with which to create the column.

form for the Create Column

Choose from among five methods:

In the Expression text field, enter an expression that operates on existing columns. Or, fill the column with an increasing sequence.

form for Create Column - Expression

Describe

To describe a column:

  1. Select the column to describe.
  2. Click Submit.

the describe form

Drop

To drop a column, click the ellipses in the top right corner of the column header, then click Drop.

Otherwise, from the Drop Form:

  1. Select the columns you want to drop.
  2. Click Submit.

the form for the Drop skill

Extract

To extract date parts from the dataset:

  1. Select the column you want to extract date parts from. you want to extract.
  2. Select the date parts you want to extract.
  3. Optionally, give the columns a new name by entering a comma-separated list of names to use for the new columns. You can use the + button to add more extract commands.
  4. Click Submit.

the form for the Extract skill

Keep

To keep a column, you can right-click the name of the column to open the Keep form with that column already selected. You can then add other columns inside the form. Otherwise:

  1. Select the columns you want to keep.
  2. Click Submit.

the form for the Keep skill

Rename

Rename a column in several different ways:

  1. Click the Column > Rename button.

    Column Rename button

  2. Fill out the "Rename Using the Dataset" form to rename one or more columns.

    Column Rename form

Dataset Panel

Double-click on the column name in the dataset panel to make the column name editable and enter the new name.

Column Rename in dataset panel

Or, click on the three-dot menu next to the column name and select "Rename". The cursor relocates to inside the editable column name as above.

Column Rename in dataset panel menu

Chat Box

To compose a DataChat sentence with the Rename skill, enter in the chatbox:

Rename the column <column> to <new name>

Column Rename in chat box

note

Column names are case-insensitive.

Split

To split a column:

  1. Select the column to split.
  2. Select whether to split the values in the column based on position (such as the third character) or by a delimiter (such as a comma).
  3. Enter the position or delimiter to split by.
  4. Optionally, enter a comma-separated list of labels to use for the new columns.
  5. Click Submit.

the form for the Split skill

Row Level

Use the Row button to:

  • Drop one or more rows.
  • Keep one or more rows.

the row button

Drop

To drop rows:

  1. Select a row to drop.
  2. Select an expression for the condition.
  3. Enter the values of the condition.
  4. Click Submit.

the drop rows form

Keep

To keep rows:

  1. Select a row to keep.
  2. Select an expression for the condition.
  3. Enter the values of the condition.
  4. Click Submit.

the keep rows form

Wrangle Your Data

Dataset Level

Use the Dataset button to:

  • Compute calculations and aggregations on a dataset.
  • Extend two datasets.
  • Pivot and summarize the data.
  • Sort your data.

the dataset button

Removing Data

You can use the X in the dataset's tab to forget that dataset.

the X button that forgets datasets

Compute

To compute an aggregation or calculation on the dataset, select the dataset on which to run Compute in the dataset panel and then select Dataset > Compute from the sidebar.

  1. Select the aggregation you want to compute.
  2. Select the column you want to compute.
  3. Give the new column a name. Optionally, click the + button to add additional aggregations to compute, or click the - button to remove additional aggregations.
  4. Optionally, enter the columns to group by.
  5. Optionally, enter the columns or values to sort by.
  6. Click Submit to run the computation.

the compute form

Extend

To extend a dataset with another dataset, select the dataset to extend in the dataset panel and then select Dataset > Extend from the sidebar.

  1. Select the dataset you want to extend onto your current dataset in the dataset panel.
  2. Optionally, specify the columns to match between the datasets.
  3. Optionally, add or remove matching columns.
  4. Click Submit.

extend form

Extending with Shared columns
  1. Ensure that columns you intend to match between the two datasets share the same column names and types with Describe and Change. The initial dataset is extended via a left-join with the second dataset. For each row of the initial dataset, if values match in the specified columns with the second dataset, that row of the initial dataset is extended with the columns from the second dataset. The extended columns are populated with the values from the second dataset's row.

  2. Ensure that columns you don't intend to match are named differently. For example, to refrain from matching on different "id" columns in the "Person" and "Office" datasets, `Rename' the columns to "Person_id" and "Office_id" if you want to retain the full content of each "id" column..

  3. Extend the initial dataset with the second dataset using the form. Or, enter into the chat box:

    Extend the dataset <dataset1> with the dataset <dataset2>

The resulting dataset is named "<dataset1>_Extend" and the extended columns from the second dataset are appended with an "E".

Extending without Shared Columns

If you extend one dataset with another dataset, without a shared column or a designated primary/foreign key relationship, the resulting operationa Cartesian productis computationally expensive and time-consuming. It can result in a very large dataset. DataChat prompts if you wish to continue with this unusual request, and computes a Cartesian product if you click "Yes".

Pivot

To create a pivot table from a dataset:

  1. Select the aggregate (such as an average) and column that you want to pivot on.
  2. From the list of columns at the top, drag and drop the columns whose values you want to use to create the table's rows to the rows section.
  3. From the list of columns at the top, drag and drop the columns whose values you want to use to create the table's columns to the columns section.
  4. Click Submit.

the pivot form

Sort

To sort your data:

  1. Select the columns by which you want to sort the data.
  2. Select whether you want to sort the data in ascending or descending order.
  3. Click Submit.

the sort form

Define Expressions

You can use expressions to help clean and wrangle your data. Expressions are reusable shortcuts you can use to perform calculations and other operations quickly.

For example, if you repeatedly need to calculate the average Sales amount for customers between the ages of 18 and 34, you could define two expressions:

  • An aggregate expression called AvgSales, which is defined as average Sales
  • A predicate expression called 18to34, which is defined as Age is between the values 18 to 34

There are five types of expressions:

  • Aggregate expressions that compute an aggregate, such as the average Sales example above.
  • Predicate expressions that perform an action of the data based on whether the data satisfies one or more conditions, such as the Age is between the values 18 to 34 example above.
  • Extract expressions that extract a time component from a datetime column. See Extract for all of the available options. When registering expressions in bulk, you can extract the:
    • microsecond
    • millisecond
    • second
    • seconds since epoch
    • minute
    • hour
    • day
    • day of the week
    • first day of the week
    • first day of the month
    • month
    • quarter
    • year
    • decade
    • century
    • YYYYMM
    • YYYYQ
  • Math expressions that compute a math expression using one or more columns.
  • Aggregate math expressions that use both aggregations, such as average and median, along with math expressions, such as exponents, ceiling, or absolute values; to compute a value from one or more columns.

Create Expressions

To create a new expression, click the Define button in the sidebar to open the form. Then:

  1. Select the type of expression you want to define.
  2. Enter a new expression name.
  3. Select options from the available dropdown list to build your expression.
  4. Click Submit

Define steps

You can also use the Define skill in the chat box.

Remove Expressions

To remove expressions, enter the following utterance in the chat box: Forget the expressions <list of expressions> where <list of expressions> is a comma-separated list of the expressions you want to remove.

Use Expressions

After you've defined an expression, you can use it many other skills throughout DataChat, such as:

Save and Export Your Data

As you work in DataChat you might want to save the set of utterances you ran into a workflow so you can rerun them later on a different dataset. You might also want to export your results to save them or use them outside of DataChat.

You can also save the datasets you've loaded in your session as a database that can be shared with other users or connected to later. Note that if you overwrite a saved database that is shared with another user, the user can still connect to the new version of the database and has access to whatever datasets are now saved within it.

See the Save and Export topics for more information.