Skip to main content
Version: 0.32.2

Wrangle

Once your data has been shaped, you might wrangle it by combining it with other datasets, computing aggregates, such as averages, creating pivot tables, and more. This section covers how to:

note

When a skill is applied to a dataset

  • If the skill creates a new dataset, it will use the convention [dataset]_[Skill].
  • If the skill alters your existing dataset, it will use the convention [dataset] v[x] to save to a new version.
tip

If your wrangled data is used in multiple recipes or charts, DataChat recommends saving the wrangled data as a snapshot to improve performance.

Grow Datasets

When you work with multiple datasets, you may need to combine or extend them.

Combine Datasets with the Same or Similar Columns

To create a new dataset that combines two datasets with the same columnsof name, number, and typeuse the Concatenate skill. You can use either the Concatenate form or the GEL input field.

Concatenate Form

To use the Concatenate form to combine your datasets, click Dataset > Concatenate in the sidebar to open the form:

  1. Select the datasets to concatenate. Note that you must pick at least two datasets and they must already be loaded into your session.

  2. Choose how to handle duplicate values. By default, Concatenate keeps all duplicates

  3. Click Submit.

    the concatenate form

GEL Input Field

To use the GEL input field to concatenate two datasets, enter:

Concatenate the datasets called <dataset1> and <dataset2>

For example, to create a new dataset of a weekly log that contains only unique rows from weekday and weekend server logs with slight time overlaps, enter into the GEL input field:

Concatenate the datasets called WeekdayLog and WeekendLog and remove all duplicates

The resulting dataset is called "WeekdayLog_Concatenate". Refer to the Concatenate skill documentation for more information.

Combine Datasets with Different Columns

To create a new dataset that combines two datasets with different columns, you can use the Extend skill and optionally indicate how the rows in the resulting dataset should be extended based on the columns in the two datasets:

  • Designate existing columns to match between the datasets.
  • Create shared columns to match between the datasets.
  • Without shared columns, calculate a Cartesian product across both datasets.

Extend with Shared Columns

  1. Ensure that columns you intend to match between the two datasets share the same column names and types see Describe and Change. The initial dataset is extended via a left-join with the second dataset. For each row of the initial dataset, if values match in the specified columns with the second dataset, that row of the initial dataset is extended with the columns from the second dataset. The extended columns are populated with the values from the second dataset's row.

  2. Ensure that columns you don't intend to match are named differently. For example, to refrain from matching on different "id" columns in datasets called "Person" and "Office", Rename the columns to "Person_id" and "Office_id" if you want to retain the full content of each "id" column..

  3. Extend the initial dataset with the second dataset by entering into the GEL input field:

    Extend the dataset <dataset1> with the dataset <dataset2>

The resulting dataset is named "<dataset1>_Extend" and the extended columns from the second dataset are appended with an "E".

Extend without Shared Columns

If you extend one dataset with another dataset, without a shared column or a designated primary/foreign key relationship, the resulting operationa Cartesian productis computationally expensive and time-consuming. It can result in a very large dataset. DataChat prompts if you wish to continue with this unusual request, and computes a Cartesian product if you click "Yes".

Extend Form

You can also use the Extend form. Select the dataset to extend in the dataset panel and then select Dataset > Extend in the sidebar:

  1. Select the dataset you want to extend onto your current dataset.

  2. Optionally, specify the columns to use as a join key between the datasets. A join key is used to match values between the two datasets.

  3. Optionally, add or remove columns from the join key.

  4. Click Submit.

    the extend form

Tighten Datasets

See Intersect, Slice, and Subtract.

Separate Column Data

tip

DataChat recommends using the Extract skill to ensure that your date and time columns are in a useful format.

To extract date parts from the dataset, select Column > Create > Extract in the sidebar. The Extract form appears:

  1. Select the column from which you want to extract date parts.

  2. Select the date parts you want to extract.

  3. Optionally, give the columns a new name by entering a comma-separated list of names to use for the new columns. You can use the + button to add more extract commands.

  4. Click Submit.

    the form for the Extract skill

To use the GEL input field, enter:

  • Extract the <date parts> from <column> and call the output columns <names> for columns of Date/Time type.
  • Extract attribute <attributes> from <json column> for columns that contain JSON values.

To split a column into parts, select Column > Create > Split in the sidebar. The Split form appears:

  1. Select the column to split.

  2. Select whether to split the values in the column based on position (such as the third character) or by a delimiter (such as a comma).

  3. Enter the position or delimiter to split by.

  4. Optionally, enter a comma-separated list of labels to use for the new columns.

  5. Click Submit.

    the form for the Split skill

To use the GEL input field, see Split.

Aggregate Data

After you've cleaned your data, you’ll want to calculate interesting values across the dataset. Some of these values might be calculated for every row. In DataChat, these values would go into a new column. In other cases, you’ll want to calculate something like the average or count across multiple rows.

Use the Compute skill to perform various calculations such as averages and sums. For example, if you had a dataset that contained every farm in the world, their total acerage, their country, and their state, and you wanted to compute the average acerage of every farm in the world across countries and states, you could say Compute the average acres for each country, state. The resulting dataset would have the following columns:

  • Country
  • State
  • AverageAcres

To compute an aggregation or calculation on the dataset, select the dataset on which to run Compute in the dataset panel and then select Dataset > Compute in the sidebar. The Compute form appears:

  1. Select the aggregation you want to compute.

  2. Select the column you want to compute.

  3. Give the new column a name. Optionally, click the + button to add additional aggregations to compute, or click the - button to remove additional aggregations.

  4. Optionally, enter the columns to group by.

  5. Optionally, enter the columns or values to sort by.

  6. Click Submit to run the computation.

    Column Compute form

Create Pivot Tables

Pivot tables quickly summarize large amounts of data. You can use the Pivot skill to create your pivot tables.

To create a pivot table with a form, select Dataset > Pivot in the sidebar. The Pivot form appears:

  1. Select the aggregate (such as an average) for the column that you want to pivot on.

  2. Select the columns to pivot on.

  3. Select the columns to create pivot rows from.

  4. Select the columns to create pivot columns from.

  5. Click Submit.

    the pivot form

To use the GEL input field, see Pivot.

Reshape Tables

Reshape converts your data to either wide or long form. This allows you to specify one or more columns to act as unique identifiers for rows and create new columns from the values of your specified columns. Each row in the resulting dataset represents a combination of the column values.

To reshape a table:

  1. Click Dataset > Reshape to open the Reshape form.

  2. Select either Wide Form or Long Form.

  3. Enter the Row Identifiers.

  4. Enter the Values.

  5. Optionally, enter a name for the variables column and the value column.

  6. Click Submit.

    the reshape form

Define Expressions

You can use expressions to help clean and wrangle your data. Expressions are reusable shortcuts you can use to perform calculations and other operations quickly.

For example, if you repeatedly need to calculate the average Sales amount for customers between the ages of 18 and 34, you could define two expressions:

  • An aggregate expression called AvgSales, which is defined as average Sales
  • A predicate expression called 18to34, which is defined as Age is between the values 18 to 34

There are five types of expressions:

  • Aggregate expressions that compute an aggregate, such as the average Sales example above.
  • Predicate expressions that perform an action of the data based on whether the data satisfies one or more conditions, such as the Age is between the values 18 to 34 example above.
  • Extract expressions that extract a time component from a datetime column. See Extract for all of the available options. When registering expressions in bulk, you can extract the:
    • microsecond
    • millisecond
    • second
    • seconds since epoch
    • minute
    • hour
    • day
    • day of the week
    • first day of the week
    • first day of the month
    • month
    • quarter
    • year
    • decade
    • century
    • YYYYMM
    • YYYYQ
  • Math expressions that compute a math expression using one or more columns.
  • Aggregate math expressions that use both aggregations, such as average and median, along with math expressions, such as exponents, ceiling, or absolute values; to compute a value from one or more columns.

Create Expressions

To create a new expression, click the Define button in the sidebar to open the form. Then:

  1. Select the type of expression you want to define.

  2. Enter a new expression name.

  3. Select options from the available dropdown list to build your expression.

  4. Click Submit

    the define form

You can also use the Define skill in the GEL input field.

Remove Expressions

To remove expressions, enter the following skill pattern in the GEL input field: Forget the expressions <list of expressions> where <list of expressions> is a comma-separated list of the expressions you want to remove.

Use Expressions

After you've defined an expression, you can use it with many other skills throughout DataChat, such as:

Add Notes

While working with your data, you might want to add notes to organize and label different datasets, computations, and other outputs. To add a note from within a session:

  1. Click Note in the sidebar.

  2. Select the color of the note.

  3. Enter your note.

  4. Click Submit.

    record note form