Wrangle
Once your data has been shaped, you might wrangle it by combining it with other datasets, computing aggregates, such as averages, creating pivot tables, and more. This section covers how to:
- Grow Datasets with
Concatenate
andExtend
- Tighten Datasets with
Intersect
,Slice
, andSubtract
- Separate Column Data with
Extract
andSplit
- Aggregate Data with
Compute
- Create Pivot Tables with
Pivot
- Reshape Tables with
Reshape
- Define Expressions with
Define
- Add Notes with
Record
When a skill is applied to a dataset
- If the skill creates a new dataset, it will use the convention
[dataset]_[Skill]
. - If the skill alters your existing dataset, it will use the convention
[dataset] v[x]
to save to a new version.
If your wrangled data is used in multiple recipes or charts, DataChat recommends saving the wrangled data as a snapshot to improve performance.
Grow Datasets
When you work with multiple datasets, you may need to combine or extend them.
Combine Datasets with the Same or Similar Columns
To create a new dataset that combines two datasets with the same columns—of name, number, and type—use the Concatenate
skill. You can use either the Concatenate form or the GEL input field.
Concatenate Form
To use the Concatenate form to combine your datasets, click Dataset > Concatenate in the sidebar to open the form:
Select the datasets to concatenate. Note that you must pick at least two datasets and they must already be loaded into your session.
Choose how to handle duplicate values. By default,
Concatenate
keeps all duplicatesClick Submit.
GEL Input Field
To use the GEL input field to concatenate two datasets, enter:
Concatenate the datasets called <dataset1> and <dataset2>
For example, to create a new dataset of a weekly log that contains only unique rows from weekday and weekend server logs with slight time overlaps, enter into the GEL input field:
Concatenate the datasets called WeekdayLog and WeekendLog and remove all duplicates
The resulting dataset is called "WeekdayLog_Concatenate". Refer to the Concatenate
skill documentation for more information.
Combine Datasets with Different Columns
To create a new dataset that combines two datasets with different columns, you can use the Extend
skill and optionally indicate how the rows in the resulting dataset should be extended based on the columns in the two datasets:
- Designate existing columns to match between the datasets.
- Create shared columns to match between the datasets.
- Without shared columns, calculate a Cartesian product across both datasets.
Extend with Shared Columns
Ensure that columns you intend to match between the two datasets share the same column names and types – see
Describe
andChange
. The initial dataset is extended via a left-join with the second dataset. For each row of the initial dataset, if values match in the specified columns with the second dataset, that row of the initial dataset is extended with the columns from the second dataset. The extended columns are populated with the values from the second dataset's row.Ensure that columns you don't intend to match are named differently. For example, to refrain from matching on different "id" columns in datasets called "Person" and "Office",
Rename
the columns to "Person_id" and "Office_id" if you want to retain the full content of each "id" column..Extend the initial dataset with the second dataset by entering into the GEL input field:
Extend the dataset <dataset1> with the dataset <dataset2>
The resulting dataset is named "<dataset1>_Extend" and the extended columns from the second dataset are appended with an "E".
Extend without Shared Columns
If you extend one dataset with another dataset, without a shared column or a designated primary/foreign key relationship, the resulting operation—a Cartesian product—is computationally expensive and time-consuming. It can result in a very large dataset. DataChat prompts if you wish to continue with this unusual request, and computes a Cartesian product if you click "Yes".
Extend Form
You can also use the Extend form. Select the dataset to extend in the dataset panel and then select Dataset > Extend in the sidebar:
Select the dataset you want to extend onto your current dataset.
Optionally, specify the columns to use as a join key between the datasets. A join key is used to match values between the two datasets.
Optionally, add or remove columns from the join key.
Click Submit.
Tighten Datasets
See Intersect
, Slice
, and Subtract
.
Separate Column Data
DataChat recommends using the Extract skill to ensure that your date and time columns are in a useful format.
To extract date parts from the dataset, select Column > Create > Extract in the sidebar. The Extract form appears:
Select the column from which you want to extract date parts.
Select the date parts you want to extract.
Optionally, give the columns a new name by entering a comma-separated list of names to use for the new columns. You can use the + button to add more extract commands.
Click Submit.
To use the GEL input field, enter:
Extract the <date parts> from <column> and call the output columns <names>
for columns of Date/Time type.Extract attribute <attributes> from <json column>
for columns that contain JSON values.
To split a column into parts, select Column > Create > Split in the sidebar. The Split form appears:
Select the column to split.
Select whether to split the values in the column based on position (such as the third character) or by a delimiter (such as a comma).
Enter the position or delimiter to split by.
Optionally, enter a comma-separated list of labels to use for the new columns.
Click Submit.
To use the GEL input field, see Split
.
Aggregate Data
After you've cleaned your data, you’ll want to calculate interesting values across the dataset. Some of these values might be calculated for every row. In DataChat, these values would go into a new column. In other cases, you’ll want to calculate something like the average or count across multiple rows.
Use the Compute
skill to perform various calculations such as averages and sums. For example, if you had a dataset that contained every farm in the world, their total acerage, their country, and their state, and you wanted to compute the average acerage of every farm in the world across countries and states, you could say Compute the average acres for each country, state
. The resulting dataset would have the following columns:
- Country
- State
- AverageAcres
To compute an aggregation or calculation on the dataset, select the dataset on which to run Compute
in the dataset panel and then select Dataset > Compute in the sidebar. The Compute form appears:
Select the aggregation you want to compute.
Select the column you want to compute.
Give the new column a name. Optionally, click the + button to add additional aggregations to compute, or click the - button to remove additional aggregations.
Optionally, enter the columns to group by.
Optionally, enter the columns or values to sort by.
Click Submit to run the computation.
Create Pivot Tables
Pivot tables quickly summarize large amounts of data. You can use the Pivot
skill to create your pivot tables.
To create a pivot table with a form, select Dataset > Pivot in the sidebar. The Pivot form appears:
Select the aggregate (such as an average) for the column that you want to pivot on.
Select the columns to pivot on.
Select the columns to create pivot rows from.
Select the columns to create pivot columns from.
Click Submit.
To use the GEL input field, see Pivot
.
Reshape Tables
Reshape
converts your data to either wide or long form. This allows you to specify one or more columns to act as unique identifiers for rows and create new columns from the values of your specified columns. Each row in the resulting dataset represents a combination of the column values.
To reshape a table:
Click Dataset > Reshape to open the Reshape form.
Select either Wide Form or Long Form.
Enter the Row Identifiers.
Enter the Values.
Optionally, enter a name for the variables column and the value column.
Click Submit.
Define Expressions
You can use expressions to help clean and wrangle your data. Expressions are reusable shortcuts you can use to perform calculations and other operations quickly.
For example, if you repeatedly need to calculate the average Sales amount for customers between the ages of 18 and 34, you could define two expressions:
- An aggregate expression called AvgSales, which is defined as
average Sales
- A predicate expression called 18to34, which is defined as
Age is between the values 18 to 34
There are five types of expressions:
- Aggregate expressions that compute an aggregate, such as the
average Sales
example above. - Predicate expressions that perform an action of the data based on whether the data satisfies one or more conditions, such as the
Age is between the values 18 to 34
example above. - Extract expressions that extract a time component from a datetime column. See
Extract
for all of the available options. When registering expressions in bulk, you can extract the:- microsecond
- millisecond
- second
- seconds since epoch
- minute
- hour
- day
- day of the week
- first day of the week
- first day of the month
- month
- quarter
- year
- decade
- century
- YYYYMM
- YYYYQ
- Math expressions that compute a math expression using one or more columns.
- Aggregate math expressions that use both aggregations, such as
average
andmedian
, along with math expressions, such as exponents, ceiling, or absolute values; to compute a value from one or more columns.
Create Expressions
To create a new expression, click the Define button in the sidebar to open the form. Then:
Select the type of expression you want to define.
Enter a new expression name.
Select options from the available dropdown list to build your expression.
Click Submit
You can also use the Define
skill in the GEL input field.
Remove Expressions
To remove expressions, enter the following skill pattern in the GEL input field: Forget the expressions <list of expressions>
where <list of expressions>
is a comma-separated list of the expressions you want to remove.
Use Expressions
After you've defined an expression, you can use it with many other skills throughout DataChat, such as:
Add Notes
While working with your data, you might want to add notes to organize and label different datasets, computations, and other outputs. To add a note from within a session:
Click Note in the sidebar.
Select the color of the note.
Enter your note.
Click Submit.