Once your data has been shaped, you might wrangle it by combining it with other datasets, computing aggregates, such as averages, creating pivot tables, and more. This section covers how to:
- Grow Datasets with
- Tighten Datasets with
- Separate Column Data with
- Aggregate Data with
- Create Pivot Tables with
- Reshape Tables with
- Define Expressions with
- Add Notes with
When a skill is applied to a dataset
- If the skill creates a new dataset, it will use the convention
- If the skill alters your existing dataset, it will use the convention
[dataset] v[x]to save to a new version.
If your wrangled data is used in multiple recipes or charts, DataChat recommends saving the wrangled data as a snapshot to improve performance.
When you work with multiple datasets, you may need to combine or extend them.
Combine Datasets with the Same or Similar Columns
To create a new dataset that combines two datasets with the same columns—of name, number, and type—use the
Concatenate skill. You can use either the Concatenate form or the GEL input field.
To use the Concatenate form to combine your datasets, click Dataset > Concatenate in the sidebar to open the form:
Select the datasets to concatenate. Note that you must pick at least two datasets and they must already be loaded into your session.
Choose how to handle duplicate values. By default,
Concatenatekeeps all duplicates
GEL Input Field
To use the GEL input field to concatenate two datasets, enter:
Concatenate the datasets called <dataset1> and <dataset2>
For example, to create a new dataset of a weekly log that contains only unique rows from weekday and weekend server logs with slight time overlaps, enter into the GEL input field:
Concatenate the datasets called WeekdayLog and WeekendLog and remove all duplicates
The resulting dataset is called "WeekdayLog_Concatenate". Refer to the
Concatenate skill documentation for more information.
Combine Datasets with Different Columns
To create a new dataset that combines two datasets with different columns, you can use the
Extend skill and optionally indicate how the rows in the resulting dataset should be extended based on the columns in the two datasets:
- Designate existing columns to match between the datasets.
- Create shared columns to match between the datasets.
- Without shared columns, calculate a Cartesian product across both datasets.
Extend with Shared Columns
Ensure that columns you intend to match between the two datasets share the same column names and types – see
Change. The initial dataset is extended via a left-join with the second dataset. For each row of the initial dataset, if values match in the specified columns with the second dataset, that row of the initial dataset is extended with the columns from the second dataset. The extended columns are populated with the values from the second dataset's row.
Ensure that columns you don't intend to match are named differently. For example, to refrain from matching on different "id" columns in datasets called "Person" and "Office",
Renamethe columns to "Person_id" and "Office_id" if you want to retain the full content of each "id" column..
Extend the initial dataset with the second dataset by entering into the GEL input field:
Extend the dataset <dataset1> with the dataset <dataset2>
The resulting dataset is named "<dataset1>_Extend" and the extended columns from the second dataset are appended with an "E".
Extend without Shared Columns
If you extend one dataset with another dataset, without a shared column or a designated primary/foreign key relationship, the resulting operation—a Cartesian product—is computationally expensive and time-consuming. It can result in a very large dataset. DataChat prompts if you wish to continue with this unusual request, and computes a Cartesian product if you click "Yes".
You can also use the Extend form. Select the dataset to extend in the dataset panel and then select Dataset > Extend in the sidebar:
Select the dataset you want to extend onto your current dataset.
Optionally, specify the columns to use as a join key between the datasets. A join key is used to match values between the two datasets.
Optionally, add or remove columns from the join key.
Separate Column Data
DataChat recommends using the Extract skill to ensure that your date and time columns are in a useful format.
To extract date parts from the dataset, select Column > Create > Extract in the sidebar. The Extract form appears:
Select the column from which you want to extract date parts.
Select the date parts you want to extract.
Optionally, give the columns a new name by entering a comma-separated list of names to use for the new columns. You can use the + button to add more extract commands.
To use the GEL input field, enter:
Extract the <date parts> from <column> and call the output columns <names>for columns of Date/Time type.
Extract attribute <attributes> from <json column>for columns that contain JSON values.
To split a column into parts, select Column > Create > Split in the sidebar. The Split form appears:
Select the column to split.
Select whether to split the values in the column based on position (such as the third character) or by a delimiter (such as a comma).
Enter the position or delimiter to split by.
Optionally, enter a comma-separated list of labels to use for the new columns.
To use the GEL input field, see
After you've cleaned your data, you’ll want to calculate interesting values across the dataset. Some of these values might be calculated for every row. In DataChat, these values would go into a new column. In other cases, you’ll want to calculate something like the average or count across multiple rows.
Compute skill to perform various calculations such as averages and sums. For example, if you had a dataset that contained every farm in the world, their total acerage, their country, and their state, and you wanted to compute the average acerage of every farm in the world across countries and states, you could say
Compute the average acres for each country, state. The resulting dataset would have the following columns:
To compute an aggregation or calculation on the dataset, select the dataset on which to run
Compute in the dataset panel and then select Dataset > Compute in the sidebar. The Compute form appears:
Select the aggregation you want to compute.
Select the column you want to compute.
Give the new column a name. Optionally, click the + button to add additional aggregations to compute, or click the - button to remove additional aggregations.
Optionally, enter the columns to group by.
Optionally, enter the columns or values to sort by.
Click Submit to run the computation.
Create Pivot Tables
Pivot tables quickly summarize large amounts of data. You can use the
Pivot skill to create your pivot tables.
To create a pivot table with a form, select Dataset > Pivot in the sidebar. The Pivot form appears:
Select the aggregate (such as an average) for the column that you want to pivot on.
Select the columns to pivot on.
Select the columns to create pivot rows from.
Select the columns to create pivot columns from.
To use the GEL input field, see
Reshape converts your data to either wide or long form. This allows you to specify one or more columns to act as unique identifiers for rows and create new columns from the values of your specified columns. Each row in the resulting dataset represents a combination of the column values.
To reshape a table:
Click Dataset > Reshape to open the Reshape form.
Select either Wide Form or Long Form.
Enter the Row Identifiers.
Enter the Values.
Optionally, enter a name for the variables column and the value column.
You can use expressions to help clean and wrangle your data. Expressions are reusable shortcuts you can use to perform calculations and other operations quickly.
For example, if you repeatedly need to calculate the average Sales amount for customers between the ages of 18 and 34, you could define two expressions:
- An aggregate expression called AvgSales, which is defined as
- A predicate expression called 18to34, which is defined as
Age is between the values 18 to 34
There are five types of expressions:
- Aggregate expressions that compute an aggregate, such as the
average Salesexample above.
- Predicate expressions that perform an action of the data based on whether the data satisfies one or more conditions, such as the
Age is between the values 18 to 34example above.
- Extract expressions that extract a time component from a datetime column. See
Extractfor all of the available options. When registering expressions in bulk, you can extract the:
- seconds since epoch
- day of the week
- first day of the week
- first day of the month
- Math expressions that compute a math expression using one or more columns.
- Aggregate math expressions that use both aggregations, such as
median, along with math expressions, such as exponents, ceiling, or absolute values; to compute a value from one or more columns.
To create a new expression, click the Define button in the sidebar to open the form. Then:
Select the type of expression you want to define.
Enter a new expression name.
Select options from the available dropdown list to build your expression.
You can also use the
Define skill in the GEL input field.
To remove expressions, enter the following skill pattern in the GEL input field:
Forget the expressions <list of expressions> where
<list of expressions> is a comma-separated list of the expressions you want to remove.
After you've defined an expression, you can use it with many other skills throughout DataChat, such as:
While working with your data, you might want to add notes to organize and label different datasets, computations, and other outputs. To add a note from within a session:
Click Note in the sidebar.
Select the color of the note.
Enter your note.