Shape
After you've explored your data, you're ready to prepare it for analysis and visualization. Some ways to shape your data include:
- Drop or Keep Rows or Columns with
Drop
andKeep
- Add Columns to a Dataset with
Create
- Adjust Column Characteristics with
Change
,Convert
,Flip
, andRename
- Bundle Columns into Groups with
Define
- Group Rows into Bins with
Bin
- Adjust Dataset Values with
Clean
,Fill
, andRemove
- Adjust Dataset Appearance with
Bring
andSort
- Create a new dataset with the Dataset Builder or the
Create
skill – to create, for example, a mapping dataset to help clean your data. - Remove datasets with
Forget
.
When a skill is applied to a dataset:
- If the skill creates a new dataset, it will use the convention
[dataset]_[Skill]
. - If the skill alters your existing dataset, it will use the convention
[dataset] v[x]
to save to a new version.
Drop or Keep Rows or Columns
You might want to limit the number of rows or columns in your dataset with:
Drop Columns
There are a couple of ways to drop
columns:
- The column header
- The Column button
Column Header
In the dataset panel, click the menu next to the column name in the dataset panel, then click Drop.
The chosen column is removed from your dataset.
Column Button
You can also select Column > Drop in the sidebar, which is useful for dropping multiple columns. The Drop Columns form appears:
- Select the columns to drop.
- Click Submit.
Drop Rows
There are a couple of ways to drop rows:
- The Row button
- The cell menu
Row Button
To Drop
a row, select Row > Drop in the sidebar. The Drop form appears.
- Select the column that selects the rows to be dropped.
- Select an expression that identifies the rows to be dropped.
- Fill out the values of the expression.
- Optionally, click the + button to add more columns and conditions. When you enter more than one column and condition, a toggle appears – choose whether all or any of the conditions drop the row.
- Click Submit.
To build a DataChat sentence in the chat box, see Drop
.
Cell Menu
To conditionally drop rows that match the value of a specific cell:
- Right-click the cell containing the value you want to drop across all rows.
- To drop all of the rows that match that value, click Drop rows matching {value}.
- To open the Drop form to add more conditions, click Drop rows matching {value} and ... and complete the form.
Keep Columns
To Keep
a column, select Column > Keep in the sidebar. The Keep form appears.
- Select the rows to keep.
- Click Submit.
Keep Rows
There are a couple of ways to keep rows:
- The Row button
- The cell menu
Row Button
To Keep
a row, select Row > Keep in the sidebar. The Keep form appears.
- Select the column that selects the rows to be kept.
- Select an expression that identifies the rows to be kept.
- Optionally, click the + button to add more columns and conditions. Click the - button to remove a column and condition.
- When you enter more than one column and condition, choose whether all or any of the conditions keep the row.
- Click Submit.
To build a DataChat sentence in the chat box, see Keep
.
Cell Menu
To conditionally keep rows that match the value of a specific cell:
- Right-click the cell containing the value you want to keep across all rows.
- To keep all of the rows that match that value, click Keep rows matching {value}.
- To open the Keep form to add more conditions, click Keep rows matching {value} and ... and complete the form.
Add Columns to a Dataset
To create a new column, select the method with which to create the column.
Choose from among five methods:
- Expression
- Set as a Value
- Concatenate
- Text
- Window
To create a column using an expression:
- Enter a name for your column.
- Enter an expression that operates on existing columns. Or, fill the column with an increasing sequence.
To create a column by conditionally setting it to a value:
- Enter the value to which the new column will be set if your expression is true for that row.
- Enter the column whose values to test against your condition.
- Create your expression.
- Optionally, add more conditions and expressions.
- Enter the default if your expression is false for that row.
- Click Submit.
To create a column by concatenating one or more values:
- Enter a name for the column.
- Choose either an existing column or enter text. Each element is concatenated—in order—to generate the column value.
- Add or subtract elements to adjust the column value. At least one element must be specified.
- Optionally, choose a delimiter.
- Click Submit.
Enter the text with which to populate all values of the new column.
- Enter a name for the column.
- Enter the text with which to populate all values of the new column.
To create a column as a window column:
- Enter a name for the column.
- Enter the text with which to calculate a window column. Use autocomplete within the field for prompts to complete the DataChat sentence.
Adjust Column Characteristics
You can change the column characteristics in your dataset:
Change the Column Type
To view your dataset's column types, see the "Representation" column of the Describe
skill's table output.
To change the type of the column with a form, click the Column button in the sidebar and select Change. The Change the Column Type form appears:
- Select the column whose type to change.
- Select the new type for the column.
- Optionally, add another column to change and its new type.
- Click Submit.
To use the chat box, enter: Change the type of the column <column name> to <type>
Rename a Column
Rename a column in several different ways:
Sidebar Form
Click the Column > Rename button.
Fill out the "Rename Using the Dataset" form to rename one or more columns.
Dataset Panel
Double-click on the column name in the dataset panel to make the column name editable and enter the new name.
Or, click on the menu next to the column name and select "Rename". The cursor relocates to inside the editable column name as above.
Chat Box
To compose a DataChat sentence with the Rename
skill, enter in the chat box:
Rename the column <column> to <new name>
Column names are case-insensitive.
Convert an Interval Column
Interval types are time-based:
- days
- hours
- minutes
- months
- seconds
- weeks
- years
To convert a column from one type of interval to another, enter into the chat box: Convert the interval column <column> to <interval type>
. You might first use Extract
to separate date parts from your Date type column.
Bundle Columns into Groups
Some skills in DataChat can use a bundle of columns as a group (as opposed to a single column or an explicit list) as an input. This bundle of columns is called a "column group" in DataChat. For example, the for each
clause in Compute
takes a single column, a list of columns, or a column group. Column groups are essentially nicknames you assign to a list of related columns (such as a group of columns that together make up demographic information). You can then refer to those columns using the name of the column group instead of entering each of their names every time you want to use the group in a skill.
But, in order to use a column group, you must create it first. You can use the Define
skill to do that. For example, you could define and use a column group that contains the demographic information of the passengers aboard the Titanic:
Load the dataset.
Load data from the file called titanic.csv
Define the column group as the Age and Gender columns.
Define a column group called Demographics as the columns called Age, Gender
Use the column group in a skill like Compute.
Compute the count of records for each column in column group Demographics
instead of
Compute the count of records for each Age, Gender
To use Define
with expressions, see Define Expressions.
Group Rows into Bins
To bin values into categories like “High” or “Low” you can use the Bin
skill. Note, this will leave your original column and add a new column with the bins. You will need to decide whether each bin should contain an equal number of records (based on percentile), an equal range of values (based on width), or use custom intervals. The Bin
skill also creates a secondary dataset that contains metadata of the bins, such as the start and end boundaries of each bin, that you can use as you would any other dataset.
To group a column's values into bins:
- Select the column whose values to bin.
- Select and configure how the values should be binned. Refer to the reference page for Bin for more information on these options.
- Enter the values used to bin the column.
- Optionally, enter a list of names to use for the new bins.
- Click Submit.
You can also use Bin
in the chat box. For example, you can say Bin the column called Age based on percentile setting the number of intervals to 3 and call the bins Low, Medium, High
.
Adjust Dataset Values
The Clean
skill can be used in most cases to manipulate values.
To clean one or more columns:
- Select whether to clean all columns of a specific type or a specific string or numeric column.
- Select the columns you'd like to clean.
- Enter the new value you want to use.
- Click Submit.
You can also use Clean
in the chat box. For example, enter: Clean the string column Name by deleting the phrase Mr
.
Adjust Dataset Appearance
Sort
To sort your data:
- Select the columns by which you want to sort the data.
- Select whether you want to sort the data in ascending or descending order.
- Click Submit.
Or, click on the menu next to the column name:
Bring
To bring a column to the front of your dataset, enter into the chat box: Bring to the front the columns <columns>
Add a Caption
To add a caption to a table, switch to Notebook mode and describe your table in the Caption this table field. Click the Save button to save your changes.
Edit Data
To modify your dataset directly, switch to Notebook mode, then click the Edit Mode button:
Toggle Notations
To display the values of numeric columns in standard or scientific notation, switch to Notebook mode, then click the Convert Notation toggle next to the column name. Numbers in scientific notation are shown with two significant digits. Numbers in standard notation are shown with up to 15 significant digits. Click the Save button to save your changes:
Create a New Dataset
You can create a dataset with the Dataset Builder or with the Create skill.
Use the Dataset Builder
The recommended way to create a new dataset is to use the Dataset Builder and it's familiar spreadsheet layout.
It works just like other spreadsheet tools, with a couple caveats:
- Your column names must be in the first row.
- You must remove any empty columns that are between populated columns before saving the dataset.
Create a Dataset from Scratch
To create a new dataset from scratch with the Dataset Builder:
- In the sidebar, click Dataset > Create.
- Click the Create a Dataset from Scratch button.
Create a Dataset from an Existing Dataset
You can use the Dataset Builder to create a dataset from the contents of an existing dataset:
- Load the dataset you want to use as the foundation of the new dataset.
- Open the Dataset Builder
- Click the dataset you loaded in step 1.
- Make and save your changes in the Dataset Builder.
Create Multiple Datasets
You can create multiple datasets in the Dataset Builder by simply adding a "sheet:"
Use Functions
The Dataset Builder supports a wide array functions for a performing a variety of calculations. To use a function:
- Click the cell that should hold the output of the function.
- Add your function using the syntax
=<function name>(<cell or start of cell range>:<end of cell range>)
. For example, to calculate the sum of the cells A2 and B2, enter=SUM(A1:B2)
. - Press Enter to save your function.
Refer to this page for a comprehensive list of all the available functions.
Name Your Dataset
Your dataset's name is the same as its "sheet" in the Builder. You can change the name of your dataset by either double-clicking or right-clicking the sheet's tab and clicking Rename:
Save Your Dataset
Before you save, make sure:
- Your column names are in the first row.
- You've removed any empty columns that are between populated columns.
When you're finished creating your datasets, click Save Dataset to save them in your session.
You can continue to make changes to your dataset after saving as long as the Dataset Builder is open.
Remove Datasets
To remove a dataset from your session, click the X at the upper right corner of a dataset's tab in the dataset panel to Forget
that dataset.
Or, enter in the chat box:
Forget the datasets <dataset names>
To remove a specific version of the dataset, click the version dropdown of a dataset in the dataset panel and click the X next to version to forget.
Or, enter in the chat box:
Forget the datasets <dataset name>, version <dataset version>