Skip to main content
Version: 0.18.3

Shape

After you've explored your data and are comfortable with it, you're ready to begin shaping data to prepare it for visualization, machine learning, and other uses. To shape your data, you might:

  • Create a new dataset, such as a mapping dataset, to help clean your data.
  • Fill in missing values
  • Replace values
  • Drop rows or columns based on certain criteria
  • Add new columns
  • Group columns with lots of precise values (such as ages or salaries) values into more manageable "bins."
note

When using a skill on a dataset, the resulting output dataset will be titled [dataset]_[Skill] or [dataset] v[x] depending on the skill used. If the skill creates a new dataset, it will use the convention [dataset]_[Skill]. If the skill alters your existing dataset to a new version, it will use the convention [dataset] v[x].

Add Columns​

Use the Create skill to add a column to a dataset.

Drop Rows or Columns​

To drop a column, click the ellipses in the top right corner of the column header, then click Drop.

Otherwise, you can click the Column or Row button from the menu bar, then select Drop.

From the Drop Form:

  1. Select the columns you want to drop.
  2. Click Submit.

the form for the Drop skill

You can also use Drop in the chat box. For example, you can enter Drop the rows where the column called <column name> is less than 5 to drop every row whose value in that column was less than 5.

Keep Rows or Columns​

To keep a column, you can right-click the name of the column to open the Keep form with that column already selected. You can then add other columns inside the form. Otherwise, you can click the Column or Row button from the menu bar, then select Keep.

  1. Select the columns or rows you want to keep.
  2. Click Submit.

the form for the Keep skill

You can also use Keep in the chat box. For example, you can enter Keep the rows where the column called <column name> is greater than 5 to drop every row whose value in that column was less than 5.

Clean Values​

The Clean skill can be used in most cases to manipulate values.

To clean one or more columns:

  1. Select whether to clean all columns of a specific type or a specific string or numeric column.
  2. Select the columns you'd like to clean.
  3. Enter the new value you want to use.
  4. Click Submit.

the form for the Clean skill

You can also use Clean in the chat box. For example, you can enter Clean the string column Name by deleting the phrase Mr.

Bin Values​

To bin values into categories like “High” or “Low” you can use the Bin skill. Note, this will leave your original column and add a new column with the bins. You will need to decide whether each bin should contain an equal number of records (based on percentile), an equal range of values (based on width), or use custom intervals. The Bin skill also creates a secondary dataset that contains metadata of the bins, such as the start and end boundaries of each bin, that you can use as you would any other dataset.

To group a column's values into bins:

  1. Select the column whose values to bin.
  2. Select and configure how the values should be binned. Refer to the reference page for Bin for more information on these options.
  3. Enter the values used to bin the column.
  4. Optionally, enter a list of names to use for the new bins.
  5. Click Submit.

the form for the Bin skill

You can also use Bin in the chat box. For example, you can say Bin the column called Age based on percentile setting the number of intervals to 3 and call the bins Low, Medium, High.

Edit Data​

To modify your dataset directly, switch to Notebook mode, then click the Edit Mode button:

the edit mode button

Toggle Notations​

To display the values of numeric columns in standard or scientific notation, switch to Notebook mode, then click the Convert Notation toggle next to the column name. Numbers in scientific notation are shown with two significant digits. Numbers in standard notation are shown with up to 15 significant digits. Click the Save button to save your changes:

the toggle button

Add a Caption​

To add a caption to a table, you must switch to Notebook mode. Then, describe your tables in the Caption this table field. Click the Save button to save your changes.

the table caption field

Define Column Groups​

Some skills in DataChat can use a bundle of columns as a group (as opposed to a single column or an explicit list) as an input. This bundle of columns is called a "column group" in DataChat. For example, the for each clause in Compute takes a single column, a list of columns, or a column group. Column groups are essentially nicknames you assign to a list of related columns (such as a group of columns that together make up demographic information). You can then refer to those columns using the name of the column group instead of entering each of their names every time you want to use the group in a skill.

But, in order to use a column group, you must create it first. You can use the Define skill to do that. For example, you could define and use a column group that contains the demographic information of the passengers aboard the Titanic by:

  1. First, loading the dataset with Load data from the file called titanic.csv.
  2. Defining the column group as the Age and Gender columns with Define a column group called Demographics as the columns called Age, Gender.
  3. Using that column group in a skill like Compute with Compute the count of records for each column in column group Demographics instead of Compute the count of records for each Age, Gender.