Skip to main content
Version: 0.18.3

Compute

Description‚Äč

Compute lets you perform calculations and aggregations on one or more columns on your dataset. You can also group values based on certain columns, sort the results, name the resulting columns, and more.

Format‚Äč

Compute the <aggregations> for each <group> where <predicates> sorted by <column name, order> such that <statement> showing the <limit, number of rows> and call the computed columns <labels>

Parameters‚Äč

The parameters used in Compute include:

  • aggregations (required). A comma-separated list of calculations to perform, including:
    • average <numeric column> | <math expression>. Returns the average of all the values in the column or values returned by the expression.
    • concatenation of <column> using the delimiter <delim>.¬†Concatenates the values of the given column into a single string using the specified delimiter between each value.
    • count of <records> | <column>. Returns the number of records (including duplicates) in the dataset or specified column.
    • distinct count of <records> | <column>. Returns the number of records (excluding duplicates) in the dataset or specified column.¬†
    • distinct percentage of¬†
    • expression <expression>. Returns the output of the expression.
    • largest¬†<column> | <math expression>. Returns the largest (maximum) value in the column or in the values returned by the expression. This is a synonym for the maximum aggregation.
    • least <column> | <math expression>. Returns the smallest (minimum) value in the column or in the values returned by the expression. This is a synonym for the minimum aggregation.
    • maximum¬†<column> | <math expression>. Returns the maximum value in the column or in the values returned by the expression. This is a synonym for the largest aggregation.
    • median <column> | <math expression>. Returns the median value in the column or in the values returned by the expression.
    • minimum <column> | <math expression>. Returns the smallest value in the column. This is a synonym for the least aggregation.
    • most <column> | <math expression>. Returns the value that appears the most often in the column or in the values returned by the expression.
    • percentage of <column>. If no grouping columns are specified with for each, this returns the percentage of records that are not null. If grouping columns are specified, this returns the percentage of the target column's total that is made up by each grouping column.
    • percentile of <column> at <percent>. Returns the value of the column at the specified percentile.
    • proportion of <column>. If no grouping columns are specified with for each, this returns the percentage of rows that are not null. If proportion of records is used, this returns the percentage of valid records in the target column divided by the total number of records in that column.
    • smallest <column> | <math expression>. Returns the smallest (minimum) value in the column or in the values returned by the expression. This is a synonym of the minimum and least
    • standard deviation of <numeric column> | <math expression> (numeric columns only)
    • absolute median deviation of <numeric column> | <math expression> (numeric columns only)
    • total <numeric column> | <math expression> (numeric columns only)
    • group (optional). One or more columns or column groups used to group the results of the aggregations.¬†Note that column groups are available only at skill level 3.
  • predicates (optional). Operators used to compare two values, including:
    • is after <aggregate date> | <date or time value>
    • is before <aggregate date> | <date or time value>
    • is (not) between the dates <date1> to <date2>
    • is (not) equal to the aggregate date <date or time value>
    • is on or before <date or time value>
    • is on or after <date or time value>
    • is (not) null
    • is (not) blank
    • is (not) equal to the column <column name>
    • is (not) equal to <value>
    • equals <value>
    • does not equal <value>
    • is (not) <value>
    • is (not) equal to the math expression <expression>
    • equal to the value <value>
    • is (not) greater than <value>
    • (not) greater than <value>
    • is (not) more than <value>
    • is at most <value>
    • is (not) smaller than <value>
    • (not) smaller than <value>
    • is (not) less than <value>
    • is at least <value>
    • less than <value>
    • matches the value <value>
    • starts with <string>
  • limit, number of rows (optional). Limits the resulting dataset to the specified number of rows in the specified part of the dataset. You can choose from the top, bottom, first, and last parts of the dataset. For example, "top 5" limits the resulting dataset to the first five rows.
  • column name, order (optional). Used to sort the resulting dataset based on one or more columns in ascending or descending order.
  • statement (optional). Allows you to run a secondary calculation (also known as a nested query) to further refine the results of your calculations.
  • labels (optional). A comma-separated list of labels used as the names of any newly-created columns. Note that this list must match the number of aggregations. Otherwise, the newly-created columns are given names in the order of the aggregations.

Output‚Äč

If the calculations are successful, the updated dataset becomes [dataset]_Compute.

Examples‚Äč

Consider a dataset called "Titanic" that contains information on each passenger, including the following columns:

  • Age. Their age.
  • Gender. Their gender.
  • Name. Their name.
  • PClass. Their class.
  • Survived. Whether they survived the disaster.

To compute the average age for all passengers, enter Compute average Age

To compute the standard deviation of age for passengers in each class and sort the result by class values in ascending order, enter Compute standard deviation of Age for each PClass sorted by PClass in ascending order

To compute both the percentages of first class passengers who survived and those who did not, enter Compute percentage of records for each Survived where PClass is equal to the value 1

To compute the number of passengers of each gender in each class group, enter Compute count of records for each PClass, Gender

To compute the number and average ages of passengers who did and didn't survive and show the results in two new columns named AvgAge and NumPeople, respectively, enter Compute count of records, average Age for each Survived calling the output columns NumPeople, AvgAge

To compute the number of people in each passenger class that have average age greater than 30, enter Compute count of records for each PClass such that average Age is greater than 30

Computing Proportions vs. Percentages‚Äč

Consider the following dataset:

IDValue
1100
1100
2200
3400

Computing a proportion calculates a percentage of non-null rows as a proportion of the total data. For example, running Compute the proportion of Value for each ID on the example dataset returns the following dataset:

IDProportionOfValue
150%
225%
325%

In this case, of the four rows in the dataset, two of them have an ID of 1 (50 percent of the total) and there is one row for each ID of 2 and 3 (25 percent of the total, respectively).

Computing a percentage calculates the percentage each grouping column makes up of the total of the target column. For example, running Compute the percentage of Value for each ID on the example dataset returns the following dataset:

IDPercentageOfValue
125%
225%
350%

In this case, we're computing the percentage of the total of the Value column made up by the values of each ID. Here, the total of the Value column is 800, and the two rows with an ID of 1 combine to make up 25 percent (200/800) of that total.