Compute
Description
Compute
lets you perform calculations and aggregations on one or more columns on your dataset. You can also group values based on certain columns, sort the results, name the resulting columns, and more.
Format
Compute
uses a single format with several variations:
Compute the <aggregations> for each <group> where <predicates> sorted by <column name, order> such that <statement> showing the <limit, number of rows> and call the computed columns <labels>
Parameters
The parameters used in Compute
include:
-
aggregations
(required). A comma-separated list of calculations to perform, including:average <numeric column> | <math expression>
. Returns the average of all the values in the column or values returned by the expression.concatenation of <column> using the delimiter <delim>
. Concatenates the values of the given column into a single string using the specified delimiter between each value.count of <records> | <column>
. Returns the number of records (including duplicates) in the dataset or specified column.distinct count of <records> | <column>
. Returns the number of records (excluding duplicates) in the dataset or specified column.expression <expression>
. Returns the output of the expression.maximum <column> | <math expression>
. Returns the maximum value in the column or in the values returned by the expression. This is a synonym for thelargest
aggregation.median <column> | <math expression>
. Returns the median value in the column or in the values returned by the expression.minimum <column> | <math expression>
. Returns the smallest value in the column. This is a synonym for theleast
aggregation.percentage of <column>
. If no grouping columns are specified withfor each
, this returns the percentage of records that are not null. If grouping columns are specified, this returns the percentage of the target column's total that is made up by each grouping column.percentile of <column> at <percent>
. Returns the value of the column at the specified percentile.percentage of <column>
. This returns the percentage each grouping column makes up of the total of the target column.distinct percentage of <column>
. This returns the percentage of distinct values in each grouping column of the target column.proportion of <column>
. This returns the proportion (up to 1) of valid records in the target column divided by the total number of records in that column. If no grouping columns are specified withfor each
, this returns the proportion of rows that are not null.distinct proportion of <column>
. This returns the proportion of distinct values in the column.standard deviation of <numeric column> | <math expression>
(numeric columns only)total <numeric column> | <math expression>
(numeric columns only)group
(optional). One or more columns or column groups used to group the results of the aggregations.
-
predicates
(optional). Operators used to compare two values, including:is after <aggregate date> | <date or time value>
is before <aggregate date> | <date or time value>
is (not) between the dates <date1> to <date2>
is (not) equal to the aggregate date minimum | maximum <date or time value>
is on or before <date or time value>
is on or after <date or time value>
is (not) null
is (not) equal to the column <column name>
is (not) equal to <value>
is (not) equal to the aggregate value <value>
is greater than <value>
is less than <value>
is one of <string>
starts with <string>
does not start with <string>
ends with <string>
does not end with <string>
contains <string>
does not contain <string>
-
limit, number of rows
(optional). Limits the resulting dataset to the specified number of rows in the specified part of the dataset. You can choose from the top, bottom, first, and last parts of the dataset. For example, "top 5" limits the resulting dataset to the first five rows. -
column name, order
(optional). Used to sort the resulting dataset based on one or more columns in ascending or descending order. -
statement
(optional). Allows you to run a secondary calculation (also known as a nested query) to further refine the results of your calculations. -
labels
(optional). A comma-separated list of labels used as the names of any newly-created columns. Note that this list must match the number of aggregations. Otherwise, the newly-created columns are given names in the order of the aggregations.
Output
If the calculations are successful, the updated dataset becomes [dataset]_Compute.
Examples
Consider a dataset called "Titanic" that contains information on each passenger, including the following columns:
Age
. Their age.Gender
. Their gender.Name
. Their name.PClass
. Their class.Survived
. Whether they survived the disaster.
To compute the average age for all passengers, enter Compute the average Age
To compute the standard deviation of age for passengers in each class and sort the result by class values in ascending order, enter Compute the standard deviation of Age for each PClass sorted by PClass in ascending order
To compute both the percentages of first class passengers who survived and those who did not, enter Compute the percentage of records for each Survived where PClass is equal to the value 1
To compute the number of passengers of each gender in each class group, enter Compute the count of records for each PClass, Gender
To compute the number and average ages of passengers who did and didn't survive and show the results in two new columns named AvgAge
and NumPeople
, respectively, enter Compute the count of records, average Age for each Survived calling the output columns NumPeople, AvgAge
To compute the number of people in each passenger class that have average age greater than 30, enter Compute the count of records for each PClass such that average Age is greater than 30
Computing Proportions vs. Percentages
Consider the following dataset:
ID | Value |
---|---|
1 | 100 |
1 | 100 |
2 | 200 |
3 | 400 |
Computing a proportion calculates a percentage of non-null rows as a proportion of the total data. For example, running Compute the proportion of Value for each ID
on the example dataset returns the following dataset:
ID | ProportionOfValue |
---|---|
1 | .5 |
2 | .25 |
3 | .25 |
In this case, of the four rows in the dataset, two of them have an ID of 1 (50 percent of the total) and there is one row for each ID of 2 and 3 (25 percent of the total, respectively).
If we'd like to compute a distinct proportion, we can run Compute the distinct proportion of Value for each ID
, which returns the following dataset:
ID | ProportionOfValue |
---|---|
1 | .33 |
2 | .33 |
3 | .33 |
In this case, of the four rows in the dataset, there are three unique ID values: 1, 2, and 3.
Computing a percentage calculates the percentage each grouping column makes up of the total of the target column. For example, running Compute the percentage of Value for each ID
on the example dataset returns the following dataset:
ID | PercentageOfValue |
---|---|
1 | 25% |
2 | 25% |
3 | 50% |
In this case, we're computing the percentage of the total of the Value column made up by the values of each ID. Here, the total of the Value column is 800, and the two rows with an ID of 1 combine to make up 25 percent (200/800) of that total.
If we'd like to compute a distinct percentage, we can run Compute the distinct percentage of Value for each ID
, which returns the following dataset:
ID | ProportionOfValue |
---|---|
1 | 33.33% |
2 | 33.33% |
3 | 33.33% |
In this case, of the four rows in the dataset, there are three unique ID values: 1, 2, and 3.