Extend
The Extend
skill lets you combine the contents of one dataset with another dataset. This can be helpful when you gather data from separate sources but want to analyze all of the data as a single dataset.
Format
The Extend
skill has several utterance variations:
Extend (the current dataset | the dataset <source dataset>) with the dataset <target dataset>
. Combines the source and target datasets using the shortest join path possible through connected datasets. If no join path exists, columns that both datasets have in common are used as join columns instead.Extend (the current dataset | the dataset <source dataset>) with the dataset <target dataset> and <keep all unmatched values>
. Combines the source and target datasets using a column that both datasets have in common while also keeping any rows in either dataset that didn’t match.Extend (the current dataset | the dataset <source dataset>) on <contained/overlapping> intervals <with delta set to> <delta>
. Combines two datasets on an interval column that's present in both datasets. Optionally, a delta can be supplied if you want to widen the coverage range of the intervals. A contained interval is one that is completely contained inside another. An overlapping interval is one that overlaps with another at some point.Extend (the current dataset | the dataset <source dataset>) with the dataset <target dataset> based on the similarity between <column A> and <column B>
. Combines the source and target datasets based on values in the two given columns that are similar. Similarity is defined by needing roughly one or two alterations between words for them to be the same when no method or threshold are specified.Extend (the current dataset | the dataset <source dataset>) with <target dataset> based on the similarity between <column A> and <column B> and use the similarity measure <measure> and set the threshold to <threshold>
. Combines the source and target datasets based on values in the two given columns that are similar based on the user-defined similarity method.Extend (the current dataset | the dataset <source dataset>) with the dataset <target dataset> where <predicates> and <keep all unmatched values>
. Combines the source and target datasets using the specific predicates given.
If you extend one dataset with another dataset, without a shared column or a designated primary/foreign key relationship, the resulting operation—a Cartesian product—is computationally expensive and time-consuming. It can result in a very large dataset. DataChat prompts if you wish to continue with this unusual request, and returns a Cartesian product if you click "Yes".
Parameters
When extending one dataset with another, you can use the following parameters to specify how the two datasets should be combined:
source dataset
(required). The name of the source dataset.target dataset
(required). The name of the target dataset.keep all unmatched values
(optional). If this parameter is used, all rows are kept even if they don’t have a matching value in the common column.interval
(optional).with delta set to
(optional).delta
(optional).column A
(required). A column with string values to compare against the values in column B.column B
(required). A column with string values to compare against the values in column A.measure
(optional). The method used to determine similarity between columns. The measures we support are:- Damerau_levenshtein. This is usually the best choice for everyday use. This is also the default similarity measure method.
- Jaro_winkler
- Jaccard
threshold
(optional). The minimum similarity value two values need to meet to be combined. The smaller this value is, the less similar words will be when they’re matched. We accept values between 0 and 1 (inclusive), where 1 is a perfect match. For example, two words with a similarity value of 0.8 are more similar than two words with a similarity value of 0.2.predicates
(required). Operators used to compare two values. See Compute for more information.
Output
If the datasets are successfully combined, the resulting dataset becomes [dataset]_Extend.
If the datasets cannot be combined, an error message appears in the chat box.
Examples
Consider two datasets, one called “Titanic” and another called “TitanicExtra.” Both datasets contain the following columns:
- Age. Their age.
- Gender. Their gender.
- Name. Their name.
- PClass. Their class.
- Cabin. The passenger’s cabin ID.
- Survived. Whether they survived the disaster.
If you wanted to simply combine these two datasets, enter Extend the dataset Titanic with the dataset TitanicExtra
. This combines the two datasets using all of the columns as join columns.
Since Dataset 2 did not include information about Vicky Ashton's gender, the gender value for Vicky Ashton in Dataset 3 is null.
If you wanted to join these two datasets based on matching names, enter Extend the current dataset with the dataset TitanicExtra based on similarity between Name and Name
.