Load lets you load data from several sources into your current session. Specifically, you can load data from:
When data is being loaded, the system analyzes the data to infer the type (such as string, integer, or float) of each column and also to identify and remove any bad rows, such as rows with too many or too few columns or mismatched types.
Datasets can contain at most 1600 columns.
Notes About Databases
Both of the following requirements must be met before DataChat can connect to a database:
- DataChat should have permission to create and drop namespaces or databases in the database server. This permission is required because DataChat creates temporary tables and views when running queries and it’s safer to create those tables in a temporary namespace that is later deleted when your session ends.
- DataChat should have permission to create and drop views or tables in the temporary DataChat namespace over the tables and views in the main database.
Connecting to a different database deletes all datasets, models, and charts loaded or created in the current database in the current session.
Notes About Files
Load is different from uploading a dataset using New > File. Uploading a file saves the file in DataChat as a dataset to be used later, while loading data from an existing file or URL loads the data directly into your session. After the data is loaded into your session, you can begin to analyze it. The following file types are supported:
- Comma-separated values (.csv)
- Excel (.xlsx)
Files must be in tabular format, with the first row as a header row that delineates column names. If saved from Excel, .csv files must be in UTF-8 format.
Notes About Snapshots
Snapshots are created with the
Save skill and contain the datasets from a different session. You can load data from them the same way you can load data from external databases or local files.
Load has several formats:
Load data from the database <database>
If the database connection includes write access, all database tables are loaded into DataChat. The chat history provides updates for lengthy data transfers. Once all tables are loaded, you can click a link to list the loaded datasets.
If the database is read-only, a list of available datasets is displayed in a table. Click on a link in the Table column to load a specific dataset from the read-only database.
and force transfer
For read-only databases, you can optionally initiate a data transfer upon load instead of upon subsequent use of a skill. A data transfer copies the data from the read-only database to your DataChat session. For very large databases, data transfer can be a time-intensive operation – forcing a transfer allows you to schedule a lengthy transfer and save a snapshot for use in a session.
The following qualifiers can be used separately to specify which datasets to load from the database:
only loading the datasets <datasets>
Loads the listed
<datasets>(separated by commas) from the specified database.
using the queries from <query file>
Load datasets from the database, as returned by the specified queries.
using the schema
Loads all of the datasets in the specified database that match the
without fetching the latest schema
Loads all of the datasets in the specified database without fetching the latest schema.
Load data from the datasets <datasets>. Loads multiple datasets into your current session at once.
Load data from the file <file> loads data into your current session from a file that has been uploaded to DataChat. See Upload Files for more information.
Load also supports the following optional qualifiers:
using the delimiter <delimiter>
From an uploaded file, specify a delimiter on which to split values. You can specify any string value, or can select from the following common values: comma
;, space, or tab.
with columns <list of columns> having types <list of types>, with formats <list of formats>
From an uploaded file, specify the types and formats of specific columns. Note that this is most useful when DataChat didn't infer the column type you expected or preferred.
.xslx files that include sheets, you can use two optional qualifiers to further specify which data to import. Use the qualifiers separately or together:
excluding the sheets <sheet name>
From an uploaded .xlsx file that includes multiple named sheets, exclude specific sheets from your current session.
including the sheets <sheet name>
From an uploaded .xlsx file that includes multiple named sheets, include specific sheets from your current session.
with cell ranges <sheet name!cell range>
From an uploaded .xlsx file that includes named sheets, specify cell ranges per sheet. Options for
cell name:which loads all values from cell-name through the rest of the sheet
cell name:cell namewhich loads all values from cell-name to cell-name
cell name:column namewhich loads all values from cell-name up to and including all values in column-name
Load data from the snapshot <snapshot name>. Loads the datasets from a snapshot into your session. A snapshot is a database that contains the datasets from another session that is created with the
Save skill (either by you or by another user in your organization and shared with you).
Load data from the URL <URL> (using the delimiter <delimiter>). Loads data directly from a URL endpoint and can split values using an optional delimiter.
When loading data into DataChat, you can use the following parameters to specify where to load data from and how to load it:
cell name(required) For an .xlsx file, the name of the cell name to include within a cell range.
cell range(required) For an .xlsx file, the name of the cell range to include.
column name(required) For an .xlsx file, the name of the column to include within a cell range.
database(required). The database to load data from.
dataset name(required). The dataset to load from the database.
datasets(required) A comma-separated list of datasets to load.
delimiter(optional). If specified, this parameter splits the values in the dataset on the specified delimiter. The options include a space, a tab, a comma, a semicolon, or a custom string.
file(required). The name of the file to load.
list of columns(optional). The columns to give specific types and formats when loading.
list of types(optional). The types to assign to each of the columns from the
list of columnsparameter.
list of formats(optional). The formats to assign each of the columns from the
list of columnsparameter.
query file(required). The name of the file containing queries to apply to a database, to load specified datasets.
schema(required). The schema of the datasets to load from the database.
sheet name(required). For an .xlsx file, the name of the sheet to include or exclude.
snapshot name(required). The name of the snapshot to load data from.
string(optional). When loading data from a database, this is the string pattern the tables in the database or schema need to match in order to be loaded.
URL(required). The URL from which to download the dataset.
If the data is loaded successfully, the system returns a success message and shows a sample of the dataset.
If the data is not loaded, the system returns an error message in the chat history.
To load data from the telcoCustomerChurn file, enter
Load data from the file telcoCustomerChurn
To load specific data from a file "SoilMultipleSheets.xlsx", enter
Load data from the file SoilMultipleSheets.xlsx using the delimiter | with cell ranges Crops!B4:L, Meso
To load a sample file from a URL, enter
Load data from the URL https://dc-small-data.s3.us-east-2.amazonaws.com/sample.csv
To load yearly sales data from the read-only database "sales_db" (after the connection has been established), enter
Load data from the database sales_db and force transfer loading datasets starting with Year
Behind the Scenes
Inferring Column Types
Before the data in your file is loaded into DataChat, the platform first infers your data's dialect, or what character (or lack thereof) is used to separate your values, such as a comma, tab, space, or semicolon. Determining a dialect allows the system to then calculate the number of columns in the dataset.
Detecting Bad Rows
To detect bad rows in the dataset, the platform first breaks the dataset into chunks and takes 1,000 rows from each chunk. Then, for each group of 1,000 rows, the platform:
Determines the most likely type for each value in every column. For example, if a value is all numbers, such as "1000," it's more likely to be an integer than a string. If that value has a decimal point, then it's even more likely to be a floating point (float) value than it is to simply be an integer.
If a column is determined to be a Date or Time column, the platform compares the values to several patterns to determine which pattern is most common, such as MMDDYY, DDMMYY, or YYMMDD. Any other rows that don't match the decided-upon format are converted to strings.
Counts the number of rows that have been given each type. The most common type becomes the column's type. For example, if you had a column with 800 integer rows, 40 string rows, and 160 float rows, the column would be considered an integer column. Note that if at least 0.5 percent of a column's values are strings, the column is automatically considered a string column. This is because most other values can be converted to strings easily, but string values often cannot be easily converted to other types, such as integers, because they contain Other rows in the column are then converted to strings.
Discards or converts any rows in the column that don't match the column's type or are completely empty. Whether a given row is kept or discarded depends on the column's type and the row's type and whether the value in the row can be converted to match the column. For example, integer rows in a string type column can be easily converted to a string because they are simply numbers. However, a string such as "name100" couldn't be converted to an integer because it contains characters, so it's discarded. Discarded rows are placed into a separate file that can then be downloaded and reviewed.