# Glossary

### App

An App in DataChat is a collection of Skills. The key app is Ava, which understands DataChat English. This is the main app for most users. DataChat English allows you to carry out a broad range of Skills including data loading, data wrangling, data visualization, machine learning and a lot more in a single language. There are also ways to create new languages for specific settings and a method to wrap that new language in a new app. To build such custom Apps, please contact info@datachat.ai.

### Attribute Value Frequency (AVF)

The Attribute Value Frequency (AVF) is a simple and fast approach to detect outliers in categorical data. The AVF method generates scores that are always positive numbers. The lower the score, the more likely the object is an outlier. More information can be found here.

### Chatterjee Correlation Coefficient

Chatterjee's correlation coefficient is a rank correlation metric. It can be used to measure the degree of dependence between two variables. The coefficient is 0 only if the variables are completely independent and 1 only if one variable is a measurable function of the other. Note that the coefficient is asymmetric in nature, meaning that its value when calculated for variable A in relation to variable B is not guaranteed to be same when calculated for variable B in relation to variable A.

### Correlation Measure Between Continuous and Discrete Variables

This is a measure of the distance between a continuous variable and categorical variable. The two distances are calculated to get an informed correlation between the two variables. All of the values of the continuous variable are clustered into the unique values of the categorical variable. An intra-class distance is calculated between all of the values belonging to a particular unique value and summed across all unique values. This distance defines the intra-class distance. An inter-class distance is calculated by taking the sum of the distances of each cluster mean from the total mean. The measure of correlation can then be defined as the ratio of inter-class distance to the sum of the inter-class and intra-class distance. The range is [0,1]. More information can be found here.

### Cumulative Variance Ratio

Variance is a measure of the spread in the data after a Principle Component Analysis (PCA) has been performed on it. The cumulative variance ratio is used to how much of the total variance comes from each component.

### Dataset

A dataset refers to a set of data in a specific state within DataChat.

For example, you may import your file “mydata.csv” as the dataset “mydata”. If you delete some rows from “mydata” using DataChat, it will automatically create a new dataset “mydata_with_some_rows_deleted”. However, it will also keep “mydata” around for you to use again if you realize you need the full “mydata” dataset after all. Conceptually, this is like DataChat keeping a backup version of your data at every analysis step in case you need to go back and iterate.

Most DataChat skills will automatically assume that they should run on the last dataset used. Refer to the Navigate Your Data section for more information.

### Durbin-Watson Test

The Durbin-Watson test is used to detect whether there is any correlation in a regression model's prediction errors. More information can be found here.

### Explained Variance Ratio

The explained variance ratio is the ratio of a given component in a PCA to the the total variance.

### Impact Scores

To obtain the impact scores for each feature as displayed on the bar chart, we first calculate the Shapley values for each feature for each sample in the training dataset. From these "local Shapley scores", we take the absolute values and then average them for each feature over all the individual data samples. These values are then displayed in the bar chart as the feature-wise model impact scores. For more information on Shapley values, see this article.

### Isolation Forest

An isolation forest is an algorithm for detecting outliers in numerical data. The scores generated by the algorithm range from -1 to 1. The lower the score, the higher chance that the corresponding object is an outlier. More information can be found here.

### Mean Absolute Error

The mean absolute error (MAE) is used to measure the average amount of forecast errors in a set of predictions without considering whether the forecast over predicts or under predicts. It is a scale-dependent metric that shows the average error to expect from a given forecast. More information can be found here.

### Mean Absolute Percentage Error

Mean absolute percentage error (MAPE) is used for measuring predictions in regression tasks. It can be expressed as a float value from 0 to 1 or equivalently as percentage. MAPE generates unbounded values when ground truth is at or near zero. More information can be found here.

### Mutual Information Gain

This is the measure of the mutual dependence of two categorical variables. This value points to the amount of information we can obtain for one variable by just observing the other variable. More information can be found here.

### Pearson Correlation Coefficient

Pearson's correlation coefficient is a measure of the linear relationship between two variables. It's the covariance between the two variables divided by the standard deviations of each of the variables. A value of -1 and 1 indicate that both are highly-correlated with each other. A value of 0 indicates that there is no correlation. Both of the variables must be continuous in nature. More information can be found here.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a way to reduce the size of a dataset by reducing the number of variables in a dataset to the principal ones that retain most of the dataset's information. More information can be found here.

### Session

A session refers collectively to the list of utterances you’ve used (aka your current workflow) and the datasets you've created within an instance of Ava or another app. When you switch apps or start working in a different part of DataChat, you might be prompted to decide whether you want to close your previous session. Keep in mind that data is cleaned up when you close a session, so make sure you save or export what you were working on before doing so.

### Skill

A skill is a piece of functionality that allows you to clean and visualize your data, perform queries, train machine learning models, collaborate with other users, and more. For example, the Load skill is used to load data into a session and the Plot skill is used to visualize your data. Skills begin with a verb and are often referred to by that word.

### Symmetric Mean Absolute Percentage Error

Symmetric mean absolute percentage error (SMAPE) is used to measure the predictive accuracy of a model. It has both lower and upper bounds, between 0% and 200%. More information can be found here.

### Utterance

When you type a sentence into DataChat, your specific sentence is called an utterance. DataChat then figures out what skill you're trying to use in that utterance. For example, `Load data from the file myfile.csv`

is an utterance that uses the Load skill.

### Workflow

A list of utterances is referred to as a workflow. Workflows can be saved and rerun on different data to provide flexibility and automation within the system.