Skip to main content
Version: 0.17.6

Best Practices

DataChat makes it easy to build and apply machine learning models. However, there are some best practices to keep in mind as you prepare your data and build your models.

note

Some best practices are only available or best accessed in Notebook mode.

Slice Data and Train Models for Each Unique Feature​

Slice datasets by features that are highly heterogeneous and build a model for each slice. For example, predicting sales might benefit from building a model per region rather than one model for the entire country.

Remove Distracting Data​

Drop or Keep distracting data that causes the model to focus on trends you don't care about. For example, when you're training a model to predict something per hour of the day, it might make sense to cut out evening hours or non-business hours so that day/night differences don't wash out more subtle trends during the day.

Use Lag or Window Columns to Focus Your Model​

Create lag or window columns to help your model focus on specific trends within time series data. For example, a two week cycle can be emphasized with a column of the two week lag on the actual value.

Normalize Units of Measurement​

Compute normalized units across columns so that regression models don't think large numbers in one feature are more important than small numbers in another. For example, if one column is measured in ounces and another is in pounds, you can normalize the two to both be in ounces or pounds.

Create Relevant Metrics Before Training​

Combining columns into business-relevant metrics. For example, a pool may have water flowing in and water flowing out which are fairly stable in absolute terms, but the business relevant metric is actually the difference between the two indicating the pool filling or emptying.

Limit Your Training Data to a Subset of Features​

Limit your initial columns to the ones that make intuitive sense. This reduces the effort for you and avoids overfitting the data by providing too many options.

Avoid Highly-Unique Features​

Avoid using columns that have a lot of unique values. Models typically can’t find reliable trends in these columns without a lot more data.

Create Bins for Continuous Features​

Bin continuous values (such as numbers) into 3-10 buckets rather than using them directly. Simple high, medium, and low bins are often a good place to start. For example, it might make sense to bin an "Age" column into common demographic buckets, such as 18-24, 25-34, 35-44, 45-54, 55-64, and so on.

Be Careful with Skewed Features​

Be careful judging your results if your input data is heavily skewed. DataChat can compensate for this using industry standard techniques, but you should scrutinize the results more carefully.