Data preparation 101

An essential part of any machine learning workflow starts with data preparation. This is the process of converting data from a structured or unstructured format into a form that machine learning algorithms can use. Data preparation is essential because it helps to improve the quality of data and make it more consistent.

Data preparation involves a multi-step process to convert data from a structured or unstructured format to one suitable for use with machine learning. In this piece, we’ll go over the different steps involved in data preparation and some tips on how to perform it effectively.

Preparing data for machine learning

Data preparation for machine learning is a multi-step process, though not every project will cover every stage. Machine learning algorithms generally require data in very specific formats and in large volumes. This makes data preparation a typically time and labor-intensive step – in fact, up to 80% of the time spent on a machine learning project may be spent on data preparation. The steps to preparing data for machine learning typically look like the following:

Data sourcing

The very first step is to understand the type of data required for a project and then source that data. Once the type of data to be gathered has been determined, it must then be ethically collected. This means respecting data ownership rights and obeying any applicable stipulations on the use of such data. Data will rarely be exactly as needed and in the desired format, so it will need to be prepped further in some capacity.

Data cleansing & filtering

Once data has been sourced and gathered, it must be suitably cleaned and filtered for the particular project. This step means taking steps to ensure that the data available is easier to understand and use as part of machine learning models. Generally, this involves ensuring that data is in a consistent and standardized format. Data filtering then involves weeding out the bits that you don’t need until you are left with appropriately cleansed and highly relevant data that can be used for modeling.

Data augmentation

Data augmentation is a way of creating or modifying additional data that is based on currently available data. The data needs to be balanced, meaning that there are some classes that need to be modeled that are not well represented or all the operating conditions do not have data associated to them, so the goal of this stage is to make sure that the volume of data is good enough and that the classes to be modeled are balanced. Some ways to do data augmentation include, for example, adding synthetic data and sourcing more data. This is often done to avoid issues with small datasets, such as oversampling and overfitting.

Data annotation

Data annotation is also sometimes known as data enrichment and it typically involves labeling or otherwise enriching data in some way to provide additional context and help to sort and categorize data in various ways. This is usually done by data type, applying labels such as whether data is in the form of images, video, audio, text, or something else. Then within data subsets, labels can annotate specific data, for example labeling all the names of people used within a piece of text. This is an important step, as machine learning models are typically trained based on the type of task to be performed. Data annotation makes it easy to sort and find data specific to the sort needed for the particular model.

Translation, normalization, localization

Depending on the needs of the particular model and the data, this step may or may not be needed. Certain data may not be in the correct language or may have localization quirks that require sorting for consistency. A good example of this would be images of canned, carbonated drinks. Even within English, these are known by several different names, such as “pop”, “fizzy drink”, “soda”, “soft drink”, and more. These are all functionally the same name but for the sake of machine learning models, consistency is needed, and as such one particular name should be chosen and used to refer to all instances within the data.

Model training/re-training

Once sufficient data has been gathered, cleansed, augmented and generally made useful for machine learning, it can then be used to train and re-train models. Machine learning models are trained on data sets with a specific purpose in mind, such as identifying all instances of the names of places or people used within a body of text.

Model validation

Model validation is a crucial step as it is used to determine whether the model in question accurately represents the behavior of the system or not. This is where a validation dataset including a diverse set of data including ground truth annotation becomes very important, as it is used to verify that the model in question can generalize and operate outside the specific conditions included in the training dataset.

Why does special attention need to be paid during the data preparation process?

The data preparation process is where up to 80% of the time is spent on a machine learning project with good reason. Since machine learning models are trained off data, the outcome of any model is largely dependent on the quality of the data input. There are many issues that can come about with ill-prepared data outlined below.

What challenges of data preparation should you consider?

There are several challenges involved in data preparation, such that it pays to work them out upfront before starting to source data. The first step is understanding the goal of the model and working out what data you need and how much. This can be very varied – for example, if you want to attempt to forecast sales and revenue streams for the next 6 months, there are many data points you may choose to include as part of the model. These include, but are not limited to, historical sales data, weather data, location data, employee data, supplier data, raw materials data, general market trends data, and plenty more. How much you wish to include and how much weight data should have will be very project-specific. However, these are decisions that should be made before data sourcing begins.

The next challenge is to find relevant, high-quality data. This is a particularly difficult challenge and one that most machine learning projects are likely to face if there is no pre-prepared data available. Some of this data might already be available in some format, such as historical sales data or historical weather data. Still, this data is likely spread out across several locations and formats.

It’s essential to also work out how data will be structured once it has been decided upon and sourced. As data is likely to come in several different formats, knowing how data is to be structured ahead of time can help to determine how much prep work will be needed to convert any particular data source into a data format that can be used for machine learning.

Another challenge worth mentioning is the difficulty to obtain high-quality annotations/labels which are unambiguous, consistent, and represent accurately the data.

How do we know if our data preparation was thorough enough?

Several signals can be used to determine how thorough data preparation was. In supervised learning, there is what is known as ‘ground truth data’. This data represents some sort of objective truth that can be used as the source of truth for comparison with data produced by machine learning models. The closer the output of machine learning models matches ground truth data, the better the quality of the model. This helps avoid associated issues such as overfitting or underfitting. Ground truth data is also used and is necessary for supervised learning to train the models themselves. It’s important that the validation set validates that the models are good enough. If the validation set is diverse enough, this also validates that the data preparation is good enough.

Overfitting is an issue that happens when a model fits the training data too well. This means that both the data and the noise get picked up by the model to the extent that the noise/random fluctuations in the training data get incorporated into the model. The model then struggles to generalize it to unseen data.

Underfitting is a related issue, but unlike overfitting, which fits the training data too well, underfitting is a problem of not being able to model the training data nor generalize with unseen data. Underfitting is generally easy to detect and suggests either an issue with the model itself or that it did not have sufficient data for training.

Proper data preparation leads to better AI

Data is the lifeblood of machine learning, you need a constant supply of high-quality data in order to get the best performance. Though data sourcing, cleansing, annotation and formatting are time-consuming tasks that are not as interesting as making and experimenting with new models, they are still important foundational steps. Problems with data will bleed into the later stages and cause problems with modeling.

The better your input, the better your output will be. Proper data preparation leads to better AI, as models can generalize better and avoid problems such as underfitting and overfitting. Though the prospect is not as exciting as working with the models themselves, spending the bulk of a project’s time on data preparation pays dividends in producing high-quality data that can be reliably used to train models for AI.

Data preparation is the foundation, but its effectiveness is measured by two factors: quality and security. Ensure your preparation strategy is robust by understanding The intricacy of assessing data quality. You should also review the security implications of your next steps by reading our guide on Ensuring data privacy and security in data annotation.

Partner with Sigma to build data pipelines that are as secure as they are precise.

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.

Data preparation 101

Table of Contents