The machine learning workflow: Key steps and best practices

The key components of any machine learning workflow are data collection, model training, and testing, and model error analysis. To ensure that your ML project is successful, it is important to pay attention to these steps carefully.

The major bottleneck tends to center around data. Data is like fuel for your car – no matter how big the engine or how much power it exerts, if you don’t have any fuel, then the car simply isn’t going to perform how it is supposed to. The same is true for machine learning – no matter how good the model is, it needs data. Not just any data, but high-quality, well-labeled, and relevant data.

Once you have access to data, it must be properly selected and prepared for it to be useful as part of a machine learning workflow. Read on to learn what this workflow looks like and how data is put to use in machine learning processes.

Breaking Down the ML Workflow

For everyone working on machine learning, the process is quite similar. While each company may implement things in a slightly different manner, the machine learning procedure follows a standard flow.

Gather Data

As emphasized above, data is hugely important for machine learning. Quality data can make or break a project regardless of how good your algorithms are, so the data gathering step could arguably be described as one of, if not the most, important step in the machine learning workflow.

Tools/methods for gathering data include:

Web crawling
Data scraping
Building a dataset
Database querying
Use of APIs
Data resulting from company processes or activities (e.g. gathering the voice of their customers via call centers)

Data gathering is a complicated process that isn’t as simple as collecting random data. Data to be fed into a machine learning workflow requires consideration of multiple factors, including factoring for errors, mistakes, omissions, and bias within the data.

Data can be gathered explicitly by collecting a dataset or implicitly as a side effect of the task being performed. Gathered data must be stored somewhere, so part of this process will often also involve designing an efficient database architecture for the storage and retrieval of data to be used as part of the machine learning workflow. This design and structure can differ depending on the data being stored, such as whether it is primarily text strings, numerical data, or imagery. The data to be gathered/used is directly related to the goal to be accomplished by the machine learning workflow, so it is important to carefully consider what data is needed and how it can be best utilized.

Pre-processing the Data

After the data has been selected, it next requires pre-processing. Pre-processing can be thought of as ‘prepping’ the data so that it may be consumed as part of a machine learning workflow. This pre-processing of data can be broken down into several categories, depending on the data and how it is intended to be used.

Data cleaning: This is the process of identifying and dealing with errors, outliers, and missing data points within the dataset.
Data transformation: This is the process of converting the data into a format that can be more easily consumed by machine learning algorithms.
Data normalization: This is the process of scaling the data so that it is within a specific range, such as between 0 and 1.
Data augmentation: This is the process of generating additional data points based on existing data points in order to increase the size or fill gaps in the dataset.

After the data has been pre-processed, it is then ready to be used in the next step of the machine learning workflow.

Datasets and Training Data

Machine learning depends on data to function. This data is split into datasets, and there are three categories of data sets used by most machine learning projects. These are training sets, validating sets, and testing sets.

Training sets will make up the majority of your data and these are data used to specifically train your algorithm for it to learn from according to the task at hand.
Validating data sets validate and finetune a model
Testing sets test the performance/accuracy of the model generated from the training set.

Refinement

Models must be validated and evaluated to identify and pick the best model for the task at hand. Validation datasets are kept separate from your training datasets to use them to estimate a model’s skill and tweak its hyperparameters to deliver the most optimal model.

Evaluation

Models developed through this process are evaluated against test datasets, kept separate from the training dataset. The test dataset is intended to give an unbiased evaluation of how well a particular model performs on the task at hand.

Machine Learning Best Practices

There are a few best practices that should be followed during the machine learning workflow to produce high-quality models. Some of these best practices include:

Make sure you have the data

Machine learning requires a lot of data. If your problem can be solved with some basic heuristics, it’ll probably be faster than using machine learning without sufficient, high-quality data.

Start simple and add complexity gradually

Machine learning is an incremental process. It pays to start small and add on complexity over time. This applies to both building the model and also tracking the right metrics. Start with simple metrics that track something easily observable and attributable.

Write a lot of tests

There are many different types of testing that can and should be applied to building and running machine learning models as many of these should be automated as possible. Tests are essential to help maintain continuous progress. There are unit tests for your components, integration tests for testing those individual components working into the broader system, and end-to-end testing for the more expansive customer flow, to name a few.

Machine Learning Models

Most machine learning models can be split into one of two categories: either supervised or unsupervised, though semi-supervised learning is a possible alternative. Unsupervised learning is used to find patterns in input data without using references to defined outcomes. Supervised learning, meanwhile, is the learning of a function that explicitly maps an input to an output (also called ground truth) based on example pairs.

Supervised learning is the most common type of machine learning and is used in a variety of tasks. Unsupervised learning is more challenging than supervised learning, as the correct labels are not known in advance.

Each machine learning model has its own complexities and best-fit use cases that are too long to go into detail here but are well worth additional research to find the best fit model for your particular use case. When we obtain the models via machine learning, tune the models on the validation set and test the accuracy of the models on the test set, it is often the case that we need to augment the training set with additional data according to the error analysis made on the trained models. In fact, machine learning is an iterative process that loops through the machine learning workflow until satisfactory performance is obtained.

Need Help With Your Machine Learning Workflow?

Machine learning is a complex topic at the forefront of modern software engineering. When correctly leveraged, it can help you to work faster, and more efficiently to uncover patterns and relationships in data that weren’t obvious before. However, there are some caveats mentioned in this article that companies must consider when looking to integrate machine learning into their workflows.

If you’re looking into machine learning, but your team isn’t sure how to make it work for them, get in touch today. Our team can help your project by bringing their expertise in designing and supporting each step of your machine learning workflow.

Our team of experts will be able to guide you through the machine learning workflow and advise you as to how we may best be able to support your project.

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.