What is ground truth data?
Meteorologists coined the term “ground truth data” for information collected onsite by “storm spotters” or others that confirm information-based tools such as satellite imagery. Similarly, in machine learning, ground truth data is information from real-world observations used to calibrate an artificial intelligence (AI) algorithm or model.
In short, it’s the reality you teach your AI so it can draw the right conclusions and make the right decisions.
You might hear the terms “ground truth data” and “training dataset” used interchangeably, but that’s not quite accurate. An AI development team collects ground truth data and then divides it into pieces: a training dataset and a testing dataset to confirm it works as intended. Part of the ground truth dataset might also be reserved for a validation dataset, used to tune the model’s performance or help teams choose between two models. Therefore, all this data, including training data, falls into the ground truth category.
Why is ground truth data important?
To perform optimally, AI platforms must be trained on ground truth data. Data done right allows a machine to process data as a human would, warning people not to touch a hot surface, giving accountants an NSF (insufficient funds) alert, or giving stargazers the exact minute the next meteor shower will occur.
These are simple examples, but an accurately trained algorithm can also spot anomalies in production that can lead to catastrophic equipment damage or enable an autonomous vehicle to protect pedestrian safety while operating. AI models trained with ground truth data can significantly impact safety, accuracy, machine uptime, and costs.
It’s also important to recognize that in addition to training AI models with quality data, they must also be trained on a large volume of data. Larger training datasets should include more examples, including edge cases. This prepares the algorithm for accurate performance in the real world, where those edge cases are a part of ground truth.
Furthermore, training isn’t a one-and-done task. McKinsey reports that 75% of AI and machine learning models require refreshing the solution from time to time with new ground truth data, and 24% require refreshed annotated datasets daily.
How do you get ground truth data?
With little doubt that ground truth data is necessary to properly train an AI or machine learning model, the next question is how to obtain it. Depending on the task, there are some options available for sourcing data:
- Accurately labeled or annotated datasets
The most common way to create a ground truth dataset usable by your machine learning algorithm is to label it. Ground truth labels mark target instances that represent the outcomes that developers expect. This is a prime example of the importance of the human side of effective artificial intelligence. With accurate ground truth labels, humans provide the model with accurate, reliable information with the goal of near-zero error rates.
- Synthetic or simulated data
Tech solutions can generate synthetic data to train and validate AI or machine learning platforms. It’s an option when historical data isn’t available, for example, when data teams want to train the algorithm on how to respond in what-if situations.
- Real data
Depending on the use case, AI solution builders may collect ground truth data from real-world circumstances, for example, when developing a machine learning platform designed to predict equipment failure. Sensor data collected during the incident can train the model to detect signs of future issues.
Whether using labeled datasets or other sources of ground truth data, how humans prepare that data is key to a machine learning system’s performance. Humans choose how to acquire ground truth data, how to use it to train and validate a model, and how often to refresh the ground truth data the model uses to refine performance. Also, humans — not computers — can see around corners, assess risk and find ways to mitigate — or eliminate — unfavorable or even dangerous outcomes.
What to keep in mind to control Ground Truth Data quality
Machine learning solution builders often ask some common questions when training their algorithms. Here are the answers:
How much data do we need?
The volume of ground truth data necessary to train an algorithm adequately and validate its performance varies. You need enough data to represent the domain in which the model will operate. For example, you need to train the algorithm on all lighting conditions a machine vision system will operate in or possible anomalies the system could encounter during a manufacturing process. Missing data can lead to irrelevant — or in some cases, even catastrophic — results.
However, don’t be tempted to sacrifice data quality to increase data volume. Inaccurate data will also lead to negative outcomes. Build the volume you need with relevant data, providing domain coverage and balance among the classes to be modeled to thoroughly train your model and prepare it for real-world use.
What parameters related to data do we need to consider?
Every AI project needs data volume specific to its domain. But also, each project has unique requirements for planning and designing training data and validating a model’s performance. There are two things every AI development team should consider, however:
- Balance: Datasets need to be balanced so that no part of the data the AI platform will use is underrepresented in training. The model needs training on all aspects of the task it will perform, with training data in proportion to the data it will encounter in the real world.
- Bias: As much as people want to remain unbiased, they always bring their experiences to a project, whether related to gender, age, race, politics, religion — or just hopes that the model will work a certain way. Teams need to keep the reality of bias in mind and do all they can to create a model that delivers unbiased results.
- Coverage: Ensure real word variability including operating conditions are covered. Operating conditions can vary widely, so it’s important to make sure that your data set covers as many different types of conditions as possible.
How can we establish consistent, quality labeled data?
When labeling ground truth data, the quality of labels will directly impact the quality of results. It can be challenging to maintain that quality when dealing with large data volumes. The best strategies include process checks within labeling workflows to ensure consistency and completeness.
Sigma has established a three-factor approach to data annotation accuracy:
1. Preventive measures
The best way to avoid an AI or machine learning platform that doesn’t perform as needed due to labeling or annotation errors is not to make the errors in the first place. We recommend using a team of people with the right skills and training them on proper procedures and guidelines, including pointing out common mistakes to avoid. It’s also wise to support your team with the right tools that automate repetitive and error-prone processes and build a motivating work environment that sets up your team for success.
Also, errors detected during validation are often related to the fact that models can’t distinguish between data classes. Correcting this issue and clearly communicating data attributes is often an iterative process of trial and error to achieve accuracy. Working with data annotation experts can reduce errors and increase speed to market.
2. Reactive measures
Even with a solid plan for preventing errors, interpretation or systematic issues, or other errors are likely to occur occasionally. Teams need a strategy to detect errors and provide feedback to refine annotation processes.
3. Metrics measures
Setting benchmarks for quality and performance will allow teams to better understand the source of errors or data quality issues and provide the framework for improvement.
Ground truth requires quality data collection and annotation
Reliable ground truth data collection and annotation is the foundation of an accurate AI model. Investing time into ensuring that data annotation is on point with ground truth and data that represents real-world conditions will provide returns in the form of a model that delivers accurate results and value to the operation in which it’s deployed. Sigma guarantees 98% accuracy using our QA tools and methodology and up to 99.99% accuracy upon request. We work with data science teams to collect, label, and optimize large, diverse datasets for machine learning models. Please contact us to learn more.