Five Factors of Training Data Quality
Quality in, quality out — an AI is only as good as the data that trains it. Artificial intelligence algorithms learn by example, so to create smarter algorithms, both the original dataset and the annotated data have to meet high-quality standards.
At Sigma, we’ve identified five key quality factors for training data that we monitor and improve on throughout the data sourcing and annotation processes.
Data Sourcing Quality Factors
Algorithms need a certain amount of data to have enough examples to perform well. The volume of data needed depends on goals and tasks the algorithm is designed to carry out. In general, the more data, the better, but time and resources are usually limited. We can help you find the optimal volume of data to collect by selecting data examples that provide new information to improve the AI and eliminate biases, and by mathematically identifying the point where collecting more data leads to diminishing returns.
Training data needs to thoroughly cover the entirety of a topic area, domain, or operating conditions that the AI application is being designed for. The closer the data is to real-world conditions, the better the application will perform. For images, this might mean lighting conditions or camera type. For recorded speech, it could mean different communications channels or levels and types of background noise, as well as including technical terms or topic-specific language. Especially for speech and natural language applications, we have deep experience and can quickly identify potential gaps in domain coverage before they happen.
Similar to domain coverage, the dataset needs to represent the end users of the AI application, or the population modeled by the AI system. All demographic characteristics of the population like age, gender, race, language and dialect, as well as accents and other speech differences, must be covered in the training data or biases could be introduced.
Even if there’s good domain and user coverage in the dataset, it still needs to be balanced, meaning all cases need to be covered in representative proportions. While AI isn’t biased in the same way as humans — it’s not affected by fatigue, friendship, emotion, sickness, or other human characteristics — it can be discriminatory or make mistakes if the training data isn’t balanced. At Sigma, we have a strong focus on ethical AI and have broad expertise in avoiding biases in training data from the outset.
Accuracy in annotations — whether an annotator labels a piece of data correctly according to the guidelines — is essential in creating high-quality training data. At Sigma, we have decades of experience designing guidelines and processes that ensure accuracy. We guarantee 98% accuracy with our tech-assisted, human-in-the-loop methodology, and up to 99.99% accuracy on request. Here are some of the quality assurance measures we implement:
Preventative measures avoid annotation errors before they happen. This includes defining precise guidelines, team selection and training, defining clear roles and responsibilities within the team, optimizing workflows with an eye for creating a motivating environment for annotators, and creating user-friendly tools for annotators that protect against errors.
Reactive measures detect and fix errors and other issues after they’ve occurred. While human annotators are exceptional at recognizing implicit patterns and detecting subtle nuances in data, they’re limited in attention span, and focus decreases over time. To assure extremely high accuracy, we take additional reactive measures to minimize errors, including quality tests after model validation.
Continuous measures fall between preventative and reactive measures, and allow annotation teams to iterate and improve on quality during running projects. While occasional errors can happen randomly, systematic errors and interpretation issues are the main culprits that cause issues in accuracy and consistency. By continuously monitoring the quality of annotator output, project managers can identify systematic or interpretation errors early in the process and refine guidelines accordingly — or spot when annotator fatigue is influencing accuracy.
Continuous monitoring, feedback, and improvements are central to Sigma’s approach to quality. Project managers monitor group and individual accuracy, maintaining constant feedback loops with annotators and iterating on the guidelines. In addition, we periodically test annotators during the running project to assure they’re applying the guidelines correctly — essential because guidelines are refined on an ongoing basis, and new data brings new cases. If the tests show that certain annotators are making occasional mistakes, especially after new data is introduced, they can be pulled from the project to receive additional training to learn to interpret new cases. We also keep continuous communication with the client to resolve any interpretation issues and guideline edge cases quickly and efficiently.
Similar to how children learn, AI algorithms learn through examples. Just as a child would learn colors incorrectly if one teacher says teal is blue, but another teacher says it’s green, an algorithm won’t perform correctly if the data annotations don’t consistently follow a well-defined set of criteria. In addition to the quality measures described above that assure accuracy as well as consistency, we put a special focus on team selection and training to establish a high level of consistency in annotation.
Annotation team curation: We curate annotator teams with specific profiles, and expertise, and high attention to detail that match each project’s domain and requirements. We never crowdsource, instead pulling candidates from an extensive database of vetted data collection and annotation candidates. Thanks to our process of continuous candidate selection, we can scale your project quickly, with the right professionals for your specific project, without increasing the cost.