Building a Scalable Data Annotation Strategy

Creating high-quality datasets is essential to successful artificial intelligence (AI) and machine learning (ML) projects. Industry analysts estimate that AI project teams devote about 80 percent of their time to data. However, this statistic is somewhat misleading. That time doesn’t necessarily reflect the importance that AI teams place on data quality. Rather, much of that time is spent on inefficient processes, rework, and training datasets that don’t teach a model to provide desirable outputs, even over multiple iterations.

An AI project team needs an effective data annotation strategy to avoid those issues.

Recap: Why Is Data Annotation So Important?

Data annotation enriches data by labeling “known” objects within the dataset. It makes it possible for AI models to understand images, text, audio, or other types of data, retain relevant information, and learn to use data to make decisions.

However, ML models don’t have the capability to discern data accuracy or validity on their own. Their performance is 100 percent dependent on the data in its training datasets and how it is annotated. The quality of data you use to fuel the training process directly impacts how the solution performs.

Why is a Data Annotation Strategy Essential?

Because data quality is vital to an AI-based project’s success, AI project teams must create and execute an effective data annotation strategy. The strategy must address:

Data consistency: The data operation team must decide on the best annotation methodology to use for the project. Then, all data must be collected and annotated consistently according to those guidelines.
Domain coverage: Training data needs adequate volume and must represent real-world operating conditions and “what if” scenarios. Furthermore, the dataset must include a distribution of data that approximates what the model will encounter in the real world.
Bias mitigation: The data annotation strategy must also include how the team will ensure that training won’t result in bias. The data annotation strategy must include tactics for representing all user demographics and situations.
Security and privacy requirements: If the project requires using sensitive or personally identifiable data, the data annotation strategy must include how to protect data and comply with regulations that protect that information.
Storage: Large training datasets require sizeable data storage. Teams need to determine the best way to store, back up, and access data.
Data annotation resources: AI project teams need enough skilled and trained resources to create datasets and to adjust annotation or augment the dataset with each iteration of the training process.

According to McKinsey Global Institute, 75% of AI and ML projects demand learning datasets to be refreshed once per month, and 24% of AI and ML models require refreshed data daily. A data annotation strategy should define how many human resources are needed, their skill sets, and how they can effectively support the project.

This includes:

Data preparation and annotation tools: Teams must determine the best tools for data cleansing, classification, and selection. They also need to choose the best tools for annotating the type of data necessary to train the model, whether image, video, audio, text, or tabular. Teams must also explore whether AI-based tools would help them automate processes and work more productively.
Quality assessments: An effective data annotation strategy must also include quality assessment methodologies to ensure data annotations are accurate and effective at training the model. Quality assessments should be quantitative, using metrics such as precision, recall, F-1 score (for balance between precision and recall), word error rate for speech recognition, and intersection over union for object detection.
Communication: Your strategy must include how you will ensure open communication with data annotators. A tight feedback loop ensures you will maintain agility and data quality.
Timeline: A well-planned data annotation strategy will include how to complete phases of the project to ensure the data is ready when needed and won’t create a project bottleneck.
Budget: Data acquisition, data preparation, and annotation tools, human resources, and material resources for the projects all take time and investment. Ensure the strategy will allow the AI project team to complete the project successfully on budget.

AI teams also need to acknowledge that data annotation strategies are project-specific. A data annotation strategy should reflect factors including the problem the AI project will solve, the market the solution is designed for, performance levels the use case requires, and real-world operating conditions.

No two AI solutions are designed for exactly the same purpose. Therefore, no data annotation strategy will translate exactly from one project to another.

How Much Training Data Do You Need to Build Out an Effective Strategy?

A key element of a data annotation strategy is how to determine and produce the right amount of accurately annotated data for the project. However, Ai project teams face a common challenge: The need to generate enough data volume while managing the project practically.

Factors that help AI project teams arrive at the perfect balance include:

Performance objectives: The higher the performance demand, the more data will be required. For example, recognizing products that are consistent in size and shape will take fewer images than training a model for a machine vision system designed to perform quality assurance on detailed electronic components with tight tolerances. An effective data annotation strategy uses performance objectives to guide all decisions.
Complexity: The more classes the model must address, the more data will be required to train it. Datasets must train the model on the similarities and differences among classes, input features, and each model parameter. Each function can multiply the amount of data needed to train and test the model.
Variable environmental conditions: AI models that work in controlled environments usually require less data than the same model deployed where conditions change. For example, a computer vision project will require more data if it will operate in different lighting conditions, if cameras capture images from different distances, or if different cameras will take the images. Likewise, a voice project must be able to differentiate between spoken queries and background noise and understand inputs at different volumes.
Intra-data variability: In some use cases, the data the AI model encounters can vary. Manufacturers may provide products in an assortment of colors, objects may be oriented differently, and people use different tones of voice or accents when they speak.

Another factor to consider is how much original data is available. Data is sometimes difficult to collect – or, in R&D use cases, doesn’t yet exist. Data accessibility may also be restricted due to privacy regulations such as the EU’s General Data Protection Regulation (GDPR) or the U.S.’s Health Insurance Portability and Accountability Act (HIPAA). In these cases, creating datasets with adequate volume, all or in part, with synthetic data can be beneficial.

Teams should also plan for an adequate volume of data for testing and validating the model. Validation and testing datasets must include a diverse set of data, including ground truth data, so AI project teams can use it to verify that the model can meet performance standards in real-world conditions.

With some projects, it’s challenging to anticipate how much training data the model will require. Testing can lead to a better understanding of the project’s needs and how to adapt data annotation processes to produce better outcomes.

What Are the Biggest Challenges for AI Project Teams?

Quality and accuracy at scale

As the demand for AI dataset volumes grows for training or informing the model to make future decisions, internal AI project teams eventually encounter a dilemma: quantity vs. quality. Even though a dataset may require millions of objects, each datum must be labeled correctly for the AI algorithm to learn and deliver desirable outcomes. AI project teams must establish and test quality management processes to ensure annotation at scale doesn’t sacrifice quality.

Human Resources

When projects scale, hiring in-house resources for data annotation is an option. However, it takes new employees months to be able to work independently and meet quality standards.

It may be tempting to assign data annotation to other members of the AI project team in an all-hands-on-deck approach. However, data labeling takes specific skills, such as patience, attention to detail, excellent short-term memory, and the ability to work consistently. Moreover, other team members may not have the temperament for this work – and taking them away from different parts of the project could cause delays.

Speed

Most AI projects are on a deadline. Annotating datasets of millions of data points can easily become the bottleneck that leads to delays or that allows a competitor to take a solution to market more quickly. An effective data annotation strategy includes time and resources for project-specific training and adequate time allocated for annotation itself.

Many companies don’t have the staff to complete large-scale data annotations projects in-house according to their preferred timelines.

Agility

AI project teams must also plan for work beyond the first round of annotation. AI-model building is iterative, and datasets must be updated or modified to refine the outcomes the model delivers.

Automating some processes to support humans, such as validating results, can be an effective strategy for increasing agility as requirements change.

How Do You Know It’s Time to Outsource Data Annotation?

When AI projects teams can’t overcome the challenges they face with data annotation, struggling to produce adequate data volumes or meet goals for quality, speed, and agility, it’s time to consider outsourcing.

Indications that outsourcing data annotation makes the most sense for your project include:

Lack of state-of-the-art tools: Data annotation providers use advanced tools that support consistent, accurate labeling and maximum throughput. AI project teams often don’t have those capabilities in-house.
Lack of in-house resources with the right skill set: The challenge of staffing data annotation projects increases with scale and the risk of staff turnover.
Inefficiency: Outsourcing may be the best choice if initial attempts to label data have resulted in excessive rework.
Challenges to finding data: In-house teams may not have the time or ability to collect adequate volumes of data that represent all parts of the domain or to create synthetic data.
Breakdowns in collaboration: Data annotation is an iterative process and requires alignment between the annotators and data scientists/AI engineers to adapt datasets for improved model performance.
Tight timeline: Outsourcing can benefit projects by keeping projects on track to reach milestones according to schedule.
Low risk tolerance: AI teams can use data annotation providers’ expertise to develop a plan to provide datasets that will train the model for proficiency early in the project.

Options for Outsourcing Data Annotation

When an AI project team determines that outsourcing data annotation is the best course, it must make decisions and preparations to get the most value from working with a provider.

The first question is which provider or providers to use for outsourced data annotation. Options include:

Crowdsourcing: AI project teams may be able to find data annotation resources through platforms such as Amazon Mechanical Turk or Upwork. With this option, AI teams must clearly communicate data annotation guidelines and provide training. Controlling consistency can often be difficult when data annotators don’t work as a cohesive team. Additionally, teams generally won’t receive labeled datasets that have undergone rigorous quality control before they receive annotated data. Also, note that if your project must comply with privacy and data protection regulations, crowdsourcing isn’t an option.

Using a data platform and contractor annotators: Another option is working with a company that has built its own platform, enabling AI teams to self-manage data annotation. These platforms may have advanced capabilities, such as tools with ML-assisted annotation features. The platform provider may also source annotators for a project. However, AI teams have little control over the labelers’ expertise and data quality.

Partnering with a data annotation service provider: AI project teams can partner with an experienced service provider that has built a data annotation solution and employs a team of experienced data labelers. These providers offer data annotation services and commit to specific service and quality levels. They can quickly adapt to an AI team’s data annotation strategy, guidelines, and data requirements. This option is most likely to result in the data quality necessary, provided according to the project timeline.

How to Prepare to Outsource Data Annotation

Once an AI team selects an outsourced data annotation services provider, it must follow these steps to build a good working relationship and achieve the level of data quality and service it needs.

Define Project Requirements

The AI project team must decide on the requirements the project’s datasets must meet. Next, the team will work with the provider to define project scope, task ownership and develop a data annotation strategy specific to the project.

The project team will be responsible for notifying the annotation provider of the data privacy and security regulations data annotators must comply with, as well as the expected performance level required for the use case.

Data Sourcing

Whether the AI project team or the annotation service provider sources data, this process must begin with establishing ethical, regulatory, and data ownership standards for the project.

Data Cleansing and Filtering

Once data is collected, it must be cleaned and filtered. This step ensures that it will be easier for the machine model to understand. The AI project team will put all data into a consistent, standardized format. Then the team will filter the data to remove data that isn’t relevant to the project.

Data Augmentation

After data is collected, cleansed, and filtered, it’s possible that the AI project team recognizes that there isn’t enough data to train the ML model properly. The problem may be that available data isn’t balanced, with some classes underrepresented. Or available data won’t train the model for real-world operations in all environmental conditions.

The AI project team may source more data, if possible. Another option is to generate synthetic data based on existing data or patterns of real-world data or work with the data annotation service provider to use their resources to augment datasets.

Translation, Normalization, and Localization

For some projects, it’s necessary to adapt data to take regional differences into account. Text data may vary. For example, people in various parts of the U.S. refer to a carbonated beverage as “pop,” “soft drink,” or “soda.” Also, data may have been generated in multiple languages. However, all data must be consistent for the ML algorithm.

Additionally, data annotation guidelines should address these variations, instructing data labelers to use consistent terms. The machine learning model’s performance can suffer if data isn’t labeled the same way for training.

Partnering with a Data Annotation Service Provider Gives You Data Preparation Options

An experienced, full-service data annotation service provider can assist you from strategy to deployment of AI projects. If you rely on crowdsourced or outsourced annotators, your end-to-end strategy and the steps from data sourcing to developing annotation guidelines fall on the internal project team. Furthermore, each step in ground truth and training data development requires strategic and tactical expertise.

Therefore, an important step in preparing to outsource includes thinking through how your team could benefit from additional help.

We love working with people to build out the proper data annotation strategy for their teams

Creating an effective, comprehensive data annotation strategy is challenging – but it’s a challenge that Sigma helps AI project teams overcome every day. With Sigma, you can work with specialists experienced in 2D bounding boxes, landmark and point annotation, semantic segmentation, polygons, voice transcription, name entity recognition, search relevance annotation, and more. Sigma’s services also include multiple checkpoints to ensure quality and accuracy.

Sigma has developed a systematic approach that ensures clients have the high-quality datasets they need to successfully train their ML models. Here’s what you can expect when you work with Sigma for data annotation:

Project Analysis: Sigma’s senior team discusses projects with clients and prepares a proposal that addresses their needs. Sigma assigns a project manager based on their skill sets and experience with the type of data annotation required.
Guidelines and Requirements: Sigma aligns with project guidelines, reviews project requirements, and addresses objectives and deliverables.
Tools and Procedures Setup: The next step is to choose the tools for your project, leveraging Sigma’s platform and AI-assisted tools and testing and quality assurance procedures. The goal is to maximize throughput and quality.
Comprehensive Test: Next, annotators collect data using tools and procedures chosen for the project. Sigma generates reports outlining results and suggestions.
Client Feedback: Sigma shares findings with the AI-project team to decide on the best course, whether to move forward as planned to update guidelines and retest.
Annotation/Collection at Scale: Once the plan is confirmed, Sigma trains appropriate annotators based on their experience and moves into collection at scale.
Quality Assessment: Sigma runs quality assessments and provides a report on results as well as suggestions for improvements to processes or code.

Position Your Project for Success

The attention you give your data annotation strategy will pay off with savings of time, costs, and resources, as well as clean and accurate data to train ML models. Moreover, it ultimately means you solve the problem and meet performance standards for the use case.

A well-planned strategy ensures data consistency and domain coverage, limits bias, and addresses data security concerns. It also ensures data annotation is properly staffed with the right number of human resources equipped with the right skills to accomplish data labeling expertly and according to schedule.

When you’re building your data annotation strategy, ask these pivotal questions:

Do we need more resources to accomplish data annotation effectively and generate the required volume of annotated data?
Would our project benefit from advanced tools with features that enable the highest quality annotation?
How can we mitigate risk from undesirable outcomes in training and reworking data annotation?
Would our project benefit from the advice of an experienced data annotation service provider?

Sigma has the answers you need.

Sigma is committed to helping AI project teams plan, build effective data annotation strategies, and adapt and scale as their project’s needs change. With Sigma, you have a partner throughout the process rather than simply data annotators that only label data.

Our highly skilled and experienced annotators and project managers work as a collaborative team, leveraging our platform, including advanced technologies, to ensure your project stays on track and that datasets meet quality standards. Our end-to-end approach to data annotation is designed for the most efficient processes to successfully execute your plan for building a model that delivers real value.

For more information on building an effective data annotation strategy, download our white paper “Defining the Right Data Strategy.”

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.