5 key challenges of human data annotation in the gen AI era
The potential of the global data collection and labeling market is immense, with a projected revenue of US$17 billion by 2030, growing at nearly 30% annually. Domain-specific models are driving rapid growth in specialized industry sectors, such as healthcare.
Here’s why human data annotation will be more relevant than ever in the gen AI era:
- Generative AI must be trained on diverse, unstructured, and complex datasets, in multiple formats (e.g., images, videos, audio, natural language). This data needs to be curated and validated to make sure that it is consistent and accurate. Human validation is crucial to this process.
- Domain-specific models increase the need of having human annotators with specialized knowledge and subject matter expertise, as opposed to generalists. This expert guidance is not only necessary through the data annotation process but also for validation.
- Because the output of gen AI is subjective, human judgment plays a crucial role in the data annotation and validation process.
- Ethical AI starts with data annotation and requires inclusive and diverse data annotation teams to ensure fair and unbiased gen AI models.
New skills to navigate subjectivity
Gen AI projects are anything but conventional. Instead, they require adaptability, diversity, and out-of-the-box thinking. With the subjective nature of AI-generated content, human data annotation becomes increasingly crucial and complex.
Clara Abou Jaoude, senior project manager at Sigma AI, explains: “Just as no two people can write an email exactly the same way, generative AI produces unique content tailored to countless languages, styles, and applications worldwide. It opens up endless possibilities.”
Specialists at Sigma AI have identified specific skills required for data annotation in the gen AI era, including:
- Logical and linguistic reasoning
- Creative thinking
- Summarization
- Prompt writing
- Paraphrasing
- Attention to detail
- Research skills
“AI must be able to synthesize information, verify facts, and recognize reliable and credible sources,” says Valentina Vendola, manager at Sigma AI. “This means that the people training it need to possess these abilities.”
The demand for new skills and domain expertise for gen AI has created new roles for data annotators, which go beyond the traditional roles of linguists and translators.
For example, if a project requires accurate recognition of emotions, a psychologist’s expertise would be valuable, Vendola explains. However, when it comes to recognizing subtle vocal nuances and their underlying meanings, someone with a background in music might also be a strong candidate, adds Abou Jaoude.
Finding domain experts
The shift from generalized to specialized language models in generative AI is reshaping what is needed from human data annotators.
As Jean-Claude Junqua, Executive Senior Advisor at Sigma AI, explains, “Data annotation for generative AI is evolving from a generalist to a specialist approach, demanding domain expertise to produce high-quality training data in specific fields such as biomedical or physics.”
Annotation experts must possess deep domain knowledge, be able to identify subtle nuances within data, and have a strong grasp of industry-specific jargon.
Scaling fast while maintaining quality
Fine-tuning LLMs for specialized tasks or domains enables companies to develop and scale gen AI models faster. As a result, they are demanding faster turnaround times for data annotation and validation processes.
Here’s how companies are accelerating data annotation while maintaining high-quality standards:
- Engage a vetted and trained workforce of annotators, who are ready to tackle complex projects.
- Develop clear annotation guidelines to reduce ambiguities and ensure quality.
- Bring automation tools into the data annotation workflow. Simple AI use cases, such as name entity recognition or extracting relationships between entities can be easily automated with generative AI, accelerating the data annotation process.
- Have humans in the loop (HITL), providing constant feedback, and verifying the quality of the annotations.
- Implement data augmentation and synthetic data. Training generative AI models require a substantial amount of data, which might be too expensive and time-consuming to obtain entirely manually. Thanks to data augmentation and synthetic data, companies can achieve the variety and diversity they need to create balanced, representative datasets.
Building diverse data annotation teams to reduce bias
Diversity is essential to train generative AI models. “Feeding AI with data from a single perspective — one person, one country, one language, one culture, and one social class — introduces bias into the model,” says Vendola.
Building a diverse team of annotators begins with an inclusive recruitment process. “Our focus is on the skills and technical profiles that align with the project’s objectives,” Abou Jaoude says “We prioritize diversity and avoid any discriminatory filters or biases during recruitment. While language proficiency is essential, the specific skills and expertise required for a project are the only limiting criteria we consider.”
Addressing data security and data privacy concerns
Data security and data privacy are big concerns for companies working on generative AI initiatives, particularly when outsourcing projects to third-party vendors. From a data perspective, the main challenges involve data breaches, unauthorized access, misuse of sensitive information, and compliance with data protection regulations.
Here are some of the strategies that companies should consider when handling sensitive or confidential data in gen AI projects:
- Using secure rooms for data annotation. Secure rooms have strict access controls and project-specific data is restricted to the relevant teams. No personal items or external devices are allowed inside these clean rooms and computers have polarized monitor filters to limit data visibility to the annotator working on a specific project.
- Implementing data protection measures for remote projects. Centralize all the communication in an internal tool with tiered access controls so that only authorized personnel can access specific data based on their roles, departments, and project involvement. At Sigma AI, for example, all interactions and data sharing take place exclusively within their own, proprietary messaging tool.
- Implementing robust physical security measures across offices and facilities. Require employees to wear badges with biometric identification. Restrict internet access to only those sites essential for annotation projects. Enforce two-factor authentication for accessing computers.
- Complying with industry-standard certifications. Examples of this include GDPR (General Data Protection Regulation), which sets standards for the protection of personal data in the European Union, CCPA (California Consumer Privacy Act), ISO 27001, an international standard for information security management, SOC2 (Service Organization Control Type 2), a cybersecurity framework used by Service Organizations, and HIPAA (Health Insurance Portability and Accountability Act), a US law that sets standards for the protection of health information.
- Applying data anonymization. This technique is often used in data annotation projects, especially when dealing with sensitive data. It consists of removing or altering personal identifying information such as names, addresses, or social security numbers from a dataset. By removing private information, companies can stay compliant with data protection regulations and mitigate risks.