Data annotation for Gen AI: Sigma’s upskilling strategy

GenAI is reshaping the way we work, accelerating the need to build new skills for 40% of the global workforce over the next few years. But the key to unlocking GenAI’s true value might lie in the abilities that make us uniquely human, like creativity, emotional intelligence, and critical thinking.

Even for companies with vast experience in AI, such as Sigma AI, GenAI poses a new frontier. GenAI’s capacity to create new, original content and ideas while becoming more proficient in handling a wide range of cognitive challenges is an uncharted territory that requires new frameworks and approaches. While integrating humans in the loop is still crucial, additional skills are necessary to achieve the highest quality results from GenAI.

Table of Contents

If traditional AI projects require human annotators to classify text into predefined categories, determine sentiment or emotions within a text, or locate specific objects within an image, GenAI projects require them to excel at content creation tasks, involving a higher level of creativity, logical thinking, and the capacity to grasp language nuances and emotion.

  • What specific skills do human annotators need to succeed in working with GenAI?
  • How can they learn these skills?
  • Is it possible to quantify something as abstract as creativity?
  • How does the selection of candidates for GenAI projects impact data quality?

 

Let’s take an insider look at Sigma AI’s upskilling strategy for building an efficient, adaptable, and innovative team to work with GenAI.

Building new skills for GenAI data annotation

High-quality data is the fuel of AI.

Traditional AI systems learn from vast amounts of labeled data. Generating this labeled data requires a crucial step: human annotation. A team of annotators with native fluency in the language(s) relevant to the project are trained to understand and follow specific guidelines, then provide the “correct answer” for every piece of data. These answers are the examples that a machine uses to understand the data and make predictions.

GenAI, on the other hand, is initially generally trained with unsupervised learning. This means it has to recognize patterns in unlabeled data. Based on this, GenAI models generate new, original ideas and concepts that resemble the original training data. But there’s a catch: after this pre-training phase, GenAI systems still need human input to be fine-tuned or improve performance for a particular domain. The oversight of humans is essential at this point to reduce bias, apply logical reasoning, and align models to specific use cases and domains.

As a result of this, GenAI projects demand a broader skill set that goes beyond language expertise. “Since GenAI tools can generate human-like responses, annotators need to be extremely careful in their answers. We need them to be able to judge whether information is true, false, or inconclusive when there isn’t enough evidence,” says Valentina Vendola, manager at Sigma.

She explained that annotators must have strong writing skills, creativity, and an analytical approach to language. They might produce a text from scratch, summarize it, or be able to draw conclusions from data.

Let’s take the example of a summarization tool that needs human feedback to be fine-tuned.

While summarizing text seems simple, it demands a mix of abilities, such as reading comprehension, critical thinking, and the ability to paraphrase, condense, and distill meaning.

To ensure consistent results, every step of the annotation process must be standardized, with detailed parameters and guidelines for annotators to follow.

However, to ensure high-quality data for GenAI, companies should prioritize a two-step approach that starts even before staffing projects:

  • Assess GenAI-related skills within their workforce and choose the best candidates for each project.
  • Develop an upskilling program to train annotators with the specific skills required for GenAI annotation.

Assessing GenAI skills: Can we measure creativity?

With over 16 years of experience in data annotation, transcription, and translation for AI training, Sigma AI has built a qualified workforce of 30,000 annotators, with specialized backgrounds in 500+ languages and dialects. Such diversity and native understanding of languages is precisely what GenAI needs to be safe and become more human. 

To address these emerging challenges Sigma AI is currently building a comprehensive system for GenAI projects, explains Antonio Hornero, Chief Operations Officer and leader of Sigma’s Annotation Group. “This involves defining the specific skills needed for these projects and developing a series of tests to assess annotators’ proficiency in these essential skills. Our goal is to match the right candidate with the right project,” he adds.

The new tests are specifically designed to assess a range of skills crucial for GenAI projects, including:

  • Reading comprehension 
  • Linguistic proficiency
  • Verbal reasoning
  • Summarization
  • Paraphrasing
  • Proofreading
  • Textual entailment recognition
  • Web search skills
  • Creativity

 

Designing these tests involves close collaboration between the company’s project managers and natural language processing (NLP) experts. Since many aspects of GenAI hinge on subjective skills, the NLP team is in charge of establishing a way to evaluate and score tests objectively, using existing datasets and linguistic corpora. Refining and validating these tests over time is also a part of the equation.

“The most challenging tests to solve are those that involve creative text generation,” says Valentina. “Here, candidates have to craft a text entirely from scratch. However, creating different text formats requires distinct approaches. For instance, an essay demands a different structure and content than a summary or a fantastical story.”

Suppose we give a candidate the following content creation task: create a short story describing your morning from a cat’s perspective.

How can we assess creative abilities from their response? Here are a few insights from the Sigma AI’s team:

  • We can measure the range of words and synonyms a candidate uses, which provides insight into their language fluency.
  • Metrics tracking changes in grammatical forms, like verb tenses and nouns, can assess the ability to adapt language for different purposes.
  • We can, of course, measure and analyze grammar, spelling, and punctuation.
  • Finally, metrics can also assess sentence structure complexity, revealing a candidate’s ability to express ideas effectively.

Upskilling to innovate: Preparing Annotators for the challenges of GenAI

The unique nature of GenAI projects — and Sigma’s commitment to exceptional quality data —  reveal the need for a structured approach to selecting annotators. But the process doesn’t end with selection: it also involves developing upskilling programs to constantly train annotators on GenAI-related abilities.

“If a candidate shows weaknesses in some areas, we’ll design targeted training to bridge those skill gaps. This will allow us to not only select talented individuals but also actively develop their skill set,” said Valentina.

Sigma AI’s approach to selection, training, and upskilling for GenAI is a long, meticulous program that requires close attention to detail. “Achieving the level of quality we strive for requires significant effort,” she said. “Not all companies are willing to invest the time and resources. In a new and evolving field like this, some may be tempted to cut corners.”

Resumes alone, for instance, can’t fully capture the specific skills needed for annotation work. They might indicate language fluency, but they don’t convey the critical thinking and reasoning abilities required for the job.

Similarly, BPO for data annotation can’t match the quality of an experienced, in-house team, which is constantly trained in simulated and real scenarios.

In sum, Sigma AI prioritizes high quality data for GenAI through a comprehensive process:

  • Assess annotator’s soft skills, like critical thinking and creativity, through a series of tests. This helps identify skills gaps.
  • Based on the assessment results, provide upskilling training to bridge any skill gaps.
  • Select the most suitable candidates to staff GenAI projects, prioritizing those who demonstrate the necessary skills. This ensures quality data from the very beginning, and saves time and resources.
  • Implement ongoing assessment and provide opportunities for improvement to maintain a high standard of quality in the annotation process.

 

The outcome? A more accurate, reliable, and unbiased AI. In other words, a more human AI.

A skilled workforce, the secret to best-in-class GenAI data

We are just scratching the surface of GenAI’s full potential. But the key to unlocking its power lies in high-quality data provided by skilled human annotators.

At Sigma AI, we’ve tackled complex AI challenges for almost two decades, nurturing and training a dedicated team of annotators and NLP experts who can lead us into the GenAI era. Through rigorous selection and continuous upskilling, we ensure they have the critical thinking, reasoning, and creativity that GenAI projects demand. We’ve also created objective testing and assessment for typically subjective factors to achieve the highest quality standards.

Partner with Sigma AI to gain access to a versatile, skilled workforce that adapts to your specific GenAI project needs. Contact us today!

Want to learn more? Contact us ->
Sigma offers tailor-made solutions for data teams annotating large volumes of training data.
EN