AI Has an Impact — We Want to Make It a Positive One
Sigma is centered on people: The people we work with, the customers we serve, and the people who use or are affected by the artificial intelligence algorithms our training data helps build. Half of the story is how our values guide the way we work together within our teams and with our clients. The other half is how we source and annotate training data, and the ways in which this influences the resulting AI. Instilling AI with balanced and unbiased human values, and building AI that serves the greater good, is core to our mission.
While ethical AI is hard to define, we believe it begins with human purpose. A “good” AI can protect people from harm, for example when it’s used to detect and prevent illness through medical image interpretation or improve quality of life by providing accurate translations and thus easier access to important information. But ethical AI doesn’t end with purpose. Biases can easily creep into AI that aren’t trained with quality data — like when a dataset is too small or unbalanced, when guidelines are ill-defined, or when annotations are inconsistent. We strive to strike the optimal balance of tech-supported processes while applying human judgment at crucial points in the annotation process to produce the highest quality training data that avoids biases.
Safe, accurate and unbiased AI systems depend on large volumes of high-quality training data. Data-related challenges are often cited as a top reason why some projects don’t deliver on expectations, or why some companies refrain from using AI.
The need to annotate larger and larger amounts of data in a short period of time is increasing, so many companies are looking into how to automate parts of the data annotation process. The holy grail of data annotation is to perform all parts of the process automatically, but it’s not always possible or sensible to do this for many types of data. Consequently, the strategy companies often adopt is on automating repetitive processes that involve a large amount of data manipulation — tasks that would be difficult for humans to perform efficiently.
For complex annotation domains, automation can have its limits. This can be due to limited data, outliers, limited perceptual or reasoning abilities of a machine for a given task, or lacking context that’s available only to humans. This is where pursuing human expertise, or “human-in-the-loop”, can make the biggest difference.
At Sigma, one of our main goals is to concentrate on using technologies, like machine learning, on problems that are challenging for humans, and seek human input for instances in the data pipeline where a machine lacks the context or nuance to make the right call. We involve human judgment at crucial points in the annotation process while using pre-and post-processing technologies to support human annotators to deliver high-quality results at maximum efficiency.
During pre-processing, simple matching technologies or machine learning can convert raw data into clean datasets using a script. It doesn’t replace or reduce data labeling, but it can improve annotation quality and throughput. In one audio annotation case, we could reduce annotation time by 20% by simply optimizing the size of the audio files. For image annotation, we could reduce annotation time by 25% by using image selection technology.
Machine learning models can offer an order of magnitude improvement for many steps in the annotation pipeline by alleviating inefficiencies and freeing up human time to focus on more contextual, critical work. In addition to fully automating some annotation steps with machine learning, we can also implement semi-automation to achieve massive time savings in human annotation. Semi-automation with machine learning provides human annotators with a reasonably accurate initial version of the annotated data that they can simply correct, rather than annotate from scratch.
For customers, strategic use of machine learning in the annotation process equates to higher-quality data and greater throughput at a comparable cost point. Our team of machine learning engineers and technology experts enables Sigma to adapt to customer requirements and provide high-quality annotations at the required speed.
At the heart of our annotation services are our teams of project managers, annotators, linguists and domain experts. Each brings in their unique skill set at crucial points in the annotation process and contributes to a higher quality interpretation of the data.
We never crowdsource annotation. Crowdsourcing often provides fluctuating levels of quality and requires extensive quality assessment (QA) methods. While QA helps assure accuracy, it can quickly increase the cost and project lead times the more involved the QA methods are and the more issues they identify.
We always curate teams to fit the specific project, selecting candidates from our extensive database of vetted, highly trained annotators. Human annotators with the right profile and experience increase the quality of annotation and can even reduce costs by increasing efficiency. Relying on specialized, trained annotators with subject matter expertise reduces training and annotation time — because they have an existing level of understanding of the task and domain at hand, existing experience with similar tasks, and because it reduces QA times.
Performing data annotation at high quality requires annotators who have attention to detail, patience, and importantly, who can strictly follow the annotation guidelines. Selecting candidates with these skills and the specific profiles a project demands can be a challenge that takes a considerable amount of time. This is why Sigma has established a continuous process of candidate selection with a proprietary method to assess candidate skills, giving us an extensive database of vetted data collection and annotation candidates. Thanks to this process, Sigma can scale quickly with the most appropriate professionals for each project. Choosing suitable candidates is the best way of approaching the challenge of getting the right data, with the required quality, at the right time, and in a cost-efficient manner.
Biases can creep into training data in a variety of ways. Datasets need to have coverage and balance — coverage means that the dataset includes all necessary task-related conditions, and balance means that these conditions are sufficiently represented in the datasets.
Unbalanced, biased datasets can have serious real-world consequences. For example, if a voice assistant is trained with an unbalanced dataset, it can fail to recognize queries correctly from people who are non-native speakers of a language, have various accents, or have speech impediments, limiting their access to important information. Our approach to data sourcing involves being keenly aware of the ways biases can be introduced to training data, collecting data from as diverse and representative a pool as possible and augmenting existing data where necessary.
Inclusion and diversity is also crucial for high-quality, unbiased annotation. In human-in-the-loop annotation, people apply their own judgment to establish ground truths. When annotators don’t represent the population an AI is built to serve in all of its facets, they may miss the context that’s essential to training the AI correctly.
We promote inclusion and celebrate our cultural and ethnic diversity — not just for the purpose of AI quality, but because it’s part of our core values. Our workforce comes from more than 100 countries, speaks over 300 languages and dialects, and 72% of our annotators are women. We’re committed to creating jobs, paying living wages that help reduce poverty, promoting equal opportunities, and empowering women. We provide work-from-home models to provide a source of income to individuals who live in rural and remote areas, low-income communities, and communities with high unemployment rates. We also work with local organizations to recruit annotators with disabilities. We extend heartfelt thanks to our clients for empowering us to provide a better, fairer working environment and creating jobs for those who need them.
We’re excited to continue our search for the best balance between automation and technology and highly trained annotators with a wide range of expertise. We strongly believe that the best annotation company is the one that can move quickly from one of the extremes of the continuum (automatic annotation) to the other (manual annotation with high-quality annotators) to satisfy customer requirements. Data annotation will serve AI engines’ future needs by optimally combining technology and human expertise and refining this combination as technology progresses.