Sigma’s Audio Annotation Services

Improve your Speech Recognition Models, Text-to-Speech Engines, NLP and IVR systems and Voice Assistants.

We support over 120 languages and dialects and guarantee an accuracy of 98% but can go higher if needed.

Our ML-assisted tools can reduce annotation time and cost significantly.

What is Audio Annotation?

It is important to differentiate between transcription and annotation.

Audio or Speech Transcription is the process of converting spoken language into written form. Transcription can be verbatim or non-verbatim. Verbatim transcription includes all conversation filler words, false starts, truncated words or sentences, and pauses. It is a literal conversion of the conversation into written language. Non-verbatim transcription does not include filler words, false starts, truncated words or sentences, and pauses. So, it provides an easy-to-read version of what it has been said in the audio file.

An annotation is any type of additional information that is added to an already existing text, be it a transcription of an audio file or an original text file.

Normally, Audio or Speech Annotation refers to both, the transcription of the audio and the annotation of the resulting text. Annotations add phonological, morphological, syntactic, semantic and discourse information.

It is also usual that Audio or Speech Annotation includes metadata, which is relevant information that refers to the audio file as a whole, rather than individual annotations that provide information about a portion of the data.

Audio Annotation Services

Edit audio

Audio Annotation

It includes audio transcription, annotation and metadata. The type of annotations and metadata are fully tailored to client’s needs. From phonological, morphological, syntactic, semantic and discourse information to audio segmentation, speaker identification, turn taking, emotion, background noise, speech or music. You name it!

Audio processing

Audio and Video Transcription

Be it verbatim or non-verbatim, our team will provide you with best-in-class transcriptions the way you want it, when you need it, in a cost-efficient way.

Sigma offers scalable audio and video transcription services thanks to the optimal combination of our large base of vetted transcribers and our ML-assisted tools.

Audio processing

Speaker Diarization

It consists of partitioning the input audio file into homogeneous audio segments according to their specific sources. These sources include the identity of the speakers whose voice is recorded in the audio file, music, silence, or background noise. This enables automating the process of analyzing any type of conversation, including call center dialogues and debates.

Wave audio

Phonetic Transcription

A phonetic transcription is very similar to a regular transcription, but instead of converting the audio into a sequence of words, it describes the way spoken words are pronounced using phonetic symbols.

The most common alphabetic system of phonetic notation is the International Phonetic Alphabet (IPA).

Edit audio

Emotion Annotation

Emotion annotation aims to determine feelings such as anger, happiness, sadness, fear, or surprise. It can be performed on text or audio data. Audio emotion analysis is more accurate since audio provides additional clues such as speech rate, pitch, pitch jumps, or voice intensity.

Emotion detection helps improve human-machine communication, analyze call center dialogues, etc.

Edit audio

Sentiment Annotation

It is the process of determining if a segment of speech is perceived as positive, negative or neutral. Audio sentiment analysis is more accurate than text sentiment analysis since audio provides additional information such as the emotional state of the speakers.

It helps gauge customers opinion, monitor brand/product reputation, customer experience and needs, social media, etc.


Audio Classification

It consists of listening to the audio recording and classifying it into a series of predetermined categories.

For example, categories that describe the user intent, the background noise, the quality of the recording, the topic, the number or type of speakers, the spoken language or dialect or semantic related information.


Data Relevance

Data relevance provides information about the quality of data that a system delivers to its users. In particular, it determines to which extend the answer of a search engine or an intelligent assistant provides insight into the question of the user; i.e.: the level of consistency between the content of the data provided and the area of interest of the user.

Edit audio

Speech Annotation Quality Assessment

It aims to determine the accuracy of the speech annotations, including word error rate (substitutions, insertions and deletions), and label error rate according to the annotation guidelines.

It helps assess the quality of the annotated speech in terms of accuracy and interpretation consistency of the annotation guidelines. It also helps complete the guidelines and resolve its ambiguities.

Edit audio

Speech Database Quality Assessment

This quality assessment service provides information that helps optimize the effort in data collection and annotation. Quality is a multidimensional parameter that depends on factors such as the volume and quality of the audio, the accuracy of the annotations, the data consistency, the domain and customer’s coverage; and the balance.

This helps focus the data collection and annotation effort where is most needed.

Edit audio

Intelligent Assistants Assessment

This service measures the performance of the wake word detection, assesses if the pronunciation of the wake word and the subsequent voice commands belong to the same or several users, checks if the voice interactions are in the expected language, or if the answers of the assistant are correct based on the dialogue status and context as well as on the user data and the system knowledge data base, etc.

Edit audio

Pronunciation Assessment

The pronunciation assessment aims to determine whether the pronunciation of a word or sentence is correct. The correctness of the pronunciation can be performed comparing it with the standard pronunciation or with the dialect variants.

The pronunciation assessment can be performed on human or synthetic speech.