The Intricacy of Assessing Data Quality

The Intricacy of Assessing Data Quality

Every AI professional agrees that the quality of training and testing data is of the utmost importance in machine learning. The better the quality of data, the better the performance of the machine learning algorithms. However, there is no common understanding of what exactly data quality means and, therefore, there is no common standard to measure it.

An important point that needs to be taken into account is that, in machine learning, the concept of data quality must be interpreted in relation to the targeted goals.

For example, if the goal was to develop a speech recognizer for noisy environments, we would need a training dataset with ideally the same type and level of noise as in the real working conditions. Therefore, in this use case, high-quality data would consist of properly annotated noisy speech data. However, not surprisingly, a person unfamiliar with machine learning, would think the quality of that dataset is low.

The other fundamental point of data quality is its multidimensionality.

The Concept of Data Quality is Multidimensional

While most times the concept of data quality is associated with just the accuracy of the annotations, the quality of a training dataset has to be measured in a comprehensive way to ensure that machine learning algorithms perform exactly what they are expected to do.

Here is how we assess data quality at Sigma:

  • Accuracy of the annotations. It can be achieved by working on four areas:

– Preventive measures to prevent annotation errors from happening.

– Reactive measures to detect and fix errors and issues once they have occurred.

– Metrics to measure the level of quality. Metrics depend on the type of data (e.g., voice, images, text, biometric) and annotation type.

– Error analysis to refine the annotation guidelines, eliminate ambiguity, improve the consistency of the annotations and, in some cases, reduce the amount of required training data.

Sigma has found that in many cases, training databases have poor accuracies ranging from 87 to 95%. These findings coincide with the conclusions of a recent study by MIT [1].

  • Volume of data. Intuitively, the more data, the better. However, time and resources are limited. So, the optimal volume of data is the one that allows a given machine learning algorithm to reach a performance that is near the asymptote of the performance curve.
  • High consistency. Similarly as children would not learn colors correctly if the color examples provided by different people were not consistent, artificial intelligence will not learn to perform a task correctly if the data annotators have not followed the same annotation criteria.

Tools, processes, and quality assurance methodologies have to be designed, implemented and synchronized to eliminate or reduce, as much as possible, consistency errors. Similarly, human annotators with the right training, skills and mindsets are needed to ensure a high data consistency level.

Analogously, data should be always collected in the exact same way.

  • Domain coverage. Data has to represent the particular domain in which the resulting AI-based system is going to operate. The closer the data is to the real working conditions, the better the performance of the system.

Factors such as the type and level of background noise, echoes, reverberations, communication channel, distance to the microphone and type of microphone, need to be considered in speech databases. Factors such as illumination conditions, distance to the camera or type of camera, need to be considered in image and video databases, etc.

  • Users’ coverage. It is an important factor when the AI system has to interact with/model people. In this case, data has to represent the end-users and/or the population modeled by the AI system. Demographic characteristics (e.g., age-group, gender, or level of education), race, language, speech rate, accent, collaborative and non-collaborative users are examples of user’s characteristics that need to be well represented in the training data.
  • Balance: No question that artificial intelligence is not affected by fatigue, friendship, emotion, sickness, or any other human characteristic. However, an AI-based system can be discriminatory or make mistakes if the training data is biased and does not represent reality in a balanced way.

Please, Contáctenos or refer to our whitepaper on ML-Assisted Data Annotation for more information on Data Quality.


[1] “Error-riddled data sets are warping our sense of how good AI really is” – MIT Technology Review, April 1, 2021


IA y aprendizaje automático

Los desafíos y oportunidades de la IA generativa

Una entrevista con el Dr. Jean-Claude Junqua Parece que casi a diario aparecen artículos sobre Chat GPT, Bard y Generative AI (Gen AI). Nos pusimos al día

Nubes de tormenta
Datos de entrenamiento

Establecimiento de datos reales sobre el terreno

Los datos reales son datos objetivos y demostrables que se utilizan para entrenar, validar y probar modelos. Está directamente relacionado con la tarea que se debe realizar. La IA no puede fijar los objetivos. Es el trabajo de los humanos.

Recent Posts

Los desafíos y oportunidades de la IA generativa

Los desafíos y oportunidades de la IA generativa

An interview with Dr. Jean-Claude JunquaIt seems like articles about Chat GPT, Bard, and Generative AI (Gen AI) appear almost…
¿Qué es el procesamiento del lenguaje natural?

¿Qué es el procesamiento del lenguaje natural?

El procesamiento del lenguaje natural (PLN), para abreviar, se refiere a la manipulación del habla y el texto mediante software.
Establecimiento de datos reales sobre el terreno

Establecimiento de datos reales sobre el terreno

Ground truth data is the objective, provable data used to train, validate and test models. It is directly related to…