The Intricacy of Assessing Data Quality
Every AI professional agrees that the quality of training and testing data is of the utmost importance in machine learning. The better the quality of data, the better the performance of the machine learning algorithms. However, there is no common understanding of what exactly data quality means and, therefore, there is no common standard to measure it.
An important point that needs to be taken into account is that, in machine learning, the concept of data quality must be interpreted in relation to the targeted goals.
For example, if the goal was to develop a speech recognizer for noisy environments, we would need a training dataset with ideally the same type and level of noise as in the real working conditions. Therefore, in this use case, high-quality data would consist of properly annotated noisy speech data. However, not surprisingly, a person unfamiliar with machine learning, would think the quality of that dataset is low.
The other fundamental point of data quality is its multidimensionality.
The Concept of Data Quality is Multidimensional
While most times the concept of data quality is associated with just the accuracy of the annotations, the quality of a training dataset has to be measured in a comprehensive way to ensure that machine learning algorithms perform exactly what they are expected to do.
Here is how we assess data quality at Sigma:
- Accuracy of the annotations. It can be achieved by working on four areas:
– Preventive measures to prevent annotation errors from happening.
– Reactive measures to detect and fix errors and issues once they have occurred.
– Metrics to measure the level of quality. Metrics depend on the type of data (e.g., voice, images, text, biometric) and annotation type.
– Error analysis to refine the annotation guidelines, eliminate ambiguity, improve the consistency of the annotations and, in some cases, reduce the amount of required training data.
Sigma has found that in many cases, training databases have poor accuracies ranging from 87 to 95%. These findings coincide with the conclusions of a recent study by MIT .
- Volume of data. Intuitively, the more data, the better. However, time and resources are limited. So, the optimal volume of data is the one that allows a given machine learning algorithm to reach a performance that is near the asymptote of the performance curve.
- High consistency. Similarly as children would not learn colors correctly if the color examples provided by different people were not consistent, artificial intelligence will not learn to perform a task correctly if the data annotators have not followed the same annotation criteria.
Tools, processes, and quality assurance methodologies have to be designed, implemented and synchronized to eliminate or reduce, as much as possible, consistency errors. Similarly, human annotators with the right training, skills and mindsets are needed to ensure a high data consistency level.
Analogously, data should be always collected in the exact same way.
- Domain coverage. Data has to represent the particular domain in which the resulting AI-based system is going to operate. The closer the data is to the real working conditions, the better the performance of the system.
Factors such as the type and level of background noise, echoes, reverberations, communication channel, distance to the microphone and type of microphone, need to be considered in speech databases. Factors such as illumination conditions, distance to the camera or type of camera, need to be considered in image and video databases, etc.
- Users’ coverage. It is an important factor when the AI system has to interact with/model people. In this case, data has to represent the end-users and/or the population modeled by the AI system. Demographic characteristics (e.g., age-group, gender, or level of education), race, language, speech rate, accent, collaborative and non-collaborative users are examples of user’s characteristics that need to be well represented in the training data.
- Balance: No question that artificial intelligence is not affected by fatigue, friendship, emotion, sickness, or any other human characteristic. However, an AI-based system can be discriminatory or make mistakes if the training data is biased and does not represent reality in a balanced way.
 “Error-riddled data sets are warping our sense of how good AI really is” – MIT Technology Review, April 1, 2021