Do You Know Where Your Training Data Is?

It’s 10pm.
Do YOU know where
your training data is?

Much like the television and radio public service announcements from the 60’s ,70’s and 80’s, training data, like your kids, often needs adult supervision. These PSA’s were directed at parents in order to promote more responsibility and accountability for their kids out after dark when risks for trouble increased. Similarly, a company always needs to keep tabs on where their training data has been, what it’s involved in now, and where it is going. We want that data safe and want to prevent it from being corrupted. And we certainly don’t want it involved in a crime.

A lot of effort goes into network protection and preventing cyber-attacks. However, regulators are becoming just as concerned with the question of how companies are using the data that consumers share with them. Consumers, to varying degrees, trust companies to not misuse their data in activities the consumer hasn’t been made aware of and agreed to. But consumers also trust the law, and the laws around the fair use of data, with particular emphasis on data that contains personally identifiable information (PII). The regulators police and enforced these laws to ensure that data be used only for the reasons legally agreed to. But why is this becoming an issue now, instead of ten years ago?

Artificial Intelligence systems normally need a very high volume of data to operate effectively. What has changed in the last ten years is the proliferation of sensors.

Your smartphone, to name an obvious one, of course. But there are now sensors that evaluate all sorts of activities and conditions because the data networks can support them—those sensors can now talk to a computer in real-time and communicate everything from cameras monitoring the flow of crowds between innings at a baseball team, to microphones deciphering the best person to talk to you online or through a call-center. At the same time these sensors have become inexpensive. So it now is possible to have a security system with hundreds or thousands of sensors and still be cost effective. Of course, the cost of computer memory being a fraction of what it was only 5 years ago, along with the speed of computer processors continuing to increase means more data can be crunched in a shorter amount of time and create cost effective solutions that wouldn’t have been possible even 7 or 8 years ago.

Companies can get into trouble when physically tracking the data. The harm does not just come from a regulatory agency. It’s also a public relations disaster. With social media at the forefront of influencing consumer opinion, losing that trust and the bad press that comes with it can cause exponentially more damage when data is put in the wrong hands (through cyber-crime) or misused (through a violation of the Terms of Service). Whether it is simple mischief impacting only a few people, or major harm affecting thousands, future revenue will invariably take a hit when the word gets out and the company’s reputation is in tatters.

That’s why data privacy and data security are a cornerstone of Sigma’s approach to data annotation. We’ve established our own policies on the protection of data, and also consult with clients on improving the security and privacy of their data annotation projects. Contact us to learn more!


IA y aprendizaje automático

Los desafíos y oportunidades de la IA generativa

Una entrevista con el Dr. Jean-Claude Junqua Parece que casi a diario aparecen artículos sobre Chat GPT, Bard y Generative AI (Gen AI). Nos pusimos al día

Nubes de tormenta
Datos de entrenamiento

Establecimiento de datos reales sobre el terreno

Los datos reales son datos objetivos y demostrables que se utilizan para entrenar, validar y probar modelos. Está directamente relacionado con la tarea que se debe realizar. La IA no puede fijar los objetivos. Es el trabajo de los humanos.

Recent Posts

Los desafíos y oportunidades de la IA generativa

Los desafíos y oportunidades de la IA generativa

An interview with Dr. Jean-Claude JunquaIt seems like articles about Chat GPT, Bard, and Generative AI (Gen AI) appear almost…
¿Qué es el procesamiento del lenguaje natural?

¿Qué es el procesamiento del lenguaje natural?

El procesamiento del lenguaje natural (PLN), para abreviar, se refiere a la manipulación del habla y el texto mediante software.
Establecimiento de datos reales sobre el terreno

Establecimiento de datos reales sobre el terreno

Ground truth data is the objective, provable data used to train, validate and test models. It is directly related to…