Do You Know Where Your Training Data Is?

It’s 10pm.
Do YOU know where
your training data is?

Much like the television and radio public service announcements from the 60’s ,70’s and 80’s, training data, like your kids, often needs adult supervision. These PSA’s were directed at parents in order to promote more responsibility and accountability for their kids out after dark when risks for trouble increased. Similarly, a company always needs to keep tabs on where their training data has been, what it’s involved in now, and where it is going. We want that data safe and want to prevent it from being corrupted. And we certainly don’t want it involved in a crime.

A lot of effort goes into network protection and preventing cyber-attacks. However, regulators are becoming just as concerned with the question of how companies are using the data that consumers share with them. Consumers, to varying degrees, trust companies to not misuse their data in activities the consumer hasn’t been made aware of and agreed to. But consumers also trust the law, and the laws around the fair use of data, with particular emphasis on data that contains personally identifiable information (PII). The regulators police and enforced these laws to ensure that data be used only for the reasons legally agreed to. But why is this becoming an issue now, instead of ten years ago?

Artificial Intelligence systems normally need a very high volume of data to operate effectively. What has changed in the last ten years is the proliferation of sensors.

Your smartphone, to name an obvious one, of course. But there are now sensors that evaluate all sorts of activities and conditions because the data networks can support them—those sensors can now talk to a computer in real-time and communicate everything from cameras monitoring the flow of crowds between innings at a baseball team, to microphones deciphering the best person to talk to you online or through a call-center. At the same time these sensors have become inexpensive. So it now is possible to have a security system with hundreds or thousands of sensors and still be cost effective. Of course, the cost of computer memory being a fraction of what it was only 5 years ago, along with the speed of computer processors continuing to increase means more data can be crunched in a shorter amount of time and create cost effective solutions that wouldn’t have been possible even 7 or 8 years ago.

Companies can get into trouble when physically tracking the data. The harm does not just come from a regulatory agency. It’s also a public relations disaster. With social media at the forefront of influencing consumer opinion, losing that trust and the bad press that comes with it can cause exponentially more damage when data is put in the wrong hands (through cyber-crime) or misused (through a violation of the Terms of Service). Whether it is simple mischief impacting only a few people, or major harm affecting thousands, future revenue will invariably take a hit when the word gets out and the company’s reputation is in tatters.

That’s why data privacy and data security are a cornerstone of Sigma’s approach to data annotation. We’ve established our own policies on the protection of data, and also consult with clients on improving the security and privacy of their data annotation projects. Contact us to learn more!

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.
EN