Understanding Synthetic Data

One of the challenges that artificial intelligence (AI) project teams face is how to create datasets that fully represent a domain. Many teams are discovering that the best solution is using synthetic data. Original data from all parts of a domain sometimes isn’t readily  accessible. Additionally,  the data required for the model to produce desirable outcomes may not exist at all.

Synthetic data is designed to address these shortcomings. Gartner predicts that 60% of data will be synthetically generated for AI and analytics projects by 2024. Furthermore, the Gartner study concluded that synthetic data will overshadow real data use in AI datasets by 2030.

What are the challenges of original data?

To understand why using synthetic data to train AI models is increasing, you need to consider the challenges that using original data presents:

Costs

Collecting or acquiring data, storing it, cleaning, formatting, and labeling it to train an AI model takes time, resources, and investments. Moreover, those costs mount for projects that require frequent training.

Scale

Adequate volumes of data or data that accurately portrays situations the model would encounter in the real world aren’t always available or would be impractical to collect.

Compliance

Data scientists and engineers must be careful when working with categories of data protected by regulation, such as consumer and healthcare data. Businesses can’t risk situations that could result in a data breach.

Bias

Generating a comprehensive dataset that complies with privacy regulations and represents your population of interest may be challenging with the original data sources available. 

What Is Synthetic Data?

Synthetic data is generated by a computer algorithm or simulation rather than collected from the real world. Synthetic data is not “real” data. However, AI project teams can use it to train an AI model, often more quickly, thoroughly, and cost-effectively.   

Synthetic datasets are compiled purposefully to reflect the domain accurately via statistical properties and patterns of original data. Additionally, synthetic data generation can control the specificity of class separation to suit the use case and even include random noise to prepare the AI solution for deployment in the real world. 

Types of Synthetic Data

Synthetic datasets, whether media (video, image, audio), text, or tabular, can be categorized in three general ways:

Fully Synthetic Datasets

This type includes only data generated by a computer program and does not contain any original data.

Partially Synthetic Datasets

If an AI model requires training with healthcare, consumer, financial, or other types of personally identifiable or protected information, data scientists or engineers can anonymize the data. These datasets contain real-world data with sensitive elements of the data removed or replaced.

Hybrid Synthetic Datasets

In cases where more data than is available is required to build out a dataset, data scientists or engineers may turn to data augmentation. This approach expands the amount of data by generating new data points, often based on existing data. Unlike synthetic data, this technique uses original and computer-generated data. 

Which industries use Synthetic Data?

Synthetic data generation tools can produce datasets to train an AI model to solve problems, recognize anomalies and other “what if” scenarios, or produce datasets without using personally identifiable information. Synthetic data overcomes AI model training for a wide range of industries, including:

Banking and Financial Services

Based on original data, businesses can use anonymous, synthetic data to correct biases in credit issuance or other services. Synthetic data can also build robust datasets that help banks and lenders identify and stop fraud.

Retail and E-Commerce

Synthetic data can protect consumer information while still providing retailers with datasets to train models to deliver hyper-personalized messaging and effective, targeted marketing.

Autonomous Vehicles and Robotics

Emerging technologies haven’t produced enough data in the real world to train AI models for innovations in the areas of robots and autonomous vehicles. Synthetic data creation overcomes this challenge by enabling data scientists and engineers to produce datasets to use during research and development.  

Healthcare

Anonymizing healthcare datasets provide AI project teams with petabytes of valuable data that show complex clinical relationships while not revealing patient identities and maintaining compliance with HIPAA or other protected health information regulations.

Manufacturing

Training AI models for manufacturing typically requires tens or hundreds of thousands of images. Synthetic data gives manufacturers a way to create datasets more quickly and cost-effectively. Moreover, synthetic data doesn’t require a second step of labeling as original data does. Synthetic labels are generated with labels tailor-made for the model and the use case.

Cybersecurity

Security professionals can use synthetic data to train AI models that test software for vulnerabilities.

How is Synthetic Data actually created?

One question that arises is whether synthetic data can be considered “real” data. It is, of course, data. However, the source is different from the original data. It isn’t collected from a customer relationship management (CRM), electronic health records (EHR) platform, or a supervisory control and data acquisition (SCADA) system in a manufacturing process. It’s generated by an algorithm or simulated.

In general, there are a few strategies for synthetic data generation:

Tabular synthetic data

This data can be generated by analyzing transactions, behaviors, or histories, creating models, and storing data in rows and columns in a table.

Numbers from a distribution

Data scientists use statistical information from original data to generate synthetic data. This category can also include using generative models.

Agent-based modeling

Engineers create a model to explain a real-world behavior, then use the model to generate synthetic data.

Deep learning

Variation autoencoder and generative adversarial networks (GANs) fuel synthetic data generation as well as discriminating between plausible and implausible data.

However, because different AI datasets require training with other data types, creating synthetic data can involve various processes.

For example:

Audio

Synthetic data generation tools include natural language generation (NLG), biometric features, including a range of vocal tones and accents. Audio data generation tools also leverage x-vectors, representations of variable-length speed segments, and text-to-speech (TTS).

Text

Tools used to generate synthetic text data include transformer natural language processing (NLP) techniques, such as bidirectional encoder representations from transformers (BERT), generative pre-trained Transformer 2 (GPT-2), and their derivatives.

Image

Techniques for image data generation include GANS and conditional GANs, convolutional neural networks (CNNs) and transformers, and generators that are capable of encoding images into a latent space.

What are the biggest benefits of Synthetic Data?

Using synthetic data solves many of the challenges of original data:

Costs

Creating a dataset from synthetic data can substantially reduce the total cost of ownership (TOC) of AI projects. Synthetic data is typically created with ground truth labels, eliminating the costs of annotation. Synthetic data doesn’t require data collection and storage in data lakes or formatting and labeling data. Additionally, suppose data is required for product testing, and original data doesn’t yet exist. In that case, synthetic data is much more cost-effective than building full-scale prototypes or devising other ways to generate data.

Scale

When data acquisition from real-world sources is difficult – or impossible – synthetic data generation can fill those gaps. AI project teams can generate datasets more quickly, even if large volumes of data is required.

Compliance

When generated correctly, synthetic data can remove all chances that a person working with the data could link it to a specific person and violate laws or create a data breach.

Bias

Although bias is always a risk, AI project teams can mold datasets to address criteria such as fairness to consumers based on race, ethnicity, or gender – even if fairness is not typically reflected in the real world.

What are the biggest challenges of creating Synthetic Data?

Even though synthetic data generation solves many of the pain points of creating AI datasets, there are also some hurdles for AI teams to overcome:

Validity

Data scientists and engineers must carefully generate and validate synthetic data to ensure it represents real-world conditions and includes enough variability. Creating synthetic data that is both fair and accurate requires a thoughtful and thorough approach during model creation and testing. 

Security

If synthetic data is based on a dataset containing sensitive information, it’s vital to ensure that a cyberattack cannot match data with specific people or accounts.

Skills, tools, and expertise

One of the biggest challenges businesses and organizations face is finding data scientists and engineers with the training, skills, and expertise to generate synthetic data to create effective datasets for AI model training.

AI teams that determine synthetic data will bring value to their projects need to weigh their options.  They should decide whether trying to build resources and skills in-house or outsourcing to a provider specializing in synthetic data generation is the best course for their organizations. 

Synthetic data will rise in the next decade

Analysts – and investors – are predicting that the majority of AI datasets will consist of synthetic data within the next few years and adoption will continue to grow. The benefits of using synthetic data have captured the attention of data scientists and business leaders across industry segments, offering them viable paths to faster, more cost-effective, and possibly more effective AI deployments.

As with any IT initiative, generating synthetic data takes skill, expertise, and the right approach. However, this doesn’t have to stand in the way. Sigma has the experience, knowledge and tools to generate synthetic data. The Sigma team specializes in a range of synthetic generation, from voice anonymization, text-to-speech, automatic summaries, rephrasing,answer generation, image augmentation and synthetic image generation.Our ethical, human-centric approach to AI training data relies on humans in the loop to support the creation of fairer synthetic data. See what Sigma has to offer and contact us to discuss your use case.

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.
ES