What is synthetic data? Types, challenges, and benefits

One of the challenges that artificial intelligence (AI) project teams face is how to create datasets that fully represent a domain. Many teams are discovering that the best solution is using synthetic data. Original data from all parts of a domain sometimes isn’t readily accessible. Additionally, the data required for the model to produce desirable outcomes may not exist at all.

Synthetic data is designed to address these shortcomings. Gartner predicts that 60% of data will be synthetically generated for AI and analytics projects by 2024. Furthermore, the Gartner study concluded that artificial data will overshadow real data use in AI datasets by 2030.

What are the challenges of original data?

To understand why using synthetic data to train AI models is increasing, you need to consider the challenges that using original data presents:

Costs. Collecting or acquiring data, storing it, cleaning, formatting, and labeling it to train an AI model takes time, resources, and investments. Moreover, those costs mount for projects that require frequent training.

Scale. Adequate volumes of data or data that accurately portray situations the model would encounter in the real world aren’t always available or would be impractical to collect.

Compliance. Data scientists and engineers must be careful when working with categories of data protected by regulation, such as consumer and healthcare data. Businesses can’t risk situations that could result in a data breach.

Bias. Generating a comprehensive dataset that complies with privacy regulations and represents your population of interest may be challenging with the original data sources available.

What is synthetic data?

Synthetic data is generated by a computer algorithm or simulation rather than collected from the real world. Artificial data is not “real” data. However, AI project teams can use it to train an AI model, often more quickly, thoroughly, and cost-effectively.

Synthetic datasets are compiled purposefully to reflect the domain accurately via statistical properties and patterns of original data. Additionally, synthetic data generation can control the specificity of class separation to suit the use case and even include random noise to prepare the AI solution for deployment in the real world.

Types of synthetic data

Synthetic datasets, whether media (video, image, audio), text, or tabular, can be categorized in three general ways:

Fully synthetic datasets. This type includes only data generated by a computer program and does not contain any original data.

Partially synthetic datasets. If an AI model requires training with healthcare, consumer, financial, or other types of personally identifiable or protected information, data scientists or engineers can anonymize the data. These datasets contain real-world data with sensitive elements of the data removed or replaced.

Hybrid synthetic datasets. In cases where more data than is available is required to build out a dataset, data scientists or engineers may turn to data augmentation. This approach expands the amount of data by generating new data points, often based on existing data. Unlike synthetic data, this technique uses original and computer-generated data.

Use cases & specific industries

Synthetic data generation tools can produce datasets to train an AI model to solve problems, recognize anomalies and other “what if” scenarios, or produce datasets without using personally identifiable information. Synthetic data overcomes AI model training for a wide range of industries, including:

Banking and financial services

Based on original data, businesses can use anonymous, synthetic data to correct biases in credit issuance or other services. They can also use it to build robust datasets that help banks and lenders identify and stop fraud.

Retail and e-commerce

Synthetic data can protect consumer information while still providing retailers with datasets to train models to deliver hyper-personalized messaging and effective, targeted marketing.

Autonomous vehicles and robotics

Emerging technologies haven’t produced enough data in the real world to train AI models for innovations in the areas of robots and autonomous vehicles. Synthetic data creation overcomes this challenge by enabling data scientists and engineers to produce datasets to use during research and development.

Healthcare

Anonymizing healthcare datasets provides AI project teams with petabytes of valuable data that show complex clinical relationships while not revealing patient identities and maintaining compliance with HIPAA or other protected health information regulations.

Manufacturing

Training AI models for manufacturing typically requires tens or hundreds of thousands of images. Synthetic data gives manufacturers a way to create datasets more quickly and cost-effectively. Moreover, it doesn’t require a second step of labeling as original data does. Synthetic labels are generated with labels tailor-made for the model and the use case.

Cybersecurity

Security professionals can use synthetic data to train AI models that test software for vulnerabilities.

How is synthetic data created?

One question that arises is whether synthetic data can be considered “real” data. It is, of course, data. However, the source is different from the original data. It isn’t collected from a customer relationship management (CRM), electronic health records (EHR) platform, or a supervisory control and data acquisition (SCADA) system in a manufacturing process. It’s generated by an algorithm or simulated.

In general, there are a few strategies for synthetic data generation:

Tabular synthetic data

This data can be generated by analyzing transactions, behaviors, or histories, creating models, and storing data in rows and columns in a table.

Numbers from a distribution

Data scientists use statistical information from original data to generate synthetic data. This category can also include using generative models.

Agent-based modeling

Engineers create a model to explain a real-world behavior, then use the model to generate synthetic data.

Deep learning

Variation autoencoder and generative adversarial networks (GANs) fuel synthetic data generation as well as discriminate between plausible and implausible data.

However, because different AI datasets require training with other data types, creating synthetic data can involve various processes.

For example:

Audio

Synthetic data generation tools include natural language generation (NLG), and biometric features, including a range of vocal tones and accents. Audio data generation tools also leverage x-vectors, representations of variable-length speed segments, and text-to-speech (TTS).

Text

Tools used to generate synthetic text data include transformer natural language processing (NLP) techniques, such as bidirectional encoder representations from transformers (BERT), generative pre-trained Transformer 2 (GPT-2), and their derivatives.

Image

Techniques for image data generation include GANS and conditional GANs, convolutional neural networks (CNNs) and transformers, and generators that are capable of encoding images into a latent space.

What are the benefits of synthetic data?

Using synthetic data solves many of the challenges of original data:

Costs

Creating a dataset from synthetic data can substantially reduce the total cost of ownership (TOC) of AI projects. This data is typically created with ground truth labels, eliminating the costs of annotation. Synthetic data doesn’t require data collection and storage in data lakes or formatting and labeling data. Additionally, suppose data is required for product testing, and original data doesn’t yet exist. In that case, it’s much more cost-effective than building full-scale prototypes or devising other ways to generate data.

Scale

When data acquisition from real-world sources is difficult — or impossible — synthetic data generation can fill those gaps. AI project teams can generate datasets more quickly, even if large volumes of data is required.

Compliance

When generated correctly, synthetic data can remove all chances that a person working with the data could link it to a specific person and violate laws or create a data breach.

Bias

Although bias is always a risk, AI project teams can mold datasets to address criteria such as fairness to consumers based on race, ethnicity, or gender — even if fairness is not typically reflected in the real world.

What are the biggest challenges of creating synthetic data?

Even though synthetic data generation solves many of the pain points of creating AI datasets, there are also some hurdles for AI teams to overcome:

Validity

Data scientists and engineers must carefully generate and validate synthetic data to ensure it represents real-world conditions and includes enough variability. Creating synthetic data that is both fair and accurate requires a thoughtful and thorough approach during model creation and testing.

Security

If synthetic data is based on a dataset containing sensitive information, it’s vital to ensure that a cyberattack cannot match data with specific people or accounts.

Skills, tools, and expertise

One of the biggest challenges businesses and organizations face is finding data scientists and engineers with the training, skills, and expertise to generate synthetic data to create effective datasets for AI model training.

AI teams that determine synthetic data will bring value to their projects need to weigh their options. They should decide whether trying to build resources and skills in-house or outsourcing to a provider specializing in synthetic data generation is the best course for their organizations.

The future of data: The rise of artificial datasets

Analysts — and investors — are predicting that the majority of AI datasets will consist of synthetic data within the next few years and adoption will continue to grow. The benefits of using synthetic data have captured the attention of data scientists and business leaders across industry segments, offering them viable paths to faster, more cost-effective, and possibly more effective AI deployments.

As with any IT initiative, generating synthetic data takes skill, expertise, and the right approach. However, this doesn’t have to stand in the way. Sigma has the experience, knowledge, and tools to generate synthetic data.

Synthetic data is a powerful tool for scaling, but only when it avoids inheriting or amplifying bias. Protect your models and ensure fairness by reading our guide on Preventing AI bias: How to ensure fairness in data annotation. You must also adopt the operational frameworks necessary to validate this new type of data by understanding Why gen AI quality requires rethinking human annotation standards.

Talk to a Sigma expert about building synthetic datasets that scale safely and ethically.

Want to learn more? Contact us ->

Sigma offers tailor-made solutions for data teams annotating large volumes of training data.