One of the challenges that artificial intelligence (AI) project teams face is how to create datasets that fully represent a domain. Many teams are discovering that the best solution is using synthetic data. Original data from all parts of a domain sometimes isn’t readily accessible. Additionally, the data required for the model to produce desirable outcomes may not exist at all.
Synthetic data is designed to address these shortcomings. Gartner predicts that 60% of data will be synthetically generated for AI and analytics projects by 2024. Furthermore, the Gartner study concluded that synthetic data will overshadow real data use in AI datasets by 2030.
What Are the Challenges of Original Data?
To understand why using synthetic data to train AI models is increasing, you need to consider the challenges that using original data presents:
Collecting or acquiring data, storing it, cleaning, formatting, and labeling it to train an AI model takes time, resources, and investments. Moreover, those costs mount for projects that require frequent training.
Adequate volumes of data or data that accurately portrays situations the model would encounter in the real world aren’t always available or would be impractical to collect.
Data scientists and engineers must be careful when working with categories of data protected by regulation, such as consumer and healthcare data. Businesses can’t risk situations that could result in a data breach.
Generating a comprehensive dataset that complies with privacy regulations and represents your population of interest may be challenging with the original data sources available.
What Is Synthetic Data?
Synthetic data is generated by a computer algorithm or simulation rather than collected from the real world. Synthetic data is not “real” data. However, AI project teams can use it to train an AI model, often more quickly, thoroughly, and cost-effectively.
Synthetic datasets are compiled purposefully to reflect the domain accurately via statistical properties and patterns of original data. Additionally, synthetic data generation can control the specificity of class separation to suit the use case and even include random noise to prepare the AI solution for deployment in the real world.
Types of Synthetic Data
Synthetic datasets, whether media (video, image, audio), text, or tabular, can be categorized in three general ways:
Fully Synthetic Datasets
This type includes only data generated by a computer program and does not contain any original data.
Partially Synthetic Datasets
If an AI model requires training with healthcare, consumer, financial, or other types of personally identifiable or protected information, data scientists or engineers can anonymize the data. These datasets contain real-world data with sensitive elements of the data removed or replaced.
Hybrid Synthetic Datasets
In cases where more data than is available is required to build out a dataset, data scientists or engineers may turn to data augmentation. This approach expands the amount of data by generating new data points, often based on existing data. Unlike synthetic data, this technique uses original and computer-generated data.
Which Industries Use Synthetic Data?
Synthetic data generation tools can produce datasets to train an AI model to solve problems, recognize anomalies and other “what if” scenarios, or produce datasets without using personally identifiable information. Synthetic data overcomes AI model training for a wide range of industries, including:
Banking and Financial Services
Based on original data, businesses can use anonymous, synthetic data to correct biases in credit issuance or other services. Synthetic data can also build robust datasets that help banks and lenders identify and stop fraud.
Retail and E-Commerce
Synthetic data can protect consumer information while still providing retailers with datasets to train models to deliver hyper-personalized messaging and effective, targeted marketing.
Autonomous Vehicles and Robotics
Emerging technologies haven’t produced enough data in the real world to train AI models for innovations in the areas of robots and autonomous vehicles. Synthetic data creation overcomes this challenge by enabling data scientists and engineers to produce datasets to use during research and development.
Anonymizing healthcare datasets provide AI project teams with petabytes of valuable data that show complex clinical relationships while not revealing patient identities and maintaining compliance with HIPAA or other protected health information regulations.
Training AI models for manufacturing typically requires tens or hundreds of thousands of images. Synthetic data gives manufacturers a way to create datasets more quickly and cost-effectively. Moreover, synthetic data doesn’t require a second step of labeling as original data does. Synthetic labels are generated with labels tailor-made for the model and the use case.
Security professionals can use synthetic data to train AI models that test software for vulnerabilities.
How is Synthetic Data Actually Created?
One question that arises is whether synthetic data can be considered “real” data. It is, of course, data. However, the source is different from the original data. It isn’t collected from a customer relationship management (CRM), electronic health records (EHR) platform, or a supervisory control and data acquisition (SCADA) system in a manufacturing process. It’s generated by an algorithm or simulated.
In general, there are a few strategies for synthetic data generation:
Tabular synthetic data
This data can be generated by analyzing transactions, behaviors, or histories, creating models, and storing data in rows and columns in a table.
Numbers from a distribution
Data scientists use statistical information from original data to generate synthetic data. This category can also include using generative models.
Engineers create a model to explain a real-world behavior, then use the model to generate synthetic data.
Variation autoencoder and generative adversarial networks (GANs) fuel synthetic data generation as well as discriminating between plausible and implausible data.
However, because different AI datasets require training with other data types, creating synthetic data can involve various processes.
Synthetic data generation tools include natural language generation (NLG), biometric features, including a range of vocal tones and accents. Audio data generation tools also leverage x-vectors, representations of variable-length speed segments, and text-to-speech (TTS).
Tools used to generate synthetic text data include transformer natural language processing (NLP) techniques, such as bidirectional encoder representations from transformers (BERT), generative pre-trained Transformer 2 (GPT-2), and their derivatives.
Techniques for image data generation include GANS and conditional GANs, convolutional neural networks (CNNs) and transformers, and generators that are capable of encoding images into a latent space.
What Are the Biggest Benefits of Synthetic Data?
Using synthetic data solves many of the challenges of original data:
Creating a dataset from synthetic data can substantially reduce the total cost of ownership (TOC) of AI projects. Synthetic data is typically created with ground truth labels, eliminating the costs of annotation. Synthetic data doesn’t require data collection and storage in data lakes or formatting and labeling data. Additionally, suppose data is required for product testing, and original data doesn’t yet exist. In that case, synthetic data is much more cost-effective than building full-scale prototypes or devising other ways to generate data.
When data acquisition from real-world sources is difficult – or impossible – synthetic data generation can fill those gaps. AI project teams can generate datasets more quickly, even if large volumes of data is required.
When generated correctly, synthetic data can remove all chances that a person working with the data could link it to a specific person and violate laws or create a data breach.
Although bias is always a risk, AI project teams can mold datasets to address criteria such as fairness to consumers based on race, ethnicity, or gender – even if fairness is not typically reflected in the real world.
What Are the Biggest Challenges of Creating Synthetic Data?
Even though synthetic data generation solves many of the pain points of creating AI datasets, there are also some hurdles for AI teams to overcome:
Data scientists and engineers must carefully generate and validate synthetic data to ensure it represents real-world conditions and includes enough variability. Creating synthetic data that is both fair and accurate requires a thoughtful and thorough approach during model creation and testing.
If synthetic data is based on a dataset containing sensitive information, it’s vital to ensure that a cyberattack cannot match data with specific people or accounts.
Skills, tools, and expertise
One of the biggest challenges businesses and organizations face is finding data scientists and engineers with the training, skills, and expertise to generate synthetic data to create effective datasets for AI model training.
AI teams that determine synthetic data will bring value to their projects need to weigh their options. They should decide whether trying to build resources and skills in-house or outsourcing to a provider specializing in synthetic data generation is the best course for their organizations.
Synthetic Data Will Rise in the Next Decade
Analysts – and investors – are predicting that the majority of AI datasets will consist of synthetic data within the next few years and adoption will continue to grow. The benefits of using synthetic data have captured the attention of data scientists and business leaders across industry segments, offering them viable paths to faster, more cost-effective, and possibly more effective AI deployments.
As with any IT initiative, generating synthetic data takes skill, expertise, and the right approach. However, this doesn’t have to stand in the way. Sigma has the experience, knowledge and tools to generate synthetic data. The Sigma team specializes in a range of synthetic generation, from voice anonymization, text-to-speech, automatic summaries, rephrasing,answer generation, image augmentation and synthetic image generation.Our ethical, human-centric approach to AI training data relies on humans in the loop to support the creation of fairer synthetic data. See what Sigma has to offer and contact us to discuss your use case.