Published on March 23rd, 2021 | by Emergent Enterprise0
How Synthetic Data Could Save AI
Data. So much data. But is it good and useful data? And has it been tainted by with bias? Perhaps it is threatening personal security? Some of these issues are being answered by synthetic data – data that is applicable and useable but not “real.” This post by Gary Grossman at VentureBeat is an excellent overview of the rising use of synthetic data. As companies work toward building their own use cases of AI, they don’t necessarily need to acquire or gather actual data of real human beings. Synthetic data can supply all the properties they need to move forward. Synthetic data proves you don’t always need “just the facts.”
Image Credit: Creating a synthetic dataset using Gretel’s Python library
AI is facing several critical challenges. Not only does it need huge amounts of data to deliver accurate results, but it also needs to be able to ensure that data isn’t biased, and it needs to comply with increasingly restrictive data privacy regulations. We have seen several solutions proposed over the last couple of years to address these challenges — including various tools designed to identify and reduce bias, tools that anonymize user data, and programs to ensure that data is only collected with user consent. But each of these solutions is facing challenges of its own.
Now we’re seeing a new industry emerge that promises to be a saving grace: synthetic data. Synthetic data is artificial computer-generated data that can stand-in for data obtained from the real world.
A synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it is replacing but does not explicitly represent real individuals. Think of this as a digital mirror of real-world data that is statistically reflective of that world. This enables training AI systems in a completely virtual realm. And it can be readily customized for a variety of use cases ranging from healthcare to retail, finance, transportation, and agriculture.
There’s significant movement happening on this front. More than 50 vendors have already developed synthetic data solutions, according to research last June by StartUs Insights. I will outline some of the leading players in a moment. First, though, let’s take a closer look at the problems they’re promising to solve.
The trouble with real data
Over the last few years, there has been increasing concern about how inherent biases in datasets can unwittingly lead to AI algorithms that perpetuate systemic discrimination. In fact, Gartner predicts that through 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them.
The proliferation of AI algorithms has also led to growing concerns over data privacy. In turn, this has led to stronger consumer data privacy and protection laws in the EU with GDPR, as well as U.S. jurisdictions including California and most recently Virginia.
These laws give consumers more control over their personal data. For example, the Virginia law grants consumers the right to access, correct, delete, and obtain a copy of personal data as well as to opt out of the sale of personal data and to deny algorithmic access to personal data for the purposes of targeted advertising or profiling of the consumer.
By restricting access to this information, a certain amount of individual protection is gained but at the cost of the algorithm’s effectiveness. The more data an AI algorithm can train on, the more accurate and effective the results will be. Without access to ample data, the upsides of AI, such as assisting with medical diagnoses and drug research, could also be limited.
One alternative often used to offset privacy concerns is anonymization. Personal data, for example, can be anonymized by masking or eliminating identifying characteristics such as removing names and credit card numbers from ecommerce transactions or removing identifying content from healthcare records. But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches. In fact, by combining data from multiple sources, it is possible to form a surprisingly clear picture of our identities even if there has been a degree of anonymization. In some instances, this can even be done by correlating data from public sources, without a nefarious security hack.
Synthetic data’s solution
Synthetic data promises to deliver the advantages of AI without the downsides. Not only does it take our real personal data out of the equation, but a general goal for synthetic data is to perform better than real-world data by correcting bias that is often engrained in the real world.