Synthetic Data – Love it or Hate it?
VOLUME 1 - ISSUE 15 ~ DECEMBER 4, 2024
What are the advantages of synthetic data? In this edition of the “CIO Two Cents” newsletter, I consider the uses of this technology, its benefits, as well as real-world use cases.
— Yvette Kanouff, partner at JC2 Ventures
The JC2 Ventures team (John J. Chambers, Shannon Pina, John T. Chambers, me, and Pankaj Patel)
I find the discussions around the use of synthetic data quite interesting. Of course, it would be wonderful to have real data in abundance, and accuracy to train our models, but this isn’t always available. So, is the use of synthetic data valuable? I think so.
As we know, synthetic data is artificially created. It can be used to train and test models. In many cases, this artificial data provides clean tagging and great accuracy that enables training and testing without the cost and time impact (or availability) of obtaining real live data.
There are many benefits to using synthetic data, even if real data is available. Some of these include:
Clean data – synthetic data enables clean tagging and precise knowledge of what the data represents, enabling AI models to be trained with accuracy.
Time and cost savings – the creation of synthetic data can be completed with mathematical algorithms and generative AI, allowing data to mimic real-live data. The time and cost savings are immense. Data can also be newly generated and customized as needed.
Legal issue minimization – one of the great benefits of synthetic data is that it avoids some of the pitfalls of real data with regard to potential copyright issues, use of protected data, and data privacy concerns, as the data is randomly generated.
Minimizing hallucinations – with properly labeled and well-generated synthetic data, hallucinations can be minimized when there is a lack of enough real data for adequate training.
Testing of unstructured data – synthetic data can be a good tool to train potential unstructured data.
Minimizing bias – with proper oversight, synthetic data can minimize some of the biases that can occur in real data.
Obviously, there are some downsides—mainly ‘quirks’ and unknowns in real data that may be missing in synthetic data and emerge at a later time. However, I believe the benefits often outweigh this concern in many cases.
Today, synthetic data is used extensively across various fields, including contact centers for voice and sentiment insights, finance, healthcare, and more. Some examples include Alphabet’s Waymo using synthetic data for self-driving cars, Amazon to help train Alexa’s natural language understanding, American Express and J.P. Morgan Chase for fraud detection. It has also proven to be beneficial in deepfake analysis. Overall, I see great value in the use of synthetic data. I’m curious what you are seeing in its pros and cons.
Moving fast? I've got you covered. Here are the key takaways:
(1)
Clean and Accurate Data: Synthetic data allows for clean tagging and precise knowledge, ensuring AI models are trained with high accuracy. This is particularly useful when real data is scarce or difficult to obtain.
(2)
Cost and Time Efficiency: The creation of synthetic data can save significant time and resources. It allows for the generation of new, customized data on demand.
(3)
Privacy and Legal Advantages: Using synthetic data mitigates legal and privacy concerns associated with real data. It avoids issues related to copyright, protected data, and privacy, as the data is artificially generated and not linked to real individuals.