Massimiliano Marchesiello’s Post

View profile for Massimiliano Marchesiello, graphic

AI & Machine Learning Specialist | Data Scientist

Evaluating synthetic data https://ift.tt/gAfTWLH Assessing plausibility and usefulness of data we generated from real data Synthetic data serves many purposes, and has been gathering attention for a while, partly due to the convincing capabilities of LLMs. But what is «good» synthetic data, and how can we know we managed to generate it ? Photo by Nigel Hoare on Unsplash What is synthetic data ? Synthetic data is data that has been generated with the intent to look like real data, at least on some aspects (schema at the very least, statistical distributions, …). It is usually generated randomly, using a wide range of models : random sampling, noise addition, GAN, diffusion models, variational autoencoders, LLM, … It is used for many purposes, for instance : training and education (eg, discovering a new database or teaching a course), data augmentation (ie, creating new samples to train a model), sharing data while protecting privacy (especially useful from an open science point of view), conducting research while protecting privacy. It is particularily used in software testing, and in sensitive domains like healthcare technology : having access to data that behaves like real data without jeopardizing patients privacy is a dream come true. Synthetic data quality principles Individual plausibility For a sample to be useful it must, in some way, look like real data. The ultimate goal is that generated samples must be indistinguishable from real samples : generate hyper-realistic faces, sentences, medical records, … Obviously, the more complex the source data, the harder it is to generate «good» synthetic data. Usefulness In many cases, especially data augmentation, we need more than one realistic sample, we need a whole dataset. And it is not the same to generate a single sample and a whole dataset : the problem is very well known, under the name of mode collapse, which is especially frequent when training a generative adversarial network (GAN). Essentially, the generator (more generally, the model that generates synthetic data) could learn to generate a single type of sample and totally miss out on the rest of the sample space, leading to a synthetic dataset that is not as useful as the original dataset. For instance, if we train a model to generate animal pictures, and it finds a very efficient way to generate cat pictures, it could stop generating anything else than cat pictures (in particular, no dog pictures). Cat pictures would then be the “mode” of the generated distribution. This type of behaviour is harmful if our initial goal is to augment our data, or create a dataset for training. What we need is a dataset that is realistic in itself, which in absolute means that any statistic derived from this dataset should be close enough to the same statistic on real data. Statistically speaking, this means that univariate and multivariate distributions should be the same (or at least “close enough”). Privacy We...

Evaluating synthetic data

https://ift.tt/gAfTWLH

Assessing plausibility and usefulness of data we generated from real data

Synthetic data serves many purposes, and has been gathering attention for a while, partly due to the convincing capabilities of LLMs. But what is «good» synthetic data, and how can we know we managed to generate it ?

Photo by Nigel Hoare on Unsplash

What is synthetic...

Evaluating synthetic data https://ift.tt/gAfTWLH Assessing plausibility and usefulness of data we generated from real data Synthetic data serves many purposes, and has been gathering attention for a while, partly due to the convincing capabilities of LLMs. But what is «good» synthetic data, and how can we know we managed to generate it ? Photo by Nigel Hoare on Unsplash What is synthetic...

uk.linkedin.com

To view or add a comment, sign in

Explore topics