‘Synthetic data’ is being used in chemistry, but is it something we should worry about? Hayley Bennett explains
For scientists, faking or making up data has obvious connotations and, thanks to some high-profile cases of scientific misconduct, they’re generally not positive ones. Chemists may, for example, be aware of a 2022 case in which a respected journal retracted two papers by a Japanese chemistry group that were found to contain ‘manipulated or fabricated’ data. Or the case of Bengü Sezen, the Columbia University chemist who, during the 2000s, ‘falsified, fabricated and plagiarised’ data to get her work on chemical bonding published – including fixing her NMR spectra with correcting fluid.
‘Synthetic data’, unlike dishonestly made-up data, is created in a systematic way for legitimate reasons, however, usually by a machine – and for a variety of reasons. Synthetic data is familiar to machine learning experts, and increasingly to computational chemists, but relatively unknown to the wider chemistry community