Helmholtz Gemeinschaft

Search
Browse
Statistics
Feeds

Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI

Item Type:Preprint
Title:Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI
Creators Name:Kapar, Jan, Günther, Kathrin, Vallis, Lori Ann, Berger, Klaus, Binder, Nadine, Brenner, Hermann, Castell, Stefanie, Fischer, Beate, Harth, Volker, Holleczek, Bernd, Intemann, Timm, Ittermann, Till, Karch, André, Keil, Thomas, Krist, Lilian, Lange, Berit, Leitzmann, Michael F., Nimptsch, Katharina, Obi, Nadia, Pigeot, Iris, Pischon, Tobias, Schikowski, Tamara, Schmidt, Börge, Schmidt, Carsten Oliver, Sedlmair, Anja M., Tanoey, Justine, Wienbergen, Harm, Wienke, Andreas, Wigmann, Claudia and Wright, Marvin N.
Abstract:Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research. We propose the use of adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications and compared original with synthetic results. These publications cover blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, based on data from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. Additionally, we assessed the impact of dimensionality and variable complexity on synthesis quality by limiting datasets to variables relevant for individual analyses, including necessary derivations. Across all replicated original studies, results from multiple synthetic data replications consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, the replication outcomes closely matched the original results across various descriptive and inferential analyses. Reducing dimensionality and pre-deriving variables further enhanced both quality and stability of the results. In summary, ARF reliably generates high-quality synthetic data that replicate epidemiological study findings across diverse scenarios, supporting its practical use for applications such as rapid prototyping and data sharing under appropriate privacy and regulatory considerations.
Keywords:Generative Artificial Intelligence, Generative Machine Learning, Adversarial Random Forests, Synthetic Data Quality, Statistical Utility, Epidemiological Study Replication, Tabular Data
Source:arXiv
Publisher:Cornell University
Article Number:2508.14936v1
Date:19 August 2025
Official Publication:https://doi.org/10.48550/arXiv.2508.14936
Related to:

Repository Staff Only: item control page

Open Access
MDC Library