Techniques to produce and evaluate realistic multivariate synthetic data. Heine, J., Fowler, E. E. E., Berglund, A., Schell, M. J., & Eschrich, S. Scientific Reports, 13(1):12266, July, 2023. Number: 1 Publisher: Nature Publishing Group
Techniques to produce and evaluate realistic multivariate synthetic data [link]Paper  doi  abstract   bibtex   
Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.
@article{heine_techniques_2023,
	title = {Techniques to produce and evaluate realistic multivariate synthetic data},
	volume = {13},
	copyright = {2023 The Author(s)},
	issn = {2045-2322},
	url = {https://www.nature.com/articles/s41598-023-38832-0},
	doi = {10.1038/s41598-023-38832-0},
	abstract = {Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.},
	language = {en},
	number = {1},
	urldate = {2023-10-17},
	journal = {Scientific Reports},
	author = {Heine, John and Fowler, Erin E. E. and Berglund, Anders and Schell, Michael J. and Eschrich, Steven},
	month = jul,
	year = {2023},
	note = {Number: 1
Publisher: Nature Publishing Group},
	keywords = {Applied mathematics, Computational science, Data processing, Predictive medicine, Scientific data, Statistical methods, Statistics},
	pages = {12266},
}

Downloads: 0