Techniques to produce and evaluate realistic multivariate synthetic data. Heine, J., Fowler, E. E. E., Berglund, A., Schell, M. J., & Eschrich, S. Scientific Reports, 13(1):12266, July, 2023. Number: 1 Publisher: Nature Publishing GroupPaper doi abstract bibtex Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.
@article{heine_techniques_2023,
title = {Techniques to produce and evaluate realistic multivariate synthetic data},
volume = {13},
copyright = {2023 The Author(s)},
issn = {2045-2322},
url = {https://www.nature.com/articles/s41598-023-38832-0},
doi = {10.1038/s41598-023-38832-0},
abstract = {Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.},
language = {en},
number = {1},
urldate = {2023-10-17},
journal = {Scientific Reports},
author = {Heine, John and Fowler, Erin E. E. and Berglund, Anders and Schell, Michael J. and Eschrich, Steven},
month = jul,
year = {2023},
note = {Number: 1
Publisher: Nature Publishing Group},
keywords = {Applied mathematics, Computational science, Data processing, Predictive medicine, Scientific data, Statistical methods, Statistics},
pages = {12266},
}
Downloads: 0
{"_id":"7Xd3YLq7oieedGXmk","bibbaseid":"heine-fowler-berglund-schell-eschrich-techniquestoproduceandevaluaterealisticmultivariatesyntheticdata-2023","author_short":["Heine, J.","Fowler, E. E. E.","Berglund, A.","Schell, M. J.","Eschrich, S."],"bibdata":{"bibtype":"article","type":"article","title":"Techniques to produce and evaluate realistic multivariate synthetic data","volume":"13","copyright":"2023 The Author(s)","issn":"2045-2322","url":"https://www.nature.com/articles/s41598-023-38832-0","doi":"10.1038/s41598-023-38832-0","abstract":"Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.","language":"en","number":"1","urldate":"2023-10-17","journal":"Scientific Reports","author":[{"propositions":[],"lastnames":["Heine"],"firstnames":["John"],"suffixes":[]},{"propositions":[],"lastnames":["Fowler"],"firstnames":["Erin","E.","E."],"suffixes":[]},{"propositions":[],"lastnames":["Berglund"],"firstnames":["Anders"],"suffixes":[]},{"propositions":[],"lastnames":["Schell"],"firstnames":["Michael","J."],"suffixes":[]},{"propositions":[],"lastnames":["Eschrich"],"firstnames":["Steven"],"suffixes":[]}],"month":"July","year":"2023","note":"Number: 1 Publisher: Nature Publishing Group","keywords":"Applied mathematics, Computational science, Data processing, Predictive medicine, Scientific data, Statistical methods, Statistics","pages":"12266","bibtex":"@article{heine_techniques_2023,\n\ttitle = {Techniques to produce and evaluate realistic multivariate synthetic data},\n\tvolume = {13},\n\tcopyright = {2023 The Author(s)},\n\tissn = {2045-2322},\n\turl = {https://www.nature.com/articles/s41598-023-38832-0},\n\tdoi = {10.1038/s41598-023-38832-0},\n\tabstract = {Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.},\n\tlanguage = {en},\n\tnumber = {1},\n\turldate = {2023-10-17},\n\tjournal = {Scientific Reports},\n\tauthor = {Heine, John and Fowler, Erin E. E. and Berglund, Anders and Schell, Michael J. and Eschrich, Steven},\n\tmonth = jul,\n\tyear = {2023},\n\tnote = {Number: 1\nPublisher: Nature Publishing Group},\n\tkeywords = {Applied mathematics, Computational science, Data processing, Predictive medicine, Scientific data, Statistical methods, Statistics},\n\tpages = {12266},\n}\n\n\n\n\n\n\n\n\n\n\n\n","author_short":["Heine, J.","Fowler, E. E. E.","Berglund, A.","Schell, M. J.","Eschrich, S."],"key":"heine_techniques_2023","id":"heine_techniques_2023","bibbaseid":"heine-fowler-berglund-schell-eschrich-techniquestoproduceandevaluaterealisticmultivariatesyntheticdata-2023","role":"author","urls":{"Paper":"https://www.nature.com/articles/s41598-023-38832-0"},"keyword":["Applied mathematics","Computational science","Data processing","Predictive medicine","Scientific data","Statistical methods","Statistics"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/mh_lenguyen","dataSources":["iwKepCrWBps7ojhDx"],"keywords":["applied mathematics","computational science","data processing","predictive medicine","scientific data","statistical methods","statistics"],"search_terms":["techniques","produce","evaluate","realistic","multivariate","synthetic","data","heine","fowler","berglund","schell","eschrich"],"title":"Techniques to produce and evaluate realistic multivariate synthetic data","year":2023}