Data Harmonization Guidelines to Combine Multi-platform Genomic Data from Admixed Populations and Boost Power in Genome-Wide Association Studies. Croock, D., Swart, Y., Schurz, H., Petersen, D. C., Möller, M., & Uren, C. Current Protocols, 4(6):e1055, June, 2024.
doi  abstract   bibtex   
Data harmonization involves combining data from multiple independent sources and processing the data to produce one uniform dataset. Merging separate genotypes or whole-genome sequencing datasets has been proposed as a strategy to increase the statistical power of association tests by increasing the effective sample size. However, data harmonization is not a widely adopted strategy due to the difficulties with merging data (including confounding produced by batch effects and population stratification). Detailed data harmonization protocols are scarce and are often conflicting. Moreover, data harmonization protocols that accommodate samples of admixed ancestry are practically non-existent. Existing data harmonization procedures must be modified to ensure the heterogeneous ancestry of admixed individuals is incorporated into additional downstream analyses without confounding results. Here, we propose a set of guidelines for merging multi-platform genetic data from admixed samples that can be adopted by any investigator with elementary bioinformatics experience. We have applied these guidelines to aggregate 1544 tuberculosis (TB) case-control samples from six separate in-house datasets and conducted a genome-wide association study (GWAS) of TB susceptibility. The GWAS performed on the merged dataset had improved power over analyzing the datasets individually and produced summary statistics free from bias introduced by batch effects and population stratification. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Processing separate datasets comprising array genotype data Alternate Protocol 1: Processing separate datasets comprising array genotype and whole-genome sequencing data Alternate Protocol 2: Performing imputation using a local reference panel Basic Protocol 2: Merging separate datasets Basic Protocol 3: Ancestry inference using ADMIXTURE and RFMix Basic Protocol 4: Batch effect correction using pseudo-case-control comparisons.
@article{croock_data_2024,
	title = {Data {Harmonization} {Guidelines} to {Combine} {Multi}-platform {Genomic} {Data} from {Admixed} {Populations} and {Boost} {Power} in {Genome}-{Wide} {Association} {Studies}},
	volume = {4},
	issn = {2691-1299},
	doi = {10.1002/cpz1.1055},
	abstract = {Data harmonization involves combining data from multiple independent sources and processing the data to produce one uniform dataset. Merging separate genotypes or whole-genome sequencing datasets has been proposed as a strategy to increase the statistical power of association tests by increasing the effective sample size. However, data harmonization is not a widely adopted strategy due to the difficulties with merging data (including confounding produced by batch effects and population stratification). Detailed data harmonization protocols are scarce and are often conflicting. Moreover, data harmonization protocols that accommodate samples of admixed ancestry are practically non-existent. Existing data harmonization procedures must be modified to ensure the heterogeneous ancestry of admixed individuals is incorporated into additional downstream analyses without confounding results. Here, we propose a set of guidelines for merging multi-platform genetic data from admixed samples that can be adopted by any investigator with elementary bioinformatics experience. We have applied these guidelines to aggregate 1544 tuberculosis (TB) case-control samples from six separate in-house datasets and conducted a genome-wide association study (GWAS) of TB susceptibility. The GWAS performed on the merged dataset had improved power over analyzing the datasets individually and produced summary statistics free from bias introduced by batch effects and population stratification. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Processing separate datasets comprising array genotype data Alternate Protocol 1: Processing separate datasets comprising array genotype and whole-genome sequencing data Alternate Protocol 2: Performing imputation using a local reference panel Basic Protocol 2: Merging separate datasets Basic Protocol 3: Ancestry inference using ADMIXTURE and RFMix Basic Protocol 4: Batch effect correction using pseudo-case-control comparisons.},
	language = {eng},
	number = {6},
	journal = {Current Protocols},
	author = {Croock, Dayna and Swart, Yolandi and Schurz, Haiko and Petersen, Desiree C. and Möller, Marlo and Uren, Caitlin},
	month = jun,
	year = {2024},
	pmid = {38837690},
	keywords = {Case-Control Studies, Genetic Predisposition to Disease, Genome-Wide Association Study, Genomics, Guidelines as Topic, Humans, Tuberculosis, ancestry inference, batch effects, data harmonization, genomic data, merging},
	pages = {e1055},
}

Downloads: 0