A Collection Of 2,280 Public Domain (CC0) Curated Genotypes. Shaw, R. J & Corpas, M. bioRxiv, April, 2017.
A Collection Of 2,280 Public Domain (CC0) Curated Genotypes [link]Paper  doi  abstract   bibtex   
\textlessp\textgreaterCheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks with which individuals donate their data. Having an aggregated genome dataset with unrestricted use analogous to public domain licensing is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (https://repositive.io/?23andMe), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments.\textless/p\textgreater
@article{shaw_collection_2017,
	title = {A {Collection} {Of} 2,280 {Public} {Domain} ({CC0}) {Curated} {Genotypes}},
	url = {http://www.biorxiv.org/content/early/2017/04/19/127241},
	doi = {10.1101/127241},
	abstract = {{\textless}p{\textgreater}Cheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks with which individuals donate their data. Having an aggregated genome dataset with unrestricted use analogous to public domain licensing is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (https://repositive.io/?23andMe), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments.{\textless}/p{\textgreater}},
	language = {English},
	journal = {bioRxiv},
	author = {Shaw, Richard J and Corpas, Manuel},
	month = apr,
	year = {2017},
	pages = {127241},
}

Downloads: 0