Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny. Hunt, M., Hinrichs, A. S, Anderson, D., Karim, L., Dearlove, B. L, Knaggs, J., Constantinides, B., Fowler, P. W, Rodger, G., Street, T. L, Lumley, S. F, Webster, H., Sanderson, T., Ruis, C., De Maio, N., Amenga-Etego, L. N, Amuzu, D. S., Avaro, M., Awandare, G. A, Ayivor-Djanie, R., Bashton, M., Batty, E. M, Bediako, Y., De Belder, D., Benedetti, E., Bergthaler, A., Boers, S. A, Campos, J., Carr, R. A. A., Cuba, F., Dattero, M. E., Dejnirattisai, W., Dilthey, A. T, Duedu, K. O., Endler, L., Engelmann, I., Francisco, N. M, Fuchs, J., Gnimpieba, E. Z, Groc, S., Gyamfi, J., Heemskerk, D., Houwaart, T., Hsiao, N., Huska, M., Hoelzer, M., Iranzadeh, A., Jarva, H., Jeewandara, C., Jolly, B., Joseph, R., Kant, R., Ki, K. K. K., Kurkela, S., Lappalainen, M., Lataretu, M., Liu, C., Malavige, G. N., Mashe, T., Mongkolsapaya, J., Montes, B., Molina-Mora, J. A., Morang'a, C. M, Mvula, B., Nagarajan, N., Nelson, A., Ngoi, J. M., da Paixao, J. P., Panning, M., Poklepovich, T., Quashie, P. K., Ranasinghe, D., Russo, M., San, J. E, Sanderson, N. D, Scaria, V., Screaton, G., Sironen, T., Sisay, A., Smith, D., Smura, T., Supasa, P., Suphavilai, C., Swann, J., Tegally, H., Tegomoh, B., Vapalahti, O., Walker, A., Wilkinson, R. J, Williamson, C., Consortium, I. L. N., de Oliveira, T., Peto, T. E., Crook, D., Corbett-Detig, R., & Iqbal, Z. bioRxiv, 15:2024.04.29.591666, Cold Spring Harbor Laboratory, apr, 2024.
Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny [link]Paper  doi  abstract   bibtex   
The SARS-CoV-2 genome occupies a unique place in infection biology – it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 3,960,704 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of March 2023, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. Phase 2 of our project will address the fact that the data in the public archives is heavily geographically biased towards the Global North. We therefore have contributed new raw data to ENA/SRA from many countries including Ghana, Thailand, Laos, Sri Lanka, India, Argentina and Singapore. We will incorporate these, along with all public raw data submitted between March 2023 and the current day, into an updated set of assemblies, and phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers. ### Competing Interest Statement Gavin Screaton sits on the GSK Vaccines Scientific Advisory Board, consults for AstraZeneca, and is a founding member of RQ Biotechnology.
@article{Hunt2024,
abstract = {The SARS-CoV-2 genome occupies a unique place in infection biology -- it is the most highly sequenced genome on earth (making up over 20{\%} of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 3,960,704 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of March 2023, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. Phase 2 of our project will address the fact that the data in the public archives is heavily geographically biased towards the Global North. We therefore have contributed new raw data to ENA/SRA from many countries including Ghana, Thailand, Laos, Sri Lanka, India, Argentina and Singapore. We will incorporate these, along with all public raw data submitted between March 2023 and the current day, into an updated set of assemblies, and phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers. {\#}{\#}{\#} Competing Interest Statement Gavin Screaton sits on the GSK Vaccines Scientific Advisory Board, consults for AstraZeneca, and is a founding member of RQ Biotechnology.},
author = {Hunt, Martin and Hinrichs, Angie S and Anderson, Daniel and Karim, Lily and Dearlove, Bethany L and Knaggs, Jeff and Constantinides, Bede and Fowler, Philip W and Rodger, Gillian and Street, Teresa L and Lumley, Sheila F and Webster, Hermione and Sanderson, Theo and Ruis, Christopher and {De Maio}, Nicola and Amenga-Etego, Lucas N and Amuzu, Dominic SY and Avaro, Martin and Awandare, Gordon A and Ayivor-Djanie, Reuben and Bashton, Matthew and Batty, Elizabeth M and Bediako, Yaw and {De Belder}, Denise and Benedetti, Estefania and Bergthaler, Andreas and Boers, Stefan A and Campos, Josefina and Carr, Rosina Afua Ampomah and Cuba, Facundo and Dattero, Maria Elena and Dejnirattisai, Wanwissa and Dilthey, Alexander T and Duedu, Kwabena Obeng and Endler, Lukas and Engelmann, Ilka and Francisco, Ngiambudulu M and Fuchs, Jonas and Gnimpieba, Etienne Z and Groc, Soraya and Gyamfi, Jones and Heemskerk, Dennis and Houwaart, Torsten and Hsiao, Nei-yuan and Huska, Matthew and Hoelzer, Martin and Iranzadeh, Arash and Jarva, Hanna and Jeewandara, Chandima and Jolly, Bani and Joseph, Rageema and Kant, Ravi and Ki, Karrie Ko Kwan and Kurkela, Satu and Lappalainen, Maija and Lataretu, Marie and Liu, Chang and Malavige, Gathsaurie Neelika and Mashe, Tapfumanei and Mongkolsapaya, Juthathip and Montes, Brigitte and Molina-Mora, Jose Arturo and Morang'a, Collins M and Mvula, Bernard and Nagarajan, Niranjan and Nelson, Andrew and Ngoi, Joyce Mwongeli and da Paixao, Joana Paula and Panning, Marcus and Poklepovich, Tomas and Quashie, Peter Kojo and Ranasinghe, Diyanath and Russo, Mara and San, James E and Sanderson, Nicholas D and Scaria, Vinod and Screaton, Gavin and Sironen, Tarja and Sisay, Abay and Smith, Darren and Smura, Teemu and Supasa, Piyada and Suphavilai, Chayaporn and Swann, Jeremy and Tegally, Houriiyah and Tegomoh, Bryan and Vapalahti, Olli and Walker, Andreas and Wilkinson, Robert J and Williamson, Carolyn and Consortium, IMSSC2 Laboratory Network and de Oliveira, Tulio and Peto, Timothy EA and Crook, Derrick and Corbett-Detig, Russ and Iqbal, Zamin},
doi = {10.1101/2024.04.29.591666},
file = {:C$\backslash$:/Users/01462563/AppData/Local/Mendeley Ltd./Mendeley Desktop/Downloaded/Hunt et al. - 2024 - Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny.pdf:pdf},
journal = {bioRxiv},
keywords = {OA,fund{\_}ack,genomics{\_}fund{\_}ack,original},
mendeley-tags = {OA,fund{\_}ack,genomics{\_}fund{\_}ack,original},
month = {apr},
pages = {2024.04.29.591666},
publisher = {Cold Spring Harbor Laboratory},
title = {{Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny}},
url = {https://www.biorxiv.org/content/10.1101/2024.04.29.591666v1 https://www.biorxiv.org/content/10.1101/2024.04.29.591666v1.abstract},
volume = {15},
year = {2024}
}

Downloads: 0