DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences. Ruan, Y., Ekanayake, S., Rho, M., Tang, H., Bae, S., S., Qiu, J., & Fox, G., C. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, pages 329-336, 2012. ACM.
doi  abstract   bibtex   
The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters). In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods. Copyright © 2012 ACM.
@inproceedings{
 title = {DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences},
 type = {inproceedings},
 year = {2012},
 pages = {329-336},
 publisher = {ACM},
 id = {64fdf49e-07fb-382d-a070-add925519882},
 created = {2017-12-18T21:44:04.277Z},
 file_attached = {false},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2020-05-11T14:43:45.853Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Ruan2012},
 source_type = {CONF},
 folder_uuids = {36d8ccf4-7085-47fa-8ab9-897283d082c5},
 private_publication = {false},
 abstract = {The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters).  In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods. Copyright © 2012 ACM.},
 bibtype = {inproceedings},
 author = {Ruan, Yang and Ekanayake, Saliya and Rho, Mina and Tang, Haixu and Bae, Seung-Hee S.-H. and Qiu, Judy and Fox, Geoffrey Charles},
 doi = {10.1145/2382936.2382978},
 booktitle = {Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine}
}

Downloads: 0