A space and time-efficient index for the compacted colored de Bruijn graph. Almodaresi, F., Sarkar, H., & Patro, R. bioRxiv, Cold Spring Harbor Laboratory, 2018. (conditionally accepted to ISMB 2018)
A space and time-efficient index for the compacted colored de Bruijn graph [link]Paper  doi  abstract   bibtex   
We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences. Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.
@article {almodaresi:2018:pufferfish,
	author = {Almodaresi, Fatemeh and Sarkar, Hirak and Patro, Rob},
	title = {{A space and time-efficient index for the compacted colored de Bruijn graph}},
	year = {2018},
	doi = {10.1101/191874},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences. Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.},
	URL = {https://www.biorxiv.org/content/early/2017/09/21/191874},
  note = {(conditionally accepted to ISMB 2018)},
	journal = {bioRxiv}
}

Downloads: 0