# KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage

Sairam Behera, Sutanu Gayen, Jitender S. Deogun, N. V. Vinodchandran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

### Abstract

The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.

Original language English (US) ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Association for Computing Machinery, Inc 438-447 10 9781450357944 https://doi.org/10.1145/3233547.3233587 Published - Aug 15 2018 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018 - Washington, United StatesDuration: Aug 29 2018 → Sep 1 2018

### Publication series

Name ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

### Other

Other 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018 United States Washington 8/29/18 → 9/1/18

### Fingerprint

Sample Size
Sampling
Data storage equipment
Genome Size
Error correction
Bioinformatics
Computational Biology
RNA
DNA
Genes
Datasets
Processing

### Keywords

• Genome assembly
• K-mer counting
• Streaming algorithm

### ASJC Scopus subject areas

• Computer Science Applications
• Software
• Health Informatics
• Biomedical Engineering

### Cite this

Behera, S., Gayen, S., Deogun, J. S., & Vinodchandran, N. V. (2018). KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. In ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 438-447). (ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics). Association for Computing Machinery, Inc. https://doi.org/10.1145/3233547.3233587

KmerEstimate : A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. / Behera, Sairam; Gayen, Sutanu; Deogun, Jitender S.; Vinodchandran, N. V.

ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2018. p. 438-447 (ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Behera, S, Gayen, S, Deogun, JS & Vinodchandran, NV 2018, KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. in ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Association for Computing Machinery, Inc, pp. 438-447, 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018, Washington, United States, 8/29/18. https://doi.org/10.1145/3233547.3233587
Behera S, Gayen S, Deogun JS, Vinodchandran NV. KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. In ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc. 2018. p. 438-447. (ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics). https://doi.org/10.1145/3233547.3233587
Behera, Sairam ; Gayen, Sutanu ; Deogun, Jitender S. ; Vinodchandran, N. V. / KmerEstimate : A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2018. pp. 438-447 (ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics).
title = "KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage",
abstract = "The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6{\%} error rate. It uses less memory than $ntCard$ as the sample size is almost 85{\%} less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6{\%} error rate. It uses less memory than $ntCard$ as the sample size is almost 85{\%} less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.",
keywords = "Genome assembly, K-mer counting, Streaming algorithm",
author = "Sairam Behera and Sutanu Gayen and Deogun, {Jitender S.} and Vinodchandran, {N. V.}",
year = "2018",
month = "8",
day = "15",
doi = "10.1145/3233547.3233587",
language = "English (US)",
series = "ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics",
publisher = "Association for Computing Machinery, Inc",
pages = "438--447",
booktitle = "ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics",

}

TY - GEN

T1 - KmerEstimate

T2 - A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage

AU - Behera, Sairam

AU - Gayen, Sutanu

AU - Deogun, Jitender S.

AU - Vinodchandran, N. V.

PY - 2018/8/15

Y1 - 2018/8/15

N2 - The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.

AB - The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.

KW - Genome assembly

KW - K-mer counting

KW - Streaming algorithm

UR - http://www.scopus.com/inward/record.url?scp=85056109624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056109624&partnerID=8YFLogxK

U2 - 10.1145/3233547.3233587

DO - 10.1145/3233547.3233587

M3 - Conference contribution

AN - SCOPUS:85056109624

T3 - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

SP - 438

EP - 447

BT - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

PB - Association for Computing Machinery, Inc

ER -