Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge

Julia D. Warnke-Sommer, Hesham H Ali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages273-282
Number of pages10
ISBN (Electronic)9781538634080
DOIs
StatePublished - Jun 30 2017
Event31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 - Orlando, United States
Duration: May 29 2017Jun 2 2017

Publication series

NameProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Other

Other31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
CountryUnited States
CityOrlando
Period5/29/176/2/17

Fingerprint

DNA sequences
DNA

Keywords

  • algorithms
  • assembly graph
  • high performance computing
  • next generation sequencing

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications
  • Information Systems

Cite this

Warnke-Sommer, J. D., & Ali, H. H. (2017). Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge. In Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 (pp. 273-282). [7965056] (Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPSW.2017.143

Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge. / Warnke-Sommer, Julia D.; Ali, Hesham H.

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 273-282 7965056 (Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Warnke-Sommer, JD & Ali, HH 2017, Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge. in Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017., 7965056, Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, Institute of Electrical and Electronics Engineers Inc., pp. 273-282, 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, Orlando, United States, 5/29/17. https://doi.org/10.1109/IPDPSW.2017.143
Warnke-Sommer JD, Ali HH. Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge. In Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 273-282. 7965056. (Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017). https://doi.org/10.1109/IPDPSW.2017.143
Warnke-Sommer, Julia D. ; Ali, Hesham H. / Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge. Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 273-282 (Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017).
@inproceedings{b2c5a192ca744197828c0df962bd7c3f,
title = "Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge",
abstract = "High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.",
keywords = "algorithms, assembly graph, high performance computing, next generation sequencing",
author = "Warnke-Sommer, {Julia D.} and Ali, {Hesham H}",
year = "2017",
month = "6",
day = "30",
doi = "10.1109/IPDPSW.2017.143",
language = "English (US)",
series = "Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "273--282",
booktitle = "Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017",

}

TY - GEN

T1 - Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge

AU - Warnke-Sommer, Julia D.

AU - Ali, Hesham H

PY - 2017/6/30

Y1 - 2017/6/30

N2 - High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.

AB - High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.

KW - algorithms

KW - assembly graph

KW - high performance computing

KW - next generation sequencing

UR - http://www.scopus.com/inward/record.url?scp=85028059789&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028059789&partnerID=8YFLogxK

U2 - 10.1109/IPDPSW.2017.143

DO - 10.1109/IPDPSW.2017.143

M3 - Conference contribution

AN - SCOPUS:85028059789

T3 - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

SP - 273

EP - 282

BT - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -