A tolerance graph approach for domain-specific assembly of next generation sequencing data

Julia Warnke, Hesham H Ali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Next generation sequencing (NGS) has become a major focus in many recent biological research applications. NGS produces thousands to millions of short DNA fragments in a single run. Individually, these fragments represent only a small fraction of an original biological sample. To obtain any useful information, overlapping fragments must be assembled into long stretches of contiguous sequence. Various assemblers have been developed to address the fragment assembly problem. The majority of current assemblers were developed to fill an important gap, however, they were developed with a pure computational focus without taking the properties of the input datasets into consideration. NGS dataset characteristics such as fragment coverage and underlying genome complexity vary dramatically between different sequencing applications. Generic assemblers that are data independent are unlikely to produce accurate solutions in all problem domains. In this study, we propose a graph theoretic approach based on the concept of tolerance graphs to develop a domain-specific assembler. The proposed assembler is designed to extract signals associated with local features in the input dataset and reintegrate this knowledge into the assembly process through customized tolerance graph parameters. We conducted a number of experiments to study the impact of various input parameters on the quality of the assembled genomes. Results from this study show that the proposed assembler produces excellent results and outperforms other known assembly algorithms for some input datasets. This approach also presents the foundation for developing domain-specific assemblers to be applied in an intelligent and customized manner to a wide variety of input instances, resulting in more efficient assembly tactics and improved overall assembly quality.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013
PublisherIEEE Computer Society
Pages88-95
Number of pages8
DOIs
StatePublished - 2013
Event2013 13th IEEE International Conference on Data Mining Workshops, ICDMW 2013 - Dallas, TX
Duration: Dec 7 2013Dec 10 2013

Other

Other2013 13th IEEE International Conference on Data Mining Workshops, ICDMW 2013
CityDallas, TX
Period12/7/1312/10/13

Fingerprint

Genes
DNA
Experiments

Keywords

  • Graph theory
  • Knowledge-based genome assembly
  • Next generation sequencing
  • Tolerance graph

ASJC Scopus subject areas

  • Software

Cite this

Warnke, J., & Ali, H. H. (2013). A tolerance graph approach for domain-specific assembly of next generation sequencing data. In Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013 (pp. 88-95). [6753907] IEEE Computer Society. https://doi.org/10.1109/ICDMW.2013.105

A tolerance graph approach for domain-specific assembly of next generation sequencing data. / Warnke, Julia; Ali, Hesham H.

Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013. IEEE Computer Society, 2013. p. 88-95 6753907.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Warnke, J & Ali, HH 2013, A tolerance graph approach for domain-specific assembly of next generation sequencing data. in Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013., 6753907, IEEE Computer Society, pp. 88-95, 2013 13th IEEE International Conference on Data Mining Workshops, ICDMW 2013, Dallas, TX, 12/7/13. https://doi.org/10.1109/ICDMW.2013.105
Warnke J, Ali HH. A tolerance graph approach for domain-specific assembly of next generation sequencing data. In Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013. IEEE Computer Society. 2013. p. 88-95. 6753907 https://doi.org/10.1109/ICDMW.2013.105
Warnke, Julia ; Ali, Hesham H. / A tolerance graph approach for domain-specific assembly of next generation sequencing data. Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013. IEEE Computer Society, 2013. pp. 88-95
@inproceedings{3de3725b88164e8da484e5a72f6fda37,
title = "A tolerance graph approach for domain-specific assembly of next generation sequencing data",
abstract = "Next generation sequencing (NGS) has become a major focus in many recent biological research applications. NGS produces thousands to millions of short DNA fragments in a single run. Individually, these fragments represent only a small fraction of an original biological sample. To obtain any useful information, overlapping fragments must be assembled into long stretches of contiguous sequence. Various assemblers have been developed to address the fragment assembly problem. The majority of current assemblers were developed to fill an important gap, however, they were developed with a pure computational focus without taking the properties of the input datasets into consideration. NGS dataset characteristics such as fragment coverage and underlying genome complexity vary dramatically between different sequencing applications. Generic assemblers that are data independent are unlikely to produce accurate solutions in all problem domains. In this study, we propose a graph theoretic approach based on the concept of tolerance graphs to develop a domain-specific assembler. The proposed assembler is designed to extract signals associated with local features in the input dataset and reintegrate this knowledge into the assembly process through customized tolerance graph parameters. We conducted a number of experiments to study the impact of various input parameters on the quality of the assembled genomes. Results from this study show that the proposed assembler produces excellent results and outperforms other known assembly algorithms for some input datasets. This approach also presents the foundation for developing domain-specific assemblers to be applied in an intelligent and customized manner to a wide variety of input instances, resulting in more efficient assembly tactics and improved overall assembly quality.",
keywords = "Graph theory, Knowledge-based genome assembly, Next generation sequencing, Tolerance graph",
author = "Julia Warnke and Ali, {Hesham H}",
year = "2013",
doi = "10.1109/ICDMW.2013.105",
language = "English (US)",
pages = "88--95",
booktitle = "Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - A tolerance graph approach for domain-specific assembly of next generation sequencing data

AU - Warnke, Julia

AU - Ali, Hesham H

PY - 2013

Y1 - 2013

N2 - Next generation sequencing (NGS) has become a major focus in many recent biological research applications. NGS produces thousands to millions of short DNA fragments in a single run. Individually, these fragments represent only a small fraction of an original biological sample. To obtain any useful information, overlapping fragments must be assembled into long stretches of contiguous sequence. Various assemblers have been developed to address the fragment assembly problem. The majority of current assemblers were developed to fill an important gap, however, they were developed with a pure computational focus without taking the properties of the input datasets into consideration. NGS dataset characteristics such as fragment coverage and underlying genome complexity vary dramatically between different sequencing applications. Generic assemblers that are data independent are unlikely to produce accurate solutions in all problem domains. In this study, we propose a graph theoretic approach based on the concept of tolerance graphs to develop a domain-specific assembler. The proposed assembler is designed to extract signals associated with local features in the input dataset and reintegrate this knowledge into the assembly process through customized tolerance graph parameters. We conducted a number of experiments to study the impact of various input parameters on the quality of the assembled genomes. Results from this study show that the proposed assembler produces excellent results and outperforms other known assembly algorithms for some input datasets. This approach also presents the foundation for developing domain-specific assemblers to be applied in an intelligent and customized manner to a wide variety of input instances, resulting in more efficient assembly tactics and improved overall assembly quality.

AB - Next generation sequencing (NGS) has become a major focus in many recent biological research applications. NGS produces thousands to millions of short DNA fragments in a single run. Individually, these fragments represent only a small fraction of an original biological sample. To obtain any useful information, overlapping fragments must be assembled into long stretches of contiguous sequence. Various assemblers have been developed to address the fragment assembly problem. The majority of current assemblers were developed to fill an important gap, however, they were developed with a pure computational focus without taking the properties of the input datasets into consideration. NGS dataset characteristics such as fragment coverage and underlying genome complexity vary dramatically between different sequencing applications. Generic assemblers that are data independent are unlikely to produce accurate solutions in all problem domains. In this study, we propose a graph theoretic approach based on the concept of tolerance graphs to develop a domain-specific assembler. The proposed assembler is designed to extract signals associated with local features in the input dataset and reintegrate this knowledge into the assembly process through customized tolerance graph parameters. We conducted a number of experiments to study the impact of various input parameters on the quality of the assembled genomes. Results from this study show that the proposed assembler produces excellent results and outperforms other known assembly algorithms for some input datasets. This approach also presents the foundation for developing domain-specific assemblers to be applied in an intelligent and customized manner to a wide variety of input instances, resulting in more efficient assembly tactics and improved overall assembly quality.

KW - Graph theory

KW - Knowledge-based genome assembly

KW - Next generation sequencing

KW - Tolerance graph

UR - http://www.scopus.com/inward/record.url?scp=84898026457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898026457&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2013.105

DO - 10.1109/ICDMW.2013.105

M3 - Conference contribution

AN - SCOPUS:84898026457

SP - 88

EP - 95

BT - Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013

PB - IEEE Computer Society

ER -