Alignment-free genetic sequence comparisons: A review of recent approaches by word analysis

Oliver Bonham-Carter, Joe Steele, Dhundy Bastola

Research output: Contribution to journalArticle

66 Citations (Scopus)

Abstract

Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events.New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies.We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base - base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.

Original languageEnglish (US)
Pages (from-to)890-905
Number of pages16
JournalBriefings in bioinformatics
Volume15
Issue number6
DOIs
StatePublished - Aug 2 2013

Fingerprint

Data compression
Data Compression
Dynamic programming
Information theory
Bioinformatics
Information Theory
Synteny
Sequence Alignment
Statistical methods
Genes
Computational Biology
Statistics
Genetic Recombination
Genome
Processing
Technology
Chemical analysis
Research

Keywords

  • Alignment-free
  • Information theory
  • Sequence-alignment
  • Word-analysis

ASJC Scopus subject areas

  • Information Systems
  • Molecular Biology

Cite this

Alignment-free genetic sequence comparisons : A review of recent approaches by word analysis. / Bonham-Carter, Oliver; Steele, Joe; Bastola, Dhundy.

In: Briefings in bioinformatics, Vol. 15, No. 6, 02.08.2013, p. 890-905.

Research output: Contribution to journalArticle

@article{110fed67fa364ab7bd833178f19140ee,
title = "Alignment-free genetic sequence comparisons: A review of recent approaches by word analysis",
abstract = "Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events.New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies.We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base - base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.",
keywords = "Alignment-free, Information theory, Sequence-alignment, Word-analysis",
author = "Oliver Bonham-Carter and Joe Steele and Dhundy Bastola",
year = "2013",
month = "8",
day = "2",
doi = "10.1093/bib/bbt052",
language = "English (US)",
volume = "15",
pages = "890--905",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford University Press",
number = "6",

}

TY - JOUR

T1 - Alignment-free genetic sequence comparisons

T2 - A review of recent approaches by word analysis

AU - Bonham-Carter, Oliver

AU - Steele, Joe

AU - Bastola, Dhundy

PY - 2013/8/2

Y1 - 2013/8/2

N2 - Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events.New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies.We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base - base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.

AB - Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events.New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies.We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base - base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.

KW - Alignment-free

KW - Information theory

KW - Sequence-alignment

KW - Word-analysis

UR - http://www.scopus.com/inward/record.url?scp=84913590574&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84913590574&partnerID=8YFLogxK

U2 - 10.1093/bib/bbt052

DO - 10.1093/bib/bbt052

M3 - Article

C2 - 23904502

AN - SCOPUS:84913590574

VL - 15

SP - 890

EP - 905

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

IS - 6

ER -