Performance comparison and an ensemble approach of transcriptome assembly

Sairam Behera, Adam Voshall, Jitender S. Deogun, Etsuko N. Moriyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Accurate transcriptome assembly using next-generation sequencing data is crucial in gene expression analysis. However, it has been observed that different assemblers generate significantly different outputs given the same RNA-Seq data. Even the same method often assembles different sets of transcripts when different sets of parameters are used. In this study, we performed comparative analysis of various transcriptome assemblers including four de novo and three genome-guided methods using simulated RNA-Seq data modeling Illumina Hi-Seq sequencing of Arabidopsis thaliana and Zea mays strain B73 transcriptomes. No assembler was able to reconstruct all of the reference transcripts correctly. A large number (∼30%) of transcripts were not assembled correctly by any assembler. Furthermore, each assembler produced a different set of reference transcripts with very few that are common among all. While the de novo tools were able to assemble similar numbers of transcripts correctly as genome-guided tools for one dataset, the former methods also produced much larger numbers of incorrectly assembled transcripts compared to genome-guided tools. These results indicate that there remains a large room for transcriptome assembly to be improved. Therefore, we further investigated a consensus-based ensemble approach. By taking the consensus contig set shared, for example, among three or more de novo assemblers, 10% more transcripts were correctly identified for Arabidopsis thaliana datasets. While the incorrect to correct contig ratio for the de novo assemblers ranged from 4.9 (for Trinity) to 10.7 (SOAPdenovo), for the genome-guided methods the ratios were from 1.3 to 1.7. Using the consensus de novo method, we successfully reduced the ratio to the level very close to or even lower than those obtained by the genome-guided methods (1.5). The results of this study provides us a direction to build a better ensemble approach that can reconstruct all the correct transcripts.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017
EditorsIllhoi Yoo, Jane Huiru Zheng, Yang Gong, Xiaohua Tony Hu, Chi-Ren Shyu, Yana Bromberg, Jean Gao, Dmitry Korkin
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2226-2228
Number of pages3
ISBN (Electronic)9781509030491
DOIs
StatePublished - Dec 15 2017
Event2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017 - Kansas City, United States
Duration: Nov 13 2017Nov 16 2017

Publication series

NameProceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017
Volume2017-January

Other

Other2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017
CountryUnited States
CityKansas City
Period11/13/1711/16/17

Fingerprint

Transcriptome
Genes
Genome
RNA
Arabidopsis
Gene expression
Gene Expression Profiling
Data structures
Zea mays
Gene Expression

Keywords

  • Ensemble Method
  • Transcriptome Assembly

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics

Cite this

Behera, S., Voshall, A., Deogun, J. S., & Moriyama, E. N. (2017). Performance comparison and an ensemble approach of transcriptome assembly. In I. Yoo, J. H. Zheng, Y. Gong, X. T. Hu, C-R. Shyu, Y. Bromberg, J. Gao, ... D. Korkin (Eds.), Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017 (pp. 2226-2228). (Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017; Vol. 2017-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIBM.2017.8218005

Performance comparison and an ensemble approach of transcriptome assembly. / Behera, Sairam; Voshall, Adam; Deogun, Jitender S.; Moriyama, Etsuko N.

Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017. ed. / Illhoi Yoo; Jane Huiru Zheng; Yang Gong; Xiaohua Tony Hu; Chi-Ren Shyu; Yana Bromberg; Jean Gao; Dmitry Korkin. Institute of Electrical and Electronics Engineers Inc., 2017. p. 2226-2228 (Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017; Vol. 2017-January).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Behera, S, Voshall, A, Deogun, JS & Moriyama, EN 2017, Performance comparison and an ensemble approach of transcriptome assembly. in I Yoo, JH Zheng, Y Gong, XT Hu, C-R Shyu, Y Bromberg, J Gao & D Korkin (eds), Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017. Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017, vol. 2017-January, Institute of Electrical and Electronics Engineers Inc., pp. 2226-2228, 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017, Kansas City, United States, 11/13/17. https://doi.org/10.1109/BIBM.2017.8218005
Behera S, Voshall A, Deogun JS, Moriyama EN. Performance comparison and an ensemble approach of transcriptome assembly. In Yoo I, Zheng JH, Gong Y, Hu XT, Shyu C-R, Bromberg Y, Gao J, Korkin D, editors, Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 2226-2228. (Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017). https://doi.org/10.1109/BIBM.2017.8218005
Behera, Sairam ; Voshall, Adam ; Deogun, Jitender S. ; Moriyama, Etsuko N. / Performance comparison and an ensemble approach of transcriptome assembly. Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017. editor / Illhoi Yoo ; Jane Huiru Zheng ; Yang Gong ; Xiaohua Tony Hu ; Chi-Ren Shyu ; Yana Bromberg ; Jean Gao ; Dmitry Korkin. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 2226-2228 (Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017).
@inproceedings{a1e766fda1624b9cb908586512a9ba89,
title = "Performance comparison and an ensemble approach of transcriptome assembly",
abstract = "Accurate transcriptome assembly using next-generation sequencing data is crucial in gene expression analysis. However, it has been observed that different assemblers generate significantly different outputs given the same RNA-Seq data. Even the same method often assembles different sets of transcripts when different sets of parameters are used. In this study, we performed comparative analysis of various transcriptome assemblers including four de novo and three genome-guided methods using simulated RNA-Seq data modeling Illumina Hi-Seq sequencing of Arabidopsis thaliana and Zea mays strain B73 transcriptomes. No assembler was able to reconstruct all of the reference transcripts correctly. A large number (∼30{\%}) of transcripts were not assembled correctly by any assembler. Furthermore, each assembler produced a different set of reference transcripts with very few that are common among all. While the de novo tools were able to assemble similar numbers of transcripts correctly as genome-guided tools for one dataset, the former methods also produced much larger numbers of incorrectly assembled transcripts compared to genome-guided tools. These results indicate that there remains a large room for transcriptome assembly to be improved. Therefore, we further investigated a consensus-based ensemble approach. By taking the consensus contig set shared, for example, among three or more de novo assemblers, 10{\%} more transcripts were correctly identified for Arabidopsis thaliana datasets. While the incorrect to correct contig ratio for the de novo assemblers ranged from 4.9 (for Trinity) to 10.7 (SOAPdenovo), for the genome-guided methods the ratios were from 1.3 to 1.7. Using the consensus de novo method, we successfully reduced the ratio to the level very close to or even lower than those obtained by the genome-guided methods (1.5). The results of this study provides us a direction to build a better ensemble approach that can reconstruct all the correct transcripts.",
keywords = "Ensemble Method, Transcriptome Assembly",
author = "Sairam Behera and Adam Voshall and Deogun, {Jitender S.} and Moriyama, {Etsuko N.}",
year = "2017",
month = "12",
day = "15",
doi = "10.1109/BIBM.2017.8218005",
language = "English (US)",
series = "Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "2226--2228",
editor = "Illhoi Yoo and Zheng, {Jane Huiru} and Yang Gong and Hu, {Xiaohua Tony} and Chi-Ren Shyu and Yana Bromberg and Jean Gao and Dmitry Korkin",
booktitle = "Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017",

}

TY - GEN

T1 - Performance comparison and an ensemble approach of transcriptome assembly

AU - Behera, Sairam

AU - Voshall, Adam

AU - Deogun, Jitender S.

AU - Moriyama, Etsuko N.

PY - 2017/12/15

Y1 - 2017/12/15

N2 - Accurate transcriptome assembly using next-generation sequencing data is crucial in gene expression analysis. However, it has been observed that different assemblers generate significantly different outputs given the same RNA-Seq data. Even the same method often assembles different sets of transcripts when different sets of parameters are used. In this study, we performed comparative analysis of various transcriptome assemblers including four de novo and three genome-guided methods using simulated RNA-Seq data modeling Illumina Hi-Seq sequencing of Arabidopsis thaliana and Zea mays strain B73 transcriptomes. No assembler was able to reconstruct all of the reference transcripts correctly. A large number (∼30%) of transcripts were not assembled correctly by any assembler. Furthermore, each assembler produced a different set of reference transcripts with very few that are common among all. While the de novo tools were able to assemble similar numbers of transcripts correctly as genome-guided tools for one dataset, the former methods also produced much larger numbers of incorrectly assembled transcripts compared to genome-guided tools. These results indicate that there remains a large room for transcriptome assembly to be improved. Therefore, we further investigated a consensus-based ensemble approach. By taking the consensus contig set shared, for example, among three or more de novo assemblers, 10% more transcripts were correctly identified for Arabidopsis thaliana datasets. While the incorrect to correct contig ratio for the de novo assemblers ranged from 4.9 (for Trinity) to 10.7 (SOAPdenovo), for the genome-guided methods the ratios were from 1.3 to 1.7. Using the consensus de novo method, we successfully reduced the ratio to the level very close to or even lower than those obtained by the genome-guided methods (1.5). The results of this study provides us a direction to build a better ensemble approach that can reconstruct all the correct transcripts.

AB - Accurate transcriptome assembly using next-generation sequencing data is crucial in gene expression analysis. However, it has been observed that different assemblers generate significantly different outputs given the same RNA-Seq data. Even the same method often assembles different sets of transcripts when different sets of parameters are used. In this study, we performed comparative analysis of various transcriptome assemblers including four de novo and three genome-guided methods using simulated RNA-Seq data modeling Illumina Hi-Seq sequencing of Arabidopsis thaliana and Zea mays strain B73 transcriptomes. No assembler was able to reconstruct all of the reference transcripts correctly. A large number (∼30%) of transcripts were not assembled correctly by any assembler. Furthermore, each assembler produced a different set of reference transcripts with very few that are common among all. While the de novo tools were able to assemble similar numbers of transcripts correctly as genome-guided tools for one dataset, the former methods also produced much larger numbers of incorrectly assembled transcripts compared to genome-guided tools. These results indicate that there remains a large room for transcriptome assembly to be improved. Therefore, we further investigated a consensus-based ensemble approach. By taking the consensus contig set shared, for example, among three or more de novo assemblers, 10% more transcripts were correctly identified for Arabidopsis thaliana datasets. While the incorrect to correct contig ratio for the de novo assemblers ranged from 4.9 (for Trinity) to 10.7 (SOAPdenovo), for the genome-guided methods the ratios were from 1.3 to 1.7. Using the consensus de novo method, we successfully reduced the ratio to the level very close to or even lower than those obtained by the genome-guided methods (1.5). The results of this study provides us a direction to build a better ensemble approach that can reconstruct all the correct transcripts.

KW - Ensemble Method

KW - Transcriptome Assembly

UR - http://www.scopus.com/inward/record.url?scp=85045975530&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045975530&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2017.8218005

DO - 10.1109/BIBM.2017.8218005

M3 - Conference contribution

AN - SCOPUS:85045975530

T3 - Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017

SP - 2226

EP - 2228

BT - Proceedings - 2017 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2017

A2 - Yoo, Illhoi

A2 - Zheng, Jane Huiru

A2 - Gong, Yang

A2 - Hu, Xiaohua Tony

A2 - Shyu, Chi-Ren

A2 - Bromberg, Yana

A2 - Gao, Jean

A2 - Korkin, Dmitry

PB - Institute of Electrical and Electronics Engineers Inc.

ER -