Computational analysis of gene identification with SAGE

Terry Clark, Sanggyu Lee, L. Ridgway Scott, San Ming Wang

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

SAGE is one of the few techniques capable of uniformly probing gene expression at a genome level irrespective of mRNA abundance and without a priori knowledge of the transcripts present. However, individual SAGE tags can match many sequences in the reference database, complicating gene identification. We perform a baseline evaluation of gene identification with SAGE using UniGene Human as the reference database by analyzing 1) the distributions of tags for various length tag sets formed for UniGene Human and 2) the tag-to-sequence mapping using a SAGE tag set consisting of 37, 522 tags derived from human myeloid cells. The extensive multiplicity of the dbEST component of UniGene significantly detracts from gains that might be expected by extending tags within the scope of the SAGE protocol. In order to achieve reasonable sequence specificity for gene identification with the content of the commonly used UniGene sequence collection, tags on the order of hundreds of bases in length are required. One way to produce tags of such lengths is with GLGI, which extends SAGE tags to the 3′ end of cDNA. We show that the longer sequences produced by GLGI relieve significantly the multiple match condition. In the myeloid sample, we also found a correlation between multiple match severity and high copy number. We extrapolate these findings, providing insights into the use of UniGene Human as a reference for gene identification.

Original languageEnglish (US)
Pages (from-to)513-526
Number of pages14
JournalJournal of Computational Biology
Volume9
Issue number3
DOIs
StatePublished - Jan 1 2002

Fingerprint

Computational Analysis
Genes
Gene
Databases
Myeloid Cells
Gene expression
Complementary DNA
Genome
Gene Expression
Messenger RNA
Extrapolate
CDNA
Specificity
Baseline
Multiplicity
Human

Keywords

  • GLGI
  • Gene expression
  • Gene identification
  • SAGE
  • Sequence distribution

ASJC Scopus subject areas

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this

Computational analysis of gene identification with SAGE. / Clark, Terry; Lee, Sanggyu; Scott, L. Ridgway; Wang, San Ming.

In: Journal of Computational Biology, Vol. 9, No. 3, 01.01.2002, p. 513-526.

Research output: Contribution to journalArticle

Clark, Terry ; Lee, Sanggyu ; Scott, L. Ridgway ; Wang, San Ming. / Computational analysis of gene identification with SAGE. In: Journal of Computational Biology. 2002 ; Vol. 9, No. 3. pp. 513-526.
@article{e0eec227ff90435fa58166df655437ae,
title = "Computational analysis of gene identification with SAGE",
abstract = "SAGE is one of the few techniques capable of uniformly probing gene expression at a genome level irrespective of mRNA abundance and without a priori knowledge of the transcripts present. However, individual SAGE tags can match many sequences in the reference database, complicating gene identification. We perform a baseline evaluation of gene identification with SAGE using UniGene Human as the reference database by analyzing 1) the distributions of tags for various length tag sets formed for UniGene Human and 2) the tag-to-sequence mapping using a SAGE tag set consisting of 37, 522 tags derived from human myeloid cells. The extensive multiplicity of the dbEST component of UniGene significantly detracts from gains that might be expected by extending tags within the scope of the SAGE protocol. In order to achieve reasonable sequence specificity for gene identification with the content of the commonly used UniGene sequence collection, tags on the order of hundreds of bases in length are required. One way to produce tags of such lengths is with GLGI, which extends SAGE tags to the 3′ end of cDNA. We show that the longer sequences produced by GLGI relieve significantly the multiple match condition. In the myeloid sample, we also found a correlation between multiple match severity and high copy number. We extrapolate these findings, providing insights into the use of UniGene Human as a reference for gene identification.",
keywords = "GLGI, Gene expression, Gene identification, SAGE, Sequence distribution",
author = "Terry Clark and Sanggyu Lee and Scott, {L. Ridgway} and Wang, {San Ming}",
year = "2002",
month = "1",
day = "1",
doi = "10.1089/106652702760138600",
language = "English (US)",
volume = "9",
pages = "513--526",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "3",

}

TY - JOUR

T1 - Computational analysis of gene identification with SAGE

AU - Clark, Terry

AU - Lee, Sanggyu

AU - Scott, L. Ridgway

AU - Wang, San Ming

PY - 2002/1/1

Y1 - 2002/1/1

N2 - SAGE is one of the few techniques capable of uniformly probing gene expression at a genome level irrespective of mRNA abundance and without a priori knowledge of the transcripts present. However, individual SAGE tags can match many sequences in the reference database, complicating gene identification. We perform a baseline evaluation of gene identification with SAGE using UniGene Human as the reference database by analyzing 1) the distributions of tags for various length tag sets formed for UniGene Human and 2) the tag-to-sequence mapping using a SAGE tag set consisting of 37, 522 tags derived from human myeloid cells. The extensive multiplicity of the dbEST component of UniGene significantly detracts from gains that might be expected by extending tags within the scope of the SAGE protocol. In order to achieve reasonable sequence specificity for gene identification with the content of the commonly used UniGene sequence collection, tags on the order of hundreds of bases in length are required. One way to produce tags of such lengths is with GLGI, which extends SAGE tags to the 3′ end of cDNA. We show that the longer sequences produced by GLGI relieve significantly the multiple match condition. In the myeloid sample, we also found a correlation between multiple match severity and high copy number. We extrapolate these findings, providing insights into the use of UniGene Human as a reference for gene identification.

AB - SAGE is one of the few techniques capable of uniformly probing gene expression at a genome level irrespective of mRNA abundance and without a priori knowledge of the transcripts present. However, individual SAGE tags can match many sequences in the reference database, complicating gene identification. We perform a baseline evaluation of gene identification with SAGE using UniGene Human as the reference database by analyzing 1) the distributions of tags for various length tag sets formed for UniGene Human and 2) the tag-to-sequence mapping using a SAGE tag set consisting of 37, 522 tags derived from human myeloid cells. The extensive multiplicity of the dbEST component of UniGene significantly detracts from gains that might be expected by extending tags within the scope of the SAGE protocol. In order to achieve reasonable sequence specificity for gene identification with the content of the commonly used UniGene sequence collection, tags on the order of hundreds of bases in length are required. One way to produce tags of such lengths is with GLGI, which extends SAGE tags to the 3′ end of cDNA. We show that the longer sequences produced by GLGI relieve significantly the multiple match condition. In the myeloid sample, we also found a correlation between multiple match severity and high copy number. We extrapolate these findings, providing insights into the use of UniGene Human as a reference for gene identification.

KW - GLGI

KW - Gene expression

KW - Gene identification

KW - SAGE

KW - Sequence distribution

UR - http://www.scopus.com/inward/record.url?scp=0035991755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035991755&partnerID=8YFLogxK

U2 - 10.1089/106652702760138600

DO - 10.1089/106652702760138600

M3 - Article

C2 - 12162890

AN - SCOPUS:0035991755

VL - 9

SP - 513

EP - 526

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 3

ER -