On the impact of data integration and edge enrichment in mining significant signals from biological networks

Sean West, Hesham H Ali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The influx of high-throughput biotechnologies has resulted in considerable amounts of available and untapped data, useful for both interpretation and extrapolation. Due to the fact that the noise to signal ratio in most biological databases are non-trivial, single source analysis techniques may suffer from relatively high falsepositive and false-negative rates. In addition, use of a single data source does not allow for the discovery of the novel relationships that can only be derived from multiple sources. Recently, the use of gene correlation networks has emerged to assist in the discovery of previously unknown genetic relationships and the identification of significant biological functions. Such networks provide a useful mechanism to model experimental results obtained from expression data and capture a snapshot of the expression as well as the temporal changes in various experiments. In addition, gene Ontology is often integrated with biological networks within the analysis process as a source of domain knowledge. In this project, we evaluate the use of Gene Ontology, not simply as an assessment tool, but as a basic component in building the correlation networks. We implemented a network integration algorithm that uses both gene expression data (experimental knowledge) and gene ontology data (domain knowledge) to build a biologically-rich correlation model. Then, we analyzed the resulting networks for topological changes and biological significance changes. Our main hypothesis is that the integrated networks would reduce the harmful effects of outliers from imperfect data while maintaining the high concentration of network substructures that are likely to reveal novel, biologically-significant relationships. In addition, using the concept of "guilt by association", we analyzed the clusters of the integrated networks and found that there was a significant increase of enrichment scores relative to the original networks. We show, through motif and pathway analysis, that integrated networks tend to cluster with higher biological significance.

Original languageEnglish (US)
Title of host publicationACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
Pages760-767
Number of pages8
ISBN (Electronic)9781450328944
DOIs
StatePublished - Sep 20 2014
Event5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014 - Newport Beach, United States
Duration: Sep 20 2014Sep 23 2014

Publication series

NameACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Conference

Conference5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014
CountryUnited States
CityNewport Beach
Period9/20/149/23/14

Fingerprint

Gene Ontology
Data integration
Genes
Ontology
Guilt
Gene Regulatory Networks
Information Storage and Retrieval
Signal-To-Noise Ratio
Biotechnology
Theoretical Models
Extrapolation
Gene expression
Databases
Gene Expression
Signal to noise ratio
Throughput
Experiments

Keywords

  • Co-regulation
  • Correlation networks
  • Data integration
  • Gene expression
  • Gene ontology
  • Hubs and clusters

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications
  • Software
  • Biomedical Engineering

Cite this

West, S., & Ali, H. H. (2014). On the impact of data integration and edge enrichment in mining significant signals from biological networks. In ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 760-767). (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). Association for Computing Machinery, Inc. https://doi.org/10.1145/2649387.2660846

On the impact of data integration and edge enrichment in mining significant signals from biological networks. / West, Sean; Ali, Hesham H.

ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2014. p. 760-767 (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

West, S & Ali, HH 2014, On the impact of data integration and edge enrichment in mining significant signals from biological networks. in ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Association for Computing Machinery, Inc, pp. 760-767, 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014, Newport Beach, United States, 9/20/14. https://doi.org/10.1145/2649387.2660846
West S, Ali HH. On the impact of data integration and edge enrichment in mining significant signals from biological networks. In ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc. 2014. p. 760-767. (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). https://doi.org/10.1145/2649387.2660846
West, Sean ; Ali, Hesham H. / On the impact of data integration and edge enrichment in mining significant signals from biological networks. ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2014. pp. 760-767 (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).
@inproceedings{a341713ae19940cb8b4471253cbb7345,
title = "On the impact of data integration and edge enrichment in mining significant signals from biological networks",
abstract = "The influx of high-throughput biotechnologies has resulted in considerable amounts of available and untapped data, useful for both interpretation and extrapolation. Due to the fact that the noise to signal ratio in most biological databases are non-trivial, single source analysis techniques may suffer from relatively high falsepositive and false-negative rates. In addition, use of a single data source does not allow for the discovery of the novel relationships that can only be derived from multiple sources. Recently, the use of gene correlation networks has emerged to assist in the discovery of previously unknown genetic relationships and the identification of significant biological functions. Such networks provide a useful mechanism to model experimental results obtained from expression data and capture a snapshot of the expression as well as the temporal changes in various experiments. In addition, gene Ontology is often integrated with biological networks within the analysis process as a source of domain knowledge. In this project, we evaluate the use of Gene Ontology, not simply as an assessment tool, but as a basic component in building the correlation networks. We implemented a network integration algorithm that uses both gene expression data (experimental knowledge) and gene ontology data (domain knowledge) to build a biologically-rich correlation model. Then, we analyzed the resulting networks for topological changes and biological significance changes. Our main hypothesis is that the integrated networks would reduce the harmful effects of outliers from imperfect data while maintaining the high concentration of network substructures that are likely to reveal novel, biologically-significant relationships. In addition, using the concept of {"}guilt by association{"}, we analyzed the clusters of the integrated networks and found that there was a significant increase of enrichment scores relative to the original networks. We show, through motif and pathway analysis, that integrated networks tend to cluster with higher biological significance.",
keywords = "Co-regulation, Correlation networks, Data integration, Gene expression, Gene ontology, Hubs and clusters",
author = "Sean West and Ali, {Hesham H}",
year = "2014",
month = "9",
day = "20",
doi = "10.1145/2649387.2660846",
language = "English (US)",
series = "ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",
publisher = "Association for Computing Machinery, Inc",
pages = "760--767",
booktitle = "ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",

}

TY - GEN

T1 - On the impact of data integration and edge enrichment in mining significant signals from biological networks

AU - West, Sean

AU - Ali, Hesham H

PY - 2014/9/20

Y1 - 2014/9/20

N2 - The influx of high-throughput biotechnologies has resulted in considerable amounts of available and untapped data, useful for both interpretation and extrapolation. Due to the fact that the noise to signal ratio in most biological databases are non-trivial, single source analysis techniques may suffer from relatively high falsepositive and false-negative rates. In addition, use of a single data source does not allow for the discovery of the novel relationships that can only be derived from multiple sources. Recently, the use of gene correlation networks has emerged to assist in the discovery of previously unknown genetic relationships and the identification of significant biological functions. Such networks provide a useful mechanism to model experimental results obtained from expression data and capture a snapshot of the expression as well as the temporal changes in various experiments. In addition, gene Ontology is often integrated with biological networks within the analysis process as a source of domain knowledge. In this project, we evaluate the use of Gene Ontology, not simply as an assessment tool, but as a basic component in building the correlation networks. We implemented a network integration algorithm that uses both gene expression data (experimental knowledge) and gene ontology data (domain knowledge) to build a biologically-rich correlation model. Then, we analyzed the resulting networks for topological changes and biological significance changes. Our main hypothesis is that the integrated networks would reduce the harmful effects of outliers from imperfect data while maintaining the high concentration of network substructures that are likely to reveal novel, biologically-significant relationships. In addition, using the concept of "guilt by association", we analyzed the clusters of the integrated networks and found that there was a significant increase of enrichment scores relative to the original networks. We show, through motif and pathway analysis, that integrated networks tend to cluster with higher biological significance.

AB - The influx of high-throughput biotechnologies has resulted in considerable amounts of available and untapped data, useful for both interpretation and extrapolation. Due to the fact that the noise to signal ratio in most biological databases are non-trivial, single source analysis techniques may suffer from relatively high falsepositive and false-negative rates. In addition, use of a single data source does not allow for the discovery of the novel relationships that can only be derived from multiple sources. Recently, the use of gene correlation networks has emerged to assist in the discovery of previously unknown genetic relationships and the identification of significant biological functions. Such networks provide a useful mechanism to model experimental results obtained from expression data and capture a snapshot of the expression as well as the temporal changes in various experiments. In addition, gene Ontology is often integrated with biological networks within the analysis process as a source of domain knowledge. In this project, we evaluate the use of Gene Ontology, not simply as an assessment tool, but as a basic component in building the correlation networks. We implemented a network integration algorithm that uses both gene expression data (experimental knowledge) and gene ontology data (domain knowledge) to build a biologically-rich correlation model. Then, we analyzed the resulting networks for topological changes and biological significance changes. Our main hypothesis is that the integrated networks would reduce the harmful effects of outliers from imperfect data while maintaining the high concentration of network substructures that are likely to reveal novel, biologically-significant relationships. In addition, using the concept of "guilt by association", we analyzed the clusters of the integrated networks and found that there was a significant increase of enrichment scores relative to the original networks. We show, through motif and pathway analysis, that integrated networks tend to cluster with higher biological significance.

KW - Co-regulation

KW - Correlation networks

KW - Data integration

KW - Gene expression

KW - Gene ontology

KW - Hubs and clusters

UR - http://www.scopus.com/inward/record.url?scp=84920747720&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84920747720&partnerID=8YFLogxK

U2 - 10.1145/2649387.2660846

DO - 10.1145/2649387.2660846

M3 - Conference contribution

AN - SCOPUS:84920747720

T3 - ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

SP - 760

EP - 767

BT - ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

PB - Association for Computing Machinery, Inc

ER -