Comparing normalization methods and the impact of noise

Thao Vu, Eli Riekeberg, Yumou Qiu, Robert Powers

Research output: Contribution to journalArticle

Abstract

Introduction: Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways. Objective: Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets. Methods: Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA). Results: Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (> 67%) and correlation with true loadings (> 0.6) at maximal noise. Conclusion: PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20% was also identified for a valid NMR metabolomics dataset.

Original languageEnglish (US)
Article number108
JournalMetabolomics
Volume14
Issue number8
DOIs
StatePublished - Aug 1 2018

Fingerprint

Noise
Dilution
Nuclear magnetic resonance
Metabolomics
Recovery
Datasets

Keywords

  • Metabolomics
  • NMR
  • Noise
  • Normalization
  • Preprocessing chemometrics

ASJC Scopus subject areas

  • Endocrinology, Diabetes and Metabolism
  • Biochemistry
  • Clinical Biochemistry

Cite this

Comparing normalization methods and the impact of noise. / Vu, Thao; Riekeberg, Eli; Qiu, Yumou; Powers, Robert.

In: Metabolomics, Vol. 14, No. 8, 108, 01.08.2018.

Research output: Contribution to journalArticle

Vu, Thao ; Riekeberg, Eli ; Qiu, Yumou ; Powers, Robert. / Comparing normalization methods and the impact of noise. In: Metabolomics. 2018 ; Vol. 14, No. 8.
@article{42d774c14ab64ffcafe872630e7382e0,
title = "Comparing normalization methods and the impact of noise",
abstract = "Introduction: Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways. Objective: Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets. Methods: Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA). Results: Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (> 67{\%}) and correlation with true loadings (> 0.6) at maximal noise. Conclusion: PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20{\%} was also identified for a valid NMR metabolomics dataset.",
keywords = "Metabolomics, NMR, Noise, Normalization, Preprocessing chemometrics",
author = "Thao Vu and Eli Riekeberg and Yumou Qiu and Robert Powers",
year = "2018",
month = "8",
day = "1",
doi = "10.1007/s11306-018-1400-6",
language = "English (US)",
volume = "14",
journal = "Metabolomics",
issn = "1573-3882",
publisher = "Springer New York",
number = "8",

}

TY - JOUR

T1 - Comparing normalization methods and the impact of noise

AU - Vu, Thao

AU - Riekeberg, Eli

AU - Qiu, Yumou

AU - Powers, Robert

PY - 2018/8/1

Y1 - 2018/8/1

N2 - Introduction: Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways. Objective: Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets. Methods: Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA). Results: Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (> 67%) and correlation with true loadings (> 0.6) at maximal noise. Conclusion: PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20% was also identified for a valid NMR metabolomics dataset.

AB - Introduction: Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways. Objective: Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets. Methods: Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA). Results: Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (> 67%) and correlation with true loadings (> 0.6) at maximal noise. Conclusion: PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20% was also identified for a valid NMR metabolomics dataset.

KW - Metabolomics

KW - NMR

KW - Noise

KW - Normalization

KW - Preprocessing chemometrics

UR - http://www.scopus.com/inward/record.url?scp=85051553749&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85051553749&partnerID=8YFLogxK

U2 - 10.1007/s11306-018-1400-6

DO - 10.1007/s11306-018-1400-6

M3 - Article

VL - 14

JO - Metabolomics

JF - Metabolomics

SN - 1573-3882

IS - 8

M1 - 108

ER -