Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray datasets

Liu, X.^1,2, Sivaganesan, S.³, Yeung, K.Y.⁴, Guo, J.¹, Bumgarner, R.E.⁴, MedvedovicMario^1,2*

¹Department of Environmental Health, University of Cincinnati College of Medicine, 3223 Eden Av. ML 56, Cincinnati OH 45267-0056, ²Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229-3039, ³ Mathematical Sciences Department, University of Cincinnati, P. O. Box 210025, Cincinnati, OH 45221, ⁴Department of Microbiology, University of Washington, Box 358070, Seattle, WA 98195, USA.

* To whom correspondence should be addressed

Abstract

Motivation: Identifying groups of co-regulated genes by monitoring their expression over various experimental conditions is complicated by the fact that such co-regulation is condition-specific. Ignoring the context-specific nature of co-regulation significantly reduces the ability of clustering procedures to detect co-expressed genes due to the additional “noise” introduced by non-informative measurements.

Results: We have developed a novel Bayesian hierarchical model and corresponding computational algorithms for clustering gene expression profiles across diverse experimental conditions and studies that accounts for context-specificity of gene expression patterns. The model is based on the Bayesian infinite mixtures framework and does not require a priori specification of the number of clusters. The increased precision in cluster identification that is attributable to accounting for context-specificity is demonstrated by examining the specificity and sensitivity of clusters in microarray data. We demonstrate that probabilities of co-expression derived from the posterior distribution of clusterings are valid estimates of statistical significance of created clusters.

Availability: The open-source package gimm is available at http://eh3.uc.edu/gimm.

Contact: Mario.Medvedovic@uc.edu

Preprint pdf can be found here

Supplementary Materials

The Web Supplement for this article contains additional mathematical details about the Context-specific infinite mixture model (CSIMM), additional figures comparing CSIMM results with other clustering procedures, the description of the dynamic annealing heuristics for speeding-up the convergence of the Gibbs sampler, and examples of the run-times for the algorithm. You can download Web Supplement HERE.

The open source gimm software package can be downloaded HERE.

The R-dataset containing normalized gene expression measurements from the cell cycle and sporulation experiments used in the paper can be downloaded HERE. The dataset is also included in the distribution of the wrapper R-package gimmR for performing infinite-mixture based clustering and it can be downloaded HERE