Statistical Genomics Laboratory: Active Research Projects
Predictive models in cancer with high-density genetic data New methods for high-dimensional data in Genomics
This project, part of a larger multicenter undertaking: CaRE (Cancer Risk Evaluation (previously known as the ARCTIC project) is primarily funded by the National Cancer Institute of Canada (NCIC). We are working on the modeling aspects of evaluating cancer risk in the presence of subject-level, dense genetic map, using a huge and unique data source comprised of few thousand Colorectal Cancer cases and controls. Each subject contributes up to 611,500 markers, detailed demographic, epidemiological and lifestyle data. One of the challenges is to develop a modeling framework capable of handling massive amounts of noisy genetic data, in combination with much more consise and precise epidemiological and clinical variables. Apart for "classical" analysis focusing on identifying most promising genetic markers, we are developing models to locate and model genetic interaction signals (epistasis) and utilize data mining techniques (e.g., boosting) in our quest to develop mixed multigenic-epidemiological model for predicting lifetime risk of colorectal cancer.
A website for this project is under construction here
This research activity is sponsored by National Institute for Complex Data Structures and NSERC. We are working on new statistical techniques in high dimensional data systems, with particular emphasis on new, high-throughput genomic data. For example, we are working on the novel models to analyze thousands of correlated, short time series - such as the ones observed in time course microarray experiments. While classical time series ARMA models can deal with single time series, and with some extensions, a few, analyzing 6,000 genes observes simultaneously through time is currently a great challenge. A set of projects also relate to develop new methods for predictive modeling using single nucleotide polymorphisms (SNP) markers from the colorectal cancer project. We are currently working on a new dimension reduction techniques utilizing sparse PCA and Correspondance Analysis models.
Genomic Data Fusion models Computational methods for Statistical Genomics
The data in high-throughput genomics (such as now classical cDNA or oligo microarray data, SNP arrays, Mass Spec proteomic data) are characterized by high-noise, high-dimensionality and relatively low sample sizes. This poses special challenges for multivariate models since one operates in high-dimensional spaces with few observations to guide model building. However, some help is available in enormous amont of information collected on genes and their workings in public genomic knowledgebases or metabases. One of the most popular ones is the Gene Ontology (GO) project which maintains information on both all known biological processes (and their inter-relationships) and annotation of known genes to these processes. This can be utilized on various models to reduce dimensionality, regularize solutions, or to validate the results. One example is our clustering model for expression data that couples the GO-derived information with the experimental data to obtain "better" clusters. We are also working on graphical models for high-dimensional data that would admit external information, such as GO, in the graph-building process. Some of this research has been funded by the CIHR-NET grant. Statistical Genomics sometimes requires enormous computational resources. This is not only due to the large data sizes, but also due to limited application of standard asymptotic theory. Besides being committed to constructing easy to use and validated R packages, we also work on researching new ways to dramatically reduce required computational times. For example, in researching genetic interactions in cancer, we discovered that one needs to use permutations to obtain trustworthy p-value estimates of effect significance, but with thousands of markers, the number of even second-level interactions quickly makes permutation testing infeasible. We have developed a combined machine-learning/Bayesian model for obtaining precise estimates of only interesting (i.e., small) p-values with hundreds-fold time savings. The point is that judicial appliation of statistics can help reduce the computational burden significantly, which is a somewhat unorthodox usage of statistical principles (in our case to save statistics from itself).