|
Statistical Genomics Laboratory: Active Research Projects
|
|
Predictive models in cancer with high-density genetic data
|
New methods for high-dimensional data in Genomics
|
|
This project, part of a larger multicenter undertaking: CaRE
(Cancer Risk
Evaluation (previously known as the ARCTIC project)
is primarily funded by the National Cancer Institute of Canada
(NCIC). We are working on the modeling aspects of evaluating
cancer risk in the presence of subject-level, dense genetic map, using a huge and
unique data source comprised of few thousand Colorectal Cancer cases
and controls. Each subject contributes up to 611,500 markers, detailed
demographic, epidemiological and lifestyle data. One of the challenges
is to develop a modeling framework capable of handling massive amounts
of noisy genetic data, in combination with much more consise and
precise epidemiological and clinical variables. Apart for "classical"
analysis focusing on identifying most promising genetic markers, we
are developing models to locate and model genetic interaction signals
(epistasis) and utilize data mining techniques (e.g., boosting)
in our quest to develop mixed multigenic-epidemiological model for
predicting lifetime risk of colorectal cancer. A website for
this project is under construction here
|
This research activity is sponsored by National Institute for
Complex Data Structures and NSERC. We are working on new
statistical techniques in high dimensional data systems, with
particular emphasis on new, high-throughput genomic data. For example,
we are working on the novel models to analyze thousands of
correlated, short time series - such as the ones observed in time
course microarray experiments. While classical time series ARMA models can
deal with single time series, and with some extensions, a few,
analyzing 6,000 genes observes simultaneously through time is
currently a great challenge. A set of projects also relate to
develop new methods for predictive modeling using single nucleotide
polymorphisms (SNP) markers from the colorectal cancer project. We are
currently working on a new dimension reduction techniques utilizing
sparse PCA and Correspondance Analysis models.
|
|
Genomic Data Fusion models
|
Computational methods for Statistical Genomics
|
|
The data in high-throughput genomics (such as now classical cDNA or
oligo microarray data, SNP arrays, Mass Spec proteomic data) are
characterized by high-noise, high-dimensionality and relatively low
sample sizes. This poses special challenges for multivariate models
since one operates in high-dimensional spaces with few observations to
guide model building. However, some help is available in enormous
amont of information collected on genes and their workings in public
genomic knowledgebases or metabases. One of the most popular ones is
the Gene Ontology (GO) project
which maintains information on both all known biological processes
(and their inter-relationships) and annotation of known genes to these
processes. This can be utilized on various models to reduce
dimensionality, regularize solutions, or to validate the results. One
example is our clustering model for expression data that couples the
GO-derived information with the experimental data to obtain "better"
clusters. We are also working on graphical models for high-dimensional
data that would admit external information, such as GO, in the
graph-building process. Some of this research has been funded by the
CIHR-NET grant.
|
Statistical Genomics sometimes requires enormous computational
resources. This is not only due to the large data sizes, but also due
to limited application of standard asymptotic theory. Besides being
committed to constructing easy to use and validated R packages, we also work on
researching new ways to dramatically reduce required computational
times. For example, in researching genetic interactions in cancer, we
discovered that one needs to use permutations to obtain trustworthy
p-value estimates of effect significance, but with thousands of
markers, the number of even second-level interactions quickly makes
permutation testing infeasible. We have developed a combined
machine-learning/Bayesian model for obtaining precise estimates of
only interesting (i.e., small) p-values with hundreds-fold time
savings. The point is that judicial appliation of statistics can help
reduce the computational burden significantly, which is a somewhat
unorthodox usage of statistical principles (in our case to save
statistics from itself).
|