Sparse Compressed Sensing based Classifiers for -omics mass-data

 

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

 

Project heads: Tim Conrad (FU), Gitta Kutyniok (TU), Christof Schütte (FU)
Staff: Nada Cvetcovic, Martin Genzel
External cooperations: Jan Vybiral (Charles University Prague)

Project Background

Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals \emph{fingerprints} of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual's biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.

Problems and (some) hope: -omics data is ultra high-dimensional and very noisy - but only sparsely filled with information

Biological -omics data (e.g. proteomics or genomics data) is typically very large (millions of dimensions), which increases the complexity of algorithms for analyzing the parameter space significantly or makes them even infeasible. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any dimension reduction with small loss of information.

However, the sparsity structure of this data is highly complex, since not only do the large entries exhibit a particular clustering with the amplitudes forming Gaussian-like shapes, but also the noise affecting the signal is by no means Gaussian noise -- a customarily assumed property. In addition, considering different sample sets, those clusters also slightly differ in the locations from sample set to sample set, hence also do not coincide with normal patterns such as joint sparsity.

This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted automatic - without cumbersome by-hand-identification of significant values - dimension reduction strategies such as compressed sensing have actually never been developed for instance for proteomics data. In our project, such a dimension reduction step will be a crucial ingredient and shall precede the analysis of parameter space, thereby then enabling low complexity algorithms.

The major challenge in these applications is to extract a set of features, as small as possible, that accurately classifies the learning examples.

The goal

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Highlights

We have developed a unified framework for high-dimensional estimation of structured signals. Our framework has the advantage of being fairly general, particularly including a large class of (non-)linear observation models and convex loss functions for signal recovery. Furthermore, we could show that the results of our first algorithmic development outperform competing approaches with respect to prediction accuracy especially in the "very sparse model''-setting, i.e., when seeking for a model with only a very few features (e.g., less than 10). These results are evaluated and validated in significant medical applications.

Publications

Network-of-Network based -omics data integration

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

 
Project heads: Tim Conrad (FU), Christof Schütte (FU/ZIB)
Staff: Peter Koltai

Project Background

Pancreatic cancer is the fifth leading cause of cancer death in  Germany (see DKFZ Report, 2010). It is estimated that in 2030 it will be the second leading cause of cancer death incurring a cost of about 15,8 Billion US-Dollar worldwide to the public health systems.

Cancer is a systems disease

"Cancer is no more a disease of cells than a traffic jam is a disease  of cars. A lifetime of study of the internal-combustion engine would not help  anyone to understand our traffic problems.'" (Smithers1962). It is accepted that gene mutations are part of the process of cancer, but mutations alone are not enough. Cancer involves an interaction between neoplastic cells and surrounding tissue on many different levels, e.g. interaction of RNA molecules, proteins, and metabolites. But most available models are limited to only one or very few levels of interactions and describe a rather static view.

From single to multi source: data integration on a systems level

Current high-throughput -omics technologies have dramatically eased the production of part lists for a variety of organisms. What is still missing are the dynamic interactions among an organism's molecular parts, and the interactions between different biological levels, such as transcriptomics and proteomics. This is pivotal to better understanding of an organism's biology, and - in our case - to understand pancreas cancer.

Therefore, the aim of this project is two-fold: (1) use data acquired in our earlier projects to create a holistic integration of the aforementioned sources and levels for modeling pancreas cancer, which we call Network-of-Networks or short: NoN (in our context networks of different -omics levels, such as genomics, transcriptomics, proteomics and metabolomics. (2) A NoN is a very large and complex object and its structure differs significantly from other biological networks. Thus, new methods for complexity reduction and analyzing NoNs will be developed in this project.

The goal

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Highlights

We developed methods for the analysis of undirected (e.g. protein-protein-interaction), directed (e.g. signal-transduction), and time-dependent biological network-or-networks. With these achievements we are now able to perform all the steps required for the full network-of-networks approach: find modules and other dominant structures (hubs, cycles) in large, not necessarily undirected multi-omics networks and follow their evolution in time. Concerning biological application we did first successful steps for the integration of transcriptomics, proteomics, and methylation data using network-of-network analysis.

Publications
  • R. Banisch, N. Djurdjevac, and Ch. Schutte. Reactive flows and unproductive cycles in irreversible markov chains. The European Physical Journal Special Topics, 224(12):2369{2387, September 2015.
  • N. Djurdjevac, Ralf Banisch, and Ch. Schutte. Modularity of directed networks: Cycle decomposition approach. Journal of Computational Dynamics, 2(1):1{24, August 2015.
  • P. Koltai and O. Junge. Quantized nonlinear feedback design by a split dynamic programming approach. Proceedings of the 21st International Symposium on Mathematical Theory of Networks and Systems (MTNS 2014), 2014.
  • Marco Sarich, Natasa Djurdjevac, Sharon Bruckner, Tim O. F. Conrad, and Christof Schutte. Modularity revisited: A novel dynamics-based concept for decomposing complex networks. Journal of Computational Dynamics, 1(1):191{212, 2014.
  • Ch. Schutte amd M. Sarich. A critical appraisal of Markov state models. The European Physical Journal Special Topics, 224(12):2445{2462, September 2015.
  • N. Djurdjevac, M.Weber, and Ch. Schuette. Finding dominant structures of nonreversible markov processes. Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal (submitted), 2015.
  • Marco Sarich and Christof Schutte. Utilizing hitting times for finding metastable sets in nonreversible Markov chains. submitted to Journal of Computational Dynamics, 2015.