Sparse Compressed Sensing based Classifiers for -omics mass-data

 

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

 

Project heads: Tim Conrad (FU), Gitta Kutyniok (TU), Christof Schütte (FU)
Staff: Nada Cvetcovic, Martin Genzel
External cooperations: Jan Vybiral (Charles University Prague)

Project Background

Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals \emph{fingerprints} of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual's biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.

Problems and (some) hope: -omics data is ultra high-dimensional and very noisy - but only sparsely filled with information

Biological -omics data (e.g. proteomics or genomics data) is typically very large (millions of dimensions), which increases the complexity of algorithms for analyzing the parameter space significantly or makes them even infeasible. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any dimension reduction with small loss of information.

However, the sparsity structure of this data is highly complex, since not only do the large entries exhibit a particular clustering with the amplitudes forming Gaussian-like shapes, but also the noise affecting the signal is by no means Gaussian noise -- a customarily assumed property. In addition, considering different sample sets, those clusters also slightly differ in the locations from sample set to sample set, hence also do not coincide with normal patterns such as joint sparsity.

This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted automatic - without cumbersome by-hand-identification of significant values - dimension reduction strategies such as compressed sensing have actually never been developed for instance for proteomics data. In our project, such a dimension reduction step will be a crucial ingredient and shall precede the analysis of parameter space, thereby then enabling low complexity algorithms.

The major challenge in these applications is to extract a set of features, as small as possible, that accurately classifies the learning examples.

The goal

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Highlights

We have developed a unified framework for high-dimensional estimation of structured signals. Our framework has the advantage of being fairly general, particularly including a large class of (non-)linear observation models and convex loss functions for signal recovery. Furthermore, we could show that the results of our first algorithmic development outperform competing approaches with respect to prediction accuracy especially in the "very sparse model''-setting, i.e., when seeking for a model with only a very few features (e.g., less than 10). These results are evaluated and validated in significant medical applications.

Publications

Network-of-Network based -omics data integration

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

 
Project heads: Tim Conrad (FU), Christof Schütte (FU/ZIB)
Staff: Peter Koltai

Project Background

Pancreatic cancer is the fifth leading cause of cancer death in  Germany (see DKFZ Report, 2010). It is estimated that in 2030 it will be the second leading cause of cancer death incurring a cost of about 15,8 Billion US-Dollar worldwide to the public health systems.

Cancer is a systems disease

"Cancer is no more a disease of cells than a traffic jam is a disease  of cars. A lifetime of study of the internal-combustion engine would not help  anyone to understand our traffic problems.'" (Smithers1962). It is accepted that gene mutations are part of the process of cancer, but mutations alone are not enough. Cancer involves an interaction between neoplastic cells and surrounding tissue on many different levels, e.g. interaction of RNA molecules, proteins, and metabolites. But most available models are limited to only one or very few levels of interactions and describe a rather static view.

From single to multi source: data integration on a systems level

Current high-throughput -omics technologies have dramatically eased the production of part lists for a variety of organisms. What is still missing are the dynamic interactions among an organism's molecular parts, and the interactions between different biological levels, such as transcriptomics and proteomics. This is pivotal to better understanding of an organism's biology, and - in our case - to understand pancreas cancer.

Therefore, the aim of this project is two-fold: (1) use data acquired in our earlier projects to create a holistic integration of the aforementioned sources and levels for modeling pancreas cancer, which we call Network-of-Networks or short: NoN (in our context networks of different -omics levels, such as genomics, transcriptomics, proteomics and metabolomics. (2) A NoN is a very large and complex object and its structure differs significantly from other biological networks. Thus, new methods for complexity reduction and analyzing NoNs will be developed in this project.

The goal

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.

Highlights

We developed methods for the analysis of undirected (e.g. protein-protein-interaction), directed (e.g. signal-transduction), and time-dependent biological network-or-networks. With these achievements we are now able to perform all the steps required for the full network-of-networks approach: find modules and other dominant structures (hubs, cycles) in large, not necessarily undirected multi-omics networks and follow their evolution in time. Concerning biological application we did first successful steps for the integration of transcriptomics, proteomics, and methylation data using network-of-network analysis.

Publications
  • R. Banisch, N. Djurdjevac, and Ch. Schutte. Reactive flows and unproductive cycles in irreversible markov chains. The European Physical Journal Special Topics, 224(12):2369{2387, September 2015.
  • N. Djurdjevac, Ralf Banisch, and Ch. Schutte. Modularity of directed networks: Cycle decomposition approach. Journal of Computational Dynamics, 2(1):1{24, August 2015.
  • P. Koltai and O. Junge. Quantized nonlinear feedback design by a split dynamic programming approach. Proceedings of the 21st International Symposium on Mathematical Theory of Networks and Systems (MTNS 2014), 2014.
  • Marco Sarich, Natasa Djurdjevac, Sharon Bruckner, Tim O. F. Conrad, and Christof Schutte. Modularity revisited: A novel dynamics-based concept for decomposing complex networks. Journal of Computational Dynamics, 1(1):191{212, 2014.
  • Ch. Schutte amd M. Sarich. A critical appraisal of Markov state models. The European Physical Journal Special Topics, 224(12):2445{2462, September 2015.
  • N. Djurdjevac, M.Weber, and Ch. Schuette. Finding dominant structures of nonreversible markov processes. Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal (submitted), 2015.
  • Marco Sarich and Christof Schutte. Utilizing hitting times for finding metastable sets in nonreversible Markov chains. submitted to Journal of Computational Dynamics, 2015.

Understanding cell trajectories with sparse similarity learning

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

   
Project heads: Tim Conrad (FU/ZIB), Gitta Kutyniok (TU), Christof Schütte (FU/ZIB)
Staff: Nada Cvetcovic (FU)

Project Background

In living organisms, biological cells transition from one state to another. This happens during normal cell development (e.g. aging) or is triggered by events, such as diseases. The time-ordered set of state changes is called a trajectory. Identifying these cell trajectories is a crucial part in bio-medical research to understand changes on a gene and molecular level. It allows to derive biological insights such as disease mechanisms and can lead to new biomedical discoveries and to advances in health-care. With the advent of single cell experiments such as Drop-Seq or inDrop, individual gene expression profiles of thousands of cells can be measured in a single experiment. These large data-sets allow to determine a cell's state based on its gene activity (cell expression profiles, CEPs), which can be expressed as a large feature vector representing its location in some large state space. The main problem with these experiments is that the actual time-information is lost, and needs to be recovered. The state-of-the art solution is to introduce the concept of pseudo-time in which the cells are ordered by CEP similarity. To find robust and biological meaningful trajectories based on CEPs, two main tasks have to be performed: (1) A CEP-based metric has to be learned to define pair-wise distances between CEPs. (2) Given this metric, similar CEP groups and transition paths between those groups should be identified and analysed.

The goal

The of this project is to develop a new and mathematically founded approach for similarity learning for high-dimensional biological data and apply it to the trajectory identification problem for cell data. The planned applications aim at identification of cell trajectories directly from experimental data in the areas of ageing and cancer..

The key idea is to use the fact that biological high-dimensional data can be sparsely represented (with respect to different conditions) using ideas developed in our SPA framework (from ECMath Project CH2). With this in hand, recent work from similarity learning for sparse high-dimensional data and progress made in the area of feature selection for multi-modal data (from ECMath project CH7) can be extended.

Publications in preparation

Cvetković, N. and Conrad, TOF and Lie, HC: "Convergent discretisation schemes for Transition Path Theory for diffusion process", 2018

Data-driven modeling of cellular processes and beyond

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

   
Project heads: Tim Conrad (FU/ZIB), Stefan Klus (FU), Christof Schütte (FU/ZIB)
Staff: Dr. Wei Zhang (ZIB)

Project Background

Cellular processes are governed by diffusion, transport, and interactions of its constituents. For many processes the spatial inhomogeneity of cells is of secondary importance; modelling such processes means finding appropriate kinetic models of the underlying cellular reaction networks (CRNs). The availability of such models is key to many areas of the life sciences ranging from computational biology to system medicine and is essential for understanding the fundamentals of cellular behavior, its malfunction under external stress and its restoration by regenerative interventions.

The goal

The project aims at developing DDMI into a technique that can be applied to observational trajectory data of cellular processes. The main goal is to identify CRN models directly based on available time-resolved process data without the need to identify the reactions that have to be included in advance. A similar attempt has been made before: In in-silico biology, for example,  evolutionary algorithms were used to construct optimal models via "random mutations" of the underlying CRNs. Unfortunately, these approaches were not very successful, mainly due to the lack of discrimination power between inappropriate models and artefacts resulting from ill-conditioned parameter identification. The approach proposed herein does \emph{not} face the same problem: In DDMI, parameters are "automatically" identified as the leading expansion coefficients in the ansatz space. That is, the ill-conditioned inverse problem of the parameter estimation step is avoided.

DDMI is of particular importance for modelling the effect of therapeutic interventions in cellular dynamics (like in the context of regenerative medicine). For such processes, almost no reliable models are available. DDMI based on time-resolved models of the system including the external influences could be of enormous help.

In particular, we aim at identifying approximate eigenfunctions of transfer operators associated with high-dimensional systems from simulation or measurement data for:

  • Dimensionality reduction.
  • Detection of metastable sets.
  • Separation of time-scales.
  • System identification.
 
Publications