Sparse Compressed Sensing based Classifiers for -omics mass-data


This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.


Project heads: Tim Conrad (FU), Gitta Kutyniok (TU), Christof Schütte (FU)
Staff: Nada Cvetcovic, Martin Genzel
External cooperations: Jan Vybiral (Charles University Prague)

Project Background

Tumor diseases rank among the most frequent causes of death in Western countries coinciding with an incomplete understanding of the underlying pathogenic mechanisms and a lack of individual treatment options. Hence, early diagnosis of the disease and early relapse monitoring are currently the best available options to improve patient survival. This calls for two things: (1) identification of disease specific sets of biological signals that reliably indicate a disease outbreak (or status) in an individual. We call these sets of disease specific signals \emph{fingerprints} of a disease. And (2), development of new classification methods allowing for robust identification of these fingerprints in an individual's biological sample. In this project we will use -omics data sources, such as proteomics or genomics. The advantage of -omics data over classical (e.g. blood) markers is that for example a proteomics data set contains a snapshot of almost all proteins that are currently active in an individual, opposed to just about 30 values analyzed in a classical blood test. Thus, it contains orders of magnitudes more potential information and describes the medical state of an individual much more precisely. However, to date there is no gold-standard of how to reliably and reproducible analyze these huge data sets and find robust fingerprints that could be used for the ultimate task: (early) diagnostics of cancer.

Problems and (some) hope: -omics data is ultra high-dimensional and very noisy - but only sparsely filled with information

Biological -omics data (e.g. proteomics or genomics data) is typically very large (millions of dimensions), which increases the complexity of algorithms for analyzing the parameter space significantly or makes them even infeasible. At the same time, this data exhibits a very particular structure, in the sense that it is highly sparse. Thus the information content of this data is much lower than its actual dimension seems to suggest, which is the requirement for any dimension reduction with small loss of information.

However, the sparsity structure of this data is highly complex, since not only do the large entries exhibit a particular clustering with the amplitudes forming Gaussian-like shapes, but also the noise affecting the signal is by no means Gaussian noise -- a customarily assumed property. In addition, considering different sample sets, those clusters also slightly differ in the locations from sample set to sample set, hence also do not coincide with normal patterns such as joint sparsity.

This means, although the data is highly sparse, the sparsity structure as well as the noise distribution is non-standard. However, specifically adapted automatic - without cumbersome by-hand-identification of significant values - dimension reduction strategies such as compressed sensing have actually never been developed for instance for proteomics data. In our project, such a dimension reduction step will be a crucial ingredient and shall precede the analysis of parameter space, thereby then enabling low complexity algorithms.

The major challenge in these applications is to extract a set of features, as small as possible, that accurately classifies the learning examples.

The goal

In this project we aim to develop a new method that can be used to solve this task: the identification of minimal, yet robust fingerprints from very high-dimensional, noisy -omics data. Our method will be based on ideas from the areas of compressed sensing and machine learning.


We have developed a unified framework for high-dimensional estimation of structured signals. Our framework has the advantage of being fairly general, particularly including a large class of (non-)linear observation models and convex loss functions for signal recovery. Furthermore, we could show that the results of our first algorithmic development outperform competing approaches with respect to prediction accuracy especially in the "very sparse model''-setting, i.e., when seeking for a model with only a very few features (e.g., less than 10). These results are evaluated and validated in significant medical applications.