Understanding cell trajectories with sparse similarity learning

This research is carried out in the framework of Matheon supported by Einstein Foundation Berlin.

Project heads: Tim Conrad (FU/ZIB), Gitta Kutyniok (TU), Christof Schütte (FU/ZIB)
Staff: Nada Cvetcovic (FU)

Project Background

In living organisms, biological cells transition from one state to another. This happens during normal cell development (e.g. aging) or is triggered by events, such as diseases. The time-ordered set of state changes is called a trajectory. Identifying these cell trajectories is a crucial part in bio-medical research to understand changes on a gene and molecular level. It allows to derive biological insights such as disease mechanisms and can lead to new biomedical discoveries and to advances in health-care. With the advent of single cell experiments such as Drop-Seq or inDrop, individual gene expression profiles of thousands of cells can be measured in a single experiment. These large data-sets allow to determine a cell's state based on its gene activity (cell expression profiles, CEPs), which can be expressed as a large feature vector representing its location in some large state space. The main problem with these experiments is that the actual time-information is lost, and needs to be recovered. The state-of-the art solution is to introduce the concept of pseudo-time in which the cells are ordered by CEP similarity. To find robust and biological meaningful trajectories based on CEPs, two main tasks have to be performed: (1) A CEP-based metric has to be learned to define pair-wise distances between CEPs. (2) Given this metric, similar CEP groups and transition paths between those groups should be identified and analysed.

The goal

The of this project is to develop a new and mathematically founded approach for similarity learning for high-dimensional biological data and apply it to the trajectory identification problem for cell data. The planned applications aim at identification of cell trajectories directly from experimental data in the areas of ageing and cancer..

The key idea is to use the fact that biological high-dimensional data can be sparsely represented (with respect to different conditions) using ideas developed in our SPA framework (from ECMath Project CH2). With this in hand, recent work from similarity learning for sparse high-dimensional data and progress made in the area of feature selection for multi-modal data (from ECMath project CH7) can be extended.