Attacking the Biobank Bottleneck

As of 2012, a quantum shift is happening in the area of human genomics. A huge wave of big data is approaching, driven by the decreasing cost of sequencing genomic data, which has been halving every 4 months since 2004. Biobanks, that are used to store and catalogue human biological material, are not prepared to handle this wave of data - there is a biobank bottleneck:
a lack of platform support for the storage, analysis and interconnection of the coming massive amounts of human genomic data.

The main research challenges we will address include:

  • definition of the regulatory framework and data model for biobank data sharing,
  • the development of a scalable, highly available storage infrastructure with support for strongly consistent data,
  • a cross-cutting security platform that ensures data confidentiality, data integrity, and data access auditing,
  • data-intensive computing workflows for aligning, clustering, aggregating, compressing and anonymizing sequence data,
  • inter-connection of biobanks, while also leveraging the storage and processing capacity of public clouds,
  • validation of our system by evaluating real-world, parallelized analysis pipelines to facilitate the biological interpretation of genomic data,
  • and the integration of these components as a platform as a service (PaaS).

A PaaS for Biobanking

Our PaaS framework will be designed to run primarily on private cloud platforms. The stack will provide Biobanks with platform services for the storage and analysis of sequence data, as well as the interconnection of Biobanks for data sharing.

BBMRI-ERIC & BiobankCloud

In its most simple definition, a biobank is a repository of biological material such as blood, tissue, saliva, DNA, etc. Biobanks can be part of academic medical institutions or pharmaceutical and biotechnology companies, delivering biological materials to researchers. Biobanks are very relevant components in the biomedical research cycle. They store and preserve samples from donors that give their consent to store and use the samples for research. Researchers use the biological material for investigating human diseases and clinicians; for diagnosis and treatment purposes. Read more

Legal and Ethical Framework for BiobankCloud

The point of departure for the BiobankCloud is that no data containing personal information related to identifiable persons will be processed. However, it is a principle of the BiobankCloud platform not to process any data where the data subject has not given his/her informed consent when so required. The cloud will handle two types of data, descriptive meta-data and omics-data. All data used will be either at an aggregated level, anonymized or coded in order to prevent any identification. At this point, only data from Charité, Germany will be used. Read more


One of the main tenets of biobanking is the digitization of our genomic information for its archival and analysis. The scale of the storage requirements for genomic information is huge - each full human genome requires the analysis of three billion base pairs. In addition to the storage of genomic information, its analysis will require both massive parallel computing infrastructure and data-intensive computing tools and services to perform analysis in reasonable time. This project will address the problems of secure storage and secure access to data, efficient analysis of data as well as the inter-connection of biobanks. Read more


One of the innovative aspects of the BiobankCloud PaaS is the capability of interconnect several PaaS deployments in a cloud federation. This enables easy-to-use data sharing and allows the use of public clouds (e.g., Amazon S3, Azure Blob Storage) for storing data. This federation, dubbed Overbank, is implemented through a novel cloud-backed storage system called Charon. Furthermore, we want to give authorized bioinformaticians a “dropbox-like” experience when accessing biobanks datasets stored in Charon. Read more

Scalable Scientific Workflow Execution in BiobankCloud

With Next Generation Sequencing (NGS) data produced at faster rates and lower costs biobanks need to exploit data parallelism in their analysis workflows to provide timely analysis results. The BiobankCloud PaaS provides a workflow language and a distributed workflow execution environment running on Apache’s Hadoop. This way the BiobankCloud PaaS scales to the anticipated data generation rates of the near future. Moreover, by wrapping existing command line tools and libraries, the BiobankCloud PaaS is not limited to a selection of fixed and highly specialized analysis pipelines. Rather, its workflow specification language is general enough to express workflows from any scientific application domain and can be easily adapted as software changes and new tools and libraries are adopted in the NGS ecosystem. Read more

Workflows in BiobankCloud

The BiobankCloud platform is able to execute arbitrarily complex analysis workflows. To this end, it features a special workflow language (called Cuneiform) optimized for applications in large NGS data analysis. Besides being programmable, the platform is readily equipped with a set of proven analysis pipelines for the most common categories of NGS data analysis. Read more