Attacking the Biobank Bottleneck

As of 2012, a quantum shift is happening in the area of human genomics. A huge wave of big data is approaching, driven by the decreasing cost of sequencing genomic data, which has been halving every 4 months since 2004. Biobanks, that are used to store and catalogue human biological material, are not prepared to handle this wave of data - there is a biobank bottleneck:
a lack of platform support for the storage, analysis and interconnection of the coming massive amounts of human genomic data..

The main research challenges we will address include:

  • definition of the regulatory framework and data model for biobank data sharing,
  • the development of a scalable, highly available storage infrastructure with support for strongly consistent data,
  • a cross-cutting security platform that ensures data confidentiality, data integrity, and data access auditing,
  • data-intensive computing workflows for aligning, clustering, aggregating, compressing and anonymizing sequence data,
  • inter-connection of biobanks, while also leveraging the storage and processing capacity of public clouds,
  • validation of our system by evaluating real-world, parallelized analysis pipelines to facilitate the biological interpretation of genomic data,
  • and the integration of these components as a PaaS.


One of the main tenets of biobanking is the digitization of our genomic information for its archival and analysis. The scale of the storage requirements for genomic information is huge - each full human genome requires the analysis of three billion base pairs. In addition to the storage of genomic information, its analysis will require both massive parallel computing infrastructure and data-intensive computing tools and services to perform analysis in reasonable time. This project will address the problems of secure storage and secure access to data, efficient analysis of data as well as the inter-connection of biobanks.

In the above diagram, we can see how traditional sequencing and analysis of DNA is performed in pipelined architectures. A next-generation sequencing machine produces a ~50GB file containing millions of ~100-byte reads from a genome, and this is passed onto the first part of the pipeline, where the reads are aligned with the reference human genome. After alignment, several other stages of analysis of the sequenced genome can be performed, but traditionally they are performed using desktop machines and require many days to perform their analysis tasks.

In this project we will parallelize as much of the analysis pipelines as possible using data-intensive computing frameworks such as MapReduce.