Attacking the Biobank Bottleneck

As of 2012, a quantum shift is happening in the area of human genomics. A huge wave of big data is approaching, driven by the decreasing cost of sequencing genomic data, which has been halving every 4 months since 2004. Biobanks, that are used to store and catalogue human biological material, are not prepared to handle this wave of data - there is a biobank bottleneck:
a lack of platform support for the storage, analysis and interconnection of the coming massive amounts of human genomic data.

The main research challenges we will address include:

  • definition of the regulatory framework and data model for biobank data sharing,
  • the development of a scalable, highly available storage infrastructure with support for strongly consistent data,
  • a cross-cutting security platform that ensures data confidentiality, data integrity, and data access auditing,
  • data-intensive computing workflows for aligning, clustering, aggregating, compressing and anonymizing sequence data,
  • inter-connection of biobanks, while also leveraging the storage and processing capacity of public clouds,
  • validation of our system by evaluating real-world, parallelized analysis pipelines to facilitate the biological interpretation of genomic data,
  • and the integration of these components as a platform as a service (PaaS).