One of the main tenets of biobanking is the digitization of our genomic information for its archival and analysis. The scale of the storage requirements for genomic information is huge - each full human genome requires the analysis of three billion base pairs. In addition to the storage of genomic information, its analysis will require both massive parallel computing infrastructure and data-intensive computing tools and services to perform analysis in reasonable time. This project will address the problems of secure storage and secure access to data, efficient analysis of data as well as the inter-connection of biobanks.

In the above diagram, we can see how traditional sequencing and analysis of DNA is performed in pipelined architectures. A next-generation sequencing machine produces a ~50GB file containing millions of ~100-byte reads from a genome, and this is passed onto the first part of the pipeline, where the reads are aligned with the reference human genome. After alignment, several other stages of analysis of the sequenced genome can be performed, but traditionally they are performed using desktop machines and require many days to perform their analysis tasks.

In this project we will parallelize as much of the analysis pipelines as possible using data-intensive computing frameworks such as MapReduce.