Scalable Scientific Workflow Execution in BiobankCloud

With Next Generation Sequencing (NGS) data produced at faster rates and lower costs biobanks need to exploit data parallelism in their analysis workflows to provide timely analysis results. The BiobankCloud PaaS provides a workflow language and a distributed workflow execution environment running on Apache’s Hadoop. This way the BiobankCloud PaaS scales to the anticipated data generation rates of the near future. Moreover, by wrapping existing command line tools and libraries, the BiobankCloud PaaS is not limited to a selection of fixed and highly specialized analysis pipelines. Rather, its workflow specification language is general enough to express workflows from any scientific application domain and can be easily adapted as software changes and new tools and libraries are adopted in the NGS ecosystem.

Workflow Execution on Hadoop

The timely analysis of large-scale data sets makes it necessary to exploit parallelism and distributed computation. The Hi-WAY scientific workflow application master is able to execute workflows from different languages on any Hadoop installation, incorporating the resource management and access control of the Hadoop ecosystem. Through a selection of schedulers, Hi-WAY is able to adapt to heterogeneous and dynamically changing compute resources. Furthermore, by storing the NGS data in Hadoop's distributed file system HDFS, computation can be co-located with the data. Thereby network bandwidth usage is minimized.

Workflow Specification

The workflow specification language for BiobankCloud allows the easy integration of NGS command line tools and libraries. That way, we provide a uniform way of calling these tools and libraries from the host specification language Cuneiform. Data-parallel structures can be expressed as maps, cross products, dot products, or combinations of them over large collections of input data. Its functional appeal makes the specification language readable and extensible. The workflows in BiobankCloud cover a number of NGS use cases, demonstrating the way NGS workflows can be made scalable and how different tools and libraries can be incorporated to form a complete analysis run.