Next Generation Sequencing (NGS) is a fundamental practice in bioinformatics. Pipelines are comprised of complex, multi-step processes involving many different tools and intermediate data formats. With easy access to cloud infrastructure and containerized applications that are portable across clouds, users are increasingly extending pipelines to the cloud.
In this first in a series of two articles we’ll discuss Nextflow, a leading tool for managing bioinformatics workflows and show how it can be used with Univa Grid Engine (UGE) and Navops Launch to realize a framework and cloud agnostic hybrid cloud infrastructure.
In a subsequent article, we’ll provide more details and explain how Nextflow users can setup transparent cloud bursting for genomic pipelines using Univa Grid Engine and Navops Launch.
On a cluster, bioinformatics pipelines can manifest themselves as hundreds or even thousands of discrete jobs. To make matters worse, many users run different pipelines simultaneously against different datasets and pipelines are constantly changing as new tools, and more effective analysis techniques are identified.
In the past, genomic pipelines were managed using custom scripts written in Bash or Python. While functional, these custom solutions tend to be “brittle” and hard to maintain. Scripted workflows are often complex because challenges like synchronizing multi-step flows, managing data, and handling run-time exceptions were left to the author of the script. Small changes to data formats, tools, or the environment could result in scripts failing. Given their complexity, often the original author of the script was the only person able to troubleshoot issues and resolve problems efficiently. A better practice is to use a tool purpose-built for distributed, collaborative genomic workflows.
Nextflow is a popular workflow system designed to manage the orchestration and deployment of containerized workloads at scale, across clouds and clusters in a portable and reproducible manner. It offers several advantages for managing bioinformatics applications and data analysis pipelines:
Nextflow is a free and open source software solution for application workflows developed by the Centre for Genomic Regulation (CRG). Seqera Labs was recently incorporated as a spin-off from the CRG to provide enterprise-level support and professional services around the Nextflow platform, as well as to explore new, innovative products to power the next generation of big data analysis applications.
The following Nextflow example illustrates Nextflow’s native support for UGE clusters using a typical “scatter-gather” pattern. In this example, an input FASTA file is divided into chunks of arbitrary sizes, and each chunk is processed with BLAST as a discrete Grid Engine job. Once all of the BLAST jobs complete, all the sequences for the top hits are collected and merged into a results file.
You can find the source data and reference data for this workflow on the Nextflow-io blast-example GitHub page.
Installing BLAST on every cluster node would be tedious, so we leverage a Nextflow provided container from DockerHub (Nextflow/examples) with BLAST pre-installed. To enable the BLAST workflow to run on Univa Grid Engine and the Nextflow supplied Docker container, we provide the following Nextflow.config file:
We modified the chunkSize parameter to 5 in this example (default was 100), and provided a slightly larger sample.fa file (50 samples instead of 5) to ensure that the small dataset would result in multiple BLAST jobs. In our example, each UGE host has Docker CE installed.
When the workflow above (in the file main.nf) is run from the Univa Grid Engine master host, the results are as shown below. Each Nextflow process corresponds to a Univa Grid Engine job.
The Nextflow user may not realize they are running on a cluster, but from a Univa Grid Engine perspective, we see containerized jobs corresponding to each process step executing on the cluster.
For many applications, running in the cloud is compelling. This is especially true for organizations that need only periodic access to large amounts of infrastructure or who need specialized or expensive cluster nodes with the latest GPU hardware.
For sites that can keep clusters busy on a sustained basis, on-premise infrastructure is usually less expensive than running workloads in the cloud. For this reason, site managers often take a hybrid approach, sizing local clusters appropriate to their needs and “bursting” to use cloud resources when it makes sense from a cost-benefit standpoint.
The key to making bursting practical is that it needs to be transparent to applications and workflows. Also, clusters need to be elastic and flex up and down quickly based on changing workload demand. Nobody wants to pay for infrastructure sitting idle, so cloud capacity needs to be provisioned automatically and taken down quickly when no longer needed.
To address bursting requirements, Nextflow provides a native executor either for AWS Batch or the Google Genomics pipeline service. A good writeup explaining how to use the AWS Batch service can be found here. Nextflow’s AWS Batch executor allows containers to be run on AWS Batch from Nextflow which provides seamless cloud bursting.
While this is a good solution for many users, there are some potential drawbacks:
As an alternative to managing bursting within Nextflow, Nextflow workflows can instead shift the management of cloud bursting to the underlying workload manager using UGE together with Navops Launch.
From the perspective of Nextflow users, bursting becomes entirely transparent. Nextflow users or administrators simply configure the sge executor in Nextflow.config to use a queue configured to burst to their preferred cloud provider as shown:
There are multiple benefits to this approach:
You can learn more about commercially supported Nextflow and the integration with Univa Grid Engine at https://www.seqera.io.
Are you running pipelines using Nextflow and other workflow tools on-premise or in the cloud? We’d love to hear from you and learn from your experiences.