In many fields of research, workflow management systems are used to automate the execution of multi-step processes. Applications include bioinformatics (for pipeline management), medical imaging and astronomy. The diversity of workflow tools can be a challenge; there are hundreds of different tools (see here and here) implemented in different languages and optimized for different disciplines. Popular workflow tools in life sciences include Galaxy, BPipe, GATK, Snakemake, and NextFlow. With so many tools for describing and executing workflows, sharing workflows between organizations is a challenge. Workflow management is an area crying out for some standardization.
In this article I’ll look at two workflow standardization efforts: the Common Workflow Language (CWL), a specification supported by several business and research organizations; and the Workflow Description Language (WDL) led by researchers from the Broad Institute.
The Common Workflow Language (CWL)
The Common Workflow Language is a broadly supported specification. It aims to make it easier for organizations to share data analysis workflows. GitHub along with cwltool, the open-source reference implementation, maintain this specification. Multiple tools implement the CWL specification.
To run Toil workflows with Univa Grid Engine, you’ll need to install Toil on each cluster node. A toil worker process on each node is responsible for running and monitoring the job’s steps that comprise a Toil workflow, which is a little awkward. If you plan to run CWL workflows with Toil, make sure that you install Toil with the “cwl” extra feature ($ pip install ‘toil[cwl]’). The toil-cwl-runner and the cwltoil executables are then installed on each host. You can follow this procedure in the Toil documentation to run a basic CWL workflow called example.cwl.
Running a simple CWL workflow in Toil with Univa Grid Engine looks something like this:
$ export TOIL_GRIDENGINE_ARGS='-q batch.q' $ cwltoil --batchSystem=sge --defaultMemory 100000000 --defaultCores 4 \ example.cwl example-job.yaml
In addition to passing resource requirements on the command line, you can pass Univa Grid Engine command line arguments via the $TOIL_GRIDENGINE_ARGS environment variable.
You can see details of how Toil interacts with Univa Grid Engine commands (qstat, qsub, qdel, etc.) by inspecting the Toil gridengine.py module on GitHub. Toil periodically polls Univa Grid Engine to monitor workload execution. Resource requirements, such as memory and CPU cores selected on the command line for all steps in the flow, is one of Toil’s drawbacks. Currently, the CWL 1.0 specification provides the notion of runtime Requirements and Hints. So, newer versions of the tool will likely address this limitation.
The Workflow Description Language (WDL)
The Workflow Description Language (pronounced “widdle”) is another open workflow specification developed at the Broad Institute. Though they both provide many similar capabilities, WDL is often described as less generalized than CWL. WDL is frequently used with GATK (Genome Analysis Toolkit), which was also developed by the Broad Institute.
Cromwell, written in Java, is integrated with Univa Grid Engine. This step-by-step example configures Cromwell to use with Univa Grid Engine (or Sun Grid Engine). Cromwell provides a “runtime” directive for individual job steps that specify resources such as CPU, memory and Docker containers. It also has an active community of Grid Engine users and many Grid Engine examples, so it’s easy to get assistance with Cromwell.
Cromwell has added support for CWL 1.0 in Cromwell version 32 and later versions. Similarly, Toil offersWDL support. It seems likely that as these open-workflow standards gain traction, workflow tools will evolve to support multiple workflow specification languages.
The Bottom Line
Open workflow standards are coming, but it’s still early days. Support for CWL or WDL should be one of many considerations in choosing a workflow tool because the standards are recent. In an earlier article, we looked at NextFlow (commercially supported by seqera.io) and its integration with Univa Grid Engine. NextFlow provides rich functionality and allows inline coding of workflow steps using a user’s preferred scripting language. It’s easy to see how some users that rely on these features might view standards-based workflow specifications as a step backward. As open workflow language specifications gain traction, leading pipeline tool provider will likely find ways to support the standards.
Your choice of a workflow or pipeline manager will likely come down to your choice of tools, who you are collaborating with, and existing repositories of workflows that you can leverage. There is good news for Univa Grid Engine users: there are plenty of choices since all of these popular tools feature a Grid Engine integration.
Are you using a workflow tool with Univa Grid Engine? I’d be interested in learning what you’re running and what your experience has been.