For HPC applications, containers are a great way to install software and ensure portability across clusters and clouds. Containers can encapsulate complete, pre-tested environments allowing users to mix and match different applications and versions without conflict. Software providers such as ESI OpenFoam, UberCloud, and others are increasingly packaging software in containers for ease of deployment.
Fortunately, native support for Docker in Univa Grid Engine (UGE) makes running containerized applications a breeze. In this article, I’ll explain how to deploy, run, and manage containerized workloads on your UGE cluster and provide some insights into how UGE manages containerized workloads behind the scenes. Readers looking for an introduction to containers may want to check out Fritz Ferstl’s article Containers for HPC Cloud.
If you don’t already have Docker installed on your compute hosts, this is a good place to start. Adding Docker shouldn’t break existing applications but testing things first on a non-production host is always a good idea. Adding Docker is like adding a Java runtime. The Docker Engine provides runtime support for containerized workloads that need it.
As a word of caution, don’t assume you can necessarily install the latest version of Docker on your cluster hosts. Docker APIs change like the weather, so you’ll want to download a stable Docker version supported with your version of UGE.
In this example, I’m running UGE v8.5.4 on CentOS 7 on Amazon Web Services (AWS). I used the AWS marketplace as an easy way to install a Grid Engine cluster. Consulting the UGE release notes, Docker version 17.03 is the latest supported Docker for UGE 8.5.4, so I’ll be using the free Docker Community Edition package (docker-ce-17.03.0.ce-1.el7.centos.x86_64) on my cluster compute hosts.
Once you have the Docker repositories configured (I’ll cover this shortly), you can use the yum list command to show available Docker versions. The text returned in the second column is the version string for the Docker release that you’ll want to make a note of.
To show the available Docker 17.03 packages, I used this command:
Depending on your OS you might need to use different commands. The Docker CE documentation has detailed instructions for other Linux versions including Debian, Fedora, and Ubuntu.
Since I need to install Docker on multiple hosts, It makes sense to build a script that installs Docker to save typing the same commands on each host. The following script runs as root and (for me at least) properly installs Docker on my CentOS 7 UGE compute hosts.
The script performs these steps:
The install command probably needs some explanation. Knowledgeable readers (the only kind that read this blog) probably expected to see something like “yum install docker-ce-17.03.0.ce-1.el7.centos”. This was my first guess too.
Just to prove nothing is ever easy, I learned that installing older versions of Docker CE can be a little glitchy. A new “obsoletes” restriction was introduced in docker-ce 17.06.0, and for whatever reason, the yum repo applies the restriction to all versions of Docker. To avoid an error message (Package docker-ce-selinux is obsoleted by docker-ce ….) that prevented Docker from installing I needed to manually set obsoletes to false on the yum command line and download docker-ce and docker-ce-selinux together. The issue is explained in detail here.
You’ll need to watch for this detail. It’s always the little things that cause the biggest headaches!
Once Docker is installed you can start Docker and verify that it is working by running a few Docker commands and running the hello-world container from Docker Hub. It’s a good idea to use systemctl to enable Docker so that it will start automatically when the node boots. You’ll probably want to add these commands to your own installation script.
One more detail – to allow regular users to run Docker commands, you’ll want to add each of the users on your cluster to the docker group. The command below adds the user bill to the docker group.
Now we’ve reached the cool part. If you’ve installed Docker correctly, you don’t need to do anything else. Grid Engine should already know about Docker and any Docker images installed on each host.
The command below executed from a Grid Engine node illustrates this:
For people that don’t administer Grid Engine for a living, qhost is showing the compute hosts in our cluster. I have a master host, and two compute hosts. You can see the AWS host names. The -F switch shows the value of specific resources on each host.
UGE has two new default resources added in UGE v8.4.0 to help manage Docker workloads:
Assuming your UGE environment is recognizing that Docker is installed on each host and seeing available images, you’re done! You’ve Dockerized your cluster and can start submitting and managing containerized applications.
Univa Grid Engine makes it easy to run jobs inside or outside of containers. To illustrate how this works, I’ve created a simple script called testjob.sh. The script does a few simple things like determining whether it’s running in a container and reporting its hostname and IP address. I add a sleep command because I wanted to script to run long enough so that I could Docker commands against the running container. In case readers are wondering, checking for the presence of the hidden file .dockerenv is a useful trick to tell whether your script is running in a container.
I submit this script to Grid Engine as a normal, non-container job:
The script is assigned a job-ID (31) and gets dispatched to one of the compute hosts. The job output is logged in the user’s home directory, and we see the output of the script. As expected the job runs in the real world (as opposed to in a Docker container) on one of our AWS machine instances.
To run the job in a container, the process is almost identical. I just need to tell UGE that we want to use a Docker container and specify the Docker image to use. To do this, I use the -l switch (lowercase L) on the command line to request two resources: docker and docker_images. This will select hosts with the docker resource set to true and hosts where the list of available images contains our desired Docker image (centos:latest). We use wildcards to match the image name against the longer comma-delimited list of images available on each host. If the image is not available on a host, UGE can pull the image for you automatically, but for performance reasons, it is preferable to run on a host that already has the image stored locally.
From the Grid Engine user’s perspective, everything works the same way. Users can delete or manipulate container jobs like any other job. The containerized job shows up as UGE job 32 and runs in a container on one of our AWS hosts.
If I monitor Docker on the execution host, I see that a Docker container has been started based on the image centos image. As a Grid Engine user, this is handled transparently for me, but it’s nice to know what’s going on.
After the job completes, I see from the job’s output file that the job ran in the container shown in the docker command line (4539e0b94529).
In the example above, I knew that one of the compute hosts already had the required Docker image (centos:latest). Often, a needed image won’t be present on any cluster hosts. UGE can automatically download required images, but to do this, we need to use a soft resource request. A soft resource request indicates to UGE that the image is “nice to have”, but not necessary to schedule the job on a host. In the example, below we specify a different Docker image (ubuntu:14.04) that we know is not available on either cluster host, and make its presence a soft request instead of a hard request.
UGE attempts to find a host with the needed ubuntu image, but when none are available it schedules the job to a host fulfilling the hard resource requirement (docker), and UGE automatically triggers the docker daemon to download the needed image and start the container. Re-running the qhost command shows that our first compute host now has the needed image and the job runs as before.
This is an important feature because it means that users can guarantee that their containerized jobs will run even when a required Docker image is not available on compute hosts.
To accomplish all this, Grid Engine did some clever things behind the scenes. First, because this was not a binary job, Grid Engine had to transfer the script from the submission host to the execution host. From there, the script was copied into a spool directory.
For the container to be able to see the script, the spool directory needs to be bound (a Docker term) to the container. The files in $SGE_ROOT are also bound to the container, and UGE automatically detects any other directories that may be required and binds them under the subdirectory /uge_mnt inside the container. Other directories are bound to the directory including the user’s home directory so that job output can be written where the user expects it along with directories passed via the -o or -e switches on the qsub command line.
The docker inspect command gives us visibility to what happens when the job runs. I wanted to see details about the job, so I saved the output from docker inspect to a file as the job’s container was running as shown:
There is too much detail in the docker inspect command to provide the full output, so I’ve abbreviated it to show a few items of interest.
First, note that when the Docker job runs, the entry point is the sge_container_shepherd program that essentially “shepherds” the job along as it runs inside the container. This is one of the reasons that the Grid Engine binaries need to be available inside the container bound under /uge_mnt. Other bindings are shown as well mapping /var, /opt, and /home/bill (our job ran as bill) so that these need to be accessible from with the container.
The working directory is set to the spool directory for the job on the host and other information of relevance to the Grid Engine job is stored in Docker labels.
From a Linux administrator’s perspective, understanding the process tree on the compute host is also instructive. The output of pstree (or ps auxf) is too verbose to show fully, but a stripped-down version of the process hierarchy is shown below.
Normally, when a Grid Engine user submits a non-containerized job, the process hierarchy on the execution host looks something like this:
The Grid Engine jobs are children of the sge_execd process on the execution host and execution is managed by a sge_shepherd process. The actual workloads run under the user ID of the user that submitted the job.
When the same job is run as a container job, the processes that comprise the job are children of the Docker daemon. In this view we see the sge_container_shepherd process running inside the container is the parent of the actual job.
Often a UGE job will want to manipulate data in a specific directory, for example an NFS share accessible from all compute hosts. Directories can be bound manually into the container using Docker’s HOST-DIR:CONTAINER-DIR format using the -xd switch to pass docker options.
On the compute node, a directory called /nfs_share might contain shared data. In this case we can bind the directory /data in the docker container to the shared /nfs_share file system visible to the docker host. The path passed on the UGE command line needs to refer to the path visible within the container.
The examples so far have dealt with scripts as opposed to binaries. Binaries are easier in some ways because it is assumed that the command invoked already resides within the container. It’s a good idea when starting a binary in a Docker container to specify the shell that should be used to start the binary. Otherwise the shell may default to csh, often not present in Docker containers. An example is shown below:
There are many more features to the UGE Docker integration including support for array jobs, MPI parallel jobs, and access to GPU devices. Also, Grid Engine can be used to launch and manage containers that package long-running services where the entry point is build into the container image. We’ll cover some of these other topics in a follow-on article.
At Univa we’ve been amassing a lot of experience running containers in production on Grid Engine clusters. If you have any comments or questions about this article, I’d love to get your feedback.