GPUs play an important role in HPC accelerating applications from molecular dynamics to deep learning. In previous articles, we’ve discussed how Univa Grid Engine supports efficient scheduling of GPU resources. We’ve also covered Univa Grid Engine support for Docker and Singularity. In this article, I’ll put these concepts together and explain how NVIDIA Docker simplifies the deployment of GPU-enabled, containerized applications in clustered environments.
Most readers appreciate the role that Docker plays in making applications portable. As a reminder, Docker provides a convenient way to package up applications along with dependencies such as binaries and libraries so that they are portable across any host running the Docker Engine. Major Linux distributions have support Docker since 2014. Univa Grid Engine extends these capabilities to clustered environments enabling users to transparently submit, manage and monitor containerized applications just like any other workload. Grid Engine manages details like placing jobs optimally, prioritizing workloads, handling exceptions, and ensuring that required Docker images are available on hosts with appropriate resources on our behalf.
One of the challenges GPU-aware applications inside containers is that containers meant to be hardware agnostic. CUDA is NVIDIA’s parallel computing platform and API that makes it easy for developers to build GPU-enabled applications. GPU-enabled applications need access to both kernel-level device drivers and user-level CUDA libraries, and different applications may require different CUDA versions.
One way to solve this problem is to install the NVIDIA driver inside the container and map the physical NVIDIA GPU device on the underlying Docker host (e.g.,/dev/nvidia0) to the container as illustrated above. The problem with this approach is that the version of the driver and libraries inside the container need to precisely match. Otherwise the application will fail. This means that users are back to worrying about what drivers and libraries are installed on each host computer to ensure compatibility with containerized applications.
To solve the problem of containerizing GPU applications, Nvidia developed NVIDIA Docker, an open-source project that provides driver-agnostic CUDA images. The NVIDIA Docker plug-in enables GPU applications running in containers to share GPU devices on the Docker host without worrying about version mismatches between libraries and device drivers.
The figure below shows how this works in a Univa Grid Engine environment. Different cluster hosts may be running different GPU hardware and even different versions of CUDA runtimes and device drivers. Ideally, it should be possible to support containerized apps that support different versions of CUDA on the same host.
Now that the value of NVIDIA Docker is clear, you may want to take advantage of it on your UGE cluster. Some readers may already have working UGE clusters with GPUs installed. Others may be starting from scratch. The guide below omits some details but provides a roadmap to get NVIDIA Docker working with UGE.
If you plan to run GPU applications, you’ll need hardware with GPUs. If you don’t have GPU capable hosts, you can rent machine instances in the cloud for a few dollars per hour. To construct the examples below, I used AWS EC2 P3 Instances. These cloud instances support up to 8 NVIDIA v100 GPUs per machine. A relatively inexpensive p3.2xlarge instance with a single 16GB GPU is available on-demand for $3.06 per hour.
There are at least three ways to build a UGE cluster in the AWS cloud (that I can think of):
The AWS Marketplace will deploy a single master host based on a Univa supplied AMI. Once the master is installed, you can log in to the master via ssh and use Navops Launch to add (or remove) cluster hosts using a single command via the built-in AWS resource connector. The Univa AWS marketplace documentation provides step-by-step instructions.
After installing the master host, but before adding compute hosts, use the Navops Launch command below from the Univa Grid Engine master to show how the AWS resource adapter is configured.
By default, Navops Launch with the AWS adapter will add m4.large compute hosts when you expand the AWS cluster. You’ll want to change the default instance type to p3.2xlarge (because these instances have NVIDIA Tesla V100 GPUs) as shown.
You can now add one or more compute hosts shown.
Make sure that you run get-node-requests to check the status of the request. The availability of V100 instances is limited, and availability varies by region so you will want to see any error message or exceptions.
After you’ve added GPU capable compute nodes to your cluster you can verify that they are up and running:
Three p3.2xlarge on-demand compute hosts on AWS as shown here will cost approx $10 per hour or approx $1,500 per week, so be careful you don’t keep them running too long!
Because I used the UGE AMI (running CentOS), my AWS-based compute hosts didn’t have the NVIDIA CUDA drivers. You’ll need to install the appropriate NVIDIA drivers for your OS release.
Detailed instructions can be found here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
The rpm to install will depend on your OS version. I’ve skipped a few steps below in the interest of brevity, but I downloaded the publicly available driver for CentOS (listed at https://www.nvidia.com/Download/Find.aspx). The p3.2xlarge AWS instances contain a Tesla V100 GPU as shown below, and the RHEL7 driver is used for CentOS 7.
The Tesla drivers are backward compatible. The V100 driver will also support older P-series, K-series, and C, M, and K class GPUs.
Once you’ve retrieved the rpm format driver you can install the driver on each compute host using the following command:
After installing the driver on the p3.2xlarge cluster hosts you should be able to run nvidia-smi (The NVIDIA System Management Interface) to verify that the driver is working and you can see the GPU.
Now that you have a UGE cluster with GPU’s the next step is to install Docker on each compute host. We explained this procedure in an earlier article so I won’t repeat all the details here, but the script below (should) install Docker on the Grid Engine cluster hosts.
You can find details on installing Docker Community Edition on CentOS here.
Install NVIDIA Docker plug-in on cluster hosts
Now that you have working GPU cluster hosts and have installed Docker on each host, the next step is to install NVIDIA Docker on each host. Detailed installation instructions are available at https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0).
The script below worked for my CentOS 7 compute hosts. The first few lines add the nvidia-docker repositories. Next, yum is used to install nvidia-docker2 and we restart the docker daemon on each host to recognize the nvidia-docker plug in.
Next, we can verify that nvidia-docker is working by running a GPU-enabled application from inside a nvidia/cuda Docker container. The nvidia/cuda container (available from Docker Hub) includes the CUDA toolkit. Packaged GPU applications are typically based on this container.
The command below again runs nvidia-smi, but this time inside the nvidia/cuda container pulled from Docker Hub.
The –runtime=nvidia switch on the docker run command tells Docker to use the NVIDIA Docker plugin.
If you’ve gotten this far, congratulations – you now have NVIDIA Docker installed and working on your UGE cluster.
NVIDIA Docker with Grid Engine
Now that Nvidia Docker is working on compute hosts, the next step is to submit NVIDIA Docker containers as Univa Grid Engine jobs. The advantage of running containers under Grid Engine is that the scheduler figures out the optimal place to run containers so that multiple applications and users can share GPU-resources.
UGE provides specific enhancements for running NVIDIA Docker. You can use the -xd switch in Univa Grid Engine to pass the –runtime=nvidia argument as well as any environment variables that need to be accessible within the Docker container.
A sample command showing how NVIDIA Docker containerized applications can be submitted to a grid engine cluster is shown below:
There are a variety of pre-packaged GPU applications available from NGC (Nvidia’s GPU Cloud). With Docker and NVIDIA Docker installed on UGE GPU hosts, you can use the approach explained above to run GPU-enabled application (with some limits of course) without worrying about compatibility with underlying device drivers.
Have you been using NVIDIA Docker with Grid Engine? We’d love to get your comments and hear about your experiences. You can learn about Univa Grid Engine at http://www.univa.com/products/.