Increasingly, GPUs power a wide variety of HPC and AI workloads from structural analysis to molecular modeling. In April of 2020, Univa’s State of the GPU in HPC survey found that 83% of survey respondents reported running GPUs with fully 62% being in production.
GPU-capable cloud instances can cost roughly 10X more than equivalent on-demand cloud instances without GPUs[i]. Given the prevalence of GPU workloads and the high-cost of GPU servers and cloud instances, enterprises are looking for ways to maximize their GPU investments.
As GPUs become more powerful and offer more cores and memory, not every workload needs exclusive access to a GPU. Sharing GPUs can result in better GPU utilization, lower costs, and improved workload throughput.
Scheduling GPU workloads is a complicated topic where concepts build on one another. Rather than cover this in a single article, I decided to break this topic up into a series of short articles. This first article starts with the basics of GPU scheduling. Future articles will cover topics including GPU sharing, handling heterogeneous GPU nodes, topology mask, and other advanced Univa Grid Engine features.
Experienced Grid Engine administrators may also be interested in these additional Univa articles related to GPU scheduling:
As most developers know, GPU sharing does not happen magically. It takes effort on the part of developers and coordination with system and cluster administrators. GPU sharing is normally facilitated using the NVIDIA Multi-Process Service (MPS). MPS is essentially an alternative, binary-compatible implementation of the CUDA runtime that facilitates running multiple CUDA kernels (and thus applications) on the same GPU. With MPS, multiple Linux applications written to the CUDA API can efficiently share resources on the same GPU using co-operative multitasking. Candidate applications for MPS-based GPU resource sharing tend to be applications that have a small number of blocks-per-grid.
MPS consists of a control daemon process (responsible for starting and stopping the MPS server), a server process (providing Linux clients with a shared connection to a GPU), and an MPS client runtime accessible to any CUDA application. For GPU sharing to work, the GPU will need to be in its DEFAULT mode (configurable via the nvidia-smi command).
With the announcement of Volta-series GPUs (V100), NVIDIA introduced new MPS capabilities to make GPU sharing easier and increase isolation between clients. Specifically, each Volta MPS client owns its own address space and can submit work directly to the GPU without passing through the MPS server.
You can learn more about NVIDIA’s Multi-Process Service by reading NVIDIA’s CUDA Multi-Process Service Overview. The documentation contains helpful considerations about what applications are appropriate candidates for use with MPS and what limitations exist. Before considering GPU sharing in the context of a workload manager such as Univa Grid Engine, developers and administrators should first ensure that MPS workload sharing works outside of the workload management environment. The NVIDIA documentation explains how to write scripts to launch shared GPU workloads under control of a workload manager such as Univa Grid Engine.
With the announcement of NVIDIA’s latest NVIDIA A100 GPUs, NVIDIA rolled out a new Multi-Instance GPU (MIG) feature. This feature provides still better GPU resource sharing by allowing NVIDIA A100 GPUs to be securely partitioned and shared by up to seven CUDA application instances. A key advantage of MIG is that GPU code runs simultaneously on the same GPU – real multitasking. You can learn more about MIG in the MIG User Guide.
All this to say, application requirements around GPU-resource sharing are complicated. They will depend on the generation of GPU and how applications are built. The scheduling issues are challenging enough, so in this and subsequent articles, we’ve chosen to set aside the intricate details of GPU sharing from a developer’s perspective. In this series of articles, we focus on GPU sharing from a workload management perspective.
Before we move on to more complicated examples, it is useful to review how consumable resources are handled in Univa Grid Engine.
In a shared cluster, consumable resources are resources that, once used, are unavailable to other users and applications. Consumable resources include things like GPUs, memory, and job slots on a host. Other resource attributes are non-consumable. These include resource attributes like machine architectures, hostnames, and downloaded docker images available on a server.
In Univa Grid Engine, complex resource attributes provide information about resources that users can request when they submit jobs. Resources can exist cluster-wide, can be associated with specific cluster hosts, or can be associated with queues. The command qconf -sc (show complex) provides a list of complex resources pre-defined in Univa Grid Engine. To schedule GPU resources and workloads, we need to define a new complex configuration called gpu. The qconf -mc command (modify complex) enables us to review and edit existing complex resource definitions cluster-wide, and add a new gpu entry as shown:
If the gpu complex resource entry already exists, you can also review and edit it using the qconf -mce command.
The meaning of each field that defines the gpu complex resource entry is provided below. The complex(5) map page describes the file format of the Univa Grid Engine complex configuration.
We have a small Univa Grid Engine cluster deployed in AWS with three hosts. ip-172-31-30-47 is both a master host and an execution host. The other two hosts are execution hosts only.
In our example, suppose that our two execution hosts (172-31-19-231 and 172.31.19.95) each have four physical GPUs. For now, we are not concerned about the GPU model or the capabilities. We will get to this in future examples.
Use qconf -me to modify the properties of each execution host containing GPUs and add the value gpu=4(0 1 2 3) to the complex_values line alongside any other attributes as shown.
This syntax tells Univa Grid Engine that there are four available GPUs on the host. A simple resource map is provided inside the brackets. In this example, each entry in the resource map corresponds to the GPU device ID on the host. NVIDIA assigns simple numerical device IDs to each physical GPU even if devices are of different models.
|Note: Readers should not assume that the order GPU devices returned by the nvidia-smi command correspondents to the actual CUDA device ID. This is because NVIDIA by default orders CUDA device IDs using a FASTEST_FIRST policy rather than ordering devices based on the PCI_BUS_ID. This is explained in the CUDA documentation here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars). While this doesn’t matter matter when all GPU devices on a host are of the same type, it becomes important in mixed GPU environments. We will explain how to handle this in the third article in this series when we look at mixed GPU types.|
We can make the same change above to the execution host ip-172-31-19-95 which also has four GPUs.
Once we’ve updated the complex_values on each execution host, we can use the qhost -F gpu command to list hosts with available gpu resources confirming that the update was successful.
hc:gpu=4 associated with each execution host indicates that four host-level, consumable resources are available on each GPU cloud instance.
Without a workload manager, users would need to decide where each GPU runs manually. Even on a small cluster with two hosts and 8 GPUs, this quickly becomes tedious – especially when multiple users or groups are trying to access the same resources.
Rather than hard code the GPU device that each GPU application uses, a better practice is to restrict the physical GPU devices visible to the CUDA environment using the CUDA_VISIBLE_DEVICES shell variable. If we have four physical GPUs as above, for example (0 1 2 3), and we want to use the third device (2) we can simply set CUDA_VISIBLE_DEVICES=2 and export the shell variable so that it is visible to child processes running GPU jobs. When a user submits a job to Univa Grid Engine, they indicate the number of GPUs that their job requires using a resource requirement string as shown:
In the example above, we start an interactive session on a host and indicate that we need access to a single GPU (-l gpu=1).
Univa Grid Engine knows what hosts have GPUs, and what jobs are presently consuming GPUs on each host. A host and GPU resource is scheduled and its ID is placed in the “hot granted resources” (HGR) shell variable corresponding to the resource name (in this case $SGE_HGR_gpu) in job context on the scheduled host.
In the example above, we can inspect the value of the shell variable through the remote shell session (qrsh) and see that we were granted access to GPU device 0 on execution host ip-172-31-19-231. I now have exclusive use of the GPU, and the scheduler will prevent other jobs from using this GPU until we terminate our session.
To illustrate how to run a GPU application in a script, we create a script called gpu_job.sh as shown.
When the script starts on host selected by Univa Grid Engine, the first thing it does is populate the variable CUDA_VISIBLE_DEVICES with the “hot granted resource” selected by Univa Grid Engine. This constrains the CUDA application to use the GPU selected by Univa Grid Engine.
|Note: In a future article we will learn about advanced Univa Grid Engine functionality where CUDA_VISIBLE_DEVICES can be set automatically by Univa Grid Engine.|
We can submit a job using qsub as shown indicating that our application needs a single GPU:
We can use the qstat command to query the status of the job submitted above, showing the resource map for the job. This shows that GPU device 0 was selected on host ip-172-31-19-95.ec2.internal.
The power of using a scheduler becomes evident when many users are submitting different jobs with different resources requirements and priorities.
We can illustrate how Univa Grid Engine provides appropriate access to GPUs across cluster nodes with a simple example. We create a script that submits requests for ten GPU jobs identical to the example above in quick succession.
Running this command results in ten GPU jobs being submitted:
Assuming no other jobs were previously running on the cluster, running qstat to show the status of Univa Grid Engine jobs, will show something like the following[ii]:
Note that there are only 8 GPUs on the three node cluster, so each job is assigned a physical GPU on a host, and other submitted jobs pend in the queue once all GPUs are in use. Running qhost -F gpu shows that all of our GPU resources are now in use (hc:gpu=0 means that no GPUs are available).
Now that we have shown the basics of GPU-aware scheduling, we can progress to more complicated examples.
In the next article in this series, we will extend this example and show how we can virtualize GPU resources and run multiple GPU jobs on the same device.
[i] https://aws.amazon.com/ec2/pricing/on-demand/ A p3.8xlarge server with 32 Intel Skylake vCPUs, 244 GiB of memory and 4 x NVIDIA Tesla V100 GPUs costs $12.24/hour on demand. A compute-optimized c6g.8xlarge instance with 32 vCPUs, 64 GiB and no GPUs costs $1.088 / hour.
[ii] In this example we use single vCPU t2_micro instances to minimimze the cost of cloud resources. By default, only one job would be allowed to execute on each single vCPU host, so we modified the default all.q configuration to allow four job slots per host so that all four GPU resources could be consumed. (qconf -mq all.q)