At Univa, we continue to see rapid growth in GPU adoption in applications ranging from Deep Learning to Inference to traditional HPC. Many factors are driving increased GPU adoption:
GPU-capable instances are now available across all major cloud platforms, and NVIDIA continues to release developer-friendly SDKs, and libraries such as CUDA-X HPC making GPUs more accessible. With improved tools, it’s becoming easier to build, debug, and deploy GPU-enabled applications.
Simplifying GPU management at scale
Just as NVIDIA is making it easier to build and package GPU workloads, at Univa, we’re making GPU applications easier to deploy and manage GPU applications at scale. Managing distributed GPU workloads across clusters is surprisingly challenging. This is particularly true when many applications and users competing for resources, as is usually the case HPC and data science environments. Jobs need to be deployed considering factors such as GPU resource requirements, CPU and GPU architectures, memory, cache, server bus topologies, and NVIDIA interconnect and network switch topologies.
Over the past five years, we’ve spent considerable time working with leading AI supercomputing sites devising and fine-tuning better ways to manage GPU workloads so that applications run reliably and efficiently.
Read also: Managing GPU workloads with Univa Grid Engine (Part I and Part II)
Today Univa Grid Engine is among most capable workload managers for running containerized GPU workloads at scale. Grid Engine can be easily deployed on-premise or in hybrid cloud environments using your preferred cloud provider using Navops Launch to automate cloud-based GPU instances.
Univa Grid Engine provides multiple features aimed at simplifying GPU application management, including RSMAPS to simplify access to GPU devices, GPU-CPU affinity controls, and support for Nvidia Docker 2.0.
New GPU enhancements in Univa Grid Engine
In our latest Univa Grid Engine 8.6.7 release, we’re making GPU applications even easier to use and manage while also providing support for Red Hat 8 and DRMAA2 related enhancements.
While Univa Grid Engine has supported a direct integration with NVIDIA’s Data Center GPU Manager (DGCM) since 8.6.0, a major focus in our latest release has been to provide improved GPU usage reporting for Grid Engine jobs. In Univa Grid Engine 8.6.7, cluster administrators can optionally enable per-job GPU usage reporting, and view extended per-job GPU metrics sing the built-in Grid Engine qstat command.
With this enhancement, Grid Engine exposes detailed reporting on an additional 35 GPU-specific metrics that can help improve management and schedule GPU workloads more efficiently.
In addition to the many GPU-specific metrics previously available to Grid Engine users (affinity masks, GPU type, and version, power usage, etc.. ) Grid Engine now reports many additional metrics, including:
By exposing these additional metrics to Grid Engine, cluster users can make even more efficient of GPU resources on-premise on your preferred cloud platform for applications ranging from HPC to Machine Learning and Deep Learning.
For NVIDIA DGX users, Univa provides a simple guide on Using NVIDIA DGX Systems with Univa Grid Engine explaining how to configure a GPU cluster and submit regular and GPU-enabled workloads inside and outside of containers taking advantage of CPU-GPU affinity features to get the most out GPU clusters.
You can learn more about Univa Grid Engine and request a free trial here.