Unlike machine learning where humans need to identify features relevant to a model, with deep learning, feature identification is built in. This means that models can be trained directly from raw data, including text, images, videos, and domain-specific datasets.
For big, gnarly problems with lots of data but where features are poorly understood, deep learning is where the action is. While it is more compute-intensive to develop and train deep learning models, they can substantially outperform other machine learning techniques. Not surprisingly, deep learning is being embraced in industries from oil & gas to cybersecurity to manufacturing.
GPUs are critical for deep learning
Deep learning has become practical because of tremendous advances in computing power and affordability enabled by modern GPUs. Deep learning environments are typically comprised of multiple servers, each with several NVIDIA GPUs connected via high-speed interconnects. The software stack typically includes Linux, NVIDIA drivers, Docker, NVIDIA Docker, and various management tools. Users often run multiple deep learning frameworks such as Tensorflow, Keras, PyTorch, Caffe, Theano, MXNet, and others. Whether on-premise or in the cloud, configuring these complex GPU environments can be challenging.
NVIDIA DGX systems simplify deep learning
Fortunately, NVIDIA offers purpose-built deep learning hardware platforms that make deep learning applications much easier to deploy and manage. NVIDIA® DGX™ Systems are designed specifically for Deep Learning applications. The DGX family is comprised of the NVIDIA DGX Station™ and NVIDIA DGX-1™ and DGX-2™ rackmount servers. NVIDIA DGX-2 servers provide up to 16 NVIDIA Volta™ V100 GPUs with an NVIDIA NVSwitch™ powered NVLink™ fabric offering up to 2.4 TB/s of bandwidth. DGX-2 servers also come pre-configured with Mellanox® EDR Infiniband offering 1,600 Gb/s bi-directional bandwidth between hosts.
Managing deep learning workloads
In enterprise environments, multiple data science teams often share a DGX cluster. Workloads range from ETL flows for generating training data to training jobs to ongoing model validation. Many of these workloads are long-running, taking hours or even days to complete. Jobs can involve different software frameworks, multiple GPUs spread across hosts, and can have different resource requirements and business priorities. Users are all but guaranteed to “trip over each other” in these shared environments without workload management causing conflict, confusion, and reduced productivity.
As an example, a single, distributed training job might request as follows:
“A distributed, containerized Tensorflow model needs two parameter servers and ten workers. Each parameter server needs a single CPU and 8GB of memory, and each worker requires a P100GPU with at least 48GB and 5GB of host memory. Workers must be scheduled to processor cores on each host such that CPU-GPU pairs share memory and a direct bus connection. Workers should be concentrated on as few hosts as possible, and if the workers need to distributed across hosts, hosts should reside in the same rack and switch to minimize network latency.“
Now imagine dozens of jobs with similar constraints submitted by different groups. With many users and workloads, hardcoding hostnames and GPU-device names is a recipe for disaster. This is where GPU-aware workload management and Univa Grid Engine comes in.
Optimized management of deep learning workloads for DGX clusters
Univa software manages workloads across NVIDIA GPUs on some of the world’s largest AI supercomputers, including the ABCI supercomputer in Japan. Based on practical experience managing containerized deep learning environments at scale, Univa has captured best practices and made these capabilities easily available to DGX users.
Learn more about NVIDIA DGX best practices here – Univa Grid Engine on DGX
Whether you are deploying a single NVIDIA DGX server, or a cluster comprised of multiple servers, Univa Grid Engine brings important capabilities to the DGX environment. Among these capabilities are:
For NVIDIA DGX customers, Univa Grid Engine provides the following benefits:
You can learn more about how Univa Grid Engine supports GPU workloads by reading Managing GPU workloads with Univa Grid Engine. Technical details about using Univa Grid Engine on NVIDIA DGX systems are provided in the guide Using NVIDIA DGX Systems with Univa Grid Engine.