A big motivator for deploying HPC workloads to the cloud is improving application throughput and performance with scalability. This is especially true in life sciences and CAE, where workloads tend to be cloud-friendly, and customers have an insatiable appetite for performance.
In late 2018, working with AWS and our team at Univa, Western Digital demonstrated HPC in the cloud at an extreme scale. In one of the largest commercial deployments to date, Western Digital ran a million+ core Univa Grid Engine cluster on AWS. It collapsed the runtime for a large multiphysics simulation from 20 days to just eight hours – a staggering 60x improvement! While deployments at this scale are still relatively rare, Univa is seeing demand for ever-larger cloud workloads. Clusters with thousands or even tens of thousands of vCPUs are becoming more common. The lessons learned from these large scale deployments are being baked into Univa Grid Engine, giving rise to important new product features. In this article, we discuss some recent scalability enhancements in versions up to Univa Grid Engine 8.6.17.
Operating at scale brings unique challenges
Deploying and managing clusters at scale poses unique challenges. Large scale clusters will typically leverage spot or spot fleet instances to operate economically. This means that cluster nodes will continually be reclaimed while workloads are running, requiring that scheduler continually restart pre-empted jobs. Similarly, users cannot afford to wait until clusters are at full-scale to submit workloads, so the workload manager needs to tolerate clusters that are rapidly adding large numbers of instances as jobs are submitted.
Some scalability issues are best addressed with some common-sense best practices. For example, using a scalable object store such as AWS S3 is much more efficient for data persistence than NFS services. Similarly, when running containerized workloads, a good practice is to bake containers directly into cloud machine images to avoid overwhelming a container registry with thousands of requests. Other issues require customizations to the environment supported by Univa Grid Engine – as examples, using a distributed cache to avoid overwhelming the cloud provider with extreme volumes of cloud API requests and DNS lookups.[i]
New scalability and throughput enhancements in Univa Grid Engine
In addition to the challenges above, other more subtle bottlenecks routinely surface in large scale-deployments. Some recent Univa Grid Engine enhancements aimed at removing these bottlenecks are described below:
These enhancements are in addition to various other enhancements in releases up to Univa Grid Engine 8.6.17 aimed at maximizing performance, reliability, and integrity in large scale environments with high job volumes.
You can learn more about recent Univa Grid Engine enhancements by reviewing the detailed release notes at https://www.univa.com/resources/releases.php.
[i] Rob Lalonde of Univa explains these changes in the article “Mission is Possible: Tips on Building a Million Core Cluster” – https://blogs.univa.com/2020/01/mission-is-possible-tips-on-building-a-million-core-cluster/