Univa CorporationUniva Corporation
  • PRODUCTS
  • SOLUTIONS
  • SUPPORT
  • COMPANY
  • CONTACT
  • REQUEST A DEMO

To Infinity and Beyond – Expanded support in Univa Grid Engine for large-scale, high-throughput cloud HPC

November 10, 2020Bill Bryce VP Products, Univa
Univa Grid Engine Scalability

A big motivator for deploying HPC workloads to the cloud is improving application throughput and performance with scalability. This is especially true in life sciences and CAE, where workloads tend to be cloud-friendly, and customers have an insatiable appetite for performance.

In late 2018, working with AWS and our team at Univa, Western Digital demonstrated HPC in the cloud at an extreme scale. In one of the largest commercial deployments to date, Western Digital ran a million+ core Univa Grid Engine cluster on AWS. It collapsed the runtime for a large multiphysics simulation from 20 days to just eight hours – a staggering 60x improvement! While deployments at this scale are still relatively rare, Univa is seeing demand for ever-larger cloud workloads. Clusters with thousands or even tens of thousands of vCPUs are becoming more common. The lessons learned from these large scale deployments are being baked into Univa Grid Engine, giving rise to important new product features. In this article, we discuss some recent scalability enhancements in versions up to Univa Grid Engine 8.6.17.

Operating at scale brings unique challenges

Deploying and managing clusters at scale poses unique challenges. Large scale clusters will typically leverage spot or spot fleet instances to operate economically. This means that cluster nodes will continually be reclaimed while workloads are running, requiring that scheduler continually restart pre-empted jobs. Similarly, users cannot afford to wait until clusters are at full-scale to submit workloads, so the workload manager needs to tolerate clusters that are rapidly adding large numbers of instances as jobs are submitted.

Some scalability issues are best addressed with some common-sense best practices. For example, using a scalable object store such as AWS S3 is much more efficient for data persistence than NFS services. Similarly, when running containerized workloads, a good practice is to bake containers directly into cloud machine images to avoid overwhelming a container registry with thousands of requests. Other issues require customizations to the environment supported by Univa Grid Engine – as examples, using a distributed cache to avoid overwhelming the cloud provider with extreme volumes of cloud API requests and DNS lookups.[i]

New scalability and throughput enhancements in Univa Grid Engine

In addition to the challenges above, other more subtle bottlenecks routinely surface in large scale-deployments. Some recent  Univa Grid Engine enhancements aimed at removing these bottlenecks are described below:

  • Optimizing name service lookups at scale – Aside from DNS services, clusters also use services such as NIS, LDAP, or Active Directory to resolve username and group names to their corresponding OS-level ids. Resolving supplementary groups (a feature where OS-level users can be assigned to groups beyond their primary group id) is particularly expensive. This is because, with supplementary groups, the same user id can be associated with multiple group entries. To help avoid this performance bottleneck, Univa Grid Engine avoids resolving supplementary group ids for client applications that do not need the information. Also, administrators can optionally suppress looking up supplementary group ids entirely for better performance. They can also disable forwarding supplementary group ids in a range when administrators know that this information is not needed for their workloads.
  • Disabling unnecessary runtime checks – At large-scale, basic validation checks at runtime can be a luxury that administrators cannot afford. For example, when a job is submitted, Univa Grid Engine will validate that queue instances exist across cluster hosts and ensure that users have permission to access them. In situations where queues are known to be correct, Univa Grid Engine now allows this runtime checking to be disabled. Suppressing unnecessary checks further increases scheduling throughput and performance.
  • Faster scheduling of parallel workloads – Scheduling parallel workloads is an expensive operation. This is because Univa Grid Engine will find the optimal resource assignment that best satisfies all request criteria. For example, the scheduler will try and provide the most possible slots when a slot range is specified, and it will seek to maximize the number of soft (optional) resource requests. Univa Grid Engine will also look for the earliest possible time window to run a job. At cloud scale, it is important that “the perfect not be the enemy of the good.” Often throughput is more important than optimizing every workload placement. New scheduling parameters in Univa Grid Engine allow these settings to be selectively relaxed for dramatically faster scheduling of parallel workloads.

These enhancements are in addition to various other enhancements in releases up to Univa Grid Engine 8.6.17 aimed at maximizing performance, reliability, and integrity in large scale environments with high job volumes.

You can learn more about recent Univa Grid Engine enhancements by reviewing the detailed release notes at https://www.univa.com/resources/releases.php.


[i] Rob Lalonde of Univa explains these changes in the article “Mission is Possible: Tips on Building a Million Core Cluster” – https://blogs.univa.com/2020/01/mission-is-possible-tips-on-building-a-million-core-cluster/

Tags: Cloud, HPC, Univa Grid Engine, Western Digital

Recent Posts

  • Univa announces support for Arm-based Fujitsu Supercomputer PRIMEHPC FX700 systems
  • Announcing Univa Grid Engine™ Cluster for Azure CycleCloud
  • To Infinity and Beyond – Expanded support in Univa Grid Engine for large-scale, high-throughput cloud HPC
  • Maximizing the utilization of GPU resources webinar now available
  • Univa Grid Engine – Building a Modern HPC Scheduler

Photo Gallery

ISC19
ISC19
BB
ISC18
Lunch and Learn
Ansys Conference Madrid
ISC18
ISC19

Follow Altair

Facebook
Twitter
LinkedIn

PRODUCTS

  • Univa Grid Engine
  • Navops Launch
  • Request a Demo

RESOURCES

  • Support
  • Webinar Library
  • Professional Services
  • Videos and Podcasts
  • Training
  • Case Studies
  • Release Notes
  • White Papers

SOLUTIONS

  • Life Sciences
  • Manufacturing
  • Oil & Gas
  • Electronic Design
  • Education & Research
  • Government
  • Transportation

COMPANY

  • About Us
  • Careers
  • Partners
  • Blog
  • News
  • Contact
  • Events
  • Media
© 2020 ALTAIR ENGINEERING, INC. ALL RIGHTS RESERVED. WE ARE CURRENTLY LISTED ON NASDAQ AS ALTR. UNIVA IS AN ALTAIR COMPANY. UNIVA® IS A REGISTERED TRADEMARK OF ALTAIR. ALL OTHER LOGOS ARE PROPERTY OF THEIR RESPECTIVE OWNERS. | Privacy Policy | Site Map