This month, Univa is rolling out Navops Launch 2.1, the most recent version of our powerful cloud automation and migration platform. This release marks an important milestone in the evolution of Navops Launch. In addition to providing new functionality for Univa Grid Engine users, this release adds support for the Slurm workload manager.
Slurm is among the world’s most widely deployed HPC workload managers. Slurm powers some of the world’s largest supercomputers and is well known for its parallel job support and extensive library of plug-ins. Together, Univa Grid Engine and Slurm manage approximately 54% of cloud workloads, according to a recent InsideHPC survey sponsored by Univa.
Below, I describe some of our motivations behind introducing Slurm support and explain why Navops Launch is an important tool for helping HPC centers operate Slurm clusters in the cloud.
Slurm in the cloud
Slurm users are no strangers to cloud computing, but present methods for deploying Slurm in the cloud pose challenges. While there are a variety of commercial solutions for deploying Slurm in the cloud, most lock users into a single cloud provider. The same InsideHPC survey referenced above found that 64% of HPC cloud users devise their own solutions. Most Slurm users rely on solutions such as AWS Parallel Cluster or custom scripts, along with elastic computing functionality built on Slurm power-saving logic. Elastic computing works in conjunction with community-supported plug-ins for Amazon Web Services (AWS) and Google Cloud Platform (GCP) or cloud-specific recipes.
While these solutions may be workable, cluster and cloud administrators using the approaches described above will need to solve a variety of cloud-related technical issues on their own. Some common challenges in our experience are:
Enhanced cloud capabilities for Slurm users
Since announcing Navops Launch in 2018, Univa has been steadily addressing these challenges.
Navops Launch makes it easy to deploy and manage clusters in dedicated and hybrid clouds and provides a comprehensive APIs, a CLI as well as an intuitive web interface.
Among our focus areas for Navops Launch are:
The automation engine in Navops Launch is an important concept. Unlike static scripts that scale-up capacity when jobs are pending and scale-down when nodes are idle, the automation engine is flexible and extensible. Navops Launch has visibility to Slurm-related workloads and metrics as well as cloud resource-related information.
With this information, Navops Launch can trigger actions in real-time such as shifting workloads, scaling services, moving data to lower-cost storage tiers, or taking advantage of spot, preemptible, or lower-priced instances. Navops Launch comes with a library of extensible automations, and sophisticated users can also build their own.
Provisioning or extending Slurm clusters
Navops Launch 2.1 dramatically simplifies the deployment of Slurm clusters. Administrators can easily deploy Slurm master nodes (running slurmctld) and execution nodes in the cloud with elastic scaling.
Users can tailor built-in connector profiles to site-specific requirements or define their own. Profiles can be used to specify different cloud images, application environments, and even deploy clusters with different Slurm versions[i].
Default settings can be overridden through the Web UI, enabling users to specify details such as authentication methods, DNS settings, instance types, and network-related settings. These include details such as the subnet-ids that hosts are provisioned to, assigned security groups, and whether public IP addresses are exposed for cluster machine instances.
Navops Launch users can use a Univa supplied image, or provide their custom machine image for each supported cloud.
Cloud spend management for Slurm environments
For HPC users, cost management is a challenge. Gartner estimates that 80% of cloud IaaS users will overshoot their budgets through 2020 because they lack the necessary process controls to deal with costs in the cloud. Slurm can’t manage what it can’t measure, and the scheduler has no visibility to details such as budgets, and cloud-related spending by project or cost center.
Navops Launch addresses this critical gap collecting actual spending data from each cloud provider programmatically. This provides administrators with visibility to cloud spending by cost-center, department, and project and enables administrators to take automated actions based on cost-related metrics.
A unique capability in Navops Launch is an innovative pricing API. Navops Launch collects pricing information for different cloud instance times from various cloud-service providers. It makes this information accessible via GraphQL-based queries. Cluster and cloud administrators can leverage the pricing API within automation applets to make appropriate resource selections at runtime base on price and other application requirements. Navops Launch makes the vast amounts of data that it collects from workload managers and cloud providers easily accessible via GraphQL. This provides administrators with tremendous flexibility in developing their own custom automations.
Informed by real-world experience
The development of Navops Launch is informed by our experience helping clients deploy large-scale clusters across multiple clouds. A significant milestone was a 2019 collaboration between Univa, Amazon Web Services, and Western Digital, where Navops Launch was used to deploy a million-core cluster in just 1 hour and 32 minutes. The deployed cluster ran a multiphysics simulation that normally required 20 days to complete in just 8 hours – a 60x performance improvement.
Navops Launch can scale cloud resources quickly, supporting Slurm or Univa Grid Engine workloads even as clusters scale. Support for image-based deployments and scale sets vastly reduces deployment time and simplifies management. This ensures that valuable cloud resources are brought online quickly and that they are immediately available to the workload manager to maximize throughput and minimize cost.
Univa Grid Engine users can also take advantage of new managed tags functionality in Navops Launch 2.1. This enhancement makes tags assigned to cloud resources in Navops Launch instantly available as requestable resources within Univa Grid Engine, dramatically improving flexibility.
Slurm users can obtain additional information or request a personalized demonstration of Navops Launch 2.1 by contacting us at Univa.com.
[i] In this initial release, Univa provides a default reference connector for Slurm-17.x.x and additional Slurm versions can be easily added