Recently, the industry has reached a tipping point as evidenced in our recent customer survey that showed a marked increase in the interest and use of HPC cloud. This openness is fueled in part by new applications (like distributed TensorFlow) that are suited to the cloud as well as remarkable advances in the management software stack.
Hybrid cloud computing has become an essential approach to HPC architectures; for considerations and getting started advice you can read this whitepaper.
Univa responded to this surge in interest with the open source project Tortuga, a general-purpose cluster and cloud management framework that has applicability to a broad set of applications including high performance computing, big data frameworks, Kubernetes and scale-out machine learning / deep learning environments. Tortuga automates the deployment of these clusters in local on-premise, cloud-based and hybrid-cloud configurations through repeatable templates. We also made a number of product and cloud partnership announcements that broaden our capability to support our customers’ cloud migration objectives.
There are differences between Tortuga and our commercial product Navops Launch and we will discuss the differentiators in this blog.
First, let’s see how Tortuga and Launch are the same
Navops Launch builds on the open source project Tortuga. Therefore, what is “in” Tortuga is present in Navops Launch. In fact, the development of Tortuga is done in the open; there is no private repo that Univa builds Launch from. Navops Launch is a superset of the features and capabilities of Tortuga. For clarity, both Tortuga and Launch can be used to provision and manage both virtual and bare-metal environments (here is a tutorial for how to build Tortuga and provision in AWS).
In addition, the cloud-specific adapters for AWS, Google Cloud, Microsoft Azure, OpenStack and Oracle Cloud Infrastructure are the same (and are available in the Tortuga GitHub).
The built-in “simple” policy engine in Tortuga allows users to dynamically create, scale and teardown cloud-based infrastructure in response to changing workload demand, however, this is an area of key differentiation between the project and the product (discussed further down).
Made for enterprise integration
Since one cannot readily find two enterprises that design and run IT infrastructure the same – such as network configuration or management tools – enterprise products deployed in these environments need to support so-called “brownfield” infrastructures. This can increase the complexity of the product, but that flexibility pays off in the end (after all, an enterprise is unlikely to bend its processes and security protocols around 3rd party software). Both Tortuga and Launch share this flexibility and support for “brownfield” environments, however, this is an area where Launch improves on the project.
Navops Launch tightens the definition of the use cases to self-serve HPC cloud provisioning, making the installation simpler and automated with a wizard.
BYOI (Bring Your Own Image)
To achieve application portability between local and cloud-based nodes, the ability to provision custom machine images is important. Rather than use managed or vendor provisioned images, enterprise HPC users will want to use their own customized and approved images with their cloud management provisioning software. Many enterprises have validated (and secured) images that they bring to the cloud and full support for bring-your-own-image (BYOI) is built-in.
Now let’s take a look at the key differentiators and where Navops Launch product packaging begins.
Policy Management tied to Workload Attributes, not queues, flags or “BMC” readings
While Tortuga has a “simple” policy engine, it is a much different implementation than what is available in Navops Launch. Tortuga’s policy engine uses external methods (such as scripting) to execute “IFTTT” actions. Tying policy to workload is possible but requires more work to integrate into Grid Engine (or any 3rd party data source). The policy engine productized with Navops Launch is integrated with Tortuga “core” and Grid Engine. That enables Navops Launch to more effectively scale up or down cloud-based infrastructure dynamically by tying actions performed to workload “objects” and metrics.
The Navops Launch policy engine has access to deep time-based visibility into the workload management system’s internal environment and attributes. This provides Navops Launch full access to all Grid Engine “objects” and the use of those attributes as part of the scaling or reaping rules can easily be created in a simple pick and choose GUI. Grid Engine “objects” include Job, Resource, RSMAP, Queue Instances, Cluster Queue, Hosts and User. For more information on objects refer to the Grid Engine Administrator documentation or “man pages”.
To see how tying a rule to a Grid Engine object works, let’s look at a simple example. In figure 1 we can see a simple configuration of two rules, one that adds nodes when a specific job name is submitted and a rule that runs every 3 minutes and “reaps” (terminates) nodes when bursting is no longer required.
Figure 1: Two rules in the Rule List of Navops Launch
In figure 2 we can see how the rule “Add Nodes when Jobs of a specific name are submitted” is configured. This rule “reads” the Job object of Grid Engine (via the restAPI) and then finds a list of jobs with the “qw” attribute submitted by users with the attribute “sge” and with the name of “burst”. This rule runs every minute and looks to match (“AND”) all 3 attributes of Job State = “qw”, Job User = “sge” and Job Name = “burst”. When the 3 conditions are met, the Actions tab would fire. In this case, the Action is to add a node in the cloud that specifically matches the attributes of the job “burst” (that means it provisions the type of node, number of cores and memory the job specifies).
Figure 2: Editing a rule in Navops Launch policy engine UI
Navops Launch makes it easy to properly orchestrate workload placed in the cloud based on a mix of system variables that best match the end user’s needs, whether that be the number of cores required, memory, data proximity, wait time in the queue or priority (literally any attribute tied to any of the “objects” listed above. Static provisioning of a fixed number of cloud images with a specific or pre-defined configuration, that is disassociated to overall workload orchestration, is a thing of the past.
The policy engine in Navops Launch allows users to dynamically scale up and reap cloud instances using Grid Engine workload metrics; in fact any metric. Univa has created and exported to our open source GitHub several rules in JSON format in our GitHub repository, with many more to come.
Tortuga and Launch have an extensive command line and restAPI. Navops Launch has a unified webUI across installation, configuration, self-service cluster provisioning, dashboards and rule definition. The command line remains consistent with Tortuga, however, over time Navops Launch may feature an advanced “Launch” command line; similar to how Kubernetes has implemented its own command line.
We will discuss more about the specific features of the webUI in upcoming blogs.
We discussed key differentiators between project Tortuga and Navops Launch in this blog. Tying cloud workload placement policy decisions to workload attributes is probably the most important. Without understanding the runtime environment requirements of the job, efficiency can quickly diminish. Workload placement and resource utilization decisions become static and difficult to implement; the value of bursting is the dynamic and effective provisioning of the “right resource”. Adding a big memory instance to run a single threaded, low memory job is like taking a sledge hammer to a thumbtack; the impact and cost of such a decision compounds if the license is not available or the data is not staged.
Not only are hybrid cloud implementations more effective with workload attribute driven policy decisions, enterprises can effectively manage costs by provisioning precisely the resources and reaping those instances as quickly as possible.
Univa has helped many customers reach their organizational HPC cloud objectives including Wharton’s use of AWS or those who use our integrated marketplace offerings at Amazon. Let us know if we can be of assistance to your cloud objectives.