While some HPC users are mature in their use of cloud, others are just getting started or kicking the tires. In this article, we address common questions about cloud computing and its suitability for HPC workloads.
Most cloud providers offer a variety of services ranging from infrastructure-as-a-service (IaaS) to platform offerings like databases or distributed caches. For our purposes, we’ll focus on IaaS offerings since these are usually of primary interest to HPC users.
Virtual machine instances are essentially cloud-based servers. They come in a variety of flavors and price points depending on the type of CPU, the number of virtual CPUs (analogous to cores), and the amount of memory and local storage.
There are typically multiple ways to provision and manage machine instances in the cloud:
In HPC environments, infrastructure can get complicated, so administrators tend to prefer solutions that leverage a cloud providers CLI or API to automate cluster deployment. Cloud providers also have their own template-based solutions such as Amazon’s CfnCluster or Google’s Cloud Deployment Manager.
For customers that want to stay portable across clouds and leverage HPC specific capabilities, there are 3rd party solutions including Univa’s Navops Launch that automates the deployment of clusters and software.
Some popular IaaS offerings are provided below:
The short answer is it depends. Cloud providers usually use virtual machine (VM) technology to provide machine instances. While overhead is generally low, on occasion VMs can impact storage and network performance, so this is an area to watch. Some cloud providers offer HPC instances with high-performance Infiniband interconnects or optional bare metal environments tailored to HPC workloads.
Cloud providers usually define a vCPU as a single thread on a hyper-threaded core so it may take more vCPUs in the cloud to get the same throughput as you have on local hosts. Also, cloud providers may expose BIOS level settings that can be advantageous in helping you tune performance depending on your application.
Ideally, the best way to assess performance is to benchmark your application on your chosen cloud provider paying special attention to issues like network latency and file system performance.
A big advantage of cloud computing is convenience and flexibility. You can deploy state-of-the-art hardware anytime and pay for resources only when you need them. This is especially useful for fast-evolving AI and deep learning workloads where users often require expensive resources like GPUs. For short-term requirements, cloud computing is almost certainly cheaper than installing and managing a cluster in a local data center. Cloud users avoid hassles like purchasing hardware, networks, racks and storage, waiting weeks for delivery, finding data center space, etc.
For long-term use, most comparisons show that operating infrastructure in the cloud is more expensive on than local infrastructure, but cloud users often avoid financial and technical risk and spend less time troubleshooting infrastructure. There is no easy answer, but most organizations find cloud computing compelling for at least some workloads.
In situations where resources are needed occasionally, cloud computing is almost certainly your best bet. Some enterprises pursue hybrid-cloud strategies running some applications locally and bursting others to the cloud depending on application requirements.
The good thing about cloud computing is that pricing is published and transparent. Machine instances are typically priced by the hour (and even down to the minute) and vary depending on the number of vCPUs, memory, storage, and specialized components like GPUs. Costs also vary depending on quality of service and whether the infrastructure is reserved in advance (cheaper) or required on demand (more expensive).
A functioning cloud environment will usually involve costs related to storage, network traffic and ancillary services like load balancers, VPNs, NAT gateways or API gateways depending on the application.
The good news is that most cloud providers offer cost calculators, so if you know the services you need, you can estimate costs. Links to cost calculators for some popular clouds are provided below:
Don’t forget – many of your application-related costs will still exist regardless of whether you’re running locally or in the cloud. Cloud computing may help reduce certain administrative costs, but it probably won’t eliminate them entirely.
Being in the cloud does not automatically make your server more or less secure. It is still up to users to think about security and protect their systems and data.
The good news is that cloud providers are expert at managing infrastructure, and generally, incorporate best practices related to security. Most providers offer tools that make it easier for system administrators to monitor and secure their systems. Also, it will probably be easier to maintain systems with the latest OS releases and patch levels in the cloud making them less vulnerable to various types of attacks.
If you are storing sensitive data in the cloud, you may want to take advantage of encryption. Most cloud providers offer solutions to encrypt both data in transit and data at rest including virtual private networks that are relatively easy to implement.
The short answer is yes. MPI (message passing interface) applications are network intensive and sensitive to latency so you’ll want to shop around for cloud providers that offer high-performance network connections between nodes. Nodes should be connected to the same network (referred to as a VPC) or via Infiniband if the cloud provider offers it.
Some clouds offer HPC-ready offerings optimized for running parallel MPI applications common in fields like bioinformatics, physics, and computational fluid dynamics. There are multiple MPIs (Open MPI, Intel MPI, MPICH, MVAPICH, etc.) so you’ll want to consider the specific MPI that your application requires and whether the cloud provider offers tools to simplify the configuration of MPI-capable clusters.
Most cloud providers offer a variety of storage solutions. Maintaining scalable storage environments on-premise can be challenging, so this is one area that system administrators are often happy to outsource. Block storage, file system storage and object storage are all commonly used by HPC applications.
Storage in the cloud gets complicated because there are many options and pricing varies based on performance, capacity and the degree of redundancy required (mirroring data blocks across cloud regions for example, something that is hard to do in your local environment). Machine instances usually include a small amount of ephemeral or temporary storage on SSDs or magnetic drives, but this data exists only as long as a machine is instantiated.
It’s common for VM instances to be bound to separate block storage where users pay based on the amount of data stored (typically $/GB-hr). Block storage attaches to a single instance, is elastic in terms of capacity, and provides good performance. In clustered environments, data can be shared between nodes using NFS or parallel file systems like Lustre, BeeGFS, or GlusterFS configured on top of block storage. Cloud providers offer varying levels of automation around deploying parallel file systems so if fast parallel I/O is a requirement, this is worth considering.
Some cloud providers offer shared elastic file systems providing additional convenience. These file systems are sharable by hundreds or even thousands of instances and can store Terabytes or even Petabytes of data while preserving existing data access semantics and supporting locking and strong consistency. Elastic file systems are similar to parallel file systems in that I/O capacity scales with the size of the file system.
The least expensive form of cloud storage is usually object storage, priced based on capacity and how quickly users anticipate needing access to their data. For large, infrequently accessed datasets, it is common to store data in an object store and extract it to a more expensive storage tier like an elastic file system only when needed.
For some types of HPC problems, capacity is needed only occasionally. For example, an insurance company might need to run actuarial models monthly or quarterly. A semiconductor design company may need additional simulation capacity to address late-stage issues in their design and stay on schedule. For these types of problems, cloud bursting is attractive.
In cloud bursting scenarios, applications burst to tap capacity in a public cloud when demand for resources is high. The idea behind bursting is to maximize the use of on-premise resources and accommodate workload spikes in the cloud to minimize total TCO. Cloud-bursting usually involves bursting from on-premise environments to the cloud, but it’s equally valid to burst from a cloud-based cluster, dynamically adding more cloud capacity depending on workload.
For cloud bursting to be effective, the provisioning of cloud resources should be automated and reliable so that users pay for cloud resources for the shortest time possible. Cloud bursting solutions are often coupled with workload managers (like Univa Grid Engine). Navops Launch can be used with Univa Grid Engine to provide automated, workload-aware, policy-based bursting into multiple clouds.
While cloud computing is increasingly viable for HPC, there are some potential pitfalls. Aside from some of the more obvious technical challenges, keeping costs under control, and ensuring adequate performance, here are some other things to watch out for:
Cloud computing has matured, and most cloud providers now offer HPC-specific offerings and features. For most enterprises, cloud computing is worth a look for at least some HPC applications.
There is no silver bullet, and careful analysis may be required balancing considerations like cost, convenience, and the nature of your workloads. For more detail on some of these considerations, read our recent article Is cloud computing right for your business?
Univa offers a variety of cloud-ready solutions that can help customers deploy and manage a wide variety of high-performance applications locally, or in hybrid cloud environments using your choice of cloud provider. To learn more about Univa solutions for cloud computing or to speak with a Univa representative, contact us or visit http://univa.com.