For most enterprises, moving at least some workloads to the cloud will probably be a no-brainer – especially for short-term or project-based requirements. In the cloud, you’ll have fast, convenient access to state-of-the-art hardware without the hassle of purchasing, installing, and managing on-premise servers as long as you have a reliable internet connection. Some businesses may need to do a broadband comparison before using the cloud to ensure they have a secure and strong connection so that they can access the cloud whenever they need it.
Most organizations will probably choose a single cloud provider. There is an investment required to learning each provider’s interface and tools so once you pick a cloud partner, there may be little incentive to look elsewhere. It’s prudent to think about portability, however. You never know when business or technical circumstances might change forcing you to consider migration, or even taking applications back-in house.
The Hotel California syndrome (where you can never leave) is alive and well in the cloud. Below is a list of five things to consider to keep your HPC environment portable and cloud agnostic.
You’ll probably want to automate the process of building, growing, shrinking, and retiring clusters in the cloud. Most organizations find themselves deploying clusters repeatedly for different applications and projects. Once cloud instances are deployed the meter is running, so manually configuring or troubleshooting clusters is time and money wasted.
For automating deployments, each cloud provider offers their own solutions. Most offer a command line interface (Amazon’s aws shell, Microsoft’s az utility for Azure, or Google Cloud Platform’s gcloud command) useful in automating the management of cloud infrastructure. Some providers offer more capable frameworks to automate cluster deployments like Amazon’s CfnCluster or Google’s Cloud Deployment Manager. While these tools are useful, they usually don’t deal with layered software and applications. Also, by the time users learn any of these cloud-specific tools and use them to automate their processes, they’re essentially locked in. The effort needed to learn a new interface and re-write scripts to move to a different cloud provider will be high.
Where possible it is a better idea to look for provisioning solutions that are cloud provider agnostic so that your customization efforts are portable across clouds. One such tool is Navops Launch based on open-source Project Tortuga.
Most people associate containers with cloud-based micro-services, but containers are useful in HPC as well. For HPC, the main use case for containers is encapsulation. Administrators package up various OS facilities, libraries, and applications in containers making them readily portable between servers, and clouds. There may still be issues around hardware dependencies (like GPUs for example), but containerized workloads are generally much easier to manage and move.
Containerized applications can run on any machine with a Docker, Singularity, or Shifter engine and containers are increasingly cluster and workload manager friendly. To learn more about container options for HPC workloads, check out our recent article Containers for HPC Cloud.
Cloud environments present all kinds of cool services, enabling developers to build scaled-out applications in a variety of ways. As some examples, serverless computing solutions like AWS Lambda or Google Cloud Functions can be used for a variety of service-oriented, parallel workloads with minimal data dependencies. For users running batch jobs, cloud users can bypass the need to configure clusters entirely, code to a service like AWS Batch or Azure Batch and let the service looking after provisioning instances.
While these solutions are tempting, integrating with cloud-specific services typically requires coding. This means that an integration you develop on one cloud platform will not be easily portable to another. In some cases, the execution models are different. For example, AWS Batch has its own API and executes workloads in Amazon’s Elastic Container Service (ECS) while Azure Batch has a distinctly different API and relies on Azure vm-scale sets.
There is a place for cloud-specific services, but if you’re running many applications, this can get messy fast because not all applications will support all cloud services. If your application is such that you see value in a serverless computing model, you might be better to package your application logic in a Docker container and deploy it as a Kubernetes services. This way your application will be portable across the Kubernetes services offered by most cloud providers (Amazon EKS, Microsoft Azure AKS and Google GKE).
When moving applications to the cloud, storage is always an issue. While local clusters tend to use file-based storage models (on local disks, shared file or parallel file systems) operating in the cloud opens up new possibilities.
For block storage in the cloud, charges are based on the number of gigabytes stored per month but can vary based on the underlying technology. Users can also take advantage of specialized services like Elastic File Systems depending on the cloud provider or use object stores like Amazon S3 or Azure Blob storage. Object stores are desirable because they are low cost, but the way you access data is different. As if this all weren’t complicated enough, pricing can vary based on things like replication options, speed of retrieval, and whether data is hot, cool or cold.
Amazon’s Glacier is a cold storage service providing archival costs as low as $0.0004 per gigabyte per month. Cold storage might be a good solution, but calculate costs carefully before putting large amounts of data in the freezer. In addition to monthly storage, expect different rate structures for data retrieval and data transfer costs based on the amount of data you’re retrieving from cold storage. To illustrate the point, retrieving 10 TB of data from Amazon Glacier out to the internet at $.09/GB would cost you approximately $900.00 – dramatically more than keeping the data in the freezer for another month. Moving data into different forms of storage in the same cloud is cheaper than moving it externally (5 to 10 times cheaper). There is value in having the cloud provider archive your data, but be aware that the pricing structure is probably cleverly designed to keep your data locked away in their cloud.
A common HPC use case is cloud bursting. Often it makes sense to run applications locally, and burst into the cloud during peak periods when local capacity is not available. While cluster administrators usually think of bursting at the infrastructure level, bursting can be done at the application level also with the mechanisms for bursting tightly coupled to the application.
Tower Watson’s MoSes widely use for actuarial modeling is a good example. With the MoSes Azure cloud service, tapping cloud resources for simulations is done from within the MoSes desktop and is transparent to the user. ANSYS CAE users can choose from among an increasing variety of ANSYS cloud partners offering access to various cloud-based application solvers supplementing local capacity or providing fully hosted environments. Examples include Rescale, Nimbix, and UberCloud.
While application-level bursting is attractive for its ease of implementation, this is a sure-fire way to get locked into a cloud. Costs related to infrastructure, software licenses and services get jumbled together, and pricing becomes opaque. Furthermore, organizations running multiple applications can end up with multiple application-specific bursting solutions posing a serious management challenge. An advantage of managing bursting policies at the workload management layer is that administrators have full visibility, a single point of control, and retain the option to burst to their choice of cloud provider.
Some organizations may decide that the convenience of using particular services is worth the risk of becoming locked in. The key is to recognize the risks and make decisions with full knowledge of the costs you’re likely to incur if you need to go in a different direction in future.