In 2017, AWS rolled out 1,400 new features and 2018 appears to have continued that remarkable momentum. Last week I had the opportunity to watch the many announcements (and even participate in one) made at the AWS re:Invent conference in Las Vegas. I started to summarize the various HPC-related announcements for our internal team, which evolved into this blog post. As I’ve mentioned in prior posts, Univa has seen a clear tipping point in HPC cloud adoption in 2018. This year, we helped several enterprise clients move from prototype to production, deploying large-scale cloud and hybrid cloud HPC environments to the AWS platform.
The barriers to HPC cloud adoption continue to melt away. This year’s conference was marked by a slew of exciting announcements from new HPC instance types to faster interconnects to HPC-optimized filesystems. I wanted to take a few minutes and share my thoughts on some of last week’s announcements and the impact they are likely to have on HPC professionals, including our large community of Grid Engine users.
More powerful HPC instances
For HPC and deep learning applications, AWS has raised the bar by announcing the availability of a new P3 instance type with up to 8 Nvidia Tesla V100 GPUs. For HPC workloads, AWS has also announced new C5n instances based on 3.0GHz Skylake processors. Both of these instances support up to100 Gbps networking on the largest instances (c5n.18xlarge and the p3dn.24xlarge instance types). Our experience working with supercomputing centers deploying Tesla V100 GPUs and NVIDIANV Link interconnects in similar GPU-dense clustered nodes shows that topology-aware scheduling is critical for deep learning workloads. As customers deploy these new instance types with deep learning frameworks, I expect that capabilities like Univa Grid Engine Resource Maps (RSMAP) and our integration with NVIDIA’s Data Center GPU Manager (DCGM) will help customers simplify the management of deep learning workloads and make more efficient use of GPUs.
Elastic Fabric Adapter (EFA)
Instances with faster networking are only useful if the interconnects can keep pace, so we were delighted to see the announcement of a latency-optimized Elastic Fabric Adapter to complement the new C5n and P3 instance types. It sounds like AWS plans to roll out this adapter to other instances types in future. Most of our clients running demanding MPI workloads have tended to run on-premise with Infinib and or Intel Omni Path switches and adapters to achieve acceptable performance levels. With similar interconnect performance now available in the AWS cloud, I expect that this barrier to cloud migration related to interconnect speeds has begun to fall away.
Navops Launch is a good way for clients to get their feet wet with the new instance types and EFA. Customers can automate the deployment of MPI-optimized Univa Grid Engine clusters and pay for resources only when needed. Depending on relative performance and cost, customers can decide what workloads make sense to shift to the cloud and when. It’s useful that AWS also announced capabilities around network optimized data movement and parallel file systems along with EFA. For many customers, these capabilities often go hand-in-hand.
Anyone who has worked on hybrid-cloud HPC deployments knows that data management is a central challenge. AWS DataSync looks to be a well-thought-out solution that addresses many of the issues we’ve experienced first-hand. It provides an easy way to transfer or synchronize data between on-premise NFS environments to AWS S3or Amazon Elastic File System (EFS). It also provides a performance-optimized network transfer protocol for faster data movement and TLS encryption. In hybrid cloud environments, details like data handling should ideally be transparent to application users. In our typical hybrid cloud deployment model, when there are insufficient resources to run a workload locally, Univa Grid Engine will communicate with Navops Launch to dynamically provision cluster resources, install software and transfer needed data files as necessary(subject to policy controls and cost considerations of course). AWS DataSync will provide another valuable data handling mechanism that can be triggered via Navops Launch’s automation both for clusters that persist in the cloud, as well as transient clusters deployed to run specific workloads for a short time.
Amazon FSx for Lustre
With AWS FSx for Lustre, Amazon is addressing another barrier to HPC cloud migration. While parallel file systems, such as BeeGFS, GlusterFS and Intel Cloud Edition for Lustre (now deprecated) have been available for some time, AWS did not provide a native parallel file system. This meant that customers needed to fend for themselves, wrestling with version issues, deployment methods, and deciding on optimal instance types, network and storage configurations.
Fast parallel file systems are commonly used as scratch storage during large-scale, multi-step simulations. The file system is often needed only while the parallel simulation runs. Recognizing this, AWS has integrated FSx for Lustre with cost-efficient AWS S3 object storage. Users can associate their AWS FSx filesystem with an S3 bucket for seamless access. AWS will automatically copy S3 data to FSx for Lustre as needed, and write results back to S3 or other low-cost data stores. Because FSx for Lustre is deployable programmatically or via scripts, we can auto-deploy and tear down clusters underpinned by AWS FSx as needed, and manage data movement and application software under control of Navops Launch. I expect this will become an important new use case in the coming year.
While we tend to think of parallel file systems as useful only for applications like CFD, crash, or seismic analysis, the truth is that many applications can benefit from fast parallel IO. Issues around cost and complexity have caused users to settle for lower performance alternatives. But when parallel file systems can be deployed quickly and cost-effectively, users will go with the higher-performance option. FSx for Lustre is a potential game changer for many HPC users because it enables them to pay for infrastructure only when they need it, avoid the complexity of deploying and managing parallel file systems, and run large-scale simulations faster while taking advantage of low-cost object storage.
As a member of the AWS Partner Network (APN), I was also pleased to see Amazon making investments in their HPC partner program with AWS Navigate. HPC implementations tend to be a team sport and having visibility to partners with deep expertise in specific areas is good for us, and more importantly for our customers. With additional resources, educational materials and architectural frameworks, we can work more efficiently and deliver higher-quality solutions.
AWS made other announcements as well, including new EC2 A1 instances running the Arm-based Graviton processor. Univa Grid Engine supports these platform types as well. AWS also made an interesting move, with the announcement of AWS Outposts. Similar to Microsoft’s Azure Stack, AWS Outposts makes native AWS services available on-premise, and it can also be managed as a cloud stack on VMware for customers that want to use familiar tools.
At Univa we’ve been investing heavily in our relationship with AWS offering our flagship Univa Grid Engine product via the AWS marketplace, optimizing our Navops Launch platform to deploy clusters and manage data on AWS dynamically. We’ve also worked very diligently to improve cloud scalability successfully deploying a 1,000,000+ core Grid Engine cluster on AWS.
With all these announcements, 2019 promises to be another exciting year for HPC in the cloud.