Workload monitoring and reporting solutions for clusters have been around for many years. Cluster Administrators have always needed data about workloads and resource usage to help fine-tune sharing policies, identify bottlenecks, plan capacity and provide chargeback and showback accounting.
In this article, I explain how reporting and monitoring have evolved in Grid Engine and describe an exciting new approach that we’ve taken in the latest version of Univa Unisight. Before I introduce the new capabilities, it’s helpful to understand how Grid Engine reporting and monitoring have advanced.
A brief history of Grid Engine monitoring and reporting
In the early days of Grid Engine, reporting was primitive. Grid Engine logged information to an accounting file and an optional reporting file that logged internal state events. Users leveraged the built-in accounting system (qacct), or wrote custom scripts to extract needed metrics.
Back in 2006, Sun Microsystems bundled their commercial ARCo tool into open-source Grid Engine, providing a web interface to reporting data and a configurable database writer that stored Grid Engine data in an RDBMS. At the time, this was a big step forward. By storing data in Oracle, MySQL, or PostgreSQL, it could easily be queried, and the system supported derived values like time-based averages and facilities to remove outdated records automatically.
Univa delivers new capabilities with Unisight
The next big advance in reporting occurred in 2011 when Univa bundled our commercial Unisight tool leveraging BI technology in Univa Grid Engine 8.0.0. Rather than relying on a single transactional RDBMS, Unisight was architected like a data warehouse with an ETL engine. Unisight users could quickly generate a wide variety of reports across multiple clusters with millions of jobs. Unisight also included a much-improved user interface and allowed users to configure, store, generate and export reports and charts.
Analytics evolve in the age of big data
While Unisight continued to evolve with many new features (Docker support, GPU monitoring, license monitoring, etc.), by 2014, industry approaches to large-scale data management were changing dramatically. Univa customers were running multiple large clusters with thousands of cores and tens of millions of jobs requiring vast amounts of storage and analytic capacity. The traditional RDBMS was fast giving way to a new generation of NoSQL datastores better suited to large-scale data storage and analytics. New non-relational data stores such as InfluxDB and Prometheus made it possible for the time-series data common in monitoring and charting applications to be queried directly without expensive and complex ETL operations.
To modernize and provide additional scalability, in 2014, we released an entirely new version of Unisight. Unisight 4.0 removed dependencies on the old ARCo technology, leveraged the Univa Grid Engine REST API for greater efficiency and delivered a modern new Web UI. It also dropped support for PostgreSQL in favor of MongoDB. By using a scalable open-source document store, we were able to simplify management, provide better performance, and more easily accommodate new types of data useful for downstream analysis.
Customers demand open, extensible interfaces
Over the past few years, customer requirements have shifted once more. To get a full view of their operations, customers increasingly need information from multiple sources. These include not only workload managers but data sources such as Splunk, third-party tools, and various cloud data sources, including AWS CloudWatch and Azure Monitoring.
To meet these needs, customers are increasingly standardizing on analysis and visualization tools that support multiple data sources. Examples include open-source tools such as Grafana, Kibana, and Freeboard as well as commercial BI and analytic tools such as Tableau, Microsoft Power BI and MicroStrategy.
Another benefit of using a single analysis tool is that customers can standardize how they manage alerting. In the past, customers often used discrete monitoring tools each with their own alerting facility. For example, Nagios, Univa Unisight, Univa Grid Engine, NVIDIA’s DCGM, AWS CloudWatch and NetApp’s OnCommand Insight are all capable of generating various types of alerts administered through different management interfaces.
Rather than managing all these siloed alerting facilities, users find it easier to aggregate data from multiple sources into a single tool such as Grafana that provides a uniform way of managing alerts across data sources. Besides simplification, this enables customers to leverage plug-ins available for popular tools such as Slack, PagerDuty, OpsGenie and Webhook, thus providing additional capability.
Providing timely access to open data sources
As customers have standardized on their preferred analysis and visualization tools, easy access to workload, and cluster-related data has become the most important criteria for a reporting engine. Customers are sophisticated with their chosen analytic tools but often struggle with how to gather, manage and curate large volumes of reporting data from multiple clusters at scale while delivering adequate performance to front-end analysis tools.
Most of the cluster deployments we’ve worked on recently have required that we provide a data bridge to make metrics in Unisight available to a customer’s preferred analysis tools. Driven by customer feedback, we’ve taken a new approach in Unisight that has been well received by customers.
A new Unisight built for the modern hybrid cloud data center
With Unisight 4.4.2, we continue to provide a robust data gathering facility and raw metrics data store, but we’ve shifted to focus on being more open to third-party analytic tools and additional data sources. Rather than delivering a closed end-to-end reporting and monitoring solution, we’ve endeavored instead to create a reporting framework that’s flexible and provides multiple points of integration with other reporting tools and data sources.
Important changes in our latest version of Unisight are:
This new reporting architecture provides customers with the best of both worlds. Clients can quickly implement an end-to-end reporting system with minimal integration headaches if they choose, knowing that the system will be future proof and integrate easily with additional data sources. For customers already invested in existing monitoring tools, this open-architecture makes workload and cluster-related data in Univa Grid Engine easier to access, providing more flexibility, and reducing the cost and complexity of integration.
While customers can continue to use the existing Unisight Web UI, this new open-framework is available in Unisight 4.4.2. Univa is also providing Grafana dashboard customization services for clients that need help creating custom data views or integrating with additional data sources.
Are you using Unisight or other reporting solutions with Univa Grid Engine or other workload managers? We’d be interested in your views.