Among statisticians and data scientists, R remains one of the most popular and versatile programming languages. According to the KDnuggets annual software survey, R has consistently been among the top- data science tools for the past three years. In the most recent 2018 survey, 50% of data scientists self-identified as running R. In this article I’ll discuss R and explain how analysts and data scientists can leverage Univa Grid Engine for use with R and other analytic applications.
About the R language
R is open-source, runs across multiple platforms, and despite being interpreted, it is exceptionally fast with execution performance rivalling compiled code. It is also object-oriented, supports integrated data visualization, and is easily extensible. There are approximately 5,000 R add-on packages available from the Comprehensive R Archive Network (CRAN) and other sources. Data scientists and analysts use R in almost every field in applications ranging from Genetics to Machine Learning to Natural Language Processing to Economics. Inspired by an earlier S programming language, R was first released in 1995 with a stable beta version available in 2000.
R provides multiple interfaces
Similar to most interpreted languages, R is typically run from the command line. Programmers that need to create scripts and run R non-interactively can invoke the R interpreter with the CMD BATCH arguments and pass an input file comprised of R commands to the interpreter. In addition to the CMD BATCH option, R also supports Rscript, a binary front-end to simplify scripting. Rscript will be familiar to Linux users. R scripts use the familiar “shebang” construct (#!/usr/bin/Rscript) at the top of a file to make an R script executable from the command line.
In 2011, R became much easier to use with the initial release of open-source RStudio. RStudio is an integrated development environment (IDE) for R that includes a console, a syntax highlighting editor, and a plots window for interactive visualization of datasets. For users with small models, RStudio can be run on a desktop or laptop, and in larger environments, RStudio Server provides multiple users with access to RStudio through a web interface. While some users still use the command line, most R users will prefer to work in RStudio. The R command line is exposed as one of the panes within RStudio, so there is little reason for analysts using R to leave the IDE.
R in grid computing environments
The fact that R scripts can be run from the command line makes it trivial to integrate with Univa Grid Engine. Grid Engine users can submit R jobs using the Rscript or R CMD BATCH command line options, and RStudio integrates easily with grid environment, too. The only requirement to run R on a distributed cluster is that the R interpreter is present on each Grid Engine node.
Parallelizing R calculations across clusters is a common use case. The CRAN resource website dedicates an entire “task view” dedicated to the topic of High-Performance and Parallel Computing with R. Although users will typically use the RStudio interface, behind the scenes, parallel R applications will transparently submit Grid Engine jobs to a cluster and aggregate results before returning them to an analyst working in RStudio. Details about the underlying cluster are hidden from the R developer. They just need to know how to code to the appropriate parallel R language framework.
Among the many parallel computing packages for R, a few key modules will be of interest to Univa Grid Engine users.
Notebooks are worth a special mention as well because data scientists and statisticians working in R frequently need to collaborate and share models with others. Apache Zeppelin and Jupyter notebooks are both used as front-ends for R-based applications. RStudio also supports a notebook facility called R Markdown Notebooks. Shiny is a separate RStudio add-on that enables models developed in R to be exposed through an interactive web interface to create high-quality interactive interfaces for visualizing data.
Running R on a shared cluster environment provides multiple advantages for analysts and data scientists as well as the IT people that support the analytic environment:
According to KDnuggets research, the average data scientist runs seven different analytic frameworks. R, Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark are all popular choices. Most of these frameworks can benefit from distributed grid computing environments. For IT organizations, providing a shared grid environment that can support multiple analytic tools and frameworks only makes sense.
Are you using Univa Grid Engine to run R, Python or other data science workloads? We’d love to hear about your experiences.