Microsoft R or Open Source R – Which Suits You the Best

Published by Tathagata Mukhopadhay | Asst. Vice President, Client Solutions

R, the open source statistical software, is gaining popularity in the world of analytics very fast. Over last few years companies, irrespective of size of their business, have adopted R as their analytical engines. Being open source and knowing analytics is the going to the booming industry, lot of products companies have made their product to be able to integrate with R. For example, we can pass data from Tableau to R, run some analysis in R and send back the result to Tableau for visualization.

Different R Products

Microsoft has also come into the picture in a different way. Revolution Analytics, a California based firm founded in 2007, developed an enterprise version of R called Revolution R Enterprise. This 2014 launched product introduced some proprietary components and libraries that can manage Big Data through parallel processing. In January 2015, Microsoft acquired Revolution Analytics and rebranded several Revolution Analytics products. Microsoft made some products free of cost and some of them licensed products. The products set include Microsoft R Open, Microsoft R Client and Microsoft R Server.

Hence with four various R products available (including open source R or CRAN R) on the market of which 3 of them are free and one (Microsoft R Server) is licensed one might be confused of the differences among these products and might end up not using the most suitable R product.

The Microsoft products being comparatively new, there is not much documentation is available on the net other than some on Microsoft websites. Though the products are very well described there, I felt the need to summarize those as a comparative view of the four versions of product.

Comparison between Different R Products

Before starting any comparison, let mention the main draw-back of Open source R (we will call this as CRAN R here onwards) which all R users know. It is the fact that R runs on memory. Hence, how much time an R code will take to analyze your data, will depend on your computer hardware. If the data goes beyond the memory limit, the code will crash. Hence, a logistic regression that takes 15 seconds to run one machine may well take 10 seconds in another high end computer to run on the same dataset. It might as well fail to run in a low configuration computer too.

Microsoft R products tries to solve this limitation of CRAN R in different editions of their products. Now let’s first look at the different additional features of Microsoft R products one by one.

Microsoft R Open

This product was previously known as Revolution R Open. Microsoft R Open (we will call this MRO henceforth) is a small improvement of CRAN R with primarily two aspect. Firstly, MRO uses multithreaded Intel Math Kernel Library (MKL) for matrix manipulations like inverse calculation, matrix multiplications, matrix decompositions, etc. But to use this we need to install the MKL library. Without this library CRAN R and MRO is same with respect to executional efficiency. Secondly, MRO provides a consistent and static set of R packages through a default CRAN repository. We can reproduce a code again and again using a specific version of R package using the checkpoint package. Other than these two, MRO is same as CRAN R.

Microsoft R Client

Microsoft R Client (we will call this MRC henceforth) is the first version of product that enables parallel computing. Hence bigger datasets can be processed efficiently here, but only for some statistical functions. Microsoft (actually developed by Revolution Analytics) developed some proprietary algorithm for some statistical calculation that can handle parallelization. For example, calculation of mean or variance can be parallelized easily but association rule mining may not be easy to parallelize. Currently there are close to 80 different proprietary functions in MRC which makes parallelization possible with data.

MRO is a free software for Windows where we can use the above proprietary functions. These function names starts with a suffix ‘rx’. For example, glm() function is CRAN R function to fit a generalized linear model, but rxGlm() does the same thing but uses parallelization. But, in MRO parallelization can go only up to two threads.

Microsoft R Server

Microsoft R Server (we will call this MRS now on) uses the same proprietary functions for parallelization, but it can process in multiple threads (more than two). Also it can process data in multiple data nodes (i.e. computers). MRS has various platforms like R Server for Linux, R Server for Windows, R Server for Hadoop, R Server for Teradata DB, SQL Server R Services, etc. those enable seamless R execution in various OS and various databases.

This licensed product has its support services and we can run R code as a standalone web service too. It is possible to operationalize the MRS engine for multi-server topologies with clustered web nodes and compute nodes using DeployR package.

A Diagrammatic Representation


Which Version of R Suits You


Below I am trying to summarize the above and try to see the best R product for different scenarios. Hopefully this will help one to decide.

Situation

The Suggested R Product

Your data is small enough to fit your machine memory. You will primarily need to run some ad hoc analysis, which includes more statistical operations.Though there is no harm in going with MRO or MRC, but make your like simple by just using CRAN R
Your data is small enough to fit your machine memory. You need to do primarily statistical modeling. Your process need to be repeated over time and you need to ensure consistent result every time.Better you go for MRO as it does the version control for the packages you use. It is not necessary to install Intel MKL
Your data is small enough to fit your machine memory. Your analysis includes some higher order matrix manipulations.MRO is the best bet, with Intel MKL library installed. There is no harm in using MRC, but you do not need it.
Your data is reasonably big, but can fit to your memory. You need to do some operations for which MS proprietary functions are available.MRC is the product for you if you are using in Windows environment. Else you need either MRS, or if you want to go for free version, go for CRAN R or MRO
Your data is reasonably big, but can fit to your memory. But the operations you need to perform, there is no MS proprietary function is available.You better use MRO or CRAN R. The speed and capacity of MRC or MRS will be same.
You are using huge data, than cannot run in fit memory (around 25% of the total RAM used). You need to do some operations for which MS proprietary functions are available.MRS is the only option here with clustered server environment. You can use MRC as a front end platform to connect MRS.
Your data is huge and data resides in a data base (SQL or Teradata or Hadoop).MRS in-database run will be a good option.
You want to develop an analytical engine as a standalone service, where people will upload data and do various analysis.MRS is the best option here.

0 Comments

Leave a reply

Contact Us

First Name

Second Name

Email

Phone No.

Message