Bringing the powers of SQL into R

December 11, 2015, 11:59 am

≫ Next: Tutorial: Data Science with SQL Server R Services

≪ Previous: Solve common R problems efficiently with data.table

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

One of the big flaw of R is that data loaded into it are stored in the memory (on the RAM) and not on the disk. As you are working in an analysis with large (big) data the processing time of simple and more complex functions can become very long or even crash your computer. SQL enters here, it is a powerful language designed to work with (large) database and to perform simple operation (like subsetting, sorting …) on them. It is particularly useful to explore very large dataset and format the data for further analysis. There are many programs for doing database management using SQL. I decided to start looking at MySQL since it has an R package and is rather easy to set up (one could also use PostgreSQL …). In this post I will show you step by step how to create a database in MySQL, to upload data from R into it, then to do some queries to look at the power of SQL. Before I start note that the data.table package was developed to perform fast operation on big data (have a look here).

Create a database

First you need to download MySQL from this website or from synaptic for the ubuntu users. Then you need to open a shell window (type cmd for windows users, terminal for Linux), type this:

> mysql -p -u root

This will ask you for the password of the root user if it worked you will see some text and mysql> appearing. Then if you don’t want to bother with different users and their rights you can directly create a database using:

mysql> CREATE DATABASE intro_to_sql;

That’s it you created a database named intro_to_sql. At this point it is very important to remember that every time you are in the shell with mysql you need to use semi-colon (;) at the end of your statement, otherwise it doesn’t work. You can look at all the databases in your system using:

mysql> show databases;

+--------------------+
| Database           |
+--------------------+
| information_schema |
| intro_to_sql       |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.00 sec)

Then we create a user having all rights on the intro_to_sql database:

mysql> GRANT ALL ON data.* TO 'user1'@'localhost' IDENTIFIED BY '12345';

Once a database and a user have been created we don’t need the shell interface everything else can be done from R.

Uploading datasets from R

We could directly create tables in the database from the shell interface but let’s see how to transfer data from R into the database:

library(RMySQL)
#connect to the database
con<-dbConnect(MySQL(),user='user1',password='12345',dbname='intro_to_sql')

#load some data
library(ggplot2)
data(diamonds)

#have a look at them
summary(diamonds)

     carat               cut        color        clarity          depth           table           price             x                y         
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                    J: 2808   (Other): 2531                                                                                    
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  


#write the table into the database
dbWriteTable(con,"diamonds",diamonds_data)

#remove the dataset from R
rm(diamonds)

We now have one table named diamonds_data in our database.

Performing queries from R

Now that our intro_to_sql database has one table we can start playing with some SQL queries from R:

#count the number of diamonds that are more than 2000$ expensive
dbGetQuery(con,"select count(*) from diamonds_data where price>2000")

  count(*)
1    29733

#make a new data frame with diamonds of color ‘D’ and a depth less than 60%
subs<-dbGetQuery(con,"select * from diamonds_data where color='D' AND depth<60")
unique(subs$color)

[1] "D"

#make a new data frame only with the column x,y,z and order them by ascending x
subs<-dbGetQuery(con,"select x,y,z from diamonds_data order by x")
head(subs)

 x    y z
1 0 6.62 0
2 0 0.00 0
3 0 0.00 0
4 0 0.00 0
5 0 0.00 0
6 0 0.00 0

#from this dataset let’s create a new variable which is the mean of x,y,z
subs$Mean<-apply(sub,1,mean)

#write the results in a new table
dbWriteTable(con,"XYZMean",subs)

#check that it has been created
dbListTables(con)

[1] "XYZMean"       "diamonds_data"

As you can see it is fairly easy to work with RMySQL, there are many advantages in working with this tool: (i) all the powers of SQL are at your command from within R, so it is easy to include this in your workflow (i.e. using the script window from RStudio …), (ii) no need to load big chunk of unprocessed data into R, use SQL to process efficiently the data (I did not talk about how to directly load a table into a MySQL database, have a look here).

There are many helpful ressources online about this topic here are a few that I found interesting: A working guide to MySQL. A nice introduction into some other SQL-platform supported in R (SQLite). A blog post about the issue of big data in R.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Tutorial: Data Science with SQL Server R Services

December 17, 2015, 2:06 pm

≫ Next: IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

≪ Previous: Bringing the powers of SQL into R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

You may have heard that R and the big-data RevoScaleR package have been integrated with with SQL Server 2016 as SQL Server R Services. If you've been wanting to try out R with SQL Server but haven't been sure where to start, a new MSDN tutorial will take you through all the steps of creating a predictive model: from obtaining data for analysis, to building a statistical model, to creating a stored prodedure to make predictions from the model. To work through the tutorial, you'll need a suitable Windows server on which to install the SQL Server 2016 Community Technology Preview, and make sure you have SQL Server R Services installed. You'll also need a separate Windows machine (say a desktop or laptop) where you'll install Revolution R Open and Revolution R Enterprise. Most of the computations will be happening in SQL Server, though, so this "data science client machine" doesn't need to be as powerful.

The tutorial is made up of five lessons, which together should take you about 90 minutes to run though. If you run into problems, each lesson includes troubleshooting tips at the end.

Lesson 1 begins with downloading the New York City taxi data set (which was also used to create these beautiful data visualizations) and loading it into SQL Server. You'll also set up R to include some useful packages such as ggmap and RODBC.

Lesson 2 starts by having you verify the data using SQL queries. Don't miss the "Next Steps" links near the end, where you'll summarize the data using the RevoScaleR package on the data science client machine, and then visualize the data as a map with the ggmaps package (as shown below).

Lesson 3 focuses on using R to augment the data with new features, such as calculating the distance between pickup and dropoff points using a custom R function or using T-SQL.

Lesson 4 is where you'll use the rxLogit function to train a logistic regression model to predict the probability of a driver receiving a tip for a ride, evauate the model using ROC curves, and then deploy the prediction into SQL Server as a T-SQL stored procedure.

Lesson 5 wraps things up by showing how to use the deployed model in a production environment, both by calculating predictions from a stored dataset in batch mode, and by performing transactional predictions one trip at a time.

To save on cutting-and-pasting, you can find all of the code used in the tutorial on Github. Give it a go, and before long you'll have your own R models running live in SQL Server.

MSDN: End-to-End Data Science Walkthrough: Overview (SQL Server R Services)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

December 18, 2015, 5:03 am

≫ Next: Integrating Python and R Part III: An Extended Example

≪ Previous: Tutorial: Data Science with SQL Server R Services

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.

(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(

Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…

It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:

Data_Scientist_Workbench

The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.

The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…

RStudio_and_OpenRefine_and_Data_Scientist_Workbench_and_Edit_Post_‹_OUseful_Info__the_blog____—_WordPress

So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)

See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

↧

Integrating Python and R Part III: An Extended Example

December 18, 2015, 5:14 am

≫ Next: A use of gsub, reshape2 and sqldf with healthcare data

≪ Previous: IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Musselle

This is the third post in a three part series where I have explored the options available for including both R and Python in a data analysis pipeline. See post one for some reasons on why you may wish to do this, and details of a general strategy involving flat files. Post two expands on this by showing how R or Python processes can call each other and parse arguments between them.

In this post I will be sharing a longer example using these approaches in analysis we carried out at Mango as a proof of concept to cluster news articles. The pipeline involved the use of both R and Python at different stages, with a Python script being called from R to fetch the data, and the exploratory analysis piece being conducted in R.

Full implementation details can be found in the repository on github here, though for brevity this article will focus on the core concepts with the most relevant parts to R and Python integration discussed below.

Document Clustering

We were interested in the problem of document clustering of live published news articles, and specifically, wished to investigate times when multiple news websites were talking about the same content. As a first step towards this, we looked at sourcing live articles via RSS feeds, and used text mining methods to preprocess and cluster the articles based on their content.

Sourcing News Articles From RSS Feeds

There are some great Python tools out there for scraping and sourcing web data, and so for this task we used a combination of feedparser, requests, and BeautifulSoup to process the RSS feeds, fetch web content, and extract the parts we were interested in. though the general code structure was as follows:

# fetch_RSS_feed.py

def get_articles(feed_url, json_filename='articles.json'):
    """Update JSON file with articles from RSS feed"""
    #
    # See github link for full function script
    #

if __name__ == '__main__':

    # Pass Arguments
    args = sys.argv[1:]
    feed_url = args[0]
    filepath = args[1]

    # Get the latest articles and append to the JSON file given
    get_articles(feed_url, filepath)

Here we can see that the get_articles function is defined to perform the bulk of the data sourcing tasks, and that the parameters passed to it are the positional arguments from the command line. Within get_articles, the url link, publication date, title and text contents, were then extracted for each article in the RSS feed and stored in a JSON file. For each article, the text content was made up of all HTML paragraph tags within the news article.

Sidenote: The if __name__ == "__main__": line may look strange to non-Python programmers, but this is a common way in Python scripts to control the sections of the code that are run when the whole script is executed, vs when the script is imported by another Python script. If the script is executed directly (as is the case when it is called from R later), the if statement evaluates to true and all code is run. If however, at some point in the future I wanted to reuse get_articles in another Python script, I could now import that function from this script without triggering the code within the if statement.

The above Python script was then executed from within R by defining the utility function shown below. Note that by using stdout=TRUE, any messages printed to stdout with print() in the Python code, can be captured and displaced in the R console.

fetch_articles <- function(url, filepath) {

command = "python"
path2script='"fetch_RSS_feed.py"'

args = c(url, filepath)
allArgs = c(path2script, args)

output = system2(command, args=allArgs, stdout=TRUE)
print(output)

}

Loading Data into R

Once the data had been written to a JSON file, the next job was to get it into R to be used with the tm package for text mining. This proved a little trickier than first expected however, as the tm package is mainly geared around reading in documents from raw text files, or directories containing multiple text files. To convert the JSON file into the expected VCorpus object for tm I used the following:

load_json_file <- function(filepath) {

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

# Convert to VCorpus
bbc_texts <- lapply(json_obj, FUN = function(x) x$text )
df = as.data.frame(bbc_texts)
df = t(df)
articles = VCorpus(DataframeSource(df))
articles
}

Unicode Woes

One potential problem when manipulating text data from a variety of sources and passing it between languages, is that you can easily get tripped up by character encoding errors on route. We found that by default Python was able to read in, process and write out the article content from the HTML sources, but R was struggling to decode certain characters that were written out to the resulting JSON file. This is due to the languages using or expecting a different character encoding by default.

To remedy this, you should be explicit in the encoding you are using when writing and reading a file, by specifying it when opening a file connection. This meant using the following in Python when writing out to a JSON file,

# Write updated file.
with open(json_filename, 'w', encoding='utf-8') as json_file:
    json.dump(JSON_articles, json_file, indent=4)

and on the R side opening the file connection was as follows:

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

Here “UTF-8″ Unicode is chosen as it is a good default encoding to use, and is the most popular one used in HTML documents worldwide.

For more details on Unicode and ways of handling it in Python 2 and 3 see Ned Batchelder’s PyCon talk here.

Summary of Text Preprocessing and Analysis

The text preprocessing part of the analysis consisted of the following steps, which were all carried out using the tm package in R:

Tokenisation – Splitting text into words.
Punctuation and whitespace removal.
Conversion to lowercase.
Stemming – to consolidate different word endings.
Stopword removal – to ignore the most common and therefore least informative words.

Once cleaned and processed, the Term Frequency-Inverse Document Frequency (TF-IDF) statistic was calculated for the collection of articles. This statistic aims to provide a measure of how important each word is for a particular document, across a collection of documents. It is more sophisticated that just using the word frequencies themselves, as it takes into account that some words may naturally occur more frequently than others across all documents.

Finally a distance matrix was constructed based on the TF-IDF values and hierarchical clustering was performed. The results were then visualised as a dendogram using the dendextend package in R.

An example of the clusters formed from 475 articles published over the last 4 days is shown below where the leaf nodes are coloured according to their source, with blue corresponding to BBC News, green to The Guardian, and indigo to The Independent.

It is interesting here to see articles from the same news websites occasionally forming groups, suggesting that news websites often post multiple articles with similar content, which is plausible considering how news story unfold over time.

What’s more interesting is finding clusters where multiple new websites are talking about similar things. Below is one such cluster with the article headlines displayed, which mostly relate to the recent flooding in Cumbria.

Hierarchical clustering is often a useful step in exploratory data analysis, and this work gives some insight into what is possible with news article clustering from live RSS feeds. Future work will look to evaluate different clustering approaches in more detail by examining the quality of the clusters they produce.

Other Approaches

In this series we have focused on describing the simplest approach of using flat files as an intermediate storage medium between the two languages. However it is worth briefly mentioning several other options that are available, such as:

Using a database, such as sqlite, as a medium of storage instead of flat files.
Passing the results of a script execution in memory instead of writing to an intermediate file.
Running two persistent R and Python processes at once, and passing data between them. Libraries such as rpy2 and rPython provide one such way of doing this.

Each of these methods brings with it some additional pros and cons, and so the question of which is most suitable is often dependent on the application itself. As a first port of call though, using common flat file formats is a good place to start.

Summary

This post gave an extended example of how Mango have been using both Python and R to perform exploratory analysis around clustering news articles. We used the flat file air gap strategy described in the first post in this series, and then automated the calling of Python from R by spawning a separate subprocess (described in the second post). As can be seen with a bit of care around character encodings, this provides a straight forward approach to “bridging the language gap”, and allows multiple skillsets to be utilised when performing a piece of analysis.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

A use of gsub, reshape2 and sqldf with healthcare data

December 18, 2015, 7:46 am

≫ Next: R is the fastest-growing language on StackOverflow

≪ Previous: Integrating Python and R Part III: An Extended Example

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Building off other industry-specific posts, I want to use healthcare data to demonstrate the use of R packages. The data can be downloaded here. To read the .CSV file in R you might read the post how to import data in R. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

gsub

When working with vectors and strings, especially in cleaning up data, gsub makes cleaning data much simpler. In my healthcare data, I wanted to convert dollar values to integers (ie. $21,000 to 21000), and I used gsub as seen below.

Reading the data in R from CSV file. I am naming the dataset “hosp”.

hosp <- read.csv("Payment_and_value_of_care_-_Hospital.csv")

In the code below I will remove hospitals without estimates

hospay<-hosp[hosp$Payment.category !="Not Available" & hosp$Payment.category !="Number of Cases Too Small",]

Now its time to remove the dollar signs and commas in estimate values

hospay$Payment <- as.numeric(gsub("[$,]","",hospay$Payment))
hospay$Lower.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate))
hospay$Higher.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate))

head(hospay$Payment)
[1] 13469 12863 12308 12222 21376 14740

reshape2

In looking at the data, I wanted to focus on the Payment estimate. So I used the melt() function that is part of reshape2. Melt allows pivot-table style capabilities to restructure data without losing values.

library(reshape2)
hosp_mel<-melt(data=hospay,id=c(2,5,9,11), measure=as.numeric(c(13)), value.name='Estimate') 

names(hosp_melt)
[1] "Hospital.name"        "State"                "Payment.measure.name" "Payment.category"     "variable"             "Estimate"

sqldf

With my data melted, I wanted to get the average estimate for heart attack patients by state. This is a classic SQL query, so bringing in sqldf allows for that.

library(sqldf)
names(hosp_melt) [3] <- "paymentmeasurename"
hosp_est <- sqldf("select State, avg(Estimate) as Estimate 
from hosp_melt 
where paymentmeasurename = 'Payment for heart attack patients' 
group by State")

head(hosp_est)
   State  Estimate
1     AK  20987.60
2     AL  21850.32
3     AR  21758.00
4     AZ  22690.62
5     CA  22707.45
6     CO  21795.30

If you have any question feel free to leave a comment below.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

R is the fastest-growing language on StackOverflow

December 21, 2015, 12:51 pm

≫ Next: Our R package roundup

≪ Previous: A use of gsub, reshape2 and sqldf with healthcare data

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

StackOverview is a popular Q&A site, and a go-to resource for developers of all languages to find answers to programming problems they may have: most of the time, the question has already been asked and answered, or you can always post a new question and wait for a reply. It's an excellent resource for R users, featuring answers to nearly 100,000 R questions. In fact, R is the fastest-growing language on StackOverflow in terms of the number of questions asked:

The chart above was created — in R, of course — by Joshua Kunst, who helpfully provided the R code to make this subway-style rank plot using the ggplot2 package. The "fastest-growing" claim is based on Joshua's regression analysis of the data above: R's trendline has a slope of 4.50. (RedMonk also uses StackOverflow data, combined with GitHub activity, for their bi-annual language popularity rankings. R was ranked #13 in their most recent analysis, in June 2015.) The data comes directly from the StackExchange data dump, loaded into a Sqlite database and processed in R using Hadley Wickham's RSQLite package. There's much more interesting analysis of the StackOverflow data in Joshua's blog post, including a cluster analysis of the top 100 tags in StackOverflow. Check it out at the link below.

Joshua Kunst: What do we ask in StackOverflow?

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Our R package roundup

December 30, 2015, 10:00 am

≫ Next: 7 new R jobs from around the world (2015-12-31)

≪ Previous: R is the fastest-growing language on StackOverflow

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

A year in review

It’s the time of the year again where one eats too much, and gets in a reflective mood! 2015 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year!

Since we are in a festive mood, we decided we would not fight it out but rather present our top five of new R packages, a purely subjective list of packages we (and Chuck Norris) approves of.

gif

But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.

Top Popular CRAN packages

Let’s start with some factual data before we go into our personal favourites of 2015. We’ll pull the titles of the new 2015 R packages from cranberries, and parse the CRAN downloads per day using cranlogs package.

Using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. I don’t know, and this is an interesting question I think we’ll tackle in a later blog post…

For now, let’s just assume that average downloads per day is a relatively stable metric to gauge package success with. We’ll grab the packages released using rvest:

berries <- read_html("http://dirk.eddelbuettel.com/cranberries/2015/")
titles <- berries %>% html_nodes("b") %>% html_text
new <- titles[grepl("^New package", titles)] %>% 
  gsub("^New package (.*) with initial .*", "\1", .) %>% unique

and then lapply() these titles into the CRAN and parse their respective average downloads per day:

logs <- pblapply(new, function(x) {
  down <- cran_downloads(x, from = "2015-01-01")$count 
  if(sum(down) > 0) {
    public <- down[which(down > 0)[1]:length(down)]
  } else {
    public <- 0
  }
  return(data.frame(package = x, sum = sum(down), avg = mean(public)))
})

logs <- do.call(rbind, logs)

With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2015, by average number of daily downloads:

top 20 new CRAN packages in 2015

The full code is availble on github, of course.

As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.

The top four packages (rversions, xml2, git2r, praise) are all technical packages. Although I have to say I did not know of praise so far, and it looks like it’s a very fun package, indeed: you can automatically add randomly generated praises to your output! Fun times ahead, I’d say.

Excluding these, the clear winner of “frontline” packages are readxl and readr, both packages by Hadly Wickham dealing with importing data into R. Well-deserved, in our opinion. These are packages nearly everybody working with data will need on a daily basis. Although, one hopes that contact with Excel sheets is kept to a minimum to ensure one’s sanity, and thus readxl is needed less often in daily life!

The next two are packages (DiagrammeR and visNetwork) relate to network diagrams, something that seems to be en vogue currently. R is getting some much-needed features on these topics here it seems.

plotly is the R package to the recently open-sourced popular plot.ly javascript libraries for interactive charts. A well-deserved top ranking entry! We also see packages that build and improve the ever-popular shiny packages (DT and shinydashboard), leaflet dealing with interactive mapping issues, and packages on stan, the Baysian statistical interference language (rstan, StanHeaders).

But now, this blog’s authors’ personal top five of new R packages for 2015:

readr

(safferli’s pick)

readr is our package pick that also made it into the top downloads metric, above. Small wonder, as it’s written by Hadley and aims to make importing data easier, and especially more consistent. It is thus immediately useful for most, if not all, R users out there, and also received a tremendous “fame kickstart” from Hadley’s reputation within the R community. For extremely large datasets I still like to use data.table’s fread() function, but for anything else the new read_* functions make your life considerably easier. They’re faster compared to base R, and just the no more worries of stringsAsFactors alone is a godsend.

Since the package is written by Hadley, it is not only great but also comes with a fantastic documentation. If you’re not using readr currently, you should head over the the package readme and check it out.

infuser

(Yuki’s pick)

R already has many template engines but this one is simple yet quite useful if you work on data exploration, visualization, statistics in R and deploy your findings in Python while using the same SQL queries and as similar syntax as possible.

Code transition from R to Python is quick and easy with infuser like this now;

# R
library(infuser)
template <- "SELECT {{var}} FROM {{table}} WHERE month = {{month}}"
query <- infuse(template,var="apple",table="fruits",month=12)
cat(query)
# SELECT apple FROM fruits WHERE month = 12

# Python
template = "SELECT {var} FROM {table} WHERE month = {month}"
query = template.format(var="apple",table="fruits",month=12)
print(query)
# SELECT apple FROM fruits WHERE month = 12

googlesheets

(Kirill’s pick)

googlesheets by Jennifer Bryan finally allows me to directly output to Google Sheets, instead of output it to xlsx format and then push it (mostly manually) to Google Drive. At our company we use Google Drive as a data communication and storage tool for the management, so outputing Data Science results to Google Sheets is important. We even have some small reports stored in Google Sheets. The package allows for easy creating, finding, filling, and reading of Google Sheets with an incredible simplicity of use.

AnomalyDetection

(Kirill’s second pick. He gets to pick two since he is so indecisive)

AnomalyDetection was developed by Twitter’s data scientists and introduced to the open source community in the first week of the year. A very handy, beautiful, well-developed tool to find anomalies in the data. This is very important for a data scientist to be able to find anomalies in the data fast and reliably, before real damage occurs. The package allows you to get a good first impression of the things going on in your KPIs (Key Performance Indicators) and react quickly. Building alerts with it is a no-brainer if you want to monitor your data and assure data quality.

emoGG

(Jess’s pick)

emoGG is definitely falling in the category “most whatever” R package of the year. What this package does is fairly simple: it allows you to display emojis in your ggplot2 plots, either as plotting symbols or as a background. Under the hood, it adds a geom_emoji layer to your ggplot2 plots, in which you have to specify one or more emoji codes corresponding to the emojis you wish to plot. emoGG can be used to make visualisations more compelling and make plots transport more meaning, no doubt. But before anything else, it’s fun and a must have for an avid emoji fan like me.

Our R package roundup was originally published by Kirill Pomogajko at Opiate for the masses on December 30, 2015.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

↧

7 new R jobs from around the world (2015-12-31)

December 31, 2015, 6:48 am

≫ Next: The R Project: 2015 in Review

≪ Previous: Our R package roundup

This is the bi-monthly R-bloggers post (for 2015-12-31) for new R Jobs.

To post your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

Part-Time

Content Development Intern ($20/hour) @ Cambridge, Massachusetts, U.S.
DataCamp – Posted by nickc123

Cambridge
Massachusetts, United States

22 Dec2015
Full-Time

Data Scientist @ Billerica, Massachusetts, U.S.
MilliporeSigma, Inc. – Posted by andreaduda

Billerica
Massachusetts, United States

22 Dec2015
Freelance

Data Science Course Mentor – Remote/Flexible
Springboard – Posted by Parul Gupta

Anywhere

21 Dec2015
Full-Time

Computational Analyst / Bioinformatician @ Cambridge, MA, U.S.
Boston Children’s Hospital, Dana-Farber Cancer Institute, Broad Institute – Posted by julirsch

Cambridge
Massachusetts, United States

20 Dec2015
Freelance

Consultant/Tutor for R and SQL
Logistics Capital & Strategy – Posted by Edoody

Anywhere

19 Dec2015
Full-Time

Data Scientist – Predictive Analyst @ Harrisburg, Pennsylvania, US
Manada Technology LLC – Posted by manadatechnology

Harrisburg
Pennsylvania, United States

18 Dec2015
Freelance

Big Data in Digital Health (5-10 hours per week)
MedStar Intitute for Innovation – Posted by Praxiteles

Anywhere

18 Dec2015

Job seekers: please follow the links below to learn more and apply for your job of interest:

(In R-users.com you may see all the R jobs that are currently available)

(you may also look at previous R jobs posts).

↧

The R Project: 2015 in Review

December 31, 2015, 8:44 am

≫ Next: Happy New Year! Top posts of 2015

≪ Previous: 7 new R jobs from around the world (2015-12-31)

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s been a banner year for the R project in 2015, with frequent new releases, ever-growing popularity, a flourishing ecosystem, and accolades from both users and press. Here’s a roundup of the big events for R from 2015.

R continues to advance under the new leadership of the R Foundation. There were five updates in 2015: R 3.1.3 in March, R 3.2.0 in April, R 3.2.1 in June, R 3.2.2 in August, and R 3.2.3 in December. That’s impressive release rate, especially for a project that’s been in active development for 18 years!

R’s popularity continued unabated in 2015. R is the most popular language for data scientists according to the 2015 Rexer survey, and the most popular Predictive Analytics / Data Mining / Data Science software in the KDnuggets software poll. While R’s popularity amongst data scientists is no surprise, R ranked highly even amongst general-purpose programming languages. In July, R placed #6 in the IEEE list of top programming languages, rising 3 places from its 2014 ranking. It also continues to rank highly amongst StackOverflow users, where it is the 8th most popular language by activity, and the fastest-growing language by number of questions. R was also a top-ranked language on GitHub in 2015.

The R Consortium, a trade group dedicated to the support and growth of the R community, was founded in June. Already, the group has published best practices for secure use of R, and formed the Infrastructure Steering Committee to fund and oversee commuinity projects. Its first project (a hub for R package developers) was funded in November, and proposals are being accepted for future projects.

2015 was the year that Microsoft put its weight behind R, beginning with the acquisition of Revolution Analytics in April and prominent R announcements at the BUILD Conference in May. Microsoft continues the steady pace of open-source R project releases, with regular updates to Revolution R Open, DeployR Open and the foreach and checkpoint packages. Revolution R Enterprise saw updates, and new releases of several Microsoft platforms have integrated R, including SQL Server 2016, Cortana Analytics, PowerBI, Azure and the Data Science Virtual Machine.

Activity within local R user groups accelerated in 2015, with 18 new groups founded for a total of 174. Microsoft expanded its R user group sponsorship with the Microsoft Data Science User Group Program. Community conferences also boasted record attendance, inclusing at useR! 2015, R/Finance, EARL Boston, and EARL London. Meanwhile, companies including Betterment, Zillow, Buzzfeed, the New York Times and many others shared how they benefit from R.

R also got some great coverage in the media this year, with features in Priceonomics, TechCrunch, Nature, Inside BigData, Mashable, The Economist, opensource.com and many other publications.

That’s a pretty big year … and we expect even more from R in 2016. A big thanks go out to everyone in the R community, and especially the R Core group, for making R the standout success it is today. Happy New Year!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Happy New Year! Top posts of 2015

January 1, 2016, 4:00 am

≫ Next: GO analysis using clusterProfiler

≪ Previous: The R Project: 2015 in Review

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Happy New Year everyone! It's hard to believe that this blog has now been going since 2008: our anniversary was on December 9. Thanks to everyone who has supported this blog over the past 7 years by reading, sharing and commenting on our posts, and an extra special thanks to my co-bloggers Joe Rickert and Andrie de Vries and all the guest bloggers from Microsoft and elsewhere that have contributed this year.

2015 was a busy year for the blog, with a 8% increase in users and a 13% increase in page views compared to 2014. The most popular posts of the year, starting with the most popular, were:

Revolution Analytics joins Microsoft (January 23)
In-database R coming to SQL Server 2016 (May 15)
R at Microsoft (June 26)
Using R with Jupyter Notebooks, by Andrie de Vries (September 9)
Association Rules and Market Basket Analysis with R (April 8)
New packages for reading data into R — fast (Apr 10)
SparkR: Distributed data frames with Spark and R (Jun 12)
Because it's Friday: Visualizing the Discrete Fourier Transform (Sep 18)
Revolution Analytics ∈ Microsoft == TRUE (Apr 6)
Parallel Programming with GPUs and R, by Norman Matloff (Jan 27)

That's all from us from the team here at Revolutions for this week, and indeed for this year! We'll be back in the New Year with more news, tips and tricks about R, but in the meantime we'll let R have the last word thanks to some careful seed selection by Berry Boessenkool:

> set.seed(31612310)
> paste0(sample(letters,5,T))
[1] "h" "a" "p" "p" "y"
> set.seed(12353)
> sample(0:9,4,T)
[1] 2 0 1 6

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

GO analysis using clusterProfiler

January 3, 2016, 7:58 pm

≫ Next: satRdays are coming

≪ Previous: Happy New Year! Top posts of 2015

(This article was first published on R on G. Yu, and kindly contributed to R-bloggers)

clusterProfiler supports over-representation test and gene set
enrichment analysis of Gene Ontology. It supports GO annotation from
OrgDb object, GMT file and user’s own data.

support many species

In github version of clusterProfiler, enrichGO and gseGO functions
removed the parameter organism and add another parameter OrgDb, so
that any species that have OrgDb object available can be analyzed in
clusterProfiler. Bioconductor have already provide OrgDb for about
20 species, see
http://bioconductor.org/packages/release/BiocViews.html#___OrgDb, and
users can build OrgDb via AnnotationHub.

library(AnnotationHub)
hub <- AnnotationHub()

## snapshotDate(): 2015-12-29

query(hub, "Cricetulus")

## AnnotationHub with 4 records
## # snapshotDate(): 2015-12-29 
## # $dataprovider: UCSC, Inparanoid8, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Cricetulus griseus
## # $rdataclass: ChainFile, Inparanoid8Db, OrgDb, TwoBitFile
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH10393"]]' 
## 
##             title                             
##   AH10393 | hom.Cricetulus_griseus.inp8.sqlite
##   AH13980 | criGri1.2bit                      
##   AH14346 | criGri1ToHg19.over.chain.gz       
##   AH48061 | org.Cricetulus_griseus.eg.sqlite

Cgriseus <- hub[["AH48061"]]

## loading from cache '/Users/guangchuangyu/.AnnotationHub/54367'

Cgriseus

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Cricetulus griseus
## | SPECIES: Cricetulus griseus
## | CENTRALID: GID
## | Taxonomy ID: 10029
## | Db type: OrgDb
## | Supporting package: AnnotationDbi

## 
## Please see: help('select') for usage information

sample_gene <- sample(keys(Cgriseus), 100)
str(sample_gene)

##  chr [1:100] "100762355" "100757285" "100773870" "100766902" ...

library(clusterProfiler)

sample_test <- enrichGO(sample_gene, OrgDb=Cgriseus, pvalueCutoff=1, qvalueCutoff=1)
head(summary(sample_test))

##                    ID                                 Description
## GO:0004983 GO:0004983            neuropeptide Y receptor activity
## GO:0005254 GO:0005254                   chloride channel activity
## GO:0005496 GO:0005496                             steroid binding
## GO:0005253 GO:0005253                      anion channel activity
## GO:0015108 GO:0015108 chloride transmembrane transporter activity
## GO:0019887 GO:0019887           protein kinase regulator activity
##            GeneRatio BgRatio     pvalue  p.adjust    qvalue    geneID
## GO:0004983      1/20  6/3946 0.03004660 0.6187407 0.6138746 100773047
## GO:0005254      1/20  6/3946 0.03004660 0.6187407 0.6138746 100773701
## GO:0005496      1/20  6/3946 0.03004660 0.6187407 0.6138746 100689048
## GO:0005253      1/20  8/3946 0.03987010 0.6187407 0.6138746 100773701
## GO:0015108      1/20  8/3946 0.03987010 0.6187407 0.6138746 100773701
## GO:0019887      1/20 12/3946 0.05923425 0.6187407 0.6138746 100763034
##            Count
## GO:0004983     1
## GO:0005254     1
## GO:0005496     1
## GO:0005253     1
## GO:0015108     1
## GO:0019887     1

support many ID types

The input ID type can be any type that was supported in OrgDb object.

library(org.Hs.eg.db)
data(geneList)
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID", 
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)

## 'select()' returned 1:many mapping between keys and columns

## Warning in bitr(gene, fromType = "ENTREZID", toType = c("ENSEMBL",
## "SYMBOL"), : 0.48% of input gene IDs are fail to map...

head(gene.df)

##   ENTREZID         ENSEMBL SYMBOL
## 1     4312 ENSG00000196611   MMP1
## 2     8318 ENSG00000093009  CDC45
## 3    10874 ENSG00000109255    NMU
## 4    55143 ENSG00000134690  CDCA8
## 5    55388 ENSG00000065328  MCM10
## 6      991 ENSG00000117399  CDC20

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(summary(ego))

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    24/197
## GO:0005876 GO:0005876                      spindle microtubule    11/197
## GO:0000793 GO:0000793                     condensed chromosome    17/197
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    13/197
## GO:0005875 GO:0005875           microtubule associated complex    14/197
## GO:0015630 GO:0015630                 microtubule cytoskeleton    36/197
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0005819 222/11632 3.810608e-13 1.276554e-10 1.139171e-10
## GO:0005876  45/11632 1.527089e-10 2.557874e-08 2.282596e-08
## GO:0000793 150/11632 5.838332e-10 6.519471e-08 5.817847e-08
## GO:0000779  81/11632 8.684319e-10 7.273117e-08 6.490386e-08
## GO:0005875 109/11632 3.936298e-09 2.637319e-07 2.353492e-07
## GO:0015630 765/11632 1.719925e-08 9.602916e-07 8.569452e-07
##                                                                                                                                                                                                       geneID
## GO:0005819                                                                   55143/991/9493/1062/259266/9787/220134/51203/22974/4751/983/4085/81930/332/3832/7272/9212/9055/3833/146909/10112/6790/891/24137
## GO:0005876                                                                                                                                       220134/51203/983/81930/332/3832/9212/9055/146909/6790/24137
## GO:0000793                                                                                                       1062/10403/7153/23397/55355/220134/4751/79019/55839/54821/4085/332/64151/9212/1111/6790/891
## GO:0000779                                                                                                                             1062/10403/55355/220134/4751/79019/55839/54821/4085/332/9212/6790/891
## GO:0005875                                                                                                                        55143/9493/1062/81930/332/3832/9212/3833/146909/10112/6790/24137/4137/7802
## GO:0015630 8318/55143/991/9493/1062/9133/7153/259266/55165/9787/220134/51203/22974/10460/4751/983/54821/4085/81930/332/3832/7272/64151/9212/1111/9055/3833/146909/10112/51514/6790/891/24137/26289/4137/7802
##            Count
## GO:0005819    24
## GO:0005876    11
## GO:0000793    17
## GO:0000779    13
## GO:0005875    14
## GO:0015630    36

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keytype       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(summary(ego2))

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    28/220
## GO:0005875 GO:0005875           microtubule associated complex    19/220
## GO:0005876 GO:0005876                      spindle microtubule    12/220
## GO:0015630 GO:0015630                 microtubule cytoskeleton    43/220
## GO:0005874 GO:0005874                              microtubule    26/220
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    14/220
##               BgRatio       pvalue     p.adjust       qvalue
## GO:0005819  298/19428 6.831911e-18 2.336514e-15 1.812254e-15
## GO:0005875  157/19428 1.710039e-14 2.924167e-12 2.268052e-12
## GO:0005876   54/19428 7.427958e-13 8.467872e-11 6.567879e-11
## GO:0015630 1118/19428 1.228816e-12 1.050638e-10 8.148992e-11
## GO:0005874  421/19428 2.266638e-12 1.550381e-10 1.202511e-10
## GO:0000779  110/19428 2.652610e-11 1.444006e-09 1.120005e-09
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     geneID
## GO:0005819                                                                                                                                                                                                                                                 ENSG00000134690/ENSG00000117399/ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000117650/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000112742/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000056678/ENSG00000233450/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000134057/ENSG00000090889
## GO:0005875                                                                                                                                                                                                                                                                                                                                                                                                 ENSG00000134690/ENSG00000137807/ENSG00000138778/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000237649/ENSG00000056678/ENSG00000233450/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000090889/ENSG00000186868/ENSG00000276155/ENSG00000277956/ENSG00000163879
## GO:0005876                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000170312/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000198901/ENSG00000186185/ENSG00000087586/ENSG00000090889
## GO:0015630 ENSG00000093009/ENSG00000134690/ENSG00000117399/ENSG00000137807/ENSG00000138778/ENSG00000157456/ENSG00000131747/ENSG00000066279/ENSG00000138180/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000013810/ENSG00000117650/ENSG00000170312/ENSG00000186871/ENSG00000164109/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000112742/ENSG00000109805/ENSG00000178999/ENSG00000149554/ENSG00000198901/ENSG00000237649/ENSG00000056678/ENSG00000233450/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000143476/ENSG00000087586/ENSG00000134057/ENSG00000090889/ENSG00000127603/ENSG00000154027/ENSG00000186868/ENSG00000276155/ENSG00000277956/ENSG00000163879
## GO:0005874                                                                                                                                                                                                                                                                                 ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000117650/ENSG00000170312/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000056678/ENSG00000233450/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000090889/ENSG00000127603/ENSG00000186868/ENSG00000276155/ENSG00000277956
## GO:0000779                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ENSG00000138778/ENSG00000080986/ENSG00000123485/ENSG00000154839/ENSG00000262634/ENSG00000117650/ENSG00000100162/ENSG00000166451/ENSG00000186871/ENSG00000164109/ENSG00000089685/ENSG00000178999/ENSG00000087586/ENSG00000134057
##            Count
## GO:0005819    28
## GO:0005875    19
## GO:0005876    12
## GO:0015630    43
## GO:0005874    26
## GO:0000779    14

ego3 <- enrichGO(gene         = gene.df$SYMBOL,
                OrgDb         = org.Hs.eg.db,
                keytype       = 'SYMBOL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(summary(ego3))

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    24/196
## GO:0005876 GO:0005876                      spindle microtubule    11/196
## GO:0000793 GO:0000793                     condensed chromosome    17/196
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    13/196
## GO:0015630 GO:0015630                 microtubule cytoskeleton    36/196
## GO:0005875 GO:0005875           microtubule associated complex    14/196
##               BgRatio       pvalue     p.adjust       qvalue
## GO:0005819  278/17761 6.023611e-15 2.042004e-12 1.769039e-12
## GO:0005876   52/17761 9.080301e-12 1.539111e-09 1.333370e-09
## GO:0000793  192/17761 4.363319e-11 4.930551e-09 4.271460e-09
## GO:0000779  103/17761 1.083989e-10 9.186804e-09 7.958759e-09
## GO:0015630 1034/17761 6.842818e-10 3.952252e-08 3.423935e-08
## GO:0005875  144/17761 6.995136e-10 3.952252e-08 3.423935e-08
##                                                                                                                                                                                                                        geneID
## GO:0005819                                                                      CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876                                                                                                                                                  SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000793                                                                                                              CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/NCAPG/AURKB/CHEK1/AURKA/CCNB1
## GO:0000779                                                                                                                                      CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/AURKB/AURKA/CCNB1
## GO:0015630 CDC45/CDCA8/CDC20/KIF23/CENPE/CCNB2/TOP2A/ASPM/CEP55/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/ERCC6L/MAD2L1/KIF18A/BIRC5/KIF11/TTK/NCAPG/AURKB/CHEK1/PRC1/KIFC1/KIF18B/KIF20A/DTL/AURKA/CCNB1/KIF4A/AK5/MAPT/DNALI1
## GO:0005875                                                                                                                             CDCA8/KIF23/CENPE/KIF18A/BIRC5/KIF11/AURKB/KIFC1/KIF18B/KIF20A/AURKA/KIF4A/MAPT/DNALI1
##            Count
## GO:0005819    24
## GO:0005876    11
## GO:0000793    17
## GO:0000779    13
## GO:0015630    36
## GO:0005875    14

Using SYMBOL directly is not recommended. User can use setReadable
function to translate geneID to gene symbol.

ego <- setReadable(ego, OrgDb = org.Hs.eg.db)
ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)
head(summary(ego), n=3)

##                    ID          Description GeneRatio   BgRatio
## GO:0005819 GO:0005819              spindle    24/197 222/11632
## GO:0005876 GO:0005876  spindle microtubule    11/197  45/11632
## GO:0000793 GO:0000793 condensed chromosome    17/197 150/11632
##                  pvalue     p.adjust       qvalue
## GO:0005819 3.810608e-13 1.276554e-10 1.139171e-10
## GO:0005876 1.527089e-10 2.557874e-08 2.282596e-08
## GO:0000793 5.838332e-10 6.519471e-08 5.817847e-08
##                                                                                                                                                   geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876                                                                             SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000793                                         CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/NCAPG/AURKB/CHEK1/AURKA/CCNB1
##            Count
## GO:0005819    24
## GO:0005876    11
## GO:0000793    17

head(summary(ego), n=3)

##                    ID          Description GeneRatio   BgRatio
## GO:0005819 GO:0005819              spindle    24/197 222/11632
## GO:0005876 GO:0005876  spindle microtubule    11/197  45/11632
## GO:0000793 GO:0000793 condensed chromosome    17/197 150/11632
##                  pvalue     p.adjust       qvalue
## GO:0005819 3.810608e-13 1.276554e-10 1.139171e-10
## GO:0005876 1.527089e-10 2.557874e-08 2.282596e-08
## GO:0000793 5.838332e-10 6.519471e-08 5.817847e-08
##                                                                                                                                                   geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876                                                                             SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000793                                         CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/NCAPG/AURKB/CHEK1/AURKA/CCNB1
##            Count
## GO:0005819    24
## GO:0005876    11
## GO:0000793    17

enrichGO test the whole GO corpus and enriched result may contains
very general terms. User can use dropGO function to remove specific GO
terms or GO level. If user want to restrict the result at sepcific GO
level, they can use gofilter function. We also provide a simplify
method to reduce redundancy of enriched GO terms, see the
post.

Visualization functions

dotplot(ego, showCategory=30)

enrichMap(ego, vertex.label.cex=1.2, layout=igraph::layout.kamada.kawai)

cnetplot(ego, foldChange=geneList)

plotGOgraph(ego)

## 
## groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.
## 
## Building most specific GOs ..... ( 335 GO terms found. )
## 
## Build GO DAG topology .......... ( 335 GO terms and 667 relations. )
## 
## Annotating nodes ............... ( 11632 genes annotated to the GO terms. )

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 29 
## Number of Edges = 50 
## 
## $complete.dag
## [1] "A graph with 29 nodes."

Gene Set Enrichment Analysis

gsecc <- gseGO(geneList=geneList, ont="CC", OrgDb=org.Hs.eg.db, verbose=F)
head(summary(gsecc))

##                    ID                              Description setSize
## GO:0031982 GO:0031982                                  vesicle    2880
## GO:0031988 GO:0031988                 membrane-bounded vesicle    2791
## GO:0005576 GO:0005576                     extracellular region    3296
## GO:0065010 GO:0065010 extracellular membrane-bounded organelle    2220
## GO:0070062 GO:0070062                    extracellular exosome    2220
## GO:0044421 GO:0044421                extracellular region part    2941
##            enrichmentScore       NES      pvalue   p.adjust    qvalues
## GO:0031982      -0.2561837 -1.222689 0.001002004 0.03721229 0.02816364
## GO:0031988      -0.2572169 -1.226003 0.001007049 0.03721229 0.02816364
## GO:0005576      -0.2746489 -1.312485 0.001009082 0.03721229 0.02816364
## GO:0065010      -0.2570342 -1.222048 0.001013171 0.03721229 0.02816364
## GO:0070062      -0.2570342 -1.222048 0.001013171 0.03721229 0.02816364
## GO:0044421      -0.2744658 -1.310299 0.001014199 0.03721229 0.02816364

gseaplot(gsecc, geneSetID="GO:0000779")

GO analysis using user’s own data

clusterProfiler provides enricher function for hypergeometric test and
GSEA function for gene set enrichment analysis that are designed to
accept user defined annotation. They accept two additional parameters
TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE
is a data.frame with first column of term ID and second column of
corresponding mapped gene and TERM2NAME is a data.frame with first
column of term ID and second column of corresponding term name.
TERM2NAME is optional.

An example of using enricher and GSEA to analyze DisGeNet annotation is
presented in the post, use clusterProfiler as an universal enrichment
analysis
tool.

GMT files

We provides a function, read.gmt, that can parse GMT file into a
TERM2GENE data.frame that is ready for both enricher and GSEA
functions.

gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")
c5 <- read.gmt(gmtfile)
egmt <- enricher(gene, TERM2GENE=c5)
head(summary(egmt))

##                                                ID              Description
## SPINDLE                                   SPINDLE                  SPINDLE
## MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON
## CYTOSKELETAL_PART               CYTOSKELETAL_PART        CYTOSKELETAL_PART
## SPINDLE_MICROTUBULE           SPINDLE_MICROTUBULE      SPINDLE_MICROTUBULE
## MICROTUBULE                           MICROTUBULE              MICROTUBULE
## CYTOSKELETON                         CYTOSKELETON             CYTOSKELETON
##                          GeneRatio  BgRatio       pvalue     p.adjust
## SPINDLE                      11/82  39/5270 7.667674e-12 6.594200e-10
## MICROTUBULE_CYTOSKELETON     16/82 152/5270 8.449298e-10 3.633198e-08
## CYTOSKELETAL_PART            15/82 235/5270 2.414879e-06 6.623386e-05
## SPINDLE_MICROTUBULE           5/82  16/5270 3.080645e-06 6.623386e-05
## MICROTUBULE                   6/82  32/5270 7.740446e-06 1.331357e-04
## CYTOSKELETON                 16/82 367/5270 1.308357e-04 1.826293e-03
##                                qvalue
## SPINDLE                  5.327016e-10
## MICROTUBULE_CYTOSKELETON 2.935019e-08
## CYTOSKELETAL_PART        5.350593e-05
## SPINDLE_MICROTUBULE      5.350593e-05
## MICROTUBULE              1.075515e-04
## CYTOSKELETON             1.475340e-03
##                                                                                                  geneID
## SPINDLE                                           991/9493/9787/22974/983/332/3832/7272/9055/6790/24137
## MICROTUBULE_CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## CYTOSKELETAL_PART             991/9493/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## SPINDLE_MICROTUBULE                                                             983/332/3832/9055/24137
## MICROTUBULE                                                                983/332/3832/9055/24137/4137
## CYTOSKELETON             991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
##                          Count
## SPINDLE                     11
## MICROTUBULE_CYTOSKELETON    16
## CYTOSKELETAL_PART           15
## SPINDLE_MICROTUBULE          5
## MICROTUBULE                  6
## CYTOSKELETON                16

egmt <- setReadable(egmt, OrgDb=org.Hs.eg.db, keytype="ENTREZID")
head(summary(egmt))

##                                                ID              Description
## SPINDLE                                   SPINDLE                  SPINDLE
## MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON
## CYTOSKELETAL_PART               CYTOSKELETAL_PART        CYTOSKELETAL_PART
## SPINDLE_MICROTUBULE           SPINDLE_MICROTUBULE      SPINDLE_MICROTUBULE
## MICROTUBULE                           MICROTUBULE              MICROTUBULE
## CYTOSKELETON                         CYTOSKELETON             CYTOSKELETON
##                          GeneRatio  BgRatio       pvalue     p.adjust
## SPINDLE                      11/82  39/5270 7.667674e-12 6.594200e-10
## MICROTUBULE_CYTOSKELETON     16/82 152/5270 8.449298e-10 3.633198e-08
## CYTOSKELETAL_PART            15/82 235/5270 2.414879e-06 6.623386e-05
## SPINDLE_MICROTUBULE           5/82  16/5270 3.080645e-06 6.623386e-05
## MICROTUBULE                   6/82  32/5270 7.740446e-06 1.331357e-04
## CYTOSKELETON                 16/82 367/5270 1.308357e-04 1.826293e-03
##                                qvalue
## SPINDLE                  5.327016e-10
## MICROTUBULE_CYTOSKELETON 2.935019e-08
## CYTOSKELETAL_PART        5.350593e-05
## SPINDLE_MICROTUBULE      5.350593e-05
## MICROTUBULE              1.075515e-04
## CYTOSKELETON             1.475340e-03
##                                                                                                              geneID
## SPINDLE                                               CDC20/KIF23/DLGAP5/TPX2/CDK1/BIRC5/KIF11/TTK/PRC1/AURKA/KIF4A
## MICROTUBULE_CYTOSKELETON CDC20/KIF23/CCNB2/TOP2A/DLGAP5/TPX2/NEK2/CDK1/BIRC5/KIF11/TTK/PRC1/AURKA/KIF4A/MAPT/DNALI1
## CYTOSKELETAL_PART              CDC20/KIF23/TOP2A/DLGAP5/TPX2/NEK2/CDK1/BIRC5/KIF11/TTK/PRC1/AURKA/KIF4A/MAPT/DNALI1
## SPINDLE_MICROTUBULE                                                                     CDK1/BIRC5/KIF11/PRC1/KIF4A
## MICROTUBULE                                                                        CDK1/BIRC5/KIF11/PRC1/KIF4A/MAPT
## CYTOSKELETON             CDC20/KIF23/CCNB2/TOP2A/DLGAP5/TPX2/NEK2/CDK1/BIRC5/KIF11/TTK/PRC1/AURKA/KIF4A/MAPT/DNALI1
##                          Count
## SPINDLE                     11
## MICROTUBULE_CYTOSKELETON    16
## CYTOSKELETAL_PART           15
## SPINDLE_MICROTUBULE          5
## MICROTUBULE                  6
## CYTOSKELETON                16

gsegmt <- GSEA(geneList, TERM2GENE=c5, verbose=F)
head(summary(gsegmt))

##                                                                    ID
## EXTRACELLULAR_REGION                             EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART                   EXTRACELLULAR_REGION_PART
## CELL_PROJECTION                                       CELL_PROJECTION
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX                             EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX_PART                   EXTRACELLULAR_MATRIX_PART
##                                                           Description
## EXTRACELLULAR_REGION                             EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART                   EXTRACELLULAR_REGION_PART
## CELL_PROJECTION                                       CELL_PROJECTION
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX                             EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX_PART                   EXTRACELLULAR_MATRIX_PART
##                                    setSize enrichmentScore       NES
## EXTRACELLULAR_REGION                   401      -0.3860230 -1.694496
## EXTRACELLULAR_REGION_PART              310      -0.4101043 -1.761338
## CELL_PROJECTION                         87      -0.4729701 -1.739867
## PROTEINACEOUS_EXTRACELLULAR_MATRIX      93      -0.6355317 -2.365007
## EXTRACELLULAR_MATRIX                    95      -0.6229461 -2.318356
## EXTRACELLULAR_MATRIX_PART               54      -0.5908035 -2.002728
##                                         pvalue   p.adjust    qvalues
## EXTRACELLULAR_REGION               0.001310616 0.03192152 0.02442263
## EXTRACELLULAR_REGION_PART          0.001375516 0.03192152 0.02442263
## CELL_PROJECTION                    0.001503759 0.03192152 0.02442263
## PROTEINACEOUS_EXTRACELLULAR_MATRIX 0.001555210 0.03192152 0.02442263
## EXTRACELLULAR_MATRIX               0.001557632 0.03192152 0.02442263
## EXTRACELLULAR_MATRIX_PART          0.001631321 0.03192152 0.02442263

To leave a comment for the author, please follow the link and comment on their blog: R on G. Yu.

↧

satRdays are coming

January 8, 2016, 12:56 am

≫ Next: Delays on the Dutch railway system

≪ Previous: GO analysis using clusterProfiler

(This article was first published on rapporter, and kindly contributed to R-bloggers)

It’s been only around 2 months since the idea of community-driven R conferences was born, when Steph Locke first talked publicly about this cool conception, but I am pretty sure we will be able to attend at least one or two satRdays in 2016 — as the project received many and very positive feedback on GitHub, Twitter and in-person conversations as well.

In short, this is a proposal for the R Consortium to be submitted in the next few days about free/cheap full-day conferences organized by R users for R users around the world, acting as a bridge between local R User Groups and global conferences.

What we already know:

the conferences won’t cost more than a video-game or a book (on R),
just like SQLSaturdays, the events will be held on the weekends, so that you don’t have to take a day off from work to attend,
it will be fun and useful for all of us, but
it won’t happen without your help!

So please register your interest in being involved (a very short survey with 5 tiny questions), whether that’s simply about brainstorming, attending or going as far as helping to organize one in your area.

To leave a comment for the author, please follow the link and comment on their blog: rapporter.

↧

Delays on the Dutch railway system

January 8, 2016, 5:43 am

≫ Next: Repel overlapping text labels in ggplot2

≪ Previous: satRdays are coming

(This article was first published on Longhow Lam's Blog » R, and kindly contributed to R-bloggers)

I almost never travel by train, the last time was years ago. However, recently I had to take the train from Amsterdam and it was delayed for 5 minutes. No big deal, but I was just curious how often these delays occur on the Dutch railway system. I couldn’t quickly find a historical data set with information on delays, so I decided to gather my own data.

The Dutch Railways provide an API (De NS API) that returns actual departure and delay data for a certain train station. I have written a small R script that calls this API for each of the 400 train stations in The Netherlands. This script is then scheduled to run every 10 minutes. The API returns data in XML format, the basic entity is “a departing train”. For each departing train we know its departure time, the destination, the departing train station, the type of train, the delay (if there is any), etc. So what to do with all these departing trains? Throw it all into MongoDB. Why?

Not for any particular reason .
It’s easy to install and setup on my little Ubuntu server.
There is a nice R interface to MongoDB.
The response structure (see picture below) from the API is not that difficult to flatten to a table, but NoSQL sounds more sexy than MySQL nowadays

mongoentry

I started to collect train departure data at the 4th of January, per day there are around 48.000 train departures in The Netherlands. I can see how much of them are delayed, per day, per station or per hour. Of course, since the collection started only a few days ago its hard to use these data for long-term delay rates of the Dutch railway system. But it is a start.

To present this delay information in an interactive way to others I have created an R Shiny app that queries the MongoDB database. The picture below from my Shiny app shows the delay rates per train station on the 4th of January 2016, an icy day especially in the north of the Netherlands.

kaartje

Cheers,

Longhow

To leave a comment for the author, please follow the link and comment on their blog: Longhow Lam's Blog » R.

↧

Repel overlapping text labels in ggplot2

January 8, 2016, 7:50 am

≫ Next: Revolution R renamed Microsoft R, available free to developers and students

≪ Previous: Delays on the Dutch railway system

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.

But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this – reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):

See the ggrepel package vignette for more.

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

↧

Revolution R renamed Microsoft R, available free to developers and students

January 12, 2016, 10:00 am

≫ Next: Microsoft R Server available free to students with DreamSpark

≪ Previous: Repel overlapping text labels in ggplot2

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In the nine months since Microsoft acquired Revolution Analytics, there have been a steady stream of updates to Revolution R Open and Revolution R Enterprise (not to mention integration of R with SQL Server, PowerBI, Azure and Cortana Analytics). Now, we have yet more updates to announce along with fresh new names. Revolution R Open is now Microsoft R Open with an update coming later this month, and Revolution R Enterprise is now Microsoft R Server, and available for purchase now, or for download free of charge for developers and students.

Revolution R Enterprise, the big-data capable R distribution for servers, Hadoop clusters, and data warehouses has been updated for its new release, Microsoft R Server 2016. In addition to its new name, Microsoft R Server includes an updated R engine (R 3.2.2), new fuzzy matching algorithms, the ability to write to databases via ODBC, and a streamlined install experience. It's now even easier for companies to purchase, via the Microsoft MSDN (Microsoft Developer Network) and VLSC (Volume License Servicing Center) programs. For developers, Microsoft R Server Developer Edition is now available free of charge via the Visual Studio Dev Essentials program. And the Microsoft DreamSpark program now provides Microsoft R Server free of charge to students, and to academic institutions as part of a discounted software site license.

Microsoft R Server is built on Microsoft R Open, which is the new name for Revolution R Open. As always, Microsoft R Open is free to download, use and share, and is available from MRAN. We're working on the new update to Microsoft R Open featuring R 3.2.3, which will be available on January 19. In the meantime, the updated MRAN has a new color-vision-friendly look, faster R package search available from every page, and a new CRAN Time Machine.

Want to get started with Microsoft R Open or Microsoft R Server? Here are all the links you need:

Everyone: Download Microsoft R Open from MRAN, for free
Students: Download Microsoft R Server from DreamSpark, for free (Microsoft account required)
Developers: Download Microsoft R Server from Visual Studio Dev Essentials, for free (Microsoft account required)
Commercial customers: Obtain Microsoft R Server via MSDN for Windows and other platforms; or via VLSC (subscription required)

Read more about the new Microsoft R Open and Microsoft R Server updates in this Microsoft blog post by Joseph Sirosh.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Microsoft R Server available free to students with DreamSpark

January 12, 2016, 10:01 am

≫ Next: In case you missed it: December 2015 roundup

≪ Previous: Revolution R renamed Microsoft R, available free to developers and students

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Over the last 6 years, thousands of students and faculty have downloaded Revolution R Enterprise (RRE) from Revolution Analytics for free, making it possible for them to do statistical modeling on large data sets with the same R language used by savvy statisticians and data scientists in business and industry. In addition to this individual scholar program (ISP), Revolution Analytics launched two initiatives in 2014 to provide academic institutions and non-profit public service companies with site licenses for the nominal annual licensing fee of $999. Both the Academic Institution Program (AIP) and Public Service program (PSP) enabled qualifying institutions to install RRE on servers and Hadoop clusters without restrictions. Now, seven months after Microsoft’s acquisition of Revolution Analytics, all three of these programs are being folded into Microsoft programs that will make it even easier for individual students and institutions to get started with the newest release of RRE, now known as Microsoft R Server.

On December 31, 2015 all three programs — ISP, AIP and PSP — came to an end. ISP participants may continue to use the software they have under the terms of the original license. Institutions currently participating in Revolution Analytics’ AIP and PSP programs will be contacted by Microsoft representatives to transition them to Microsoft programs.

Microsoft R Server is available for academic use under Microsoft’s DreamSpark programs. Students can download Microsoft R Server 2016 for free via DreamSpark for Students. Universities and other qualifying academic institutions will be able to obtain licenses for Microsoft R Server 2016 as part of Microsoft’s DreamSpark for academic institutions program. Academic institutions will have two choices for participating in the DreamSpark program. DreamSpark Standard is campus-wide and includes a subset of tools including Visual Studio Professional, Windows Server, and SQL Server and is available for an annual licensing fee of $99 (or $199 for 3 years). DreamSpark Premium is only for a single STEM-related department or school and contains premium titles including Windows 10 client, Visual Studio Enterprise, Visio, and Project. The annual licensing fee is $499.

Microsoft R Server 2016 runs on Windows and Linux operating systems, in Teradata databases and on a number of Hadoop platforms. The product names to look for on the DreamSpark web pages are:

Microsoft R Server for Hadoop on Red Hat
Microsoft R Server for Red Hat Linux
Microsoft R Server for SUSE Linux
Microsoft R Server for Teradata DB
Revolution R Enterprise for Windows

Providing even more students with access to Microsoft R Server is pretty big deal. Microsoft R Server extends the reach of R into big data, distributed processing environments by providing a framework for manipulating large data sets on chunk at at time so that all of the data being analyzed does not have to simultaneously fit into memory. Moreover, the RevoScaleR package which ships only with Microsoft R Server provides a number of inherently parallel, distributed algorithms for statistical analysis and machine learning. These include a high performance implementations of Generalized Linear Models, K-means clustering, the Naïve Bayes classifier, decision trees, random forests and much more.

These algorithms automatically distribute computations across all of the available resources. Users need only specify a compute context that points to that data. When SQL Server 2016 becomes available midyear, students will be able to fit predictive models directly in a SQL database.

Microsoft DreamSpark: Download Microsoft R Server

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

In case you missed it: December 2015 roundup

January 15, 2016, 6:00 am

≫ Next: Scheduling R Markdown Reports via Email

≪ Previous: Microsoft R Server available free to students with DreamSpark

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from December of particular interest to R users.

A look back at accomplishments of the R Project and community in 2015.

Segmented regression with the "segmented" package, applied to long-distance running records.

Creating multi-tab reports in R with knitr and jQuery UI.

New version 2.0 update to ggplot2 adds extensibility and many improvements.

A circle diagram of translations of "Merry Christmas".

Upcoming R events and conferences, and sponsorship for R user groups.

How to embed images in R help pages.

An Azure ML Studio fraud detection template relies heavily on R components.

R is the fastest-growing language on Stackoverflow, as shown in a subway-style rank chart from ggplot2.

Buzzfeed is using R for some (serious!) data journalism.

A tutorial on using SQL Server R Services to analyze a billion taxi rides.

Some suggestions on how to cryptographically store secrets in R code.

Some tips and trade-offs to consider when reading large data files with the RevoScaleR package.

A brief summary of improvements in R 3.2.3.

Implementing Wald's sequential analysis test in R.

Using the gtrendsR package to download and chart Google Trends data.

Distributed data structures in R with the ddR package.

Using the leaflet package to create an interactive, photo-annotated map of GPS data from a hike.

Microsoft Azure's Data Science Virtual Machine includes R.

Feature selection when modeling wide data sets with genetic algorithms using the caret package.

Tips on setting up a virtual machine with RStudio in Azure.

Querying recursive CTEs (common table expressions) in a database with the sqldf package.

General interest stories (not related to R) in the past month included: your Macbook charger has more CPU than the
original Macintosh, how 5 particles can jam a hopper, and a film based on the NASA photo archive.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Scheduling R Markdown Reports via Email

January 17, 2016, 6:20 pm

≫ Next: Data Manipulation in R: Beyond SQL

≪ Previous: In case you missed it: December 2015 roundup

(This article was first published on analytics for fun, and kindly contributed to R-bloggers)

R Markdown is an amazing tool that allows you to blend bits of R code with ordinary text and produce well-formatted data analysis reports very quickly. You can export the final report in many formats like HTML, pdf or MS Words which makes it easy to share with others. And of course, you can modify or update it with fresh data very easily.

I have recently been using it R Markdown for pulling data from various data source such Google Analytics API and MySQL database, perform several operations on it (merging for example) and present the outputs with tables, visualizations and insights (text).

But what about automating the whole report generation and emailing the final report as an attached document every month at a specific time? In this post I am going to explain how to do it in Windows. If you do a search on google, you will find several threads on stackoverflow and a few good specific posts on it. However it took me sometimes to get it working and had to try different options before. That’s why I am writing this quick tutorial, including screenshots, hoping you might get it your report automated faster!

1. Create your Rmarkdown report

In RStudio create a new Rmarkdown document where you will enter your R code and texts. Mine is called “Schedule_Report.Rmd” and here is what is does:

retrieve some data from Google Analytics API using the RGoogleAnalytics library
turn dates into a more friendly format
create a trend line chart of sessions using the ggplot2 package

A very basic report. Remember that in Rmarkdown you can decide whether to show each chunk of code or not. I showed just the final outputs that are the table and the bar chart.

2. Create an R script that executes and email your Rmarkdown report

Create a new R script which will:

locate your Rmarkdown document (set the working directory to where your report is located)
generate an HTML file (or pdf, MS Word) from your Rmarkdown document
send the HTML file via email

To email the report I have used the gmailR library which allows you to generate and send emails directly from R. To make sure the gmailR library will work, first you might need to enable the “Less Secure apps” option in your google account. Open your Google personal account and go to Sign-in and Security section, scrolll down to the bottom of the page and switch on the “Allow less secure apps”.

I also made a few tries with the mailR package but without success. I guess this was because of security issues with my google account, I have gmail. Anyway the gmailR package worked perfectly so I sticked to it! Here is the code contained into my R script, which is named “Script.R”.

3. Schedule a task in Windows

From the main Windows menu, go to Programs>Accessories>System Tools>Task Scheduler (at least this is the path in my Windows edition). The task scheduler will open up:

Click on Action>Create basic task. Type a name for your task and add a short description if you like. Now select the trigger which means how often you want the task to be executed (to try it first I recommend choosing “One time”). Select the date and time and on the action field choose “Start a program”.

In the “Start a Program” step, complete the fields as follows:

>Program/Script: the directory path to where to find both the executable file for R

>Add arguments: CMD BATCH followed by the path of the R script you created at step two. Remember to put the directories path between quotation marks “” like in the image below.

Click on next and you should now reach the last step and see a confirmation window. Press finish and voila’ your task is created and it should execute correctly at the time you set.

4. Check your mail

At the time you set the task you should see the “taskeng” window popping up and disappearing after a few seconds (depending the workload you placed on your R files). Now open the mail account where you sent the report to. Did you get the email with your report attached?

In case you did not receive the email, I recommend you to:

–> Check if the the task was executed in Windows. Open the task scheduler and you will see the list of tasks. Look for your task name. Make sure the status says “Success” and not failed.
–> If the status says failed, double check you set the task correctly as per step 3. An alternative is creating a .bat file separately and enter the path of the .bat file in the task scheduler.
–> As a general troubleshooting method, I also suggest opening your R console (double click on R.exe) and execute line by line the code of your R script at step 2. This way you can realize if there is an error inside your R code. I mean, Windows executes the task correcly but no data is generated/sent by R.

Here below are a few issues that might prevent R from executing the code contained in your R script properly:

*To be able to send mails via gmailR package, make sure you enable the “less secured apps” option in your Google account.

**To be able to create an html document from an Rmarkdown file, make sure you have installed the latest version of pandoc library. To do that, you should, in order:
> install.packages(“installr”)
> install.pandoc()
> Restart your machine

To recap, the process you have just automated will work as follows:

Windows will start a task at the day/time you specified in the task scheduler
the taskeng will open and executes your R script you create at point step 2 through R
the Rmarkdown report will be converted into an HTML file and sent by email

If you like to reproduce the whole process using my files, you can find them both the Rmarkdown report and the R script at this github repository. I hope the post was helpful and will push you to use R for generating business reports.

To leave a comment for the author, please follow the link and comment on their blog: analytics for fun.

↧

Data Manipulation in R: Beyond SQL

January 19, 2016, 4:29 am

≫ Next: R trends in 2015 (based on cranlogs)

≪ Previous: Scheduling R Markdown Reports via Email

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

As a follow up to an article on using SQL in R, I just had an new article published at Simple Talk that considers ways to manipulate data in R that are cumbersome in SQL as well as ways to replace SQL statements with functional equivalents.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

↧

R trends in 2015 (based on cranlogs)

January 20, 2016, 12:11 am

≫ Next: 100 “must read” R-bloggers’ posts for 2015

≪ Previous: Data Manipulation in R: Beyond SQL

(This article was first published on R – G-Forge, and kindly contributed to R-bloggers)

What are the current tRends? The image is CC from coco + kelly.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.

Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).

^?View Code RSPLUS

library(rvest)
library(dplyr)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(magrittr)
library(lubridate)
 
getCranberriesElmnt <- function(txt, elmnt_name){
  desc <- grep(sprintf("^%s:", elmnt_name), txt)
  if (length(desc) == 1){
    txt <- txt[desc:length(txt)]
    end <- grep("^[A-Za-z/@]{2,}:", txt[-1])
    if (length(end) == 0)
      end <- length(txt)
    else
      end <- end[1]
 
    desc <-
      txt[1:end] %>% 
      gsub(sprintf("^%s: (.+)", elmnt_name),
           "\1", .) %>% 
      paste(collapse = " ") %>% 
      gsub("[ ]{2,}", " ", .) %>% 
      gsub(" , ", ", ", .)
  }else if (length(desc) == 0){
    desc <- paste("No", tolower(elmnt_name))
  }else{
    stop("Could not find ", elmnt_name, " in text: n",
         paste(txt, collapse = "n"))
  }
  return(desc)
}
 
convertCharset <- function(txt){
  if (grepl("Windows", Sys.info()["sysname"]))
    txt <- iconv(txt, from = "UTF-8", to = "cp1252")
  return(txt)
}
 
getAuthor <- function(txt, package){
  author <- getCranberriesElmnt(txt, "Author")
  if (grepl("No author|See AUTHORS file", author)){
    author <- getCranberriesElmnt(txt, "Maintainer")
  }
 
  if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || 
      is.null(author) ||
      nchar(author)  <= 2){
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    author <- cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Author", .)] %>% 
      gsub(".*n", "", .)
 
    # If not found then the package has probably been
    # removed from the repository
    if (length(author) == 1)
      author <- author
    else
      author <- "No author"
  }
 
  # Remove stuff such as:
  # [cre, auth]
  # (worked on the...)
  # <my@email.com>
  # "John Doe"
  author %<>% 
    gsub("^Author: (.+)", 
         "\1", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("\([^)]+\)", " ", .) %>% 
    gsub("([ ]*<[^>]+>)", " ", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("[ ]{2,}", " ", .) %>% 
    gsub("(^[ '"]+|[ '"]+$)", "", .) %>% 
    gsub(" , ", ", ", .)
  return(author)
}
 
getDate <- function(txt, package){
  date <- 
    grep("^Date/Publication", txt)
  if (length(date) == 1){
    date <- txt[date] %>% 
      gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
           "\1", .)
  }else{
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    date <- 
      cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Published", .)] %>% 
      gsub(".*n", "", .)
 
 
    # The main page doesn't contain the original date if 
    # new packages have been submitted, we therefore need
    # to check first entry in the archives
    if(cran_txt %>% 
       html_nodes("tr") %>% 
       html_text %>% 
       gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
       grepl("^Old.{1,4}sources", .) %>% 
       any){
      archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
                                       package))
      pkg_date <- 
        archive_txt %>% 
        html_nodes("tr") %>% 
        lapply(function(x) {
          nodes <- html_nodes(x, "td")
          if (length(nodes) == 5){
            return(nodes[3] %>% 
                     html_text %>% 
                     as.Date(format = "%d-%b-%Y"))
          }
        }) %>% 
        .[sapply(., length) > 0] %>% 
        .[!sapply(., is.na)] %>% 
        head(1)
 
      if (length(pkg_date) == 1)
        date <- pkg_date[[1]]
    }
  }
  date <- tryCatch({
    as.Date(date)
  }, error = function(e){
    "Date missing"
  })
  return(date)
}
 
getNewPkgStats <- function(published_in){
  # The parallel is only for making cranlogs requests
  # we can therefore have more cores than actual cores
  # as this isn't processor intensive while there is
  # considerable wait for each http-request
  cl <- create_cluster(parallel::detectCores() * 4)
  parallel::clusterEvalQ(cl, {
    library(cranlogs)
  })
  set_default_cluster(cl)
  on.exit(stop_cluster())
 
  berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
  pkgs <- 
    # Select the divs of the package class
    html_nodes(berries, ".package") %>% 
    # Extract the text
    html_text %>% 
    # Split the lines
    strsplit("[n]+") %>% 
    # Now clean the lines
    lapply(.,
           function(pkg_txt) {
             pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, 
                            USE.NAMES = FALSE)] %>% 
               gsub("^[ t]+", "", .) 
           })
 
  # Now we select the new packages
  new_packages <- 
    pkgs %>% 
    # The first line is key as it contains the text "New package"
    sapply(., function(x) x[1], USE.NAMES = FALSE) %>% 
    grep("^New package", .) %>% 
    pkgs[.] %>% 
    # Now we extract the package name and the date that it was published
    # and merge everything into one table
    lapply(function(txt){
      txt <- convertCharset(txt)
      ret <- data.frame(
        name = gsub("^New package ([^ ]+) with initial .*", 
                     "\1", txt[1]),
        stringsAsFactors = FALSE
      )
 
      ret$desc <- getCranberriesElmnt(txt, "Description")
      ret$author <- getAuthor(txt, ret$name)
      ret$date <- getDate(txt, ret$name)
 
      return(ret)
    }) %>% 
    rbind_all %>% 
    # Get the download data in parallel
    partition(name) %>% 
    do({
      down <- cran_downloads(.$name[1], 
                             from = max(as.Date("2015-01-01"), .$date[1]), 
                             to = "2015-12-31")$count 
      cbind(.[1,],
            data.frame(sum = sum(down), 
                       avg = mean(down))
      )
    }) %>% 
    collect %>% 
    ungroup %>% 
    arrange(desc(avg))
 
  return(new_packages)
}
 
pkg_list <- 
  lapply(2010:2015,
         getNewPkgStats)
 
pkgs <- 
  rbind_all(pkg_list) %>% 
  mutate(time = as.numeric(as.Date("2016-01-01") - date),
         year = format(date, "%Y"))

Downloads and time on CRAN

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:

^?View Code RSPLUS

pkgs %<>% 
  mutate(time_yrs = time/365.25)
fit <- lm(avg ~ time_yrs, data = pkgs)
 
# Test for non-linearity
library(splines)
anova(fit,
      update(fit, .~.-time_yrs+ns(time_yrs, 2)))

Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
  Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922                           
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:

^?View Code RSPLUS

library(quantreg)
library(htmlTable)
lapply(c(.5, .75, .95, .99),
       function(tau){
         rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau)
         rq_sum <- summary(rq_fit)
         c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 
           `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + 
                                        c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% 
             paste(collapse = " to "))
       }) %>% 
  do.call(rbind, .) %>% 
  htmlTable(rnames = c("Median",
                       "Upper quartile",
                       "Top 5%",
                       "Top 1%"))

	Estimate	95 % CI
Median	0.6	0.6 to 0.6
Upper quartile	1.2	1.2 to 1.1
Top 5%	9.7	11.9 to 7.6
Top 1%	182.5	228.2 to 136.9

The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses.

Top downloaded packages

In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).

		Downloads
Name	Author	Total	Average/day	Description
Top 10 packages published in 2015
xml2	Hadley Wickham, Jeroen Ooms, RStudio, R Foundation	348,222	1635	Work with XML files …
rversions	Gabor Csardi	386,996	1524	Query the main R SVN…
git2r	Stefan Widgren	411,709	1303	Interface to the lib…
praise	Gabor Csardi, Sindre Sorhus	96,187	673	Build friendly R pac…
readxl	David Hoerl	99,386	379	Import excel files i…
readr	Hadley Wickham, Romain Francois, R Core Team, RStudio	90,022	337	Read flat/tabular te… Read flat/tabular text files from disk.
DiagrammeR	Richard Iannone	84,259	236	Create diagrams and … Create diagrams and flowcharts using R.
visNetwork	Almende B.V. (vis.js library in htmlwidgets/lib,	41,185	233	Provides an R interf…
plotly	Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy	9,745	217	Easily translate ggp…
DT	Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc	24,806	120	Data objects in R ca…
Top 10 packages published in 2014
stringi	Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc.	1,316,900	3608	stringi allows for v…
magrittr	Stefan Milton Bache and Hadley Wickham	1,245,662	3413	Provides a mechanism…
mime	Yihui Xie	1,038,591	2845	This package guesses…
R6	Winston Chang	920,147	2521	The R6 package allow…
dplyr	Hadley Wickham, Romain Francois	778,311	2132	A fast, consistent t…
manipulate	JJ Allaire, RStudio	626,191	1716	Interactive plotting…
htmltools	RStudio, Inc.	619,171	1696	Tools for HTML gener… Tools for HTML generation and output
curl	Jeroen Ooms	599,704	1643	The curl() function …
lazyeval	Hadley Wickham, RStudio	572,546	1569	A disciplined approa…
rstudioapi	RStudio	515,665	1413	This package provide…
Top 10 packages published in 2013
jsonlite	Jeroen Ooms, Duncan Temple Lang	906,421	2483	This package is a fo…
BH	John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois	691,280	1894	Boost provides free …
highr	Yihui Xie and Yixuan Qiu	641,052	1756	This package provide…
assertthat	Hadley Wickham	527,961	1446	assertthat is an ext…
httpuv	RStudio, Inc.	310,699	851	httpuv provides low-…
NLP	Kurt Hornik	270,682	742	Basic classes and me…
TH.data	Torsten Hothorn	242,060	663	Contains data sets u…
NMF	Renaud Gaujoux, Cathal Seoighe	228,807	627	This package provide…
stringdist	Mark van der Loo	123,138	337	Implements the Hammi…
SnowballC	Milan Bouchet-Valat	104,411	286	An R interface to th…
Top 10 packages published in 2012
gtable	Hadley Wickham	1,091,440	2990	Tools to make it eas…
knitr	Yihui Xie	792,876	2172	This package provide…
httr	Hadley Wickham	785,568	2152	Provides useful tool…
markdown	JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte	636,888	1745	Markdown is a plain-…
Matrix	Douglas Bates and Martin Maechler	470,468	1289	Classes and methods …
shiny	RStudio, Inc.	427,995	1173	Shiny makes it incre…
lattice	Deepayan Sarkar	414,716	1136	Lattice is a powerfu…
pkgmaker	Renaud Gaujoux	225,796	619	This package provide…
rngtools	Renaud Gaujoux	225,125	617	This package contain…
base64enc	Simon Urbanek	223,120	611	This package provide…
Top 10 packages published in 2011
scales	Hadley Wickham	1,305,000	3575	Scales map data to a…
devtools	Hadley Wickham	738,724	2024	Collection of packag… Collection of package development tools
RcppEigen	Douglas Bates, Romain Francois and Dirk Eddelbuettel	634,224	1738	R and Eigen integrat…
fpp	Rob J Hyndman	583,505	1599	All data sets requir…
nloptr	Jelmer Ypma	583,230	1598	nloptr is an R inter…
pbkrtest	Ulrich Halekoh Søren Højsgaard	536,409	1470	Test in linear mixed…
roxygen2	Hadley Wickham, Peter Danenberg, Manuel Eugster	478,765	1312	A Doxygen-like in-so…
whisker	Edwin de Jonge	413,068	1132	logicless templating…
doParallel	Revolution Analytics	299,717	821	Provides a parallel …
abind	Tony Plate and Richard Heiberger	255,151	699	Combine multi-dimens…
Top 10 packages published in 2010
reshape2	Hadley Wickham	1,395,099	3822	Reshape lets you fle…
labeling	Justin Talbot	1,104,986	3027	Provides a range of …
evaluate	Hadley Wickham	862,082	2362	Parsing and evaluati…
formatR	Yihui Xie	640,386	1754	This package provide…
minqa	Katharine M. Mullen, John C. Nash, Ravi Varadhan	600,527	1645	Derivative-free opti…
gridExtra	Baptiste Auguie	581,140	1592	misc. functions
memoise	Hadley Wickham	552,383	1513	Cache the results of…
RJSONIO	Duncan Temple Lang	414,373	1135	This is a package th…
RcppArmadillo	Romain Francois and Dirk Eddelbuettel	410,368	1124	R and Armadillo inte…
xlsx	Adrian A. Dragulescu	401,991	1101	Provide R functions …

Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.

R-star authors

Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:

^?View Code RSPLUS

top_coders <- list(
  "2015" = 
    pkgs %>% 
    filter(format(date, "%Y") == 2015) %>% 
    partition(author) %>% 
    do({
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(10),
  "all" =
    pkgs %>% 
    partition(author) %>% 
    do({
      if (grepl("Jeroen Ooms", .$author))
        browser()
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(30))
 
interactiveTable(
  do.call(rbind, top_coders) %>% 
    mutate(download_ave = txtInt(download_ave)),
  align = "lrr",
  header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"),
  tspanner = c("Top coders 2015",
               "Top coders 2010-2015"),
  n.tspanner = sapply(top_coders, nrow),
  minimized.columns = 4, 
  rnames = FALSE, 
  col.rgroup = c("white", "#F0F0FF"))

Coder	Total ave. downloads	No. of packages	Packages
Top coders 2015
Gabor Csardi	2,312	11	sankey, franc, rvers…
Stefan Widgren	1,563	1	git2r
RStudio	781	16	shinydashboard, with…
Hadley Wickham	695	12	withr, cellranger, c…
Jeroen Ooms	541	10	rjade, js, sodium, w…
Richard Cotton	501	22	assertive.base, asse…
R Foundation	490	1	xml2
David Hoerl	455	1	readxl
Sindre Sorhus	409	2	praise, clisymbols
Richard Iannone	294	2	DiagrammeR, stationa… DiagrammeR, stationaRy
Top coders 2010-2015
Hadley Wickham	32,115	55	swirl, lazyeval, ggp…
Yihui Xie	9,739	18	DT, Rd2roxygen, high…
RStudio	9,123	25	shinydashboard, lazy…
Jeroen Ooms	4,221	25	JJcorr, gdtools, bro…
Justin Talbot	3,633	1	labeling
Winston Chang	3,531	17	shinydashboard, font…
Gabor Csardi	3,437	26	praise, clisymbols, …
Romain Francois	2,934	20	int64, LSD, RcppExam…
Duncan Temple Lang	2,854	6	RMendeley, jsonlite,…
Adrian A. Dragulescu	2,456	2	xlsx, xlsxjars
JJ Allaire	2,453	7	manipulate, htmlwidg…
Simon Urbanek	2,369	15	png, fastmatch, jpeg…
Dirk Eddelbuettel	2,094	33	Rblpapi, RcppSMC, RA…
Stefan Milton Bache	2,069	3	import, blatr, magri… import, blatr, magrittr
Douglas Bates	1,966	5	PKPDmodels, RcppEige…
Renaud Gaujoux	1,962	6	NMF, doRNG, pkgmaker…
Jelmer Ypma	1,933	2	nloptr, SparseGrid
Rob J Hyndman	1,933	3	hts, fpp, demography
Baptiste Auguie	1,924	2	gridExtra, dielectri… gridExtra, dielectric
Ulrich Halekoh Søren Højsgaard	1,764	1	pbkrtest
Martin Maechler	1,682	11	DescTools, stabledis…
Mirai Solutions GmbH	1,603	3	XLConnect, XLConnect… XLConnect, XLConnectJars, XLConnectJars
Stefan Widgren	1,563	1	git2r
Edwin de Jonge	1,513	10	tabplot, tabplotGTK,…
Kurt Hornik	1,476	12	movMF, ROI, qrmtools…
Deepayan Sarkar	1,369	4	qtbase, qtpaint, lat… qtbase, qtpaint, lattice, qtutils
Tyler Rinker	1,203	9	cowsay, wakefield, q…
Yixuan Qiu	1,131	12	gdtools, svglite, hi…
Revolution Analytics	1,011	4	doParallel, doSMP, r… doParallel, doSMP, revoIPC, checkpoint
Torsten Hothorn	948	7	MVA, HSAUR3, TH.data…

It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable.

When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous.

When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:

DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
checkmate A neat package for checking function arguments.
covr An excellent package for testing how much of a package’s code is tested.
rex A package for making regular easier.
openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

↧