100 “must read” R-bloggers’ posts for 2015

January 20, 2016, 1:14 pm

≫ Next: Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

≪ Previous: R trends in 2015 (based on cranlogs)

The site R-bloggers.com is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.

In celebration to R-bloggers’ 6th birth-month, here are the top 100 most read R posts written in 2015, enjoy:

p.s.: 2015 was also a great year for R-users.com, a job board site for R users. If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds). If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

↧

Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

January 22, 2016, 3:00 am

≫ Next: How To Import Data Into R – New Course

≪ Previous: 100 “must read” R-bloggers’ posts for 2015

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Individual-level estimates from survey data

I was motivated by web apps like the British Office of National Statistics’ How well do you know your area? and How well does your job pay? to see if I could turn the New Zealand Income Survey into an individual-oriented estimate of income given age group, qualification, occupation, ethnicity, region and hours worked. My tentative go at this is embedded below, and there’s also a full screen version available.

The job’s a tricky one because the survey data available doesn’t go to anywhere near that level of granularity. It could be done with census data of course, but any such effort to publish would come up against confidentiality problems – there are just too few people in any particular combination of category to release real data there. So some kind of modelling is required that can smooth over the actual data but still give a plausible and realistic estimate.

I also wanted to emphasise the distribution of income, not just a single measure like mean or median – something I think that we statisticians should do much more than we do, with all sorts of variables. And in particular I wanted to find a good way of dealing with the significant number of people in many categories (particularly but not only “no occupation”) who have zero income; and also the people who have negative income in any given week.

My data source is the New Zealand Income Survey 2011 simulated record file published by Statistics New Zealand. An earlier post by me describes how I accessed this, normalised it and put it into a database. I’ve also written several posts about dealing with the tricky distribution of individual incomes, listed here under the “NZIS2011” heading.

This is a longer post than usual, with a digression into the use of Random Forests ™ to predict continuous variables, an attempt at producing a more polished plot of a regression tree than usually available, and some reflections on strengths and weakness of several different approaches to estimating distributions.

Data import and shape

I begin by setting up the environment and importing the data I’d placed in the data base in that earlier post. There’s a big chunk of R packages needed for all the things I’m doing here. I also re-create some helper functions for transforming skewed continuous variables that include zero and negative values, which I first created in another post back in September 2015.

#------------------setup------------------------
library(showtext)
library(RMySQL)
library(ggplot2)
library(scales)
library(MASS) # for stepAIC.  Needs to be before dplyr to avoid "select" namespace clash
library(dplyr)
library(tidyr)
library(stringr)
library(gridExtra)
library(GGally)

library(rpart)
library(rpart.plot)   # for prp()
library(caret)        # for train()
library(partykit)     # for plot(as.party())
library(randomForest)

# library(doMC)         # for multicore processing with caret, on Linux only

library(h2o)


library(xgboost)
library(Matrix)
library(data.table)


library(survey) # for rake()


font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

PlayPen <- dbConnect(RMySQL::MySQL(), username = "analyst", dbname = "nzis11")


#------------------transformation functions------------
# helper functions for transformations of skewed data that crosses zero.  See 
# http://ellisp.github.io/blog/2015/09/07/transforming-breaks-in-a-scale/
.mod_transform <- function(y, lambda){
   if(lambda != 0){
      yt <- sign(y) * (((abs(y) + 1) ^ lambda - 1) / lambda)
   } else {
      yt = sign(y) * (log(abs(y) + 1))
   }
   return(yt)
}


.mod_inverse <- function(yt, lambda){
   if(lambda != 0){
      y <- ((abs(yt) * lambda + 1)  ^ (1 / lambda) - 1) * sign(yt)
   } else {
      y <- (exp(abs(yt)) - 1) * sign(yt)
      
   }
   return(y)
}

# parameter for reshaping - equivalent to sqrt:
lambda <- 0.5

Importing the data is a straightforward SQL query, with some reshaping required because survey respondents were allowed to specify either one or two ethnicities. This means I need an indicator column for each individual ethnicity if I’m going to include ethnicity in any meaningful way (for example, an “Asian” column with “Yes” or “No” for each survey respondent). Wickham’s {dplyr} and {tidyr} packages handle this sort of thing easily.

#---------------------------download and transform data--------------------------
# This query will include double counting of people with multiple ethnicities
sql <-
"SELECT sex, agegrp, occupation, qualification, region, hours, income, 
         a.survey_id, ethnicity FROM
   f_mainheader a                                               JOIN
   d_sex b           on a.sex_id = b.sex_id                     JOIN
   d_agegrp c        on a.agegrp_id = c.agegrp_id               JOIN
   d_occupation e    on a.occupation_id = e.occupation_id       JOIN
   d_qualification f on a.qualification_id = f.qualification_id JOIN
   d_region g        on a.region_id = g.region_id               JOIN
   f_ethnicity h     on h.survey_id = a.survey_id               JOIN
   d_ethnicity i     on h.ethnicity_id = i.ethnicity_id
   ORDER BY a.survey_id, ethnicity"

orig <- dbGetQuery(PlayPen, sql) 
dbDisconnect(PlayPen)

# ...so we spread into wider format with one column per ethnicity
nzis <- orig %>%
   mutate(ind = TRUE) %>%
   spread(ethnicity, ind, fill = FALSE) %>%
   select(-survey_id) %>%
   mutate(income = .mod_transform(income, lambda = lambda))

for(col in unique(orig$ethnicity)){
   nzis[ , col] <- factor(ifelse(nzis[ , col], "Yes", "No"))
}

# in fact, we want all characters to be factors
for(i in 1:ncol(nzis)){
   if(class(nzis[ , i]) == "character"){
      nzis[ , i] <- factor(nzis[ , i])
   }
}

names(nzis)[11:14] <- c("MELAA", "Other", "Pacific", "Residual")

After reshaping ethnicity and transforming the income data into something a little less skewed (so measures of prediction accuracy like root mean square error are not going to be dominated by the high values), I split my data into training and test sets, with 80 percent of the sample in the training set.

set.seed(234)
nzis$use <- ifelse(runif(nrow(nzis)) > 0.8, "Test", "Train")
trainData <- nzis %>% filter(use == "Train") %>% select(-use)
trainY <- trainData$income
testData <- nzis %>% filter(use == "Test") %>% select(-use)
testY <- testData$income

Modelling income

The first job is to get a model that can estimate income for any arbitrary combination of the explanatory variables hourse worked, occupation, qualification, age group, ethnicity x 7 and region. I worked through five or six different ways of doing this before eventually settling on Random Forests which had the right combination of convenience and accuracy.

Regression tree

My first crude baseline is a single regression tree. I didn’t seriously expect this to work particularly well, but treated it as an interim measure before moving to a random forest. I use the train() function from the {caret} package to determine the best value for the complexity parameter (cp) – the minimum improvement in overall R-squared needed before a split is made. The best single tree is shown below.

One nice feature of regression trees – so long as they aren’t too large to see all at once – is usually their easy interpretability. Unfortunately this goes a bit by the wayside because I’m using a transformed version of income, and the tree is returning the mean of that transformed version. When I reverse the transform back into dollars I get a dollar number that is in effect the squared mean of the square root of the original income in a particular category; which happens to generally be close to the median, hence the somewhat obscure footnote in the bottom right corner of the plot above. It’s a reasonable measure of the centre in any particular group, but not one I’d relish explaining to a client.

Following the tree through, we see that

the overall centre of the data is $507 income per week
for people who work less than 23 hours, it goes down to $241; and those who work 23 or more hours receive $994.
of those who work few hours, if they are a community and personal service worker, labourer, no occupation, or residual category occupation their average income is $169 and all other incomes it is $477.
of those people who work few hours and are in the low paying occupations (including no occupation), those aged 15 – 19 receive $28 per week and those in other categories $214 per week.
and so on.

It takes a bit of effort to look at this plot and work out what is going on (and the abbreviated occupation labels don’t help sorry), but it’s possible once you’ve got the hang of it. Leftwards branches always receive less income than rightwards branches; the split is always done on only one variable at a time, and the leftwards split label is slightly higher on the page than the rightwards split label.

Trees are a nice tool for this sort of data because they can capture fairly complex interactions in a very flexible way. Where they’re weaker is in dealing with relationships between continuous variables that can be smoothly modelled by simple arithmetic – that’s when more traditional regression methods, or model-tree combinations, prove useful.

The code that fitted and plotted this tree (using the wonderful and not-used-enough prp() function that allows considerable control and polish of rpart trees) is below.

#---------------------modelling with a single tree---------------
# single tree, with factors all grouped together
set.seed(234)

# Determine the best value of cp via cross-validation
# set up parallel processing to make this faster, for this and future use of train()
# registerDoMC(cores = 3) # linux only
rpartTune <- train(income ~., data = trainData,
                     method = "rpart",
                     tuneLength = 10,
                     trControl = trainControl(method = "cv"))

rpartTree <- rpart(income ~ ., data = trainData, 
                   control = rpart.control(cp = rpartTune$bestTune),
                   method = "anova")


node.fun1 <- function(x, labs, digits, varlen){
   paste0("$", round(.mod_inverse(x$frame$yval, lambda = lambda), 0))
}

# exploratory plot only - not for dissemination:
# plot(as.party(rpartTree))

svg("..http://ellisp.github.io/img/0026-polished-tree.svg", 12, 10)
par(fg = "blue", family = "myfont")

prp(rpartTree, varlen = 5, faclen = 7, type = 4, extra = 1, 
    under = TRUE, tweak = 0.9, box.col = "grey95", border.col = "grey92",
    split.font = 1, split.cex = 0.8, eq = ": ", facsep = " ",
    branch.col = "grey85", under.col = "lightblue",
    node.fun = node.fun1)

grid.text("New Zealanders' income in one week in 2011", 0.5, 0.89,
          gp = gpar(fontfamily = "myfont", fontface = "bold"))  

grid.text("Other factors considered: qualification, region, ethnicity.",
          0.8, 0.2, 
          gp = gpar(fontfamily = "myfont", cex = 0.8))

grid.text("$ numbers in blue are 'average' weekly income:nsquared(mean(sign(sqrt(abs(x)))))nwhich is a little less than the median.",
          0.8, 0.1, 
          gp = gpar(fontfamily = "myfont", cex = 0.8, col = "blue"))

dev.off()

(Note – in working on this post I was using at different times several different machines, including some of it on a Linux server which is much easier than Windows for parallel processing. I’ve commented out the Linux-only bits of code so it should all be fully portable.)

The success rates of the various modelling methods in predicting income in the test data I put aside will be shown all in one part of this post, later.

A home-made random spinney (not forest…)

Regression trees have high variance. Basically, they are unstable, and vulnerable to influential small pockets of data changing them quite radically. The solution to this problem is to generate an ensemble of different trees and take the average prediction. Two most commonly used methods are:

“bagging” or bootstrap aggregation, which involves resampling from the data and fitting trees to the resamples
Random Forests (trademark of Breiman and Cutler), which resamples rows from the data and also restricts the number of variables to a different subset of variables for each split.

Gradient boosting can also be seen as a variant in this class of solutions but I think takes a sufficiently different approach for me to leave it to further down the post.

Bagging is probably an appropriate method here given the relatively small number of explanatory variables, but to save space in an already grossly over-long post I’ve left it out.

Random Forests ™ are a subset of the broader group of ensemble tree techniques known as “random decision forests”, and I set out to explore one variant of random decision forests visually (I’m a very visual person – if I can’t make a picture or movie of something happening I can’t understand it). The animation below shows an ensemble of 50 differing trees, where each tree was fitted to a set of data sample with replacement from the original data, and each tree was also restricted to just three randomly chosen variables. Note that this differs from a Random Forest, where the restriction differs for each split within a tree, rather than being a restriction for the tree as a whole.

Here’s how I generated my spinney of regression trees. Some of this code depends on a particular folder structure. The basic strategy is to

work out which variables have the most crude explanatory power
subset the data
subset the variables, choosing those with good explanatory power more often than the weaker ones
use cross-validation to work out the best tuning for the complexity parameter
fit the best tree possible with our subset of data nad variables
draw an image, with appropriate bits of commentary and labelling added to it, and save it for later
repeat the above 50 times, and then knit all the images into an animated GIF using ImageMagick.

#----------home made random decision forest--------------
# resample both rows and columns, as in a random decision forest,
# and draw a picture for each fitted tree.  Knit these
# into an animation.  Note this isn't quite the same as a random forest (tm).

# define the candidate variables
variables <- c("sex", "agegrp", "occupation", "qualification",
               "region", "hours", "Maori")

# estimate the value of the individual variables, one at a time

var_weights <- data_frame(var = variables, r2 = 0)
for(i in 1:length(variables)){
   tmp <- trainData[ , c("income", variables[i])]
   if(variables[i] == "hours"){
      tmp$hours <- sqrt(tmp$hours)
   }
   tmpmod <- lm(income ~ ., data = tmp)
   var_weights[i, "r2"] <- summary(tmpmod)$adj.r.squared
}

svg("..http://ellisp.github.io/img/0026-variables.svg", 8, 6)
print(
   var_weights %>%
   arrange(r2) %>%
   mutate(var = factor(var, levels = var)) %>%
   ggplot(aes(y = var, x = r2)) +
   geom_point() +
   labs(x = "Adjusted R-squared from one-variable regression",
        y = "",
        title = "Effectiveness of one variable at a time in predicting income")
)
dev.off()


n <- nrow(trainData)

home_made_rf <- list()
reps <- 50

commentary <- str_wrap(c(
   "This animation illustrates the use of an ensemble of regression trees to improve estimates of income based on a range of predictor variables.",
   "Each tree is fitted on a resample with replacement from the original data; and only three variables are available to the tree.",
   "The result is that each tree will have a different but still unbiased forecast for a new data point when a prediction is made.  Taken together, the average prediction is still unbiased and has less variance than the prediction of any single tree.",
   "This method is similar but not identical to a Random Forest (tm).  In a Random Forest, the choice of variables is made at each split in a tree rather than for the tree as a whole."
   ), 50)


set.seed(123)
for(i in 1:reps){
   
   these_variables <- sample(var_weights$var, 3, replace = FALSE, prob = var_weights$r2)
   
   this_data <- trainData[
      sample(1:n, n, replace = TRUE),
      c(these_variables, "income")
   ]
   
   
   
   this_rpartTune <- train(this_data[,1:3], this_data[,4],
                      method = "rpart",
                      tuneLength = 10,
                      trControl = trainControl(method = "cv"))
   
   
   
   home_made_rf[[i]] <- rpart(income ~ ., data = this_data, 
                      control = rpart.control(cp = this_rpartTune$bestTune),
                      method = "anova")
 
   png(paste0("_output/0026_random_forest/", 1000 + i, ".png"), 1200, 1000, res = 100)  
      par(fg = "blue", family = "myfont")
      prp(home_made_rf[[i]], varlen = 5, faclen = 7, type = 4, extra = 1, 
          under = TRUE, tweak = 0.9, box.col = "grey95", border.col = "grey92",
          split.font = 1, split.cex = 0.8, eq = ": ", facsep = " ",
          branch.col = "grey85", under.col = "lightblue",
          node.fun = node.fun1, mar = c(3, 1, 5, 1))
      
      grid.text(paste0("Variables available to this tree: ", 
                      paste(these_variables, collapse = ", "))
                , 0.5, 0.90,
                gp = gpar(fontfamily = "myfont", cex = 0.8, col = "darkblue"))
      
      grid.text("One tree in a random spinney - three randomly chosen predictor variables for weekly income,
resampled observations from New Zealand Income Survey 2011", 0.5, 0.95,
                gp = gpar(fontfamily = "myfont", cex = 1))
      
      grid.text(i, 0.05, 0.05, gp = gpar(fontfamily = "myfont", cex = 1))
      
      grid.text("$ numbers in blue are 'average' weekly income:nsquared(mean(sign(sqrt(abs(x)))))nwhich is a little less than the median.",
                0.8, 0.1, 
                gp = gpar(fontfamily = "myfont", cex = 0.8, col = "blue"))
      
      comment_i <- floor(i / 12.5) + 1
      
      grid.text(commentary[comment_i], 
                0.3, 0.1,
                gp = gpar(fontfamily = "myfont", cex = 1.2, col = "orange"))
      
      dev.off()

}   

# knit into an actual animation
old_dir <- setwd("_output/0026_random_forest")
# combine images into an animated GIF
system('"C:\Program Files\ImageMagick-6.9.1-Q16\convert" -loop 0 -delay 400 *.png "rf.gif"') # Windows
# system('convert -loop 0 -delay 400 *.png "rf.gif"') # linux
# move the asset over to where needed for the blog
file.copy("rf.gif", "../../..http://ellisp.github.io/img/0026-rf.gif", overwrite = TRUE)
setwd(old_dir)

Random Forest

Next model to try is a genuine Random Forest ™. As mentioned above, a Random Forest is an ensemble of regression trees, where each tree is a resample with replacement (variations are possible) of the original data, and each split in the tree is only allowed to choose from a subset of the variables available. To do this I used the {randomForests} R package, but it’s not efficiently written and is really pushing its limits with data of this size on modest hardware like mine. For classification problems the amazing open source H2O (written in Java but binding nicely with R) gives super-efficient and scalable implementations of Random Forests and of deep learning neural networks, but it doesn’t work with a continuous response variable.

Training a Random Forest requires you to specify how many explanatory variables to make available for each individual tree, and the best way to decide this is vai cross validation.

Cross-validation is all about splitting the data into a number of different training and testing sets, to get around the problem of using a single hold-out test set for multiple purposes. It’s better to give each bit of the data a turn as the hold-out test set. In the tuning exercise below, I divide the data into ten so I can try different values of the “mtry” parameter in my randomForest fitting and see the average Root Mean Square Error for the ten fits for each value of mtry. “mtry” defines the number of variables the tree building algorithm has available to it at each split of the tree. For forests with a continuous response variable like mine, the default value is the number of variables divided by three and I have 10 variables, so I try a range of options from 2 to 6 as the subset of variables for the tree to choose from at each split. It turns out the conventional default value of mtry = 3 is in fact the best:

rf-tuning

Here’s the code for this home-made cross-validation of randomForest:

#-----------------random forest----------
# Hold ntree constant and try different values of mtry
# values of m to try for mtry for cross-validation tuning
m <- c(1, 2, 3, 4, 5, 6)

folds <- 10

cvData <- trainData %>%
   mutate(group = sample(1:folds, nrow(trainData), replace = TRUE))

results <- matrix(numeric(length(m) * folds), ncol = folds)



# Cross validation, done by hand with single processing - not very efficient or fast:
for(i in 1:length(m)){
   message(i)
   for(j in 1:folds){
      
      cv_train <- cvData %>% filter(group != j) %>% select(-group)
      cv_test <- cvData %>% filter(group == j) %>% select(-group)

      tmp <- randomForest(income ~ ., data = cv_train, ntree = 100, mtry = m[i], 
                          nodesize = 10, importance = FALSE, replace = FALSE)
      tmp_p <- predict(tmp, newdata = cv_test)
      
      results[i, j] <- RMSE(tmp_p, cv_test$income)
      print(paste("mtry", m[i], j, round(results[i, j], 2), sep = " : "))
   }
}

results_df <- as.data.frame(results)
results_df$mtry <- m

svg("..http://ellisp.github.io/img/0026-rf-cv.svg", 6, 4)
print(
   results_df %>% 
   gather(trial, RMSE, -mtry) %>% 
   ggplot() +
   aes(x = mtry, y = RMSE) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle(paste0(folds, "-fold cross-validation for random forest;ndiffering values of mtry"))
)
dev.off()

Having determined a value for mtry of three variables to use for each tree in the forest, we re-fit the Random Forest with the full training dataset. It’s interesting to see the “importance” of the different variables – which ones make the most contribution to the most trees in the forest. This is the best way of relating as Random Forest to a theoretical question; otherwise their black box nature makes them harder to interpret than a more traditional regression with its t tests and confidence intervals for each explanatory variable’s explanation.

It’s also good to note that after the first 300 or so trees, increasing the size of the forest seems to have little impact.

final-forest

Here’s the code that fits this forest to the training data and draws those plots:

# refit model with full training data set
rf <- randomForest(income ~ ., 
                    data = trainData, 
                    ntrees = 500, 
                    mtries = 3,
                    importance = TRUE,
                    replace = FALSE)


# importances
ir <- as.data.frame(importance(rf))
ir$variable  <- row.names(ir)

p1 <- ir %>%
   arrange(IncNodePurity) %>%
   mutate(variable = factor(variable, levels = variable)) %>%
   ggplot(aes(x = IncNodePurity, y = variable)) + 
   geom_point() +
   labs(x = "Importance of contribution tonestimating income", 
        title = "Variables in the random forest")

# changing RMSE as more trees added
tmp <- data_frame(ntrees = 1:500, RMSE = sqrt(rf$mse))
p2 <- ggplot(tmp, aes(x = ntrees, y = RMSE)) +
   geom_line() +
   labs(x = "Number of trees", y = "Root mean square error",
        title = "Improvement in predictionnwith increasing number of trees")

grid.arrange(p1, p2, ncol = 2)

Extreme gradient boosting

I wanted to check out extreme gradient boosting as an alternative prediction method. Like Random Forests, this method is based on a forest of many regression trees, but in the case of boosting each tree is relatively shallow (not many layers of branch divisions), and the trees are not independent of eachother. Instead, successive trees are built specifically to explain the observations poorly explained by previous trees – this is done by giving extra weight to outliers from the prediction to date.

Boosting is prone to over-fitting and if you let it run long enough it will memorize the entire training set (and be useless for new data), so it’s important to use cross-validation to work out how many iterations are worth using and at what point is not picking up general patterns but just the idiosyncracies of the training sample data. The excellent {xgboost} R package by Tianqui Chen, Tong He and Michael Benesty applies gradient boosting algorithms super-efficiently and comes with built in cross-validation functionality. In this case it becomes clear that 15 or 16 rounds is the maximum boosting before overfitting takes place, so my final boosting model is fit to the full training data set with that number of rounds.

#-------xgboost------------
sparse_matrix <- sparse.model.matrix(income ~ . -1, data = trainData)

# boosting with different levels of rounds.  After 16 rounds it starts to overfit:
xgb.cv(data = sparse_matrix, label = trainY, nrounds = 25, objective = "reg:linear", nfold = 5)

mod_xg <- xgboost(sparse_matrix, label = trainY, nrounds = 16, objective = "reg:linear")

Two stage Random Forests

My final serious candidate for a predictive model is a two stage Random Forest. One of my problems with this data is the big spike at $0 income per week, and this suggests a possible way of modelling it does so in two steps:

first, fit a classification model to predict the probability of an individual, based on their characteristics, having any income at all
fit a regression model, conditional on them getting any income and trained only on those observations with non-zero income, to predict the size of their income (which may be positive or negative).

The individual models could be chosen from many options but I’ve opted for Random Forests in both cases. Because the first stage is a classification problem, I can use the more efficient H2O platform to fit it – much faster.

#---------------------two stage approach-----------
# this is the only method that preserves the bimodal structure of the response
# Initiate an H2O instance that uses 4 processors and up to 2GB of RAM
h2o.init(nthreads = 4, max_mem_size = "2G")

var_names <- names(trainData)[!names(trainData) == "income"]

trainData2 <- trainData %>%
   mutate(income = factor(income != 0)) %>%
   as.h2o()

mod1 <- h2o.randomForest(x = var_names, y = "income",
                         training_frame = trainData2,
                         ntrees = 1000)

trainData3 <- trainData %>% filter(income != 0) 
mod2 <- randomForest(income ~ ., 
                     data = trainData3, 
                     ntree = 250, 
                     mtry = 3, 
                     nodesize = 10, 
                     importance = FALSE, 
                     replace = FALSE)

Traditional regression methods

As a baseline, I also fit three more traditional linear regression models:

one with all variables
one with all variables and many of the obvious two way interactions
a stepwise selection model.

I’m not a big fan of stepwise selection for all sorts of reasons but if done carefully, and you refrain from interpreting the final model as though it was specified in advance (which virtually everyone gets wrong) they have their place. It’s certainly a worthwhile comparison point as stepwise selection still prevails in many fields despite development in recent decades of much better methods of model building.

Here’s the code that fit those ones:

#------------baseline linear models for reference-----------
lin_basic <- lm(income ~ sex + agegrp + occupation + qualification + region +
                   sqrt(hours) + Asian + European + Maori + MELAA + Other + Pacific + Residual, 
                data = trainData)          # first order only
lin_full  <- lm(income ~ (sex + agegrp + occupation + qualification + region +
                   sqrt(hours) + Asian + European + Maori + MELAA + Other + Pacific + Residual) ^ 2, 
                data = trainData)  # second order interactions and polynomials
lin_fullish <- lm(income ~ (sex + Maori) * (agegrp + occupation + qualification + region +
                     sqrt(hours)) + Asian + European + MELAA + 
                     Other + Pacific + Residual,
                  data = trainData) # selected interactions only

lin_step <- stepAIC(lin_fullish, k = log(nrow(trainData))) # bigger penalisation for parameters given large dataset

Results – predictive power

I used root mean square error of the predictions of (transformed) income in the hold-out test set – which had not been touched so far in the model-fitting – to get an assessment of how well the various methods perform. The results are shown in the plot below. Extreme gradient boosting and my two stage Random Forest approaches are neck and neck, followed by the single tree and the random decision forest, with the traditional linear regressions making up the “also rans”.

rmses

I was surprised to see that a humble single regression tree out-performed my home made random decision forest, but concluded that this is probably something to do with the relatively small number of explanatory variables to choose from, and the high performance of “hours worked” and “occupation” in predicting income. A forest (or spinney…) that excludes those variables from whole trees at a time will be dragged down by trees with very little predictive power. In contrast, Random Forests choose from a random subset of variables at each split, so excluding hours from the choice in one split doesn’t deny it to future splits in the tree, and the tree as a whole still makes a good contribution.

It’s useful to compare at a glance the individual-level predictions of all these different models on some of the hold-out set, and I do this in the scatterplot matrix below. The predictions from different models are highly correlated with eachother (correlation of well over 0.9 in all cases), and less strongly correlated with the actual income. This difference is caused by the fact that the observed income includes individual level random variance, whereas all the models are predicting some kind of centre value for income given the various demographic values. This is something I come back to in the next stage, when I want to predict a full distribution.

pairs

Here’s the code that produces the predicted values of all the models on the test set and produces those summary plots:

#---------------compare predictions on test set--------------------
# prediction from tree
tree_preds <- predict(rpartTree, newdata = testData)

# prediction from the random decision forest
rdf_preds <- rep(NA, nrow(testData))
for(i in 1:reps){
   tmp <- predict(home_made_rf[[i]], newdata = testData)
   rdf_preds <- cbind(rdf_preds, tmp)
}
rdf_preds <- apply(rdf_preds, 1, mean, na.rm= TRUE)

# prediction from random forest
rf_preds <- as.vector(predict(rf, newdata = testData))

# prediction from linear models
lin_basic_preds <- predict(lin_basic, newdata = testData)
lin_full_preds <- predict(lin_full, newdata = testData)
lin_step_preds <-  predict(lin_step, newdata = testData)

# prediction from extreme gradient boosting
xgboost_pred <- predict(mod_xg, newdata = sparse.model.matrix(income ~ . -1, data = testData))

# prediction from two stage approach
prob_inc <- predict(mod1, newdata = as.h2o(select(testData, -income)), type = "response")[ , "TRUE"]
pred_inc <- predict(mod2, newdata = testData)
pred_comb <- as.vector(prob_inc > 0.5)  * pred_inc
h2o.shutdown(prompt = F) 

rmse <- rbind(
   c("BasicLinear", RMSE(lin_basic_preds, obs = testY)), # 21.31
   c("FullLinear", RMSE(lin_full_preds, obs = testY)),  # 21.30
   c("StepLinear", RMSE(lin_step_preds, obs = testY)),  # 21.21
   c("Tree", RMSE(tree_preds, obs = testY)),         # 20.96
   c("RandDecForest", RMSE(rdf_preds, obs = testY)),       # 21.02 - NB *worse* than the single tree!
   c("randomForest", RMSE(rf_preds, obs = testY)),        # 20.85
   c("XGBoost", RMSE(xgboost_pred, obs = testY)),    # 20.78
   c("TwoStageRF", RMSE(pred_comb, obs = testY))       # 21.11
   )

rmse %>%
   as.data.frame(stringsAsFactors = FALSE) %>%
   mutate(V2 = as.numeric(V2)) %>%
   arrange(V2) %>%
   mutate(V1 = factor(V1, levels = V1)) %>%
   ggplot(aes(x = V2, y = V1)) +
   geom_point() +
   labs(x = "Root Mean Square Error (smaller is better)",
        y = "Model type",
        title = "Predictive performance on hold-out test set of different models of individual income")

#------------comparing results at individual level------------
pred_results <- data.frame(
   BasicLinear = lin_basic_preds,
   FullLinear = lin_full_preds,
   StepLinear = lin_step_preds,
   Tree = tree_preds,
   RandDecForest = rdf_preds,
   randomForest = rf_preds,
   XGBoost = xgboost_pred,
   TwoStageRF = pred_comb,
   Actual = testY
)

pred_res_small <- pred_results[sample(1:nrow(pred_results), 1000),]

ggpairs(pred_res_small)

Building the Shiny app

There’s a few small preparatory steps now before I can put the results of my model into an interactive web app, which will be built with Shiny.

I opt for the two stage Random Forest model as the best way of re-creating the income distribution. It will let me create simulated data with a spike at zero dollars of income in a way none of the other models (which focus just on averages) will do; plus it has equal best (with extreme gradient boosting) in overall predictive power.

Adding back in individual level variation

After refitting my final model to the full dataset, my first substantive problem is to recreate the full distribution, with individual level randomness, not just a predicted value at each point. On my transformed scale for income, the residuals from the models are fairly homoskedastic, so decide that the Shiny app will simulate a population at any point by sampling with replacement from the residuals of the second stage model.

I save the models, the residauals, and the various dimension variables for my Shiny app.

#----------------shiny app-------------
# dimension variables for the user interface:
d_sex <- sort(as.character(unique(nzis$sex)))
d_agegrp <- sort(as.character(unique(nzis$agegrp)))
d_occupation <- sort(as.character(unique(nzis$occupation)))
d_qualification <- sort(as.character(unique(nzis$qualification)))
d_region <- sort(as.character(unique(nzis$region)))

save(d_sex, d_agegrp, d_occupation, d_qualification, d_region,
     file = "_output/0026-shiny/dimensions.rda")

# tidy up data of full dataset, combining various ethnicities into an 'other' category:     
nzis_shiny <- nzis %>% 
   select(-use) %>%
   mutate(Other = factor(ifelse(Other == "Yes" | Residual == "Yes" | MELAA == "Yes",
                         "Yes", "No"))) %>%
   select(-MELAA, -Residual)
   
for(col in c("European", "Asian", "Maori", "Other", "Pacific")){
   nzis_shiny[ , col]   <- ifelse(nzis_shiny[ , col] == "Yes", 1, 0)
   }

# Refit the models to the full dataset
# income a binomial response for first model
nzis_rf <- nzis_shiny %>%  mutate(income = factor(income !=0))
mod1_shiny <- randomForest(income ~ ., data = nzis_rf,
                           ntree = 500, importance = FALSE, mtry = 3, nodesize = 5)
save(mod1_shiny, file = "_output/0026-shiny/mod1.rda")

nzis_nonzero <- subset(nzis_shiny, income != 0) 

mod2_shiny <- randomForest(income ~ ., data = nzis_nonzero, ntree = 500, mtry = 3, 
                           nodesize = 10, importance = FALSE, replace = FALSE)

res <- predict(mod2_shiny) - nzis_pos$income
nzis_skeleton <- nzis_shiny[0, ]
all_income <- nzis$income

save(mod2_shiny, res, nzis_skeleton, all_income, nzis_shiny,
   file = "_output/0026-shiny/models.rda")

Contextual information – how many people are like “that” anyway?

After my first iteration of the web app, I realised that it could be badly misleading by giving a full distribution for a non-existent combination of demographic variables. For example, Maori female managers aged 15-19 with Bachelor or Higher qualification and living in Southland (predicted to have median weekly income of $932 for what it’s worth).

I realised that for meaningful context I needed a model that estimated the number of people in New Zealand with the particular combination of demographics selected. This is something that traditional survey estimation methods don’t provide, because individuals in the sample are weighted to represent a discrete number of exactly similar people in the population; there’s no “smoothing” impact allowing you to widen inferences to similar but not-identical people.

Fortunately this problem is simpler than the income modelling problem above and I use a straightforward generalized linear model with a Poisson response to create the seeds of such a model, with smoothed estimates of the number of people for each combination of demographics. I then can use iterative proportional fitting to force the marginal totals for each explanatory variable to match the population totals that were used to weight the original New Zealand Income Survey. Explaining this probably deserves a post of its own, but no time for that now.

#---------------population--------

nzis_pop <- expand.grid(d_sex, d_agegrp, d_occupation, d_qualification, d_region,
                        c(1, 0), c(1, 0), c(1, 0), c(1, 0), c(1, 0))
names(nzis_pop) <-  c("sex", "agegrp", "occupation", "qualification", "region",
                      "European", "Maori", "Asian", "Pacific", "Other")
nzis_pop$count <- 0
for(col in c("European", "Asian", "Maori", "Other", "Pacific")){
 nzis_pop[ , col]   <- as.numeric(nzis_pop[ , col])
}

nzis_pop <- nzis_shiny %>%
   select(-hours, -income) %>%
   mutate(count = 1) %>%
   rbind(nzis_pop) %>%
   group_by(sex, agegrp, occupation, qualification, region, 
             European, Maori, Asian, Pacific, Other) %>%
   summarise(count = sum(count)) %>%
   ungroup() %>%
   mutate(Ethnicities = European + Maori + Asian + Pacific + Other) %>%
   filter(Ethnicities %in% 1:2) %>%
   select(-Ethnicities)

# this pushes my little 4GB of memory to its limits:
 mod3 <- glm(count ~ (sex + Maori) * (agegrp + occupation + qualification) + region + 
                Maori:region + occupation:qualification + agegrp:occupation +
                agegrp:qualification, 
             data = nzis_pop, family = poisson)
 
 nzis_pop$pop <- predict(mod3, type = "response")

# total population should be (1787 + 1410) * 1000 = 319700.  But we also want
# the marginal totals (eg all men, or all women) to match the sum of weights
# in the NZIS (where wts = 319700 / 28900 = 1174).  So we use the raking method
# for iterative proportional fitting of survey weights

wt <- 1174

sex_pop <- nzis_shiny %>%
   group_by(sex) %>%
   summarise(freq = length(sex) * wt)

agegrp_pop <- nzis_shiny %>%
   group_by(agegrp) %>%
   summarise(freq = length(agegrp) * wt)

occupation_pop <- nzis_shiny %>%
   group_by(occupation) %>%
   summarise(freq = length(occupation) * wt)

qualification_pop <- nzis_shiny %>%
   group_by(qualification) %>%
   summarise(freq = length(qualification) * wt)

region_pop <- nzis_shiny %>%
   group_by(region) %>%
   summarise(freq = length(region) * wt)

European_pop <- nzis_shiny %>%
   group_by(European) %>%
   summarise(freq = length(European) * wt)

Asian_pop <- nzis_shiny %>%
   group_by(Asian) %>%
   summarise(freq = length(Asian) * wt)

Maori_pop <- nzis_shiny %>%
   group_by(Maori) %>%
   summarise(freq = length(Maori) * wt)

Pacific_pop <- nzis_shiny %>%
   group_by(Pacific) %>%
   summarise(freq = length(Pacific) * wt)

Other_pop <- nzis_shiny %>%
   group_by(Other) %>%
   summarise(freq = length(Other) * wt)

nzis_svy <- svydesign(~1, data = nzis_pop, weights = ~pop)

nzis_raked <- rake(nzis_svy,
                   sample = list(~sex, ~agegrp, ~occupation, 
                                 ~qualification, ~region, ~European,
                                 ~Maori, ~Pacific, ~Asian, ~Other),
                   population = list(sex_pop, agegrp_pop, occupation_pop,
                                     qualification_pop, region_pop, European_pop,
                                     Maori_pop, Pacific_pop, Asian_pop, Other_pop),
                   control = list(maxit = 20, verbose = FALSE))

nzis_pop$pop <- weights(nzis_raked)

save(nzis_pop, file = "_output/0026-shiny/nzis_pop.rda")

The final shiny app

The full screen version of the web app
The source code

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

How To Import Data Into R – New Course

January 26, 2016, 3:35 pm

≫ Next: A Million Text Files And A Single Laptop

≪ Previous: Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Importing your data into R to start your analyses: it should be the easiest step. Unfortunately, this is almost never the case. Data is stored in all sorts of formats, ranging from from flat files to other statistical software files to databases and web data. A skilled data scientist knows which techniques to use to in order to proceed with the analysis of data.

In our latest course, Importing Data Into R, you will learn the basics on how to get up and running in no time! Start the new Importing & Cleaning Data course for free, today.

What you’ll learn

This 4 hour course includes 5 chapters and covers the given topics below:

Chapter 1: Learn how to import data from flat files without hesitation using the readr and data.table packages, in addition to harnessing the power of the fread function.
Chapter 2: You will excel at loading .xls and .xlsx files with help from packages such as: readxl, gdata, and XLConnect.
Chapter 3: Help out your friends that are still paying for their statistical software and import their datasets from SAS, STATA, and SPSS using the haven and foreign packages.
Chapter 4: Pull data in style from popular relational databases, including SQL.
Chapter 5: Learn the valuable skill of importing data from the web.

Start your data importing journey here!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

↧

A Million Text Files And A Single Laptop

January 28, 2016, 6:53 am

≫ Next: New Yorkers, municipal bikes, and the weather

≪ Previous: How To Import Data Into R – New Course

(This article was first published on R – randyzwitch.com, and kindly contributed to R-bloggers)

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt

-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

How many unique products were sold?
How many transactions were there per day?
How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.

To leave a comment for the author, please follow the link and comment on their blog: R – randyzwitch.com.

↧

New Yorkers, municipal bikes, and the weather

January 29, 2016, 8:19 am

≫ Next: Microsoft R Server Now Available in the Cloud on Azure Marketplace

≪ Previous: A Million Text Files And A Single Laptop

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Like many modern cities, New York offers a public pick-up/drop-off bicycle service (called Citi Bikes). Subscribing City Bike members can grab a bike from almost 500 stations scattered around the city, hop on and ride to their destination, and drop the bike at a nearby station. (Visitors to the city can also purchase day passes.) The City Bike program shares data to the public about the operation of the service: time and location of pick-ups and drop-offs, and basic demographic data (age and gender) of subscriber riders.

Data Scientist Todd Schneider has followed-up on his tour-de-force analysis of Taxi Rides in NYC with a similar analysis of the Citi Bike data. Check out the wonderful animation of bike rides on September 16 below. While the Citi Bike data doesn't include actual trajectories (just the pick-up and drop-off locations), Todd has "interpolated" these points using Google Maps biking directions. Though these may not match actual routes (and gives extra weight to roads with bike lanes), it's nonetheless an elegant visualization of bike commuter patterns in the city.

Check out in particular the rush hours of 7-9AM and 4-6PM. September 16 was a Wednesday, but as Todd shows in the chart below, biking patterns are very different on the weekends as the focus switches from commuting to pleasure rides.

Todd also matched the biking data with NYC weather data to take a look at its effect on biking patterns. Unsurprisingly, low temperatures and rain both have a dampening effect (pun intended!) on ridership: one inch of rain deters as many riders as a 24-degree (F) drop in temperature. Surprisingly, snow doesn't have such a dramatic effect: an inch of snow depresses ridership like a 1.4 degree drop in temperature. (However, Todd's data doesn't include the recent blizzard in New York, from which many City Bike stations are still waiting to be dug out.)

Todd conducted all of the analysis and data visualization with the R language (he shares the R code on Github). He mainly used the the RPostgreSQL package for data extraction, the dplyr package for the data manipulation, the ggplot2 package for the graphics, and the minpack.lm package for the nonlinear least squares analysis of the weather impact.

There's plenty more detail to the analysis, including the effects of age and gender on cycling speed. For the complete analysis and lots more interesting charts, follow the link to the blog post below.

Todd W. Schneider: A Tale of Twenty-Two Million Citi Bikes: Analyzing the NYC Bike Share System

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Microsoft R Server Now Available in the Cloud on Azure Marketplace

March 14, 2016, 8:00 am

≫ Next: SQL Server 2016 launch showcases R

≪ Previous: New Yorkers, municipal bikes, and the weather

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Richard Kittler, Microsoft R Server PM, Microsoft Advanced Analytics

In January, Microsoft announced the re-branding of Revolution R Enterprise as Microsoft R Server, and the release of Microsoft R Server 2016 (aka 8.0). Highlights of what’s new in version 8.0 include an updated R engine (R 3.2.2), new fuzzy matching algorithms, the ability to write to databases via ODBC, and a streamlined install experience.

This latest release is now available as virtual machines (VMs) in Azure’s world-wide cloud infrastructure via the Azure Marketplace. The VMs can be found in the Marketplace here, or by searching for “R Server” in the Marketplace or Azure Portal. Note that although the new OpenLogic CentOS Linux VM is branded as “Microsoft R Server”, the Windows VM will remain tagged as “RRE for Windows” until later in the year.

Use of VMs can be a practical means of trying out the software, gaining flex capacity, and reducing the burden of system administration. Through the R Server VMs, users can run computations on data sets up to 1 terabyte on cloud-based Windows and Linux multi-CPU instances from 4 to 32 vCPUs (virtual CPUs), accessing data copied from an Azure data store, including blob storage, Azure Data Lake Store, or SQL Server, or accessed directly through an ODBC connection. Utility pricing for both the Windows and Linux versions starts at $1.50 per 4-cores per hour with no long-term commitments.

The Windows VM can be accessed with Windows Remote Desktop and includes Microsoft’s Visual Studio-based IDE for R developers. Access to the Linux VM is via SSH with R access through the RGui. Alternatively, customers have the option of bringing their own license for a favorite IDE such as RStudio, RStudio Server, or StatET. Both the Windows and Linux offerings include DeployR Open web services.

A 30-day free trial is available with no software charge, though standard Azure cloud usage charges apply. However, if you are new to Azure then you also have the benefit of a 30-day Azure free trial which will cover the cost of the underlying VM during your R Server free trial! Support is available through the monitored Azure Machine Learning Forum or through a paid Azure Support plan. Lastly, if you are new to R Server (fka RRE) then you may wish to take a look at this free intro to RRE course on DataCamp.

Additional Notes:

These new VMs supersede those for RRE 7.4.1 under the Revolution brand. Revolution’s Marketplace offers have been retired.
Although for development only, the latest release of Microsoft R Server is also available as part the Azure Data Science VM on Windows.
To learn more about R Server, see the Microsoft R Server page on MSDN.

Azure Marketplace: Microsoft R Server

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

SQL Server 2016 launch showcases R

March 14, 2016, 9:42 am

≫ Next: Analyzing Golden State Warriors’ passing network using GraphFrames in Spark

≪ Previous: Microsoft R Server Now Available in the Cloud on Azure Marketplace

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft officially launched SQL Server 2016 at the Data Driven event in New York City last week, and R featured prominently.

One of the highlights from Joseph Sirosh's keynote was a demonstration of the World Wide Telescope, an online tool that allows budding astronomers (and professionals!) to explore the visible universe using imagery from the Digitized Sky Survey. The Survey contains images from thousands of galaxies that have not yet been classified by astronomers into one of the standard forms: elliptical, spiral, or lenticular. In the demo, the World Wide Telescope automatically classifies the galaxies in mere seconds:

The automated classification was based on a random forests model, using the manually-classified galaxies in the database as the training set. The model was trained and run using R (specifically, SQL Server R Services) running within SQL Server 2016. You can see the R code embedded within the T-SQL used in SQL Server in the screenshot below.

Notably, R wasn't the only open-source techology featured in this demo. In fact, the demo was running SQL Server on Linux, which is in preview now and will be available in 2017.

If you'd like to explore R and SQL Server in more detail, the launch also saw the release of a number of in-depth videos featuring many of the developers involved in the project. (While I had a small cameo in the Scott Guthrie's keynote presentation on some of the team involved in the project, the names below deserve much more credit for the R integration.) Click on the links below for the videos:

Scaling your advanced analytics in SQL Server 2016 with R Server: Bill Jacobs.
R Services in SQL Server 2016: Dotan Elharrar and UC Jayachandran.
R Services FAQ SQL Server 2016: Dotan Elharrar, Vijay Jayaseelan, Jasraj Dange, and UC Jayachandran.
Building intelligent applications using SQL Server R services: an End to End Walkthrough, Gopi Kumar and Hang Zhang.

Official Microsoft Blog: SQL Server 2016: The database for mission-critical intelligence

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Analyzing Golden State Warriors’ passing network using GraphFrames in Spark

March 15, 2016, 7:00 am

≫ Next: Webinar and free e-book on data preparation with R

≪ Previous: SQL Server 2016 launch showcases R

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

Databricks recently announced GraphFrames, awesome Spark extension to implement graph processing using DataFrames.
I performed graph analysis and visualized beautiful ball movement network of Golden State Warriors using rich data provided by NBA.com’s stats

Pass network of Warriors

### Passes received & made

The league’s MVP Stephen Curry received the most passes and the team’s MVP Draymond Green provides the most passes.
We’ve seen most of the offense start with their pick & roll or Curry’s off-ball cuts with Green as a pass provider.

via GIPHY

inDegree

id	inDegree
CurryStephen	3993
GreenDraymond	3123
ThompsonKlay	2276
LivingstonShaun	1925
IguodalaAndre	1814
BarnesHarrison	1241
BogutAndrew	1062
BarbosaLeandro	946
SpeightsMarreese	826
ClarkIan	692
RushBrandon	685
EzeliFestus	559
McAdooJames Michael	182
VarejaoAnderson	67
LooneyKevon	22

outDegree

id	outDegree
GreenDraymond	3841
CurryStephen	3300
IguodalaAndre	1896
LivingstonShaun	1878
BogutAndrew	1660
ThompsonKlay	1460
BarnesHarrison	1300
SpeightsMarreese	795
RushBrandon	772
EzeliFestus	765
BarbosaLeandro	758
ClarkIan	597
McAdooJames Michael	261
VarejaoAnderson	94
LooneyKevon	36

Label Propagation

Label Propagation is an algorithm to find communities in a graph network.
The algorithm nicely classifies players into backcourt and frontcourt without providing label!

name	label
Thompson, Klay	3
Barbosa, Leandro	3
Curry, Stephen	3
Clark, Ian	3
Livingston, Shaun	3
Rush, Brandon	7
Green, Draymond	7
Speights, Marreese	7
Bogut, Andrew	7
McAdoo, James Michael	7
Iguodala, Andre	7
Varejao, Anderson	7
Ezeli, Festus	7
Looney, Kevon	7
Barnes, Harrison	7

Pagerank

PageRank can detect important nodes (players in this case) in a network.
It’s no surprise that Stephen Curry, Draymond Green and Klay Thompson are the top three.
The algoritm detects Shaun Livingston and Andre Iguodala play key roles in the Warriors’ passing games.

name	pagerank
Curry, Stephen	2.17
Green, Draymond	1.99
Thompson, Klay	1.34
Livingston, Shaun	1.29
Iguodala, Andre	1.21
Barnes, Harrison	0.86
Bogut, Andrew	0.77
Barbosa, Leandro	0.72
Speights, Marreese	0.66
Clark, Ian	0.59
Rush, Brandon	0.57
Ezeli, Festus	0.48
McAdoo, James Michael	0.27
Varejao, Anderson	0.19
Looney, Kevon	0.16

Everything together

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)

Here is a network visualization using the results of above.

Node size: pagerank
Node color: community
Link width: passes received & made

Workflow

Calling API

I used the endpoint playerdashptpass and saved data for all the players in the team into local JSON files.
The data is about who passed how many times in 2015-16 season

# GSW player IDs
playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546,
203110,201939,203105,2733,1626172,203084]

# Calling API and store the results as JSON
for playerid in playerids:
    os.system('curl "http://stats.nba.com/stats/playerdashptpass?'
        'DateFrom=&'
        'DateTo=&'
        'GameSegment=&'
        'LastNGames=0&'
        'LeagueID=00&'
        'Location=&'
        'Month=0&'
        'OpponentTeamID=0&'
        'Outcome=&'
        'PerMode=Totals&'
        'Period=0&'
        'PlayerID={playerid}&'
        'Season=2015-16&'
        'SeasonSegment=&'
        'SeasonType=Regular+Season&'
        'TeamID=0&'
        'VsConference=&'
        'VsDivision=" > {playerid}.json'.format(playerid=playerid))

JSON -> Panda’s DataFrame

Then I combined all the individual JSON files into a single DataFrame for later aggregation.

raw = pd.DataFrame()
for playerid in playerids:
    with open("{playerid}.json".format(playerid=playerid)) as json_file:
        parsed = json.load(json_file)['resultSets'][0]
        raw = raw.append(
            pd.DataFrame(parsed['rowSet'], columns=parsed['headers']))

raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'})

raw['id'] = raw['PLAYER'].str.replace(', ', '')

Prepare vertices and edges

You need a special data format for GraphFrames in Spark, vertices and edges.
Vertices are lis of nodes and IDs in a graph.
Edges are the relathionship of the nodes.
You can pass additional features like weight but I couldn’t find out a way to utilize there features well in later analysis.
A workaround I took below is brute force and not even a proper graph operation but works (suggestions/comments are very welcome).

# Make raw vertices
pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates()
pandas_vertices.columns = ['name', 'id']

# Make raw edges
pandas_edges = pd.DataFrame()
for passer in raw['id'].drop_duplicates():
    for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) &
     (raw['id'] == passer)]['PASS_TO'].drop_duplicates():
        pandas_edges = pandas_edges.append(pd.DataFrame(
        	{'passer': passer, 'receiver': receiver
        	.replace(  ', ', '')}, 
        	index=range(int(raw[(raw['id'] == passer) &
        	 (raw['PASS_TO'] == receiver)]['PASS'].values))))

pandas_edges.columns = ['src', 'dst']

Graph analysis

Bring the local vertices and edges to Spark and let it spark.

vertices = sqlContext.createDataFrame(pandas_vertices)
edges = sqlContext.createDataFrame(pandas_edges)

# Analysis part
g = GraphFrame(vertices, edges)
print("vertices")
g.vertices.show()
print("edges")
g.edges.show()
print("inDegrees")
g.inDegrees.sort('inDegree', ascending=False).show()
print("outDegrees")
g.outDegrees.sort('outDegree', ascending=False).show()
print("degrees")
g.degrees.sort('degree', ascending=False).show()
print("labelPropagation")
g.labelPropagation(maxIter=5).show()
print("pageRank")
g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort(
    'pagerank', ascending=False).show()

Visualise the network

When you run gsw_passing_network.py in my github repo, you have passes.csv, groups.csv and size.csv in your working directory.
I used networkD3 package in R to make a cool interactive D3 chart.

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)

Code

The full codes are available on github.

Analyzing Golden State Warriors’ passing network using GraphFrames in Spark was originally published by Kirill Pomogajko at Opiate for the masses on March 15, 2016.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

↧

Webinar and free e-book on data preparation with R

March 15, 2016, 9:57 am

≫ Next: apply lapply rapply sapply functions in R

≪ Previous: Analyzing Golden State Warriors’ passing network using GraphFrames in Spark

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Just a quick heads up that Nina Zumel, co-founder and principal consultant at Win-Vector LLC will be presenting a webinar at 10AM Pacific Time on Thursday March 17, Data Preparation Techniques with R. Nina is the co-author of Practical Data Science with R and blogs frequently at the Win-Vector blog (and contributes the occasional guest blog here), and has a wealth of experience managing data in R as a consultant. This webinar will be a great opportunity to learn tips and tricks from Nina on the best way to handle real-world data sets in R.

I'll be hosting the webinar and delivering your questions to Nina live during the event. If you can't join live, you'll receive an email afetr the event with a link to the slides and replay. All registrants will also receive a free copy of the new e-book Preparing Data for Analysis with R by our presenter, Nina Zumel.

Data quality is the single most important item to the success of your data science project. Preparing data for analysis is one of the most important, laborious and yet, neglected aspects of data science. Many of the routine steps can be automated in a principled manner. This webinar will lay out the statistical fundamentals of preparing data. Our speaker, Nina Zumel, principal consultant and co-founder of Win-Vector, LLC, will cover what goes wrong with data and how you can detect the problems and fix them.

To register for the webinar and receive your free e-book, follow the link below.

Microsoft Advanced Analytics and IoT: Data Preparation Techniques with R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

apply lapply rapply sapply functions in R

March 18, 2016, 10:42 am

≫ Next: Adobe Analytics Clickstream Data Feed: Loading To Relational Database

≪ Previous: Webinar and free e-book on data preparation with R

(This article was first published on Data Perspective, and kindly contributed to R-bloggers)

As part of Data Science with R, this is third tutorial after basic data types,control structures in r.

One of the issues with for loop is its memory consumption and its slowness in executing a repetitive task at hand. Often dealing with large data and iterating it, for loop is not advised. R provides many few alternatives to be applied on vectors for looping operations. In this section, we deal with apply function and its variants:

?apply

Datasets for apply family tutorial

 For understanding the apply functions in R we use,the data from 1974 Motor Trend
US magazine which comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
 
data("mtcars")
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
 

Reynolds (1994) describes a small part of a study of the long-term temperature dynamics
of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by
telemetry every 10 minutes for four females, but data from a one period of less than a 
day for each of two animals is used there. 

data(beavers)
head(t(beaver1)[1:4,1:10])
        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]    [,9]   [,10]
day   346.00 346.00 346.00 346.00 346.00 346.00 346.00 346.00  346.00  346.00
time  840.00 850.00 900.00 910.00 920.00 930.00 940.00 950.00 1000.00 1010.00
temp   36.33  36.34  36.35  36.42  36.55  36.69  36.71  36.75   36.81   36.88
activ   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    0.00    0.00

apply():
apply() function is the base function. We will learn how to apply family functions by trying out the code. apply() function takes 3 arguments:

data matrix
row/column operation, – 1 for row wise operation, 2 for column wise operation
function to be applied on the data.

 
when 1 is passed as second parameter, the function max is applied row wise and gives
us the result. In the below example, row wise maximum value is calculated.Since we 
have four types of attributes we got 4 results.
 
apply(t(beaver1),1,max) 
    day    time    temp   activ 
 347.00 2350.00   37.53    1.00 

 
When 2 is passed as second  parameter the function  mean is applied column wise.
In the below example mean function is applied on each column and mean for each 
column is calculated. Hence  we can see results for each column.
 
apply(mtcars,2,mean) 
       mpg        cyl       disp         hp       drat         wt       qsec         vs         am       gear       carb 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500   0.406250   3.687500   2.812500 
 
We can also pass custom function instead of default functions. For example in 
the below example let us divide each column element with modulus of 10.
For this we use a custom function which takes each element from each column and
apply the modulus operation.
 
head(apply(mtcars,2,function(x) x%%10))
                  mpg cyl disp hp drat    wt qsec vs am gear carb
Mazda RX4         1.0   6    0  0 3.90 2.620 6.46  0  1    4    4
Mazda RX4 Wag     1.0   6    0  0 3.90 2.875 7.02  0  1    4    4
Datsun 710        2.8   4    8  3 3.85 2.320 8.61  1  1    4    1
Hornet 4 Drive    1.4   6    8  0 3.08 3.215 9.44  1  0    3    1
Hornet Sportabout 8.7   8    0  5 3.15 3.440 7.02  0  0    3    2
Valiant           8.1   6    5  5 2.76 3.460 0.22  1  0    3    1

lapply():
lapply function is applied for operations on list objects and returns a list object of same length of original set.
lapply function in R, returns a list of the same length as input list object, each element of which is the result of applying FUN to the corresponding element of list.

 #create a list with 2 elements 
l = (a=1:10,b=11:20)  # the mean of the value in each element
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5
class(lapply(l, mean))
[1] "list
  # the sum of the values in each element 
lapply(l, sum)
$a
[1] 55

$b
[1] 155

sapply():
sapply is wrapper class to lapply with difference being it returns vector or matrix instead of list object.

 
 # create a list with 2 elements 
 l = (a=1:10,b=11:20)  # mean of values using sapply 
sapply(l, mean)
   a    b 
 5.5 15.5

tapply():
tapply() is a very powerful function that lets you break a vector into pieces and then apply some function to each of the pieces. In the below code, first each of mpg in mtcars data is grouped by cylinder type and then mean() function is calculated.

str(mtcars$cyl)
 num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
levels(as.factor(mtcars$cyl))
[1] "4" "6" "8"

In the dataset we have 3 types of cylinders and now we want to see the average mpg
for each cylinder type.

tapply(mtcars$mpg,mtcars$cyl,mean)
       4        6        8 
26.66364 19.74286 15.10000 

In the output above we see that the average mpg for 4 cylinder engine 
is 26.664, 6-cyinder engine is 19.74 and 8-cylinder engine is 15.10

by():
by works similar to group by function in SQL, applied to factors, where in we may apply operations on individual results set. In the below example, we apply colMeans() function to all the observations on iris dataset grouped by Species.

data(iris) 
'data.frame': 150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

by(iris[,1:4],iris$Species,colMeans)
iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 
------------------------------------------------------------------------------------ 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------------------------------ 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.588        2.974        5.552        2.026

To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.

↧

Adobe Analytics Clickstream Data Feed: Loading To Relational Database

March 18, 2016, 11:42 am

≫ Next: Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

≪ Previous: apply lapply rapply sapply functions in R

(This article was first published on R – randyzwitch.com, and kindly contributed to R-bloggers)

In my previous post about the Adobe Analytics Clickstream Data Feed, I showed how it was possible to take a single day worth of data and build a dataframe in R. However, most likely your analysis will require using multiple days/weeks/months of data, and given the size and complexity of the feed, loading the files into a relational database makes a lot of sense. Although there may be database-specific “fast-load” tools more appropriate for this application, this blog post will show how to handle this process using only R and PostgresSQL.

File Organization

Before getting into the loading of the data into PostgreSQL, I like to sort my files by type into separate directories (remember from the previous post, you’ll receive three files per day). R makes OS-level operations simple enough:

Were there more file types, I could’ve abstracted this into a function instead of copying the code three times, but the idea is the same: Check to see if the directory exists, if it doesn’t then create it and move the files into the directory.

Connecting and Loading Data to PostgreSQL from R

Once we have our files organized, we can begin the process of loading the files into PostgreSQL using the RPostgreSQL R package. RPostgreSQL is DBI-compliant, so the connection string is the same for any other type of database engine; the biggest caveat of loading your servercall data into a database is the first load is almost guaranteed to require loading as text (using colClasses = “character” argument in R). The reason that you’ll need to load the data as text is that Adobe Analytics implementations necessarily change over time; text is the only column format that allows for no loss of data (we can fix the schema later within Postgres either by using ALTER TABLE or by writing a view).

With this small amount of code, we’ve generated the table definition structure (see here for the underlying Postgres code), loaded the data, and told Postgres to analyze the table to gather statistics for efficient queries. Sweet, two years of data loaded with minimal effort!

Loading Lookup Tables Into PostgreSQL

With the server call data loaded into our database, we now need to load our lookup tables. Lucky for us, these do maintain a constant format, so we don’t need to worry about setting all the fields to text, RPostgreSQL should get the column types correct.

SHORTCUT: The dimension tables that are common to all report suites don’t really change over time, although that isn’t guaranteed. In the 758 days of files I loaded (code), the only files having more than one value for a given key were: browser, browser_type, operating_system, search_engines, event (report suite specific for every company) and column_headers (report suite specific for every company). So if you’re doing a bulk load of data, it’s generally sufficient to use the newest lookup table and save yourself some time. If you are processing the data every day, you can use an upsert process and generally there will be few if any updates.

Let’s Do Analytics!!!!???!!!

<moan>Why is there always so much ETL work, I want to data science the hell out of some data</moan>

At this point, if you were uploading the same amount of data for the traffic my blog does (not much), you’d be about 1-2 hours into loading data, still having done no analysis. In fact, in order to do analysis, you’d still need to modify the column names and types in your servercalls table, update the lookup tables to have the proper column names, and maybe you’d even want to pre-summarize the tables into views/materialized views for Page View/Visit/Visitor level. Whew, that’s a lot of work just to calculate daily page views.

Yes it is. But taking on a project like this isn’t for page views; just use the Adobe Analytics UI!

In a future blog post or two, I’ll demonstrate how to use this relational database layout to perform analyses not possible within the Adobe Analytics interface, and also show how we can skip this ETL process altogether using a schema-on-read process with Spark.

To leave a comment for the author, please follow the link and comment on their blog: R – randyzwitch.com.

↧

Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

March 19, 2016, 8:43 am

≫ Next: satRdays are go!

≪ Previous: Adobe Analytics Clickstream Data Feed: Loading To Relational Database

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

Introduction

So, I’m not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.

I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.

So I thought I’d put together a simple example of making nice choropleths using R and ggmap. Let’s do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto’s Open Data initiative.

Background

We’re going to plot the shapefile data of Toronto’s neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.

We’ll need a few spatial plotting packages in R (ggmap, rgeos, maptools).

Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn’t fix. The working version is available on the github page for this post.

Analysis

First let’s just load in the shapefile and plot the raw boundary data using maptools. What do we get?

# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
plot(sh)

This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There’s some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that’s a good start. Now we’re going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')

Next we’ll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I’ve arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:

# Set the palette
p <- colorRampPalette(c("white", "red"))(128)
palette(p)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)

And here’s the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I’m not crazy about this white-red palette so let’s use RColorBrewer’s spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
palette(rev(p))
plot(sh2, col=cols)

There, that’s better. The dark red neighborhood is Woburn. But we still don’t have a legend so this choropleth isn’t really telling us anything particularly helpful. And it’d be nice to have the polygons overplotted onto map tiles. So let’s use ggmap!

ggmap

In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap’s very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

#GGPLOT 
points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)

Voila!

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It’s also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
  scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

code and data on github
https://github.com/mylesmharrison/toronto_neighbourhoods

Neighbourhood boundaries at Toronto Open Data:

http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=04b489fe9c18b210VgnVCM1000003dd60f89RCRD&vgnextchannel=1a66e03bb8d1e310VgnVCM10000071d60f89RCRD

Demographic data from Well-being Toronto:

http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=4482904ade9ea410VgnVCM10000071d60f89RCRD&vgnextchannel=1a66e03bb8d1e310VgnVCM10000071d60f89RCRD

To leave a comment for the author, please follow the link and comment on their blog: everyday analytics.

↧

satRdays are go!

April 1, 2016, 12:57 am

≫ Next: Working with databases in R

≪ Previous: Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

(This article was first published on rapporter, and kindly contributed to R-bloggers)

Steph Locke has already written about this last week, so just a quick follow-up on R-bloggers to share the great news with a wider audience: the R Consortium agreed to the support satRdays!

A quick reminder on the project:

satRdays are SQLSaturday-inspired, community-driven, one-day, free or very cheap and regional conferences — acting as a bridge between local R user groups and international conferences
the idea kicked off only a few months ago at the EARL Boston conference
Steph quickly created a GitHub repository for planning, and we started to circulate a very short survey to see interest in the proposal, and we received over 500 responses in a week or so (quick analysis)
the project idea was submitted to the R Consortium in January, and now we are extremely happy to announce that the R Consortium is supporting us
we plan to have three independent events in 2016 (or early 2017): one in Eastern Europe, one in North America and a third satRday on another continent — looking forward to your ideas!

As said previously, satRdays are community driven, so please join the discussion on GitHub, fill in the short questionnaire if you missed it the last time — and we will contact every interested party over the next few weeks to take this funded idea to the next level: see you soon on a satRday!

To leave a comment for the author, please follow the link and comment on their blog: rapporter.

↧

Working with databases in R

April 2, 2016, 10:15 am

≫ Next: SQL Server 2016 with R Install

≪ Previous: satRdays are go!

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

The dplyr package, which is one of my favorite R packages, works with in-memory data and with data stored in databases. In this extensive and comprehensive post, I will share my experience on using dplyr to work with databases. The basic functions of dplyr package are covered by Teja in another post at DataScience+

Using dplyr with databases has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

dplyr is a really powerful package for data manipulation, data exploration and feature engineering in R and if you do not know SQL, it provides the ability to work with databases just within R. Further, dplyr functions are easy to write and read. dplyr considers database tables as data frames and it uses lazy evaluation (it delays the actual operation until necessary and loads data onto R from the database only when we need it) and for someone who knows Spark, the processes and even the functions have similarities.

dplyr supports a couple of databases such as sqlite, mysql and postgresql. In this post, we will see how to work with sqlite database. You can get more information from the dplyr database vignette here.

When people take drugs, if they experience any adverse events, they can report it to the FDA. These data are in public domain and anyone can download them and analyze them. In this post, we will download demography information of the patients, drug they used and for what indication they used it, reaction and outcome. Then, we will put all the datasets in a database and use dplyr to work with the databases.

You can read more about the adverse events data in my previous post.

You can simply run the code below and it will download the adverse events data and create one large dataset, for each category, by merging the various datasets. For demonstration purposes, let’s use the adverse event reports from 2013-2015. The adverse events are released in quarterly data files (a single data file for every category every three months).

Load R Packages

library(dplyr)
library(ggplot2)
library(data.table)

Download adverse events data

 year_start=2013
year_last=2015
for (i in year_start:year_last){
            j=c(1:4)
            for (m in j){
            url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
                download.file(url1,dest="data.zip") # Demography
                unzip ("data.zip")
            url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
                download.file(url2,dest="data.zip")   # Drug 
                unzip ("data.zip")
            url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
                download.file(url3,dest="data.zip") # Reaction
                unzip ("data.zip")
            url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
                download.file(url4,dest="data.zip") # Outcome
                unzip ("data.zip")
            url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
                download.file(url5,dest="data.zip") # Indication for use
                unzip ("data.zip")
            }
        }

Concatenate the quarterly data files and create single dataset for each category

Demography

filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)
demo=lapply(filenames,fread)
demography=do.call(rbind,lapply(1:length(demo), function(i) 
            select(as.data.frame(demo[i]),primaryid,caseid, 
             age,age_cod,event_dt,sex,wt,wt_cod, occr_country)))

str(demography)
'data.frame':	3037542 obs. of  9 variables:
 $ primaryid   : int  30375293 30936912 32481334 35865322 37005182 37108102 37820163 38283002 38346784 40096383 ...
 $ caseid      : int  3037529 3093691 3248133 3586532 3700518 3710810 3782016 3828300 3834678 4009638 ...
 $ age         : chr  "44" "38" "28" "45" ...
 $ age_cod     : chr  "YR" "YR" "YR" "YR" ...
 $ event_dt    : int  199706 199610 1996 20000627 200101 20010810 20120409 NA 20020615 20030619 ...
 $ sex         : chr  "F" "F" "F" "M" ...
 $ wt          : num  56 56 54 NA NA 80 102 NA NA 87.3 ...
 $ wt_cod      : chr  "KG" "KG" "KG" "" ...
 $ occr_country: chr  "US" "US" "US" "AR" ...

We see that our demography data has more than 3 million rows and the variables are age, age code, date the event happened, sex, weight, weight code and country where the event happened.

Drug

filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)
drug_list=lapply(filenames,fread)
drug=do.call(rbind,lapply(1:length(drug_list), function(i) 
            select(as.data.frame(drug_list[i]),primaryid,drug_seq,drugname,route)))

str(drug)
'data.frame':	9989450 obs. of  4 variables:
 $ primaryid: chr  "" "" "" "" ...
 $ drug_seq : chr  "" "" "20140601" "U" ...
 $ drugname : chr  "" "" "" "" ...
 $ route    : chr  "" "21060" "" "76273" ...

We can see that the drug data has about ten million rows and among the variables are drug name and route.

Diagnoses/Indications

filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)
indi=lapply(filenames,fread)
indication=do.call(rbind,lapply(1:length(indi), function(i) 
            select(as.data.frame(indi[i]),primaryid,indi_drug_seq,indi_pt)))

str(indication)
'data.frame':	6383312 obs. of  3 variables:
 $ primaryid    : int  8480348 8480354 8480355 8480357 8480358 8480358 8480358 8480359 8480360 8480361 ...
 $ indi_drug_seq: int  1020135312 1020135329 1020135331 1020135333 1020135334 1020135337 1020135338 1020135339 1020135340 1020135341 ...
 $ indi_pt      : chr  "CONTRACEPTION" "SCHIZOPHRENIA" "ANXIETY" "SCHIZOPHRENIA" ...

The indication data has more than six million rows and the variables are primaryid, drug sequence and indication (indication prefered term).

Outcomes

filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
outc=lapply(filenames,fread)
outcome=do.call(rbind,lapply(1:length(outc), function(i) 
            select(as.data.frame(outc[i]),primaryid,outc_cod)))

str(outcome)
 'data.frame':	2453953 obs. of  2 variables:
 $ primaryid: int  8480347 8480348 8480350 8480351 8480352 8480353 8480353 8480354 8480355 8480356 ...
 $ outc_cod : chr  "OT" "HO" "HO" "HO" ...

The outcome data has more than two million rows and the variables are primaryid and outcome code (outc_cod).

Reaction (Adverse Event)

filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
reac=lapply(filenames,fread)
reaction=do.call(rbind,lapply(1:length(reac), function(i) 
            select(as.data.frame(reac[i]),primaryid,pt)))

str(reaction)
'data.frame':	9288270 obs. of  2 variables:
 $ primaryid: int  8480347 8480348 8480349 8480350 8480350 8480350 8480350 8480350 8480350 8480351 ...
 $ pt       : chr  "ANAEMIA HAEMOLYTIC AUTOIMMUNE" "OPTIC NEUROPATHY" "DYSPNOEA" "DEPRESSED MOOD" ...

So, we see that the adverse events (reaction) data has more than nine million rows and the variables are primaryid and prefered term for adverse event (pt).

Create a database

To create SQLite database in R, we do not need anything, just specifying the path only. We use src_sqlite to connect to an existing sqlite database, and tbl to connect to tables within that database. We can also use src_sqlite to create new SQlite database at the specified path. If we do not specify a path, it will be created in our working directory.

my_database<- src_sqlite("adverse_events", create = TRUE) # create =TRUE creates a new database

Put data in the database

To upload data to the database, we use the dplyr function copy_to. According to the documentation, wherever possible, the new object will be temporary, limited to the current connection to the source. So, we have to change temporary to false to make it permanent.

 copy_to(my_database,demography,temporary = FALSE) # uploading demography data
copy_to(my_database,drug,temporary = FALSE)       # uploading drug data
copy_to(my_database,indication,temporary = FALSE) # uploading indication data
copy_to(my_database,reaction,temporary = FALSE)   # uploading reaction data
copy_to(my_database,outcome,temporary = FALSE)     #uploading outcome data

Now, I have put all the datasets in the “adverse_events” database. I can query it and do analytics I want.

Connect to database

my_db <- src_sqlite("adverse_events", create = FALSE)
              # create is false now because I am connecting to an existing database

List the tables in the database

src_tbls(my_db)

    "demography" "drug" "indication" "outcome" "reaction" "sqlite_stat1"

Querying the database

We use the same dplyr verbs that we use in data manipulation to work with databases. dplyr translates the R code we write to SQL code. We use tbl to connect to tables within the database.

demography = tbl(my_db,"demography" )

class(demography)
tbl_sqlite" "tbl_sql" "tbl" 

head(demography,3)

US = filter(demography, occr_country=='US')  # Filtering demography of patients from the US

US$query
 SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'

We can also see how the database plans to execute the query:

explain(US)

SELECT "primaryid", "caseid", "age", "age_cod", "event_dt", "sex", "wt", "wt_cod", "occr_country"
FROM "demography"
WHERE "occr_country" = 'US'
  selectid order from                detail
1        0     0    0 SCAN TABLE demography

Let’s similarly connect to the other tables in the database.

drug = tbl(my_db,"drug" )
indication = tbl(my_db,"indication" )
outcome = tbl(my_db,"outcome" )
reaction = tbl(my_db,"reaction" )

It is very interesting to note that dplyr delays the actual operation until necessary and loads data onto R from the database only when we need it. When we use actions such as collect(), head(), count(), etc, the commands are executed.

While we can use head() on database tbls, we can’t find the last rows without executing the whole query.

head(indication,3)

tail(indication,3)
Error: tail is not supported by sql sources

dplyr verbs (select, arrange, filter, mutate, summarize, rename) on the tables from the database

We can pipe dplyr operations together with %>% from the magrittr R package. The pipeline %>% takes the output from the left-hand side of the pipe as the first argument to the function on the right hand side.

Find the top ten countries with the highest number of adverse events

demography%>%group_by(Country= occr_country)%>% 
           summarize(Total=n())%>%      
           arrange(desc(Total))%>%       
           filter(Country!='')%>% head(10)

We can also include ggplot in the chain:

demography%>%group_by(Country= occr_country)%>% #grouped by country
           summarize(Total=n())%>%    # found the count for each country
           arrange(desc(Total))%>%    # sorted them in descending order
           filter(Country!='')%>%     # removed reports that does not have country information
           head(10)%>%                  # took the top ten
          ggplot(aes(x=Country,y=Total))+geom_bar(stat='identity',color='skyblue',fill='#b35900')+
          xlab("")+ggtitle('Top ten countries with highest number of adverse event reports')+
          coord_flip()+ylab('Total number of reports')

Find the most common drug

drug%>%group_by(drug_name= drugname)%>% #grouped by drug_name
           summarize(Total=n())%>%    # found the count for each drug name
           arrange(desc(Total))%>%    # sorted them in descending order
           head(1)                   # took the most frequent drug

What are the top 5 most common outcomes?

head(outcome,3)  # to see the variable names

outcome%>%group_by(Outcome_code= outc_cod)%>% #grouped by Outcome_code
           summarize(Total=n())%>%    # found the count for each Outcome_code
           arrange(desc(Total))%>%    # sorted them in descending order
           head(5)                   # took the top five

What are the top ten reactions?

head(reaction,3)  # to see the variable names

reaction%>%group_by(reactions= pt)%>% # grouped by reactions
           summarize(Total=n())%>%    # found the count for each reaction type
           arrange(desc(Total))%>%    # sorted them in descending order
           head(10)                   # took the top ten

Joins

Let’s join demography, outcome and reaction based on primary id:

inner_joined = demography%>%inner_join(outcome, by='primaryid',copy = TRUE)%>%
               inner_join(reaction, by='primaryid',copy = TRUE)

head(inner_joined)

We can also use primary key and secondary key in our joins. Let’s join drug and indication using two keys (primary and secondary keys).

drug_indication= indication%>%rename(drug_seq=indi_drug_seq)%>%
   inner_join(drug, by=c("primaryid","drug_seq"))

head(drug_indication)

In this post, we saw how to use the dplyr package to create a database and upload data to the database. We also saw how to perform various analytics by querying data from the database.
Working with databases in R has huge advantage when our data is big where loading it to R is impossible or slows down our analytics. If our data resides in a database, rather than loading the whole data to R, we can work with subsets or aggregates of the data that we are interested in. Further, if we have many data files, putting our data in a database, rather than in csv or other format, has better security and is easier to manage.

This is enough for this post. See you in my next post. You can read about dplyr two-table verbs here. If you have any questions or feedback, feel free to leave a comment.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

SQL Server 2016 with R Install

April 2, 2016, 5:24 pm

≫ Next: AirbnB uses R to scale data science

≪ Previous: Working with databases in R

(This article was first published on TeachR, and kindly contributed to R-bloggers)

I installed the “SQL Server 2016 Release Candidate” today to start playing around with Microsoft is calling “SQL Server R Services (In-Database)”. I think this is really going to change the way we do things at the office and it’s got me really excited.

You can download the Release Candidate from https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016.

I chose the CAB install which downloads two files, about 1.9GB. Running the executable then unpacks the setup files you will need for installation. Then, setup should run automatically. Microsoft has some good instructions at https://msdn.microsoft.com/en-us/library/mt696069.aspx.

The last part of their instructions is a test to make sure R is working.

exec sp_execute_external_script @language =N'R', @script=N'OutputDataSet<-InputDataSet', @input_data_1 =N'select 1 as hello' with result sets (([hello] int not null)); go

The install actually adds a new instance of R running on the server, which for me is just my laptop. A simple change to the code above helped me to find where the instance is.

exec sp_execute_external_script @language =N'R', @script=N'OutputDataSet <- as.data.frame(R.home())' with result sets (([R.home] sysname)); go

You can add more packages to this install quite easily. I’ve gone ahead and pinned this R instance to my taskbar. Run it as an admin and the installed packages are automatically available to SQL Server code. After installing the actuar package, just to test, I ran this in SSMS and it worked!

exec sp_execute_external_script @language =N'R', @script=N' require(actuar) OutputDataSet <- as.data.frame(search())' with result sets (([LoadedPackages] sysname)); go

To leave a comment for the author, please follow the link and comment on their blog: TeachR.

↧

AirbnB uses R to scale data science

April 5, 2016, 9:19 am

≫ Next: Plotting App for ggplot2

≪ Previous: SQL Server 2016 with R Install

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Airbnb, the property-rental marketplace that helps you find a place to stay when you're travelling, uses R to scale data science. Airbnb is a famously data-driven company, and has recently gone through a period of rapid growth. To accommodate the influx of data scientists (80% of whom are proficient in R, and 64% use R as their primary data analysis language), Airbnb organizes monthly week-long data bootcamps for new hires and current team members.

But just as important as the training program is the engineering process Airbnb uses to scale data science with R. Rather than just have data scientists write R functions independently (which not only is a likely duplication of work, but inhibits transparency and slows down productivity), Airbnb has invested in building an internal R package called Rbnb that implements collaborative solutions to common problems, standardizes visual presentations, and avoids reinventing the wheel. (Incidentally, the development and use of internal R packages is a common pattern I've seen at many companies with large data science teams.)

The Rbnb package used at Airbnb includes more than 60 functions and is still growing under the guidance of several active developers. It's actively used by Airbnb's engineering, data science, analytics and user experience teams, to do things like move aggregated or filtered data from a Hadoop or SQL environment into R, impute missing values, compute year-over-year trends, and perform common data aggregations. It has been used to create more than 500 research reports and to solve problems like automating the detection of host preferences and using guest ratings to predict rebooking rates.

The package is also widely used to visualize data using a standard Airbnb "look". The package includes custom themes, scales, and geoms for ggplot2; CSS templates for htmlwidgets and Shiny; and custom R Markdown templates for different types of reports. You can see several examples in the blog post by Ricardo Bion linked below, including this gorgeous visualization of the 500,000 top Airbnb trips.

Medium (AirbnbEng): Using R packages and education to scale Data Science at Airbnb

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Plotting App for ggplot2

April 5, 2016, 11:09 am

≫ Next: An Analysis of Traffic Violation Data with SQL Server and R

≪ Previous: AirbnB uses R to scale data science

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Through this post, I would like to share an update to my RTutoR package. The first version of this package included an R Basics Tutorial App which I published earlier at DataScience+

The updated version of this package, which is now on CRAN, includes a plotting app. This app provides an automated interface for generating plots using the ggplot2 package. Current version of the app supports 10 different plot types along with options to manipulate specific aesthetics and controls related to each plot type. The app also displays the underlying code which generates the plot and this feature would hopefully be useful for people trying to learn ggplot2. The app also utilizes the plotly package for generating interactive plots which is especially suited for performing exploratory analysis.

A video demo on how to use the app is provided at the end of the post. The app is also hosted Shinyapps.io. However, unlike the package version, you would not be able to use your own dataset. Instead, the app provides a small set of pre-defined datasets to choose from.

High level overview of ggplot2

For people who are completely new to gglot2, or have just started working on it, I provide below a quick, concise overview of the ggplot2 package. This is not meant to be comprehensive, but just covers some key aspects so that it’s easier to understand how the app is structured and to make the most of it. You also can read a published tutorial in DataScience+ for ggplot2.

The template for generating a basic plot using ggplot2 is as follows:

ggplot(data_set) + plot_type(aes(x_variable,y_variable)) #For univariate analysis, you can specify just one variable

plot_type specifies the type of plot that should be constructed. There are more than 35 different plot types in ggplot2. (The current version of the app, however, supports only 10 different plot types)

ggplot2 provides an extensive set of options to manipulate different aesthetics of a plot. Aesthetics can be manipulated in one of two ways:

Manually setting the aesthetic
Mapping the aesthetic to a variable

To manually set the aesthetic, include the code outside the aes call. For example, the code below generates a scatter plot and colors the point red, using the color aesthetic

ggplot(iris) + geom_point(aes(Sepal.Length,Sepal.Width), color = "red")

To map the aesthetic to a variable, include the code inside the aes call. For example to color the scatter plot as per the Species type we will the modify the code above as follows:

ggplot(iris) + geom_point(aes(Sepal.Length,Sepal.Width,color = Species))

Not all aesthetics are applicable to all plot types. For e.g. the linetype aesthetic (which controls the line format), is not applicable to geom_point for instance (The app only displays those aesthetics which are applicable to the selected plot type)

A plot may also include controls specific to that plot type. Smoothing curve(geom_smooth), for example, provides a “method” argument to control the smoothing function that is used for fitting the curve. ggplot2 provides an extensive set of options for different plot types (A good reference to read about the various options is ggplot2’s documentation here) The app does not include all the various options that are available, but tries to incorporate few of the most commonly used ones.

Multiple layers can be added to a plot. For example, the code below plots a scatter plot and fits a smoothing line as well:

ggplot(mtcars,aes(mpg,hp)) + geom_point() + geom_smooth()

Note: The way the app is coded, we need to specify the aesthetics for each plot separately, even if the aesthetics are same for each plot. Hence, if we construct this plot using the app, the underlying code that is displayed,would read:

ggplot(mtcars) + geom_point(aes(mpg,hp)) + geom_smooth(aes(mpg,hp))

Plotting App for ggplot2

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

An Analysis of Traffic Violation Data with SQL Server and R

April 6, 2016, 6:30 am

≫ Next: satRdays are go!

≪ Previous: Plotting App for ggplot2

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

By Srini Kumar, Director of Data Science at Microsoft

Who does not hate being stopped and given a traffic ticket? Invariably, we think that something is not fair that we got it and everyone else did not. I am no different, and living in the SF Bay Area, I have often wondered if I could get the data about traffic tickets, particularly since there may be some unusual patterns.

While I could not find one for any data for the SF Bay Area, I supposed I could get some vicarious pleasure out of analyzing it for some place for which I could get data. As luck would have it, I did find something. It turns out that Montgomery County in Maryland does put all the traffic violation information online. Of course, they take out personally identifiable information, but do put out a lot of other information. If any official from that county is reading this, thanks for putting it online! Hopefully, other counties follow suit.

By the standards of the data sizes I have worked with, this is rather small; it is a little over 800K records, though it has about 35 columns, including latitude and longitude. I find it convenient to park the data in a relational database, analyze it using SQL first, and then get slices of it into R.

In this post, I'll be using Microsoft SQL Server to import and prepare the traffic violations data, and then use R to visualize the data.

The SQL language

If you are unfamiliar with SQL, this SQL tutorial has been around for a long time, and this is where I got started with SQL. Another thing I find convenient is to use an ETL tool. ETL, if you are not already familiar with it, stands for Extract-Transform-Load, and is as pervasive a utility in IT infrastructures in companies as plumbing and electrical wiring are in homes, and often just as invisible, and noticed only when unavailable. If you are determined to stay open source, you can use PostgreSQL for a database (I prefer PostgreSQL over MySQL, but you can choose whatever you become comfortable with) and say, Talend for the ETL. For this exercise, I had access to Microsoft SQL Server's 2016 preview, and ended up using it. An interesting thing is that it already has an ETL tool called SQL Server Integration Services, and a wizard that makes it extremely convenient to do it all in one place.

Now, why do I spend so much time suggesting all these, when you can very easily just download a file, read it into R and do whatever you want with it? In a nutshell, because of all the goodies you get with it. For example, you could easily detect problems with the data as part of the load, and run SQL. You can handle data sizes that can't fit in memory. And at least with MS SQL, it gets better. You have an easy wizard that you can use to load the data, and figure out problems with it even before you load it. And with MS SQL, you can also run R from within it, on a large scale.

Here are the fields in the table. The query is quite simple, and accesses the meta data which is available in every relational database.

select column_name, data_type from INFORMATION_SCHEMA.columns where table_name = 'Montgomery_County_MD_Traffic_Violations'

order by ORDINAL_POSITION asc

Date Of Stop date

Time Of Stop time

Agency varchar

SubAgency varchar

Description varchar

Location varchar

latitude float

longitude float

Accident varchar

Belts varchar

Personal Injury varchar

Property Damage varchar

Fatal varchar

Commercial License varchar

HAZMAT varchar

Commercial Vehicle varchar

Alcohol varchar

Work Zone varchar

State varchar

VehicleType varchar

Year bigint

Make varchar

Model varchar

Color varchar

Violation Type varchar

Charge varchar

Article varchar

Contributed To Accident varchar

Race varchar

Gender varchar

Driver City varchar

Driver State varchar

DL State varchar

Arrest Type varchar

Geolocation varchar

month_of_year nvarchar

day_of_week nvarchar

hour_of_day int

num_day_of_week int

num_month_of_year int

The last five columns were created by me. It is trivial to create new columns in SQL Server (or any other relational database) and having them computed each time a new row is added.

With RODBC/RJDBC, it is easy to get data from an RDBMS and do what we want with it. Here is an interesting plot of violations by race and if they concentrate in certain areas.

The code to create the above chart uses, as you may have guessed, the ggmap library. Installing ggmap automatically includes the ggplot2 library as well. In order to plot this, we first get the geocodes for Montgomery county, which is easy to do. First, we get the geocode for the location, and then use the get_map function to get the map. Actually, the get_map function also has a zoom level, that we can play with to get the appropriate zoom if we need to zoom in or out instead of using the default.

Here is another one on violation concentrations related to the time of day (the scale is hours from midnight: 0 is midnight and 12 is noon):

The code to do this is as follows:

It looks like there are few violations are really small in the wee hours of the morning. How about violations by month by vehicle?

Looks like it is mostly cars, but there are some light duty trucks as well. But how did we get the numerical month of the year? Again, in SQL, it becomes easy to do:

ALTER TABLE <table name> ADD COLUMN num_month_of_year AS datepart(month,[Date Of Stop])

After that, it is a routine matter of plotting a bar chart, after using RODBC/RJDBC to get the data from the database.

ggplot(data=alldata, aes(num_month_of_year)) + geom_bar(aes(color=VehicleType), fill=factor(VehicleType))

When we look at the Make of the cars, we see how messed up that attribute is. For example, I have seen HUYAND, HUYND, HUYNDAI, HYAN, HYANDAI, HYN, HYND, HYNDAI. I am pretty sure I even saw a GORRILLA, and have no idea what it might mean. That does not make for a very good plot, or for that matter any analysis. Can we do better? Not really, unless we have reference data. We could build one using the car makes and models that we know of, but that is a nontrivial exercise.

A little SQL goes a long way, particularly if the data is very large and needs to be sampled selectively. For example, here is a query:

The square brackets [] are an MS SQL syntax to work with fields that are separated by spaces and other special characters. To practise SQL itself, you don’t have to get an RDBMS first. In R itself, there is the sqldf library, where you can practice your SQL if you don't have an actual database. For example:

bymonth <- sqldf("select num_month_of_year, sum(frequency) as frequency from res group by num_month_of_year")

Since I have been recently introduced to what I call the "rx" functions from Microsoft R Server (the new name for Revolution Analytics' R), I decided to check it against what I normally use, which is ctree from the party package and of course the glm. My intent was not to get insights from the data here; it was to simply see if the rx functions would run slightly better. So I simply tried to relate the number of violations to as many variables as I could use.

The rx functions (in my limited testing) worked when the CRAN R functions did not. Note that I did not go looking for places where one outperformed the other. This was simply to figure out what was doable with the rx functions.

Finally, we can call R from within SQL Server too. Here is an example:

It is no surprise that the least violations are during the wee hours of the morning, but one surprise for me was the vehicle brand names. Of course, one can create a reference set of names and use it to clean this data, so that one can try and see what brands have a higher chance of violations, but that is not an elegant solution. Chances are that as we get such data, the errors in human inputs will be a serious issue to contend with.

Hopefully, this article also illustrates how powerful SQL can be in manipulating data, though this article barely scratched the surface on SQL usage.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

satRdays are go!

March 23, 2016, 6:52 am

≫ Next: In case you missed it: March 2016 roundup

≪ Previous: An Analysis of Traffic Violation Data with SQL Server and R

(This article was first published on R – It's a Locke, and kindly contributed to R-bloggers)

I’m very pleased to say that the R Consortium agreed to the support the satRday project!

The idea kicked off in November and I was over the moon with the response from the community, then we garnered support before submitting to the Consortium and I must have looped the moon a few times as we had more than 500 responses. Now the R Consortium are supporting us and we can turn all that enthusiasm into action.

Over the next couple of weeks, we’ll be contacting everyone who has registered their interest in organising an event or helping with the tech & infrastructure side of things to get the ball rolling. There’s still time to sign up if you missed the initial call to action, and of course, we’re hoping that we won’t be stopping at three events and that we’ll be continuing these events for a very long time.

The post satRdays are go! appeared first on It's a Locke.

To leave a comment for the author, please follow the link and comment on their blog: R – It's a Locke.

↧

In case you missed it: March 2016 roundup

April 8, 2016, 7:00 am

≫ Next: Clandestine DNS lookups with gdns

≪ Previous: satRdays are go!

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from February of particular interest to R users.

Reviews of new CRAN packages RtutoR, lavaan.shiny, dCovTS, glmmsr, GLMMRR, MultivariateRandomForest, genie, kmlShape, deepboost and rEDM.

You can now create and host Jupyter notebooks based on R, for free, in Azure ML Studio.

Calculating learning curves for predictive models with doParallel.

An amusing look at some of R's quirks, by Oliver Keyes.

A recording of a recent talk I gave on real-time predictive analytics, featuring R.

A preview of the New York R Conference.

The R Consortium has funded seven community projects and two working groups for R projects.

A look at several methods for computing and assessing the performance of classification models, with R.

An application to help airlines prevent unexpected maintenance delays, based on predictive models created with R.

Using R to predict the winning basketball team in the March Madness competition.

How to call an R function from an Excel worksheet after it's been published as a service to Azure ML.

You can now use magrittr pipes with the out-of-memory XDF data files used by Microsoft R Server.

Watch the recorded webinar "Data Preparation Techniques with R", and download the free e-book by Nina Zumel.

An R-based application to automatically classify galaxies in the World Wide Telescope was featured in a keynote at Microsoft's Data Driven event.

Microsoft R Server is now available in the Azure Marketplace.

R 3.2.4 was released by the R Core Group on March 10.

Previews of some talks at the Bay Area R Users Group.

R Tools for Visual Studio, which lets you edit and debug R code within Visual Studio, is now available.

A tutorial on creating election maps with R, from ComputerWorld.

A history of the R project since the release of version 1.0.0.

Calculating confidence intervals for Random Forest predictions based on a corrected jackknife estimator.

Microsoft's Data Science Virtual Machine now includes Microsoft R Server.

Using a pet tracker and R to map the movements of a cat.

General interest stories (not related to R) in the past month included: typography in movies, solving a rubiks cube while juggling it, pianograms and a robot rebellion.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧