I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds.One thing nobody admits to is: Excel data is hard to read reliably into R. Admitting this means admitting there is something Excel is better at than R is (even if it is just reading its own data). Cran suggests several methods to read data into R, but most of them fail on non-trivial (but not even large) data. For example cran suggests:
- Methods that depend on connecting to a Windows service in a 32 bit environment (like RODBC).
-
The xlsx package which even when you add more memory to the required Java dependency (
.jinit(parameters="-Xmx6G")
) runs out of memory parsing a 44 megabyte file. - The gdata package which uses Perl for the Excel parsing.
- The XLConnect package which seems to not install on R2.15.0
- The Omegahat package RExcelXML (not part of cran).
Of these packages we have had the best luck with gdata (and already this is not simple). However we have seen the package silently truncate data on read. Also, strings are converted to factors (though this can be fixed by passing the appropriate read.table() settings through the read.xls() call). Digging into the gdata package we see it uses a Perl script to try and export Excel data into a quoted format- but the package does not get all of the various possibilities of embedded line-breaks, embedded quotes, escapes and comments correct. The wrong string-valued field can silently poison the results.
Additionally we have found is R’s read.table() and read.xls() command do not fully understand all the complexities of Excel export quoting (which allows actual quotes, fields separators and line separators inside fields, relying on a non-Unix escape sequence of doubling quotes). This is a problem for any data set which includes text extracts (bringing in line breaks, false field separators and problematic punctuation). To add insult to injury R’s read table is in no way guaranteed to be able to read the output of R’s own write.table(), unless you remember to set enough options about quoting, escaping, line breaks and comments both during writing and reading. Finally Excel’s native save as text or save as CSV has the amazing bug that any fields that is too big to display (and thus displayed as hash-marks) is in fact exported as hash-marks! So even having a copy of Excel isn’t enough to solve the Excel export problem.
If needless complexity is causing problems- let us try avoiding the complexity. I have modified the Perl script found in the gdata package to a new version that exports to what I call “strong TSV.” In strong TSV fields never contain the fields separator (tab) or a line-separator (CR or LF) and there is no quoting, escaping or comments (quoting is no longer needed, so we just treat quotes as a standard character like any other). To export to strong TSV we just use a simple Perl regexp to replace all field whitespace with standard space (slightly damaging the field, but saving a lot of down-stream complexity). The file converted in this manner can then be read reliably using R’s read.table command:
d <- read.table('file.tab', quote='',comment.char='', allowEscapes=F,sep='\t', header=T,as.is=T, stringsAsFactors=F)
.
The modified perl-script ( xls2goodTSV.pl ) is placed in the Perl section of the gdata library (in my case Library/R/2.15/library/gdata/perl
) and run in that environment.
Some of these weaknesses are flaws in the cran-suggested R libraries. But some of the problems are due to the complexity of Excel and even quoting in exported Excel documents (and the hash-bug). All of these steps could be skipped if the data were available in a true machine readable format like strong TSV, JSON or SQL dump (all of which get quoting correct). My suggestion is to try strong TSV as it is very simple and interoperates well with most tools (R, linux command line tools and even Excel itself). The time you save may be your own.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...