Pleases stop using Excel-like formats to exchange data

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.

I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds.One thing nobody admits to is: Excel data is hard to read reliably into R. Admitting this means admitting there is something Excel is better at than R is (even if it is just reading its own data). Cran suggests several methods to read data into R, but most of them fail on non-trivial (but not even large) data. For example cran suggests:

Methods that depend on connecting to a Windows service in a 32 bit environment (like RODBC).
The xlsx package which even when you add more memory to the required Java dependency (.jinit(parameters="-Xmx6G")) runs out of memory parsing a 44 megabyte file.
The gdata package which uses Perl for the Excel parsing.
The XLConnect package which seems to not install on R2.15.0
The Omegahat package RExcelXML (not part of cran).

Of these packages we have had the best luck with gdata (and already this is not simple). However we have seen the package silently truncate data on read. Also, strings are converted to factors (though this can be fixed by passing the appropriate read.table() settings through the read.xls() call). Digging into the gdata package we see it uses a Perl script to try and export Excel data into a quoted format- but the package does not get all of the various possibilities of embedded line-breaks, embedded quotes, escapes and comments correct. The wrong string-valued field can silently poison the results.

Additionally we have found is R’s read.table() and read.xls() command do not fully understand all the complexities of Excel export quoting (which allows actual quotes, fields separators and line separators inside fields, relying on a non-Unix escape sequence of doubling quotes). This is a problem for any data set which includes text extracts (bringing in line breaks, false field separators and problematic punctuation). To add insult to injury R’s read table is in no way guaranteed to be able to read the output of R’s own write.table(), unless you remember to set enough options about quoting, escaping, line breaks and comments both during writing and reading. Finally Excel’s native save as text or save as CSV has the amazing bug that any fields that is too big to display (and thus displayed as hash-marks) is in fact exported as hash-marks! So even having a copy of Excel isn’t enough to solve the Excel export problem.

If needless complexity is causing problems- let us try avoiding the complexity. I have modified the Perl script found in the gdata package to a new version that exports to what I call “strong TSV.” In strong TSV fields never contain the fields separator (tab) or a line-separator (CR or LF) and there is no quoting, escaping or comments (quoting is no longer needed, so we just treat quotes as a standard character like any other). To export to strong TSV we just use a simple Perl regexp to replace all field whitespace with standard space (slightly damaging the field, but saving a lot of down-stream complexity). The file converted in this manner can then be read reliably using R’s read.table command:

d <- read.table('file.tab',
   quote='',comment.char='',
   allowEscapes=F,sep='\t',
   header=T,as.is=T,
   stringsAsFactors=F)

The modified perl-script ( xls2goodTSV.pl ) is placed in the Perl section of the gdata library (in my case Library/R/2.15/library/gdata/perl) and run in that environment.

Some of these weaknesses are flaws in the cran-suggested R libraries. But some of the problems are due to the complexity of Excel and even quoting in exported Excel documents (and the hash-bug). All of these steps could be skipped if the data were available in a true machine readable format like strong TSV, JSON or SQL dump (all of which get quoting correct). My suggestion is to try strong TSV as it is very simple and interoperates well with most tools (R, linux command line tools and even Excel itself). The time you save may be your own.

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Pleases stop using Excel-like formats to exchange data

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112