I’m primarily a Java / Scala developer, but every so often I dabble in R, whether its to do a little data analysis, train statistical models, or plot some graphs. Probably 95% of my R code is in throw-away scripts, that are undocumented and aren’t meant to be shared or ever looked at again. But every so often, I write a utility function in R that I’d like to reuse, and even share with others. I’ve found that R’s “standard” methods for sharing & reusing code aren’t quite enough, but I think I’ve come up with a scheme that works pretty well.

TL;DR Follow the instructions in this gist.

One method for sharing R code is to create a package. Packages are fantastic ways to share polished tools with a wide audience. In fact, one of the major benefits of R is all the great packages that have been written by others and shared publicly on CRAN. But they are much too heavy-weight for some simple utility functions; I don’t want to go through all the work of creating a package when I write a little util which stores our database configuration.

Update: If you do want to write a package, I highly recommend the devtools package and Hadley Wickham’s book on writing packages (in progress, but lots of good content already available online).

The other way to re-use your utility functions is to stick them in a plain .R file, and then source() that file whenever you want to use it. This is perfect when you are working by yourself, and just on your on laptop. But it doesn’t work that well when you need to share your work with others in scripts that can be run repeatably, because of the way source() searches for the file.

I could reference my file with an absolute path, eg. source("/Users/imran/myRutils/dbUtils.R"). Obviously, that would break if I send someone else all my .R files, because they’ll put everything in a different directory. I could just use relative paths: if my scripts lives in "/Users/imran/coolNewProject/analyzeData.R", then I could load my utils with source("../../myRUtils/dbUtils.R").

This gets us further, but it has some problems as the code gets used more:

All team members have to use the exact same directory structure. Since we’re using relative paths, everybody has to put their utility functions in the same place relative to their scripts. This doesn’t sound too bad, but as the number of scripts proliferate this can be become a pain.
Its hard to have utilities “stand alone”, eg. in their own git repo, or even to have multiple git repos of utilities from many different sources, as they all need to be put in exactly the right location to work correctly.
Its hard for utilities to reference each other, while still allowing there to be some independence between them.

All of these reasons are really just different takes on the same issue: we want looser coupling between our scripts and our utility functions, and more flexibility on where we put our utility functions. In the Java world, the Java runtime has a somewhat similar problem of figuring out where to look for all the compiled code. You could just dump it all in one directory – but this can make it a pain to have shared components used by multiple different programs. Java solves this with a classpath – a list of places to search for code. For lack of a better term, I propose creating a “classpath” for R as well.

If instead of using source(), we use the mylib() function I have defined below, we can get R to search for our utilities in a handful of predefined locations. This way, we can keep our utilities in a completely separate location from the one-off scripts – in fact, the utilities can even be in a separate repo, checked out to a different location. This can make it a lot easier for a team to keep a set of shared utilities, which they all use, maintain, and contribute to, while they are free to put their hacky scripts wherever they like.

Here’s how we define mylib():

There is just one missing piece: how do we ensure this file gets loaded all the time? We don’t want to have to paste these function definitions at the top of every file, or in the beginning of every R session; we would be no better off than we were in the beginning.

We can solve this problem by using ~/.Rprofile, a file which contains ordinary R code that is automatically loaded every time R starts up. We can either paste the definitions from above into ~/.Rprofile, or we can put those definitions in another file, and just have ~/.Rprofile source that file. (Yes, in this one case, we’ll need to use source() and manually ensure the paths are correct.)

After we do this small amount of setup, it becomes easy to use the mylib() function:

One final note for mac users: if you want to use environment variables when defining your source.dirs, its a little tricky to make sure those environment variables are defined for gui applications, like RStudio. You need to edit the file ~/.launchd.conf, add a line like setenv <VAR_NAME> <VAR_VALUE>, then execute launchctl < ~/.launchd.conf

Now that we’ve got our utility functions nicely separated, so they are easier to reuse and maintain, the next step would be set some unit tests for them. ~~But, that will have to wait for a future blog post :)~~ Hadley Wickham’s book on R packages also has a good section on unit testing.

I hope this was helpful – honestly this is a system I came up because there didn’t seem to be a better way. I’m curious if bigger teams of R users have developed other ways of dealing with this problem, I’d love to hear about it in the comments.

Update: Hadley Wickham commented on twitter that I should just bite the bullet and write packages. After that post, I took a closer look at the devtools package, and his upcoming book. They definitely make the process of writing and using your own packages much easier.

Its still a lot more complicated than just sticking your functions in a file, though. By all means, if you are motivated to write a package, do that instead. But I imagine there are many organizations where that simply won’t happen because of the overhead involved. Hopefully this is a happy-compromise in those situations.

Imran Rashid

Simple "Classpath" for Reusable R Scripts

Tagged:

Comments