Data analysis Hadley Wickham style
Table of contents
- magrittr: Pipes in R
- dplyr: Easy data frame manipulation
- tidyr: From wide to long data (replacing reshape2)
- Reading and storing big data
- Other packages from the Hadleyverse
- Sources and more blog posts
magrittr: Pipes in R
|Magrittr introduces the pipe operator
||, e.g. in
With it, you get rid of the annoying inside-out wrapping of functions, e.g.
The current development version on github introduces several other operators:
%T>is a tee operator. Compare to UNIX’s tee command. It returns the left hand side after applying the right-hand side.
%$%exposes the data frame on the left to the expressions on the right (so you can omit the dataset$ in front of 1000 variables)
%>%, but afterwards does not return the result of the whole chain, but overwrites the original symbol.
%,%could later be the same thing for functionals, i.e. to build functions out of pipe commands
dplyr: Easy data frame manipulation
dplyr is a faster, more consistent version of plyr, but focuses only on data frames (it can handle data.tables too). plyr included functions like ddply, daply, etc.
The most important functions here are:
All functions behave similarly (first argument is data frame, result is data frame), so the magrittr pipe is perfect for chaining these commands.
Further functions not mentioned here: joins, e.g.
left_join, which is a Hadley version for
Select a subset of the rows. These two lines are equivalent:
It works similar to
subset(), but the arguments are joined by & automatically.
Select rows by position:
Reorder rows instead of selecting them:
Use desc(year) to sort descending.
select() and rename()
Select columns. Awesome: Specify ranges and/or exclusions by name, not number:
?select for details. You can use helpers like
Rename arguments by using named arguments:
This drops all other columns. If you want to keep them, use
Extract unique values only. Similar but faster than base::unique()
mutate() and transmute()
mutate is similar to
base::transform(). It allows you to add new columns to a data frame:
If you want to drop the old variables, use
Collapses a data frame into a single row:
Use any of R’s aggregation functions: min, max, mean, sum, sd, etc. Additionally, dplyr gives you
n() for counting,
n_distinct() for counting uniques, and
sample_n() and sample_frac()
Downsample a data frame to n observations or a specific fraction.
You can use
replace=TRUE for bootstrap samples and weights.
This makes the above verbs very powerful.
group_by() returns the same data.frame, but with group attributes. The other functions (most notably
summarise()) now work separately on each subgroup:
The verbs are affected by grouping as follows:
select()is the same as ungrouped
select(), excepted that retains grouping variables are always retained.
arrange()orders first by grouping variables
filter()are most useful in conjunction with window functions (like
min(x) == x), and are described in detail in
sample_frac()sample the specified number/fraction of rows in each group.
slice()extracts rows within each group.
summarise()is easy to understand and very useful, and is described in more detail below.
tidyr: From wide to long data (replacing reshape2)
A newer, better version of reshape2. Integrates with dplyr.
gather() instead of
spread() instead of
Also you have
unite() for splitting/combining column names if you have or want things like “male.control” and “female.treatment”.
An example for
gather() (I mostly only use this function):
So you provide (or pipe in) the data frame; with
key you specify the column name of the new ID variables; with
value you specify the column name of the measured variable; afterwards, you supply (unquoted) a comma-separated list of all measured variables, or a list of all ID variables, prepended with a minus sign.
Reading and storing big data
data.tablepackage implements a child class (i.e. it’s compatible) of data.frame that speeds up many operations on it and reduces file size and the amount of implicit copying.
data.tablepackage instead of
read.csv, so CSV import takes only 2% of the time.
library(readr)is a Hadley package that provides simplified functions for reading data as well, e.g.
- Use the
rhdf5package to apply HDF5 to store big data sets in a compressed but easily sliceable format. This allows you to extract only a rectangular slice from your data set, if the whole thing doesn’t fit into memory or would take too long to load.
Other packages from the Hadleyverse
lubridateis a package for working with dates
stringroffers simple string manipulation
testthatand assertthat for nice testing and assertions
devtoolsto facilitate code developing
ggvislets you create graphics ggplot2-style, but interactively playable in RStudio or the web, shiny-style.
I never read into the documentation of these packages, but just use them on a case-by-case basis. It’s helpful to keep in mind they exist, though.
Sources and more blog posts
- Source for the magrittr part
- Source for the dplyr part (but also
- Source for the tidyr part