Introduction: All your base are annoying, slow, and large

Compared to other languages, R has good utilities for managing data. Most prominent among these utilities is R’s data.frame class, which gives you all the basic features you’d want in a data soring class (e.g. row and column names, mixed data types).

That being said, data.frames are not without their problems, of which I highlight two (spoiler alert: data.table addresses both):

  1. The syntax is cumbersome. At its core, a data.frame is just a list named list of columns where all of the columns happen to be the same length.
    This means that common data operations like subsetting can have annoyingly redundant syntax.

  2. Operations are slow. As a list of lists, data.frame doesn’t do anything to make your data more compact or organized. On the contrary, this means that accessing data stored in a data.frame is inefficient because it requires multiple searches through individual lists.

data.table syntax

For demosntration purposes, let’s load R’s built in iris database. By default, it’s a data.frame, so we’ll create a copy that’s a data.table.

library(data.table)
data(iris)
iris.dt <- data.table(iris)
dim(iris.dt)
## [1] 150   5
class(iris.dt)
## [1] "data.table" "data.frame"
names(iris.dt)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Implict environment scoping

The first thing to understand about data.table syntax is that everything inside brackets is scoped to the data.table in question. If you’ve ever used R’s with function, it’s exactly the same idea. For instance, to access a column, by default, you do not use quotes because the variable Sepal.Length is in the local data.table environment.

head(iris.dt[,Sepal.Length])
## [1] 5.1 4.9 4.7 4.6 5.0 5.4

In fact, if you try to access the variable with quotes or by index, you’ll get a strange result:

iris.dt[, "Sepal.Length"]
## [1] "Sepal.Length"
iris.dt[, 3]
## [1] 3

If you do want to use strings or column indices instead of environment variables (for instance, if you’re passing column names from a variable), you can disable the with syntax by just adding a with=FALSE inside the brackets.

cols <- c("Sepal.Length", "Sepal.Width")
head(iris.dt[, cols, with=FALSE])
##    Sepal.Length Sepal.Width
## 1:          5.1         3.5
## 2:          4.9         3.0
## 3:          4.7         3.2
## 4:          4.6         3.1
## 5:          5.0         3.6
## 6:          5.4         3.9
head(iris.dt[, c(2,4), with=FALSE])
##    Sepal.Width Petal.Width
## 1:         3.5         0.2
## 2:         3.0         0.2
## 3:         3.2         0.2
## 4:         3.1         0.2
## 5:         3.6         0.2
## 6:         3.9         0.4

You CAN however, access rows by index without any additional syntax.

iris.dt[2:4]    # Note that the comma is optional
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.9         3.0          1.4         0.2  setosa
## 2:          4.7         3.2          1.3         0.2  setosa
## 3:          4.6         3.1          1.5         0.2  setosa

Notice that for a single column, this operation returns a vector, but for multiple columns, it returns a data.table.

Granted, this is barely an improvement over data.frame, where the same operation is just iris[, "Sepal.Length"]. However, where this starts to shine is in subsetting operations. For instance, in base R, if you wanted just the rows with Sepal.Length < 5, you would have to do something like:

slt5 <- iris[iris$Sepal.Length < 5,]
head(slt5)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

By contrast, the same operation in data.table is:

slt5 <- iris.dt[Sepal.Length < 5]  # The comma is optional

Operating on columns

Another cool feature is that you can call functions directly on columns within the data.table. For instance, if I wanted to get the mean sepal width for only plants with sepal length < 5, I could do:

iris.dt[Sepal.Length < 5, mean(Sepal.Width)]
## [1] 3.077273
head(iris.dt[, Sepal.Width * Sepal.Length])
## [1] 17.85 14.70 15.04 14.26 18.00 21.06

It gets better. Built into data.table is a by option that lets you perform aggregation functions by subsets based on columns. For example:

iris.dt[ ,mean(Sepal.Width), by=Species]
##       Species    V1
## 1:     setosa 3.428
## 2: versicolor 2.770
## 3:  virginica 2.974
iris.dt[ ,mean(Sepal.Width), by=list(Species, Sepal.Length < 5)]
##       Species Sepal.Length       V1
## 1:     setosa        FALSE 3.620000
## 2:     setosa         TRUE 3.140000
## 3: versicolor        FALSE 2.777551
## 4: versicolor         TRUE 2.400000
## 5:  virginica        FALSE 2.983673
## 6:  virginica         TRUE 2.500000

Note that I used a list in the by statement to select multiple columns. A common mistake is to use the c function instead (i.e.
by=c(column1,column2)), but that won’t work UNLESS you have the with=FALSE flag.

Unique operations

Similar to this is a very useful function in data.table.N, which counts the numbers of each element.

iris.dt[, .N, by=list(Species, Sepal.Length < 5)]
##       Species Sepal.Length  N
## 1:     setosa        FALSE 30
## 2:     setosa         TRUE 20
## 3: versicolor        FALSE 49
## 4: versicolor         TRUE  1
## 5:  virginica        FALSE 49
## 6:  virginica         TRUE  1

Notice that each of these returns a data.table that you can save and manipulate however you want.

Yet another construct is .SD, which refers by default to all of the columns in the data.table not referenced by the subset command or a subset of columns specified by the user. This is very handy if you have a lot of columns and want to calculate some conditional statistic for each. The trick is to recognize that a data.table is still a list of columns and therefore can be worked on by the lapply function:

iris.dt[, lapply(.SD, mean), by=list(Species, Sepal.Length < 5)]
##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:     setosa        FALSE    3.620000     1.493333    0.280000
## 2:     setosa         TRUE    3.140000     1.415000    0.195000
## 3: versicolor        FALSE    2.777551     4.279592    1.332653
## 4: versicolor         TRUE    2.400000     3.300000    1.000000
## 5:  virginica        FALSE    2.983673     5.573469    2.032653
## 6:  virginica         TRUE    2.500000     4.500000    1.700000

Notice that the mean sepal lengths were not calculated because it was used as a criterion for subsetting. Also, data.table is not smart enough to selectively apply functions to only numeric columns, so you may get unexpected results if you have numeric functions applied to non-numeric columns.

iris.dt[, lapply(.SD, mean), by=Sepal.Length < 5]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1:        FALSE    3.053906     4.121875    1.353906 versicolor
## 2:         TRUE    3.077273     1.640909    0.300000     setosa

To specify columns, add the .SDcols argument. Note that .SDcols are given as a character vector, NOT as a list.

iris.dt[, lapply(.SD, mean), by=Species, .SDcols=c("Sepal.Width", "Sepal.Length")]
##       Species Sepal.Width Sepal.Length
## 1:     setosa       3.428        5.006
## 2: versicolor       2.770        5.936
## 3:  virginica       2.974        6.588

Adding columns

Adding a new column to a data.table can be accomplished by using standard, data.frame syntax…

iris.dt$random.index <- sample(nrow(iris.dt))

…but there is also a way that lets you more conveniently leverage the data.table subsetting syntax.

iris.dt[, longer.than.five := FALSE][Sepal.Length > 5, longer.than.five := TRUE]
iris.dt[, .N, by=longer.than.five]
##    longer.than.five   N
## 1:             TRUE 118
## 2:            FALSE  32

A few points about that last bit. First, notice the := internal assignment operator. When you perform data.table assignments in this way, nothing is returned because the data.table is modified in place.

Second, notice that data.table lets you chain commands by simply stacking brackets back-to-back. The commands are executed sequentially – first, a column of all FALSE values was added to iris.dt, and then some of the values in this data.table were changed to TRUE.

Renaming columns

One major difference between data.table and data.frame operations is that the former generally modify the object in place while the latter make lots of copies. This is a core aspect of what makes data.table so fast and effective for working with large data, but it does mean that certain data.frame functions don’t work for data.table. Instead, they are replaced by set functions, such as the setnames function.

setnames is very intuitive. It takes 2-3 arguments: the name of the data.table, [optionally] a vector of old names that you’re changing, and a vector of new names that map to the old ones (or all the columns by default).

names(iris.dt)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five"
setnames(iris.dt, c("Sepal.Length", "Sepal.Width"), c("sl", "sw"))
names(iris.dt)
## [1] "sl"               "sw"               "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five"
setnames(iris.dt, c("sl", "sw"), c("Sepal.Length", "Sepal.Width"))

A similar function is the setcolorder function for–you guessed it–reordering columns.

Keys: Faster and more convenient searches

I hinted earlier that searching operations in data.frame are very inefficient, and are much faster in data.table. But if you’re working with really big data, you’ll want to squeeze even more speed out, which you can do by telling data.table more about how your data is organized and letting it optimize its memory to access it as fast as possible. The operation for doing this is the setkey operation.

Say I’m going to be doing a lot of calculations in the iris database by species. Before doing everything, I’ll run:

setkey(iris.dt, Species)
print(iris.dt)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.1         3.5          1.4         0.2    setosa
##   2:          4.9         3.0          1.4         0.2    setosa
##   3:          4.7         3.2          1.3         0.2    setosa
##   4:          4.6         3.1          1.5         0.2    setosa
##   5:          5.0         3.6          1.4         0.2    setosa
##  ---                                                            
## 146:          6.7         3.0          5.2         2.3 virginica
## 147:          6.3         2.5          5.0         1.9 virginica
## 148:          6.5         3.0          5.2         2.0 virginica
## 149:          6.2         3.4          5.4         2.3 virginica
## 150:          5.9         3.0          5.1         1.8 virginica
##      random.index longer.than.five
##   1:            5             TRUE
##   2:           44            FALSE
##   3:           68            FALSE
##   4:           48            FALSE
##   5:          137            FALSE
##  ---                              
## 146:           75             TRUE
## 147:          132             TRUE
## 148:           73             TRUE
## 149:           10             TRUE
## 150:           51             TRUE

You may have noticed that the data.table is now sorted alphabetically by species. This isn’t just eye candy – it has major performance implications, since it’s much easier to perform operations on contiguous data that to search for each case.

Another effect of setting a key is that I can now subset based on the key without actually saying what I’m searching for. For instance:

head(iris.dt["setosa"])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species random.index
## 1:          5.1         3.5          1.4         0.2  setosa            5
## 2:          4.9         3.0          1.4         0.2  setosa           44
## 3:          4.7         3.2          1.3         0.2  setosa           68
## 4:          4.6         3.1          1.5         0.2  setosa           48
## 5:          5.0         3.6          1.4         0.2  setosa          137
## 6:          5.4         3.9          1.7         0.4  setosa          145
##    longer.than.five
## 1:             TRUE
## 2:            FALSE
## 3:            FALSE
## 4:            FALSE
## 5:            FALSE
## 6:             TRUE
iris.dt[c("setosa", "virginica"), .N, by=Species]
##      Species  N
## 1:    setosa 50
## 2: virginica 50

You can set and reset keys as many times as you’d like, and you can even set multiple keys, although you need the setkeyv function for that. (The difference is that setkey takes columns as they are–no quotes–while setkeyv interprets column names from strings. This also means you can pass it a preset vector of column names for keying, which you can’t do with setkey).

keys <- c("Species", "longer.than.five")
setkeyv(iris.dt, keys)
iris.dt
##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          4.9         3.0          1.4         0.2    setosa
##   2:          4.7         3.2          1.3         0.2    setosa
##   3:          4.6         3.1          1.5         0.2    setosa
##   4:          5.0         3.6          1.4         0.2    setosa
##   5:          4.6         3.4          1.4         0.3    setosa
##  ---                                                            
## 146:          6.7         3.0          5.2         2.3 virginica
## 147:          6.3         2.5          5.0         1.9 virginica
## 148:          6.5         3.0          5.2         2.0 virginica
## 149:          6.2         3.4          5.4         2.3 virginica
## 150:          5.9         3.0          5.1         1.8 virginica
##      random.index longer.than.five
##   1:           44            FALSE
##   2:           68            FALSE
##   3:           48            FALSE
##   4:          137            FALSE
##   5:           30            FALSE
##  ---                              
## 146:           75             TRUE
## 147:          132             TRUE
## 148:           73             TRUE
## 149:           10             TRUE
## 150:           51             TRUE

Merging data tables

First, let’s create data to merge in.

iris.supp <- data.table(species_name = c("setosa","versicolor", "virginica"),
                        testcode = c("SET", "VER", "VGA"))

The basic data.frame merge syntax will work.

iris.merge <- merge(iris.dt, iris.supp, by.x = "Species", by.y="species_name")

That said, a much faster merge can be done using custom data.table syntax that harnesses the power of keys.

setkey(iris.dt, Species)
setkey(iris.supp, species_name)
iris.merge <- iris.dt[iris.supp]

Note that this is only a partial join – i.e. iris.dt will only take values from iris.supp that are in iris.dt, and any values in iris.supp that are not in iris.dt will not be added. For full joins, use the merge function with the all=TRUE argument.

Performance – speed and size

Whether or not data.table syntax is better than data.frame is a subjective quesiton. However, the speed of data.table is unquestionable. To do this, let’s simulate a large dataset. (Feel free to change the value of nvals and nvals.write to see how data.table speeds scale, or to accomodate slower PC’s with less memory).

nvals <- 1e7
col1 <- sample(letters[1:5], nvals, replace=TRUE)
col2 <- rnorm(nvals)
system.time(dat.df <- data.frame(col1, col2))
##    user  system elapsed 
##   0.304   0.032   0.336
system.time(dat.dt <- data.table(col1, col2))
##    user  system elapsed 
##   0.092   0.016   0.107

Note the almost twofold increase in speed in creating a data.frame compared to a data.table. Now, let’s compare the calculation of an average.

system.time(with(dat.df, tapply(col2, col1, mean)))
##    user  system elapsed 
##   0.596   0.052   0.648
system.time(dat.dt[, mean(col2), by=col1])
##    user  system elapsed 
##   0.112   0.016   0.127
setkey(dat.dt, col1)
system.time(dat.dt[, mean(col2), by=col1])
##    user  system elapsed 
##   0.108   0.000   0.109

Without keys, data.table is approximately 5 times faster on my machine. With keys, it’s close to 30 times faster. Hard to argue with that.

On a similar note, the data.table function fread is dramatically faster at reading in large files than base R’s read.table (or its derivatives like read.csv).

nvals.write <- 1e6
write.table(dat.dt[1:nvals.write], file="testfile.dat", row.names=FALSE)
system.time(test.df <- read.table("testfile.dat"))
##    user  system elapsed 
##  10.244   0.036  10.371
system.time(test.dt <- fread("testfile.dat", header=TRUE))
##    user  system elapsed 
##   0.260   0.004   0.267

Not only is fread dramatically faster, but, if the process takes more than a second or two, it helpfully prints information on its progress, including the size of the file and the percentage read so far.

That speed isn’t completely free though – fread achieves its speed by quickly determining features of the file and assuming they are constant throughout. I’ve run into situations where weird line endings or similar things can cause fread to fail in situations where read.table worked just fine.
But, fread does have the advantage that you can fine-tune exactly what it reads in to speed it up even more. For example, you can tell it to ignore the first n lines of a file (useful for reading in instrument data that has long headers), to only read in specific rows or columns (allowing you to subset data before even reading it in to save memory and time), or to assume columns have a specific class (rather than making it try to figure it out independently).

Common data.table mistakes

Below are some important examples of counterintuitive behavior by data.table that have tripped me up in the past. This is an evolving list that I’ll update as I get tripped up by other things.

Shallow vs. true copying

Because data.table is optimized for working with very large datasets, it doesn’t treat variables the way you expect. Specifically, when you reassign a data.table to a new variable name, both the old and new variable names point to the same object, meaning that any changes you make to one are applied to both. Case in point:

names(iris.dt)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five"
iris.dt2 <- iris.dt
iris.dt2[, random.index.2 := sample(letters, nrow(iris.dt2), replace=TRUE)]
names(iris.dt2)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five" "random.index.2"
names(iris.dt)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five" "random.index.2"

The way to do a full object copy is simple, through the aptly named copy command. You just have to remember to do it!

iris.dt[, random.index.2 := NULL]
iris.dt2 <- copy(iris.dt)
iris.dt2[, random.index.2 := sample(letters, nrow(iris.dt2), replace=TRUE)]
names(iris.dt2)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five" "random.index.2"
names(iris.dt)
## [1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"    
## [4] "Petal.Width"      "Species"          "random.index"    
## [7] "longer.than.five"