medium_articles/Data Wrangling/introduction_to_dtplyr.Rmd at master · erbat1/medium_articles · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction to `dtplyr`

Learn how to easily combine `dplyr`'s readability with `data.table`'s performance!

I recently saw a [Tweet](https://twitter.com/hadleywickham/status/1153763948447043584) by Hadley Wickham about the release of `dtplyr`. It is a package that enables working with `dplyr` syntax on `data.table` objects. It automatically translates the `dplyr` syntax to a `data.table` equivalent, which results in a performance boost.

Marvel: Infinity War is the most ambitious crossover event in history.
Hadley Wickham: Hold my beer.

I always liked the ease and readability of `dplyr` and was eager to compare the performance of the new package Let's see how it works in practice!

## Loading libraries

For this article, we need to install `dtplyr` from GitHub by running `devtools::install_github("tidyverse/dtplyr")` and we use `microbenchmark` for performance comparison.

```{r install_packages, echo=FALSE}
# install required packages

# install.packages("dplyr")
# install.packages("data.table")
# install.packages("devtools")
# install.packages("microbenchmark")
# devtools::install_github("tidyverse/dtplyr")
```

```{r}
# import packages
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(microbenchmark)
library(stringi)
```

## Generating the dataset

We generate an artificial dataset. The first thing that came to my mind is an order registry, where we store:

- `id` of the client
- name of the `product`
- `date` of purchase
- `amount` of product purchased
- unit `price` of a certain product

As this is only a toy example, we do not dive deeply into the logic behind the dataset, as we can agree that it vaguely resembles a real-life scenario. For testing the performance of different approaches, we generate 10 million rows of data.

```{r}
# specify size
n <- 1e7

# generate data.table with random data
data_dt <- data.table(id=stri_rand_strings(n, 3, pattern = "[A-Z]"),
                      product=stri_rand_strings(n, 3, pattern = "[A-Z]"),
                      date=sample(seq(as.Date('2019/01/01'), as.Date('2019/04/01'), by="day"), n, replace=TRUE),
                      amount=sample(1:10000,n,replace=TRUE),
                      price=rnorm(n, mean = 100, sd = 20))

data_dt_lazy <- lazy_dt(data_dt)

# convert to data.frame
data_df <- as.data.frame(data_dt)
```

By using `lazy_dt()` we trigger the lazy evaluation - no computation is performed until we explicitly request it by using `as.data.table()`, `as.data.frame()` or `as_tibble()`. For the sake of comparison, we store one `data.frame`, one `data.table` and one "lazy" `data.table`.

We can preview the transformation, as well as the generated `data.table` code by printing the result:

```{r}
data_dt_lazy %>%
    filter(amount < 500) %>%
    arrange(id)
```
Generally, this should be used for debugging. We should indicate what kind of object we want to receive at the end of the pipeline to clearly show that we are done with the transformations.

## Use-case 1: Filtering, Selecting and Sorting

Let's say we want to have a list of transactions that happened before `2019-02-01`, sorted by date, and we do not care about either the amount or price.

```{r}
mbm <- microbenchmark("dplyr" = {
                                    result_df <- data_df %>%
                                                 filter(date < as.Date('2019-02-01')) %>%
                                                 select(c(id, product, date)) %>%
                                                 arrange(date)
                                },
                       "data.table" = {
                                    result_dt <- data_dt[date < as.Date('2019-02-01'), .(id, product, date)][order(date)]
                                },
                       "dtplyr" = {
                                    result_dtplyr <- data_dt_lazy %>%
                                                     filter(date < as.Date('2019-02-01')) %>%
                                                     select(c(id, product, date)) %>%
                                                     arrange(date) %>%
                                                     as.data.table()
                                  })

mbm
```

In this example, we want to filter orders with a number of products over 5000 and calculate the order value, which is `amount * price`.

Most of the expressions using `mutate()` must make a copy (do not modify in-place), which would not be necessary when using `data.table` directly. To counter for that, we can specify `immutable = FALSE` in `lazy_dt()` to opt-out of the mentioned behavior.

```{r}
data_dt_lazy <- lazy_dt(data_dt, immutable=FALSE)

mbm <- microbenchmark("dplyr" = {
                                    result_df <- data_df %>%
                                                 filter(amount >= 5000) %>%
                                                 mutate(order_value = amount * price)
                                },
                       "data.table" = {
                                    result_dt <- data_dt[amount >= 5000][, order_value := amount * price]
                                },
                       "dtplyr" = {
                                    result_dtplyr <- data_dt_lazy %>%
                                                     filter(amount >= 5000) %>%
                                                     mutate(order_value = amount * price) %>%
                                                     as.data.table()
                                  })

mbm
```

## Use-case 3: Aggregation on top

Let's say we want to:

1. Filter all orders on amount <= 4000
2. Calculate the average order value per customer

```{r}
data_dt_lazy <- lazy_dt(data_dt, immutable=FALSE)

mbm <- microbenchmark("dplyr" = {
                                    result_df <- data_df %>%
                                                 filter(amount <= 4000) %>%
                                                 mutate(order_value = amount * price) %>%
                                                 group_by(id) %>%
                                                 summarise(avg_order_value = mean(order_value))
                                },
                      "data.table" = {
                                    result_dt <- data_dt[amount <= 4000][, order_value := amount * price][, .(avg_order_value = mean(order_value)), keyby = id]
                                     },
                      "dtplyr" = {
                                    result_dtplyr <- data_dt_lazy %>%
                                                     filter(amount <= 4000) %>%
                                                     mutate(order_value = amount * price) %>%
                                                     group_by(id) %>%
                                                     summarise(avg_order_value = mean(order_value))  %>%
                                                     as.data.table()
                                  })

mbm
```

## Use-case 4: Joining

In the last example, we consider a case of joining datasets. For that, we create a new `data.frame`/`data.table` called `product` by selecting 75% of the available products and assigning a random letter to them. We can assume that the letter corresponds to a distribution center (variable called `origin`), from which the item is shipped.

```{r}
# create product tables
product_dt <- unique(data_dt[, .(product)])[sample(.N, 0.75*.N, replace=FALSE)][, origin := stri_rand_strings(.N, 1, pattern = "[A-Z]")]

product_dt_lazy <- lazy_dt(product_dt, immutable=FALSE)

product_df <- as.data.frame(product_dt)

```

We want to calculate the average order value per distribution center. In case we do not have data regarding the distribution center, we discard the row.

```{r}
data_dt_lazy <- lazy_dt(data_dt, immutable=FALSE)

mbm <- microbenchmark("dplyr" = {
                                    result_dplyr <- data_df %>%
                                                    left_join(., product_df, by = 'product') %>%
                                                    na.omit() %>%
                                                    mutate(order_value = amount * price) %>%
                                                    group_by(origin) %>%
                                                    summarise(avg_order_value = mean(order_value)) %>%
                                                    arrange(desc(avg_order_value))
                                },
                      "data.table" = {
                                    result_dt <- data_dt[product_dt, on = .(product)][, order_value := amount *
    price][, .(avg_order_value = mean(order_value)), keyby = .(origin)][order(desc(avg_order_value))]
                                     },
                      "dtplyr" = {
                                    result_dtplyr <- data_dt_lazy %>%
                                                     left_join(., product_dt_lazy, by = 'product') %>%
                                                     na.omit() %>%
                                                     mutate(order_value = amount * price) %>%
                                                     group_by(origin) %>%
                                                     summarise(avg_order_value = mean(order_value)) %>%
                                                     arrange(desc(avg_order_value)) %>%
                                                     as.data.table()
                                  })

mbm
```

## Conclusions

`dtplyr` is (and always will be) slightly slower than `data.table`. That is because:

1. Each `dplyr` verb must be converted to a `data.table` equivalent. For large datasets, this should be negligible, as these translation operations take time proportional to the complexity of the input code, rather than the amount of data.
2. Some `data.table` expressions have no direct `dplyr` equivalent.
3. Immutability issue mentioned in use-case 2.

Summing up, I believe that `dtplyr` is a valuable addition to the `tidyverse`, as with only small changes to the `dplyr` code, we can achieve significant performance improvements.

As always, any constructive feedback is welcome. You can reach out to me on [Twitter](https://twitter.com/erykml1) or in the comments. You can find the code used for this article on my [GitHub](https://github.com/erykml/medium_articles/blob/master/introduction_to_dtplyr.Rmd).