While assessing the utility of data.table
(vs. dplyr
), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).
This is the original code:
library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]
This is its dataset-agnostic equivalent:
dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"
strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal"
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work!
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work!
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]
Note (Thanks to reply below): if you need to change variable value by reference, you need to use ()
, not get()
, as shown below:
strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]
Now: you can cut-n-paste the following code for all your looping needs:
for(nVarGroup in 2:4) # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans)))
qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]
print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
print(p)
}
My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of ()
, get()
, quote()
/eval()
, [[]]
). Seems too many for a such straightforward need...
Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=
, lapply
/.SD
/.SDcols
?
Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table
within functions
and loops
.
The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.