The function set
or the expression :=
inside [.data.table
allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?
keepcols<-function(DF,cols){
eval.parent(substitute(DF<-DF[,cols,with=FALSE]))
}
keeprows<-function(DF,i){
eval.parent(substitute(DF<-DF[i,]))
}
Because the RHS in the expression <-
is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?
Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.
library(data.table)
# Long dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
v1 = sample(5, N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.060 0.013 0.077
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.044 0.010 0.060
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
0.132 0.025 0.161
system.time(DT <- DT[,list(id1,v1)])
user system elapsed
0.124 0.026 0.153
# Wide dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
id2 = sample(sprintf("id%03d",1:K), N, TRUE),
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE),
v3 = sample(round(runif(100,max=100),4), N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.057 0.014 0.089
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.038 0.009 0.061
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
2.483 0.146 2.602
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
user system elapsed
1.143 0.088 1.220
See Question&Answers more detail:os