r - Performance of rbind.data.frame

Question

Welcome To Ask or Share your Answers For Others

r - Performance of rbind.data.frame

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

495 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:14:45+0000

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091

Categories

r - Performance of rbind.data.frame

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags