Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to take columns of a data.frame/matrix and apply a function to between each cell ([i, j]) of the dataframe where i and j are the sequences along the columns of the data.frame. Basically I want to fill a matrix of individual cells in the same way that the cor function works with a data.frame.

This is a related question: Create a matrix from a function and two numeric data frames However, I use this in randomization tests and repeat the operation many times (make many matrices). I'm looking for the fastest way to do this operation. I have sped things up a bit using parallel processing but I'm still not happy with this speed. It can not be assumed that the matrix output is symmetrical either, that is in the way cor produces a symmetrical matrix (my example will reflect this).

I saw on the data.table web page today (http://datatable.r-forge.r-project.org/) the following:

500+ times faster than DF[i,j]<-value

This got me thinking that perhaps data.table or dplyr or other means may speed things up a bit. My brain has been fixed on filling cells but maybe there's a better way involving reshaping, applying the function and reshaping to a matrix or something along those lines. I can achieve this in base R using outer or a for loop as follows.

## Arbitrary function
FUN <- function(x, y) round(sqrt(sum(x)) - sum(y), digits=1)

## outer approach
outer(
  names(mtcars), 
  names(mtcars), 
  Vectorize(function(i,j) FUN(mtcars[,i],mtcars[,j]))
)

## for approach
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
    for (j in 1:ncol(mtcars)) {
        mat[i, j] <- FUN(mtcars[, i], mtcars[, j])
    }
}
mat

Here are the microbenchmark timings with for getting a slight edge.

Unit: milliseconds
    expr      min       lq   median       uq      max neval
 OUTER() 4.450410 4.691124 4.774394 4.877724 55.77333  1000
   FOR() 4.309527 4.521785 4.588728 4.694156  7.04275  1000

What is the fastest approach to this in R (add on packages welcomed)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
741 views
Welcome To Ask or Share your Answers For Others

1 Answer

Still sticking to base R solution, I got a 1.6-1.7x speedup in the for-based approach by:

  • substituting [,i] for [[i]] (significant time impact - perhaps FUN just receives C pointers here instead of freshly allocated vectors);
  • byte-code compiling of FUN (small time impact);
  • wrapping for code to a function + byte-code compilation (small time impact);

BTW, swapping indices (i,j) -> (j,i) in the 2 loops didn't result in significant differences (theoretically, row-wise matrix access should be faster).

Code:

library(compiler)
FUN2 <- cmpfun(FUN)
for2 <- cmpfun(function(mtcars, FUN) {
      mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
   for (i in 1:ncol(mtcars)) {
       for (j in 1:ncol(mtcars)) {
           mat[i, j] <- FUN(mtcars[[i]], mtcars[[j]])
       }
   }
   mat
})

Benchmarks:

 Unit: milliseconds
                min       lq   median       uq      max neval
 outer     7.791739 7.991474 8.245869 8.538163 16.24460   100
 for       8.143679 8.463249 8.588230 9.912008 16.30842   100
 for-mods  4.713837 4.875972 5.006202 5.246584 15.66491   100

In my opinion, it will be difficult to find a much faster approach (but I may be wrong). The for loop time bias is quite small (ca. 0.25 ms) comparing to the time needed to compute FUN multiple times.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...