I want to take columns of a data.frame/matrix and apply a function to between each cell ([i, j]
) of the dataframe where i and j are the sequences along the columns of the data.frame. Basically I want to fill a matrix of individual cells in the same way that the cor
function works with a data.frame.
This is a related question: Create a matrix from a function and two numeric data frames However, I use this in randomization tests and repeat the operation many times (make many matrices). I'm looking for the fastest way to do this operation. I have sped things up a bit using parallel processing but I'm still not happy with this speed. It can not be assumed that the matrix output is symmetrical either, that is in the way cor
produces a symmetrical matrix (my example will reflect this).
I saw on the data.table web page today (http://datatable.r-forge.r-project.org/) the following:
500+ times faster than
DF[i,j]<-value
This got me thinking that perhaps data.table
or dplyr
or other means may speed things up a bit. My brain has been fixed on filling cells but maybe there's a better way involving reshaping, applying the function and reshaping to a matrix or something along those lines. I can achieve this in base R using outer
or a for
loop as follows.
## Arbitrary function
FUN <- function(x, y) round(sqrt(sum(x)) - sum(y), digits=1)
## outer approach
outer(
names(mtcars),
names(mtcars),
Vectorize(function(i,j) FUN(mtcars[,i],mtcars[,j]))
)
## for approach
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
for (j in 1:ncol(mtcars)) {
mat[i, j] <- FUN(mtcars[, i], mtcars[, j])
}
}
mat
Here are the microbenchmark timings with for
getting a slight edge.
Unit: milliseconds
expr min lq median uq max neval
OUTER() 4.450410 4.691124 4.774394 4.877724 55.77333 1000
FOR() 4.309527 4.521785 4.588728 4.694156 7.04275 1000
What is the fastest approach to this in R (add on packages welcomed)?
See Question&Answers more detail:os