Is there a fast way to iterate through combinations like those returned by expand.grid
or CJ
(data.table
). These get too big to fit in memory when there are enough combinations. There is iproduct
in itertools2
library (port of Python's itertools) but it is really slow (at least the way I'm using it - shown below). Are there other options?
Here is an example, where the idea is to apply a function to each combination of rows from two data.frames
(previous post).
library(data.table) # CJ
library(itertools2) # iproduct iterator
library(doParallel)
## Dimensions of two data
dim1 <- 10
dim2 <- 100
df1 <- data.frame(a = 1:dim1, b = 1:dim1)
df2 <- data.frame(x= 1:dim2, y = 1:dim2, z = 1:dim2)
## function to apply to combinations
f <- function(...) sum(...)
## Too big to expand with bigger dimensions (ie, 1e6, 1e5) -> errors
## test <- expand.grid(seq.int(dim1), seq.int(dim2))
## test <- CJ(indx1 = seq.int(dim1), indx2 = seq.int(dim2))
## Error: cannot allocate vector of size 3.7 Gb
## Create an iterator over the cartesian product of the two dims
it <- iproduct(x=seq.int(dim1), y=seq.int(dim2))
## Setup the parallel backend
cl <- makeCluster(4)
registerDoParallel(cl)
## Run
res <- foreach(i=it, .combine=c, .packages=c("itertools2")) %dopar% {
f(df1[i$x, ], df2[i$y, ])
}
stopCluster(cl)
## Expand.grid results (different ordering)
expgrid <- expand.grid(x=seq(dim1), y=seq(dim2))
test <- apply(expgrid, 1, function(i) f(df1[i[["x"]],], df2[i[["y"]],]))
all.equal(sort(test), sort(res)) # TRUE
See Question&Answers more detail:os