I have two sets of points, called path
and centers
. For each point in path
, I would like an efficient method for finding the ID of the closest point in centers
. I would like to do this in R. Below is a simple reproducible example.
set.seed(1)
n <- 10000
x <- 100*cumprod(1 + rnorm(n, 0.0001, 0.002))
y <- 50*cumprod(1 + rnorm(n, 0.0001, 0.002))
path <- data.frame(cbind(x=x, y=y))
centers <- expand.grid(x=seq(0, 500,by=0.5) + rnorm(1001),
y=seq(0, 500, by=0.2) + rnorm(2501))
centers$id <- seq(nrow(centers))
x
and y
are coordinates. I would like to add a column to the path
data.frame that has the id of the closest center for the given x and y co-ordinate. I then want to get all of the unique ids.
My solution at the moment does work, but is very slow when the scale of the problem increases. I would like something much more efficient.
path$closest.id <- sapply(seq(nrow(path)), function(z){
tmp <- ((centers$x - path[z, 'x'])^2) + ((centers$y - path[z, 'y'])^2)
as.numeric(centers[tmp == min(tmp), 'id'])
})
output <- unique(path$closest.id)
Any help on speeding this up would be greatly appreciated.
I think data.table
might help, but ideally what I am looking for is an algorithm that is perhaps a bit smarter in terms of the search, i.e. instead of calculating the distances to each center and then only selecting the minimum one... to get the id...
I am also happy to use Rcpp
/Rcpp11
as well if that would help improve performance.
My minimum acceptable time to perform this kind of calculation out would be 10 seconds, but obviously faster would be better.
See Question&Answers more detail:os