Assume I have data (t,y)
, where I expect a linear dependency y(t)
. Furthermore, there exist attributes to each observation par1, par2, par3
. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10)
but was not able to get the formula for the best fit.
The final result should look like this if plotted. For data see below.
- Adding
par1
andpar2
gives the best fit - The models are
y_i = a_i * t_i + b_i
with givena_i
andb_i
Reproducible example:
t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset
# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
d[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10)
y[[i]] <- a[i]*d[[i]] + b[i] + noise
y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]
# That is what the data is looking like:
plot(mydata$t, mydata$y)
# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")
Comment and question on @Roland's answer:
I understand that that with the given three parameters there are 2^3=8
groups with 2*3*3=18
factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".
I tried the following
g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)
and realized that the fit is off. Looking back this is clear because
- I include the wrong factor levels (most likely parameter 3 is not relevant)
- and thus get the wrong data for the fit
So my last question is:
- Where can I find the relevant groups included in the best model and
- what are the corresponding fit parameters from the regression?
Sorry, if this was obvious but to me it is mystery
See Question&Answers more detail:os