Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to train a random forest to make a categorical prediction. If I want to include a fixed set of independent variables in the prediction model (e.g. x1, x2, and x3 in Y~.+x1+x2+x3), but exclude them from the set of independent variables (represented by . in the example) that can be used to partition the data/create branches/trees in the forest, is there a simple way to do this using caret, grf, or another package in R?

Here's an example: If I wanted to predict which flowers had sepal width over 3.2 in the iris dataset, but I wanted to condition on flower species when deciding whether to create a new branch while excluding flower species as a possible variable to split on. Imagine that I know that flower species is a good predictor of sepal width, but I want to know what other factors predict sepal width, conditional on species:

data(iris)
d <- iris

d$sepal_width_over3point2<-as.factor(d$Sepal.Width>3.2)
d$Type1<-as.numeric(d$Species=='versicolor')
d$Type2<-as.numeric(d$Species=='virginica')
d$Type3<-as.numeric(d$Species=='setosa')

d<-subset(d,select=-c(Species,Sepal.Width))


## Set parameters to train models
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

# Random Forest
set.seed(11)
rf <- train(sepal_width_over3point2~.+Type1+Type2+Type3, data=d, method="rf", metric=metric, trControl=control)
print(rf)

example_varImp_rf<-varImp(rf)

When I look at the variable importance in this model, I'd like to know that the estimates for the other parameters (Sepal.length, Petal.length, and Petal.width) are conditional on flower Type1, Type2, and Type3, but exclude these variables as possible variables to branch on. Is there a way to tell the random forest to ignore these three variables as possible splits?

question from:https://stackoverflow.com/questions/65835383/add-conditioning-variables-to-a-random-forest-model-in-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.2k views
Welcome To Ask or Share your Answers For Others

1 Answer

That would require your node splits to have one threshold for each flower species, which would be more computationally expensive than most tree learners. I don't know of any package that implements this.

One possible workaround is to do some feature engineering. In this case, where your condition on is a smallish categorical, you could standardize each feature relative to their flower species, so that a split would be something like "sepal length is at least 20% higher than species average" or "sepal length is at least one (species) standard deviation higher than species average."


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...