So I am having some troubles running a random forest regression on panel data.
The data currently looks like this:
I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:
from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
'summary', 'household_size', 'work_from_home', 'num_rooms',
'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)
However I get the following error message:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.
Thanks in advance.
question from:https://stackoverflow.com/questions/65891664/random-forest-on-panel-data-using-python