"- Change the parameter max_depth in the code above. \n",
"- Change the parameter max_depth in the code above. \n",
"- Compare the behaviour of the random forest regressor with the behaviour of the tree regressor of notebook N2_a_Regression_tree, when max_depth is changed.\n",
"- Compare the behaviour of the random forest regressor with the behaviour of the tree regressor of notebook N2_a_Regression_tree, when max_depth is changed.\n",
"- Study the Extremely Randomized Regressor behaviour for max_depth parameter values (change it in the code above) ranging from 1 to 6. \n",
"- Study the Extremely Randomized Regressor behaviour for max_depth parameter values (change it in the code above) ranging from 1 to 6. \n",
"- Explain the green curve observed for max_depth=1\n",
"- Explain the green curve observed for max_depth=1\n",
"- Propose a method for setting the optimal value of max_depth parameter. Implement it (hint: look at notebook N2_a_regression_tree)\n",
"- Propose a method for setting the optimal value of max_depth parameter. Implement it (hint: look at notebook N2_a_regression_tree)\n",
...
@@ -241,7 +240,7 @@
...
@@ -241,7 +240,7 @@
"name": "python",
"name": "python",
"nbconvert_exporter": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.8.2"
}
}
},
},
"nbformat": 4,
"nbformat": 4,
...
...
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N3_a_Random_Forest_Regression.ipynb)
This notebook can be run on mybinder: [](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N3_a_Random_Forest_Regression.ipynb)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## RANDOM FORESTS regressors
## RANDOM FORESTS regressors
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Consider first the same example as in notebook `N2_Regression_tree.ipynb`
### Consider first the same example as in notebook `N2_Regression_tree.ipynb`
This is a regression problem. Rather than evaluating the optimal tree structure of a single tree, random forest is considered.
This is a regression problem. Rather than evaluating the optimal tree structure of a single tree, random forest is considered.
print("The number of point in the set is {}".format(len(X)))
print("The number of point in the set is {}".format(len(X)))
plt.figure()
plt.figure()
plt.scatter(X,y,s=1)
plt.scatter(X,y,s=1)
plt.xlabel('X')
plt.xlabel('X')
plt.ylabel('y');
plt.ylabel('y');
```
```
%%%% Output: stream
%%%% Output: stream
The number of point in the set is 629
The number of point in the set is 629
%%%% Output: display_data
%%%% Output: display_data
[Hidden Image Output]
[Hidden Image Output]
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In **random forest**, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
In **random forest**, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.
Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.
(Note that on thour example, we only deal with a single feature.)
(Note that on thour example, we only deal with a single feature.)
The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.
The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.
*(extracted from https://scikit-learn.org/stable/modules/ensemble.html#forest)*
*(extracted from https://scikit-learn.org/stable/modules/ensemble.html#forest)*
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# random forest estimator :
# random forest estimator :
fromsklearn.ensembleimportRandomForestRegressor
fromsklearn.ensembleimportRandomForestRegressor
clf=RandomForestRegressor(n_estimators=100, \
clf=RandomForestRegressor(n_estimators=100, \
max_depth=5,\
max_depth=5,\
random_state=None, \
random_state=None, \
criterion='mse')
criterion='mse')
clf=clf.fit(X,y.ravel())
clf=clf.fit(X,y.ravel())
Ntest=300
Ntest=300
XX=np.linspace(X.min(),X.max(),Ntest)
XX=np.linspace(X.min(),X.max(),Ntest)
y_ref=np.sin(XX)
y_ref=np.sin(XX)
XX=XX.reshape(len(XX),1)
XX=XX.reshape(len(XX),1)
yp=clf.predict(XX)
yp=clf.predict(XX)
plt.scatter(X,y,s=1)
plt.scatter(X,y,s=1)
plt.plot(XX,yp,color='red')
plt.plot(XX,yp,color='red')
error=np.square(y_ref-yp).sum()/Ntest
error=np.square(y_ref-yp).sum()/Ntest
print('MSE = {}'.format(error))
print('MSE = {}'.format(error))
```
```
%%%% Output: stream
%%%% Output: stream
MSE = 0.0019143711219644015
MSE = 0.001900837145744446
%%%% Output: display_data
%%%% Output: display_data
[Hidden Image Output]
[Hidden Image Output]
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Exercize
## Exercize 11
- Change the parameter max_depth in the code above.
- Change the parameter max_depth in the code above.
- Compare the behaviour of the random forest regressor with the behaviour of the tree regressor of notebook N2_a_Regression_tree, when max_depth is changed.
- Compare the behaviour of the random forest regressor with the behaviour of the tree regressor of notebook N2_a_Regression_tree, when max_depth is changed.
- Explain your findings.
Explain your findings.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In **extremely randomized trees**, randomness goes one step further in the way splits are computed.
In **extremely randomized trees**, randomness goes one step further in the way splits are computed.
As in random forests, a random subset of candidate features is used (*again, here we have a single feature, so this does not apply*) but instead of looking for the most discriminative thresholds, **thresholds are drawn at random** for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.
As in random forests, a random subset of candidate features is used (*again, here we have a single feature, so this does not apply*) but instead of looking for the most discriminative thresholds, **thresholds are drawn at random** for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.
This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.
This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.