Commit cff563c3 authored by Florent Chatelain's avatar Florent Chatelain
Browse files

fix typo

parent b6b5a9ad
...@@ -165,16 +165,18 @@ ...@@ -165,16 +165,18 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Questions 4\n", "## Questions 4\n",
"- How many different prediction values are defined by a tree of depth N?\n", "- How many different prediction values are defined by a tree of depth $d$?\n",
"- What is the average number of samples from the training set in each leave? (let N be the size of the training set)" "- What is the average number of samples from the training set in each leave? (let $N$ be the size of the training set)"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {
"tags": []
},
"source": [ "source": [
"## Setting the depth by using cross validation\n", "## Setting the depth by using cross validation\n",
"https://scikit-learn.org/stable/modules/model_evaluation.html" "Recall the sklearn documention to assess the [performance of a model (`model_evaluation` module)](https://scikit-learn.org/stable/modules/model_evaluation.html)."
] ]
}, },
{ {
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N2_a_Regression_tree.ipynb) This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N2_a_Regression_tree.ipynb)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# REGRESSION TREE # REGRESSION TREE
In this notebook, the methods illustrated are basically the same as those presented in the case of classification (notebook N1_Classif_tree.ipynb), **except** the criterion for defining the best split : in this regression case, the best split search at each node is conducted to minimize a **Mean Square Error criterion**. Note that this applies for numerical data only. In this notebook, the methods illustrated are basically the same as those presented in the case of classification (notebook N1_Classif_tree.ipynb), **except** the criterion for defining the best split : in this regression case, the best split search at each node is conducted to minimize a **Mean Square Error criterion**. Note that this applies for numerical data only.
Let $N$ be the umber of samples in a set $S$. Splitting in two subsets will define the **partition** of $S$ into $\{ S_l, S_r \}$. The estimated variance or MSE of $S$ is Let $N$ be the umber of samples in a set $S$. Splitting in two subsets will define the **partition** of $S$ into $\{ S_l, S_r \}$. The estimated variance or MSE of $S$ is
$$ MSE(S) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y})^2 $$ $$ MSE(S) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y})^2 $$
where $\hat{y}=\frac{1}{N}\sum_{i=1}^N y_i$. where $\hat{y}=\frac{1}{N}\sum_{i=1}^N y_i$.
Assuming that the samples are elements of $\mathbb{R}^d $, one seeks the best threshold to apply on one of the $d$ components, to minimize Assuming that the samples are elements of $\mathbb{R}^d $, one seeks the best threshold to apply on one of the $d$ components, to minimize
$$MSE(S_r,S_l) = \frac{n_r}{N}MSE(S_r)+ \frac{n_l}{N}MSE(S_l)$$ $$MSE(S_r,S_l) = \frac{n_r}{N}MSE(S_r)+ \frac{n_l}{N}MSE(S_l)$$
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## A first simple example for data in $\mathbb{R}$ ## A first simple example for data in $\mathbb{R}$
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%matplotlib inline %matplotlib inline
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np import numpy as np
# Make a noisy a sine shape function # Make a noisy a sine shape function
# np.random.seed(0) # np.random.seed(0)
noise_std = 0.1 noise_std = 0.1
X = np.arange(0, 2 * np.pi, 0.01)[:, np.newaxis] X = np.arange(0, 2 * np.pi, 0.01)[:, np.newaxis]
nx = np.random.randn(X.shape[0], 1) * noise_std nx = np.random.randn(X.shape[0], 1) * noise_std
y = np.sin(X) + np.random.randn(X.shape[0], 1) * noise_std y = np.sin(X) + np.random.randn(X.shape[0], 1) * noise_std
print("The number of point in the set is {}".format(len(X))) print("The number of point in the set is {}".format(len(X)))
# changing y to observe the behaviour on linear regression # changing y to observe the behaviour on linear regression
# y= .5*X + np.random.randn(X.shape[0],1)*noise_std # y= .5*X + np.random.randn(X.shape[0],1)*noise_std
plt.figure() plt.figure()
plt.scatter(X, y, s=1) plt.scatter(X, y, s=1)
plt.xlabel("X") plt.xlabel("X")
plt.ylabel("y") plt.ylabel("y")
``` ```
%%%% Output: stream %%%% Output: stream
The number of point in the set is 629 The number of point in the set is 629
%%%% Output: execute_result %%%% Output: execute_result
Text(0, 0.5, 'y') Text(0, 0.5, 'y')
%%%% Output: display_data %%%% Output: display_data
[Hidden Image Output] [Hidden Image Output]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Compute the decision tree for two different maximal depth. ### Compute the decision tree for two different maximal depth.
Other parameters are set to default, see the [sklearn documentation]( Other parameters are set to default, see the [sklearn documentation](
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor). https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from sklearn.tree import DecisionTreeRegressor from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score from sklearn.model_selection import cross_val_score
X = X.reshape( X = X.reshape(
len(X), 1 len(X), 1
) # necessary as DecistionTreeRegressor takes a 2D array in input ) # necessary as DecistionTreeRegressor takes a 2D array in input
regr2 = DecisionTreeRegressor(max_depth=2, criterion="mse") regr2 = DecisionTreeRegressor(max_depth=2, criterion="mse")
regr2 = regr2.fit(X, y) regr2 = regr2.fit(X, y)
N = 8 N = 8
regrN = DecisionTreeRegressor(max_depth=N, criterion="mse") regrN = DecisionTreeRegressor(max_depth=N, criterion="mse")
regrN = regrN.fit(X, y) regrN = regrN.fit(X, y)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Visualize the predictions obtained for both regression trees ### Visualize the predictions obtained for both regression trees
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Define a new set of inputs with the same range as X # Define a new set of inputs with the same range as X
X_test = np.arange(0.0, X.max(), 0.01)[ X_test = np.arange(0.0, X.max(), 0.01)[
:, np.newaxis :, np.newaxis
] # alternate method, to get a 2D array for 2D array from a scalar time series ] # alternate method, to get a 2D array for 2D array from a scalar time series
# use the computed trees to obtain prediction values # use the computed trees to obtain prediction values
y_r2 = regr2.predict(X_test) y_r2 = regr2.predict(X_test)
y_rN = regrN.predict(X_test) y_rN = regrN.predict(X_test)
plt.scatter(X, y, s=1) plt.scatter(X, y, s=1)
plt.plot(X_test, y_r2, color="green", label="2 depths", linewidth=2) plt.plot(X_test, y_r2, color="green", label="2 depths", linewidth=2)
plt.plot(X_test, y_rN, color="red", label="N depths", linewidth=2) plt.plot(X_test, y_rN, color="red", label="N depths", linewidth=2)
plt.xlabel("X") plt.xlabel("X")
plt.ylabel("y") plt.ylabel("y")
plt.legend() plt.legend()
plt.show() plt.show()
``` ```
%%%% Output: display_data %%%% Output: display_data
[Hidden Image Output] [Hidden Image Output]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Questions 4 ## Questions 4
- How many different prediction values are defined by a tree of depth N? - How many different prediction values are defined by a tree of depth $d$?
- What is the average number of samples from the training set in each leave? (let N be the size of the training set) - What is the average number of samples from the training set in each leave? (let $N$ be the size of the training set)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Setting the depth by using cross validation ## Setting the depth by using cross validation
https://scikit-learn.org/stable/modules/model_evaluation.html Recall the sklearn documention to assess the [performance of a model (`model_evaluation` module)](https://scikit-learn.org/stable/modules/model_evaluation.html).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, train_size=0.5, test_size=0.5, random_state=None) cv = ShuffleSplit(n_splits=10, train_size=0.5, test_size=0.5, random_state=None)
depth = np.arange(2, 13) depth = np.arange(2, 13)
reg_MSE = [] reg_MSE = []
for N in depth: for N in depth:
regrN = DecisionTreeRegressor(max_depth=N, criterion="mse") regrN = DecisionTreeRegressor(max_depth=N, criterion="mse")
mserr = [] mserr = []
for train_index, test_index in cv.split(X): for train_index, test_index in cv.split(X):
regrN = regrN.fit(X[train_index], y[train_index]) regrN = regrN.fit(X[train_index], y[train_index])
y_pred = regrN.predict(X[test_index]) y_pred = regrN.predict(X[test_index])
y_t = y[test_index].ravel() # to force same dimensions as those of y_pred y_t = y[test_index].ravel() # to force same dimensions as those of y_pred
mserr.append(np.square(y_t - y_pred).sum()) mserr.append(np.square(y_t - y_pred).sum())
# print(mserr) # print(mserr)
reg_MSE.append(np.asarray(mserr).mean()) reg_MSE.append(np.asarray(mserr).mean())
plt.plot(depth, reg_MSE) plt.plot(depth, reg_MSE)
plt.xlabel("max Depth of the reg_tree") plt.xlabel("max Depth of the reg_tree")
plt.ylabel("MSE") plt.ylabel("MSE")
plt.grid() plt.grid()
``` ```
%%%% Output: display_data %%%% Output: display_data
[Hidden Image Output] [Hidden Image Output]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Exercice 5 ## Exercice 5
- Determine the "optimal" depth to use to perform the best (MSE sense) tree based prediction - Determine the "optimal" depth to use to perform the best (MSE sense) tree based prediction
- Change the noise power (set e.g. noise_std to take different values in the range $[.01;1]$ and study (plot) the obtained cross-validated "optimal depth" as a function of the noise power. - Change the noise power (set e.g. noise_std to take different values in the range $[.01;1]$ and study (plot) the obtained cross-validated "optimal depth" as a function of the noise power.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment