Commit e76dcb05 authored by Florent Chatelain's avatar Florent Chatelain
Browse files

Merge branch 'master' of

gricad-gitlab.univ-grenoble-alpes.fr:ai-courses/autonomous_systems_ml
parents 50c06220 5bd98a6c
......@@ -74,34 +74,45 @@
`Lab timetables`
|Group | Supervisor | Members | Lab2 | Lab3 | Lab4| Lab5| Lab6 |
|------|:-----------|---------|------|------|-----|-----|------|
|**G1** | F.Chatelain |**ASI** students: AJANA->GOUILLY + Sukhera S.| 18oct | 25oct | 29oct | 8nov| 10nov |
|**G2** | K.Tidriri |**ASI** students: HAMDAOUI->SORIANI|18oct | 21oct | 25oct | 29oct| 8nov |
|**G3** | O. Michel |**MARS**+foreign students| 25oct | 29oct | 8nov | 10nov| 10nov|
|Group | Supervisor | Members | Lab2 | Lab3 | Lab4| Lab5| Lab6 | Lab7 |
|------|:-----------|---------|------|------|-----|-----|------|------|
|**G1** | F.Chatelain |**ASI** students: AJANA->GOUILLY + Sukhera S.| 18oct | 25oct | 29oct | 8nov| 15nov | 17nov |
|**G2** | K.Tidriri |**ASI** students: HAMDAOUI->SORIANI|18oct | 21oct | 25oct | 29oct| 8nov | 15nov |
|**G3** | O. Michel |**MARS**+foreign students| 25oct | 29oct | 8nov | 10nov| 10nov| 15nov |
##### Lab5
##### Lab7
- **HW: To do before the lab session:** re-read the
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/6_linear_models_regularization.pdf)) on the lasso regularization and logistic regression parts.
- Lab5 statement on generalized linear models, lasso regulariation and variable selection is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab5_statement.md)
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 5 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=145522&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/9_Trees_RandomForest_Boosting.pdf)) on Classification and Regression Trees, tree pruning, Random Forests.
- Lab7 statement on Tree based method for unsupervised classification and regression problems is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab7_statement.md)
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 6 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=145522&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
##### Lab6
##### Lab4
- **HW: To do before the lab session:** re-read the
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/8_clustering.pdf)) on clustering methods, with focus on K-Means and EM algorithm. Read slides 1 to 37.
- Lab6 statement on clustering methods is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab6_statement.md)
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 6 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=145522&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
##### Lab5
- **HW: To do before the lab session:** re-read the
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/6_linear_models_regularization.pdf)) on linear models until ridge regularization.
- Lab4 statement on linear models and ridge regression is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab4_statement.md).
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 4 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=143765&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/6_linear_models_regularization.pdf)) on the lasso regularization and logistic regression parts.
- Lab5 statement on generalized linear models, lasso regulariation and variable selection is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab5_statement.md)
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 5 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=145522&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
##### ~~Lab4~~
- ~~**HW: To do before the lab session:** re-read the
the lesson ([slides](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/slides/6_linear_models_regularization.pdf)) on linear models until ridge regularization.~~
- ~~Lab4 statement on linear models and ridge regression is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab4_statement.md).~~
- ~~**Except if your supervisor give you another instruction** upload at the end of the session your lab 4 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=143765&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)~~
##### Lab3
##### ~~Lab3~~
- Lab3 statement on Principal Component Analysis (PCA) is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab3_statement.md)
- **Except if your supervisor give you another instruction** upload at the end of the session your lab 3 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=139477&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)
- ~~Lab3 statement on Principal Component Analysis (PCA) is [here](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/tree/master/labs/lab3_statement.md)~~
- ~~**Except if your supervisor give you another instruction** upload at the end of the session your lab 3 short report in the [chamilo assigment task](https://chamilo.grenoble-inp.fr/main/work/work_list.php?cidReq=ENSE3WEUMAIA0&id_session=0&gidReq=0&gradebook=0&origin=&id=139477&isStudentView=true) (pdf file from your editor, or scanned pdf file of a handwritten paper; *code, figures or graphics are not required!*)~~
##### ~~Lab2 (Monday, October 18 for the groups of students supervised by Khaoula Tidriri and Florent Chatelain)~~
......
# Trees and boosting, random forests
The objective of this lab is to introduce Trees, Tree pruning and then boosting by use of random forests. Both classification and estimation problems are studied.
1. First steps with classification trees : the importance of depth, impurity function and visualization of trees ['N1_Classif_tree.ipynb'](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/9_Trees_Boosting/N1_Classif_tree.ipynb)
2. Some fundamentals on regression trees ['N2_a_Regression_tree.ipynb'](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/9_Trees_Boosting/N2_a_Regression_tree.ipynb)
3. Adapting the complexity to the data : Cost complexity pruning on a regression example. ['N2_b_Cost_Complexity_Pruning_Regressor.ipynb'](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/9_Trees_Boosting/N2_b_Cost_Complexity_Pruning_Regressor.ipynb)
4. Boosting : random forest regression case study ['N3_a_Random_Forest_Regression.ipynb'](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/9_Trees_Boosting/N3_a_Random_Forest_Regression.ipynb)
5. Boosting : random forest classification case study ['N3_b_Random_Forest_Classif.ipynb'](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/9_Trees_Boosting/N3_b_Random_Forest_Classif.ipynb)
......@@ -121,7 +121,7 @@
"## Implementation of Kmean on a simple case\n",
"In this example, the number of clusters is assumed to be known. \n",
"\n",
"### Exercize :\n",
"### Exercize 1 :\n",
"- Explain/ comment the code below\n",
"- What is the main problem left aside by this code? "
]
......@@ -160,28 +160,30 @@
"muvec=np.zeros((K,1)) \n",
"\n",
"\n",
"change = True\n",
"change = True # Defines the test variable for the loop. \n",
" # Default is true (meaning that a new iteration will be performed\n",
"perm=np.random.permutation(N)[0:2]\n",
" # takes two different random integers between 0 and $N$\n",
"\n",
"for k in range (0,K): \n",
" muvec[k] = X[perm[k],:]\n",
" muvec[k] = X[perm[k],:] #Initialization of the cluster representatives (centers)\n",
"\n",
"for i in range (0,N):\n",
" d=(X[i] - muvec )**2\n",
" idx[i]=np.where(d==d.min())[0]\n",
" d=(X[i] - muvec )**2 #Computation of distances wrt cluster centers\n",
" idx[i]=np.where(d==d.min())[0] #label = index of closest center\n",
" \n",
"while change:\n",
" change=False\n",
" change=False \n",
" #update\n",
" for k in range (0,K):\n",
" muvec[k]= np.mean( X[idx == k] ) \n",
" muvec[k]= np.mean( X[idx == k] ) #compute new centers \n",
" #prediction\n",
" for i in range (0,N):\n",
" d=(X[i] - muvec )**2\n",
" index=np.where(d==d.min())[0]\n",
" if index != idx[i]:\n",
" d=(X[i] - muvec )**2 #Computation of distances wrt cluster centers\n",
" index=np.where(d==d.min())[0]##label = index of closest center\n",
" if index != idx[i]: #check if some indices changed\n",
" change=True\n",
" idx[i]=index\n",
" idx[i]=index #replaces new index set\n",
" \n",
"X0=X[idx==0]\n",
"X1=X[idx==1]\n",
......@@ -197,7 +199,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### sklearn implementation - exercize\n",
"### Exercize 2 : sklearn implementation \n",
"- Compare the results obtained with the simple code above\n",
"- Comment and explain the role of the input parameters used in this implementation\n"
]
......@@ -273,7 +275,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.8.2"
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/7_Clustering/N1_Kmeans_basic.ipynb)
%% Cell type:markdown id: tags:
# KMEANS basics
The purpose of this lab is to implement simple 1D Kmeans clustering algorithm, and compare the obtained results with those obtained using skleran implementation
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline
```
%% Cell type:markdown id: tags:
## import data from matlab file :
%% Cell type:code id: tags:
``` python
Data=loadmat('fictitious_train.mat')
print(Data.keys())
X=Data.get('Xtrain')
print('dim of X:{}'.format(X.shape))
```
%%%% Output: stream
dict_keys(['__header__', '__version__', '__globals__', 'Xtrain'])
dim of X:(20, 1)
%% Cell type:markdown id: tags:
## Compute the histogram
%% Cell type:code id: tags:
``` python
bins=np.arange(np.min(X)-1,np.max(X)+2,1)
hist_val,bins=np.histogram(X, bins=bins)
print(hist_val)
print(bins)
```
%%%% Output: stream
[0 4 2 5 1 4 3 1 0]
[-1.39 -0.39 0.61 1.61 2.61 3.61 4.61 5.61 6.61 7.61]
%% Cell type:markdown id: tags:
### or directly visualize the histogram
%% Cell type:code id: tags:
``` python
bins=np.arange(np.min(X)-1,np.max(X)+2,1)
plt.scatter(X,np.zeros_like(X)+.5,c='red',marker='+')
n,bin_edges,patches=plt.hist(x=X,bins=bins, color='blue',histtype='step')
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Implementation of Kmean on a simple case
In this example, the number of clusters is assumed to be known.
### Exercize :
### Exercize 1 :
- Explain/ comment the code below
- What is the main problem left aside by this code?
%% Cell type:code id: tags:
``` python
K=2 #nb of clusters
p=1 # dimension (the code below is given for p=1 only)
```
%% Cell type:code id: tags:
``` python
N=X.size
idx=np.zeros((N,1))
muvec=np.zeros((K,1))
change = True
change = True # Defines the test variable for the loop.
# Default is true (meaning that a new iteration will be performed
perm=np.random.permutation(N)[0:2]
# takes two different random integers between 0 and $N$
for k in range (0,K):
muvec[k] = X[perm[k],:]
muvec[k] = X[perm[k],:] #Initialization of the cluster representatives (centers)
for i in range (0,N):
d=(X[i] - muvec )**2
idx[i]=np.where(d==d.min())[0]
d=(X[i] - muvec )**2 #Computation of distances wrt cluster centers
idx[i]=np.where(d==d.min())[0] #label = index of closest center
while change:
change=False
#update
for k in range (0,K):
muvec[k]= np.mean( X[idx == k] )
muvec[k]= np.mean( X[idx == k] ) #compute new centers
#prediction
for i in range (0,N):
d=(X[i] - muvec )**2
index=np.where(d==d.min())[0]
if index != idx[i]:
d=(X[i] - muvec )**2 #Computation of distances wrt cluster centers
index=np.where(d==d.min())[0]##label = index of closest center
if index != idx[i]: #check if some indices changed
change=True
idx[i]=index
idx[i]=index #replaces new index set
X0=X[idx==0]
X1=X[idx==1]
bins=np.arange(np.min(X)-1,np.max(X)+2,1)
n,bin_edges,patches=plt.hist(x=X,bins=bins, color='blue',histtype='step')
plt.scatter(X0,np.zeros_like(X0)+.5,c='red',marker='+', label='class 0')
plt.scatter(X1,np.zeros_like(X1)+.5,c='green',marker='+',label='class 1')
plt.legend()
h=plt.gcf()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
### sklearn implementation - exercize
### Exercize 2 : sklearn implementation
- Compare the results obtained with the simple code above
- Comment and explain the role of the input parameters used in this implementation
%% Cell type:code id: tags:
``` python
#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 10, n_init = 10, random_state = 0)
kmeans.fit(X)
y_kmeans = kmeans.fit_predict(X)
Y0=X[y_kmeans==0]
Y1=X[y_kmeans==1]
plt.scatter(Y0,np.zeros_like(Y0)+.7,c='red',marker='o', label='class 0 skl')
plt.scatter(Y1,np.zeros_like(Y1)+.7,c='green',marker='o',label='class 1 skl')
plt.scatter(X0,np.zeros_like(X0)+.5,c='red',marker='+', label='class 0')
plt.scatter(X1,np.zeros_like(X1)+.5,c='green',marker='+',label='class 1')
plt.legend()
```
%%%% Output: execute_result
<matplotlib.legend.Legend at 0x7fafd8d12df0>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:code id: tags:
``` python
```
......
......@@ -69,7 +69,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize \n",
"## Exercize 3\n",
"- Comment the choice of Kmeans input parameters used above\n",
"- 'The elbow method' from the above graph : find the optimum number of clusters by observing the within cluster sum of squares (WCSS). Explain the shape of the curve WCSS=f(nb of clusters)\n",
"- What is the asymptotic value of WCSS when the. umber of clusters approaches N (nb of points)? \n",
......@@ -133,7 +133,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize\n",
"## Exercize 4\n",
"- remind the reasons why the clusters formed by KMeans algorithm are are included in Voronoï cells associated to the centroïds\n",
"- Comment the shape of the obtained clusters represented in the figure above\n",
"- How would you check that enough iterations were performed? "
......@@ -183,7 +183,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize\n",
"## Exercize 5\n",
"- Propose a measure of the goodness of clustering, associated to this problem (implementation is not required).\n",
"- How could the cost-complexity tradeoff be tackled? "
]
......@@ -212,7 +212,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.8.2"
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/7_Clusturing/N2_Kmeans_iris_data_example/)
%% Cell type:markdown id: tags:
# Iris data : KMEANS
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
iris = datasets.load_iris()
x=iris.data
```
%% Cell type:code id: tags:
``` python
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300,
n_init = 10, random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Exercize
## Exercize 3
- Comment the choice of Kmeans input parameters used above
- 'The elbow method' from the above graph : find the optimum number of clusters by observing the within cluster sum of squares (WCSS). Explain the shape of the curve WCSS=f(nb of clusters)
- What is the asymptotic value of WCSS when the. umber of clusters approaches N (nb of points)?
- Explain why the curve doesn't decrease significantly with every iteration
%% Cell type:code id: tags:
``` python
#Applying kmeans to the dataset / Creating the kmeans classifier
NbClust= 3
kmeans = KMeans(n_clusters = NbClust, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
```
%% Cell type:code id: tags:
``` python
#Visualising the 3 first clusters wrt x1 X2
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 25, c = 'red', label = 'C0')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 25, c = 'blue', label = 'C1')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 25, c = 'green', label = 'C2')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1],
s =25, c = 'yellow', label = 'Centroids')
plt.legend()
```
%%%% Output: execute_result
<matplotlib.legend.Legend at 0x7fcc29ec9310>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Exercize
## Exercize 4
- remind the reasons why the clusters formed by KMeans algorithm are are included in Voronoï cells associated to the centroïds
- Comment the shape of the obtained clusters represented in the figure above
- How would you check that enough iterations were performed?
%% Cell type:code id: tags:
``` python
#Visualising the clusters x1, x3
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 2], s = 25, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 2], s = 25, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 2], s = 25, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,2], s = 25, c = 'yellow', label = 'Centroids')
plt.legend();
```
%%%% Output: execute_result
<matplotlib.legend.Legend at 0x1a1526d320>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Exercize
## Exercize 5
- Propose a measure of the goodness of clustering, associated to this problem (implementation is not required).
- How could the cost-complexity tradeoff be tackled?
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -45,8 +45,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question \n",
" As IRIS data file contians only 150 4-dimensional samples, assuming that we imposa that no less than 2 samples are contained in a leave, and that the training test is chosen to contain 100 samples, what is the possible maximal depth? "
"### Question 13\n",
" \n",
"* As IRIS data file contians only 150 4-dimensional samples, assuming that we imposa that no less than 2 samples are contained in a leave, and that the training test is chosen to contain 100 samples, what is the possible maximal depth? "
]
},
{
......@@ -144,7 +145,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize\n",
"## Exercize 14\n",
"- Compute the confusion matrix associated to this classifier. (Hint : see N1_Classif_tree.ipynb)\n",
"- Compute the mean accuracy of this tree classifier. (Hint : see N1_Classif_tree.ipynb)"
]
......@@ -201,7 +202,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize \n",
"### Exercize 15\n",
"- Change the value of parameter max_depth (ranging from 1 to 5) and record the obtained accuracy. Explain your findings. \n",
"- Propose a method for setting the 'best' value of parameter n_estimator. "
]
......@@ -298,11 +299,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercize\n",
"## Exercize 16\n",
"Evaluate the feature importance in the IRIS Data Set , using ExtraTreesClassifier. \n",
"- Compare with the results above. \n",
"- What can be concluded about the features importance? "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
......@@ -321,7 +329,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.8.2"
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N3_b_Random_Forest_Classif.ipynb)
%% Cell type:markdown id: tags:
## RANDOM FOREST Classifiers
%% Cell type:code id: tags:
``` python
from sklearn import tree
import numpy as np
from IPython.display import Image
import pydotplus
%matplotlib inline
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
**First, construct a tree based classifier, on IRIS data set (*), and evaluate the 'best' depth to use on a classification tree by cross-vaidation.**
(*) This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray.
The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
see https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
%% Cell type:markdown id: tags:
### Question
As IRIS data file contians only 150 4-dimensional samples, assuming that we imposa that no less than 2 samples are contained in a leave, and that the training test is chosen to contain 100 samples, what is the possible maximal depth?
### Question 13
* As IRIS data file contians only 150 4-dimensional samples, assuming that we imposa that no less than 2 samples are contained in a leave, and that the training test is chosen to contain 100 samples, what is the possible maximal depth?
%% Cell type:code id: tags:
``` python
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn import metrics
iris = load_iris()
depth_array = np.arange(2,7)
estd_accuracy = []
cv = ShuffleSplit(n_splits=20, test_size=0.33)
for nbdepth in depth_array:
clf = tree.DecisionTreeClassifier(max_depth=nbdepth,criterion='gini',\
min_samples_leaf=2)
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)
estd_accuracy.append(scores.mean())
plt.plot(depth_array,estd_accuracy)
plt.xlabel("Max depth of the tree")
plt.ylabel("Classification accuracy")
plt.grid()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
**Visualize the obtained tree for depth = 4** (this valuecanbe changed..)
%% Cell type:code id: tags:
``` python
nbdepth=4
clf = tree.DecisionTreeClassifier(max_depth=nbdepth,criterion='gini')
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, \
test_size=.33,random_state=None)
clf = clf.fit(X_train, y_train)
```
%% Cell type:code id: tags:
``` python
from sklearn.tree import plot_tree
plt.figure(figsize=(50,20))
a = plot_tree(clf,
filled=True,
rounded=True,fontsize=35)
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Exercize
## Exercize 14
- Compute the confusion matrix associated to this classifier. (Hint : see N1_Classif_tree.ipynb)
- Compute the mean accuracy of this tree classifier. (Hint : see N1_Classif_tree.ipynb)
%% Cell type:markdown id: tags:
### Random forest classifier computation
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestClassifier
mdepth_array=np.arange(1,7)
for mdepth in mdepth_array :
print('mdepth={}'.format(mdepth))
clf = RandomForestClassifier(n_estimators=40, \
max_depth=mdepth, \
random_state=None, \
min_samples_split=2,
criterion='gini')
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
print("Mean Accuracy and 95 percent confidence interval: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() *2))
```
%%%% Output: stream
mdepth=1
Mean Accuracy and 95 percent confidence interval: 0.91 (+/- 0.21)
mdepth=2
Mean Accuracy and 95 percent confidence interval: 0.95 (+/- 0.09)
mdepth=3
Mean Accuracy and 95 percent confidence interval: 0.96 (+/- 0.09)
mdepth=4
Mean Accuracy and 95 percent confidence interval: 0.97 (+/- 0.07)
mdepth=5
Mean Accuracy and 95 percent confidence interval: 0.95 (+/- 0.10)
mdepth=6
Mean Accuracy and 95 percent confidence interval: 0.96 (+/- 0.09)
%% Cell type:markdown id: tags:
## Exercize
### Exercize 15
- Change the value of parameter max_depth (ranging from 1 to 5) and record the obtained accuracy. Explain your findings.
- Propose a method for setting the 'best' value of parameter n_estimator.
%% Cell type:markdown id: tags:
## Study of feature importance
The purpose of this os to evaluate the importance of a given feature. This may be done by recording for all the trees involved in the forest, and all the nodes within each tree, the relevance of a feature : the contribution of each feature is increased each time it is used for splitting a node. This contribution correspond to the gain of impurity weighted by the relative number (wrt to the train set size) of samples in the splitted node.
%% Cell type:code id: tags:
``` python
clf.fit(iris.data, iris.target)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
axis=0)
print(std)
indices = np.argsort(importances)[::-1]
indices
```
%%%% Output: stream
[0.1228137 0.05077528 0.32452641 0.30561254]
%%%% Output: execute_result
array([2, 3, 0, 1])
%% Cell type:code id: tags:
``` python
# Print the feature ranking
print("Feature ranking:")
for f in range(4):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure(figsize=(5,6))
plt.title("Feature importances")
plt.bar(range(np.asarray(iris.data).shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks( range( np.asarray(iris.data).shape[1]) , indices)
plt.xlim([-1, np.asarray(iris.data).shape[1]])
plt.show()
```
%%%% Output: stream
Feature ranking:
1. feature 2 (0.501634)
2. feature 3 (0.388246)
3. feature 0 (0.083257)
4. feature 1 (0.026863)
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Exercize
## Exercize 16
Evaluate the feature importance in the IRIS Data Set , using ExtraTreesClassifier.
- Compare with the results above.
- What can be concluded about the features importance?
%% Cell type:code id: tags:
``` python
```
......
No preview for this file type
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment