Commit bb10113a authored by Florent Chatelain's avatar Florent Chatelain
Browse files

fix LDA comment

parent 3c0401f2
......@@ -275,7 +275,7 @@
"\n",
"This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a $K-1$ dimensional space*, where $K$ is the total number of target classes. \n",
"\n",
"We can order these vectors $z_{(1)}, z_{(2)},\\ldots,z_{(K)}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $z_{(1)},\\ldots,z_{(r)}$, with $r\\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above."
"In the space spanned by the rescaled mean vectors $z_1,\\ldots,z_{K-1}$, we can now derive some vectors $w_{1}, w_{2},\\ldots,w_{K-1}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $w_{1},\\ldots,w_{r}$, with $r\\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above."
]
},
{
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis)
%% Cell type:markdown id: tags:
# Compare two transformation method on the IRIS data set:
- Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.
- Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance *between classes*. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
%% Cell type:code id: tags:
``` python
import matplotlib
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 16})
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load and standardize the data
iris = load_iris()
X = iris.data ; y = iris.target
# load_iris(return_X_y=True)
sc = StandardScaler()
X = sc.fit_transform(X)
```
%% Cell type:markdown id: tags:
### Plot some pairs of features
%% Cell type:code id: tags:
``` python
f, axes = plt.subplots(2,2,figsize=(10, 10))
# Axe 1 and 2
axes[0,0].scatter(X[:,0], X[:,1],c=iris.target,)
axes[0,0].set_xlabel(iris.feature_names[0])
axes[0,0].set_ylabel(iris.feature_names[1])
# Axe 1 and 3
axes[0,1].scatter(X[:,0], X[:,2],c=iris.target,)
axes[0,1].set_xlabel(iris.feature_names[0])
axes[0,1].set_ylabel(iris.feature_names[2])
# Axe 2 and 4
axes[1,0].scatter(X[:,1], X[:,3],c=iris.target,)
axes[1,0].set_xlabel(iris.feature_names[1])
axes[1,0].set_ylabel(iris.feature_names[3])
# Axe 3 and 4
axes[1,1].scatter(X[:,2], X[:,3],c=iris.target)
axes[1,1].set_xlabel(iris.feature_names[2])
axes[1,1].set_ylabel(iris.feature_names[3])
```
%%%% Output: execute_result
Text(0, 0.5, 'petal width (cm)')
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## PCA
We apply PCA transformation to the dataset, and plot the cumulative variance
%% Cell type:code id: tags:
``` python
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
l = pca.explained_variance_
PC_values = np.arange(pca.n_components_) + 1
PC_labels = ['PC' + str(nb+1) for nb in range(pca.n_components_)]
plt.figure(figsize=(8,6))
plt.bar(PC_values, l.cumsum()/l.sum(), linewidth=2, edgecolor='k')
plt.xticks(ticks=PC_values, labels=PC_labels)
plt.title('(Cumulative) Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(axis='y')
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:code id: tags:
``` python
l.cumsum()/l.sum(),
```
%%%% Output: execute_result
(array([0.72962445, 0.95813207, 0.99482129, 1. ]),)
%% Cell type:markdown id: tags:
We can project the data on the first two PCs
%% Cell type:code id: tags:
``` python
Xp = pca.transform(X) # linear projection along the axes that maximize the dispersion (variance)
f, ax = plt.subplots(figsize=(8, 6))
ax.scatter(Xp[:,0], Xp[:,1],c=y) # plot the 2 first components
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
```
%%%% Output: execute_result
Text(0, 0.5, 'PC2')
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## LDA
We apply now LDA transformation to find the directions (vectors) that maximizes the class separation, and plot their cumulative variance ratio to explain the Fisher linear discriminant separation criterion.
%% Cell type:code id: tags:
``` python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA()
lda.fit(X, y)
l = lda.explained_variance_ratio_
LDA_values = np.arange(2) + 1
LDA_labels = ['LDA' + str(nb+1) for nb in range(2)]
plt.figure(figsize=(8,6))
plt.bar(LDA_values, l.cumsum()/l.sum(), linewidth=2, edgecolor='k')
plt.xticks(ticks=LDA_values, labels=LDA_labels)
plt.title('Fisher linear analysis (LDA)')
plt.xlabel('LDA components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(axis='y')
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
Remember that in LDA we assume that all classes have the same estimated covariance. Thus we can rescale the data so that this covariance is the identity. Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean which is closest to the data point in the Euclidean distance (see course). But this can be done just as well after projecting on the affine subspace spanned by all the (rescaled) means for all classes, which reads:
$$
z_k = {\hat{\Sigma}}^{-1}(\hat{\mu}_k - \overline{\mu}),
$$
where $\hat{\Sigma}$ is the sample covariance matrix common to all the classes (LDA assumption), $\hat{\mu}_k$ is the sample mean vector for the $k$th class and $\overline{\mu}= \frac{1}{K} \sum_{k=1}^K \mu_k$ is the mean of the class means. Note that the dimension of the space spanned by all these rescaled class mean vectors $z_1,\ldots,z_K$ is at most $K-1$ since they are linearly dependant $(\sum_{k=1}^K {z_k}=0)$.
This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a $K-1$ dimensional space*, where $K$ is the total number of target classes.
We can order these vectors $z_{(1)}, z_{(2)},\ldots,z_{(K)}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $z_{(1)},\ldots,z_{(r)}$, with $r\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above.
In the space spanned by the rescaled mean vectors $z_1,\ldots,z_{K-1}$, we can now derive some vectors $w_{1}, w_{2},\ldots,w_{K-1}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $w_{1},\ldots,w_{r}$, with $r\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above.
%% Cell type:code id: tags:
``` python
Xp = lda.transform(X) # linear projection to maximize class separation for the fitted LDA model
f, ax = plt.subplots(figsize=(8, 6))
ax.scatter(Xp[:,0], Xp[:,1],c=y) # plot the 2 LDA vectors
ax.set_xlabel('LDA1')
ax.set_ylabel('LDA2')
```
%%%% Output: execute_result
Text(0, 0.5, 'LDA2')
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
### Exercise
- Does the PCA requires the class label to transform the dataset (justify)?
- How many PCA principal component seems sufficient to mostly explain the variance of this dataset?
- How many PCA principal component seems sufficient to get a quite accurate separation between the classes?
- What is the 'explained variance ratio' criterion for PCA, i.e. *variance* of what? Same question for LDA.
- Explain why the explained variance ratio is 1 for the LDA with two components.
- Does LDA allows us to get an accurate separation between the classes when using only the first LDA vector?
- Do you think that PCA can be useful to transform this dataset for visualization or dimension reduction?
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment