"This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a $K-1$ dimensional space*, where $K$ is the total number of target classes. \n",

"\n",

"We can order these vectors $z_{(1)}, z_{(2)},\\ldots,z_{(K)}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $z_{(1)},\\ldots,z_{(r)}$, with $r\\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above."

"In the space spanned by the rescaled mean vectors $z_1,\\ldots,z_{K-1}$, we can now derive some vectors $w_{1}, w_{2},\\ldots,w_{K-1}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $w_{1},\\ldots,w_{r}$, with $r\\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above."

]

},

{

...

...

%% Cell type:markdown id: tags:

This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis)

%% Cell type:markdown id: tags:

# Compare two transformation method on the IRIS data set:

- Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.

- Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance *between classes*. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.

%% Cell type:code id: tags:

``` python

importmatplotlib

importscipyassp

importnumpyasnp

importmatplotlib.pyplotasplt

plt.rcParams.update({'font.size':16})

fromsklearn.datasetsimportload_iris

fromsklearn.preprocessingimportStandardScaler

# Load and standardize the data

iris=load_iris()

X=iris.data;y=iris.target

# load_iris(return_X_y=True)

sc=StandardScaler()

X=sc.fit_transform(X)

```

%% Cell type:markdown id: tags:

### Plot some pairs of features

%% Cell type:code id: tags:

``` python

f,axes=plt.subplots(2,2,figsize=(10,10))

# Axe 1 and 2

axes[0,0].scatter(X[:,0],X[:,1],c=iris.target,)

axes[0,0].set_xlabel(iris.feature_names[0])

axes[0,0].set_ylabel(iris.feature_names[1])

# Axe 1 and 3

axes[0,1].scatter(X[:,0],X[:,2],c=iris.target,)

axes[0,1].set_xlabel(iris.feature_names[0])

axes[0,1].set_ylabel(iris.feature_names[2])

# Axe 2 and 4

axes[1,0].scatter(X[:,1],X[:,3],c=iris.target,)

axes[1,0].set_xlabel(iris.feature_names[1])

axes[1,0].set_ylabel(iris.feature_names[3])

# Axe 3 and 4

axes[1,1].scatter(X[:,2],X[:,3],c=iris.target)

axes[1,1].set_xlabel(iris.feature_names[2])

axes[1,1].set_ylabel(iris.feature_names[3])

```

%%%% Output: execute_result

Text(0, 0.5, 'petal width (cm)')

%%%% Output: display_data

[Hidden Image Output]

%% Cell type:markdown id: tags:

## PCA

We apply PCA transformation to the dataset, and plot the cumulative variance

Xp=pca.transform(X)# linear projection along the axes that maximize the dispersion (variance)

f,ax=plt.subplots(figsize=(8,6))

ax.scatter(Xp[:,0],Xp[:,1],c=y)# plot the 2 first components

ax.set_xlabel('PC1')

ax.set_ylabel('PC2')

```

%%%% Output: execute_result

Text(0, 0.5, 'PC2')

%%%% Output: display_data

[Hidden Image Output]

%% Cell type:markdown id: tags:

## LDA

We apply now LDA transformation to find the directions (vectors) that maximizes the class separation, and plot their cumulative variance ratio to explain the Fisher linear discriminant separation criterion.

Remember that in LDA we assume that all classes have the same estimated covariance. Thus we can rescale the data so that this covariance is the identity. Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean which is closest to the data point in the Euclidean distance (see course). But this can be done just as well after projecting on the affine subspace spanned by all the (rescaled) means for all classes, which reads:

where $\hat{\Sigma}$ is the sample covariance matrix common to all the classes (LDA assumption), $\hat{\mu}_k$ is the sample mean vector for the $k$th class and $\overline{\mu}= \frac{1}{K} \sum_{k=1}^K \mu_k$ is the mean of the class means. Note that the dimension of the space spanned by all these rescaled class mean vectors $z_1,\ldots,z_K$ is at most $K-1$ since they are linearly dependant $(\sum_{k=1}^K {z_k}=0)$.

This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a $K-1$ dimensional space*, where $K$ is the total number of target classes.

We can order these vectors $z_{(1)}, z_{(2)},\ldots,z_{(K)}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $z_{(1)},\ldots,z_{(r)}$, with $r\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above.

In the space spanned by the rescaled mean vectors $z_1,\ldots,z_{K-1}$, we can now derive some vectors $w_{1}, w_{2},\ldots,w_{K-1}$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top $r$ vectors $w_{1},\ldots,w_{r}$, with $r\le K-1$, and project the data on this $r$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of $r$ as in the plot above.

%% Cell type:code id: tags:

``` python

Xp=lda.transform(X)# linear projection to maximize class separation for the fitted LDA model

f,ax=plt.subplots(figsize=(8,6))

ax.scatter(Xp[:,0],Xp[:,1],c=y)# plot the 2 LDA vectors

ax.set_xlabel('LDA1')

ax.set_ylabel('LDA2')

```

%%%% Output: execute_result

Text(0, 0.5, 'LDA2')

%%%% Output: display_data

[Hidden Image Output]

%% Cell type:markdown id: tags:

### Exercise

- Does the PCA requires the class label to transform the dataset (justify)?

- How many PCA principal component seems sufficient to mostly explain the variance of this dataset?

- How many PCA principal component seems sufficient to get a quite accurate separation between the classes?

- What is the 'explained variance ratio' criterion for PCA, i.e. *variance* of what? Same question for LDA.

- Explain why the explained variance ratio is 1 for the LDA with two components.

- Does LDA allows us to get an accurate separation between the classes when using only the first LDA vector?

- Do you think that PCA can be useful to transform this dataset for visualization or dimension reduction?