Commit bb10113a by Florent Chatelain

### fix LDA comment

parent 3c0401f2
 ... ... @@ -275,7 +275,7 @@ "\n", "This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a \$K-1\$ dimensional space*, where \$K\$ is the total number of target classes. \n", "\n", "We can order these vectors \$z_{(1)}, z_{(2)},\\ldots,z_{(K)}\$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top \$r\$ vectors \$z_{(1)},\\ldots,z_{(r)}\$, with \$r\\le K-1\$, and project the data on this \$r\$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of \$r\$ as in the plot above." "In the space spanned by the rescaled mean vectors \$z_1,\\ldots,z_{K-1}\$, we can now derive some vectors \$w_{1}, w_{2},\\ldots,w_{K-1}\$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top \$r\$ vectors \$w_{1},\\ldots,w_{r}\$, with \$r\\le K-1\$, and project the data on this \$r\$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of \$r\$ as in the plot above." ] }, { ... ...
 %% Cell type:markdown id: tags: This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis) %% Cell type:markdown id: tags: # Compare two transformation method on the IRIS data set: - Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components. - Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance *between classes*. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels. %% Cell type:code id: tags: ``` python import matplotlib import scipy as sp import numpy as np import matplotlib.pyplot as plt plt.rcParams.update({'font.size': 16}) from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler # Load and standardize the data iris = load_iris() X = iris.data ; y = iris.target # load_iris(return_X_y=True) sc = StandardScaler() X = sc.fit_transform(X) ``` %% Cell type:markdown id: tags: ### Plot some pairs of features %% Cell type:code id: tags: ``` python f, axes = plt.subplots(2,2,figsize=(10, 10)) # Axe 1 and 2 axes[0,0].scatter(X[:,0], X[:,1],c=iris.target,) axes[0,0].set_xlabel(iris.feature_names[0]) axes[0,0].set_ylabel(iris.feature_names[1]) # Axe 1 and 3 axes[0,1].scatter(X[:,0], X[:,2],c=iris.target,) axes[0,1].set_xlabel(iris.feature_names[0]) axes[0,1].set_ylabel(iris.feature_names[2]) # Axe 2 and 4 axes[1,0].scatter(X[:,1], X[:,3],c=iris.target,) axes[1,0].set_xlabel(iris.feature_names[1]) axes[1,0].set_ylabel(iris.feature_names[3]) # Axe 3 and 4 axes[1,1].scatter(X[:,2], X[:,3],c=iris.target) axes[1,1].set_xlabel(iris.feature_names[2]) axes[1,1].set_ylabel(iris.feature_names[3]) ``` %%%% Output: execute_result Text(0, 0.5, 'petal width (cm)') %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: ## PCA We apply PCA transformation to the dataset, and plot the cumulative variance %% Cell type:code id: tags: ``` python from sklearn.decomposition import PCA pca = PCA() pca.fit(X) l = pca.explained_variance_ PC_values = np.arange(pca.n_components_) + 1 PC_labels = ['PC' + str(nb+1) for nb in range(pca.n_components_)] plt.figure(figsize=(8,6)) plt.bar(PC_values, l.cumsum()/l.sum(), linewidth=2, edgecolor='k') plt.xticks(ticks=PC_values, labels=PC_labels) plt.title('(Cumulative) Scree Plot') plt.xlabel('Principal Components') plt.ylabel('Cumulative Explained Variance Ratio') plt.grid(axis='y') plt.show() ``` %%%% Output: display_data [Hidden Image Output] %% Cell type:code id: tags: ``` python l.cumsum()/l.sum(), ``` %%%% Output: execute_result (array([0.72962445, 0.95813207, 0.99482129, 1. ]),) %% Cell type:markdown id: tags: We can project the data on the first two PCs %% Cell type:code id: tags: ``` python Xp = pca.transform(X) # linear projection along the axes that maximize the dispersion (variance) f, ax = plt.subplots(figsize=(8, 6)) ax.scatter(Xp[:,0], Xp[:,1],c=y) # plot the 2 first components ax.set_xlabel('PC1') ax.set_ylabel('PC2') ``` %%%% Output: execute_result Text(0, 0.5, 'PC2') %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: ## LDA We apply now LDA transformation to find the directions (vectors) that maximizes the class separation, and plot their cumulative variance ratio to explain the Fisher linear discriminant separation criterion. %% Cell type:code id: tags: ``` python from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA lda = LDA() lda.fit(X, y) l = lda.explained_variance_ratio_ LDA_values = np.arange(2) + 1 LDA_labels = ['LDA' + str(nb+1) for nb in range(2)] plt.figure(figsize=(8,6)) plt.bar(LDA_values, l.cumsum()/l.sum(), linewidth=2, edgecolor='k') plt.xticks(ticks=LDA_values, labels=LDA_labels) plt.title('Fisher linear analysis (LDA)') plt.xlabel('LDA components') plt.ylabel('Cumulative Explained Variance Ratio') plt.grid(axis='y') plt.show() ``` %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: Remember that in LDA we assume that all classes have the same estimated covariance. Thus we can rescale the data so that this covariance is the identity. Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean which is closest to the data point in the Euclidean distance (see course). But this can be done just as well after projecting on the affine subspace spanned by all the (rescaled) means for all classes, which reads: \$\$ z_k = {\hat{\Sigma}}^{-1}(\hat{\mu}_k - \overline{\mu}), \$\$ where \$\hat{\Sigma}\$ is the sample covariance matrix common to all the classes (LDA assumption), \$\hat{\mu}_k\$ is the sample mean vector for the \$k\$th class and \$\overline{\mu}= \frac{1}{K} \sum_{k=1}^K \mu_k\$ is the mean of the class means. Note that the dimension of the space spanned by all these rescaled class mean vectors \$z_1,\ldots,z_K\$ is at most \$K-1\$ since they are linearly dependant \$(\sum_{k=1}^K {z_k}=0)\$. This shows that, implicit in the LDA classifier, there is a *dimensionality reduction by linear projection onto a \$K-1\$ dimensional space*, where \$K\$ is the total number of target classes. We can order these vectors \$z_{(1)}, z_{(2)},\ldots,z_{(K)}\$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top \$r\$ vectors \$z_{(1)},\ldots,z_{(r)}\$, with \$r\le K-1\$, and project the data on this \$r\$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of \$r\$ as in the plot above. In the space spanned by the rescaled mean vectors \$z_1,\ldots,z_{K-1}\$, we can now derive some vectors \$w_{1}, w_{2},\ldots,w_{K-1}\$ of the most important to improve the variance-based [Fisher class-separation criterion](https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant), then the second one, etc... If we want to reduce the dimension, as in the PCA procedure, we can keep the only top \$r\$ vectors \$w_{1},\ldots,w_{r}\$, with \$r\le K-1\$, and project the data on this \$r\$-dimensional space. We can also compute the (cumulative) explained ratio of the variance, in the Fisher criterion sense, as a function of \$r\$ as in the plot above. %% Cell type:code id: tags: ``` python Xp = lda.transform(X) # linear projection to maximize class separation for the fitted LDA model f, ax = plt.subplots(figsize=(8, 6)) ax.scatter(Xp[:,0], Xp[:,1],c=y) # plot the 2 LDA vectors ax.set_xlabel('LDA1') ax.set_ylabel('LDA2') ``` %%%% Output: execute_result Text(0, 0.5, 'LDA2') %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: ### Exercise - Does the PCA requires the class label to transform the dataset (justify)? - How many PCA principal component seems sufficient to mostly explain the variance of this dataset? - How many PCA principal component seems sufficient to get a quite accurate separation between the classes? - What is the 'explained variance ratio' criterion for PCA, i.e. *variance* of what? Same question for LDA. - Explain why the explained variance ratio is 1 for the LDA with two components. - Does LDA allows us to get an accurate separation between the classes when using only the first LDA vector? - Do you think that PCA can be useful to transform this dataset for visualization or dimension reduction? ... ...
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!