### up pca lab

parent 9916e8e5
 %% Cell type:markdown id: tags: This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis) %% Cell type:markdown id: tags: # Olympic decathlon data This example is a short introduction to PCA analysis. The Data are performance marks on the ten [decathlon events](https://en.wikipedia.org/wiki/Decathlon) for 33 athletes at the Olympic Games (1988). ## Exercise In the following, the questions to answer (exercises) are indicated by a question **Q** mark %% Cell type:markdown id: tags: The code cell below defines some useful functions to display summary statistics of PCA representation %% Cell type:code id: tags: ``` python import matplotlib.pyplot as plt plt.rcParams.update({'font.size': 18}) def scree_plot(pca): PC_values = np.arange(pca.n_components_) + 1 PC_labels = ['PC' + str(nb+1) for nb in range(pca.n_components_)] plt.figure(figsize=(8,6)) plt.bar(PC_values, pca.explained_variance_, linewidth=2, edgecolor='k') plt.xticks(ticks=PC_values, labels=PC_labels) plt.title('Scree Plot') plt.xlabel('Principal Components') plt.ylabel('Explained Variance ') plt.show() def pca_summary(pca, X, out=True): """Display a table of the explained std, proportion of variance, and proportion of variance ratio for each component """ names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)] a = list(np.std(pca.transform(X), axis=0, ddof=1)) b = list(pca.explained_variance_ratio_) c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)] columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")]) summary = pd.DataFrame(list(zip(a, b, c)), index=names, columns=columns) if out: print("Importance of components:") display(summary) return summary def biplot2D(score,coeff,labels=None): """Generate biplot for the first two principal components to display both scores and variables """ xs = score[:,0] # projection on PC1 ys = score[:,1] # projection on PC2 p = coeff.shape n = score.shape fig, ax = plt.subplots(figsize=(10,8)) # plot the scores with the index of the sample ax.scatter(xs, ys, marker=".", color = 'k') for i in range(33): ax.text(xs[i], ys[i], str(i), color = 'k') ax.set_xlabel("PC{}".format(1)) ax.set_ylabel("PC{}".format(2)) # plot the variable vectors (arrow) in the PC plane (loadings) arrow_sc = 1.15 color = 'tab:red' ax2 = ax.twinx() # instantiate a second x axe ax2.set_ylim(-1.2,1.2) ax2.tick_params(axis='y', labelcolor=color) ax2 = ax2.twiny() # instantiate a second y axe ax2.set_xlim(-1.2,1.2) ax2.tick_params(axis='x', labelcolor=color) for i in range(p): ax2.arrow(0, 0, coeff[0, i], coeff[1, i], color = color ,alpha = 0.5, linestyle = '-',linewidth = 1.5, head_width=0.02, head_length=0.02) if labels is None: ax2.text(coeff[0, i]* arrow_sc, coeff[1,i] * arrow_sc, "Var"+str(i+1), color = color, ha = 'center', va = 'center') else: ax2.text(coeff[0, i]* arrow_sc, coeff[1, i] * arrow_sc, labels[i], color = color, ha = 'center', va = 'center') ``` %% Cell type:markdown id: tags: ## Dataset #### Load olympic dataset contained in the text file `olympic.csv` (*Warning:* copy this data file in the notebook directory) %% Cell type:code id: tags: ``` python import pandas as pd import numpy as np #load data set olympic = pd.read_csv('olympic.csv', sep=',', header=0) olympic.head() #data overview: variable names and first rows ``` %% Output 100 long poid haut 400 110 disq perc jave 1500 0 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95 1 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02 2 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20 3 10.62 7.38 15.02 2.03 49.06 14.72 44.80 4.9 64.04 285.11 4 11.02 7.43 12.92 1.97 47.44 14.40 41.20 5.2 57.46 256.64 %% Cell type:markdown id: tags: #### Display some descriptive statistics for this dataset %% Cell type:code id: tags: ``` python olympic.describe() ``` %% Output 100 long poid haut 400 110 \ count 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 mean 11.196364 7.133333 13.976364 1.982727 49.276667 15.048788 std 0.243321 0.304340 1.331991 0.093984 1.069660 0.506765 min 10.620000 6.220000 10.270000 1.790000 47.440000 14.180000 25% 11.020000 7.000000 13.150000 1.940000 48.340000 14.720000 50% 11.180000 7.090000 14.120000 1.970000 49.150000 15.000000 75% 11.430000 7.370000 14.970000 2.030000 49.980000 15.380000 max 11.570000 7.720000 16.600000 2.270000 51.280000 16.200000 disq perc jave 1500 count 33.000000 33.000000 33.000000 33.000000 mean 42.353939 4.739394 59.438788 276.038485 std 3.719131 0.334421 5.495998 13.657098 min 34.360000 4.000000 49.520000 256.640000 25% 39.080000 4.600000 55.420000 266.420000 50% 42.320000 4.700000 59.480000 272.060000 75% 44.800000 4.900000 64.000000 286.040000 max 50.660000 5.700000 72.600000 303.170000 %% Cell type:markdown id: tags: **Q:** Based on the summary statistics above, can you find the units for the score of the *running* event, and for the *jumping* or *throwing* ones? %% Cell type:markdown id: tags: ## Part I. PCA on raw data %% Cell type:markdown id: tags: Make *PCA* on decathlon event scores data \$X \in \mathbb{R}^{n \times p}\$: \$n=33\$ samples (athletes), \$p=10\$ variables/features (decathlon events) %% Cell type:code id: tags: ``` python from sklearn.decomposition import PCA pca = PCA() olympic_pc = pca.fit_transform(olympic) # get the Principal components ``` %% Cell type:markdown id: tags: How is the distribution of component variances/eigenvalues \$\lambda_i^2\$, \$1 \le i \le p\$ ? Let's visualize the **screeplot** %% Cell type:code id: tags: ``` python scree_plot(pca) ``` %% Output %% Cell type:markdown id: tags: #### Display a summary of PCA representation %% Cell type:code id: tags: ``` python pca_summary(pca, olympic) ; ``` %% Output Importance of components: %% Cell type:markdown id: tags: **Q:** Does it seems possible to reduce the dimension of this dataset using PCA? Then how many components do you think are sufficient to explain these olympic data (justify)? %% Cell type:markdown id: tags: #### Display the biplot *Recall that the entries for a particular principal component vector are called the __loadings__: they describe the contribution of each variable to that principal component, or its weight. Large loading values (positive or negative) indicate that a particular variable has a strong relationship with a particular principal component. The sign of a loading also indicates whether a variable and a principal component are positively or negatively correlated.* The *biplot* is a useful visualisation tool that gives a graphical summary of both samples (athletes) in terms of *scores* and the variables/features in terms of *loadings*. %% Cell type:code id: tags: ``` python #Call the function. Use only the 2 PCs. biplot2D(olympic_pc[:,0:2], pca.components_[0:2, :], olympic.columns) plt.show() ``` %% Output %% Cell type:markdown id: tags: **Q:** From this plot above, - What is the essential loadings for the first principal component `PC1`? Thus what is the meaning of `PC1`? Same question for `PC2`. - How can we interpret the score of the athlete 9? Same question for athlete 23 and 16. %% Cell type:markdown id: tags: We can compare now the athlete '1500' event mark with their score with the first component %% Cell type:code id: tags: ``` python # display the average mark for 1500 event print('Average 1500 event mark (seconds) == {:.2f}\n'.format(olympic['1500'].mean()) ) # display the table of 1500 marks and PC1 score for all the athletes pd.DataFrame(list(zip(olympic['1500'],olympic_pc[:,0])), columns=['1500', 'score PC1']).transpose() ``` %% Output Average 1500 event mark (seconds) == 276.04 0 1 2 3 4 \ 1500 268.950000 273.020000 263.200000 285.110000 256.640000 score PC1 -6.073747 -2.678254 -12.350157 9.526886 -19.567241 5 6 7 8 9 ... \ 1500 274.070000 291.200000 265.860000 269.620000 292.240000 ... score PC1 -2.312197 15.048108 -10.023305 -5.976738 16.717007 ... 23 24 25 26 27 \ 1500 277.840000 285.570000 270.07000 261.900000 303.170000 score PC1 1.098877 8.574457 -6.79812 -15.095428 27.202497 28 29 30 31 32 1500 272.060000 293.850000 294.990000 293.720000 269.980000 score PC1 -4.895252 17.291469 19.234705 16.965757 -7.217652 [2 rows x 33 columns] %% Cell type:markdown id: tags: We can check on the table above that **faster** runners have **negative score** on this component, and conversely slower ones have higher positive score. We can also plot the athlete `PC1 scores` vs their `1500`event mark %% Cell type:code id: tags: ``` python plt.figure(figsize=(8,6)) plt.plot(olympic['1500'],olympic_pc[:,0],'k.') for i in range(len(olympic_pc[:,0])): plt.text(olympic['1500'][i], olympic_pc[i,0], str(i), color = 'k') plt.xlabel('1500 time (s)') plt.ylabel('PC1') ``` %% Output Text(0, 0.5, 'PC1') %% Cell type:markdown id: tags: Hence the correlation is almost perfect between the `1500`event and the first principal component! %% Cell type:markdown id: tags: Moreover, the previous biplot shows that the *second main component* is correlated with the force in the form of a long *javelin* throw. %% Cell type:code id: tags: ``` python plt.figure(figsize=(8,6)) plt.plot(olympic['jave'],olympic_pc[:,1],'k.') for i in range(len(olympic_pc[:,0])): plt.text(olympic['jave'][i], olympic_pc[i,1], str(i), color = 'k') plt.xlabel('Javelin throw (m)') plt.ylabel('PC2') ``` %% Output Text(0, 0.5, 'PC2') %% Cell type:markdown id: tags: We can also check in the plot above that stronger throwers will have higher value on this second component. %% Cell type:markdown id: tags: #### Recap: We obtained that two principal components are sufficient to explain most of (i.e., at least 95%) the variance of the athlete performance , and that they are almost perfectly correlated with the 1500m and javelin events! **Q:** - Why was this result expected just by looking at the descriptive statistics table of the variables displayed at the beginnning of the notebook? - Can we deduce that **only these two events**, namely the *1500*-metre run and *javelin* throw, are **really significant** to measure the performance of decathlon athletes (justify, *hint: see below*)? %% Cell type:markdown id: tags: ## Part II. Standardizing: scale matters! In the previous example, we saw that the two variables were based somewhat on speed and strength. However, **we did not scale the variables** so the 1500 has much more weight than the 400, for instance! We correct this by standardizing the variables with `sklearn` preprocessor methods %% Cell type:code id: tags: ``` python from sklearn.preprocessing import StandardScaler # Center and reduce the variables scaler = StandardScaler() Xs = scaler.fit_transform(olympic) # Make PCA on standardized variables pca_s = PCA() # estimate only 2 PCs Xs_pc = pca_s.fit_transform(Xs) # project the original data into the PCA space ``` %% Cell type:markdown id: tags: Show the new biplot for the standardized variables %% Cell type:code id: tags: ``` python #Call the function. Use only the 2 PCs. biplot2D(Xs_pc[:,0:2], pca_s.components_[0:2, :], olympic.columns) plt.show() ``` %% Output %% Cell type:markdown id: tags: By standardizing, this plot above reinforces our earlier interpretation by grouping sprint events (as *100m*, *110m*, *400m*, *long*) along a same axis aligned with the first component. Likewise the strength and throwing events (in french, *javelot*, *disque*, *poids*)lies on a separate axis rather aligned on the second component (thus rather decorrelated from the previous one). %% Cell type:markdown id: tags: ### Display the loadings %% Cell type:code id: tags: ``` python pd.DataFrame(pca_s.components_[:2, :].T, columns=['PC1', 'PC2'], index=olympic.columns) ``` %% Output PC1 PC2 100 0.415882 0.148808 long -0.394051 -0.152082 poid -0.269106 0.483537 haut -0.212282 0.027898 400 0.355847 0.352160 110 0.433482 0.069568 disq -0.175792 0.503335 perc -0.384082 0.149582 jave -0.179944 0.371957 1500 0.170143 0.420965 %% Cell type:markdown id: tags: #### Exercise *Check that PCA is just an eigendecomposition of the sample covariance matrix*: - compute the sample covariance matrix of the scaled data `Xs`, then its eigenvector and check if the dominant eigenvectors (i.e. associated with the largest eigenvalues) correspond to the loadings of the principal components displayed above
(*hint1: standardized data `Xs` are already centered. hint2: since the sample covariance is an hermitian matrix, we can use the function `np.linalg.eigsh()` to get its eigendecomposition*) **Q:** From the loadings and biplot analysis, answer the following questions: - *After standarization* of the olympic data, is the `1500` event still the more important to explain the variance? - In this *standardized* case what are the overall meanings of the first two principal components? Are the loadings also consistent with these conclusions? - How can you interpret that the arrows for the jump events are in the opposite direction than the sprint ones? - How can we interpret the score of athlete 1, 16, 32 respectively (weakness/strength)? - How can you interpret the score of athlete 1, 16, 32 respectively (weakness/strength)? - In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example? *Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.* **Q:** - Is it still possible to explain most of the variance with a much reduced number of components? - Why this is rather a good news for the sporting interest of decathlon? %% Cell type:code id: tags: ``` python ``` ... ...
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment