Commit fec26481 authored by Florent Chatelain's avatar Florent Chatelain
Browse files

up pca lab

parent 9916e8e5
......@@ -1152,7 +1152,7 @@
"- *After standarization* of the olympic data, is the `1500` event still the more important to explain the variance?\n",
"- In this *standardized* case what are the overall meanings of the first two principal components? Are the loadings also consistent with these conclusions?\n",
"- How can you interpret that the arrows for the jump events are in the opposite direction than the sprint ones?\n",
"- How can we interpret the score of athlete 1, 16, 32 respectively (weakness/strength)? \n",
"- How can you interpret the score of athlete 1, 16, 32 respectively (weakness/strength)? \n",
"- In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example?\n",
"\n",
"*Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.*\n",
......@@ -1161,13 +1161,6 @@
"- Is it still possible to explain most of the variance with a much reduced number of components?\n",
"- Why this is rather a good news for the sporting interest of decathlon?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis)
%% Cell type:markdown id: tags:
# Olympic decathlon data
This example is a short introduction to PCA analysis. The Data are performance marks on the ten [decathlon events](https://en.wikipedia.org/wiki/Decathlon) for 33 athletes at the Olympic Games (1988).
## Exercise
In the following, the questions to answer (exercises) are indicated by a question **Q** mark
%% Cell type:markdown id: tags:
The code cell below defines some useful functions to display summary statistics of PCA representation
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
def scree_plot(pca):
PC_values = np.arange(pca.n_components_) + 1
PC_labels = ['PC' + str(nb+1) for nb in range(pca.n_components_)]
plt.figure(figsize=(8,6))
plt.bar(PC_values, pca.explained_variance_, linewidth=2, edgecolor='k')
plt.xticks(ticks=PC_values, labels=PC_labels)
plt.title('Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance ')
plt.show()
def pca_summary(pca, X, out=True):
"""Display a table of the explained std, proportion of variance,
and proportion of variance ratio for each component
"""
names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
a = list(np.std(pca.transform(X), axis=0, ddof=1))
b = list(pca.explained_variance_ratio_)
c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
summary = pd.DataFrame(list(zip(a, b, c)), index=names, columns=columns)
if out:
print("Importance of components:")
display(summary)
return summary
def biplot2D(score,coeff,labels=None):
"""Generate biplot for the first two principal components
to display both scores and variables
"""
xs = score[:,0] # projection on PC1
ys = score[:,1] # projection on PC2
p = coeff.shape[1]
n = score.shape[0]
fig, ax = plt.subplots(figsize=(10,8))
# plot the scores with the index of the sample
ax.scatter(xs, ys, marker=".", color = 'k')
for i in range(33):
ax.text(xs[i], ys[i], str(i), color = 'k')
ax.set_xlabel("PC{}".format(1))
ax.set_ylabel("PC{}".format(2))
# plot the variable vectors (arrow) in the PC plane (loadings)
arrow_sc = 1.15
color = 'tab:red'
ax2 = ax.twinx() # instantiate a second x axe
ax2.set_ylim(-1.2,1.2)
ax2.tick_params(axis='y', labelcolor=color)
ax2 = ax2.twiny() # instantiate a second y axe
ax2.set_xlim(-1.2,1.2)
ax2.tick_params(axis='x', labelcolor=color)
for i in range(p):
ax2.arrow(0, 0, coeff[0, i], coeff[1, i], color = color ,alpha = 0.5,
linestyle = '-',linewidth = 1.5, head_width=0.02, head_length=0.02)
if labels is None:
ax2.text(coeff[0, i]* arrow_sc, coeff[1,i] * arrow_sc, "Var"+str(i+1),
color = color, ha = 'center', va = 'center')
else:
ax2.text(coeff[0, i]* arrow_sc, coeff[1, i] * arrow_sc, labels[i],
color = color, ha = 'center', va = 'center')
```
%% Cell type:markdown id: tags:
## Dataset
#### Load olympic dataset contained in the text file `olympic.csv` (*Warning:* copy this data file in the notebook directory)
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
#load data set
olympic = pd.read_csv('olympic.csv', sep=',', header=0)
olympic.head() #data overview: variable names and first rows
```
%%%% Output: execute_result
100 long poid haut 400 110 disq perc jave 1500
0 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95
1 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02
2 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20
3 10.62 7.38 15.02 2.03 49.06 14.72 44.80 4.9 64.04 285.11
4 11.02 7.43 12.92 1.97 47.44 14.40 41.20 5.2 57.46 256.64
%% Cell type:markdown id: tags:
#### Display some descriptive statistics for this dataset
%% Cell type:code id: tags:
``` python
olympic.describe()
```
%%%% Output: execute_result
100 long poid haut 400 110 \
count 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000
mean 11.196364 7.133333 13.976364 1.982727 49.276667 15.048788
std 0.243321 0.304340 1.331991 0.093984 1.069660 0.506765
min 10.620000 6.220000 10.270000 1.790000 47.440000 14.180000
25% 11.020000 7.000000 13.150000 1.940000 48.340000 14.720000
50% 11.180000 7.090000 14.120000 1.970000 49.150000 15.000000
75% 11.430000 7.370000 14.970000 2.030000 49.980000 15.380000
max 11.570000 7.720000 16.600000 2.270000 51.280000 16.200000
disq perc jave 1500
count 33.000000 33.000000 33.000000 33.000000
mean 42.353939 4.739394 59.438788 276.038485
std 3.719131 0.334421 5.495998 13.657098
min 34.360000 4.000000 49.520000 256.640000
25% 39.080000 4.600000 55.420000 266.420000
50% 42.320000 4.700000 59.480000 272.060000
75% 44.800000 4.900000 64.000000 286.040000
max 50.660000 5.700000 72.600000 303.170000
%% Cell type:markdown id: tags:
**Q:** Based on the summary statistics above, can you find the units for the score of the *running* event, and for the *jumping* or *throwing* ones?
%% Cell type:markdown id: tags:
## Part I. PCA on raw data
%% Cell type:markdown id: tags:
Make *PCA* on decathlon event scores data $X \in \mathbb{R}^{n \times p}$: $n=33$ samples (athletes), $p=10$ variables/features (decathlon events)
%% Cell type:code id: tags:
``` python
from sklearn.decomposition import PCA
pca = PCA()
olympic_pc = pca.fit_transform(olympic) # get the Principal components
```
%% Cell type:markdown id: tags:
How is the distribution of component variances/eigenvalues $\lambda_i^2$, $1 \le i \le p$ ? Let's visualize the **screeplot**
%% Cell type:code id: tags:
``` python
scree_plot(pca)
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
#### Display a summary of PCA representation
%% Cell type:code id: tags:
``` python
pca_summary(pca, olympic) ;
```
%%%% Output: stream
Importance of components:
%%%% Output: display_data
%% Cell type:markdown id: tags:
**Q:** Does it seems possible to reduce the dimension of this dataset using PCA? Then how many components do you think are sufficient to explain these olympic data (justify)?
%% Cell type:markdown id: tags:
#### Display the biplot
*Recall that the entries for a particular principal component vector are called the __loadings__: they describe the contribution of each variable to that principal component, or its weight. Large loading values (positive or negative) indicate that a particular variable has a strong relationship with a particular principal component. The sign of a loading also indicates whether a variable and a principal component are positively or negatively correlated.*
The *biplot* is a useful visualisation tool that gives a graphical summary of both samples (athletes) in terms of *scores* and the
variables/features in terms of *loadings*.
%% Cell type:code id: tags:
``` python
#Call the function. Use only the 2 PCs.
biplot2D(olympic_pc[:,0:2], pca.components_[0:2, :], olympic.columns)
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
**Q:** From this plot above,
- What is the essential loadings for the first principal component `PC1`? Thus what is the meaning of `PC1`? Same question for `PC2`.
- How can we interpret the score of the athlete 9? Same question for athlete 23 and 16.
%% Cell type:markdown id: tags:
We can compare now the athlete '1500' event mark with their score with the first component
%% Cell type:code id: tags:
``` python
# display the average mark for 1500 event
print('Average 1500 event mark (seconds) == {:.2f}\n'.format(olympic['1500'].mean()) )
# display the table of 1500 marks and PC1 score for all the athletes
pd.DataFrame(list(zip(olympic['1500'],olympic_pc[:,0])), columns=['1500', 'score PC1']).transpose()
```
%%%% Output: stream
Average 1500 event mark (seconds) == 276.04
%%%% Output: execute_result
0 1 2 3 4 \
1500 268.950000 273.020000 263.200000 285.110000 256.640000
score PC1 -6.073747 -2.678254 -12.350157 9.526886 -19.567241
5 6 7 8 9 ... \
1500 274.070000 291.200000 265.860000 269.620000 292.240000 ...
score PC1 -2.312197 15.048108 -10.023305 -5.976738 16.717007 ...
23 24 25 26 27 \
1500 277.840000 285.570000 270.07000 261.900000 303.170000
score PC1 1.098877 8.574457 -6.79812 -15.095428 27.202497
28 29 30 31 32
1500 272.060000 293.850000 294.990000 293.720000 269.980000
score PC1 -4.895252 17.291469 19.234705 16.965757 -7.217652
[2 rows x 33 columns]
%% Cell type:markdown id: tags:
We can check on the table above that **faster** runners have **negative score** on this component, and conversely slower ones have higher positive score.
We can also plot the athlete `PC1 scores` vs their `1500`event mark
%% Cell type:code id: tags:
``` python
plt.figure(figsize=(8,6))
plt.plot(olympic['1500'],olympic_pc[:,0],'k.')
for i in range(len(olympic_pc[:,0])):
plt.text(olympic['1500'][i], olympic_pc[i,0], str(i), color = 'k')
plt.xlabel('1500 time (s)')
plt.ylabel('PC1')
```
%%%% Output: execute_result
Text(0, 0.5, 'PC1')
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
Hence the correlation is almost perfect between the `1500`event and the first principal component!
%% Cell type:markdown id: tags:
Moreover, the previous biplot shows that the *second main component* is correlated with the force in the form of a long *javelin* throw.
%% Cell type:code id: tags:
``` python
plt.figure(figsize=(8,6))
plt.plot(olympic['jave'],olympic_pc[:,1],'k.')
for i in range(len(olympic_pc[:,0])):
plt.text(olympic['jave'][i], olympic_pc[i,1], str(i), color = 'k')
plt.xlabel('Javelin throw (m)')
plt.ylabel('PC2')
```
%%%% Output: execute_result
Text(0, 0.5, 'PC2')
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
We can also check in the plot above that stronger throwers will have higher value on this second component.
%% Cell type:markdown id: tags:
#### Recap:
We obtained that two principal components are sufficient to explain most of (i.e., at least 95%) the variance of the athlete performance , and that they are almost perfectly correlated with the 1500m and javelin events!
**Q:**
- Why was this result expected just by looking at the descriptive statistics table of the variables displayed at the beginnning of the notebook?
- Can we deduce that **only these two events**, namely the *1500*-metre run and *javelin* throw, are **really significant** to measure the performance of decathlon athletes (justify, *hint: see below*)?
%% Cell type:markdown id: tags:
## Part II. Standardizing: scale matters!
In the previous example, we saw that the two variables were based somewhat on speed and strength. However,
**we did not scale the variables** so the 1500 has much more weight than the 400, for instance!
We correct this by standardizing the variables with `sklearn` preprocessor methods
%% Cell type:code id: tags:
``` python
from sklearn.preprocessing import StandardScaler
# Center and reduce the variables
scaler = StandardScaler()
Xs = scaler.fit_transform(olympic)
# Make PCA on standardized variables
pca_s = PCA() # estimate only 2 PCs
Xs_pc = pca_s.fit_transform(Xs) # project the original data into the PCA space
```
%% Cell type:markdown id: tags:
Show the new biplot for the standardized variables
%% Cell type:code id: tags:
``` python
#Call the function. Use only the 2 PCs.
biplot2D(Xs_pc[:,0:2], pca_s.components_[0:2, :], olympic.columns)
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
By standardizing, this plot above reinforces our earlier interpretation by grouping sprint events (as *100m*,
*110m*, *400m*, *long*) along a same axis aligned with the first component.
Likewise the strength and throwing events (in french, *javelot*, *disque*, *poids*)lies on a separate axis rather aligned on the second component (thus rather decorrelated from the previous one).
%% Cell type:markdown id: tags:
### Display the loadings
%% Cell type:code id: tags:
``` python
pd.DataFrame(pca_s.components_[:2, :].T, columns=['PC1', 'PC2'], index=olympic.columns)
```
%%%% Output: execute_result
PC1 PC2
100 0.415882 0.148808
long -0.394051 -0.152082
poid -0.269106 0.483537
haut -0.212282 0.027898
400 0.355847 0.352160
110 0.433482 0.069568
disq -0.175792 0.503335
perc -0.384082 0.149582
jave -0.179944 0.371957
1500 0.170143 0.420965
%% Cell type:markdown id: tags:
#### Exercise
*Check that PCA is just an eigendecomposition of the sample covariance matrix*:
- compute the sample covariance matrix of the scaled data `Xs`, then its eigenvector and check if the dominant eigenvectors (i.e. associated with the largest eigenvalues) correspond to the loadings of the principal components displayed above<br>
(*hint1: standardized data `Xs` are already centered. hint2: since the sample covariance is an hermitian matrix, we can use the function `np.linalg.eigsh()` to get its eigendecomposition*)
**Q:** From the loadings and biplot analysis, answer the following questions:
- *After standarization* of the olympic data, is the `1500` event still the more important to explain the variance?
- In this *standardized* case what are the overall meanings of the first two principal components? Are the loadings also consistent with these conclusions?
- How can you interpret that the arrows for the jump events are in the opposite direction than the sprint ones?
- How can we interpret the score of athlete 1, 16, 32 respectively (weakness/strength)?
- How can you interpret the score of athlete 1, 16, 32 respectively (weakness/strength)?
- In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example?
*Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.*
**Q:**
- Is it still possible to explain most of the variance with a much reduced number of components?
- Why this is rather a good news for the sporting interest of decathlon?
%% Cell type:code id: tags:
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment