@@ -6,5 +6,5 @@ See the notebooks in the [5_principal_component_analysis](https://gricad-gitlab.
1. apply and interpret PCA on the olympic decathlon dataset, [`N1_pca_olympic_data.ipynb`](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/5_principal_component_analysis/N1_pca_olympic_data.ipynb)
2. compare PCA with supervised linear discriminant analysis (LDA) on the Iris data set to reduce the dimension, [`N2_pca_versus_lda.ipynb`](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/5_principal_component_analysis/N2_pca_versus_lda.ipynb)
<!-- The following notebooks is _optional_:-->
3. experiment how a dimension reduction method like PCA can effectively improve the performance of a (SVM) classifier, [`N3_svm_face_recognition.ipynb`](https://gricad-gitlab.univ-grenoble-alpes.fr/ai-courses/autonomous_systems_ml/-/blob/master/notebooks/5_principal_component_analysis/N3_svm_face_recognition.ipynb)
"*Check that PCA is just an eigendecomposition of the sample covariance matrix*: \n",
"\n",
"- compute the sample covariance matrix of the scaled data `Xs`, then its eigenvector and check that the dominant eigenvectors (i.e. associated with the largest eigenvalues) correspond to the loadings of the principal components displayed above.\n",
"\n",
"\n",
"**Q:** From the loadings and biplot analysis, answer the following questions:\n",
"- *After standarization* of the olympic data, is the `1500` event still the more important to explain the variance?\n",
"- In this *standardized* case what are the overall meanings of the first two principal components? Are the loadings also consistent with these conclusions?\n",
...
...
@@ -1149,19 +1150,12 @@
"- How can we interpret the score of athlete 1, 16, 32 respectively (weakness/strength)? \n",
"- In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example?\n",
"\n",
"Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.\n",
"*Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.*\n",
"\n",
"**Q:**\n",
"- Is it still possible to explain most of the variance with a much reduced number of components?\n",
"- Why this is rather a good news for the sporting interest of decathlon?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
...
...
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fai-courses%2Fautonomous_systems_ml/HEAD?filepath=notebooks%2F5_principal_component_analysis)
%% Cell type:markdown id: tags:
# Olympic decathlon data
This example is a short introduction to PCA analysis. The Data are performance marks on the ten [decathlon events](https://en.wikipedia.org/wiki/Decathlon) for 33 athletes at the Olympic Games (1988).
## Exercise
In the following, the questions to answer (exercises) are indicated by a question **Q** mark
%% Cell type:markdown id: tags:
The code cell below defines some useful functions to display summary statistics of PCA representation
max 11.570000 7.720000 16.600000 2.270000 51.280000 16.200000
disq perc jave 1500
count 33.000000 33.000000 33.000000 33.000000
mean 42.353939 4.739394 59.438788 276.038485
std 3.719131 0.334421 5.495998 13.657098
min 34.360000 4.000000 49.520000 256.640000
25% 39.080000 4.600000 55.420000 266.420000
50% 42.320000 4.700000 59.480000 272.060000
75% 44.800000 4.900000 64.000000 286.040000
max 50.660000 5.700000 72.600000 303.170000
%% Cell type:markdown id: tags:
**Q:** Based on the summary statistics above, can you find the units for the score of the *running* event, and for the *jumping* or *throwing* ones?
%% Cell type:markdown id: tags:
## Part I. PCA on raw data
%% Cell type:markdown id: tags:
Make *PCA* on decathlon event scores data $X \in \mathbb{R}^{n \times p}$: $n=33$ samples (athletes), $p=10$ variables/features (decathlon events)
%% Cell type:code id: tags:
``` python
fromsklearn.decompositionimportPCA
pca=PCA()
olympic_pc=pca.fit_transform(olympic)# get the Principal components
```
%% Cell type:markdown id: tags:
How is the distribution of component variances/eigenvalues $\lambda_i^2$, $1 \le i \le p$ ? Let's visualize the **screeplot**
%% Cell type:code id: tags:
``` python
scree_plot(pca)
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
#### Display a summary of PCA representation
%% Cell type:code id: tags:
``` python
pca_summary(pca,olympic);
```
%%%% Output: stream
Importance of components:
%%%% Output: display_data
%% Cell type:markdown id: tags:
**Q:** Does it seems possible to reduce the dimension of this dataset using PCA? Then how many components do you think are sufficient to explain these olympic data (justify)?
%% Cell type:markdown id: tags:
#### Display the biplot
*Recall that the entries for a particular principal component vector are called the __loadings__: they describe the contribution of each variable to that principal component, or its weight. Large loading values (positive or negative) indicate that a particular variable has a strong relationship with a particular principal component. The sign of a loading also indicates whether a variable and a principal component are positively or negatively correlated.*
The *biplot* is a useful visualisation tool that gives a graphical summary of both samples (athletes) in terms of *scores* and the
We can check on the table above that **faster** runners have **negative score** on this component, and conversely slower ones have higher positive score.
We can also plot the athlete `PC1 scores` vs their `1500`event mark
We can also check in the plot above that stronger throwers will have higher value on this second component.
%% Cell type:markdown id: tags:
#### Recap:
We obtained that two principal components are sufficient to explain most of (i.e., at least 95%) the variance of the athlete performance , and that they are almost perfectly correlated with the 1500m and javelin events!
**Q:**
- Why was this result expected just by looking at the descriptive statistics table of the variables displayed at the beginnning of the notebook?
- Can we deduce that **only these two events**, namely the *1500*-metre run and *javelin* throw, are **really significant** to measure the performance of decathlon athletes (justify, *hint: see below*)?
%% Cell type:markdown id: tags:
## Part II. Standardizing: scale matters!
In the previous example, we saw that the two variables were based somewhat on speed and strength. However,
**we did not scale the variables** so the 1500 has much more weight than the 400, for instance!
We correct this by standardizing the variables with `sklearn` preprocessor methods
%% Cell type:code id: tags:
``` python
fromsklearn.preprocessingimportStandardScaler
# Center and reduce the variables
scaler=StandardScaler()
Xs=scaler.fit_transform(olympic)
# Make PCA on standardized variables
pca_s=PCA()# estimate only 2 PCs
Xs_pc=pca_s.fit_transform(Xs)# project the original data into the PCA space
```
%% Cell type:markdown id: tags:
Show the new biplot for the standardized variables
By standardizing, this plot above reinforces our earlier interpretation by grouping sprint events (as *100m*,
*110m*, *400m*, *long*) along a same axis aligned with the first component.
Likewise the strength and throwing events (in french, *javelot*, *disque*, *poids*)lies on a separate axis rather aligned on the second component (thus rather decorrelated from the previous one).
*Check that PCA is just an eigendecomposition of the sample covariance matrix*:
- compute the sample covariance matrix of the scaled data `Xs`, then its eigenvector and check that the dominant eigenvectors (i.e. associated with the largest eigenvalues) correspond to the loadings of the principal components displayed above.
**Q:** From the loadings and biplot analysis, answer the following questions:
-*After standarization* of the olympic data, is the `1500` event still the more important to explain the variance?
- In this *standardized* case what are the overall meanings of the first two principal components? Are the loadings also consistent with these conclusions?
- How can you interpret that the arrows for the jump events are in the opposite direction than the sprint ones?
- How can we interpret the score of athlete 1, 16, 32 respectively (weakness/strength)?
- In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example?
Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.
*Plot (screeplot) and display the (cumulative) proportion of explained variance with each principal components for the standardized data.*
**Q:**
- Is it still possible to explain most of the variance with a much reduced number of components?
- Why this is rather a good news for the sporting interest of decathlon?