Commit 0b4b3015 authored by Laurence Viry's avatar Laurence Viry
Browse files

modification Multidim

parent d47e043c
......@@ -456,7 +456,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Correlation\n",
......@@ -510,7 +512,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Center and reduce the data temperat\n",
......@@ -663,7 +667,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Analyse en composantes principales\n",
......@@ -1456,30 +1462,37 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Management of missing data in ACP"
"## Handling of missing data in ACP\n",
"[PCA with missing data usinf MissMDA R Package](https://www.youtube.com/watch?v=OOM8_FH6_8o) <br\\>\n",
"[Methodology on the treatment of missing data](https://www.youtube.com/watch?v=hQ6tDtgotx0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Autres méthodes d'analyse multidimentionnelles\n",
"## Analyse factorielle des correspondances \n",
"### Données et objectifs\n",
"# Other methods of multidimensional analysis\n",
"[Approach in multidimensional data analysis](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner) <br\\>\n",
"<br\\>\n",
"<img src=\"../../figures/demarcheAD.jpg\",width=\"80%\",height=\"80%\">\n",
"\n",
"## Analyse des correspondances multiples \n",
"### Données et objectifs\n",
"[Approach in multidimensional data analysis](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner)\n",
"## Correspondence Analysis\n",
"### Data et objectifs\n",
"The main point of **correspondence analysis** is studying the **links between pairs of qualitative variables**. This really means looking at the difference between the given data, and what it would be like if the variables were independent. We're therefore going to see how the analysis captures deviation from independence. Our reasoning will mainly be geometrical, creating point clouds for the rows and point clouds for the columns. Projecting these clouds onto planes will give some useful representations. \n",
"\n",
"## Classification \n",
"### Données et objectifs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyse Factorielle Multiple \n",
"### Données et objectifs"
"## Multiple Correspondence Analysis\n",
"### Data et objectifs\n",
"In the MCA context, we have a point cloud of individuals, and a point clouds of categories. We see how to visualize the point cloud of individuals, and how to interpret it using the categories and how to directly visualize the point cloud of categories. The point cloud of\n",
"individuals, and that of the categories, can be shown simultaneously on the same graph. This is called the simultaneous representation of the point clouds. \n",
"\n",
"## Multiple Factor Analysis\n",
"### Data et objectifs <br\\>\n",
"Method to study more complex data tables, where a group of individuals is\n",
"characterized by variables structured as groups, and possibly coming from different information sources. The interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.<br\\>\n",
"<br\\>\n",
"[MOOC AgroCampus Ouest Exploratory Multivariate Data Analysis](https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04EN+session04/info)<br\\>\n",
"[Courses AgroCampus Ouest F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/teaching)"
]
},
{
......
%% Cell type:markdown id: tags:
 
This course is inspired by the MOOC (Massive Open Online Course) "[Exploratory Multivariate Data Analysis](https://www.fun-mooc.fr/courses/course-v1%3Aagrocampusouest%2B40001S04EN%2Bsession04/about)" (the first session in English was in 2017) from the platform FUN. <br\>
(Multivariate Multidimensional Data Analysis (Département de mathématiques
appliquées d’Agrocampus Ouest - Rennes, F. Husson, J. Pagès, M. Houée-Bigot)
 
The 2nd edition of the MOOC will start the 5h of March 2018, you can subscribe until april 20.
 
Version en français: [Analyse de données multidimensionnelles](https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04+session04/about)
 
%% Cell type:markdown id: tags:
 
# Introduction - Multivariate analysis
In many applications we observe **p** variables on **n** individuals (<FONT color="#B40404">p and n being able to be high).</FONT><br\>
<br\>
Databases become more and more voluminous in term of individuals and variables measured on these individuals. The study of each variable and pairs of variables by classical descriptive statistics methods are indispensable but insufficient.<br\>
<br\>
The **multidimensional exploratory** methods allow:
* to take into account *the simultaneous variations* of a larger number of variables,
* to synthesize and / or simplify *the underlying structures*.
 
How to perform a multiple factor analysis that handles several groups of continuous and/or categorical variables and/or contingency tables? And how can we improve the graphs obtained by the method?<br\>
<br\>
[Tutorial F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner)
 
1. Are there groups of variables? Use of Multiple factor analysis (function <FONT color="#B40404">MFA</FONT> i FactoMiner) <br\>
<br\>
2. What is the type of information?
* Contingency table -> Factorial correspondence analysis (<FONT color="#B40404">AFC</FONT> or <FONT color="#B40404">AFMTC</FONT> if several)
* Several tables of contingencies -> <FONT color="#B40404">AFMTC </FONT>
* Table "individuals - variables" -> principal components analysis (<FONT color="#B40404">PCA</FONT>), Multiple factor analysis (<FONT color="#B40404">MFA</FONT>). <br\>
<br\>
3. What are the active elements? what are the elements that will participate in the construction of the axes?<br\>
<br\>
4. What are the additional elements? they do not participate in the construction of the axes but are useful for interpretation.<br\>
<br\>
5. What is the nature of the active variables?
* **Quantitative variables**: Principal Component Analysis (<FONT color="#B40404">PCA</FONT>)
* **Qualitative Variables**: Multiple Correspondence Analysis (<FONT color="#B40404">MCA</FONT>)
* **Mixed variables**: <FONT color="#B40404">AFDM</FONT>
<br\>
(*Whatever the method, the additional variables can be of two types.*)<br\>
<br\>
 
6. Should we **reduce** the quantitative variables?<br\>
<br\>
7. Are there any **missing data**? How to treat them?<br\>
<br\>
8. The *steps of the analysis*<br\>
* Start the factor analysis.<br\>
<br\>
* Describe the factorial axes by the active initial variables (<FONT color="#B40404">dimdesc</FONT>)<br\>
<br\>
* It may be interesting to use a classification method to determine groups of individuals (<FONT color="#B40404">HCPC</FONT>)<br\>
<br\>
<img src="../../figures/MultiFactorielAnalysis.jpg",width="80%",height="80%">
 
%% Cell type:markdown id: tags:
 
To know more [see vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french).
 
In this course, we present only how to analyze tables with quantitative variables using **principal components analysis** (PCA) and how to use the method with **FactoMineR** in **R**.
 
# Principal component analysis
 
## Introduction
The aim of the PCA method is to summarize a table of individuals x variables data, the variables being quantitatives.
 
The PCA allows to study the similarities between individuals from the point of view of a group of variables and gives off profiles of individuals.
 
It allows a balance of the linear links between variables from the correlation coefficients.
 
These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals.
 
%% Cell type:markdown id: tags:
 
## Data - practicalities
### Which kinds of data
We have **p variables** <FONT color="#013ADF">$X^{1},X^{2},\ldots,X^{p}$</FONT> observed
on **n individuals**.
 
Principal component analysis, also known as <FONT color="#B40404">PCA</FONT>, applies to data tables where **rows** can be considered like individuals and **columns** like **quantitative** variables. <br\>
 
### Data table
We note <FONT color="#013ADF">$x^{j}_{i}$</FONT> the observation of the variable <FONT color="#013ADF">$X^{j}$</FONT> on the <FONT color="#013ADF">ith</FONT> individual.
 
<table style="width:60%">
<tr>
<th>
$$\begin{aligned}
X= \left[\begin{array}{ccc}
x_{1}^{1} & \dots & x_{1}^{p}\\
\vdots & \ddots & \vdots\\
x_{n}^{1} & \dots & x_{n}^{p}
\end{array}\right] & \quad n \quad \mbox{individus} \nonumber \\
p \quad \mbox{variables} \nonumber \end{aligned}$$
</th>
<th>
$$\bar{x^{j}}=\sum_{i=1}^{n} x^{j}_{i}$$
$$\sigma^{j}=\sqrt{\sum_{i=1}^{n} (x^{j}_{i}-\bar{x^{j}})^2}$$
</th>
</tr>
</table>
 
 
The data table can be analyzed through its **lines** (individuals) or through its **columns** (variables).
 
Le tableau des données peut être analysé à travers ses **lignes** (individus) ou à travers ses
**colonnes**(variables).<br\>
 
<FONT color="#013ADF">$X^{j}$</FONT>$\,= \, (X^{j}_{1},\ldots,X^{j}_{n}) \quad \mbox{variable} \quad j, \quad \mbox{dans} \quad \mathcal{R}^{n}$
<FONT color="#013ADF">$X_{i}$</FONT>$\, = \, (X^{1}_{i},\ldots,X^{p}_{i}) \quad\mbox{individu} \quad i, \quad \mbox{dans} \quad \mathcal{R}^{p}$
<br\>
<br\>
<figure>
<img src="../../figures/TwoClouds.jpg",width="40%",height="20%">
<figcaption> <br><em>(Exploratory Multivariate Data Analysis</em> <a href="https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04EN+session04/info">MOOC AgroCampus Ouest )</a>
</figcaption>
</figure>
 
%% Cell type:markdown id: tags:
 
### Problems and objectives
#### Studying individuals
* When can we say that two individuals are similar with respect to all the variables or a group of variables?
* If there are many individuals, is it possible to categorize them? <br\>
<br\>
 
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">
groups of individuals<br\>
partitions between them
</div>
 
%% Cell type:markdown id: tags:
 
#### Studying variables
* The correlation matrix provides a simple indication on the linear link
between variables two by two.
* Look for similarities between all the variables or a group of variables.
* Synthetic indicators are sought to summarize groups of
variables.
 
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">visualization of the correlation matrix <br\> find a small number of synthetic variables to summarize many variables
</div>
 
#### Links between the two points-of-view
* Characterize groups of individuals using variables.
* Use typical individuals to interpret groups of variables.
 
%% Cell type:markdown id: tags:
 
### Some examples
 
Data tables, with individuals in rows and variables in columns, can be found in many different
areas, which means that we can perform PCA on quite a diverse range of data sets.
 
* **Sensory analysis**: note of the descriptor k for the product i
* **Ecology**: concentration of the pollutant k on the river i
* **Economy**: value of indicator k for year i
* **Genetics**: Gene expression k for the patient i
* **Biology**: k measure for the animal i
* **Marketing**: satisfaction index value k for brand i
* **Sociology**: time spent in activity k by the individuals of the CSP i
* $\ldots$
 
%% Cell type:markdown id: tags:
 
### Example: Climate of different European countries
<br\>
To illustrate this course, we will take **temperature data** to analyse climate of different European countries.
 
#### Description of the data:
 
* 35 individuals (lines): European cities
* 17 variables (columns) :
- 12 average monthly temperatures (over 30 years)
- 2 geographical variables (latitude, longitude of each city)
- the annual average temperature, the thermal amplitude.
- A qualitative variable belonging to a region of Europe: Northern Europe, south, east and west.
 
#### Data extract
<table style="width:100%">
<tr>
<th> Town</th>
<th>Janv</th>
<th>Fév </th>
<th>... </th>
<th>Nov </th>
<th>Déc </th>
<th>Moy </th>
<th>Amp </th>
<th>Lat </th>
<th>Lon </th>
<th>Rég</th>
</tr>
<tr>
<td>Amsterdam </td>
<td>2.9 </td>
<td>2.5 </td>
<td>... </td>
<td>7.0 </td>
<td>4.4 </td>
<td>9.9 </td>
<td>14.6 </td>
<td>52.2 </td>
<td>4.5 </td>
<td>Ouest
</td>
</tr>
<tr>
<td>Athènes </td>
<td>9.1 </td>
<td>9.7 </td>
<td>... </td>
<td>14.6</td>
<td>11.0</td>
<td>17.8 </td>
<td>18.3 </td>
<td>37.6</td>
<td>23.5</td>
<td>Sud </td>
</tr>
<tr>
<td> Berlin </td>
<td>-0.2</td>
<td>0.1</td>
<td>...</td>
<td>4.2</td>
<td>1.2</td>
<td>9.1</td>
<td>18.5 </td>
<td>52.3 </td>
<td>13.2</td>
<td>Ouest</td>
</tr>
<tr>
<td>Helsinki</td>
<td>-5.8</td>
<td>-5.0</td>
<td>...</td>
<td>0.1</td>
<td>-2.3</td>
<td>4.8</td>
<td>23.4</td>
<td>60.1</td>
<td>25.0</td>
<td>Nord</td>
</tr>
<tr>
<td>Kiev</td>
<td>-5.9</td>
<td>-5.0</td>
<td>...</td>
<td>1.2</td>
<td>-3.6</td>
<td>7.1</td>
<td>25.3</td>
<td>50.3</td>
<td>30.3</td>
<td>Est</td>
</tr>
<tr>
<td> Copenhague</td>
<td>-0.4 </td>
<td>-0.4</td>
<td>...</td>
<td>4.1</td>
<td>1.3</td>
<td>7.8</td>
<td>17.5</td>
<td>55.4</td>
<td>12.3</td>
<td>Nord</td>
</tr>
<tr>
<td> Budapest </td>
<td>-1.1</td>
<td>0.8</td>
<td>...</td>
<td>5.1 </td>
<td>0.7</td>
<td>10.9</td>
<td>23.1</td>
<td>47.3</td>
<td>19.0</td>
<td>Est</td>
</tr>
<tr>
<td> Bruxelles</td>
<td>3.3</td>
<td>3.3</td>
<td>...</td>
<td>6.7</td>
<td>4.4</td>
<td>10.3</td>
<td>14.4</td>
<td>50.5</td>
<td>4.2</td>
<td>Ouest
</tr>
<tr>
<td> ...</td>
<td></td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</table>
#### Read data
 
%% Cell type:code id: tags:
 
``` R
load("./data/temperat.RData")
```
 
%% Cell type:markdown id: tags:
 
#### Descriptive statistics
 
%% Cell type:code id: tags:
 
``` R
# Descriptive statistics
dim(temperat)
summary(temperat)
```
 
%%%% Output: display_data
 
1. 35
2. 17
\begin{enumerate*}
\item 35
\item 17
\end{enumerate*}
 
%%%% Output: display_data
 
 
%% Cell type:code id: tags:
 
``` R
# Scaterplot
pairs(temperat[,1:12])
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:code id: tags:
 
``` R
# Correlation
cor(temperat[,1:12])
```
 
%% Cell type:markdown id: tags:
 
## Method
### The cloud of individuals
 
1 **individual** = 1 **row** of the table
= 1 **point** in a space of dimension **p** (number of variables)<br\>
 
* p=1 => points on a straight <br\>
* p=2 => points in a plane<br\>
* p=3 => points in a 3D space, more difficult to represent<br\>
* p=4 and more => impossible to represent<br\>
 
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">
**Concept of resemblance between two individuals**
$$ \|X_{i} - X_{i^{'}}\|^{2}=\sum_{j=1}^{p}(x^{j}_{i} - x^{j}_{i^{'}})^{2}$$
*This difference can be represented by choosing another metric*.
</div>
 
 
%% Cell type:markdown id: tags:
 
#### Centering – standardizing data
 
* *Center the variables* :<br\>
Centering the cloud does not distort the cloud <br\>
$$\tilde{x^{j}_{i}} = x^{j}_{i} -\bar{x^{j}} \quad (\bar{x^{j}} \mbox{est la moyenne de}\, X^{j})$$
=> We will always be at the *center of gravity* of the cloud<br\>
<br\>
* *Center and reduce the variables* (Normalized ACP):<br\>
<br\>
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">
$\sigma^{j}$ is the standard deviation of $X^{j}$<br\>
- When the variables are not expressed with the same units.<br\>
- Do not reduce the variables gives more importance to variables that have great variability.<br\>
$$\frac{x^{j}_{i} -\bar{x^{j}}}{\sigma^{j}}\quad j=1\dots p$$
</div>
 
%% Cell type:code id: tags:
 
``` R
# Center and reduce the data temperat
temp_standard<-as.data.frame(lapply(temperat[1:16], scale, center=T, scale=T))
summary(temp_standard)
```
 
%% Cell type:markdown id: tags:
 
#### Less deformation of the cloud
The reduction of the cloud of individuals is done by orthogonal projection on a
affine subspace $\mathcal{H}$. The choice of the subspace $\mathcal{H}$ is obtained by minimizing the deformation of the cloud by projection.<br\>
* *Adjustment of the cloud of individuals*: how to find the best approximated image of the cloud.
- Find the axis (O,$u_1$) or factor that deforms the cloud as little as possible<br\>
<br\>
<table style="width:60%">
<tr>
<th>
<img src="../../figures/Ajustement1D.jpg",width="90%",height="90%">
</th>
<th>
$(iH_i)^2$ minimum with $H_i \in \mbox{axe}$ <br\>
<br\>
$(OH_i)^2$ maximum (Pythagore) <br\>
</th>
</tr>
</table>
<br\>
- Find the best plan that maximizes $\sum_i(OH_i)^2$. The best plan contains the best first axis ($u_1$). We look for u2 such that $u_1\perp u_2$ and maximizes $\sum_i(OH_i)^2$. <br\>
<br\>
* *Inertia of the cloud of individuals*
$$ I = \sum_{i=1}^{n}m_{i} \|X_{i}\|^{2} $$
$m_{i}$ : poids associé à l'individu $i$ <br\>
<br\>
 
* Inertia of the cloud of individuals around $\mathcal{H}$
$$ J_{\mathcal{H}} = \sum_{i=1}^{n}m_{i} \|X_{i} - X_{i}^{*}\|^{2}\quad \mbox{measure the deformation of the cloud}$$
<center>
$\Longrightarrow$ Il faudra minimiser $ J_{\mathcal{H}}$
</center>
<br\>
* Inertia of the projected cloud
$$ I_{\mathcal{H}} = \sum_{i=1}^{n}m_{i} \|X_{i}^{*}\|^{2}$$
<br\>
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">
$$ I = J_{\mathcal{H}} + I_{\mathcal{H}} \quad \mbox{(pythagore)}$$
<br\>
<center>
**Minimisation** $J_{\mathcal{H}}$ $\Longleftrightarrow$ **Maximiser** $I_{\mathcal{H}}$
</center>
</div>
 
%% Cell type:markdown id: tags:
 
#### Determination of $\mathcal{H}_k$
$\mathcal{H}_k$ is an affine **subspace of dimension k** obtained by minimizing the deformation of the cloud by projection.
<div style="background-color:#F2F5A9;Orange;border-style: solid ;border-color: black;border-width: 2px;padding:1%">
$$\mathcal{H}_{k} = \min_{\mathcal{H} : dim(\mathcal{H})=k} J_{\mathcal{H}} = \max_{\mathcal{H} : dim(\mathcal{H})=k} I_{\mathcal{H}} $$
</div>
<br\>
 
The search for $\mathcal{H}_k$ can be done **sequentially** (axis by axis).<br\>
<br\>
$$ \Gamma=(X)^{t}.M.X\quad \mbox{matrice de variance covariance}$$
<br\>
$\Gamma$ is symmetrical, semi-definite positive, it is diagonalisable.
- $\lambda_{1} \ge \ldots \ge \lambda_{p} \ge 0$: eigenvalue of $\Gamma$
- $u_{1}, \ldots, u_{p}$: eigenvectors of $\Gamma$
 
<br\>
* $\mathcal{H}_{1}$ = $(O,u_{1})$ is generated by the first eigenvector of $\Gamma$<br\>
<br\>
* $\mathcal{H}_{2}$ = $(O,u_{1},u_{2})$<br\>
<br\>
* $\mathcal{H}_{k}$ = $(O,u_{1},\ldots,u_{k})$<br\>
<br\>
* k-th eigenvalue of $\Gamma$ associated with k-th eigenvector $u_k$.<br\>
<br\>
$$I_{u_{k}} = \lambda_{k}$$
<br\>
* Inertia of the cloud on $\mathcal{H}_k$<br\>
<br\>
$$I_{\mathcal{H}_{k}} = \sum_{j=1}^{k} \lambda_{j}$$
 
%% Cell type:markdown id: tags:
 
#### Main axes - Quality of representation
* (G,$u_{k}$)} : k th main axis $$I_k=\lambda_k$$
* ACP standardized (sum of variances) $$I=p$$
* ACP not standardized: $$I= \sum_{j=1}^{p} \lambda_{j}$$
* Global quality of representation: **share of inertia explained **<br\>
<br\>
- on k th main axis : $$\frac{\lambda_{k}}{I}$$
- sur $\mathcal{H}_{k}$ : $$I_{\mathcal{H}_{k}} = \frac{\sum_{j=1}^{k} \lambda_{j}}{I}$$
 
%% Cell type:markdown id: tags:
 
### Principal Component Analysis and R
 
* The function <FONT color="#B40404">princomp</FONT> from R realize a PCA, it remains simplistic.<br\>
<br\>
* The <FONT color="#B40404">FactoMineR</FONT> package is an R package dedicated to *multivariate exploratory data analysis*. It is developed and maintained by François Husson, Julie Josse, Sébastien Lê, d'Agrocampus Rennes, and J. Mazet.<br\>
<br\>
For the Principal Component Analysis, we use the <FONT color="#B40404">PCA</FONT> function.
<br\>
 
[To know more](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/coinR)
 
%% Cell type:markdown id: tags:
 
### Analysis of temperature data
#### Choice of active and illustrative variables
 
- the <FONT color="#21610B">active variables</FONT> are the variables taken into account in the determination of the factorial axes: monthly temperature variables (12 variables).
- <FONT color="#21610B">Quantitative illustrative variables</FONT> : annual average, thermal amplitude.
 
- <FONT color="#21610B">Categorical illustrative variables</FONT> : région.
 
#### Choice of active and illustrative individuals
 
- the <FONT color="#21610B">active individuals</FONT> are the capitals of the countries (1:23) to avoid giving more weight to the countries for which several cities are informed.
- <FONT color="#21610B">Illustrative individuals</FONT> are the cities associated with lines 24:35 of the data table.
 
#### Questions
 
* Can we summarize monthly temperatures by a small number of factors?
 
* What are the biggest disparities between countries?
 
#### Using FactoMineR
 
The function <FONT color="#B40404">PCA</FONT> performs principal component analysis with supplementary individuals, supplementary quantitative variables and supplementary categorical variables.
Missing values are replaced by the column mean.
 
%% Cell type:code id: tags:
 
``` R
# Analyse en composantes principales
.libPaths("/home/viryl/R/lib")
library(FactoMineR)
temperat.pca <- PCA(temperat,ind.sup=24:35,quanti.sup=12:16,quali.sup=17)
# temperat.PCA : objet de class ''PCA'' et ''list''
# attributes(temperat.pca)
# Choisir les axes
temperat.pca$eig
barplot(temperat.pca$eig[,2])
round(temperat.pca$eig[,2],2)
```
 
%% Cell type:markdown id: tags:
 
#### Choose the axes
 
%% Cell type:code id: tags:
 
``` R
barplot(temperat.pca$eig[,2])
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
#### Visualization of individuals on axes 1 and 2
 
%% Cell type:code id: tags:
 
``` R
# Graphiques
# Individu sur les axes 1 et 2 - coloriage des individus avec la variable
plot(temperat.pca, choix="ind", habillage=17,cex=0.8)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
#### Visualization of individuals on axes 3 and 4
 
%% Cell type:code id: tags:
 
``` R
# Individu sur les axes 3 et 4
plot(temperat.pca, choix="ind", habillage=17,cex=0.8,axes=3:4)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
#### Use of variables for interpretation
A good knowledge of the data can help the interpretation of the projections but when there are many individuals, one will be helped by the variables.
 
We consider the coordinates of the individuals on the axes as variables.<br\>
 
$F^{k}_{i}$ (**factor k**) : coordinate of the individual **i** on the **k** axis.<br\>
 
$$ F^{1}=\{ F^{1}_{i}, i=1 \ldots n\} \, , \, F^{2}=\{ F^{2}_{i}, i=1 \ldots n\}\, ,\,\ldots$$
 
* Analysis of the correlations of the active variables with the factors<br\>
<br\>
When the variables are strongly correlated with the factors :<br\>
 
- $cor(X^{k},F^{1}) > O$ : individuals with high values on $ X^{k} $ have high values on axis 1.<br\>
- $cor(X^{k},F^{1}) < O$ : individuals with high values on $ X ^ {k} $ have low values on axis 1.<br\>
<br\>
* Same for the axis 2.<br\>
<br\>
We build ** the correlation circle **.
 
%% Cell type:markdown id: tags:
 
### The cloud of variables
 
%% Cell type:markdown id: tags:
 
#### Studying variables
 
A ** variable ** is a point on a hypersphere in $\mathcal{R}^{n}$ <br\>
<br\>
<table style="width:60%">
<tr>
<th>
<img src="../../figures/NuageVarSphere.jpg",width="90%",height="90%">
</th>
<th>
$$\cos (\theta_{k,l}) = \frac{<X^{k},X^{l}>}{\|X^{k}\| \|X^{l}\|} $$
<br\>
$$ = \frac{\sum_{i=1}^{n} x_{i}^{k} x_{i}^{l}}{\sqrt{\sum_{i=1}^{n} (x_{i}^{k})^{2} \sum_{i=1}^{n} (x_{i}^{l})^{2}}}$$
<br\>
</th>
</tr>
</table>
Since the variables are centered
 
$$ \cos (\theta_{k,l}) = r(X^{k},X^{l}) \quad \mbox{correlation coefficient between} \quad X^{k} \; \mbox{et} \quad X^{l}$$
<br\>
<FONT color="#FF0000">Variables well represented will be close to the circle</FONT>.<br\>
<br\>
**Reduced variables** $\Longrightarrow$ hypersphere is of radius 1
 
%% Cell type:markdown id: tags:
 
#### Projection of the cloud of variables
Which are the axes in $\mathcal{R}^{n}$ that better represent the correlation matrix? <br\>
<br\>
* The first axis is the axis that ** maximizes the sum of the squared correlations between the factor and the set of variables**.
 
<FONT color="#FF0000"> $$\underset{V_{1} \in \mathcal{R}^{n}}{\operatorname{argmax}} {\sum_{k=1}^{p} r(X^{k},V^{1})^{2}}$$</FONT>
 
 
The factor <FONT color="#FF0000"> $V^{1}$</FONT> is the factor that is the most linked to the set of variables in terms of squared correlations.<br\>
<br\>
* We look for ** a second axis orthogonal to the first ** (uncorrelated) which maximizes the sum of the correlations with the set of variables.<br\>
<br\>
* In a sequential way, we determine the 3rd axis, $\ldots$<br\>
<br\>
<FONT color="#FF0000">The projection of the cloud of variables is the same as the representation of the correlation circle obtained previously</FONT>.
 
%% Cell type:code id: tags:
 
``` R
# Variables sur les axes 1 et 2
plot(temperat.pca, choix="var",cex=0.8)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
* All variables have positive coordinates on axis 1 (effect size).<br\>
<br\>
* Axis 1 can be summarized as the annual average, which is comforted by the "average" illustratrive variable.<br\>
<br\>
* latitude is also linked to the first factor.<br\>
<br\>
* the thermal amplitude is linked to the second axis.
 
%% Cell type:markdown id: tags:
 
#### Projection of the variables
<table style="width:70%">
<tr>
<th>
<img src="../../figures/cercleCor12.jpeg",width="100%",height="100%">
</th>
<th>
<p style="text-align:left";> * All variables have positive coordinates on axis 1 (effect size).</p><br\>
<p style="text-align:left";> * We can summarize the axis 1 by the annual average which is comforted by the illustratrive variable "Moyenne".</p><br\>
<p style="text-align:left";> * The latitude is also linked to the first factor.</p><br\>
<p style="text-align:left";> * The thermal amplitude is linked to the second axis.</p>
</th>
</tr>
</table>
 
%% Cell type:markdown id: tags:
 
 
$$r(A,B) = \cos(\theta_{A,B})$$
<br\>
If ** A ** is close to the plane <br\>
 
* **A** is well projected.
* $r(A,H_{A}) \approx 1$
* close to the correlation circle.
 
<br\>
<img src="../../figures/QualProjVar.jpg",width="100%",height="100%">
<br\>
$\Longrightarrow$ **<FONT color="#FF0000">Only well-designed variables can be interpreted</FONT>**
 
%% Cell type:markdown id: tags:
 
### Interprétation
 
#### Percentage of inertia - Choice of number of axes
 
* Percentage of information explained by each axis (eigenvalue)
* The axes being orthogonal, we can add the explained inertia of several axes.
 
%% Cell type:code id: tags:
 
``` R
barplot(temperat.pca$eig[,2])
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
$\Longrightarrow$ **<FONT color="#FF0000">Allows the choice of the number of axes to analyze</FONT>**
 
%% Cell type:markdown id: tags:
 
#### Interpretation - 2 indicators
Two aids to interpretation:
<br\>
* **Quality of representation ** of variables and individuals on the k-axis.
<br\>
<table style="width:60%">
<tr>
<td> **Variable**</td>
<td> ** Individuals** </td>
</tr>
<tr>
<td>$\cos^{2}(V,V_{k})$ </td>
<td> $\cos^{2}(GI,GH_{i}^{k})$ <br\>
</td>
</tr>
</table>
*$H_{i}^{k}$ is the projection of I on the k axis*.<br\>
<br\>
$\Longrightarrow$ **Only well-designed elements can be interpreted**
 
%% Cell type:markdown id: tags:
 
* **Contribution à la construction de l'axe k**
<table style="width:60%">
<tr>
<td> $$Ctr_k(j) = \frac{r(X^{j},V_{k})^{2}}{\sum_{l=1}^{p} r(X^{l},V_{k})^{2}}$$ </td>
<td> $$Ctr_k(i) = \frac{F^{k^2}_{i}}{\sum_{l=1}^{n} F^{k^2}_{l}}$$</td>
</tr>
<br/>
<tr>
<br/>
<td> <img src="../../figures/cos2Var.jpg",width="90%",height="90%"> </td>
<td> <img src="../../figures/cos2Ind.jpg",width="90%",height="90%"> </td>
</tr>
</table>
 
%% Cell type:markdown id: tags:
 
#### Eléments supplémentaires ou illustratifs
Additional items may be ** individuals ** and / or ** variables **. They are not used to calculate distances between individuals or to construct the correlation matrix.<br/>
<br/>
$\Longrightarrow$ ** they do not participate in the construction of the axes **, they are a help to their interpretation. <br/>
<br/>
* <FONT color="#013ADF">Additional variables</FONT><br/>
- * Quantitative variables *: they will be projected on the circle of correlation. <br/>
The coordinate of the additional variable $X^{j}$ on the k-axis is the correlation between this variable and the factor $F^{k}$.<br/>
<br/>
- * Qualitative variables *: projection of each modality ** to the barycenter of the individuals associated with this modality **, on ** the graph of the individuals **<br/>
<br/>
 
The information can be * represented in the form of a color code *, individuals associated with the same category are colored in the same color.<br/>
<br/>
* <FONT color="#013ADF">Additional individuals </ FONT>: they are projected on the graph of the individuals.
 
%% Cell type:code id: tags:
 
``` R
# Individu sur les axes 3 et 4
plot(temperat.pca, choix="ind", habillage=17,cex=0.8,axes=3:4)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:code id: tags:
 
``` R
# Variables sur les axes 1 et 2
plot(temperat.pca, choix="var",cex=0.8)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
#### Automatic description of axes
This type of interpretation help is interesting when ** the number of variables is important **. <br/>
<br/>
* <FONT color="#013ADF">Quantitatives variables</FONT> : $(r(V^{j},F^{k}), j=1 \ldots p)$<br/>
<br/>
- Variables that have a correlation coefficient with the factor are kept significantly $\# 0$.<br/>
<br/>
- For each axis, we sort the variables of the highest correlation coefficient at least high.<br/>
<br/>
* <FONT color="#013ADF">Qualitative variable</FONT> : an analysis of variance is performed for each qualitative variable and each factor (Fisher test, Student test).
 
%% Cell type:code id: tags:
 
``` R
dimdesc(temperat.pca)
```
 
%%%% Output: display_data
 
$Dim.1
: $quanti
:
| <!--/--> | correlation | p.value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Septembre | 0.9924932 | 1.187361e-20 |
| Moyenne | 0.9922340 | 1.693870e-20 |
| Octobre | 0.9852237 | 1.409265e-17 |
| Avril | 0.9794972 | 4.282880e-16 |
| Novembre | 0.9344977 | 6.950867e-11 |
| Mars | 0.9294103 | 1.490263e-10 |
| Aout | 0.9293171 | 1.510426e-10 |
| Mai | 0.8939479 | 9.120840e-09 |
| Juillet | 0.8693439 | 7.288714e-08 |
| Juin | 0.8612611 | 1.318936e-07 |
| Fevrier | 0.8587611 | 1.572744e-07 |
| Décembre | 0.8448172 | 3.962517e-07 |
| Janvier | 0.8117914 | 2.574863e-06 |
| Latitude | -0.8695115 | 7.196646e-08 |
$quali
:
| <!--/--> | R2 | p.value |
|---|
| Région | 0.6889198 | 4.659316e-05 |
$category
:
| <!--/--> | Estimate | p.value |
|---|---|
| Sud | 4.045232 | 2.196202e-05 |
| Nord | -2.840457 | 7.568562e-03 |
$Dim.2
: $quanti
:
| <!--/--> | correlation | p.value |
|---|---|---|---|---|---|---|
| Janvier | 0.5734912 | 4.223923e-03 |
| Décembre | 0.5140212 | 1.210370e-02 |
| Fevrier | 0.5044054 | 1.411137e-02 |
| Longitude | -0.4359624 | 3.756716e-02 |
| Juillet | -0.4663341 | 2.489752e-02 |
| Juin | -0.5006543 | 1.496483e-02 |
| Amplitude | -0.9604872 | 3.865197e-13 |
$quali
:
| <!--/--> | R2 | p.value |
|---|
| Région | 0.5144575 | 0.002825625 |
$category
:
| <!--/--> | Estimate | p.value |
|---|---|
| Nord | 0.7521184 | 0.0370465132 |
| Est | -1.3636282 | 0.0004962863 |
$Dim.3
: $quali
:
| <!--/--> | R2 | p.value |
|---|
| Région | 0.4182892 | 0.0144468 |
$category
:
| <!--/--> | Estimate | p.value |
|---|---|
| Nord | 0.2713414 | 0.01390328 |
| Est | -0.2481828 | 0.01634125 |
\begin{description}
\item[\$Dim.1] \begin{description}
\item[\$quanti] \begin{tabular}{r|ll}
& correlation & p.value\\
\hline
Septembre & 0.9924932 & 1.187361e-20\\
Moyenne & 0.9922340 & 1.693870e-20\\
Octobre & 0.9852237 & 1.409265e-17\\
Avril & 0.9794972 & 4.282880e-16\\
Novembre & 0.9344977 & 6.950867e-11\\
Mars & 0.9294103 & 1.490263e-10\\
Aout & 0.9293171 & 1.510426e-10\\
Mai & 0.8939479 & 9.120840e-09\\
Juillet & 0.8693439 & 7.288714e-08\\
Juin & 0.8612611 & 1.318936e-07\\
Fevrier & 0.8587611 & 1.572744e-07\\
Décembre & 0.8448172 & 3.962517e-07\\
Janvier & 0.8117914 & 2.574863e-06\\
Latitude & -0.8695115 & 7.196646e-08\\
\end{tabular}
\item[\$quali] \begin{tabular}{r|ll}
& R2 & p.value\\
\hline
Région & 0.6889198 & 4.659316e-05\\
\end{tabular}
\item[\$category] \begin{tabular}{r|ll}
& Estimate & p.value\\
\hline
Sud & 4.045232 & 2.196202e-05\\
Nord & -2.840457 & 7.568562e-03\\
\end{tabular}
\end{description}
\item[\$Dim.2] \begin{description}
\item[\$quanti] \begin{tabular}{r|ll}
& correlation & p.value\\
\hline
Janvier & 0.5734912 & 4.223923e-03\\
Décembre & 0.5140212 & 1.210370e-02\\
Fevrier & 0.5044054 & 1.411137e-02\\
Longitude & -0.4359624 & 3.756716e-02\\
Juillet & -0.4663341 & 2.489752e-02\\
Juin & -0.5006543 & 1.496483e-02\\
Amplitude & -0.9604872 & 3.865197e-13\\
\end{tabular}
\item[\$quali] \begin{tabular}{r|ll}
& R2 & p.value\\
\hline
Région & 0.5144575 & 0.002825625\\
\end{tabular}
\item[\$category] \begin{tabular}{r|ll}
& Estimate & p.value\\
\hline
Nord & 0.7521184 & 0.0370465132\\
Est & -1.3636282 & 0.0004962863\\
\end{tabular}
\end{description}
\item[\$Dim.3] \begin{description}
\item[\$quali] \begin{tabular}{r|ll}
& R2 & p.value\\
\hline
Région & 0.4182892 & 0.0144468\\
\end{tabular}
\item[\$category] \begin{tabular}{r|ll}
& Estimate & p.value\\
\hline
Nord & 0.2713414 & 0.01390328\\
Est & -0.2481828 & 0.01634125\\
\end{tabular}
\end{description}
\end{description}
 
%% Cell type:markdown id: tags:
 
## Factoshiny: interactive graphs in exploratory multivariate data analysis
The [Factoshiny](http://factominer.free.fr/graphs/factoshiny.html) package allows you to use the [FactoMineR](http://factominer.free.fr) package using a graphical interface, and also allows you to modify the graphics interactively. This package is very useful to optimize these graphics before distributing them.
 
%% Cell type:markdown id: tags:
 
## Management of missing data in ACP
## Handling of missing data in ACP
[PCA with missing data usinf MissMDA R Package](https://www.youtube.com/watch?v=OOM8_FH6_8o) <br\>
[Methodology on the treatment of missing data](https://www.youtube.com/watch?v=hQ6tDtgotx0)
 
%% Cell type:markdown id: tags:
 
# Autres méthodes d'analyse multidimentionnelles
## Analyse factorielle des correspondances
### Données et objectifs
## Analyse des correspondances multiples
### Données et objectifs
# Other methods of multidimensional analysis
[Approach in multidimensional data analysis](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner) <br\>
<br\>
<img src="../../figures/demarcheAD.jpg",width="80%",height="80%">
 
## Classification
### Données et objectifs
[Approach in multidimensional data analysis](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner)
## Correspondence Analysis
### Data et objectifs
The main point of **correspondence analysis** is studying the **links between pairs of qualitative variables**. This really means looking at the difference between the given data, and what it would be like if the variables were independent. We're therefore going to see how the analysis captures deviation from independence. Our reasoning will mainly be geometrical, creating point clouds for the rows and point clouds for the columns. Projecting these clouds onto planes will give some useful representations.
 
%% Cell type:markdown id: tags:
## Multiple Correspondence Analysis
### Data et objectifs
In the MCA context, we have a point cloud of individuals, and a point clouds of categories. We see how to visualize the point cloud of individuals, and how to interpret it using the categories and how to directly visualize the point cloud of categories. The point cloud of
individuals, and that of the categories, can be shown simultaneously on the same graph. This is called the simultaneous representation of the point clouds.
 
## Analyse Factorielle Multiple
### Données et objectifs
## Multiple Factor Analysis
### Data et objectifs <br\>
Method to study more complex data tables, where a group of individuals is
characterized by variables structured as groups, and possibly coming from different information sources. The interest in the method is due to it being able to analyze a data table as a whole, but also its ability to compare information provided by the various information sources.<br\>
<br\>
[MOOC AgroCampus Ouest Exploratory Multivariate Data Analysis](https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04EN+session04/info)<br\>
[Courses AgroCampus Ouest F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/teaching)
 
%% Cell type:markdown id: tags:
 
# Quelques références
 
* Analyse de données avec R, 2ème édition revue et augmentée. <br />
F. Husson, S. Lê & J. Pagès (2016). <br /> Presses Universitaires de Rennes
 
 
* Statistique avec R, 3ème edition revue et augmentée.<br />
P-A. Cornillon, A. Guyader, F. Husson, N. Jégou, J. Josse, M. Kloareg,
E. Matzner-Lober, L. Rouvière (2012). <br />Presses Universitaires de Rennes
 
 
* Exploratory Multivariate Analysis by Example Using R.<br />
F. Husson, S. Lê & J. Pagès. 2nd edition (2017). <br />Chapman & Hall/CRC Computer Science & Data Analysis.
 
* MOOC sur FUN
 
%% Cell type:markdown id: tags:
 
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment