Vous avez reçu un message "Your GitLab account has been locked ..." ? Pas d'inquiétude : lisez cet article https://docs.gricad-pages.univ-grenoble-alpes.fr/help/unlock/

Commit ab9214e8 authored by Laurence Viry's avatar Laurence Viry
Browse files

multidim notebook

parent 42c3522b
......@@ -17,68 +17,119 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Approach to make a multivariate analysis\n",
"# Introduction - Multivariate analysis\n",
" In many applications we observe **p** variables on **n** individuals (<FONT color=\"#B40404\">p and n being able to be high).</FONT><br\\>\n",
" <br\\>\n",
"Databases become more and more voluminous in term of individuals and variables measured on these individuals. The study of each variable and pairs of variables by classical descriptive statistics methods are indispensable but insufficient.<br\\>\n",
"<br\\>\n",
"The **multidimensional exploratory** methods allow:\n",
"* to take into account *the simultaneous variations* of a larger number of variables,\n",
"* to synthesize and / or simplify *the underlying structures*.\n",
"\n",
"How to perform a multiple factor analysis that handles several groups of continuous and/or categorical variables and/or contingency tables? And how can we improve the graphs obtained by the method?<br\\>\n",
"<br\\>\n",
"[Tutorial F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner)\n",
"\n",
"1. Are there groups of variables? Use of Multiple factor analysis (function MFA i FactoMiner) <br\\>\n",
"1. Are there groups of variables? Use of Multiple factor analysis (function <FONT color=\"#B40404\">MFA</FONT> i FactoMiner) <br\\>\n",
"<br\\>\n",
"2. What is the type of information?\n",
" * Contingency table -> Factorial correspondence analysis (AFC or AFMTC if several)\n",
" * Several tables of contingencies -> AFMTC\n",
" * Table \"individuals - variables\" -> principal components analysis (PCA), Multiple factor analysis (MFA), nalyse es correspondances multiples. <br\\>\n",
" * Contingency table -> Factorial correspondence analysis (<FONT color=\"#B40404\">AFC</FONT> or <FONT color=\"#B40404\">AFMTC</FONT> if several)\n",
" * Several tables of contingencies -> <FONT color=\"#B40404\">AFMTC </FONT>\n",
" * Table \"individuals - variables\" -> principal components analysis (<FONT color=\"#B40404\">PCA</FONT>), Multiple factor analysis (<FONT color=\"#B40404\">MFA</FONT>). <br\\>\n",
"<br\\>\n",
"3. What are the active elements? what are the elements that will participate in the construction of the axes?<br\\>\n",
"<br\\>\n",
"4. What are the additional elements? they do not participate in the construction of the axes but are useful for interpretation.<br\\>\n",
"<br\\>\n",
"5. What is the nature of the active variables?\n",
" * Quantitative variables: Principal Component Analysis (PCA)\n",
" * Qualitative Variables: Multiple Correspondence Analysis (MCA)\n",
" * Mixed variables: AFDM<br\\>\n",
" <br\\>\n",
"Whatever the method, the additional variables can be of two types.\n",
" * **Quantitative variables**: Principal Component Analysis (<FONT color=\"#B40404\">PCA</FONT>)\n",
" * **Qualitative Variables**: Multiple Correspondence Analysis (<FONT color=\"#B40404\">MCA</FONT>)\n",
" * **Mixed variables**: <FONT color=\"#B40404\">AFDM</FONT>\n",
"<br\\>\n",
"6. Should we reduce the quantitative variables?<br\\>\n",
"(*Whatever the method, the additional variables can be of two types.*)<br\\>\n",
"<br\\>\n",
"7. Are there any missing data? How to treat them?<br\\>\n",
"\n",
"6. Should we **reduce** the quantitative variables?<br\\>\n",
"<br\\>\n",
"7. Are there any **missing data**? How to treat them?<br\\>\n",
"<br\\>\n",
"8. The steps of the analysis<br\\>\n",
"8. The *steps of the analysis*<br\\>\n",
" * Start the factor analysis.<br\\>\n",
"<br\\>\n",
" * Describe the factorial axes by the active initial variables (dimdesc)<br\\>\n",
" * Describe the factorial axes by the active initial variables (<FONT color=\"#B40404\">dimdesc</FONT>)<br\\>\n",
"<br\\>\n",
" * It may be interesting to use a classification method to determine groups of individuals (HCPC)<br\\>\n",
" * It may be interesting to use a classification method to determine groups of individuals (<FONT color=\"#B40404\">HCPC</FONT>)<br\\>\n",
"<br\\>\n",
"<img src=\"../../figures/MultiFactorielAnalysis.jpg\",width=\"80%\",height=\"80%\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour en savoir plus [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french).\n",
"To know more [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french).\n",
"\n",
"# Principal component analysis \n",
"In this course, we present only how to analyze tables with quantitative variables using **principal components analysis** (PCA) and how to use the method with **FactoMineR** in **R**. \n",
"\n",
"# Introduction\n",
"## Introduction\n",
"The aim of the PCA method is to summarize a table of individuals x variables data, the variables being quantitatives.\n",
"\n",
"The PCA allows to study the similarities between individuals from the point of view of a group of variables and gives off profiles of individuals.\n",
"\n",
"It allows a balance of the linear links between variables from the correlation coefficients.\n",
"\n",
"These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals.\n",
"These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data - practicalities\n",
"### Which kinds of data\n",
"PCA applies to data tables where rows are considered as individuals and columns as quantitative variables\n",
"\n",
"On dispose de **p variables** <FONT color=\"#013ADF\">$X^{1},X^{2},\\ldots,X^{p}$</FONT> observées\n",
"sur **n individus**.\n",
"\n",
"## Data - practicalities"
"Principal component analysis, also known as <FONT color=\"#B40404\">PCA</FONT>, applies to data tables where **rows** can be considered like individuals and **columns** like quantitative variables. <br\\>\n",
"\n",
"### Data table\n",
" We note <FONT color=\"#013ADF\">$x^{j}_{i}$</FONT> the observation of the variable <FONT color=\"#013ADF\">$X^{j}$</FONT> on the <FONT color=\"#013ADF\">ith</FONT> individual.\n",
" \n",
" <table style=\"width:60%\">\n",
" <tr>\n",
" <th>\n",
" $$\\begin{aligned}\n",
"X= \\left[\\begin{array}{ccc}\n",
"x_{1}^{1} & \\dots & x_{1}^{p}\\\\\n",
"\\vdots & \\ddots & \\vdots\\\\\n",
"x_{n}^{1} & \\dots & x_{n}^{p}\n",
"\\end{array}\\right] & \\quad n \\quad \\mbox{individus} \\nonumber \\\\\n",
" p \\quad \\mbox{variables} \\nonumber \\end{aligned}$$\n",
" </th>\n",
" <th>\n",
"$$\\bar{x^{j}}=\\sum_{i=1}^{n} x^{j}_{i}$$\n",
"$$\\sigma^{j}=\\sqrt{\\sum_{i=1}^{n} (x^{j}_{i}-\\bar{x^{j}})^2}$$\n",
"</th>\n",
" </tr>\n",
"</table> \n",
"\n",
"\n",
"The data table can be analyzed through its **lines** (individuals) or through its **columns** (variables).\n",
"\n",
"Le tableau des données peut être analysé à travers ses **lignes** (individus) ou à travers ses\n",
"**colonnes**(variables).<br\\>\n",
"\n",
"<FONT color=\"#013ADF\">$X^{j}$</FONT>$\\,= \\, (X^{j}_{1},\\ldots,X^{j}_{n}) \\quad \\mbox{variable} \\quad j, \\quad \\mbox{dans} \\quad \\mathcal{R}^{n}$\n",
"<FONT color=\"#013ADF\">$X_{i}$</FONT>$\\, = \\, (X^{1}_{i},\\ldots,X^{p}_{i}) \\quad\\mbox{individu} \\quad i, \\quad \\mbox{dans} \\quad \\mathcal{R}^{p}$\n",
"<br\\>\n",
"<br\\>\n",
"<figure>\n",
" <img src=\"../../figures/TwoClouds.jpg\",width=\"40%\",height=\"20%\">\n",
" <figcaption> <br><em>(Exploratory Multivariate Data Analysis</em> <a href=\"https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04EN+session04/info\">MOOC AgroCampus Ouest )</a>\n",
" </figcaption>\n",
" </figure>"
]
},
{
......@@ -87,39 +138,268 @@
"source": [
"# Studying individuals and variables\n",
"## Studying individuals\n",
"* When can we say that two individuals are similar with respect to all the variables or a group of variables?\n",
"* If there are many individuals, is it possible to categorize them? <br\\>\n",
"<br\\>\n",
"⇒ groups of individuals, partitions between them\n",
"\n",
"## Studying variables\n",
"* The correlation matrix provides a simple indication on the linear link\n",
"between variables two by two.\n",
"* Look for similarities between all the variables or a group of variables.\n",
"* Synthetic indicators are sought to summarize groups of\n",
"variables.\n",
"\n",
"## Studying variables"
"⇒ visualization of the correlation matrix <br\\>\n",
"⇒ find a small number of synthetic variables to summarize many\n",
"variables\n",
"\n",
"## Links between the two points-of-view\n",
"* Characterize groups of individuals using variables.\n",
"* Use typical individuals to interpret groups of variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA with FactoMineR \n",
"FactoMineR is an R package dedicated to multivariate Exploratory Data Analysis. It is developed and maintained by François Husson, Julie Josse, Sébastien Lê, d'Agrocampus Rennes, and J. Mazet."
"## Some examples\n",
"\n",
"* **Sensory analysis**: note of the descriptor k for the product i\n",
"* **Ecology**: concentration of the pollutant k on the river i\n",
"* **Economy**: value of indicator k for year i\n",
"* **Genetics**: Gene expression k for the patient i\n",
"* **Biology**: k measure for the animal i\n",
"* **Marketing**: satisfaction index value k for brand i\n",
"* **Sociology**: time spent in activity k by the individuals of the CSP i\n",
"* $\\ldots$\n",
"\n",
"## Example: Climate of different European countries\n",
"<br\\>\n",
"To illustrate this course, we will take **temperature data** to analyse climate of different European countries.\n",
"\n",
"### Description of the data:\n",
"* average monthly temperatures (over 30 years).\n",
"* the annual average temperature, the thermal amplitude.\n",
"* longitude, latitude of each city\n",
"* A qualitative variable belonging to a region of Europe: Northern Europe, south, east and west.\n",
"### Data extract\n",
" <table style=\"width:100%\">\n",
" <tr>\n",
" <th> Town</th>\n",
" <th>Janv</th>\n",
" <th>Fév </th>\n",
" <th>... </th>\n",
" <th>Nov </th>\n",
" <th>Déc </th>\n",
" <th>Moy </th>\n",
" <th>Amp </th>\n",
" <th>Lat </th>\n",
" <th>Lon </th>\n",
" <th>Rég</th>\n",
" </tr>\n",
" <tr>\n",
" <td>Amsterdam </td> \n",
" <td>2.9 </td> \n",
" <td>2.5 </td> \n",
" <td>... </td> \n",
" <td>7.0 </td> \n",
" <td>4.4 </td> \n",
" <td>9.9 </td> \n",
" <td>14.6 </td> \n",
" <td>52.2 </td> \n",
" <td>4.5 </td> \n",
" <td>Ouest\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Athènes </td>\n",
" <td>9.1 </td>\n",
" <td>9.7 </td>\n",
" <td>... </td>\n",
" <td>14.6</td>\n",
" <td>11.0</td>\n",
" <td>17.8 </td>\n",
" <td>18.3 </td>\n",
" <td>37.6</td>\n",
" <td>23.5</td>\n",
" <td>Sud </td>\n",
" </tr>\n",
" <tr>\n",
" <td> Berlin </td>\n",
" <td>-0.2</td>\n",
" <td>0.1</td>\n",
" <td>...</td>\n",
" <td>4.2</td>\n",
" <td>1.2</td>\n",
" <td>9.1</td>\n",
" <td>18.5 </td>\n",
" <td>52.3 </td>\n",
" <td>13.2</td>\n",
" <td>Ouest</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Helsinki</td>\n",
" <td>-5.8</td>\n",
" <td>-5.0</td>\n",
" <td>...</td>\n",
" <td>0.1</td>\n",
" <td>-2.3</td>\n",
" <td>4.8</td>\n",
" <td>23.4</td>\n",
" <td>60.1</td>\n",
" <td>25.0</td>\n",
" <td>Nord</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Kiev</td>\n",
" <td>-5.9</td>\n",
" <td>-5.0</td>\n",
" <td>...</td>\n",
" <td>1.2</td>\n",
" <td>-3.6</td>\n",
" <td>7.1</td>\n",
" <td>25.3</td>\n",
" <td>50.3</td>\n",
" <td>30.3</td>\n",
" <td>Est</td>\n",
" </tr>\n",
" <tr>\n",
" <td> Copenhague</td>\n",
" <td>-0.4 </td>\n",
" <td>-0.4</td>\n",
" <td>...</td>\n",
" <td>4.1</td>\n",
" <td>1.3</td>\n",
" <td>7.8</td>\n",
" <td>17.5</td>\n",
" <td>55.4</td>\n",
" <td>12.3</td>\n",
" <td>Nord</td>\n",
" </tr>\n",
" <tr>\n",
" <td> Budapest </td>\n",
" <td>-1.1</td>\n",
" <td>0.8</td>\n",
" <td>...</td>\n",
" <td>5.1 </td>\n",
" <td>0.7</td>\n",
" <td>10.9</td>\n",
" <td>23.1</td>\n",
" <td>47.3</td>\n",
" <td>19.0</td>\n",
" <td>Est</td>\n",
" </tr>\n",
" <tr>\n",
" <td> Bruxelles</td>\n",
" <td>3.3</td>\n",
" <td>3.3</td>\n",
" <td>...</td>\n",
" <td>6.7</td>\n",
" <td>4.4</td>\n",
" <td>10.3</td>\n",
" <td>14.4</td>\n",
" <td>50.5</td>\n",
" <td>4.2</td>\n",
" <td>Ouest \n",
" </tr>\n",
" <tr>\n",
" <td> ...</td>\n",
" <td></td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td> \n",
" <td>...</td> \n",
"</table> \n",
"### Read data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"load(\"../../data/temperat.RData\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Factoshiny: interactive graphs in exploratory multivariate data analysis "
"### Descriptive statistics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Descriptive statistics\n",
"summary(temperat)\n",
"# Scaterplot\n",
"pairs(temperat[,1:12])\n",
"# Correlation\n",
"cor(temperat[,1:12]) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Interpretation aids"
"### Descriptives statistics"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"summary(temp)\n",
"pairs(temps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA with FactoMineR \n",
"FactoMineR is an R package dedicated to multivariate Exploratory Data Analysis. It is developed and maintained by François Husson, Julie Josse, Sébastien Lê, d'Agrocampus Rennes, and J. Mazet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Factoshiny: interactive graphs in exploratory multivariate data analysis \n",
"Le package [Factoshiny](http://factominer.free.fr/graphs/factoshiny.html) permet d'utiliser le package [FactoMineR](http://factominer.free.fr) à l'aide d'une **interface graphique**, et permet aussi de modifier les graphiques de façon **interactive**. Ce package est très utile pour optimiser ces graphiques avant de les diffuser."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Interpretation aids\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# A detailed PCA example"
]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment