### multidim notebook

parent 42c3522b

93.6 KB

 ... ... @@ -17,68 +17,119 @@ "cell_type": "markdown", "metadata": {}, "source": [ "# Approach to make a multivariate analysis\n", "# Introduction - Multivariate analysis\n", " In many applications we observe **p** variables on **n** individuals (p and n being able to be high).\n", " \n", "Databases become more and more voluminous in term of individuals and variables measured on these individuals. The study of each variable and pairs of variables by classical descriptive statistics methods are indispensable but insufficient.\n", "\n", "The **multidimensional exploratory** methods allow:\n", "* to take into account *the simultaneous variations* of a larger number of variables,\n", "* to synthesize and / or simplify *the underlying structures*.\n", "\n", "How to perform a multiple factor analysis that handles several groups of continuous and/or categorical variables and/or contingency tables? And how can we improve the graphs obtained by the method?\n", "\n", "[Tutorial F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner)\n", "\n", "1. Are there groups of variables? Use of Multiple factor analysis (function MFA i FactoMiner) \n", "1. Are there groups of variables? Use of Multiple factor analysis (function MFA i FactoMiner) \n", "\n", "2. What is the type of information?\n", " * Contingency table -> Factorial correspondence analysis (AFC or AFMTC if several)\n", " * Several tables of contingencies -> AFMTC\n", " * Table \"individuals - variables\" -> principal components analysis (PCA), Multiple factor analysis (MFA), nalyse es correspondances multiples. \n", " * Contingency table -> Factorial correspondence analysis (AFC or AFMTC if several)\n", " * Several tables of contingencies -> AFMTC \n", " * Table \"individuals - variables\" -> principal components analysis (PCA), Multiple factor analysis (MFA). \n", "\n", "3. What are the active elements? what are the elements that will participate in the construction of the axes?\n", "\n", "4. What are the additional elements? they do not participate in the construction of the axes but are useful for interpretation.\n", "\n", "5. What is the nature of the active variables?\n", " * Quantitative variables: Principal Component Analysis (PCA)\n", " * Qualitative Variables: Multiple Correspondence Analysis (MCA)\n", " * Mixed variables: AFDM\n", " \n", "Whatever the method, the additional variables can be of two types.\n", " * **Quantitative variables**: Principal Component Analysis (PCA)\n", " * **Qualitative Variables**: Multiple Correspondence Analysis (MCA)\n", " * **Mixed variables**: AFDM\n", "\n", "6. Should we reduce the quantitative variables?\n", "(*Whatever the method, the additional variables can be of two types.*)\n", "\n", "7. Are there any missing data? How to treat them?\n", "\n", "6. Should we **reduce** the quantitative variables?\n", "\n", "7. Are there any **missing data**? How to treat them?\n", "\n", "8. The steps of the analysis\n", "8. The *steps of the analysis*\n", " * Start the factor analysis.\n", "\n", " * Describe the factorial axes by the active initial variables (dimdesc)\n", " * Describe the factorial axes by the active initial variables (dimdesc)\n", "\n", " * It may be interesting to use a classification method to determine groups of individuals (HCPC)\n", " * It may be interesting to use a classification method to determine groups of individuals (HCPC)\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pour en savoir plus [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french).\n", "To know more [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french).\n", "\n", "# Principal component analysis \n", "In this course, we present only how to analyze tables with quantitative variables using **principal components analysis** (PCA) and how to use the method with **FactoMineR** in **R**. \n", "\n", "# Introduction\n", "## Introduction\n", "The aim of the PCA method is to summarize a table of individuals x variables data, the variables being quantitatives.\n", "\n", "The PCA allows to study the similarities between individuals from the point of view of a group of variables and gives off profiles of individuals.\n", "\n", "It allows a balance of the linear links between variables from the correlation coefficients.\n", "\n", "These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals.\n", "These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data - practicalities\n", "### Which kinds of data\n", "PCA applies to data tables where rows are considered as individuals and columns as quantitative variables\n", "\n", "On dispose de **p variables** $X^{1},X^{2},\\ldots,X^{p}$ observées\n", "sur **n individus**.\n", "\n", "## Data - practicalities" "Principal component analysis, also known as PCA, applies to data tables where **rows** can be considered like individuals and **columns** like quantitative variables. \n", "\n", "### Data table\n", " We note $x^{j}_{i}$ the observation of the variable $X^{j}$ on the ith individual.\n", " \n", " \n", "
\n", " \\begin{aligned}\n", "X= \\left[\\begin{array}{ccc}\n", "x_{1}^{1} & \\dots & x_{1}^{p}\\\\\n", "\\vdots & \\ddots & \\vdots\\\\\n", "x_{n}^{1} & \\dots & x_{n}^{p}\n", "\\end{array}\\right] & \\quad n \\quad \\mbox{individus} \\nonumber \\\\\n", " p \\quad \\mbox{variables} \\nonumber \\end{aligned}\n", " \n", "$$\\bar{x^{j}}=\\sum_{i=1}^{n} x^{j}_{i}$$\n", "$$\\sigma^{j}=\\sqrt{\\sum_{i=1}^{n} (x^{j}_{i}-\\bar{x^{j}})^2}$$\n", "
\n", " \n", " \n", " \n", " \n", "\n", "\n", "The data table can be analyzed through its **lines** (individuals) or through its **columns** (variables).\n", "\n", "Le tableau des données peut être analysé à travers ses **lignes** (individus) ou à travers ses\n", "**colonnes**(variables).\n", "\n", "$X^{j}$$\\,= \\, (X^{j}_{1},\\ldots,X^{j}_{n}) \\quad \\mbox{variable} \\quad j, \\quad \\mbox{dans} \\quad \\mathcal{R}^{n}\n", "X_{i}$$\\, = \\, (X^{1}_{i},\\ldots,X^{p}_{i}) \\quad\\mbox{individu} \\quad i, \\quad \\mbox{dans} \\quad \\mathcal{R}^{p}$\n", "\n", "\n", "" ] }, { ... ... @@ -87,39 +138,268 @@ "source": [ "# Studying individuals and variables\n", "## Studying individuals\n", "* When can we say that two individuals are similar with respect to all the variables or a group of variables?\n", "* If there are many individuals, is it possible to categorize them? \n", "\n", "⇒ groups of individuals, partitions between them\n", "\n", "## Studying variables\n", "* The correlation matrix provides a simple indication on the linear link\n", "between variables two by two.\n", "* Look for similarities between all the variables or a group of variables.\n", "* Synthetic indicators are sought to summarize groups of\n", "variables.\n", "\n", "## Studying variables" "⇒ visualization of the correlation matrix \n", "⇒ find a small number of synthetic variables to summarize many\n", "variables\n", "\n", "## Links between the two points-of-view\n", "* Characterize groups of individuals using variables.\n", "* Use typical individuals to interpret groups of variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PCA with FactoMineR \n", "FactoMineR is an R package dedicated to multivariate Exploratory Data Analysis. It is developed and maintained by François Husson, Julie Josse, Sébastien Lê, d'Agrocampus Rennes, and J. Mazet." "## Some examples\n", "\n", "* **Sensory analysis**: note of the descriptor k for the product i\n", "* **Ecology**: concentration of the pollutant k on the river i\n", "* **Economy**: value of indicator k for year i\n", "* **Genetics**: Gene expression k for the patient i\n", "* **Biology**: k measure for the animal i\n", "* **Marketing**: satisfaction index value k for brand i\n", "* **Sociology**: time spent in activity k by the individuals of the CSP i\n", "* $\\ldots$\n", "\n", "## Example: Climate of different European countries\n", "\n", "To illustrate this course, we will take **temperature data** to analyse climate of different European countries.\n", "\n", "### Description of the data:\n", "* average monthly temperatures (over 30 years).\n", "* the annual average temperature, the thermal amplitude.\n", "* longitude, latitude of each city\n", "* A qualitative variable belonging to a region of Europe: Northern Europe, south, east and west.\n", "### Data extract\n", " \n", "
TownJanvFév ... Nov Déc Moy Amp Lat Lon Rég
Amsterdam 2.9 2.5 ... 7.0 4.4 9.9 14.6 52.2 4.5 Ouest\n", "
Athènes 9.1 9.7 ... 14.611.017.8 18.3 37.623.5Sud
Berlin -0.20.1...4.21.29.118.5 52.3 13.2Ouest
Helsinki-5.8-5.0...0.1-2.34.823.460.125.0Nord
Kiev-5.9-5.0...1.2-3.67.125.350.330.3Est
Copenhague-0.4 -0.4...4.11.37.817.555.412.3Nord
Budapest -1.10.8...5.1 0.710.923.147.319.0Est
Bruxelles3.33.3...6.74.410.314.450.54.2Ouest \n", "
..............................
 %% Cell type:markdown id: tags: This course is inspired by the MOOC (Massive Open Online Course) "[Exploratory Multivariate Data Analysis](https://www.fun-mooc.fr/courses/course-v1%3Aagrocampusouest%2B40001S04EN%2Bsession04/about)" (the first session in English was in 2017) from the platform FUN. (Multivariate Multidimensional Data Analysis (Département de mathématiques appliquées d’Agrocampus Ouest - Rennes, F. Husson, J. Pagès, M. Houée-Bigot) The 2nd edition of the MOOC will start the 5h of March 2018, you can subscribe until april 20. Version en français: [Analyse de données multidimensionnelles](https://www.fun-mooc.fr/courses/course-v1:agrocampusouest+40001S04+session04/about) %% Cell type:markdown id: tags: # Approach to make a multivariate analysis # Introduction - Multivariate analysis In many applications we observe **p** variables on **n** individuals (p and n being able to be high). Databases become more and more voluminous in term of individuals and variables measured on these individuals. The study of each variable and pairs of variables by classical descriptive statistics methods are indispensable but insufficient. The **multidimensional exploratory** methods allow: * to take into account *the simultaneous variations* of a larger number of variables, * to synthesize and / or simplify *the underlying structures*. How to perform a multiple factor analysis that handles several groups of continuous and/or categorical variables and/or contingency tables? And how can we improve the graphs obtained by the method? [Tutorial F. Husson](http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/Francois.Husson/Rcorner) 1. Are there groups of variables? Use of Multiple factor analysis (function MFA i FactoMiner) 1. Are there groups of variables? Use of Multiple factor analysis (function MFA i FactoMiner) 2. What is the type of information? * Contingency table -> Factorial correspondence analysis (AFC or AFMTC if several) * Several tables of contingencies -> AFMTC * Table "individuals - variables" -> principal components analysis (PCA), Multiple factor analysis (MFA), nalyse es correspondances multiples. * Contingency table -> Factorial correspondence analysis (AFC or AFMTC if several) * Several tables of contingencies -> AFMTC * Table "individuals - variables" -> principal components analysis (PCA), Multiple factor analysis (MFA). 3. What are the active elements? what are the elements that will participate in the construction of the axes? 4. What are the additional elements? they do not participate in the construction of the axes but are useful for interpretation. 5. What is the nature of the active variables? * Quantitative variables: Principal Component Analysis (PCA) * Qualitative Variables: Multiple Correspondence Analysis (MCA) * Mixed variables: AFDM Whatever the method, the additional variables can be of two types. * **Quantitative variables**: Principal Component Analysis (PCA) * **Qualitative Variables**: Multiple Correspondence Analysis (MCA) * **Mixed variables**: AFDM 6. Should we reduce the quantitative variables? (*Whatever the method, the additional variables can be of two types.*) 7. Are there any missing data? How to treat them? 6. Should we **reduce** the quantitative variables? 8. The steps of the analysis 7. Are there any **missing data**? How to treat them? 8. The *steps of the analysis* * Start the factor analysis. * Describe the factorial axes by the active initial variables (dimdesc) * Describe the factorial axes by the active initial variables (dimdesc) * It may be interesting to use a classification method to determine groups of individuals (HCPC) * It may be interesting to use a classification method to determine groups of individuals (HCPC) %% Cell type:code id: tags:  R  %% Cell type:markdown id: tags: Pour en savoir plus [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french). To know more [voir vidéo F. Husson ](https://www.youtube.com/watch?v=UrS00sOpeec) (in french). # Principal component analysis In this course, we present only how to analyze tables with quantitative variables using **principal components analysis** (PCA) and how to use the method with **FactoMineR** in **R**. # Introduction ## Introduction The aim of the PCA method is to summarize a table of individuals x variables data, the variables being quantitatives. The PCA allows to study the similarities between individuals from the point of view of a group of variables and gives off profiles of individuals. It allows a balance of the linear links between variables from the correlation coefficients. These studies can be related to characterize individuals or groups of individuals by variables and to illustrate the links between variables from characteristic individuals. %% Cell type:markdown id: tags: ## Data - practicalities ### Which kinds of data PCA applies to data tables where rows are considered as individuals and columns as quantitative variables On dispose de **p variables** $X^{1},X^{2},\ldots,X^{p}$ observées sur **n individus**. Principal component analysis, also known as PCA, applies to data tables where **rows** can be considered like individuals and **columns** like quantitative variables. ### Data table We note $x^{j}_{i}$ the observation of the variable $X^{j}$ on the ith individual.
\begin{aligned} X= \left[\begin{array}{ccc} x_{1}^{1} & \dots & x_{1}^{p}\\ \vdots & \ddots & \vdots\\ x_{n}^{1} & \dots & x_{n}^{p} \end{array}\right] & \quad n \quad \mbox{individus} \nonumber \\ p \quad \mbox{variables} \nonumber \end{aligned} $$\bar{x^{j}}=\sum_{i=1}^{n} x^{j}_{i}$$ $$\sigma^{j}=\sqrt{\sum_{i=1}^{n} (x^{j}_{i}-\bar{x^{j}})^2}$$
The data table can be analyzed through its **lines** (individuals) or through its **columns** (variables). Le tableau des données peut être analysé à travers ses **lignes** (individus) ou à travers ses **colonnes**(variables). $X^{j}$$\,= \, (X^{j}_{1},\ldots,X^{j}_{n}) \quad \mbox{variable} \quad j, \quad \mbox{dans} \quad \mathcal{R}^{n} X_{i}$$\, = \, (X^{1}_{i},\ldots,X^{p}_{i}) \quad\mbox{individu} \quad i, \quad \mbox{dans} \quad \mathcal{R}^{p}$ %% Cell type:markdown id: tags: # Studying individuals and variables ## Studying individuals * When can we say that two individuals are similar with respect to all the variables or a group of variables? * If there are many individuals, is it possible to categorize them? ⇒ groups of individuals, partitions between them ## Studying variables * The correlation matrix provides a simple indication on the linear link between variables two by two. * Look for similarities between all the variables or a group of variables. * Synthetic indicators are sought to summarize groups of variables. ⇒ visualization of the correlation matrix ⇒ find a small number of synthetic variables to summarize many variables ## Links between the two points-of-view * Characterize groups of individuals using variables. * Use typical individuals to interpret groups of variables. %% Cell type:markdown id: tags: ## Some examples * **Sensory analysis**: note of the descriptor k for the product i * **Ecology**: concentration of the pollutant k on the river i * **Economy**: value of indicator k for year i * **Genetics**: Gene expression k for the patient i * **Biology**: k measure for the animal i * **Marketing**: satisfaction index value k for brand i * **Sociology**: time spent in activity k by the individuals of the CSP i * $\ldots$ ## Example: Climate of different European countries To illustrate this course, we will take **temperature data** to analyse climate of different European countries. ### Description of the data: * average monthly temperatures (over 30 years). * the annual average temperature, the thermal amplitude. * longitude, latitude of each city * A qualitative variable belonging to a region of Europe: Northern Europe, south, east and west. ### Data extract