Commit c5eec54e authored by Laurence Viry's avatar Laurence Viry
Browse files

modif manipDon 2/07/18

parent 7c8c449f
Pipeline #9419 passed with stages
in 47 seconds
individu;taille;poids;pointure;sexe
roger;184;80;44;M
theodule;175,5;78;43;M
theodule;175,5;5;43;M
nicolas;158;72;42;M
{
"cells": [
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"setwd(\"/home/viryl/notebooks/ATMO_IntroR\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
......@@ -25,6 +36,7 @@
"\n",
"On peut également utiliser **les formats propriétaires** des autres logiciels en utilisant un package adapté (le package **foreign** par exemple), le choix dépendant du contexte et du volume des données.\n",
"\n",
"## Importer des données en format texte\n",
"### Cas des fichiers **csv** \n",
"\n",
"Les avantages des fichiers **csv**:\n",
......@@ -47,22 +59,40 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"'/home/viryl/notebooks/ATMO_IntroR/notebooks'"
"'/home/viryl/notebooks/ATMO_IntroR'"
],
"text/latex": [
"'/home/viryl/notebooks/ATMO\\_IntroR'"
],
"text/markdown": [
"'/home/viryl/notebooks/ATMO_IntroR'"
],
"text/plain": [
"[1] \"/home/viryl/notebooks/ATMO_IntroR\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'list'"
],
"text/latex": [
"'/home/viryl/notebooks/ATMO\\_IntroR/notebooks'"
"'list'"
],
"text/markdown": [
"'/home/viryl/notebooks/ATMO_IntroR/notebooks'"
"'list'"
],
"text/plain": [
"[1] \"/home/viryl/notebooks/ATMO_IntroR/notebooks\""
"[1] \"list\""
]
},
"metadata": {},
......@@ -104,89 +134,19 @@
],
"source": [
"# Lecture du fichier donnees.csv\n",
"#setwd(\"/home/viryl/notebooks/ATMO_IntroR/data\")\n",
"getwd() # repertoire de travail\n",
"don <- read.csv(file = \"data/donnees.csv\",header=TRUE,sep=\";\",dec=\",\",row.names=1)\n",
"mode(don)\n",
"class(don)\n",
"summary(don)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - l'argument **sep** : indique que les valeurs sont séparées par **\";\"** (**\" \"** pour un espace, **\"\\t\"** pour une tabulation)\n",
" - l'argument **dep** : indique que le séparateur de décimal est **\",\"**\n",
" - l'argument **header** : indique si la première ligne contient les noms des variables (TRUE) ou non(FALSE).\n",
" - l'argument **row.names** : indique que la colonne 1 n'est pas une variable mais l'identifiant des individus.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Un caractère spécial peut indiquer qu'il y a des données manquantes:\n",
"\n",
"Le fichier **don2.csv** contient des données manquantes codées **\"\\*\\*\\*\"**, on ajoute l'argument **na.strings**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" taille poids pointure sexe \n",
" Min. :100.0 Min. :15 Min. :22.00 F :2 \n",
" 1st Qu.:110.0 1st Qu.:30 1st Qu.:30.50 M :3 \n",
" Median :160.0 Median :72 Median :40.00 NA's:1 \n",
" Mean :145.9 Mean :55 Mean :36.17 \n",
" 3rd Qu.:175.5 3rd Qu.:78 3rd Qu.:42.75 \n",
" Max. :184.0 Max. :80 Max. :44.00 \n",
" NA's :1 NA's :1 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'list'"
],
"text/latex": [
"'list'"
],
"text/markdown": [
"'list'"
],
"text/plain": [
"[1] \"list\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'data.frame'"
],
"text/latex": [
"'data.frame'"
],
"text/markdown": [
"'data.frame'"
],
"text/plain": [
"[1] \"data.frame\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
......@@ -206,9 +166,6 @@
"\t<li>'roger'</li>\n",
"\t<li>'theodule'</li>\n",
"\t<li>'nicolas'</li>\n",
"\t<li>'Alice'</li>\n",
"\t<li>'Marcel'</li>\n",
"\t<li>'Claire'</li>\n",
"</ol>\n",
"</dd>\n",
"</dl>\n"
......@@ -227,9 +184,6 @@
"\\item 'roger'\n",
"\\item 'theodule'\n",
"\\item 'nicolas'\n",
"\\item 'Alice'\n",
"\\item 'Marcel'\n",
"\\item 'Claire'\n",
"\\end{enumerate*}\n",
"\n",
"\\end{description}\n"
......@@ -249,9 +203,6 @@
": 1. 'roger'\n",
"2. 'theodule'\n",
"3. 'nicolas'\n",
"4. 'Alice'\n",
"5. 'Marcel'\n",
"6. 'Claire'\n",
"\n",
"\n",
"\n",
......@@ -266,16 +217,51 @@
"[1] \"data.frame\"\n",
"\n",
"$row.names\n",
"[1] \"roger\" \"theodule\" \"nicolas\" \"Alice\" \"Marcel\" \"Claire\" \n"
"[1] \"roger\" \"theodule\" \"nicolas\" \n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"'data.frame':\t3 obs. of 4 variables:\n",
" $ taille : num 184 176 158\n",
" $ poids : int 80 78 72\n",
" $ pointure: int 44 43 42\n",
" $ sexe : Factor w/ 1 level \"M\": 1 1 1\n"
]
},
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'taille'</li>\n",
"\t<li>'poids'</li>\n",
"\t<li>'pointure'</li>\n",
"\t<li>'sexe'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'taille'\n",
"\\item 'poids'\n",
"\\item 'pointure'\n",
"\\item 'sexe'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'taille'\n",
"2. 'poids'\n",
"3. 'pointure'\n",
"4. 'sexe'\n",
"\n",
"\n"
],
"text/plain": [
"[1] NA"
"[1] \"taille\" \"poids\" \"pointure\" \"sexe\" "
]
},
"metadata": {},
......@@ -284,16 +270,98 @@
{
"data": {
"text/html": [
"55"
"3"
],
"text/latex": [
"55"
"3"
],
"text/markdown": [
"55"
"3"
],
"text/plain": [
"[1] 55"
"[1] 3"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"4"
],
"text/latex": [
"4"
],
"text/markdown": [
"4"
],
"text/plain": [
"[1] 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<ol>\n",
"\t<li><ol class=list-inline>\n",
"\t<li>'roger'</li>\n",
"\t<li>'theodule'</li>\n",
"\t<li>'nicolas'</li>\n",
"</ol>\n",
"</li>\n",
"\t<li><ol class=list-inline>\n",
"\t<li>'taille'</li>\n",
"\t<li>'poids'</li>\n",
"\t<li>'pointure'</li>\n",
"\t<li>'sexe'</li>\n",
"</ol>\n",
"</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate}\n",
"\\item \\begin{enumerate*}\n",
"\\item 'roger'\n",
"\\item 'theodule'\n",
"\\item 'nicolas'\n",
"\\end{enumerate*}\n",
"\n",
"\\item \\begin{enumerate*}\n",
"\\item 'taille'\n",
"\\item 'poids'\n",
"\\item 'pointure'\n",
"\\item 'sexe'\n",
"\\end{enumerate*}\n",
"\n",
"\\end{enumerate}\n"
],
"text/markdown": [
"1. 1. 'roger'\n",
"2. 'theodule'\n",
"3. 'nicolas'\n",
"\n",
"\n",
"\n",
"2. 1. 'taille'\n",
"2. 'poids'\n",
"3. 'pointure'\n",
"4. 'sexe'\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"[[1]]\n",
"[1] \"roger\" \"theodule\" \"nicolas\" \n",
"\n",
"[[2]]\n",
"[1] \"taille\" \"poids\" \"pointure\" \"sexe\" \n"
]
},
"metadata": {},
......@@ -301,37 +369,78 @@
}
],
"source": [
"don2 <- read.csv(file = \"data/don2.csv\",header=TRUE,sep=\";\",dec=\",\",row.names=1,na.strings=\"***\")\n",
"summary(don2)\n",
"mode(don2)\n",
"class(don2)\n",
"attributes(don2)\n",
"mean(don2$poids)\n",
"mean(don2$poids,na.rm=TRUE)"
"# Fonctions utilisees sur un data-frame\n",
"attributes(don)\n",
"str(don)\n",
"names(don)\n",
"nrow(don)\n",
"ncol(don)\n",
"dimnames(don)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Le chemin peut-être une URL:"
" - l'argument **sep** : indique que les valeurs sont séparées par **\";\"** (**\" \"** pour un espace, **\"\\t\"** pour une tabulation,$\\ldots$)\n",
" - l'argument **dep** : indique que le séparateur de décimal est **\",\"**\n",
" - l'argument **header** : indique si la première ligne contient les noms des variables (TRUE) ou non(FALSE).\n",
" - l'argument **row.names** : indique que la colonne 1 n'est pas une variable mais l'identifiant des individus.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Un caractère spécial peut indiquer qu'il y a des données manquantes:\n",
"\n",
"Le fichier **don2.csv** contient des données manquantes codées **\"\\*\\*\\*\"**, on ajoute l'argument **na.strings**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" SCORE TIME DECIMAL.TIME CLASS \n",
" Min. : 6.0 03:22:50:1 Min. :0.140 BEST :2 \n",
" 1st Qu.: 8.0 07:42:51:1 1st Qu.:0.320 INTERMEDIATE:1 \n",
" Median :13.0 09:30:03:1 Median :0.400 WORST :2 \n",
" Mean :12.4 12:01:03:1 Mean :0.372 \n",
" 3rd Qu.:16.0 12:01:29:1 3rd Qu.:0.500 \n",
" Max. :19.0 Max. :0.500 "
" taille poids pointure sexe \n",
" Min. :100.0 Min. :15 Min. :22.00 F :2 \n",
" 1st Qu.:110.0 1st Qu.:30 1st Qu.:30.50 M :3 \n",
" Median :160.0 Median :72 Median :40.00 NA's:1 \n",
" Mean :145.9 Mean :55 Mean :36.17 \n",
" 3rd Qu.:175.5 3rd Qu.:78 3rd Qu.:42.75 \n",
" Max. :184.0 Max. :80 Max. :44.00 \n",
" NA's :1 NA's :1 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[1] NA"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"55"
],
"text/latex": [
"55"
],
"text/markdown": [
"55"
],
"text/plain": [
"[1] 55"
]
},
"metadata": {},
......@@ -339,29 +448,17 @@
}
],
"source": [
"df <- read.table(\"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/scores_timed.csv\",header=TRUE,row.names = 1,sep = \",\")\n",
"summary(df)"
"don2 <- read.csv(file = \"data/don2.csv\",header=TRUE,sep=\";\",dec=\",\",row.names=1,na.strings=\"***\")\n",
"summary(don2)\n",
"mean(don2$poids)\n",
"mean(don2$poids,na.rm=TRUE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Fonctions utiles dans un data-frame\n",
"\n",
"* **head()** - pour voir les 6 premières lignes \n",
"* **tail()** - pour voir les 6 dernières lignes \n",
"* **dim()** - ses dimensions \n",
"* **nrow()** - le nombre de lignes \n",
"* **ncol()** - le nombre de colonnes \n",
"* **str()** - structure de chaque colonne\n",
"* **names()** - liste l'attribut **names** d'un data.frame (ou n'importe quel autre objet), les noms des colonnes\n",
"* **dimnanes()** - liste l'attribut **row.names** d'un data.frame."
"* Le chemin peut-être une URL:"
]
},
{
......@@ -370,11 +467,8 @@
"metadata": {},
"outputs": [],
"source": [
"str(df)\n",
"names(df)\n",
"nrow(df)\n",
"ncol(df)\n",
"dimnames(df)"
"df <- read.table(\"https://s3.amazonaws.com/assets.datacamp.com/blog_assets/scores_timed.csv\",header=TRUE,row.names = 1,sep = \",\")\n",
"summary(df)"
]
},
{
......@@ -390,25 +484,110 @@
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"help(scan)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"'list'"
],
"text/latex": [
"'list'"
],
"text/markdown": [
"'list'"
],
"text/plain": [
"[1] \"list\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"5"
],
"text/latex": [
"5"
],
"text/markdown": [
"5"
],
"text/plain": [
"[1] 5"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"List of 5\n",
" $ : chr [1:3] \"roger\" \"theodule\" \"nicolas\"\n",
" $ : num [1:3] 184 176 158\n",
" $ : num [1:3] 80 5 72\n",
" $ : num [1:3] 44 43 42\n",
" $ : chr [1:3] \"M\" \"M\" \"M\"\n"
]
},
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'roger'</li>\n",
"\t<li>'theodule'</li>\n",
"\t<li>'nicolas'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'roger'\n",
"\\item 'theodule'\n",
"\\item 'nicolas'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'roger'\n",
"2. 'theodule'\n",
"3. 'nicolas'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"roger\" \"theodule\" \"nicolas\" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'roger'"
],
"text/latex": [
"'roger'"
],
"text/markdown": [
"'roger'"
],
"text/plain": [
"[1] \"roger\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mydata <- scan(\"data/don.txt\",skip=1,what = list(\"\", 0, 0,0))\n",
"mydata <- scan(\"data/donnees.csv\",skip=1,sep=\";\",dec=\",\",what = list(\"\", 0, 0,0,\"\"))\n",
"class(mydata)\n",
"str(mydata)\n",
"mydata[[1]] # premiere variable\n",
"mydata[[1]][1] # "
]
......@@ -417,9 +596,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Dans cet exemple, scan lit 4 variables, la première de mode caractère et les trois suivantes sont de mode numérique.\n",
"Dans cet exemple, **scan** lit *5 variables*, la première en mode caractère, les trois suivantes sont en mode numérique et la cinquième en mode caractère.\n",
"\n",
"**myDat** est une liste de 4 vecteurs.\n",
"**myDat** est une liste de 5 vecteurs.\n",
"\n",
"* scan() peut être utilisée pour créer des objets de mode différent (vecteurs, matrices, tableaux de données, listes,...).\n",
"*\n",
......@@ -432,14 +611,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cas des fichiers Excel"
"## Importer des fichiers Excel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Les formats propriétaires\n",
"## Importer des fichiers avec un format propriétaire\n",
"\n",
"R peut également lire des fichiers dans d'autres formats (**Excel, SAS, SPSS**,$\\ldots$) et accéder à des **bases de données**.\n",
"\n",
......@@ -451,9 +630,14 @@
" **read.spss**(\"file.sav\") # for SPSS format<br\\>\n",
" **read.mpt**(\"file.mtp\") # for Minitab Portable Worksheet<br\\>\n",
"\n",
"Une autre solution pour des **fichiers SPSS**, le package **Hmisc**<br\\>\n",
"\n",
"### Base de données relationnelles et autres formats\n",
"Une autre solution pour des **fichiers SPSS**, le package **Hmisc**<br\\>"
]
},
{