Commit ffce6dc5 by Florian Gallay

### Final pass on the lectures, group them in one file

parent e6786645
 --- title: "Is Batman Somewhere ?" output: html_document output: pdf_document --- {r setup, include=FALSE} ... ... @@ -20,13 +20,13 @@ head(data)  ### Study of the relationship between brain weight and body mass {r} {r, out.width='40%', fig.align='center'} phyto <- data[data$Diet==1,] ggplot(phyto, aes(x = BOW, y = BRW)) + geom_point() + geom_smooth(method = "lm")  The mathematical model estimated by R is the following :$BRW = \beta_1 + \beta_2 \cdot BOW + \epsilon$where$\beta_1$and$\beta_2$are two unknown parameters (resp. the intercept and the directing coefficient) and$\espilon$represents the modelling error. The mathematical model estimated by R is the following :$BRW = \beta_1 + \beta_2 \cdot BOW + \epsilon$where$\beta_1$and$\beta_2$are two unknown parameters (resp. the intercept and the directing coefficient) and$\epsilonrepresents the modelling error. {r} reg1 = lm(BRW ~ BOW, data=phyto) ... ... @@ -46,15 +46,17 @@ anova(reg1) The additional informations provided here are the explained and residual variance (SSE and SSR) as well as the degree of freedom of the law used for the test. The sum of the residual squares represents the variations that are not explained by the model (the noise). Here, it is equal to 4253838. {r} {r, out.width='50%', fig.align='center'} par(mfrow=c(1,2)) plot(reg1fitted.values, reg1residuals, xlab="Predicted", ylab="Residuals", pch = 20) plot(reg1, 4)  We can see that there are two points that stand out from the rest : one point (at a predicted brain weight of around 3000) seems to be rather far from the estimated model and one point (at a predicted brain weight of around 10000) is really far from the others (it is the only point higher than 3000). (left) We can see that there are two points that stand out from the rest : one point (at a predicted brain weight of around 3000) seems to be rather far from the estimated model and one point (at a predicted brain weight of around 10000) is really far from the others (it is the only point higher than 3000). The second point could be a bit problematic as it could have a large impact on the estimation. {r} plot(reg1, 4) {r, out.width='50%', fig.align='center'}  The graph above shows us that the seventh observation in the dataset is considered as a "major" outlier. This observation corresponds to the point with a predicted brain weight of 10000. ... ... @@ -76,7 +78,7 @@ anova(reg2) We can confirm this by looking at the analysis of variance table: the sum of squares residuals is much smaller than before. {r} {r, out.width='80%', fig.align='center'} par(mfcol=c(2,2)) plot(reg1,2) plot(reg1,3) ... ... @@ -89,7 +91,7 @@ In graph 2 (normal Q-Q), we can see that for reg1 the points follow the bisector In graph 3, we are checking whether the hypothesis on the variance of the errors is reasonable or not. The red line should be approximately horizontal and the spread of the points around it should not have a structure. This is obviously not the case for the plot using reg1 (the line is not horizontal at all). For reg2, the line is not really horizontal on the left-hand side and the points seem to be much more spread vertically at a fitted value of 500 when compared to 1000. The hypothesis on the variance of the errors is not that good here either. The low amount of measurement for larger fitted value could have an impact on this. ### Study of the contribution to the total weight of each part of the brain {r} {r, out.width='40%', fig.align='center'} phytoNum = phyto[, c(4:8)] mat.cor = cor(phytoNum) corrplot(mat.cor, type="upper") ... ... @@ -120,7 +122,7 @@ anova(regm) The model corresponding to this analysis is the following:BRW = \beta_0 + \beta_1 \cdot AUD + \beta_2 \cdot MOB + \beta_3 \cdot HIP + \epsilon$where$\beta_{i \in \{0,..,3\}}are unknown not observable parameters. {r} {r, out.width='70%', fig.align='center'} par(mfcol=c(2,2)) plot(regm)  ... ... @@ -138,12 +140,12 @@ reg0 = lm(BRW ~ 1, data = phyto) step(reg0, scope = BRW ~ AUD + MOB + HIP, direction = "forward")  The step function compares different model by adding variables one by one. It uses the AIC score (Akaike Information Criterion) to compare them. This score compares model looking at how much of the variations is explained by a model as well as the its number of variables (it penalized models containing more variables). The step function compares different model by adding variables one by one. It uses the AIC score (Akaike Information Criterion) to compare them. This score compares model looking at how much of the variations is explained by a model as well as its number of variables (it penalizes models containing more variables). In our case, we can see that the model with the best (lowest) score is the model containing all three variables (HIP, AUD and MOB). We can conclude that, despite the previous observations, the variable MOB seem to help explain some of the variations of the data. ### Link between volume of the auditory part and diet {r} {r, out.width='50%', fig.align='center'} dataDiet_F = as.factor(data$Diet) par(mfcol=c(1,2)) with(data, plot(AUD~Diet)) ... ... File added No preview for this file type This diff is collapsed. This diff is collapsed.  Lecture on data visualization Synthesize and represent data using tools ### Introduction ## Remarks on images # Plot of banana exports in tons - Colors and one of the axes -> same thing - Picture + hard to read - Time on the z axis is weird (usually on x, maybe should have put countries on z instead) - Difficulties to compare (because of 3d ?) - Legend of axes # Monthly global mean temperature - Identification of months difficult - Global view, microscopic difficult - Comparison difficult # New Covid cases / million people - Exponential scale (not clear ?) - Difficult to compare on the x-axis In general, picture with a lot of information, hard to focus on the information/message. ## Plan Motivation behind the usefulness of pictures and how to communicate informations Then, criteria for good images (readability, intelligibility and no misunderstanding) ### How to get informations from a data set - Set of studied object : population, each element: individual. Population too large -> study on a sample - Study characteristic$X$on population$P$of size$n$. Possible values :$\Omega$- Sample chosen randomly. Data sample = observed values on the population sample - What information can we draw from the data sample ? - What predication could be made ? (Confidence ?) - Two types of variables : qualitative (not numerical, nominal or ordinal) and quantitative (continuous or discrete) - Statistical study : answer a question. Several steps : - Define the protocol to get the data - Data collection, coding, cleaning - Data exploration - Data pre-processing - If from a sample, statistical modelling - Forecasting, decision making Today, focus on exploration - Once you have a question and data, start exploring it ? - Nominal qualitative variable -> create tables to get relative frequency or number of occurrences of each modalities (nb of male/female, single/couple, ...) - 3D diagrams are **evil** - CANNOT DO HISTOGRAMS WITH QUALITATIVE VARIABLES -> Nominal : - Diagramme de Pareto / barplot : Pareto better with a lot of modalities - Diagramme empile : hard to read if the data is not ordered (and the modalities with a low frequency are not renmoved) - Pie chart : difficult to read with a lot of modalities, nice if message is clear - Ordinal : - Same thing but cannot order the data - Quantitative variables : - Discrete/continuous depends. More about the number of repetitions of each modality than the mathematical nature of the data. Example : age in month vs in years - Too many modalities -> group into classes (continuous) - Statistical summary of - Position : Mean, mode, median, quartiles - Dispersion : Range, interquartile interval, variance and standard deviation (variance = mean of distance to mean, std = square root of variance) - Shape : ? - Representation : - Discrete : bar diagram - Continuous : histogram (inherits the problem of creating the classes, i.e. choosing, number, ...) - Empirical distribution function - Boxplots ### How to report results from model analysis Problem : provide pictures that help the understanding - Increase quality of paper/research, generates discussion ## Common mistakes - Bar plots with values close together (?) (see Availability/unavailability during the week) - Multiple curves on the same plot -> too much information - Q : couldn't this help to show correlation ? - A : not so good, this adds fuzziness. Better to have two plots and put a vertical line between the two for example. - Non-relevant graphic objects (see CPU Type image) - This could occur when choosing bad classes for continuous variables - Cheat (play with scales so that the data looks better) See checklist for good graphics. - Data : the nature of the data implies its representation - Graphical objects : provide the readability of the graphics - Annotations : they put a semantic on the graphics - Context : A graphic should be a partial necessary information in a specific context - Comparing two variables : quanti x quail -> example heights : one histogram with the whole population and then split into two : man/woman ? - quali x quali : diagramme empile, exemple compare age and the continuation of studies - quanti x quanti : scatterplot or time series TODO : Implement one of the graph of the first exercise and then assemble both parts in a Jupiter notebook HOMEWORK : Use the checklist to find mistakes in the pictures in the slide ... ...  --- title: "Lecture 3" output: html_document output: pdf_document --- {r setup, include=FALSE} ... ... @@ -35,7 +35,8 @@ mean(mtcars[mtcars$mpg > 20 & mtcars$cyl == 4,]$mpg) {r, warning=FALSE} library(dplyr) mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n()) %>% ungroup() %>% mutate(mpg_gmean = mean(mpg_mean), deviation = mpg_mean - mpg_gmean) -> mtcars_summary mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n()) %>% ungroup() %>% mutate(mpg_gmean = mean(mpg_mean), deviation = mpg_mean - mpg_gmean) -> mtcars_summary mtcars_summary  ... ...
This diff is collapsed.
 --- title: "Lecture 4" output: html_document output: pdf_document --- {r setup, include=FALSE} ... ...
This diff is collapsed.
 ... ... @@ -56,7 +56,7 @@ myData=read.csv(file="les-arbres.csv", sep=";",skip=3,header=T) myData=myData[myData$X10!=0,]  We draw the height of a tree compared totheir circomference We draw the height of a tree compared to their circomference {r} circ=myData$X70 ... ...