Commit 37286307 authored by Nathan Rebiscoul's avatar Nathan Rebiscoul
Browse files

Add peer review

parent 3c4e7ea5
This diff is collapsed.
#+TITLE: Around Simpson's Paradox
#+AUTHOR: Nathan REBISCOUL
* Information about dataset
** Datas story
We will work on dataset made with the help of two studies. The first
one is a survey made in the town of Whickham in 1977. They ask at
one-sixth of constituents if they were smoking or not. This study was lead by Tunbridge and his
team. The second study was lead by Vanderpump in 1995. This one was
just looking on if peoples of the previous one was still alive. (The
study was not focusing on tabagism but some questions was related to
this subject)
** What we will use
For simplicity we will focus on women, and we will look at their
ages, if they were smoking and if they are still alive.
** Get datas
For this analysis we will use R with ggplot2 and dplyr.
First we import libraries and we get datas.
#+begin_src R :results output :session *R* :exports both
library(ggplot2)
library(dplyr)
data = read.csv(file = "Subject6_smoking.csv", sep = ",", header = T)
names(data)
head(data)
#+end_src
#+RESULTS:
#+begin_example
[1] "Smoker" "Status" "Age"
Smoker Status Age
1 Yes Alive 21.0
2 Yes Alive 19.3
3 No Dead 57.5
4 No Alive 47.1
5 Yes Alive 81.4
6 No Alive 36.8
#+end_example
* Question 1 : Mortality in the two groups
We will look at the death rate of both group.
First we get the number of people alive and smoker and the number of
people that are dead and smoker.
#+begin_src R :results output :session *R* :exports both
nbAliveSmoker <- sum(data$Status == "Alive" & data$Smoker == "Yes")
nbAliveSmoker
nbDeadSmoker <- sum(data$Status == "Dead" & data$Smoker == "Yes")
nbDeadSmoker
#+end_src
#+RESULTS:
:
: [1] 443
:
: [1] 139
And then we compute the death rate of the population.
#+begin_src R :results output :session *R* :exports both
nbSmoker <- nbAliveSmoker + nbDeadSmoker
deathRateSmoker <- nbDeadSmoker / nbSmoker
deathRateSmoker
#+end_src
#+RESULTS:
:
: [1] 0.2388316
We do the same thing for non smoker people.
#+begin_src R :results output :session *R* :exports both
nbAliveNonSmoker <- sum(data$Status == "Alive" & data$Smoker == "No")
nbAliveNonSmoker
nbDeadNonSmoker <- sum(data$Status == "Dead" & data$Smoker == "No")
nbDeadNonSmoker
nbNonSmoker <- nbAliveNonSmoker + nbDeadNonSmoker
deathRateNonSmoker <- nbDeadNonSmoker / nbNonSmoker
deathRateNonSmoker
#+end_src
#+RESULTS:
:
: [1] 502
:
: [1] 230
:
: [1] 0.3142077
We can see that the death rate of non smoker people is higher than
smoker people. It's a surprising result because smoking is bad for
health so we can think that smoker's has a higher chance of dying.
We will plot result with two barplot.
#+begin_src R :results output graphics :file img/fig1.png :session *R* :exports both
ggplot(data) + facet_wrap(~ Smoker) +
aes(x = Status, fill = factor(Status)) + geom_bar() +
labs(title = "Number of alive and dead people for smoker and non smoker")
#+end_src
#+RESULTS:
[[file:img/fig1.png]]
We can see on this graph that indeed the death rate of non smoker
people is higher than smoker people.
Results are surprising but it does not mean that it's false or that
data are false too. It could also mean that there is things that we
not considered.
* Question 2 : Ages classes splitting
Ages is a factor of death that we can find on this dataset.
We will split group in 4 groups of ages (18-34, 34-54, 55-64, 65+).
We do it for all non smoker
#+begin_src R :results output :session *R* :exports both
nbANS18 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=18 & data$Age < 34)
nbDNS18 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=18 & data$Age < 34)
nbANS34 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=34 & data$Age < 54)
nbDNS34 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=34 & data$Age < 54)
nbANS54 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64)
nbDNS54 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64)
nbANS65 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=65)
nbDNS65 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=65)
#+end_src
#+RESULTS:
And for all smoker
#+begin_src R :results output :session *R* :exports both
nbAS18 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34)
nbDS18 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34)
nbAS34 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54)
nbDS34 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54)
nbAS54 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64)
nbDS54 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64)
nbAS65 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=65)
nbDS65 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=65)
#+end_src
#+RESULTS:
We compute the rate for ages between 18 and 34.
#+begin_src R :results output :session *R* :exports both
deathRateNonSmoker18 <- nbDNS18/(nbANS18+nbDNS18)
deathRateSmoker18 <- nbDS18/(nbAS18+nbDS18)
deathRateNonSmoker18
deathRateSmoker18
#+end_src
#+RESULTS:
:
: [1] 0.02739726
:
: [1] 0.02793296
We compute the rate for ages between 34 and 54.
#+begin_src R :results output :session *R* :exports both
deathRateNonSmoker34 <- nbDNS34/(nbANS34+nbDNS34)
deathRateSmoker34 <- nbDS34/(nbAS34+nbDS34)
deathRateNonSmoker34
deathRateSmoker34
#+end_src
#+RESULTS:
:
: [1] 0.09547739
:
: [1] 0.1715481
We compute the rate for ages between 18 and 34.
#+begin_src R :results output :session *R* :exports both
deathRateNonSmoker54 <- nbDNS54/(nbANS54+nbDNS54)
deathRateSmoker54 <- nbDS54/(nbAS54+nbDS54)
deathRateNonSmoker54
deathRateSmoker54
#+end_src
#+RESULTS:
:
: [1] 0.3305785
:
: [1] 0.4434783
We compute the rate for ages higher than 65.
#+begin_src R :results output :session *R* :exports both
deathRateNonSmoker65 <- nbDNS65/(nbANS65+nbDNS65)
deathRateSmoker65 <- nbDS65/(nbAS65+nbDS65)
deathRateNonSmoker65
deathRateSmoker65
#+end_src
#+RESULTS:
:
: [1] 0.8549223
:
: [1] 0.8571429
We can see that more age is higher more the death rate is high (that
seems logical). But again the death rate of smokers is lower than
deathrate of non smokers. It is again a little bit surprising.
Now we will try to plot the barplot like before.
We will modify our dataset by regrouping ages to make easier a future plot.
#+begin_src R :results output :session *R* :exports both
data %>%
transmute(
Smoker=data$Smoker,
Status=data$Status,
Age = case_when(
data$Age >=18 & data$Age < 34 ~ "18-34",
data$Age >=34 & data$Age < 54 ~ "34-54",
data$Age >=54 & data$Age < 65 ~ "54-64",
data$Age >= 65 ~ "65+",
)) -> dataModified
head(dataModified)
#+end_src
#+RESULTS:
:
: Smoker Status Age
: 1 Yes Alive 18-34
: 2 Yes Alive 18-34
: 3 No Dead 54-64
: 4 No Alive 34-54
: 5 Yes Alive 65+
: 6 No Alive 34-54
#+begin_src R :results output graphics :file img/fig2.png :session *R* :exports both
ggplot(dataModified) +
facet_grid(Smoker ~ Age) +
aes(x = Status, fill = factor(Status)) +
geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker group by ages")
#+end_src
#+RESULTS:
[[file:img/fig2.png]]
We can also see on this graph that often the death rate for non smoker
is often higher. But we can also see that there is a lot of old (65+)
people in the non smoker group. And lot of them died (probably by
their ages). Maybe this is this group that change the statistics. But
we also saw that in other groups the problem ws the same so we cannot
going to quick into conclusion. Maybe because smoker died more quickly
there is more non smoker old people thus statistics are changed. So we
cannot conclude maybe we need another kind of study to show the
dangerousness of smoking.
% Created 2022-01-02 dim. 20:43
% Intended LaTeX compiler: pdflatex
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{wrapfig}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\usepackage{amsmath}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage{capt-of}
\usepackage{hyperref}
\author{Nathan REBISCOUL}
\date{\today}
\title{Around Simpson's Paradox}
\hypersetup{
pdfauthor={Nathan REBISCOUL},
pdftitle={Around Simpson's Paradox},
pdfkeywords={},
pdfsubject={},
pdfcreator={Emacs 26.3 (Org mode 9.1.9)},
pdflang={English}}
\begin{document}
\maketitle
\tableofcontents
\section{Information about dataset}
\label{sec:orgf5ce458}
\subsection{Datas story}
\label{sec:org2f482e8}
We will work on dataset made with the help of two studies. The first
one is a survey made in the town of Whickham in 1977. They ask at
one-sixth of constituents if they were smoking or not. This study was lead by Tunbridge and his
team. The second study was lead by Vanderpump in 1995. This one was
just looking on if peoples of the previous one was still alive. (The
study was not focusing on tabagism but some questions was related to
this subject)
\subsection{What we will use}
\label{sec:org78863b2}
For simplicity we will focus on women, and we will look at their
ages, if they were smoking and if they are still alive.
\subsection{Get datas}
\label{sec:org14778b0}
For this analysis we will use R with ggplot2 and dplyr.
First we import libraries and we get datas.
\begin{verbatim}
library(ggplot2)
library(dplyr)
data = read.csv(file = "Subject6_smoking.csv", sep = ",", header = T)
names(data)
head(data)
\end{verbatim}
\begin{verbatim}
[1] "Smoker" "Status" "Age"
Smoker Status Age
1 Yes Alive 21.0
2 Yes Alive 19.3
3 No Dead 57.5
4 No Alive 47.1
5 Yes Alive 81.4
6 No Alive 36.8
\end{verbatim}
\section{Question 1 : Mortality in the two groups}
\label{sec:org99ef84f}
We will look at the death rate of both group.
First we get the number of people alive and smoker and the number of
people that are dead and smoker.
\begin{verbatim}
nbAliveSmoker <- sum(data$Status == "Alive" & data$Smoker == "Yes")
nbAliveSmoker
nbDeadSmoker <- sum(data$Status == "Dead" & data$Smoker == "Yes")
nbDeadSmoker
\end{verbatim}
\begin{verbatim}
[1] 443
[1] 139
\end{verbatim}
And then we compute the death rate of the population.
\begin{verbatim}
nbSmoker <- nbAliveSmoker + nbDeadSmoker
deathRateSmoker <- nbDeadSmoker / nbSmoker
deathRateSmoker
\end{verbatim}
\begin{verbatim}
[1] 0.2388316
\end{verbatim}
We do the same thing for non smoker people.
\begin{verbatim}
nbAliveNonSmoker <- sum(data$Status == "Alive" & data$Smoker == "No")
nbAliveNonSmoker
nbDeadNonSmoker <- sum(data$Status == "Dead" & data$Smoker == "No")
nbDeadNonSmoker
nbNonSmoker <- nbAliveNonSmoker + nbDeadNonSmoker
deathRateNonSmoker <- nbDeadNonSmoker / nbNonSmoker
deathRateNonSmoker
\end{verbatim}
\begin{verbatim}
[1] 502
[1] 230
[1] 0.3142077
\end{verbatim}
We can see that the death rate of non smoker people is higher than
smoker people. It's a surprising result because smoking is bad for
health so we can think that smoker's has a higher chance of dying.
We will plot result with two barplot.
\begin{verbatim}
ggplot(data) + facet_wrap(~ Smoker) +
aes(x = Status, fill = factor(Status)) + geom_bar() +
labs(title = "Number of alive and dead people for smoker and non smoker")
\end{verbatim}
\begin{center}
\includegraphics[width=.9\linewidth]{img/fig1.png}
\end{center}
We can see on this graph that indeed the death rate of non smoker
people is higher than smoker people.
Results are surprising but it does not mean that it's false or that
data are false too. It could also mean that there is things that we
not considered.
\section{Question 2 : Ages classes splitting}
\label{sec:org52d4b94}
Ages is a factor of death that we can find on this dataset.
We will split group in 4 groups of ages (18-34, 34-54, 55-64, 65+).
We do it for all non smoker
\begin{verbatim}
nbANS18 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=18 & data$Age < 34)
nbDNS18 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=18 & data$Age < 34)
nbANS34 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=34 & data$Age < 54)
nbDNS34 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=34 & data$Age < 54)
nbANS54 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64)
nbDNS54 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64)
nbANS65 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=65)
nbDNS65 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=65)
\end{verbatim}
And for all smoker
\begin{verbatim}
nbAS18 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34)
nbDS18 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34)
nbAS34 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54)
nbDS34 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54)
nbAS54 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64)
nbDS54 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64)
nbAS65 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=65)
nbDS65 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=65)
\end{verbatim}
We compute the rate for ages between 18 and 34.
\begin{verbatim}
deathRateNonSmoker18 <- nbDNS18/(nbANS18+nbDNS18)
deathRateSmoker18 <- nbDS18/(nbAS18+nbDS18)
deathRateNonSmoker18
deathRateSmoker18
\end{verbatim}
\begin{verbatim}
[1] 0.02739726
[1] 0.02793296
\end{verbatim}
We compute the rate for ages between 34 and 54.
\begin{verbatim}
deathRateNonSmoker34 <- nbDNS34/(nbANS34+nbDNS34)
deathRateSmoker34 <- nbDS34/(nbAS34+nbDS34)
deathRateNonSmoker34
deathRateSmoker34
\end{verbatim}
\begin{verbatim}
[1] 0.09547739
[1] 0.1715481
\end{verbatim}
We compute the rate for ages between 18 and 34.
\begin{verbatim}
deathRateNonSmoker54 <- nbDNS54/(nbANS54+nbDNS54)
deathRateSmoker54 <- nbDS54/(nbAS54+nbDS54)
deathRateNonSmoker54
deathRateSmoker54
\end{verbatim}
\begin{verbatim}
[1] 0.3305785
[1] 0.4434783
\end{verbatim}
We compute the rate for ages higher than 65.
\begin{verbatim}
deathRateNonSmoker65 <- nbDNS65/(nbANS65+nbDNS65)
deathRateSmoker65 <- nbDS65/(nbAS65+nbDS65)
deathRateNonSmoker65
deathRateSmoker65
\end{verbatim}
\begin{verbatim}
[1] 0.8549223
[1] 0.8571429
\end{verbatim}
We can see that more age is higher more the death rate is high (that
seems logical). But again the death rate of smokers is lower than
deathrate of non smokers. It is again a little bit surprising.
Now we will try to plot the barplot like before.
We will modify our dataset by regrouping ages to make easier a future plot.
\begin{verbatim}
data %>%
transmute(
Smoker=data$Smoker,
Status=data$Status,
Age = case_when(
data$Age >=18 & data$Age < 34 ~ "18-34",
data$Age >=34 & data$Age < 54 ~ "34-54",
data$Age >=54 & data$Age < 65 ~ "54-64",
data$Age >= 65 ~ "65+",
)) -> dataModified
head(dataModified)
\end{verbatim}
\begin{verbatim}
Smoker Status Age
1 Yes Alive 18-34
2 Yes Alive 18-34
3 No Dead 54-64
4 No Alive 34-54
5 Yes Alive 65+
6 No Alive 34-54
\end{verbatim}
\begin{verbatim}
ggplot(dataModified) +
facet_grid(Smoker ~ Age) +
aes(x = Status, fill = factor(Status)) +
geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker group by ages")
\end{verbatim}
\begin{center}
\includegraphics[width=.9\linewidth]{img/fig2.png}
\end{center}
We can also see on this graph that often the death rate for non smoker
is often higher. But we can also see that there is a lot of old (65+)
people in the non smoker group. And lot of them died (probably by
their ages). Maybe this is this group that change the statistics. But
we also saw that in other groups the problem ws the same so we cannot
going to quick into conclusion. Maybe because smoker died more quickly
there is more non smoker old people thus statistics are changed. So we
cannot conclude maybe we need another kind of study to show the
dangerousness of smoking.
\end{document}
\ No newline at end of file
......@@ -57,9 +57,11 @@ plt.xlabel('feet size')
plt.ylabel('number of mistakes')
plt.savefig('fig1_feetsize.png')
plt.show()
#+end_src
#+RESULTS:
[[file:fig1_feetsize.png]]
*** Interpretation
......@@ -77,7 +79,7 @@ dataset. And it's diminished by the scatter trend line.
The choice to put feet size on x-axis is debatable.
** TODO Find another way of presentation
* Conclusion
Our dataset show a strong correlation between feetsize and number of
......
......@@ -241,16 +241,16 @@ library(tidyverse)
P = 35
N = 20
df = data.frame(val = runif(n = N*P, min = 0, max = 2), group = 1-P)
df %>% group_by(group) %>% summarise(mean_val = mean(val)) -> df_agg
df_agg %>% ggplot() * geom_histogram(aes(x=mean_val), binwidth = 0.005) + theme_bw() + xlim(0.2)
df = data.frame(val = runif(n = N*P, min = 0, max = 2), group = 1:P)
df %>% group_by(group) %>% summarize(val_mean = mean(val), val_sd = sd(val)) -> df_agg
df_agg %>% ggplot() + geom_histogram(aes(x = val_mean), binwidth = 0.01) + theme_bw() + xlim(0,2)
#+end_src