### Add peer review

parent 3c4e7ea5
File moved
This diff is collapsed.
 #+TITLE: Around Simpson's Paradox #+AUTHOR: Nathan REBISCOUL * Information about dataset ** Datas story We will work on dataset made with the help of two studies. The first one is a survey made in the town of Whickham in 1977. They ask at one-sixth of constituents if they were smoking or not. This study was lead by Tunbridge and his team. The second study was lead by Vanderpump in 1995. This one was just looking on if peoples of the previous one was still alive. (The study was not focusing on tabagism but some questions was related to this subject) ** What we will use For simplicity we will focus on women, and we will look at their ages, if they were smoking and if they are still alive. ** Get datas For this analysis we will use R with ggplot2 and dplyr. First we import libraries and we get datas. #+begin_src R :results output :session *R* :exports both library(ggplot2) library(dplyr) data = read.csv(file = "Subject6_smoking.csv", sep = ",", header = T) names(data) head(data) #+end_src #+RESULTS: #+begin_example  "Smoker" "Status" "Age" Smoker Status Age 1 Yes Alive 21.0 2 Yes Alive 19.3 3 No Dead 57.5 4 No Alive 47.1 5 Yes Alive 81.4 6 No Alive 36.8 #+end_example * Question 1 : Mortality in the two groups We will look at the death rate of both group. First we get the number of people alive and smoker and the number of people that are dead and smoker. #+begin_src R :results output :session *R* :exports both nbAliveSmoker <- sum(data$Status == "Alive" & data$Smoker == "Yes") nbAliveSmoker nbDeadSmoker <- sum(data$Status == "Dead" & data$Smoker == "Yes") nbDeadSmoker #+end_src #+RESULTS: : :  443 : :  139 And then we compute the death rate of the population. #+begin_src R :results output :session *R* :exports both nbSmoker <- nbAliveSmoker + nbDeadSmoker deathRateSmoker <- nbDeadSmoker / nbSmoker deathRateSmoker #+end_src #+RESULTS: : :  0.2388316 We do the same thing for non smoker people. #+begin_src R :results output :session *R* :exports both nbAliveNonSmoker <- sum(data$Status == "Alive" & data$Smoker == "No") nbAliveNonSmoker nbDeadNonSmoker <- sum(data$Status == "Dead" & data$Smoker == "No") nbDeadNonSmoker nbNonSmoker <- nbAliveNonSmoker + nbDeadNonSmoker deathRateNonSmoker <- nbDeadNonSmoker / nbNonSmoker deathRateNonSmoker #+end_src #+RESULTS: : :  502 : :  230 : :  0.3142077 We can see that the death rate of non smoker people is higher than smoker people. It's a surprising result because smoking is bad for health so we can think that smoker's has a higher chance of dying. We will plot result with two barplot. #+begin_src R :results output graphics :file img/fig1.png :session *R* :exports both ggplot(data) + facet_wrap(~ Smoker) + aes(x = Status, fill = factor(Status)) + geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker") #+end_src #+RESULTS: [[file:img/fig1.png]] We can see on this graph that indeed the death rate of non smoker people is higher than smoker people. Results are surprising but it does not mean that it's false or that data are false too. It could also mean that there is things that we not considered. * Question 2 : Ages classes splitting Ages is a factor of death that we can find on this dataset. We will split group in 4 groups of ages (18-34, 34-54, 55-64, 65+). We do it for all non smoker #+begin_src R :results output :session *R* :exports both nbANS18 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=18 & data$Age < 34) nbDNS18 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=18 & data$Age < 34) nbANS34 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=34 & data$Age < 54) nbDNS34 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=34 & data$Age < 54) nbANS54 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64) nbDNS54 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64) nbANS65 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=65) nbDNS65 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=65) #+end_src #+RESULTS: And for all smoker #+begin_src R :results output :session *R* :exports both nbAS18 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34) nbDS18 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34) nbAS34 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54) nbDS34 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54) nbAS54 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64) nbDS54 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64) nbAS65 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=65) nbDS65 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=65) #+end_src #+RESULTS: We compute the rate for ages between 18 and 34. #+begin_src R :results output :session *R* :exports both deathRateNonSmoker18 <- nbDNS18/(nbANS18+nbDNS18) deathRateSmoker18 <- nbDS18/(nbAS18+nbDS18) deathRateNonSmoker18 deathRateSmoker18 #+end_src #+RESULTS: : :  0.02739726 : :  0.02793296 We compute the rate for ages between 34 and 54. #+begin_src R :results output :session *R* :exports both deathRateNonSmoker34 <- nbDNS34/(nbANS34+nbDNS34) deathRateSmoker34 <- nbDS34/(nbAS34+nbDS34) deathRateNonSmoker34 deathRateSmoker34 #+end_src #+RESULTS: : :  0.09547739 : :  0.1715481 We compute the rate for ages between 18 and 34. #+begin_src R :results output :session *R* :exports both deathRateNonSmoker54 <- nbDNS54/(nbANS54+nbDNS54) deathRateSmoker54 <- nbDS54/(nbAS54+nbDS54) deathRateNonSmoker54 deathRateSmoker54 #+end_src #+RESULTS: : :  0.3305785 : :  0.4434783 We compute the rate for ages higher than 65. #+begin_src R :results output :session *R* :exports both deathRateNonSmoker65 <- nbDNS65/(nbANS65+nbDNS65) deathRateSmoker65 <- nbDS65/(nbAS65+nbDS65) deathRateNonSmoker65 deathRateSmoker65 #+end_src #+RESULTS: : :  0.8549223 : :  0.8571429 We can see that more age is higher more the death rate is high (that seems logical). But again the death rate of smokers is lower than deathrate of non smokers. It is again a little bit surprising. Now we will try to plot the barplot like before. We will modify our dataset by regrouping ages to make easier a future plot. #+begin_src R :results output :session *R* :exports both data %>% transmute( Smoker=data$Smoker, Status=data$Status, Age = case_when( data$Age >=18 & data$Age < 34 ~ "18-34", data$Age >=34 & data$Age < 54 ~ "34-54", data$Age >=54 & data$Age < 65 ~ "54-64", data$Age >= 65 ~ "65+", )) -> dataModified head(dataModified) #+end_src #+RESULTS: : : Smoker Status Age : 1 Yes Alive 18-34 : 2 Yes Alive 18-34 : 3 No Dead 54-64 : 4 No Alive 34-54 : 5 Yes Alive 65+ : 6 No Alive 34-54 #+begin_src R :results output graphics :file img/fig2.png :session *R* :exports both ggplot(dataModified) + facet_grid(Smoker ~ Age) + aes(x = Status, fill = factor(Status)) + geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker group by ages") #+end_src #+RESULTS: [[file:img/fig2.png]] We can also see on this graph that often the death rate for non smoker is often higher. But we can also see that there is a lot of old (65+) people in the non smoker group. And lot of them died (probably by their ages). Maybe this is this group that change the statistics. But we also saw that in other groups the problem ws the same so we cannot going to quick into conclusion. Maybe because smoker died more quickly there is more non smoker old people thus statistics are changed. So we cannot conclude maybe we need another kind of study to show the dangerousness of smoking.  % Created 2022-01-02 dim. 20:43 % Intended LaTeX compiler: pdflatex \documentclass[11pt]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{graphicx} \usepackage{grffile} \usepackage{longtable} \usepackage{wrapfig} \usepackage{rotating} \usepackage[normalem]{ulem} \usepackage{amsmath} \usepackage{textcomp} \usepackage{amssymb} \usepackage{capt-of} \usepackage{hyperref} \author{Nathan REBISCOUL} \date{\today} \title{Around Simpson's Paradox} \hypersetup{ pdfauthor={Nathan REBISCOUL}, pdftitle={Around Simpson's Paradox}, pdfkeywords={}, pdfsubject={}, pdfcreator={Emacs 26.3 (Org mode 9.1.9)}, pdflang={English}} \begin{document} \maketitle \tableofcontents \section{Information about dataset} \label{sec:orgf5ce458} \subsection{Datas story} \label{sec:org2f482e8} We will work on dataset made with the help of two studies. The first one is a survey made in the town of Whickham in 1977. They ask at one-sixth of constituents if they were smoking or not. This study was lead by Tunbridge and his team. The second study was lead by Vanderpump in 1995. This one was just looking on if peoples of the previous one was still alive. (The study was not focusing on tabagism but some questions was related to this subject) \subsection{What we will use} \label{sec:org78863b2} For simplicity we will focus on women, and we will look at their ages, if they were smoking and if they are still alive. \subsection{Get datas} \label{sec:org14778b0} For this analysis we will use R with ggplot2 and dplyr. First we import libraries and we get datas. \begin{verbatim} library(ggplot2) library(dplyr) data = read.csv(file = "Subject6_smoking.csv", sep = ",", header = T) names(data) head(data) \end{verbatim} \begin{verbatim}  "Smoker" "Status" "Age" Smoker Status Age 1 Yes Alive 21.0 2 Yes Alive 19.3 3 No Dead 57.5 4 No Alive 47.1 5 Yes Alive 81.4 6 No Alive 36.8 \end{verbatim} \section{Question 1 : Mortality in the two groups} \label{sec:org99ef84f} We will look at the death rate of both group. First we get the number of people alive and smoker and the number of people that are dead and smoker. \begin{verbatim} nbAliveSmoker <- sum(data$Status == "Alive" & data$Smoker == "Yes") nbAliveSmoker nbDeadSmoker <- sum(data$Status == "Dead" & data$Smoker == "Yes") nbDeadSmoker \end{verbatim} \begin{verbatim}  443  139 \end{verbatim} And then we compute the death rate of the population. \begin{verbatim} nbSmoker <- nbAliveSmoker + nbDeadSmoker deathRateSmoker <- nbDeadSmoker / nbSmoker deathRateSmoker \end{verbatim} \begin{verbatim}  0.2388316 \end{verbatim} We do the same thing for non smoker people. \begin{verbatim} nbAliveNonSmoker <- sum(data$Status == "Alive" & data$Smoker == "No") nbAliveNonSmoker nbDeadNonSmoker <- sum(data$Status == "Dead" & data$Smoker == "No") nbDeadNonSmoker nbNonSmoker <- nbAliveNonSmoker + nbDeadNonSmoker deathRateNonSmoker <- nbDeadNonSmoker / nbNonSmoker deathRateNonSmoker \end{verbatim} \begin{verbatim}  502  230  0.3142077 \end{verbatim} We can see that the death rate of non smoker people is higher than smoker people. It's a surprising result because smoking is bad for health so we can think that smoker's has a higher chance of dying. We will plot result with two barplot. \begin{verbatim} ggplot(data) + facet_wrap(~ Smoker) + aes(x = Status, fill = factor(Status)) + geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker") \end{verbatim} \begin{center} \includegraphics[width=.9\linewidth]{img/fig1.png} \end{center} We can see on this graph that indeed the death rate of non smoker people is higher than smoker people. Results are surprising but it does not mean that it's false or that data are false too. It could also mean that there is things that we not considered. \section{Question 2 : Ages classes splitting} \label{sec:org52d4b94} Ages is a factor of death that we can find on this dataset. We will split group in 4 groups of ages (18-34, 34-54, 55-64, 65+). We do it for all non smoker \begin{verbatim} nbANS18 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=18 & data$Age < 34) nbDNS18 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=18 & data$Age < 34) nbANS34 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=34 & data$Age < 54) nbDNS34 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=34 & data$Age < 54) nbANS54 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64) nbDNS54 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=55 & data$Age <= 64) nbANS65 <- sum(data$Status == "Alive" & data$Smoker == "No" & data$Age >=65) nbDNS65 <- sum(data$Status == "Dead" & data$Smoker == "No" & data$Age >=65) \end{verbatim} And for all smoker \begin{verbatim} nbAS18 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34) nbDS18 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=18 & data$Age < 34) nbAS34 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54) nbDS34 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=34 & data$Age < 54) nbAS54 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64) nbDS54 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=55 & data$Age <= 64) nbAS65 <- sum(data$Status == "Alive" & data$Smoker == "Yes" & data$Age >=65) nbDS65 <- sum(data$Status == "Dead" & data$Smoker == "Yes" & data$Age >=65) \end{verbatim} We compute the rate for ages between 18 and 34. \begin{verbatim} deathRateNonSmoker18 <- nbDNS18/(nbANS18+nbDNS18) deathRateSmoker18 <- nbDS18/(nbAS18+nbDS18) deathRateNonSmoker18 deathRateSmoker18 \end{verbatim} \begin{verbatim}  0.02739726  0.02793296 \end{verbatim} We compute the rate for ages between 34 and 54. \begin{verbatim} deathRateNonSmoker34 <- nbDNS34/(nbANS34+nbDNS34) deathRateSmoker34 <- nbDS34/(nbAS34+nbDS34) deathRateNonSmoker34 deathRateSmoker34 \end{verbatim} \begin{verbatim}  0.09547739  0.1715481 \end{verbatim} We compute the rate for ages between 18 and 34. \begin{verbatim} deathRateNonSmoker54 <- nbDNS54/(nbANS54+nbDNS54) deathRateSmoker54 <- nbDS54/(nbAS54+nbDS54) deathRateNonSmoker54 deathRateSmoker54 \end{verbatim} \begin{verbatim}  0.3305785  0.4434783 \end{verbatim} We compute the rate for ages higher than 65. \begin{verbatim} deathRateNonSmoker65 <- nbDNS65/(nbANS65+nbDNS65) deathRateSmoker65 <- nbDS65/(nbAS65+nbDS65) deathRateNonSmoker65 deathRateSmoker65 \end{verbatim} \begin{verbatim}  0.8549223  0.8571429 \end{verbatim} We can see that more age is higher more the death rate is high (that seems logical). But again the death rate of smokers is lower than deathrate of non smokers. It is again a little bit surprising. Now we will try to plot the barplot like before. We will modify our dataset by regrouping ages to make easier a future plot. \begin{verbatim} data %>% transmute( Smoker=data$Smoker, Status=data$Status, Age = case_when( data$Age >=18 & data$Age < 34 ~ "18-34", data$Age >=34 & data$Age < 54 ~ "34-54", data$Age >=54 & data$Age < 65 ~ "54-64", data$Age >= 65 ~ "65+", )) -> dataModified head(dataModified) \end{verbatim} \begin{verbatim} Smoker Status Age 1 Yes Alive 18-34 2 Yes Alive 18-34 3 No Dead 54-64 4 No Alive 34-54 5 Yes Alive 65+ 6 No Alive 34-54 \end{verbatim} \begin{verbatim} ggplot(dataModified) + facet_grid(Smoker ~ Age) + aes(x = Status, fill = factor(Status)) + geom_bar() + labs(title = "Number of alive and dead people for smoker and non smoker group by ages") \end{verbatim} \begin{center} \includegraphics[width=.9\linewidth]{img/fig2.png} \end{center} We can also see on this graph that often the death rate for non smoker is often higher. But we can also see that there is a lot of old (65+) people in the non smoker group. And lot of them died (probably by their ages). Maybe this is this group that change the statistics. But we also saw that in other groups the problem ws the same so we cannot going to quick into conclusion. Maybe because smoker died more quickly there is more non smoker old people thus statistics are changed. So we cannot conclude maybe we need another kind of study to show the dangerousness of smoking. \end{document} \ No newline at end of file

6.63 KB

16.5 KB

 ... ... @@ -57,9 +57,11 @@ plt.xlabel('feet size') plt.ylabel('number of mistakes') plt.savefig('fig1_feetsize.png') plt.show() #+end_src #+RESULTS: [[file:fig1_feetsize.png]] *** Interpretation ... ... @@ -77,7 +79,7 @@ dataset. And it's diminished by the scatter trend line. The choice to put feet size on x-axis is debatable. ** TODO Find another way of presentation * Conclusion Our dataset show a strong correlation between feetsize and number of ... ...
 ... ... @@ -241,16 +241,16 @@ library(tidyverse) P = 35 N = 20 df = data.frame(val = runif(n = N*P, min = 0, max = 2), group = 1-P) df %>% group_by(group) %>% summarise(mean_val = mean(val)) -> df_agg df_agg %>% ggplot() * geom_histogram(aes(x=mean_val), binwidth = 0.005) + theme_bw() + xlim(0.2) df = data.frame(val = runif(n = N*P, min = 0, max = 2), group = 1:P) df %>% group_by(group) %>% summarize(val_mean = mean(val), val_sd = sd(val)) -> df_agg df_agg %>% ggplot() + geom_histogram(aes(x = val_mean), binwidth = 0.01) + theme_bw() + xlim(0,2) #+end_src