Commit 29346049 by Laurence Viry

### testOne_Bio.md

parent f3225e8a
 We focus on one sample tests. # Tests on a Gaussian sample ... ... @@ -7,6 +8,7 @@ Consider a random sample $(X_1, \ldots,X_n)$ of a distribution with mean $\mu$, * The empirical mean is $\bar X = \frac{X_1 + \ldots + X_n}n$ * The empirical variance is $S^2 = \frac{n}{n-1} \left(\frac{X^2_1 +\ldots + X^2_n}n - \bar X^2\right)$. ## Test of the mean A first test is on the mean of the sample. ... ... @@ -14,11 +16,13 @@ A first test is on the mean of the sample. **Example**. For an adult, the logarithm of the D-dimer concentration, denoted by $X$, is modeled by a normal random variable with mean $\mu$ and standard deviation $\sigma$. The variable $X$ is an indicator for the risk of thrombosis: it is considered that for healthy individuals, $\mu$ is −1, whereas for individuals at risk $\mu$ is 0. The influence of olive oil on thrombosis risk must be evaluated. A group of 13 patients, previously considered as being at risk, had an olive oil enriched diet. After the diet, their value of $X$ was measured, and this gave an empirical mean of −0.15. <\div> The doctor would like to decide if the olive oil diet has improved the D-dimer concentration.
{#msg][#md>] The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$ with a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$. ... ... @@ -26,38 +30,57 @@ When the **variance $\sigma^2$ is known**, the test statistic is $$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{\sigma}\right)$$ When $H_0$ is true, the statistic $T$ follows a $\mathcal{N}(0,1)$. [#msg}
The decision rule depends on $H_1$. It consists in computing the bounds at which we reject $H_0$. The bounds depend also on the risk of the test (the first kind risk). The decision rule depends on $H_1$. It consists in computing the bounds at which we reject $H_0$. The bounds depend also on the risk of the test (the first kind risk). {#msg][#md>]
The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$ with one of the three alternatives: * For $H1: \mu\neq \mu_0$: we reject $H_0$ when $T$ takes too small or too large values. At a risk of $\alpha=5$\%, the two bounds are [#R>>] R #[#R>>] alpha <- 0.05 qnorm(alpha/2,0,1) qnorm(1-alpha/2,0,1) [#md>]We reject $H_0$ when $T<(-1.959964)$ or when$T>1.959964$. #[#md>] 
We reject $H_0$ when $T<(-1.959964)$ or when$T>1.959964$. * For $H1: \mu\geq \mu_0$: we reject $H_0$ when $T$ takes too large values. At a risk of $\alpha=5$\%, the bound is [#R>>] R #[#R>>] alpha <- 0.05 qnorm(alpha,0,1, lower.tail=FALSE) [#md>] #[#md>] 
We reject $H_0$ when $T>1.644854$. * For $H1: \mu\leq \mu_0$: we reject $H_0$ when $T$ takes too small values. At a risk of $\alpha=5$\%, the bound is [#R>>] R #[#R>>] alpha <- 0.05 qnorm(alpha,0,1) [#md>] We reject $H_0$ when $T<(-1.644854)$. [#msg} #[#md>]  We reject $H_0$ when $T<(-1.644854)$. **Back to the example**. We assume that the sample of 13 patients is a Gaussian sample. The standard deviation $\sigma$ is supposed to be known and equal to $0.3$. We want to test ... ... @@ -67,29 +90,38 @@ The test statistic is $$T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$ According to the null hypothesis $H_0$, $T$ follows the normal distribution $\mathcal{N}(0,1)$. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is [#R>>] R #[#R>>] qnorm(0.05,0,1) [#md>] #[#md>]  The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$. For $\bar X= -0.15$, the test statistic takes the value [#R>>] R #[#R>>] n<-13 Xbar<--0.15 sig<-0.3 mu0<-0 t<-sqrt(n)*(Xbar-mu0)/sig t [#md>] {#msg][#md>] **Decision and interpretation**: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement. [#class]green[#msg} #[#md>] 
**Decision and interpretation**: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement. The previous case assumes that the standard deviation $\sigma$ is known. This is usually not the case in practice. The adaptation to the test of the mean, with an unknown variance is the following. {#msg][#md>]
The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$ versus a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$. ... ... @@ -101,35 +133,60 @@ When $H_0$ is true, the statistic $T$ follows a Student distribution with $n-1$ The decision rules are before, but the bounds are computing from the Student distribution instead of the normal distribution. {#msg][#md>]
* For $H1: \mu\neq \mu_0$: we reject $H_0$ when $T$ takes too small or too large values. At a risk of $\alpha=5$\%, the two bounds are [#R>>] R #[#R>>] alpha <- 0.05 n<-13 qt(alpha/2,n-1) qt(1-alpha/2,n-1) [#md>] #[#md>]  -2.17881282966723 2.17881282966723
We reject $H_0$ when $T$ is outside the two bounds. * For $H1: \mu\geq \mu_0$: we reject $H_0$ when $T$ takes too large values. At a risk of $\alpha=5$\%, the bound is [#R>>] R #[#R>>] alpha <- 0.05 n<-13 qt(alpha,n-1) [#md>] #[#md>] 
We reject $H_0$ when $T$ is larger than the bound. * For $H1: \mu\leq \mu_0$: we reject $H_0$ when $T$ takes too small values. At a risk of $\alpha=5$\%, the bound is [#R>>] R #[#R>>] alpha <- 0.05 n<-13 qt(alpha,n-1) [#md>] #[#md>] 
We reject $H_0$ when $T$ is lower than the bound. [#msg} **Back to the example**. We assume that the standard deviation $\sigma$ is unknown and estimated to $0.3$. We want to test ... ... @@ -139,35 +196,42 @@ The test statistic is $$T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$ According to the null hypothesis $H_0$, $T$ follows a Student distribution with $12$ degrees of freedom. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is [#R>>] R #[#R>>] qt(0.05,12) [#md>] #[#md>]  The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$. For $\bar X= -0.15$, the test statistic takes the value [#R>>] R #[#R>>] n<-13 Xbar<--0.15 s<-0.3 mu0<-0 t<-sqrt(n)*(Xbar-mu0)/s t [#md>] #[#md>]  {#msg][#md>] Decision and interpretation: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement. Decision and interpretation: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement. [#class]green[#msg} The previous example uses the estimation of the mean, the standard deviation and the sample size. In practice, all the values of the sample are usually available. In that case, the user could estimate himself the mean, the standard deviation and apply the previous instruction. Or he can directly use the function t.test. {#msg][#md>] **R code for the test on a mean**. The mean of a sample can be tested using the function t.test. t.test(X,mu,alternative) [#msg} The function computes the test statistic of Student’s T-test comparing mean(X) to mu, and the corresponding p-value according to the alternative. The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative is in « two.sided » (default), « less », « greater »; they are understood as: The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative is in « two.sided » (default), « less », « greater »; they are understood as: * two.sided: the mean is not equal mu, ... ... @@ -176,11 +240,15 @@ The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative i * greater: the mean is greater than mu. **Example** To test if the mean of the age in LenzIsample is equal to 60, we run the following code [#R>>] R #[#R>>] LenzI <- readRDS("data/LenzI.rds") A<-LenzI$age t.test(A, mu=60) [#md>] #[#md>]  The two hypotheses of this t-test are *H$_0: \mu=60$* and$H_1: \mu \neq 60$*. The output reads as follows: ... ... @@ -197,25 +265,33 @@ The output reads as follows: * the last value 61.1401 is the estimation of the mean of the sample **Interpretation** of the t.testoutput: the p-value is 0.1339. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly different from 60 years. {#msg][#md>] **Interpretation** of the t.testoutput: the p-value is 0.1339. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly different from 60 years. [#class]green[#msg} We can also apply a one-sided test by changing the alternative. To test the alternative$H_1: \mu \geq 60$, run [#R>>] R #[#R>>] A<-LenzI$age t.test(A, mu=60, alternative= "greater") [#md>] #[#md>]  {#msg][#md>] Interpretation of the t.testoutput: the p-value is 0.06695. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly greater than 60 years. [#class]green[#msg}
Interpretation of the t.testoutput: the p-value is 0.06695. Therefo
re, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly greater than 60 years. **Remark** that even if the empirical mean (61.1401) is greater than 60 years, the difference is not significant, and we can not conclude, at a risk of 5\%, that the mean is larger than 60. Several reasons could be involved: the variability in the sample is too large (the standard error of the empirical mean is large) or the size of the sample is not large enough (the empirical mean is not estimated with enough precision). ## Test of the standard deviation or the variance One can also test the value of the standard deviation or the variance of a Gaussian sample. {#msg][#md>]
The test on the variance of the sample compares the hypothesis $H_0: \sigma^2=\sigma_0^2$ versus a two-sided hypothesis $H1: \sigma^2\neq \sigma^2_0$ or a one-sided hypothesis $H1: \sigma^2\geq \sigma^2_0$ or $H1: \sigma^2\leq \sigma^2_0$. ... ... @@ -223,15 +299,14 @@ The test statistic is $$T = (n-1) \left(\frac{S^2}{\sigma_0^2}\right)$$ When $H_0$ is true, the statistic $T$ follows a chi-square distribution with $n-1$ degrees of freedom $\chi^2(n-1)$. [#msg} ## Test of the mean for large sample Finally, a test of the mean exists for large sample, and we don't need the assumption that the sample is Gaussian thanks to the Central Limit Theorem. {#msg][#md>]
The test on the mean of a large sample compares the hypothesis $H_0: \mu=\mu_0$ versus a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$. ... ... @@ -239,7 +314,7 @@ The test statistic is $$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{S}\right)$$ When $H_0$ is true, the statistic $T$ follows a normal distribution $\mathcal{N}(0,1)$. [#msg} With R, the test of the mean for large sample can be applied with the function t.test (as above), assuming that the normal distribution is very closed to a Student distribution with a large degree of freedom. ... ... @@ -249,21 +324,23 @@ The previous tests are applied to gaussian samples. When the variable of interes **Example** For a certain disease, there exists a treatment that cures 70% of the cases. A laboratory proposes a new treatment claiming that it is better than the previous one. Out of 100 patients having received the new treatment, 74 of them have been cured. The expert would like to decide whether the new treatment should be authorized. {#msg][#md>]
The test on the proportion of a binary sample compares the hypothesis $H_0: p=p_0$ versus a two-sided hypothesis $H1: p\neq p_0$ or a one-sided hypothesis $H1: p\leq p_0$ or $H1: p\geq p_0$. [#msg} **Back to the example** The hypotheses we want to test are $H_0: p=0.7$ versus $H1: p\geq 0.7$. {#msg][#md>]
A value of a proportion can be tested using the function prop.test. prop.test(x,n,p,alternative) [#msg} The null hypothesis $H_0$ is: “the proportion of x out of n is equal to p”. The alternative is in « two.sided » (default), « less », « greater »; they are understood as: The alternative is in « two.sided » (default), « less », « greater »; they are understood as: * two.sided: the proportion x/n is not equal p, ... ... @@ -272,13 +349,19 @@ The alternative is in « two.sided » (default), « less », « greate * greater: the proportion x/n is greater than p. **Back to the example** The one-sided test is applied running [#R>>] prop.test(x=74, n= 100, p=0.7, alternative=« greater ») [#md>] {#msg][#md>] **Interpretation** The p-value is 0.2225. At risk 5\%, we do not reject $H_0$. The new treatment is not significantly better than the standard treatment. It should not be authorized. [#class]green[#msg} Note that when the whole binary sample $X$ is available (and not only the count of « successes »), the instruction is prop.test(sum(X), length(X), p, alternative). R #[#R>>] prop.test(x=74, n= 100, p=0.7, alternative=« greater ») #[#md>] 
**Interpretation** The p-value is 0.2225. At risk 5\%, we do not reject $H_0$. The new treatment is not significantly better than the standard treatment. It should not be authorized. Note that when the whole binary sample $X$ is available (and not only the count of « successes »), the instruction is prop.test(sum(X), length(X), p, alternative). # Goodness-of-fit tests A goodness-of-fit test answers the question: could the sample have been drawn at random from a particular distribution? ... ... @@ -289,24 +372,24 @@ A goodness-of-fit test answers the question: could the sample have been drawn at For a discrete variable, the goodness-of-fit is measured by a distance between the relative frequencies of the variable, and the probabilities of the target distribution. {#msg][#md>]
For a *discrete variable*, the **chi-squared test** compares the null hypothesis $H_0$: “the observed frequencies fit the theoretical probabilities”. The alternative is “the observed frequencies do not fit the theoretical probabilities”. Under $H_0$, the distance follows a chi-squared distribution. The parameter df of that chi-squared distribution is the number of different values minus 1, minus the number of estimated parameters, if there are any. [#msg} Under the alternative, the distance should be large, so that the p-value is computed as the right-tail probability of the chi-squared distribution at the distance. {#msg][#md>]
A a goodness-of-fit for a discrete variable can be tested using the function chisq.test , if no parameter have been estimated. If $X$ is the sample, and $p$ is the distribution, the result is obtained by: chisq.test(table(X),p) [#msg} In that command, ... ... @@ -321,9 +404,13 @@ If some frequencies are too small, a warning message may be issued. If one param **Back to the example** The frequency table of the three genotypes AA, Aa, aa is (1600, 4900, 3500). The theoretical probabilities are (0.16, 0.48, 0.36). The chi-squared test is applied running [#R>>] R #[#R>>] chisq.test(c(1600, 4900, 3500),p=c(0.16, 0.48, 0.36)) [#md>] #[#md>]  The outputs are * data the observed data ... ... @@ -334,27 +421,30 @@ The outputs are * p-value the p-value {#msg][#md>] The p-value is 0.08799. At risk 5\%, we do not reject $H_0$. The theoretical probabilities are acceptable. [#class]green[#msg}
The p-value is 0.08799. At risk 5\%, we do not reject $H_0$. The theoretical probabilities are acceptable. ## Kolmogorov-Smirnov test For a *continuous variable*, the goodness-of-fit is measured by a distance between the empirical cumulative distribution function (ecdf) of the variable, and the cdf of the target distribution. {#msg][#md>]
For a *continuous variable*, the **Kolmogorov-Smirnov** test compares the null hypothesis H0: “the empirical distribution of the data fits the theoretical distribution” and $H_1$: "The empirical distribution does not fit the theoretical distribution". [#msg} {#msg][#md>]
If X is the sample, dist is the distribution, param the parameters of that distribution, the result is obtained by: ks.test(table(X), dist, param, alternative)’ [#msg} The answer is “the fit is good”, if the p-value is large (above the risk). The variable X should not have ties (equal values). If some values are equal, a warning message is issued, indicating that the p-value is not quite as precise. This does not affect the validity of the result. The null hypothesis $H_0$ is: “the distribution of the sample is the theoretical cdf”. The alternative is in « two.sided » (default), « less », « greater »; they are understood as: The alternative is in « two.sided » (default), « less », « greater »; they are understood as: * two.sided: the ecdf of the sample is different from the theoretical cdf, ... ... @@ -366,19 +456,27 @@ sample are larger than those of the theoretical distribution), **Example** The hypoxy level is given in data set HY. Let us plot the ecdf of Level and those of a normal distribution. {#imgAvecCode] R #{#imgAvecCode] HY <- read.table("data/hypoxy.csv", header=TRUE, dec=",") L<-HY$Level plot(ecdf(L)) curve(pnorm(x,mean(L), sd(L)), col="red", add=TRUE) [#} #[#}  The red curve is the one of a normal distribution with parameters$\mu=1.2$and$\sigma=1$. The ecdf (black curve) is quite far from the theoretical cdf. The Level variable is probably not normally distributed. To test if the ecdf of Level is a normal distribution with parameters$\mu=1.2$and$\sigma=1$or not, run [#R>>] R #[#R>>] ks.test(L, "pnorm", c(1.2,1)) [#md>] #[#md>]  The outputs are * data the observed data ... ... @@ -387,51 +485,68 @@ The outputs are * p-value the p-value {#msg][#md>] The p-value is 2.501e-06. At risk 5\%, we reject$H_0$. The sample level does not follow a normal distribution$\mathcal{N}(1.2,1)$. [#class]green[#msg} The p-value is 2.501e-06. At risk 5\%, we reject$H_0$. The sample level does not follow a normal distribution$\mathcal{N}(1.2,1)$. The histogram of the variable Level reveals a right-skewed distribution, closed to a log-normal distribution. Let us log-transform the data [#R>>] R #[#R>>] LL<-log(L) [#md>] #[#md>]  and plot the ecdf (black curve) of the log-Level and the cdf (red curve) of a normal distribution with parameters$\log(1.2)\approx 0.2$and 1. {#img] R #{#img] plot(ecdf(LL)) curve(pnorm(x,mean(LL), sd(LL)), col="red", add=TRUE) [#} # [#}  The two curves are quite closed. Let us test if the empirical cdf of LL is a normal distribution with parameters$\log(1.2)\approx 0.2$and 1 [#R>>] R #[#R>>] ks.test(LL, "pnorm", c(0.2,1)) [#md>] {#msg][#md>] The p-value is still very small. At risk 5\%, we reject$H_0$. The sample log-level does not follow a normal distribution$\mathcal{N}(0.2,1)$. [#class]green[#msg} #[#md>]  The p-value is still very small. At risk 5\%, we reject$H_0$. The sample log-level does not follow a normal distribution$\mathcal{N}(0.2,1)$. Remark: the difference on the left of the two curves (ecdf and cdf) is large enough to reject the null hypothesis. ## Normality test Testing whether a variable is normally distributed, is different from testing whether a particular normal distribution with given parameters fits the variable. {#msg][#md>] The normality of a *continuous variable* is tested with the **Shapiro-Wilk test**. The null hypothesis H0 is: “the variable is normally distributed”. The alternative is “the variable is not normally distributed”. [#msg} {#msg][#md>] If X is the sample, the result is obtained by: shapiro.test(X)’ [#msg} **Back to the example** Test of the normality of the log-level of hypoxy: [#R>>] shapiro.test(LL) [#md>] R #[#R>>] shapiro.test(LL) #[#md>]  The outputs are ... ... @@ -441,9 +556,13 @@ shapiro.test(LL) * p-value the p-value {#msg][#md>] The p-value is 1.936e-06. At risk 5\%, we reject$H_0$. The sample log-level is not normally distributed. [#class]green[#msg} The p-value is 1.936e-06. At risk 5\%, we reject$H_0\$. The sample log-level is not normally distributed. R [#case} `
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!