Vous avez reçu un message "Your GitLab account has been locked ..." ? Pas d'inquiétude : lisez cet article https://docs.gricad-pages.univ-grenoble-alpes.fr/help/unlock/

Commit f3225e8a authored by Laurence Viry's avatar Laurence Viry
Browse files

ajout exemple Adeline

parent 87e31bf0
We focus on one sample tests.
# Tests on a Gaussian sample
Consider a random sample $(X_1, \ldots,X_n)$ of a distribution with mean $\mu$, standard deviation $\sigma$. Recall that
* The empirical mean is $\bar X = \frac{X_1 + \ldots + X_n}n$
* The empirical variance is $S^2 = \frac{n}{n-1} \left(\frac{X^2_1 +\ldots + X^2_n}n - \bar X^2\right)$.
## Test of the mean
A first test is on the mean of the sample.
**Example**. For an adult, the logarithm of the D-dimer concentration, denoted by $X$, is modeled by a normal random variable with mean $\mu$ and standard deviation $\sigma$. The variable $X$ is an indicator for the risk of thrombosis: it is considered that for healthy individuals, $\mu$ is −1, whereas for individuals at risk $\mu$ is 0.
The influence of olive oil on thrombosis risk must be evaluated.
A group of 13 patients, previously considered as being at risk, had an olive oil enriched diet. After the diet, their value of $X$ was measured, and this gave an empirical mean of −0.15.
The doctor would like to decide if the olive oil diet has improved the D-dimer concentration.
{#msg][#md>]
The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$
with a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$.
When the **variance $\sigma^2$ is known**, the test statistic is
$$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{\sigma}\right)$$
When $H_0$ is true, the statistic $T$ follows a $\mathcal{N}(0,1)$.
[#msg}
The decision rule depends on $H_1$. It consists in computing the bounds at which we reject $H_0$. The bounds depend also on the risk of the test (the first kind risk).
{#msg][#md>]
The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$ with one of the three alternatives:
* For $H1: \mu\neq \mu_0$: we reject $H_0$ when $T$ takes too small or too large values. At a risk of $\alpha=5$\%, the two bounds are
[#R>>]
alpha <- 0.05
qnorm(alpha/2,0,1)
qnorm(1-alpha/2,0,1)
[#md>]We reject $H_0$ when $T<(-1.959964)$ or when$T>1.959964$.
* For $H1: \mu\geq \mu_0$: we reject $H_0$ when $T$ takes too large values. At a risk of $\alpha=5$\%, the bound is
[#R>>]
alpha <- 0.05
qnorm(alpha,0,1, lower.tail=FALSE)
[#md>]
We reject $H_0$ when $T>1.644854$.
* For $H1: \mu\leq \mu_0$: we reject $H_0$ when $T$ takes too small values. At a risk of $\alpha=5$\%, the bound is
[#R>>]
alpha <- 0.05
qnorm(alpha,0,1)
[#md>]
We reject $H_0$ when $T<(-1.644854)$.
[#msg}
**Back to the example**. We assume that the sample of 13 patients is a Gaussian sample. The standard deviation $\sigma$ is supposed to be known and equal to $0.3$.
We want to test
$$H_0: \mu = 0 \quad \mbox{versus} \quad H1: \mu = -1 $$
The test statistic is
$$ T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$
According to the null hypothesis $H_0$, $T$ follows the normal distribution $\mathcal{N}(0,1)$. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is
[#R>>]
qnorm(0.05,0,1)
[#md>]
The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$.
For $\bar X= -0.15$, the test statistic takes the value
[#R>>]
n<-13
Xbar<--0.15
sig<-0.3
mu0<-0
t<-sqrt(n)*(Xbar-mu0)/sig
t
[#md>]
{#msg][#md>] **Decision and interpretation**: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.
[#class]green[#msg}
The previous case assumes that the standard deviation $\sigma$ is known. This is usually not the case in practice. The adaptation to the test of the mean, with an unknown variance is the following.
{#msg][#md>]
The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$
versus a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$.
When the **variance $\sigma^2$ is unknown**, the test statistic is
$$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{S}\right)$$
When $H_0$ is true, the statistic $T$ follows a Student distribution with $n-1$ degrees of freedom $T(n-1)$.
[#msg}
The decision rules are before, but the bounds are computing from the Student distribution instead of the normal distribution.
{#msg][#md>]
* For $H1: \mu\neq \mu_0$: we reject $H_0$ when $T$ takes too small or too large values. At a risk of $\alpha=5$\%, the two bounds are
[#R>>]
alpha <- 0.05
n<-13
qt(alpha/2,n-1)
qt(1-alpha/2,n-1)
[#md>]
We reject $H_0$ when $T$ is outside the two bounds.
* For $H1: \mu\geq \mu_0$: we reject $H_0$ when $T$ takes too large values. At a risk of $\alpha=5$\%, the bound is
[#R>>]
alpha <- 0.05
n<-13
qt(alpha,n-1)
[#md>]
We reject $H_0$ when $T$ is larger than the bound.
* For $H1: \mu\leq \mu_0$: we reject $H_0$ when $T$ takes too small values. At a risk of $\alpha=5$\%, the bound is
[#R>>]
alpha <- 0.05
n<-13
qt(alpha,n-1)
[#md>]
We reject $H_0$ when $T$ is lower than the bound.
[#msg}
**Back to the example**. We assume that the standard deviation $\sigma$ is unknown and estimated to $0.3$.
We want to test
$$H_0: \mu = 0 \quad \mbox{versus} \quad H1: \mu = -1 $$
The test statistic is
$$ T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$
According to the null hypothesis $H_0$, $T$ follows a Student distribution with $12$ degrees of freedom. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is
[#R>>]
qt(0.05,12)
[#md>]
The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$.
For $\bar X= -0.15$, the test statistic takes the value
[#R>>]
n<-13
Xbar<--0.15
s<-0.3
mu0<-0
t<-sqrt(n)*(Xbar-mu0)/s
t
[#md>]
{#msg][#md>] Decision and interpretation: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.
[#class]green[#msg}
The previous example uses the estimation of the mean, the standard deviation and the sample size. In practice, all the values of the sample are usually available. In that case, the user could estimate himself the mean, the standard deviation and apply the previous instruction. Or he can directly use the function `t.test`.
{#msg][#md>]
**R code for the test on a mean**.
The mean of a sample can be tested using the function `t.test`.
`t.test(X,mu,alternative)`
[#msg}
The function computes the test statistic of Student’s T-test comparing `mean(X)` to `mu`, and the corresponding p-value according to the `alternative`.
The null hypothesis H$_0$ is “the mean is equal to mu”.
The alternative is in « `two.sided` » (default), « `less` », « `greater` »; they are understood as:
* `two.sided`: the mean is not equal `mu`,
* `less`: the mean is less than `mu`,
* `greater`: the mean is greater than `mu`.
**Example** To test if the mean of the age in `LenzI`sample is equal to 60, we run the following code
[#R>>]
LenzI <- readRDS("data/LenzI.rds")
A<-LenzI$age
t.test(A, mu=60)
[#md>]
The two hypotheses of this t-test are *H$_0: \mu=60$* and $H_1: \mu \neq 60$*.
The output reads as follows:
* `t=1.5019` is the value of the test statistic
* `df=413`is the degree of freedom of the Student distribution (equal to `n-1`)
* `p-value = 0.1339`is the p-value of the t-test, corresponding to the two-sided test (by default, the alternative is two-sided)
* `alternative…` recalls the choice of the alternative
* `95 percent confidence interval` computes the 95\% confidence interval of the mean, assuming the standard deviation unknown
* the last value `61.1401` is the estimation of the mean of the sample
{#msg][#md>] **Interpretation** of the `t.test`output: the `p-value` is `0.1339`. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly different from 60 years.
[#class]green[#msg}
We can also apply a one-sided test by changing the `alternative`. To test the alternative $H_1: \mu \geq 60$, run
[#R>>]
A<-LenzI$age
t.test(A, mu=60, alternative= "greater")
[#md>]
{#msg][#md>] Interpretation of the `t.test`output: the `p-value` is `0.06695`. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly greater than 60 years.
[#class]green[#msg}
**Remark** that even if the empirical mean (`61.1401`) is greater than 60 years, the difference is not significant, and we can not conclude, at a risk of 5\%, that the mean is larger than 60. Several reasons could be involved: the variability in the sample is too large (the standard error of the empirical mean is large) or the size of the sample is not large enough (the empirical mean is not estimated with enough precision).
## Test of the standard deviation or the variance
One can also test the value of the standard deviation or the variance of a Gaussian sample.
{#msg][#md>]
The test on the variance of the sample compares the hypothesis $H_0: \sigma^2=\sigma_0^2$
versus a two-sided hypothesis $H1: \sigma^2\neq \sigma^2_0$ or a one-sided hypothesis $H1: \sigma^2\geq \sigma^2_0$ or $H1: \sigma^2\leq \sigma^2_0$.
The test statistic is
$$T = (n-1) \left(\frac{S^2}{\sigma_0^2}\right)$$
When $H_0$ is true, the statistic $T$ follows a chi-square distribution with $n-1$ degrees of freedom $\chi^2(n-1)$.
[#msg}
## Test of the mean for large sample
Finally, a test of the mean exists for large sample, and we don't need the assumption that the sample is Gaussian thanks to the Central Limit Theorem.
{#msg][#md>]
The test on the mean of a large sample compares the hypothesis $H_0: \mu=\mu_0$
versus a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$.
The test statistic is
$$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{S}\right)$$
When $H_0$ is true, the statistic $T$ follows a normal distribution $\mathcal{N}(0,1)$.
[#msg}
With `R`, the test of the mean for large sample can be applied with the function `t.test` (as above), assuming that the normal distribution is very closed to a Student distribution with a large degree of freedom.
# Test of a proportion
The previous tests are applied to gaussian samples. When the variable of interest is binary (two possible modalities), we are interested in comparing the proportion of the first modality with a theoretical proportion $p_0$.
**Example** For a certain disease, there exists a treatment that cures 70% of the cases. A laboratory proposes a new treatment claiming that it is better than the previous one. Out of 100 patients having received the new treatment, 74 of them have been cured. The expert would like to decide whether the new treatment should be authorized.

{#msg][#md>]
The test on the proportion of a binary sample compares the hypothesis $H_0: p=p_0$
versus a two-sided hypothesis $H1: p\neq p_0$ or a one-sided hypothesis $H1: p\leq p_0$ or $H1: p\geq p_0$.
[#msg}
**Back to the example** The hypotheses we want to test are $H_0: p=0.7$ versus $H1: p\geq 0.7$.

{#msg][#md>]
A value of a proportion can be tested using the function `prop.test`.
`prop.test(x,n,p,alternative)`
[#msg}
The null hypothesis $H_0$ is: “the proportion of `x` out of `n` is equal to `p`”.
The alternative is in « `two.sided` » (default), « `less` », « `greater` »; they are understood as:
* `two.sided`: the proportion `x/n` is not equal `p`,
* `less`: the proportion `x/n` is less than `p`,
* `greater`: the proportion `x/n` is greater than `p`.
**Back to the example** The one-sided test is applied running
[#R>>]
prop.test(x=74, n= 100, p=0.7, alternative=« greater »)
[#md>]
{#msg][#md>] **Interpretation** The p-value is `0.2225`. At risk 5\%, we do not reject $H_0$. The new treatment is not significantly better than the standard treatment. It should not be authorized.
[#class]green[#msg}
Note that when the whole binary sample $X$ is available (and not only the count of « successes »), the instruction is `prop.test(sum(X), length(X), p, alternative)`.

# Goodness-of-fit tests
A goodness-of-fit test answers the question: could the sample have been drawn at random from a particular distribution?
**Example** Consider a diploid population with allele frequencies 0.4 for A, 0.6 for a. At the Hardy-Weinberg equilibrium, the probabilities of the three genotypes AA, Aa, aa, are $(0.16, 0.48, 0.36)$. Frequency table of the three genotypes is (1600, 4900, 3500). Is the theoretical model plausible ?

## Chi-squared test
For a discrete variable, the goodness-of-fit is measured by a distance between the relative frequencies of the variable, and the probabilities of the target distribution.
{#msg][#md>]
For a *discrete variable*, the **chi-squared test** compares the null hypothesis $H_0$: “the observed frequencies fit the theoretical probabilities”.
The alternative is “the observed frequencies do not fit the theoretical probabilities”.
Under $H_0$, the distance follows a chi-squared distribution. The parameter `df` of that chi-squared distribution is the number of different values minus 1, minus the number of estimated parameters, if there are any.
[#msg}
Under the alternative, the distance should be large, so that the p-value is computed as the right-tail probability of the chi-squared distribution at the distance.
{#msg][#md>]
A a goodness-of-fit for a discrete variable can be tested using the function `chisq.test `, if no parameter have been estimated. If $X$ is the sample, and $p$ is the distribution, the result is obtained by:
`chisq.test(table(X),p)`
[#msg}
In that command,
* `table(X)` is a table of absolute frequencies (any vector of integers can be tested)
* The probability distribution `p` is a vector of probabilities, with same length as `table(X)`
The answer is “the fit is good”, if the p-value is large (above the risk).
If some frequencies are too small, a warning message may be issued. If one parameter or more have been estimated, the test statistic should be extracted, and the p-value computed as its right-tail probability for the chi-squared distribution with a smaller parameter.
**Back to the example** The frequency table of the three genotypes AA, Aa, aa is (1600, 4900, 3500). The theoretical probabilities are (0.16, 0.48, 0.36). The chi-squared test is applied running
[#R>>]
chisq.test(c(1600, 4900, 3500),p=c(0.16, 0.48, 0.36))
[#md>]
The outputs are
* `data` the observed data
* `X-squared` the value of the chi-squared distance
* `df`the degree of freedom of the chi-squared distribution
* `p-value` the p-value
{#msg][#md>] The p-value is `0.08799`. At risk 5\%, we do not reject $H_0$. The theoretical probabilities are acceptable.
[#class]green[#msg}
## Kolmogorov-Smirnov test
For a *continuous variable*, the goodness-of-fit is measured by a distance between the empirical cumulative distribution function (ecdf) of the variable, and the cdf of the target distribution.
{#msg][#md>]
For a *continuous variable*, the **Kolmogorov-Smirnov** test compares the null hypothesis H0: “the empirical distribution of the data fits the theoretical distribution” and $H_1$: "The empirical distribution does not fit the theoretical distribution".
[#msg}
{#msg][#md>]
If X is the sample, dist is the distribution, param the parameters of that distribution, the result is obtained by:
`ks.test(table(X), dist, param, alternative)’
[#msg}
The answer is “the fit is good”, if the p-value is large (above the risk). The variable X should not have ties (equal values). If some values are equal, a warning message is issued, indicating that the p-value is not quite as precise. This does not affect the validity of the result.
The null hypothesis $H_0$ is: “the distribution of the sample is the theoretical cdf”.
The alternative is in « `two.sided` » (default), « `less` », « `greater` »; they are understood as:
* `two.sided`: the ecdf of the sample is different from the theoretical cdf,
* `less`: the ecdf of the sample is under the theoretical cdf (the values of the
sample are larger than those of the theoretical distribution),
* `greater`: the ecdf of the sample is above the theoretical cdf (the values of the sample are smaller than those of the theoretical distribution).
**Example** The hypoxy level is given in data set `HY`. Let us plot the ecdf of `Level` and those of a normal distribution.
{#imgAvecCode]
HY <- read.table("data/hypoxy.csv", header=TRUE, dec=",")
L<-HY$Level
plot(ecdf(L))
curve(pnorm(x,mean(L), sd(L)), col="red", add=TRUE)
[#}
The red curve is the one of a normal distribution with parameters $\mu=1.2$ and $\sigma=1$. The ecdf (black curve) is quite far from the theoretical cdf. The `Level` variable is probably not normally distributed.
To test if the ecdf of `Level` is a normal distribution with parameters $\mu=1.2$ and $\sigma=1$ or not, run
[#R>>]
ks.test(L, "pnorm", c(1.2,1))
[#md>]
The outputs are
* `data` the observed data
* `D` the value of the Kolmogorov-Smirnov distance
* `p-value` the p-value
{#msg][#md>] The p-value is `2.501e-06`. At risk 5\%, we reject $H_0$. The sample `level` does not follow a normal distribution $\mathcal{N}(1.2,1)$.
[#class]green[#msg}
The histogram of the variable `Level` reveals a right-skewed distribution, closed to a log-normal distribution. Let us log-transform the data
[#R>>]
LL<-log(L)
[#md>]
and plot the ecdf (black curve) of the log-Level and the cdf (red curve) of a normal distribution with parameters $\log(1.2)\approx 0.2$ and 1.
{#img]
plot(ecdf(LL))
curve(pnorm(x,mean(LL), sd(LL)), col="red", add=TRUE)
[#}
The two curves are quite closed. Let us test if the empirical cdf of `LL` is a normal distribution with parameters $\log(1.2)\approx 0.2$ and 1
[#R>>]
ks.test(LL, "pnorm", c(0.2,1))
[#md>]
{#msg][#md>] The p-value is still very small. At risk 5\%, we reject $H_0$. The sample `log-level` does not follow a normal distribution $\mathcal{N}(0.2,1)$.
[#class]green[#msg}
Remark: the difference on the left of the two curves (ecdf and cdf) is large enough to reject the null hypothesis.
## Normality test
Testing whether a variable is normally distributed, is different from testing whether a particular normal distribution with given parameters fits the variable.
{#msg][#md>]
The normality of a *continuous variable* is tested with the **Shapiro-Wilk test**.
The null hypothesis H0 is: “the variable is normally distributed”. The alternative is “the variable is not normally distributed”.
[#msg}
{#msg][#md>]
If X is the sample, the result is obtained by:
`shapiro.test(X)’
[#msg}
**Back to the example** Test of the normality of the log-level of hypoxy:
[#R>>]
shapiro.test(LL)
[#md>]
The outputs are
* `data` the observed data
* `W` the value of the Shapiro-Wilk distance
* `p-value` the p-value
{#msg][#md>] The p-value is `1.936e-06`. At risk 5\%, we reject $H_0$. The sample `log-level` is not normally distributed.
[#class]green[#msg}
[#case}
site_name: Introduction à R
site_url: "https:/formations-statistiques-R.gricad-pages.univ-grenoble-alpes.fr/CED-IntroR"
pages:
- Accueil: index.md
- Modules du cours:
- Introduction : IntroR.md
- Objets: Objets.md
- Données : manipDon.md
- Visualisation : GraphiqueR.md
- Programmation : programmerR.md
- Divers : Divers.md
- Informations pratiques :
- Infos Pratiques: Info_pratiques.md
- Consignes : Consignes.md
- Etudes :
- Ozone : ozoneClimat.md
theme : yeti
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment