@@ -7,6 +8,7 @@ Consider a random sample $(X_1, \ldots,X_n)$ of a distribution with mean $\mu$,

* The empirical mean is $\bar X = \frac{X_1 + \ldots + X_n}n$

* The empirical variance is $S^2 = \frac{n}{n-1} \left(\frac{X^2_1 +\ldots + X^2_n}n - \bar X^2\right)$.

## Test of the mean

A first test is on the mean of the sample.

...

...

@@ -14,11 +16,13 @@ A first test is on the mean of the sample.

**Example**. For an adult, the logarithm of the D-dimer concentration, denoted by $X$, is modeled by a normal random variable with mean $\mu$ and standard deviation $\sigma$. The variable $X$ is an indicator for the risk of thrombosis: it is considered that for healthy individuals, $\mu$ is −1, whereas for individuals at risk $\mu$ is 0.

The influence of olive oil on thrombosis risk must be evaluated.

A group of 13 patients, previously considered as being at risk, had an olive oil enriched diet. After the diet, their value of $X$ was measured, and this gave an empirical mean of −0.15.

<\div>

The doctor would like to decide if the olive oil diet has improved the D-dimer concentration.

The **test on the mean** of the sample compares the hypothesis $H_0: \mu=\mu_0$

with a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$.

...

...

@@ -26,38 +30,57 @@ When the **variance $\sigma^2$ is known**, the test statistic is

$$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{\sigma}\right)$$

When $H_0$ is true, the statistic $T$ follows a $\mathcal{N}(0,1)$.

[#msg}

</div>

<!--dyndoc [#msg} -->

The decision rule depends on $H_1$. It consists in computing the bounds at which we reject $H_0$. The bounds depend also on the risk of the test (the first kind risk).

The decision rule depends on $H_1$. It consists in computing the bounds at which we reject

$H_0$. The bounds depend also on the risk of the test (the first kind risk).

* For $H1: \mu\leq \mu_0$: we reject $H_0$ when $T$ takes too small values. At a risk of $\alpha=5$\%, the bound is

[#R>>]

```R

#[#R>>]

alpha<-0.05

qnorm(alpha,0,1)

[#md>]

We reject $H_0$ when $T<(-1.644854)$.

[#msg}

#[#md>]

```

<!--dyndoc [#msg] -->

We reject $H_0$ when $T<(-1.644854)$.

<!--dyndoc [#msg} -->

**Back to the example**. We assume that the sample of 13 patients is a Gaussian sample. The standard deviation $\sigma$ is supposed to be known and equal to $0.3$.

We want to test

...

...

@@ -67,29 +90,38 @@ The test statistic is

$$ T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$

According to the null hypothesis $H_0$, $T$ follows the normal distribution $\mathcal{N}(0,1)$. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is

[#R>>]

```R

#[#R>>]

qnorm(0.05,0,1)

[#md>]

#[#md>]

```

<!--dyndoc [#msg] -->

The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$.

For $\bar X= -0.15$, the test statistic takes the value

[#R>>]

```R

#[#R>>]

n<-13

Xbar<--0.15

sig<-0.3

mu0<-0

t<-sqrt(n)*(Xbar-mu0)/sig

t

[#md>]

{#msg][#md>] **Decision and interpretation**: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.

<!--dyndoc {#msg][#md>] --> **Decision and interpretation**: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.

<!--dyndoc [#class]green[#msg} -->

The previous case assumes that the standard deviation $\sigma$ is known. This is usually not the case in practice. The adaptation to the test of the mean, with an unknown variance is the following.

**Back to the example**. We assume that the standard deviation $\sigma$ is unknown and estimated to $0.3$.

We want to test

...

...

@@ -139,35 +196,42 @@ The test statistic is

$$ T = \sqrt{13} \left(\frac{\bar X - 0}{0.3}\right)$$

According to the null hypothesis $H_0$, $T$ follows a Student distribution with $12$ degrees of freedom. The hypothesis $H_0$ is rejected when $T$ takes low values. At risk 5%, the bound is

[#R>>]

```R

#[#R>>]

qt(0.05,12)

[#md>]

#[#md>]

```

The decision rule is **Reject H_0** if $T \, <\, (-1.6449)$.

For $\bar X= -0.15$, the test statistic takes the value

[#R>>]

```R

#[#R>>]

n<-13

Xbar<--0.15

s<-0.3

mu0<-0

t<-sqrt(n)*(Xbar-mu0)/s

t

[#md>]

#[#md>]

```

{#msg][#md>] Decision and interpretation: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.

<!--dyndoc {#msg][#md>] --> Decision and interpretation: At risk 5\%, the hypothesis $H_0$ is rejected. The decision is that there has been a significant improvement.

[#class]green[#msg}

The previous example uses the estimation of the mean, the standard deviation and the sample size. In practice, all the values of the sample are usually available. In that case, the user could estimate himself the mean, the standard deviation and apply the previous instruction. Or he can directly use the function `t.test`.

{#msg][#md>]

<!--dyndoc {#msg][#md>] -->

**R code for the test on a mean**.

The mean of a sample can be tested using the function `t.test`.

`t.test(X,mu,alternative)`

[#msg}

<!--dyndoc [#msg} -->

The function computes the test statistic of Student’s T-test comparing `mean(X)` to `mu`, and the corresponding p-value according to the `alternative`.

The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative is in «`two.sided`» (default), «`less`», «`greater`»; they are understood as:

The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative is in «`two.sided`» (default), «`less`», «`greater`»; they are understood as:

*`two.sided`: the mean is not equal `mu`,

...

...

@@ -176,11 +240,15 @@ The null hypothesis H$_0$ is “the mean is equal to mu”. The alternative i

*`greater`: the mean is greater than `mu`.

**Example** To test if the mean of the age in `LenzI`sample is equal to 60, we run the following code

[#R>>]

```R

#[#R>>]

LenzI<-readRDS("data/LenzI.rds")

A<-LenzI$age

t.test(A,mu=60)

[#md>]

#[#md>]

```

The two hypotheses of this t-test are *H$_0: \mu=60$* and $H_1: \mu \neq 60$*.

The output reads as follows:

...

...

@@ -197,25 +265,33 @@ The output reads as follows:

* the last value `61.1401` is the estimation of the mean of the sample

<!--dyndoc {#msg][#md>] --> **Interpretation** of the `t.test`output: the `p-value` is `0.1339`. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly different from 60 years.

{#msg][#md>] **Interpretation** of the `t.test`output: the `p-value` is `0.1339`. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly different from 60 years.

[#class]green[#msg}

<!-- [#class]green[#msg} -->

We can also apply a one-sided test by changing the `alternative`. To test the alternative $H_1: \mu \geq 60$, run

[#R>>]

```R

#[#R>>]

A<-LenzI$age

t.test(A,mu=60,alternative="greater")

[#md>]

#[#md>]

```

{#msg][#md>] Interpretation of the `t.test`output: the `p-value` is `0.06695`. Therefore, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly greater than 60 years.

<!--dyndoc {#msg][#md>] --> Interpretation of the `t.test`output: the `p-value` is `0.06695`. Therefo<div style="color:MediumSeaGreen;">re, at a risk of 5\%, we do not reject H$_0$. The mean age is not significantly greater than 60 years.

<!-- [#class]green[#msg}-->

**Remark** that even if the empirical mean (`61.1401`) is greater than 60 years, the difference is not significant, and we can not conclude, at a risk of 5\%, that the mean is larger than 60. Several reasons could be involved: the variability in the sample is too large (the standard error of the empirical mean is large) or the size of the sample is not large enough (the empirical mean is not estimated with enough precision).

## Test of the standard deviation or the variance

One can also test the value of the standard deviation or the variance of a Gaussian sample.

The test on the mean of a large sample compares the hypothesis $H_0: \mu=\mu_0$

versus a two-sided hypothesis $H1: \mu\neq \mu_0$ or a one-sided hypothesis $H1: \mu\geq \mu_0$ or $H1: \mu\leq \mu_0$.

...

...

@@ -239,7 +314,7 @@ The test statistic is

$$T = \sqrt{n} \left(\frac{\bar X - \mu_0}{S}\right)$$

When $H_0$ is true, the statistic $T$ follows a normal distribution $\mathcal{N}(0,1)$.

[#msg}

<!--dyndoc [#msg} -->

With `R`, the test of the mean for large sample can be applied with the function `t.test` (as above), assuming that the normal distribution is very closed to a Student distribution with a large degree of freedom.

...

...

@@ -249,21 +324,23 @@ The previous tests are applied to gaussian samples. When the variable of interes

**Example** For a certain disease, there exists a treatment that cures 70% of the cases. A laboratory proposes a new treatment claiming that it is better than the previous one. Out of 100 patients having received the new treatment, 74 of them have been cured. The expert would like to decide whether the new treatment should be authorized.

{#msg][#md>] **Interpretation** The p-value is `0.2225`. At risk 5\%, we do not reject $H_0$. The new treatment is not significantly better than the standard treatment. It should not be authorized.

[#class]green[#msg}

Note that when the whole binary sample $X$ is available (and not only the count of « successes »), the instruction is `prop.test(sum(X), length(X), p, alternative)`.

<!--dyndoc {#msg][#md>] --> **Interpretation** The p-value is `0.2225`. At risk 5\%, we do not reject $H_0$. The new treatment is not significantly better than the standard treatment. It should not be authorized.

<!-- [#class]green[#msg} -->

Note that when the whole binary sample $X$ is available (and not only the count of « successes »), the instruction is `prop.test(sum(X), length(X), p, alternative)`.

# Goodness-of-fit tests

A goodness-of-fit test answers the question: could the sample have been drawn at random from a particular distribution?

...

...

@@ -289,24 +372,24 @@ A goodness-of-fit test answers the question: could the sample have been drawn at

For a discrete variable, the goodness-of-fit is measured by a distance between the relative frequencies of the variable, and the probabilities of the target distribution.

For a *discrete variable*, the **chi-squared test** compares the null hypothesis $H_0$: “the observed frequencies fit the theoretical probabilities”.

The alternative is “the observed frequencies do not fit the theoretical probabilities”.

Under $H_0$, the distance follows a chi-squared distribution. The parameter `df` of that chi-squared distribution is the number of different values minus 1, minus the number of estimated parameters, if there are any.

[#msg}

<!--dyndoc [#msg} -->

Under the alternative, the distance should be large, so that the p-value is computed as the right-tail probability of the chi-squared distribution at the distance.

A a goodness-of-fit for a discrete variable can be tested using the function `chisq.test `, if no parameter have been estimated. If $X$ is the sample, and $p$ is the distribution, the result is obtained by:

`chisq.test(table(X),p)`

[#msg}

<!--dyndoc [#msg} -->

In that command,

...

...

@@ -321,9 +404,13 @@ If some frequencies are too small, a warning message may be issued. If one param

**Back to the example** The frequency table of the three genotypes AA, Aa, aa is (1600, 4900, 3500). The theoretical probabilities are (0.16, 0.48, 0.36). The chi-squared test is applied running

[#R>>]

```R

#[#R>>]

chisq.test(c(1600,4900,3500),p=c(0.16,0.48,0.36))

[#md>]

#[#md>]

```

The outputs are

*`data` the observed data

...

...

@@ -334,27 +421,30 @@ The outputs are

*`p-value` the p-value

{#msg][#md>] The p-value is `0.08799`. At risk 5\%, we do not reject $H_0$. The theoretical probabilities are acceptable.

[#class]green[#msg}

<divstyle="color:MediumSeaGreen;">

<!--dyndoc {#msg][#md>] --> The p-value is `0.08799`. At risk 5\%, we do not reject $H_0$. The theoretical probabilities are acceptable.

<!-- [#class]green[#msg} -->

## Kolmogorov-Smirnov test

For a *continuous variable*, the goodness-of-fit is measured by a distance between the empirical cumulative distribution function (ecdf) of the variable, and the cdf of the target distribution.

For a *continuous variable*, the **Kolmogorov-Smirnov** test compares the null hypothesis H0: “the empirical distribution of the data fits the theoretical distribution” and $H_1$: "The empirical distribution does not fit the theoretical distribution".

If X is the sample, dist is the distribution, param the parameters of that distribution, the result is obtained by:

`ks.test(table(X), dist, param, alternative)’

[#msg}

<!--dyndoc [#msg} -->

The answer is “the fit is good”, if the p-value is large (above the risk). The variable X should not have ties (equal values). If some values are equal, a warning message is issued, indicating that the p-value is not quite as precise. This does not affect the validity of the result.

The null hypothesis $H_0$ is: “the distribution of the sample is the theoretical cdf”.

The alternative is in «`two.sided`» (default), «`less`», «`greater`»; they are understood as:

The alternative is in «`two.sided`» (default), «`less`», «`greater`»; they are understood as:

* `two.sided`: the ecdf of the sample is different from the theoretical cdf,

...

...

@@ -366,19 +456,27 @@ sample are larger than those of the theoretical distribution),

**Example** The hypoxy level is given in data set `HY`. Let us plot the ecdf of `Level` and those of a normal distribution.

{#imgAvecCode]

```R

#{#imgAvecCode]

HY <- read.table("data/hypoxy.csv", header=TRUE, dec=",")

The red curve is the one of a normal distribution with parameters $\mu=1.2$ and $\sigma=1$. The ecdf (black curve) is quite far from the theoretical cdf. The `Level` variable is probably not normally distributed.

To test if the ecdf of `Level` is a normal distribution with parameters $\mu=1.2$ and $\sigma=1$ or not, run

[#R>>]

```R

#[#R>>]

ks.test(L, "pnorm", c(1.2,1))

[#md>]

#[#md>]

```

The outputs are

* `data` the observed data

...

...

@@ -387,51 +485,68 @@ The outputs are

* `p-value` the p-value

{#msg][#md>] The p-value is `2.501e-06`. At risk 5\%, we reject $H_0$. The sample `level` does not follow a normal distribution $\mathcal{N}(1.2,1)$.

[#class]green[#msg}

<!--dyndoc {#msg][#md>] --> The p-value is `2.501e-06`. At risk 5\%, we reject $H_0$. The sample `level` does not follow a normal distribution $\mathcal{N}(1.2,1)$.

<!-- [#class]green[#msg} -->

The histogram of the variable `Level` reveals a right-skewed distribution, closed to a log-normal distribution. Let us log-transform the data

[#R>>]

```R

#[#R>>]

LL<-log(L)

[#md>]

#[#md>]

```

and plot the ecdf (black curve) of the log-Level and the cdf (red curve) of a normal distribution with parameters $\log(1.2)\approx 0.2$ and 1.

The two curves are quite closed. Let us test if the empirical cdf of `LL` is a normal distribution with parameters $\log(1.2)\approx 0.2$ and 1

[#R>>]

```R

#[#R>>]

ks.test(LL, "pnorm", c(0.2,1))

[#md>]

{#msg][#md>] The p-value is still very small. At risk 5\%, we reject $H_0$. The sample `log-level` does not follow a normal distribution $\mathcal{N}(0.2,1)$.

<!--dyndoc {#msg][#md>] --> The p-value is still very small. At risk 5\%, we reject $H_0$. The sample `log-level` does not follow a normal distribution $\mathcal{N}(0.2,1)$.

<!-- [#class]green[#msg} -->

Remark: the difference on the left of the two curves (ecdf and cdf) is large enough to reject the null hypothesis.

## Normality test

Testing whether a variable is normally distributed, is different from testing whether a particular normal distribution with given parameters fits the variable.