Commit cc247a9c authored by Florent Chatelain's avatar Florent Chatelain
Browse files

up code+exo

parent 00bb10d2
......@@ -11,15 +11,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# GMM covariances\n",
"\n",
"Demonstration of several covariances types for Gaussian mixture models (GMM).\n",
"\n",
"Demonstration of several covariances types for Gaussian mixture models.\n",
"See the [sklearn guide on GMM](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) for more information on the estimator.\n",
"\n",
"See https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture for more information on the estimator.\n",
"We apply GMM on the iris dataset. This dataset is four-dimensional, but only the first two\n",
"dimensions are shown here, and thus some points are separated in other\n",
"dimensions.\n",
"We plot the shape of the estimated clusters and the AIC/BIC criteria to estimate the optimal number of clusters.\n",
"Note that we initialize the means of the Gaussians with the means of the true classes from the training set to make\n",
"this comparison valid for several covariances types for GMM.\n",
"\n",
"Although GMM are often used for clustering, we can compare the obtained\n",
"<!--Although GMM are often used for clustering, we can compare the obtained\n",
"clusters with the actual classes from the dataset. We initialize the means\n",
"of the Gaussians with the means of the classes from the training set to make\n",
"this comparison valid.\n",
......@@ -35,7 +40,8 @@
"On the plots, train data is shown as dots, while test data is shown as\n",
"crosses. The iris dataset is four-dimensional. Only the first two\n",
"dimensions are shown here, and thus some points are separated in other\n",
"dimensions.\n",
"dimensions.-->\n",
"\n",
"\n"
]
},
......@@ -43,8 +49,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a simplified version of the code that can be found here:\n",
"https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py"
"This is a simplified version of the code of this [sklearn example](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py)."
]
},
{
......@@ -101,8 +106,13 @@
"outputs": [],
"source": [
"# Try GMMs using full covariance (no constraints imposed on cov)\n",
"\n",
"cv_type = \"full\"\n",
"#cv_type = \"tied\"\n",
"#cv_type = \"diagonal\"\n",
"#cv_type = \"spherical\"\n",
"estimator = GaussianMixture(\n",
" n_components=n_classes, covariance_type=\"full\", max_iter=50, random_state=0\n",
" n_components=n_classes, covariance_type=cv_type, max_iter=50, random_state=0\n",
")"
]
},
......@@ -326,8 +336,7 @@
"bic = []\n",
"aic = []\n",
"n_components_range = range(2, 7)\n",
"#cv_type = \"spherical\"\n",
"cv_type = \"full\"\n",
"\n",
"\n",
"for n_comp in n_components_range:\n",
" # Fit a Gaussian mixture with EM\n",
......@@ -352,11 +361,13 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": []
"source": [
"### (Optional) Exercise 12\n",
"- Comment on the number of clusters estimated with the BIC and AIC criteria respectively.\n",
"- Change the shape of the clusters by setting some constraints on the GMM covariances (`tied` for a covariance common to all clusters, `diagonal`, or `spherical` for a covariance proportional to the identity matrix, see cell 3) and re-run the estimation/visualization/model selection cells. Comment on the cluster shapes and the AIC/BIC criterion obtained."
]
}
],
"metadata": {
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/7_EM_iris_data_example/)
%% Cell type:markdown id: tags:
# GMM covariances
Demonstration of several covariances types for Gaussian mixture models (GMM).
Demonstration of several covariances types for Gaussian mixture models.
See the [sklearn guide on GMM](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) for more information on the estimator.
See https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture for more information on the estimator.
We apply GMM on the iris dataset. This dataset is four-dimensional, but only the first two
dimensions are shown here, and thus some points are separated in other
dimensions.
We plot the shape of the estimated clusters and the AIC/BIC criteria to estimate the optimal number of clusters.
Note that we initialize the means of the Gaussians with the means of the true classes from the training set to make
this comparison valid for several covariances types for GMM.
Although GMM are often used for clustering, we can compare the obtained
<!--Although GMM are often used for clustering, we can compare the obtained
clusters with the actual classes from the dataset. We initialize the means
of the Gaussians with the means of the classes from the training set to make
this comparison valid.
We plot predicted labels on both training and held out test data using a
variety of GMM covariance types on the iris dataset.
We compare GMMs with spherical, diagonal, full, and tied covariance
matrices in increasing order of performance. Although one would
expect full covariance to perform best in general, it is prone to
overfitting on small datasets and does not generalize well to held out
test data.
On the plots, train data is shown as dots, while test data is shown as
crosses. The iris dataset is four-dimensional. Only the first two
dimensions are shown here, and thus some points are separated in other
dimensions.
dimensions.-->
%% Cell type:markdown id: tags:
This is a simplified version of the code that can be found here:
https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py
This is a simplified version of the code of this [sklearn example](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py).
%% Cell type:code id: tags:
``` python
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold
%matplotlib inline
```
%% Cell type:markdown id: tags:
# Iris data - EM clustering
Note that while labels are available for this data set, they will not be used in GMM identification.
%% Cell type:code id: tags:
``` python
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
n_classes = len(
np.unique(y_train)
) # list the unique elements in y_train : nb of different labels
```
%% Cell type:markdown id: tags:
## GMM model estimation
%% Cell type:code id: tags:
``` python
# Try GMMs using full covariance (no constraints imposed on cov)
cv_type = "full"
#cv_type = "tied"
#cv_type = "diagonal"
#cv_type = "spherical"
estimator = GaussianMixture(
n_components=n_classes, covariance_type="full", max_iter=50, random_state=0
n_components=n_classes, covariance_type=cv_type, max_iter=50, random_state=0
)
```
%% Cell type:markdown id: tags:
# !!
Lines below initialize with centers of mass of each cluster, as labels are known...
Usually, 3 different centers are required, chosen at random. It this latter case, the correct
clusters are extracted, up to some circular permutation on the labels
%% Cell type:code id: tags:
``` python
# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
estimator.means_init = np.array(
[X_train[y_train == i].mean(axis=0) for i in range(n_classes)]
)
estimator.fit(X_train)
```
%%%% Output: execute_result
GaussianMixture(max_iter=50,
means_init=array([[5.006, 3.428, 1.462, 0.246],
[5.936, 2.77 , 4.26 , 1.326],
[6.588, 2.974, 5.552, 2.026]]),
n_components=3, random_state=0)
%% Cell type:code id: tags:
``` python
print(estimator.covariances_)
# print(estimator.covariances_[1][1:3,1:3])
```
%%%% Output: stream
[[[0.121765 0.097232 0.016028 0.010124 ]
[0.097232 0.140817 0.011464 0.009112 ]
[0.016028 0.011464 0.029557 0.005948 ]
[0.010124 0.009112 0.005948 0.010885 ]]
[[0.27555846 0.09657992 0.18562554 0.05486324]
[0.09657992 0.09253766 0.0910186 0.04299954]
[0.18562554 0.0910186 0.20266592 0.06184329]
[0.05486324 0.04299954 0.06184329 0.03239585]]
[[0.38754333 0.09223946 0.30243691 0.06078315]
[0.09223946 0.11041919 0.08379036 0.05570474]
[0.30243691 0.08379036 0.32566732 0.07253285]
[0.06078315 0.05570474 0.07253285 0.08471718]]]
%% Cell type:code id: tags:
``` python
print(estimator.means_)
#estimator.means_[0, 0::2]
```
%%%% Output: stream
[[5.006 3.428 1.462 0.246 ]
[5.91743867 2.77807834 4.20602834 1.29872958]
[6.54663103 2.94958567 5.48422408 1.98765097]]
%% Cell type:markdown id: tags:
## Ploting results :
choose the axis pair to visualize
%% Cell type:code id: tags:
``` python
# for K clusters, specify K colors (here K=3)
colors = ["navy", "turquoise", "darkorange"]
fig, ax = plt.subplots(subplot_kw={"aspect": "equal"})
axes = ["x2", "x4"]
for n, color in enumerate(colors):
# defines ellipses parameters, using eigen-axes
data = iris.data[iris.target == n]
if axes == ["x1", "x2"]:
covariances = estimator.covariances_[n][0:2, 0:2]
plt.scatter(
data[:, 0], data[:, 1], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 0:2]
elif axes == ["x1", "x3"]:
covariances = estimator.covariances_[n][0::2, 0::2]
plt.scatter(
data[:, 0], data[:, 2], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 0::2]
elif axes == ["x1", "x4"]:
covariances = estimator.covariances_[n][0::3, 0::3]
plt.scatter(
data[:, 0], data[:, 3], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 0::3]
elif axes == ["x2", "x3"]:
covariances = estimator.covariances_[n][1:3, 1:3]
plt.scatter(
data[:, 1], data[:, 2], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 1:3]
elif axes == ["x2", "x4"]:
covariances = estimator.covariances_[n][1::2, 1::2]
plt.scatter(
data[:, 1], data[:, 3], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 1::2]
elif axes == ["x3", "x4"]:
covariances = estimator.covariances_[n][2:, 2:]
plt.scatter(
data[:, 2], data[:, 3], s=10, color=color, label=iris.target_names[n]
)
Est_means = estimator.means_[n, 2:]
v, w = np.linalg.eigh(covariances)
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan2(u[1], u[0])
angle = 180 * angle / np.pi # convert to degrees
v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
ell = mpl.patches.Ellipse(Est_means, v[0], v[1], 180 + angle, color=color)
# dplot the ellipses
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
ax.set_aspect("auto")
# for visualizing axe1 vs axe2, use "covariances = estimator.covariances_[n][0:2, 0:2]"
# for visualizing axe1 vs axe3, use "covariances = estimator.covariances_[n][0::2, 0::2]"
# for visualizing axe1 vs axe4, use "covariances = estimator.covariances_[n][0::3, 0::3]"
# for visualizing axe2 vs axe3, use "covariances = estimator.covariances_[n][1:3, 1:3]"
# for visualizing axe2 vs axe4, use "covariances = estimator.covariances_[n][1::2, 1::2]"
# for visualizing axe3 vs axe4, use "covariances = estimator.covariances_[n][2:, 2:]"
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:code id: tags:
``` python
import itertools
from scipy import linalg
from sklearn import mixture
lowest_bic = np.infty
bic = []
aic = []
n_components_range = range(2, 7)
#cv_type = "spherical"
cv_type = "full"
for n_comp in n_components_range:
# Fit a Gaussian mixture with EM
gmm = GaussianMixture(
n_components=n_comp, covariance_type=cv_type, max_iter=1000, random_state=1
)
gmm.fit(X_train)
# bic.append(gmm.aic(X_train))
bic.append(gmm.bic(X_train))
aic.append(gmm.aic(X_train))
bic = np.array(bic)
aic = np.array(aic)
# Plot the BIC scores
plt.plot(np.linspace(2, 6, 5), bic, "b", label="bic")
plt.plot(np.linspace(2, 6, 5), aic, "r", label="aic")
plt.legend()
plt.grid('On')
print("bic = {}".format(bic))
print("aic = {}".format(aic))
```
%%%% Output: stream
bic = [574.01783272 580.86127847 629.83784882 674.43878158 726.14923081]
aic = [486.70940919 448.39332553 452.21036647 451.65176982 458.20268963]
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:code id: tags:
%% Cell type:markdown id: tags:
``` python
```
### (Optional) Exercise 12
- Comment on the number of clusters estimated with the BIC and AIC criteria respectively.
- Change the shape of the clusters by setting some constraints on the GMM covariances (`tied` for a covariance common to all clusters, `diagonal`, or `spherical` for a covariance proportional to the identity matrix, see cell 3) and re-run the estimation/visualization/model selection cells. Comment on the cluster shapes and the AIC/BIC criterion obtained.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment