Commit 201d848c authored by Florent Chatelain's avatar Florent Chatelain

er

parents f243ea3c c3a6f354
# ml-sicom3a
# Python notebooks for Sicom ML course
You will find in this repository the material associated with the course:
- Slides (pdf files) for the lessons,
- **Examples with [Jupyter notebooks](./notebooks/)** to illustrate concepts and methods in Python (.ipynb files)
## Requirements to run notebooks:
**Two** solutions:
1. Python (> 3.3), Jupyter notebook and scikit-learn package. It is recommended to install them via Anaconda Distribution which will install these dependencies directly.
**Or**
2. Use the mybinder service to run them remotely and interactively (wait a few seconds for the first connection so that the environment loads):
- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/54301940e4486a8ece22a910c3efa1b2734ed82d?filepath=notebooks)
link to run the examples, *except Deep learning ones* too computationally demanding for the JupyterHub server (use the first solution to run these notebooks with your own ressources...)
close all
clear all
%% Load synthetic data sets (see question 1.1)
nom_fich_train= 'synth_train.mat' ;
nom_fich_test= 'synth_test.mat' ;
load(nom_fich_train) ;% import training dataset into variables Xtrain, Ytrain
load(nom_fich_test) ; % import test dataset into variables Xtest, Ytest
% Get the sample size n and the data point dimension p (here p=2)
[n,p]= size( Xtrain )
% binary classification problem
K=2 ;
%% Training set
% Get the index of the samples for ecah class separetely
ind1= (Ytrain==1) ;
ind2= (Ytrain==2) ;
% display bivariate data points (see lab question 1.1)
figure,
hold on
plot(Xtrain(ind1,1), Xtrain(ind1,2),'+r') ;
plot(Xtrain(ind2,1), Xtrain(ind2,2),'oc' );
%% Train LDA classifier by hand (see Lab question 1.2)
% Compute (unbiased) MLE
ind1= (Ytrain==1) ;
ind2= (Ytrain==2) ;
n1= sum(ind1) ;
n2= sum(ind2) ;
% weights \pi_k for each class
pi1hat= n1/n ;
pi2hat= n2/n ;
% mean vector \mu_k for each class
mu1hat= mean( Xtrain(ind1,:) ) ;
mu1hat= mu1hat(:) ; % get a column vector
mu2hat= mean( Xtrain(ind2,:) ) ;
mu2hat= mu2hat(:) ; % get a column vector
% common pooled covariance matrix
% sample covariances C_k for each class
% Here covkhat= 1 / (nk-1) * sum_{for i in indk} ( X_i - mukhat) * ( X_i - mukhat)'
cov1hat= cov( Xtrain(ind1,:) ) ;
cov2hat= cov( Xtrain(ind2,:) ) ;
% TODO (Lab question 1.2)
% - compute the pooled sample covarance SigmaLDAhat
% - compute the coeffs for the decision boundary equation:
% constant CLDA and linear vector LLDA
% - plot the decision boundary
% - compute misclassification rate
SigmaLDAhat= eye(p) ; %FIXME
CLDA= 0 ; %FIXME
LLDA= ones(2,1) ; %FIXME
% Plot the decision boudary
boundaryLDA= @(x,y) ( CLDA + LLDA(:)' * [x(:)' ; y(:)'] ) ; % boundary function
h= ezplot( boundaryLDA ,[-3 3 -1 4] ) ; % Recent versions of matlab warn to replace ezplot() by fimplicit(), but that's okay
set(h, 'Color', 'k','LineWidth',2); % draw a black solid line
%% Compare with matlab built-in functions for LDA/QDA classification (Lab questions 1.3 and 1.4)
% fitcdiscr() to fit discriminant analysis classifier (training step). See the 'discrimType' parameters in the doc
% predict() for prediction step
% Note: in older versions of matlab, fitcdiscr() is named
% ClassificationDiscriminant.fit()
% Exemple of call for LDA
mdl = fitcdiscr(Xtrain,Ytrain) ;
% Coeffs for the decision boundary equation between class 1 and 2 are then
% stored in the following structure. See fields Const and Linear
mdl.Coeffs(1,2)
% Classify test data
Yhat= predict(mdl,Xtest) ;
% TODO: Complete the lab questions
close all
clear all
%% Zip Codes
% Load zip data sets
nom_fich_train= 'zip_train.mat' ;
nom_fich_test= 'zip_test.mat' ;
load(nom_fich_train) ; % import training dataset into variables Xtrain, Ytrain
load(nom_fich_test) ; % import test dataset into variables Xtest, Ytest
% Get the sample size n and the data point dimension p (here p=256)
[n,p]= size( Xtrain )
% Multiclass problem: one class per digit
K=10 ;
% Display some examples (Lab question 2.1)
figure,
title('Examples of sample');
subplot(4,3,1) ;
for k=0:K-1
indk= find(Ytrain==k,1,'first') ;
subplot(4,3,k+1) ;
imshow( ( reshape(Xtrain(indk,:),16,16) + 1 )'*.5 ) ;
end
%% Lab Question 2.4
% Display some data generated with the LDA model that fit the training set
% Get the common pooled covariance matrix (LDA assumption)
SigmaLDA= zeros(p) ;
for k=0:K-1
indk= (Ytrain==k) ;
nk= sum(indk) ;
SigmaLDA= SigmaLDA + ( cov( Xtrain(indk,:) ) ) * ( nk -1) ;
end
SigmaLDA= diag( diag(SigmaLDA) ) / ( numel(Ytrain) - K ) ;
if (0) %FIXME
figure,
title('Realizations from the LDA generative model');
subplot(4,3,1) ;
for k=0:K-1
% Get the sample mean for each class separately
muk= zeros(1,p) ; %FIXME
subplot(4,3,k+1) ;
% generate a gaussian random vector with LDA parameters
genk= muk + randn(1,p) * sqrt(SigmaLDA) ;
% display as an image
imshow((reshape(genk,16,16)+1)'*.5 ) ;
end
end
%% Lab Question 3.2, 3.3, 3.4 Binary SVM classification
% Binary SVM classif between the pair of most confused classes
label1= 1 ; % FIXME
label2= 2 ; % FIXME
% Retain only the data associated with these two classes to perform
% binary SVM classification
indl1= (Ytrain== label1) ;
indl2= (Ytrain== label2) ;
Ytrainl12= [Ytrain(indl1); Ytrain(indl2)] ;
Xtrainl12= [Xtrain(indl1,:); Xtrain(indl2,:)] ;
indtestl1= (Ytest== label1) ;
indtestl2= (Ytest== label2) ;
Ytestl12= [Ytest(indtestl1); Ytest(indtestl2)] ;
Xtestl12= [Xtest(indtestl1,:); Xtest(indtestl2,:)] ;
% Perform K-fold Cross Validation for optimizing SVM hyperparams
% Create K-fold partition
nl12= numel(Ytrainl12) ;
c = cvpartition(nl12,'KFold',10); % 10 fold
% Evaluate test error for default values of the SVM parameters
z.box= 1 ; % Box constraint hyperparameter to penalize slack variables
z.sigma= 10 ; % Kernel bandwidth hyperparameter (RBF kernel)
mcr12= kfoldLoss( fitcsvm(Xtrainl12,Ytrainl12,'CVPartition',c,...
'KernelFunction','rbf','BoxConstraint',z.box,...
'KernelScale',z.sigma))
% TODO: test different orders of magnitude for these two parameters
\ No newline at end of file
close all
clear all
%% Question 1.1 and 1.4. Displaying Fictitious data
nom_fich_in= 'fictitious_train.mat' ;
load( nom_fich_in ) ; % import dataset into variables Xtrain
% Get the sample size N (here data are univariate, i.e. p=1)
N=length(Xtrain) ;
% Display datapoints and their histogram
figure,
hold on
bins= min(Xtrain)-1:1:max(Xtrain)+1 ;
hist(Xtrain,bins) ;
plot(Xtrain,zeros(size(Xtrain)),'xr','MarkerSize',14) ;
%% Question 1.2. Implement Kmeans
K=2 ; % number of clusters
% Initialization
change= true ; % flag for the stopping criterion
perm = randperm( N , K ) ; % random initialization to pick up centroids
idx= zeros(N,1) ; % cluster indexes/labels
muvec= zeros(K,1) ; % mean values for each cluster
% Choose data points with random indexes perm(k) as starting centroid values
for k=1:K
muvec(k)= Xtrain( perm(k) ,: ) ;
end
% Compute the initial labels for each data point
for i=1:N
% Find the index indm of the centroid whose cluster is the closest
% from Xtrain(i)
[vm, indm] = min( mean( (Xtrain(i) - muvec ).^2 , 2 ) ) ;
idx(i)= indm ;
end
% loop until convergence
while (change)
change= false ;
% Update
for k=1:K
muvec(k)= 0 ; % FIXME
end
% Assigment
for i=1:N
indm=1 ; %FIXME: compute the index of the cluster where Xtrain(i) is assigned
if ( indm ~= idx(i) )
idx(i)= indm ;
change= true ;
end
end
end
%% Display clusters and decision boundary
figure,
hold on
plot(Xtrain,zeros(size(Xtrain)),'xk','MarkerSize',14) ;
plot(muvec(1),0,'pg','MarkerSize',16) ;
plot(muvec(2),0,'pb','MarkerSize',16) ;
plot([ mean(muvec) mean(muvec)] , [0 1], '-r' ) ;
%% Question 1.3. Use Matlab kmeans built-in functions
% same initialization as previous question
[idxkm centroidkm]= kmeans( Xtrain, K, 'start', Xtrain(perm) ) ;
%TODO: compare the results with your kmeans implementation, i.e.
% - muvec VS centroidkm
% - indx VS idxkm
%% Question 1.5. Implement EM for Gaussian Mixture Model (GMM) clustering
K=2 ;
MaxIter= 100 ; % stopping criterion
% Initialization
pivec= ones(K,1)* 1/K ; % initial prior weights for each cluster
perm = randperm( N , K ) ; % random initialization to pick up centroids
muvec= zeros(K,1) ; % mean values for each cluster
sigvec= ones(K,1) ; % variance values sigma^2 for each cluster
postpr= zeros(N,K) ; % posterior proba/weight for each cluster and each data point
% Choose data points with random indexes perm(k) as starting centroid values
for k=1:K
muvec(k,:)= Xtrain( perm(k) ,: ) ;
sigvec(k,:)= var(Xtrain) ;
end
% Loop until maximum number of iterations is reached
for t=1:MaxIter
% E step: Prediction step
for i=1:N
% Compute the normal pdfs p( X_i | Y_i=k ) for current parameters,
% and for each cluster k=1,2
g1= normpdf(Xtrain(i), muvec(1,:), sqrt( sigvec(1,:) ) ) ;
g2= normpdf(Xtrain(i), muvec(2,:), sqrt( sigvec(2,:) ) ) ;
% Compute marginal distribution of data point X_i
% p(X_i)= \sum_k \pi_k * normpdf(X_i, \mu_k, \sigma_k )
px= pivec(1) * g1(i) + pivec(2) * g2(i)
% Compute proba p( Y_i | X_i=k ) for each cluster k=1,2
postpr(i,1)= 0.5 ; % (cluster k=1) FIXME
postpr(i,2)= 0.5 ; % (cluster k=2) FIXME
end
% M step: Update step
for k=1:K
pivec(k,:)= mean( postpr(:,k) ) ; % new prior cluster weight
muvec(k,:)= sum( postpr(:,k) .* Xtrain ) ./ sum( postpr(:,k) ) ; % new cluster mean
sigvec(k,:)= 1 ; %FIXME: compute new cluster variance, for each cluster
end
end
%% Question 1.6: Display pdf estimates
figure,
title( 'density estimates' ) ;
hold on
plot(Xtrain,zeros(size(Xtrain)),'xk','MarkerSize',14) ;
xval= linspace(-1,7,1000) ;
g1= normpdf(xval, muvec(1), sqrt( sigvec(1) ) ) ;
g2= normpdf(xval, muvec(2), sqrt( sigvec(2) ) ) ;
g= pivec(1) * g1 + pivec(2) * g2 ;
plot( xval, g1, '-c' ) ;
plot( xval, g2, '-m' ) ;
plot( xval, g, '-b' ) ;
%% Question 1.7: Use Matlab EM/GMM built-in functions
p= 1 ; % Here data points are univariate
strParIni.PComponents= pivec ;
strParIni.mu= muvec ;
strParIni.Sigma= zeros(p,p,K) ;
strParIni.Sigma(1,1,:)= sigvec ;
% Options for better precision
% options=statset( 'MaxIter', 200, 'TolX', 1e-8, 'TolFun',1e-10 ) ;
%
% obj= gmdistribution.fit(Xtrain,K, 'replicates', 1, 'options', options, ...
%'start', strParIni) ;
obj= gmdistribution.fit(Xtrain,K, 'replicates', 1, 'start', strParIni) ;
% Get cluster centroids and weights
gmestPi= obj.PComponents ;
gmestMu= obj.mu' ;
gmestSig= squeeze( obj.Sigma )' ;
%TODO: compare the results with your kmeans implementation, i.e.
% - prior weights: pivec VS gmestPi
% - cluster means: muvec VS gmestMu
% - cluster variance sigvec VS gmestSig
%% Question 1.8: Kmeans vs EM
% TODO: plot both
% - kmeans decision boudary (hard thresholding)
% - GMM responsabilities curves (soft thresholding)
\ No newline at end of file
close all
clear all
%% Question 2.1 Loading/Displaying CHD data
nom_fich_in= 'chd_train.mat' ;
load(nom_fich_in) ; % import dataset into variables Xtrain (and Ytrain for case/control labels)
K=2 ; % binary clustering
% Get the sample size N (here data are univariate, i.e. p=1)
N= length(Xtrain) ;
% display an histogram of the data
figure,
hold on
histogram(Xtrain,'BinMethod','integers','BinWidth',5)
%% Question 2.2.
options=statset( 'MaxIter', 500, 'TolX', 1e-8, 'TolFun',1e-10 ) ;
% Retain the best partition (GMM likelihood sense) from 10 random
% iniatilizations
mdlGMM = fitgmdist(Xtrain,K,'Replicates',10,'Options',options) ;
% Get the partition
idx = cluster(mdlGMM,Xtrain) ;
% Get the cluster parameters
mu0= mdlGMM.mu(1) ;
mu1= mdlGMM.mu(2) ;
sig0=squeeze( mdlGMM.Sigma(1,1,1) ) ;
sig1=squeeze( mdlGMM.Sigma(1,1,2) ) ;
pi0=squeeze( mdlGMM.PComponents(1) ) ;
pi1=squeeze( mdlGMM.PComponents(2) ) ;
xval= linspace(10,70,1000) ;
% plot the GMM pdf
figure,
hold on
ezplot(@(x)pdf(mdlGMM,x), [10 70 0 .1] ) % Recent versions of matlab warn to replace ezplot() by fimplicit(), but that's okay
% TODO:
% compute and display on the same figure the pdf g0 and g1 for each cluster
%% Question 2.4
% TODO: Use responses Ytrain to compare
% - case/controls classes with clusters
% -EM/GMM clustering vs QDA supervised classification
close all
clear all
%% Question 3.1. Displaying bivariate mixture data
nom_fich_in= 'mixture_train.mat' ;
load(nom_fich_in) ; % import dataset into variables Xtrain (and Ytrain for class labels)
K=2 ; % binary clustering
p= 2 ; %bivariate data
% Get the sample size N (here data are bivariate, i.e. p=2)
N= length( Xtrain(:,1 ) )
% Display data points
figure
hold on
plot( Xtrain(:,1) , Xtrain(:,2),' ok' ) ;
%% TODO Question 3.2 EM clustering
%% TODO Question 3.3 Confusion and missclassification rate for EM clustering
%% Question 3.4.
figure,
hold on
plot( Xtrain(idxperm==1,1) , Xtrain(idxperm==1,2),' or' ) ;
plot( Xtrain(idxperm==2,1) , Xtrain(idxperm==2,2),' +g' ) ;
plot( Xtrain(idxperm==3,1) , Xtrain(idxperm==3,2),' xb' ) ;
%FIXME: replace XXXX by the GMM object that fit the dataset (see Question 3.2)
ezcontour(@(x,y) pdf( XXXX ,[x y] ), [-5, 4, -2, 5],256) ; % FIXME
close all
clear all
%% Displaying Lena image data
nom_fich_in= 'lena.tif' ;
lennaim= imread(nom_fich_in) ;
% Data point: vector value of color pixel (p=3)
p=3 ;
% Get the sample size N (here data are trivariate, i.e. p=3)
n= numel(lennaim) / p ;
figure,
subplot(3,3,1)
imshow(lennaim) ;
%% Question 4.2 Vector Quantification
options = statset('MaxIter',1000) ;
lennaim= double( reshape(lennaim, n , p) ) ; %Vectorize dataset
K= 2 ; % binary clustering
% Compute kmeans cluster indexes IDX, cluster centroids C, and
% within-cluster sums of point-to-centroid distances sumd
[IDX,C,sumd] = kmeans(lennaim,K,'replicates',5, 'options',options) ;
% Compute quantized image
for k=1:K
lennaquantK( IDX == k , : )= repmat( C(k,:), sum(IDX == k) , 1) ;
end
% Display quantized image
imshow(uint8(reshape( lennaquantK, 512,512,p ))) ;
%% TODO: questions 4.3 and 4.4
\ No newline at end of file
File added
......@@ -170,7 +170,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.6.9"
}
},
"nbformat": 4,
......
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# MultiLayer Perceptron\n",
"# MultiLayer Perceptron : use a multiple layer perceptron to predict simple logical functions \n",
"\n",
"see https://scikit-learn.org/stable/modules/neural_networks_supervised.html"
]
......@@ -184,7 +184,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
"version": "3.6.9"
}
},
"nbformat": 4,
......
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
numpy==1.15.1
numpy==1.16
scipy==1.1.0
matplotlib==2.1.1
scikit-learn==0.20.1
pandas==0.23.4
graphviz==2.40.1
keras==2.2.5
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment