Commit 50c06220 authored by Florent Chatelain's avatar Florent Chatelain
Browse files

fix typo

parent 58583b35
......@@ -428,9 +428,16 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.7.9"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}
......@@ -328,7 +328,14 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.7.9"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
......
......@@ -12,7 +12,8 @@
"metadata": {},
"source": [
"Given the computational load, an efficient alternative is to use the UGA's jupyterhub service https://jupyterhub.u-ga.fr/ . In this case, to install tensorflow 2.X, just type\n",
"!pip install --user --upgrade tensorflow\n",
"\n",
" !pip install --user --upgrade tensorflow\n",
"in a code cell, then restart the notebook (or just restart the kernel)"
]
},
......@@ -941,7 +942,14 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
"version": "3.7.9"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fparcours-numerique-ia/master?filepath=notebooks%2F/9_RNN_LSTM/N1_LSTM_example.ipynb)
%% Cell type:markdown id: tags:
Given the computational load, an efficient alternative is to use the UGA's jupyterhub service https://jupyterhub.u-ga.fr/ . In this case, to install tensorflow 2.X, just type
!pip install --user --upgrade tensorflow
!pip install --user --upgrade tensorflow
in a code cell, then restart the notebook (or just restart the kernel)
%% Cell type:markdown id: tags:
LSTM example on forecasting a scalar temperature value from a vector of past values.
This Notebook is largely inspired from
https://www.tensorflow.org/tutorials/structured_data/time_series
which runs under TensorFlow2.
%% Cell type:code id: tags:
``` python
#import keras
#if tensorflow2, replace by
import tensorflow.keras as keras
import os
import pandas as pd
import numpy as np
# to insure reproductability of different runs
np.random.seed(1)
#from tensorflow import set_random_seed
#set_random_seed(2)
#if tensorflow2, replace the two lines above by
from tensorflow import random
random.set_seed(2)
import matplotlib.pyplot as plt
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
This tutorial uses a weather time series dataset recorded by the Max Planck Institute for Biogeochemistry.
This dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. For efficiency, you will use only the data collected between 2009 and 2016. This section of the dataset was prepared by François Chollet for his book Deep Learning with Python.
%% Cell type:code id: tags:
``` python
#zip_path = keras.utils.get_file(
# origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
# fname='jena_climate_2009_2016.csv.zip', extract= True)
#csv_path, _ = os.path.splitext(zip_path)
#csv_path
#df = pd.read_csv(csv_path)
filename='jena_climate_2009_2016.csv'
df = pd.read_csv(filename)
df.head()
```
%%%% Output: execute_result
Date Time p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%) \
0 01.01.2009 00:10:00 996.52 -8.02 265.40 -8.90 93.3
1 01.01.2009 00:20:00 996.57 -8.41 265.01 -9.28 93.4
2 01.01.2009 00:30:00 996.53 -8.51 264.91 -9.31 93.9
3 01.01.2009 00:40:00 996.51 -8.31 265.12 -9.07 94.2
4 01.01.2009 00:50:00 996.51 -8.27 265.15 -9.04 94.1
VPmax (mbar) VPact (mbar) VPdef (mbar) sh (g/kg) H2OC (mmol/mol) \
0 3.33 3.11 0.22 1.94 3.12
1 3.23 3.02 0.21 1.89 3.03
2 3.21 3.01 0.20 1.88 3.02
3 3.26 3.07 0.19 1.92 3.08
4 3.27 3.08 0.19 1.92 3.09
rho (g/m**3) wv (m/s) max. wv (m/s) wd (deg)
0 1307.75 1.03 1.75 152.3
1 1309.80 0.72 1.50 136.1
2 1310.24 0.19 0.63 171.6
3 1309.19 0.34 0.50 198.0
4 1309.00 0.32 0.63 214.3
%% Cell type:markdown id: tags:
An observation is recorded every 10 mintues. This means that, for a single hour, you will have 6 observations. Similarly, a single day will contain 144 (6x24) observations.
Given a specific time, let's say you want to predict the temperature 6 hours in the future. In order to make this prediction, you choose to use a set of preceding N observations. Thus, you would create a window containing the last N points. For example, 5 days of observation correspond to 720(5x144) observations to train the model. Many such configurations are possible, making this dataset a good one to experiment with.
%% Cell type:markdown id: tags:
## Forecast a univariate time series
Here, you will train a model using only a single feature (temperature), and use it to make predictions for that value in the future.
Let's first extract only the temperature from the dataset.
%% Cell type:code id: tags:
``` python
# Extract univariate time series
Temp=df['T (degC)']
Temp.index=df['Date Time']
print('the time series contains {} sample'.format(Temp.shape[0]))
print('\n First five rows:')
Temp.head()
```
%%%% Output: stream
the time series contains 420551 sample
First five rows:
%%%% Output: execute_result
Date Time
01.01.2009 00:10:00 -8.02
01.01.2009 00:20:00 -8.41
01.01.2009 00:30:00 -8.51
01.01.2009 00:40:00 -8.31
01.01.2009 00:50:00 -8.27
Name: T (degC), dtype: float64
%% Cell type:code id: tags:
``` python
# plot time series
Temp.plot(subplots=True, figsize=(12,6))
```
%%%% Output: stream
/Applications/anaconda3/envs/tensorflow_env/lib/python3.8/site-packages/pandas/plotting/_matplotlib/core.py:1235: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(xticklabels)
%%%% Output: execute_result
array([<AxesSubplot:xlabel='Date Time'>], dtype=object)
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
### Rescaling
LSTMs are sensitive to the scale of the input data, specifically when the sigmoid (default) or tanh activation functions are used. It can be a good practice to rescale the data to the range of 0-to-1 (if sigmoid activation) or -1 to +1 (if tanh), also called normalizing. We can easily normalize the dataset using the MinMaxScaler preprocessing class from the scikit-learn library.
%% Cell type:code id: tags:
``` python
from sklearn.preprocessing import MinMaxScaler
data=np.reshape(Temp.values,(420551,1))
data=data.astype('float32')
scaler = MinMaxScaler(feature_range=(-1, 1)) # compatiblity with choice activation=tanh
data_sc = scaler.fit_transform(data)
print(data[:5,])
scaler.inverse_transform(data_sc)[:5,]
```
%%%% Output: stream
[[-8.02]
[-8.41]
[-8.51]
[-8.31]
[-8.27]]
%%%% Output: execute_result
array([[-8.02 ],
[-8.409999],
[-8.51 ],
[-8.31 ],
[-8.27 ]], dtype=float32)
%% Cell type:markdown id: tags:
### Define traning and testing sets
%% Cell type:code id: tags:
``` python
# split into train and test sets
train_size = 300000
test_size = len(data_sc) - train_size
train, test = data_sc[0:train_size,:], data_sc[train_size:len(data_sc),:]
print(len(train), len(test))
```
%%%% Output: stream
300000 120551
%% Cell type:markdown id: tags:
The function takes three arguments: the dataset, which is a NumPy array that we want to convert into a dataset, and the look_back, which is the number of previous time steps to use as input variables to predict the next time period (default forecast=0); or the next time+forecast period. Note that look_back is defaulted to 1.
%% Cell type:markdown id: tags:
### Rearrange data into a matrix forms
%% Cell type:code id: tags:
``` python
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1, forecast=0):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-forecast-1): # current time index is i
a = dataset[i:(i+look_back), 0] # a takes i,i+1,...,i+lookback-1
dataX.append(a)
dataY.append(dataset[i + look_back+forecast, 0]) # from a, predicts at time index i
return np.array(dataX), np.array(dataY)
```
%% Cell type:markdown id: tags:
On the example below, only 20 past observations are used to predict future temperature value 6 samples ahead (This correspond to a 1 hour)
%% Cell type:code id: tags:
``` python
# reshape into X=t and Y=t+1
look_back=20
forecast=6
trainX, trainY = create_dataset(train, look_back, forecast)
testX, testY = create_dataset(test, look_back, forecast)
trainX.shape
trainY=trainY.reshape((len(trainY),1))
testY=testY.reshape((len(testY),1))
```
%% Cell type:code id: tags:
``` python
trainX.shape
```
%%%% Output: execute_result
(299973, 20)
%% Cell type:raw id: tags:
%% Cell type:markdown id: tags:
### The LSTM network expects the input data (X) to be provided with a specific array structure in the form of: [samples, time steps, features].
Currently, our data is in the form: [samples, features] and we are framing the problem as one time step for each sample. We can transform the prepared train and test input data into the expected structure using numpy.reshape() as follows:
%% Cell type:code id: tags:
``` python
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))
trainX.shape
```
%%%% Output: execute_result
(299973, 20, 1)
%% Cell type:code id: tags:
``` python
# May be usefull to test the code on shorter data set...
#trainX=trainX[:200,:,:]
#trainY=trainY[:200]
#testX=testX[:200,:,:]
#testY=testY[:200]
#trainY.shape
```
%% Cell type:markdown id: tags:
## Learn a LSTM model
A batch method is used during the optization phase. The size of the batches is set here to 256 samples (256 consecutive -in time- vectors X of look_back (=20) numerical values, and correponding outputs Y.
The cell states are initialized at random. Obviously, the paramater shuffle has to be set to shuffle=False, as shuffling the data would destroy all time dependencies that are seeked by the LSTM.
As by default, cell states are not reinitialized after an epoch, we prefer here to force these states to reinitilaize afetr each epoch.
(Cell states are reinitialized, but not the internal networks -remind that there are 4 of them in a LSTM- are kept to their current value.
%% Cell type:code id: tags:
``` python
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
#if tensorflow2, replace the three lines above by
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense
#from tensorflow.keras.layers import LSTM
import math
size_of_batch=256
# adapt length of trainX and trainY to size_of_batch
L=math.floor(len(trainX)/256)*256
trainX=trainX[:L,:,:]
trainY=trainY[:L]
nb_epochs=10;
neurons=8 #size of both hiden stat (h) and celle state (C)
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(neurons, activation='tanh', recurrent_activation='sigmoid', batch_input_shape=(size_of_batch,trainX.shape[1], 1),stateful=True))
model.add(Dense(1))
# use stochastic gradient optimizer
opt = keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
model.summary()
```
%%%% Output: stream
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (256, 8) 384
_________________________________________________________________
dense (Dense) (256, 1) 9
=================================================================
Total params: 393
Trainable params: 393
Non-trainable params: 0
_________________________________________________________________
%% Cell type:code id: tags:
``` python
# Train the network
for i in range(nb_epochs):
model.fit(trainX, trainY, epochs=1, batch_size=size_of_batch, verbose=1,shuffle=False)
model.reset_states()
print('Epoch {} completed'.format(i))
```
%%%% Output: stream
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0072
Epoch 0 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0018
Epoch 1 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0011
Epoch 2 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0011
Epoch 3 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0010
Epoch 4 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 0.0010
Epoch 5 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 9.8257e-04
Epoch 6 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 9.5387e-04
Epoch 7 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 9.3068e-04
Epoch 8 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 9.1339e-04
Epoch 9 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 9.0010e-04
Epoch 10 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.8947e-04
Epoch 11 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.8080e-04
Epoch 12 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.7366e-04
Epoch 13 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.6769e-04
Epoch 14 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.6256e-04
Epoch 15 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.5808e-04
Epoch 16 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.5415e-04
Epoch 17 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.5073e-04
Epoch 18 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.4777e-04
Epoch 19 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.4524e-04
Epoch 20 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.4309e-04
Epoch 21 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.4126e-04
Epoch 22 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.3972e-04
Epoch 23 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.3841e-04
Epoch 24 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.3730e-04
Epoch 25 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.3635e-04
Epoch 26 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.3553e-04
Epoch 27 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.3480e-04
Epoch 28 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.3413e-04
Epoch 29 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.3352e-04
Epoch 30 completed
1171/1171 [==============================] - 5s 4ms/step - loss: 8.3294e-04
Epoch 31 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.3238e-04
Epoch 32 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.3182e-04
Epoch 33 completed
1171/1171 [==============================] - 4s 4ms/step - loss: 8.3126e-04
Epoch 34 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.3068e-04
Epoch 35 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.3007e-04
Epoch 36 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2943e-04
Epoch 37 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2874e-04
Epoch 38 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2799e-04
Epoch 39 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2717e-04
Epoch 40 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2626e-04
Epoch 41 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2525e-04
Epoch 42 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2412e-04
Epoch 43 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2283e-04
Epoch 44 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.2138e-04
Epoch 45 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.1972e-04
Epoch 46 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.1789e-04
Epoch 47 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.1601e-04
Epoch 48 completed
1171/1171 [==============================] - 4s 3ms/step - loss: 8.1436e-04
Epoch 49 completed
%% Cell type:code id: tags:
``` python
# make predictions
trainPredict = model.predict(trainX,batch_size=size_of_batch)
# adapt length of trainX and trainY to size_of_batch
L=math.floor(len(testX)/256)*256
testX=testX[:L,:,:]
testY=testY[:L]
testDate = Temp.index[train_size:train_size+L]
testPredict = model.predict(testX,batch_size=size_of_batch)
```
%% Cell type:code id: tags:
``` python
print(testDate)
```
%%%% Output: stream
Index(['13.09.2014 02:20:00', '13.09.2014 02:30:00', '13.09.2014 02:40:00',
'13.09.2014 02:50:00', '13.09.2014 03:00:00', '13.09.2014 03:10:00',
'13.09.2014 03:20:00', '13.09.2014 03:30:00', '13.09.2014 03:40:00',
'13.09.2014 03:50:00',
...
'30.12.2016 08:00:00', '30.12.2016 08:10:00', '30.12.2016 08:20:00',
'30.12.2016 08:30:00', '30.12.2016 08:40:00', '30.12.2016 08:50:00',
'30.12.2016 09:00:00', '30.12.2016 09:10:00', '30.12.2016 09:20:00',
'30.12.2016 09:30:00'],
dtype='object', name='Date Time', length=120320)
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
df_test = pd.DataFrame(index=testDate)
df_test['Actual'] = testY
df_test['Predict'] = testPredict
df_test.plot(subplots=False,figsize=(18,6))
```
%%%% Output: stream
/Applications/anaconda3/envs/tensorflow_env/lib/python3.8/site-packages/pandas/plotting/_matplotlib/core.py:1235: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(xticklabels)
%%%% Output: execute_result
<AxesSubplot:xlabel='Date Time'>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
## Back in the original space
To evaluate RMSE performance and visualize the results in the original space, one needs to rescale back the data to their original scale.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import mean_squared_error
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainPredict.shape
trainY = scaler.inverse_transform(trainY)
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform(testY)
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY, trainPredict))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY, testPredict))
print('Test Score: %.2f RMSE' % (testScore))
df_test = pd.DataFrame(index=testDate)
df_test['Actual'] = testY
df_test['Predict'] = testPredict
df_test.plot(subplots=False,figsize=(15,6))
```
%%%% Output: stream
Train Score: 0.97 RMSE
Test Score: 0.94 RMSE
%%%% Output: stream
/Applications/anaconda3/envs/tensorflow_env/lib/python3.8/site-packages/pandas/plotting/_matplotlib/core.py:1235: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(xticklabels)
%%%% Output: execute_result
<AxesSubplot:xlabel='Date Time'>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -823,7 +823,7 @@
"source": [
"### Exercice\n",
"- Based on the logistic regression formula to compute the probability of each class (with, or without CHD) and the values of the estimated weights, what would be the increase factor on the odds probability $$\\frac{ \\Pr(\\textrm{\"CHD\"} |X=x)}{\\Pr( \\textrm{\"no CHD\"}|X=x)}$$ when the tobacco consumption increases of 1 (in standardized unit)? \n",
"- Check this by adding an offset to the tobbaco variable in the cell above and comparing the obtained odds"
"- Check this by adding an offset to the tobacco variable in the cell above and comparing the obtained odds"
]
},
{
......@@ -964,7 +964,7 @@
"\n",
"This principle can be extented to generalized linear model such that Logistic Regression by replacing the residual sum of squares criterion by the opposite of the log likelihood. \n",
"\n",
"Within scikit-learn, only OMP is available for linear regression model with [`linear_model.OrthogonalMatchingPursuit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html). However this can be directly apply to our binary classification problem"
"Within scikit-learn, only OMP is available for linear regression model with [`linear_model.OrthogonalMatchingPursuit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html). However this can be directly applied to our binary classification problem"
]
},
{
......
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/5bis_linear_models_lasso_logistic/)
%% Cell type:markdown id: tags:
# South Africa Heart Diseases Data
A retrospective sample of males in a heart-disease high-risk region
of the Western Cape, South Africa. There are roughly two controls per
case of Coranary Heart Disease (CHD). Many of the CHD positive men have undergone blood
pressure reduction treatment and other programs to reduce their risk
factors after their CHD event. In some cases the measurements were
made after these treatments. These data are taken from a larger
dataset, described in Rousseauw et al, 1983, South African Medical Journal.
| Variable name | Description |
|:---------|:-------------------------|
| sbp | systolic blood **pressure** |
| tobacco | cumulative **tobacco** (kg) |
| ldl | low densiity lipoprotein **cholesterol** |
| adiposity | |
| famhist | **family** history of heart disease (Present=1, Absent=0) |
| typea | [type-A **behavior**](https://en.wikipedia.org/wiki/Type_A_and_Type_B_personality_theory) |
| obesity | |
| alcohol | current **alcohol** consumption |
| age | **age** at onset |
| chd | **response**: coronary heart disease (Present=1, Absent=0) |
%% Cell type:markdown id: tags:
#### Load the data set
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
#load data set
heart = pd.read_csv('SAheart.csv', sep=',', header=0)
heart.head() #data overview: variable names and first values
```
%%%% Output: execute_result
row.names sbp tobacco ldl adiposity famhist typea obesity alcohol \
0 1 160 12.00 5.73 23.11 1 49 25.30 97.20
1 2 144 0.01 4.41 28.61 0 55 28.87 2.06
2 3 118 0.08 3.48 32.28 1 52 29.14 3.81
3 4 170 7.50 6.41 38.03 1 51 31.99 24.26
4 5 134 13.60 3.50 27.78 1 60 25.99 57.34
age chd
0 52 1
1 63 1
2 46 0
3 58 1
4 49 1
%% Cell type:markdown id: tags:
#### Display some summary statistics
%% Cell type:code id: tags:
``` python
heart.describe()
```
%%%% Output: execute_result
row.names sbp tobacco ldl adiposity famhist \
count 462.000000 462.000000 462.000000 462.000000 462.000000 462.000000
mean 231.935065 138.326840 3.635649 4.740325 25.406732 0.415584
std 133.938585 20.496317 4.593024 2.070909 7.780699 0.493357
min 1.000000 101.000000 0.000000 0.980000 6.740000 0.000000
25% 116.250000 124.000000 0.052500 3.282500 19.775000 0.000000
50% 231.500000 134.000000 2.000000 4.340000 26.115000 0.000000
75% 347.750000 148.000000 5.500000 5.790000 31.227500 1.000000
max 463.000000 218.000000 31.200000 15.330000 42.490000 1.000000
typea obesity alcohol age chd
count 462.000000 462.000000 462.000000 462.000000 462.000000
mean 53.103896 26.044113 17.044394 42.816017 0.346320
std 9.817534 4.213680 24.481059 14.608956 0.476313
min 13.000000 14.700000 0.000000 15.000000 0.000000
25% 47.000000 22.985000 0.510000 31.000000 0.000000
50% 53.000000 25.805000 7.510000 45.000000 0.000000
75% 60.000000 28.497500 23.892500 55.000000 1.000000
max 78.000000 46.580000 147.190000 64.000000 1.000000
%% Cell type:markdown id: tags:
#### Format the input/target data set
%% Cell type:code id: tags:
``` python
y = np.asarray( heart.iloc[:,10], 'float' )
X = np.asarray( heart.iloc[:,1:10], 'float' )
names = list(heart.columns[1:10]) # variable names
y = heart.iloc[:,10]
X = heart.iloc[:,1:10]
# Data matrix X is (n,p) where p is the number of variable and n the number of sample
(n,p) = X.shape
ncases = int(np.sum(y)) # number of cases
print('There is {} samples: {} cases and {:d} control'.format(n,ncases,n-ncases))
```
%%%% Output: stream
There is 462 samples: 160 cases and 302 control
%% Cell type:markdown id: tags:
## Compute ordinary (without regularization) Logistic Regression
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from time import time
# Standardize the data (useful to compare the variable weights)
sc = StandardScaler()
X = heart.iloc[:,1:10]
Xs= sc.fit_transform(X)
print("Computing LR estimates ...")
start = time()
clf = LogisticRegression(penalty='none', tol=1e-6, max_iter=int(1e6))
clf.fit(Xs, y)
print("This took %0.3fs" % (time() - start))
betas = pd.DataFrame.from_records(clf.coef_, columns=names, index=['Weights'])
betas['intercept'] = clf.intercept_
print('Estimated Weights for standardized variables:\n')
betas.head()
```
%%%% Output: stream
Computing LR estimates ...
This took 0.003s
Estimated Weights for standardized variables:
%%%% Output: execute_result
sbp tobacco ldl adiposity famhist typea \
Weights 0.133162 0.364183 0.359791 0.144465 0.456047 0.388312
obesity alcohol age intercept
Weights -0.264795 0.002981 0.659973 -0.878543
%% Cell type:markdown id: tags:
**Exercise**:
- what does a positive, negative or near-zero weight mean to predict heart disease?
- How do you interpret the weight of obesity for instance?
- How can you explain such surprising findings?
%% Cell type:markdown id: tags:
## Compute $\ell_1$ penalized Logistic Regression and lasso path
Regularization functions such as $\ell_1$ or $\ell_2$ penalty (or combined $\ell_1$/$\ell_2$ penalty as in *elastic net*) can also be used in generalized linear model such that Logistic Regression. The residual sum of squares criterion used for linear regression is then replaced by the opposite of the (conditional) log likelihood.
For instance for Logitic Regression with Lasso-type regularization ($\ell_1$ penalty), this yields for binary classification $y_i \in \{-1,+1\}$ the following optimization problem:
$$
\min_{\beta} \sum_{i=1}^n \log(\exp(- y_i (X_i^T \beta )) + 1) + \lambda \| \beta \|_1.
$$
Within scikit-learn, penalized Logistic Regression is available through the `penalty` parameter of [`linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htm), parameter `C` being the inverse of regularization strength such that $C=\frac{1}{\lambda}$
%% Cell type:code id: tags:
``` python
#compute lasso path
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from time import time
from sklearn.svm import l1_min_c
import matplotlib.pyplot as plt
# Standardize the data!
# Don't forget that scale matters for penaliezd regression such that Lasso or ridge
# (This is true also for distance based methods s.t. K-NN).
# If one variable has a larger magnitude than the other (imagine that you change the
# unity for a variable from kilograms to grams), this variable will be much less shrunken than
# the others.
# Advice: Except if you have a good reason to do the opposite, standardize all your variables
# to be sure that they are comparable
sc = StandardScaler()
Xs= sc.fit_transform(X) #center (zero mean) and reduce (unit variance) the variables
# generate useful values for the regularization parameter (log-scale)
cs = l1_min_c(Xs, y, loss='log') * np.logspace(0, 4, 30)
# Use a stochastic gradient descent
print("Computing regularization path ...")
start = time()
clf = LogisticRegression(penalty='l1', solver='saga',
tol=1e-6, max_iter=int(1e6),
warm_start=True)
coefs_ = []
beta_l1norm = []
for c in cs:
clf.set_params(C=c)
clf.fit(Xs, y)
beta_l1norm.append( np.sum(np.abs(clf.coef_.ravel())))
coefs_.append(clf.coef_.ravel().copy())
print("This took %0.3fs" % (time() - start))
betas = np.array(coefs_)
```
%%%% Output: stream
Computing regularization path ...
This took 0.160s
%% Cell type:code id: tags:
``` python
# Display lasso path Vs l1 norm of the coeff vector
plt.figure(figsize=(12,6))
#plt.plot(np.log10(cs), coefs_, marker='o')
plt.plot(beta_l1norm, betas, marker='o')
ymin, ymax = plt.ylim()
plt.xlabel('l1 norm of beta')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.legend(names, fontsize=14)
plt.grid('On')
plt.show()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
### Cross validation of the LR model with Lasso penalty
We estimate now the optimal regularization parameter
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegressionCV
print("Computing K-fold CV ...")
# K fold cross validation (K=5)
start = time()
cs = l1_min_c(Xs, y, loss='log') * np.logspace(0, 2, 50) # the vector fot the alpha (lasso penalty parameter) values
model = LogisticRegressionCV(Cs=cs, cv=5, penalty='l1', solver='saga', tol=1e-6).fit(Xs,y)
print("This took %0.3fs" % (time() - start))
```
%%%% Output: stream
Computing K-fold CV ...
This took 0.333s
%% Cell type:code id: tags:
``` python
# Now model is tuned with the penalty parameter estimated by CV
lambda_cv = model.C_[0]
# The coef estimated with CV
beta_l1norm = np.sum(np.abs(model.coef_))
print('CV estimates:')
print('- lambda = {:.3f}, which yields ||beta||_1 = {:.3f}\n'.format(lambda_cv,beta_l1norm) )
print('CV weights for standardized variables:')
betas_cv = pd.DataFrame.from_records(model.coef_, columns=names, index=['Weights'])
betas_cv['intercept'] = clf.intercept_
betas_cv.head()
```
%%%% Output: stream
CV estimates:
- lambda = 0.060, which yields ||beta||_1 = 1.342
CV weights for standardized variables:
%%%% Output: execute_result
sbp tobacco ldl adiposity famhist typea obesity \
Weights 0.0 0.232516 0.199362 0.0 0.284652 0.11216 0.0
alcohol age intercept
Weights 0.0 0.513103 -0.878436
%% Cell type:markdown id: tags:
### Exercice
- What are the only significant variables estimated with cross-validation?
- How can we rank them by significance order (*hint: look at the lasso path*)?
- Do these results seem more credible (than those obtain without regularization) to predict heart diseases?
%% Cell type:markdown id: tags:
### Compute the predicted CHD probability for some patients
For this kind of problem, we are more of course more interested in the modeling between the inputs/response variables and their interpretation rather than the only prediction of the binary responses (CHD or not). With a generalized linear model such that LR, it is possible to compute a risk probability for each patient and to assess and to interpret the influence of each variable.
%% Cell type:code id: tags:
``` python
# Get the predicted risk for the first patient
ipatient = 0 # patient index
x = Xs[ipatient,:].reshape(1, -1)
x = pd.DataFrame.from_records(x, columns=names)
ylabel = 'Case' if y[ipatient] else 'Control'
print("{} Patient with (standardized) features:\n{}\n".format(ylabel, x))
# TODO: increase/decrease tobacco consumption
x_copy = x.copy()
x_copy['tobacco'] += 0 # offset to add in standardized unit
# Proba of heart disease
proba_CHD_0 = model.predict_proba(x)[0,1]
proba_CHD = model.predict_proba(x_copy)[0,1]
print("Proba of coronary heart disease (CHD): {:.3f}".format(proba_CHD))
print("Odds probability of CHD: {:.3f}".format(proba_CHD/(1-proba_CHD)))
```
%%%% Output: stream
Case Patient with (standardized) features:
sbp tobacco ldl adiposity famhist typea obesity \
0 1.058564 1.823073 0.478412 -0.295503 1.185854 -0.41847 -0.176786
alcohol age
0 3.277738 0.629336
Proba of coronary heart disease (CHD): 0.596
Odds probability of CHD: 1.473
%% Cell type:markdown id: tags:
### Exercice
- Based on the logistic regression formula to compute the probability of each class (with, or without CHD) and the values of the estimated weights, what would be the increase factor on the odds probability $$\frac{ \Pr(\textrm{"CHD"} |X=x)}{\Pr( \textrm{"no CHD"}|X=x)}$$ when the tobacco consumption increases of 1 (in standardized unit)?
- Check this by adding an offset to the tobbaco variable in the cell above and comparing the obtained odds
- Check this by adding an offset to the tobacco variable in the cell above and comparing the obtained odds
%% Cell type:markdown id: tags:
## Greedy variable selection procedure
An alternative to Lasso penalization to select the most significant variables and regularize the problem is to fit all the possible combination of variables and choose the best one (for instance that minimizes the test error). This is called *best subset criterion*. However this approach is very computationnaly demanding. With $p$ variables, we need to fit $2^p$ models. Additionally, if we want to use cross-validation to evaluate and compare their performances, the problem quickly becomes unfeasible...
Greedy algorithm are then useful procedures that select the best variables (forward selection) or remove the worst variables (backward selection) **step by step**. There is generally no longer any guarantee of converging towards the best subset solution, but this may give useful solutions at a much reduced computational cost.
**Exercise:**
- for our heart diseases dataset, usig a $K=5$-fold CV, how many fits would be required to find the best subset?
%% Cell type:markdown id: tags:
### Backward selection: Recursive Feature Elimination (RFE)
Given a prediction rule that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select variables (the features) by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set.
Within scikit-learn, recursive feature elimination is available with [`feature_selection.RFE`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html): for a linear model such that Logistic Regression, the importance of the features is obtained through their coefficients (pay attention to scale your data!). The recursive feature elimination is repeated until the desired number of features to select is reached.
%% Cell type:code id: tags:
``` python
from sklearn.feature_selection import RFE, RFECV
clf = LogisticRegression(penalty='none', tol=1e-6, max_iter=int(1e6))
start = time()
selector = RFE(clf, n_features_to_select=4, step=1)
selector = selector.fit(Xs, y)
print("This took %0.3fs" % (time() - start))
print("The selected variables are: {}".format(np.array(names)[selector.support_]))
sel_ind_sorted = np.argsort(selector.ranking_)
ranking = pd.DataFrame.from_records(selector.ranking_[sel_ind_sorted].reshape(1,p),
columns=[names[i] for i in sel_ind_sorted], index=['Rank'])
ranking.head()
```
%%%% Output: stream
This took 0.017s
The selected variables are: ['tobacco' 'famhist' 'typea' 'age']
%%%% Output: execute_result
tobacco famhist typea age ldl obesity adiposity sbp alcohol
Rank 1 1 1 1 2 3 4 5 6
%% Cell type:markdown id: tags:
**Exercise:**
- Change the desired number of features to 1 to rank the features by signicance order (most significant variables must enter first in the model). Is it perfectly consistent with the Lasso path ranking?
- What is hyperparameter to optimize for the RFE procedure? How can you estimate it?
- Replace the RFE procedure by [`feature_selection.RFECV`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) to select the most significant variables. Is it in agreement with the Lasso results?
- Do you think that this procedure is still appropriate when the number of variables $p$ is greater than the sample size $n$ of the training set?
%% Cell type:markdown id: tags:
### Forward selection: Orthogonal Matching Pursuit (OMP)
Orthogonal Matching Pursuit (OMP) is based on a greedy algorithm that selects at each step the most significant variable to enter the model. For a linear regression model, this is the variable
most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen variables.
OMP was introduced in
> G. Mallat, Z. Zhang, [Matching pursuits with time-frequency dictionaries](http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf), IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
Note also that there exists an other popular method in the statistical literature called *Forward stepwise selection*. The selection step used differs from the one used in OMP in that it selects the variable that will lead to the minimum residual error *after* orthogonalisation. See for instance the following paper for a comparison between both methods
> Blumensath, Thomas, and Mike E. Davies. ["On the difference between orthogonal matching pursuit and orthogonal least squares."](https://eprints.soton.ac.uk/142469/1/BDOMPvsOLS07.pdf) (2007).
This principle can be extented to generalized linear model such that Logistic Regression by replacing the residual sum of squares criterion by the opposite of the log likelihood.
Within scikit-learn, only OMP is available for linear regression model with [`linear_model.OrthogonalMatchingPursuit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html). However this can be directly apply to our binary classification problem
Within scikit-learn, only OMP is available for linear regression model with [`linear_model.OrthogonalMatchingPursuit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html). However this can be directly applied to our binary classification problem
%% Cell type:code id: tags:
``` python
# We use Forward selection, aka Orthogonal Matching Pursuit
from sklearn.linear_model import OrthogonalMatchingPursuit, OrthogonalMatchingPursuitCV
# Get the 5 most significant features
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=5,fit_intercept=True, normalize=True)
omp.fit(X, y)
coef = omp.coef_
idx_r, = coef.nonzero()
sel_feat_name= [names[l] for l in idx_r] ;
print("Most significant features: {}".format(np.array(names)[omp.coef_.nonzero()]))
```
%%%% Output: stream
Most significant features: ['tobacco' 'ldl' 'famhist' 'typea' 'age']
%% Cell type:markdown id: tags:
**Exercise:**
- Change the desired number of variables (number of nonzero coefs) from p to 1 to rank the features by signicance order (most significant variables must enter first in the model). Is it perfectly consistent with the previous results?
- What is hyperparameter to optimize for the OMP procedure? How can you estimate it?
- Replace the OMP procedure by [`linear_model.OrthogonalMatchingPursuitCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit) to select the most significant variables. Is it in agreement with the previous results?
- *Optional:* Compared to greedy methods that are suboptimal procedures to approximate the best sparse solution (the best subset), Lasso penalty is often presented as a *convex relaxation of the sparsity constraint*. Can you explain why? What are the possible benefits/disadvantages of the greedy and lasso procedures?
%% Cell type:code id: tags:
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment