Skip to content
Snippets Groups Projects
Commit 2dac2728 authored by Jean-Luc Parouty's avatar Jean-Luc Parouty
Browse files

shuffle avec un h, is better

parent 8b712749
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [BHPD1] - Regression with a Dense Network (DNN)
<!-- DESC --> Simple example of a regression with the dataset Boston Housing Prices Dataset (BHPD)
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- Predicts **housing prices** from a set of house features.
- Understanding the **principle** and the **architecture** of a regression with a **dense neural network**
The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.
Alongside with price, the dataset also provide theses informations :
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft
- INDUS: This is the proportion of non-retail business acres per town
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per 10,000 dollars
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in 1000 dollars
## What we're going to do :
- Retrieve data
- Preparing the data
- Build a model
- Train the model
- Evaluate the result
%% Cell type:markdown id: tags:
## Step 1 - Import and init
You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :
- 0 = all messages are logged (default)
- 1 = INFO messages are not printed.
- 2 = INFO and WARNING messages are not printed.
- 3 = INFO , WARNING and ERROR messages are not printed.
%% Cell type:code id: tags:
``` python
# import os
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os,sys
sys.path.append('..')
import fidle.pwk as pwk
datasets_dir = pwk.init('BHPD1')
```
%% Cell type:markdown id: tags:
Verbosity during training :
- 0 = silent
- 1 = progress bar
- 2 = one line per epoch
%% Cell type:code id: tags:
``` python
fit_verbosity = 1
```
%% Cell type:markdown id: tags:
Override parameters (batch mode) - Just forget this cell
%% Cell type:code id: tags:
``` python
pwk.override('fit_verbosity')
```
%% Cell type:markdown id: tags:
## Step 2 - Retrieve data
### 2.1 - Option 1 : From Keras
Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
%% Cell type:code id: tags:
``` python
# (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)
```
%% Cell type:markdown id: tags:
### 2.2 - Option 2 : From a csv file
More fun !
%% Cell type:code id: tags:
``` python
data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)
display(data.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
print('Missing Data : ',data.isna().sum().sum(), ' Shape is : ', data.shape)
```
%% Cell type:markdown id: tags:
## Step 3 - Preparing the data
### 3.1 - Split data
We will use 70% of the data for training and 30% for validation.
The dataset is **shuffled** and shared between **learning** and **testing**.
x will be input data and y the expected output
%% Cell type:code id: tags:
``` python
# ---- Suffle and Split => train, test
# ---- Shuffle and Split => train, test
#
data = data.sample(frac=1., axis=0)
data_train = data.sample(frac=0.7, axis=0)
data_test = data.drop(data_train.index)
# ---- Split => x,y (medv is price)
#
x_train = data_train.drop('medv', axis=1)
y_train = data_train['medv']
x_test = data_test.drop('medv', axis=1)
y_test = data_test['medv']
print('Original data shape was : ',data.shape)
print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
print('x_test : ',x_test.shape, 'y_test : ',y_test.shape)
```
%% Cell type:markdown id: tags:
### 3.2 - Data normalization
**Note :**
- All input data must be normalized, train and test.
- To do this we will **subtract the mean** and **divide by the standard deviation**.
- But test data should not be used in any way, even for normalization.
- The mean and the standard deviation will therefore only be calculated with the train data.
%% Cell type:code id: tags:
``` python
display(x_train.describe().style.format("{0:.2f}").set_caption("Before normalization :"))
mean = x_train.mean()
std = x_train.std()
x_train = (x_train - mean) / std
x_test = (x_test - mean) / std
display(x_train.describe().style.format("{0:.2f}").set_caption("After normalization :"))
display(x_train.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
x_train, y_train = np.array(x_train), np.array(y_train)
x_test, y_test = np.array(x_test), np.array(y_test)
```
%% Cell type:markdown id: tags:
## Step 4 - Build a model
About informations about :
- [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
- [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)
- [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)
- [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
%% Cell type:code id: tags:
``` python
def get_model_v1(shape):
model = keras.models.Sequential()
model.add(keras.layers.Input(shape, name="InputLayer"))
model.add(keras.layers.Dense(32, activation='relu', name='Dense_n1'))
model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))
model.add(keras.layers.Dense(32, activation='relu', name='Dense_n3'))
model.add(keras.layers.Dense(1, name='Output'))
model.compile(optimizer = 'adam',
loss = 'mse',
metrics = ['mae', 'mse'] )
return model
```
%% Cell type:markdown id: tags:
## Step 5 - Train the model
### 5.1 - Get it
%% Cell type:code id: tags:
``` python
model=get_model_v1( (13,) )
model.summary()
# img=keras.utils.plot_model( model, to_file='./run/model.png', show_shapes=True, show_layer_names=True, dpi=96)
# display(img)
```
%% Cell type:markdown id: tags:
### 5.2 - Train it
%% Cell type:code id: tags:
``` python
history = model.fit(x_train,
y_train,
epochs = 60,
batch_size = 10,
verbose = fit_verbosity,
validation_data = (x_test, y_test))
```
%% Cell type:markdown id: tags:
## Step 6 - Evaluate
### 6.1 - Model evaluation
MAE = Mean Absolute Error (between the labels and predictions)
A mae equal to 3 represents an average error in prediction of $3k.
%% Cell type:code id: tags:
``` python
score = model.evaluate(x_test, y_test, verbose=0)
print('x_test / loss : {:5.4f}'.format(score[0]))
print('x_test / mae : {:5.4f}'.format(score[1]))
print('x_test / mse : {:5.4f}'.format(score[2]))
```
%% Cell type:markdown id: tags:
### 6.2 - Training history
What was the best result during our training ?
%% Cell type:code id: tags:
``` python
df=pd.DataFrame(data=history.history)
display(df)
```
%% Cell type:code id: tags:
``` python
print("min( val_mae ) : {:.4f}".format( min(history.history["val_mae"]) ) )
```
%% Cell type:code id: tags:
``` python
pwk.plot_history(history, plot={'MSE' :['mse', 'val_mse'],
'MAE' :['mae', 'val_mae'],
'LOSS':['loss','val_loss']}, save_as='01-history')
```
%% Cell type:markdown id: tags:
## Step 7 - Make a prediction
The data must be normalized with the parameters (mean, std) previously used.
%% Cell type:code id: tags:
``` python
my_data = [ 1.26425925, -0.48522739, 1.0436489 , -0.23112788, 1.37120745,
-2.14308942, 1.13489104, -1.06802005, 1.71189006, 1.57042287,
0.77859951, 0.14769795, 2.7585581 ]
real_price = 10.4
my_data=np.array(my_data).reshape(1,13)
```
%% Cell type:code id: tags:
``` python
predictions = model.predict( my_data )
print("Prediction : {:.2f} K$".format(predictions[0][0]))
print("Reality : {:.2f} K$".format(real_price))
```
%% Cell type:code id: tags:
``` python
pwk.end()
```
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
......
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [BHP1] - Regression with a Dense Network (DNN)
<!-- DESC --> A Simple regression with a Dense Neural Network (DNN) - BHPD dataset
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP), Laurent Risser (CNRS/IMT) -->
## Objectives :
- Predicts **housing prices** from a set of house features.
- Understanding the **principle** and the **architecture** of a regression with a **dense neural network**
The **[Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.
Alongside with price, the dataset also provide theses informations :
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft
- INDUS: This is the proportion of non-retail business acres per town
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per 10,000 dollars
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in 1000 dollars
## What we're going to do :
- Retrieve data
- Preparing the data
- Build a model
- Train the model
- Evaluate the result
%% Cell type:markdown id: tags:
## Step 1 - Import and init
%% Cell type:code id: tags:
``` python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
import matplotlib.pyplot as plt
import sys,os
import pandas as pd
sys.path.append('..')
import fidle.pwk as ooo
from fidle_pwk_additional import convergence_history_MSELoss
```
%% Cell type:markdown id: tags:
## Step 2 - Retrieve data
Boston housing is a famous historic dataset, which can be get here: [Boston housing datasets](https://www.kaggle.com/puxama/bostoncsv)
%% Cell type:code id: tags:
``` python
data = pd.read_csv('./BostonHousing.csv', header=0)
display(data.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
print('Missing Data : ',data.isna().sum().sum(), ' Shape is : ', data.shape)
```
%% Output
Missing Data : 0 Shape is : (506, 14)
%% Cell type:markdown id: tags:
## Step 3 - Preparing the data
### 3.1 - Split data
We will use 70% of the data for training and 30% for validation.
The dataset is **shuffled** and shared between **learning** and **testing**.
x will be input data and y the expected output
%% Cell type:code id: tags:
``` python
# ---- Suffle and Split => train, test
# ---- Shuffle and Split => train, test
#
data_train = data.sample(frac=0.7, axis=0)
data_test = data.drop(data_train.index)
# ---- Split => x,y (medv is price)
#
x_train = data_train.drop('medv', axis=1)
y_train = data_train['medv']
x_test = data_test.drop('medv', axis=1)
y_test = data_test['medv']
print('Original data shape was : ',data.shape)
print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
print('x_test : ',x_test.shape, 'y_test : ',y_test.shape)
```
%% Output
Original data shape was : (506, 14)
x_train : (354, 13) y_train : (354,)
x_test : (152, 13) y_test : (152,)
%% Cell type:markdown id: tags:
### 3.2 - Data normalization
**Note :**
- All input data must be normalized, train and test.
- To do this we will **subtract the mean** and **divide by the standard deviation**.
- But test data should not be used in any way, even for normalization.
- The mean and the standard deviation will therefore only be calculated with the train data.
%% Cell type:code id: tags:
``` python
display(x_train.describe().style.format("{0:.2f}").set_caption("Before normalization :"))
mean = x_train.mean()
std = x_train.std()
x_train = (x_train - mean) / std
x_test = (x_test - mean) / std
display(x_train.describe().style.format("{0:.2f}").set_caption("After normalization :"))
display(x_train.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
x_train, y_train = np.array(x_train), np.array(y_train)
x_test, y_test = np.array(x_test), np.array(y_test)
```
%% Output
%% Cell type:markdown id: tags:
## Step 4 - Build a model
About informations about :
- [Optimizer](https://pytorch.org/docs/stable/optim.html)
- [Basic neural-network blocks](https://pytorch.org/docs/stable/nn.html)
- [Loss](https://pytorch.org/docs/stable/nn.html#loss-functions)
%% Cell type:code id: tags:
``` python
class model_v1(nn.Module):
"""
Basic fully connected neural-network for tabular data
"""
def __init__(self,num_vars):
super(model_v1, self).__init__()
self.num_vars=num_vars
self.hidden1 = nn.Linear(self.num_vars, 64)
self.hidden2 = nn.Linear(64, 64)
self.hidden3 = nn.Linear(64, 1)
def forward(self, x):
x = x.view(-1,self.num_vars) #flatten the observation before using fully-connected layers
x = self.hidden1(x)
x = F.relu(x)
x = self.hidden2(x)
x = F.relu(x)
x = self.hidden3(x)
return x
```
%% Cell type:markdown id: tags:
## Step 5 - Train the model
#### 5.1 - stochastic gradient descent strategy to fit the model
%% Cell type:code id: tags:
``` python
def fit(model,X_train,Y_train,X_test,Y_test, EPOCHS = 5, BATCH_SIZE = 32):
loss = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3) #lr is the learning rate
model.train()
history=convergence_history_MSELoss()
history.update(model,X_train,Y_train,X_test,Y_test)
n=X_train.shape[0] #number of observations in the training data
#stochastic gradient descent
for epoch in range(EPOCHS):
batch_start=0
epoch_shuffler=np.arange(n)
np.random.shuffle(epoch_shuffler) #remark that 'utilsData.DataLoader' could be used instead
while batch_start+BATCH_SIZE < n:
#get mini-batch observation
mini_batch_observations = epoch_shuffler[batch_start:batch_start+BATCH_SIZE]
var_X_batch = Variable(X_train[mini_batch_observations,:]).float()
var_Y_batch = Variable(Y_train[mini_batch_observations]).float()
#gradient descent step
optimizer.zero_grad() #set the parameters gradients to 0
Y_pred_batch = model(var_X_batch) #predict y with the current NN parameters
curr_loss = loss(Y_pred_batch.view(-1), var_Y_batch.view(-1)) #compute the current loss
curr_loss.backward() #compute the loss gradient w.r.t. all NN parameters
optimizer.step() #update the NN parameters
#prepare the next mini-batch of the epoch
batch_start+=BATCH_SIZE
history.update(model,X_train,Y_train,X_test,Y_test)
return history
```
%% Cell type:markdown id: tags:
##### 5.2 - get the model
%% Cell type:code id: tags:
``` python
model=model_v1( x_train[0,:].shape[0] )
print(model)
```
%% Output
model_v1(
(hidden1): Linear(in_features=13, out_features=64, bias=True)
(hidden2): Linear(in_features=64, out_features=64, bias=True)
(hidden3): Linear(in_features=64, out_features=1, bias=True)
)
%% Cell type:markdown id: tags:
##### 5.3 - train the model
%% Cell type:code id: tags:
``` python
torch_x_train=torch.from_numpy(x_train)
torch_y_train=torch.from_numpy(y_train)
torch_x_test=torch.from_numpy(x_test)
torch_y_test=torch.from_numpy(y_test)
batch_size = 10
epochs = 100
history=fit(model,torch_x_train,torch_y_train,torch_x_test,torch_y_test,EPOCHS=epochs,BATCH_SIZE = batch_size)
```
%% Cell type:markdown id: tags:
## Step 6 - Evaluate
### 6.1 - Model evaluation
MAE = Mean Absolute Error (between the labels and predictions)
A mae equal to 3 represents an average error in prediction of $3k.
%% Cell type:code id: tags:
``` python
var_x_test = Variable(torch_x_test).float()
var_y_test = Variable(torch_y_test).float()
y_pred = model(var_x_test)
nn_loss = nn.MSELoss()
nn_MAE_loss = nn.L1Loss()
print('x_test / loss : {:5.4f}'.format(nn_loss(y_pred.view(-1), var_y_test.view(-1)).item()))
print('x_test / mae : {:5.4f}'.format(nn_MAE_loss(y_pred.view(-1), var_y_test.view(-1)).item()))
```
%% Output
x_test / loss : 9.2881
x_test / mae : 2.3158
%% Cell type:markdown id: tags:
### 6.2 - Training history
What was the best result during our training ?
%% Cell type:code id: tags:
``` python
df=pd.DataFrame(data=history.history)
df.describe()
```
%% Output
loss mae val_loss val_mae
count 101.000000 101.000000 101.000000 101.000000
mean 23.476395 2.692269 23.864368 2.989625
std 76.294859 2.956140 74.627726 2.845627
min 3.821492 1.493004 8.697336 2.217072
25% 5.978652 1.768553 8.936612 2.270410
50% 8.439981 2.047496 9.364830 2.329784
75% 13.354569 2.480675 11.053273 2.494222
max 548.658447 21.719831 565.027466 22.106112
%% Cell type:code id: tags:
``` python
print("min( val_mae ) : {:.4f}".format( min(history.history["val_mae"]) ) )
```
%% Output
min( val_mae ) : 2.2171
%% Cell type:code id: tags:
``` python
ooo.plot_history(history, plot={'MAE' :['mae', 'val_mae'],
'LOSS':['loss','val_loss']})
```
%% Output
%% Cell type:markdown id: tags:
## Step 7 - Make a prediction
The data must be normalized with the parameters (mean, std) previously used.
%% Cell type:code id: tags:
``` python
my_data = [ 1.26425925, -0.48522739, 1.0436489 , -0.23112788, 1.37120745,
-2.14308942, 1.13489104, -1.06802005, 1.71189006, 1.57042287,
0.77859951, 0.14769795, 2.7585581 ]
real_price = 10.4
my_data=np.array(my_data).reshape(1,13)
```
%% Cell type:code id: tags:
``` python
torch_my_data=torch.from_numpy(my_data)
var_my_data = Variable(torch_my_data).float()
predictions = model( var_my_data )
print("Prediction : {:.2f} K$".format(predictions[0][0]))
print("Reality : {:.2f} K$".format(real_price))
```
%% Output
Prediction : 10.63 K$
Reality : 10.40 K$
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment