Micro bug correction with dataset shuffling

01d91351 · Jean-Luc Parouty · c254941c · 01d91351 · 01d91351
Commit 01d91351 authored 3 years ago by Jean-Luc Parouty
--- a/BHPD/01-DNN-Regression.ipynb
+++ b/BHPD/01-DNN-Regression.ipynb
@@ -330,6 +330,7 @@
   "source": [
    "# ---- Suffle and Split => train, test\n",
    "#\n",
+    "data       = data.sample(frac=1., axis=0)\n",
    "data_train = data.sample(frac=0.7, axis=0)\n",
    "data_test  = data.drop(data_train.index)\n",
    "\n",
@@ -1701,7 +1702,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:
 <img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
 # <!-- TITLE --> [BHPD1] - Regression with a Dense Network (DNN)
 <!-- DESC --> Simple example of a regression with the dataset Boston Housing Prices Dataset (BHPD)
 <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
 ## Objectives :
 - Predicts **housing prices** from a set of house features.
 - Understanding the **principle** and the **architecture** of a regression with a **dense neural network**
 The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.
 Alongside with price, the dataset also provide theses informations :
 - CRIM: This is the per capita crime rate by town
 - ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft
 - INDUS: This is the proportion of non-retail business acres per town
 - CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
 - NOX: This is the nitric oxides concentration (parts per 10 million)
 - RM: This is the average number of rooms per dwelling
 - AGE: This is the proportion of owner-occupied units built prior to 1940
 - DIS: This is the weighted distances to five Boston employment centers
 - RAD: This is the index of accessibility to radial highways
 - TAX: This is the full-value property-tax rate per 10,000 dollars
 - PTRATIO: This is the pupil-teacher ratio by town
 - B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town
 - LSTAT: This is the percentage lower status of the population
 - MEDV: This is the median value of owner-occupied homes in 1000 dollars
 ## What we're going to do :
 - Retrieve data
 - Preparing the data
 - Build a model
 - Train the model
 - Evaluate the result
 %% Cell type:markdown id: tags:
 ## Step 1 - Import and init
 %% Cell type:code id: tags:
 ``` python
 import tensorflow as tf
 from tensorflow import keras
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd
 import os,sys
 sys.path.append('..')
 import fidle.pwk as pwk
 datasets_dir = pwk.init('BHPD1')
 ```
 %% Output
    <br>**FIDLE 2020 - Practical Work Module**
    Version              : 2.0.1
    Notebook id          : BHPD1
    Run time             : Thursday 14 January 2021, 10:57:04
    TensorFlow version   : 2.2.0
    Keras version        : 2.3.0-tf
    Datasets dir         : /home/pjluc/datasets/fidle
    Run dir              : ./run
    Update keras cache   : False
 %% Cell type:markdown id: tags:
 ## Step 2 - Retrieve data
 ### 2.1 - Option 1  : From Keras
 Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
 %% Cell type:code id: tags:
 ``` python
 # (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)
 ```
 %% Cell type:markdown id: tags:
 ### 2.2 - Option 2 : From a csv file
 More fun !
 %% Cell type:code id: tags:
 ``` python
 data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)
 display(data.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
 print('Missing Data : ',data.isna().sum().sum(), '  Shape is : ', data.shape)
 ```
 %% Output
    Missing Data :  0   Shape is :  (506, 14)
 %% Cell type:markdown id: tags:
 ## Step 3 - Preparing the data
 ### 3.1 - Split data
 We will use 70% of the data for training and 30% for validation.
 The dataset is **shuffled** and shared between **learning** and **testing**.
 x will be input data and y the expected output
 %% Cell type:code id: tags:
 ``` python
 # ---- Suffle and Split => train, test
 #
+data       = data.sample(frac=1., axis=0)
 data_train = data.sample(frac=0.7, axis=0)
 data_test  = data.drop(data_train.index)
 # ---- Split => x,y (medv is price)
 #
 x_train = data_train.drop('medv',  axis=1)
 y_train = data_train['medv']
 x_test  = data_test.drop('medv',   axis=1)
 y_test  = data_test['medv']
 print('Original data shape was : ',data.shape)
 print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
 print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)
 ```
 %% Output
    Original data shape was :  (506, 14)
    x_train :  (354, 13) y_train :  (354,)
    x_test  :  (152, 13) y_test  :  (152,)
 %% Cell type:markdown id: tags:
 ### 3.2 - Data normalization
 **Note :**
 - All input data must be normalized, train and test.
 - To do this we will **subtract the mean** and **divide by the standard deviation**.
 - But test data should not be used in any way, even for normalization.
 - The mean and the standard deviation will therefore only be calculated with the train data.
 %% Cell type:code id: tags:
 ``` python
 display(x_train.describe().style.format("{0:.2f}").set_caption("Before normalization :"))
 mean = x_train.mean()
 std  = x_train.std()
 x_train = (x_train - mean) / std
 x_test  = (x_test  - mean) / std
 display(x_train.describe().style.format("{0:.2f}").set_caption("After normalization :"))
 display(x_train.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
 x_train, y_train = np.array(x_train), np.array(y_train)
 x_test,  y_test  = np.array(x_test),  np.array(y_test)
 ```
 %% Output
 %% Cell type:markdown id: tags:
 ## Step 4 - Build a model
 About informations about :
 - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
 - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)
 - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)
 - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
 %% Cell type:code id: tags:
 ``` python
  def get_model_v1(shape):
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape, name="InputLayer"))
    model.add(keras.layers.Dense(32, activation='relu', name='Dense_n1'))
    model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))
    model.add(keras.layers.Dense(32, activation='relu', name='Dense_n3'))
    model.add(keras.layers.Dense(1, name='Output'))
    model.compile(optimizer = 'adam',
                  loss      = 'mse',
                  metrics   = ['mae', 'mse'] )
    return model
 ```
 %% Cell type:markdown id: tags:
 ## Step 5 - Train the model
 ### 5.1 - Get it
 %% Cell type:code id: tags:
 ``` python
 model=get_model_v1( (13,) )
 model.summary()
 # img=keras.utils.plot_model( model, to_file='./run/model.png', show_shapes=True, show_layer_names=True, dpi=96)
 # display(img)
 ```
 %% Output
    Model: "sequential_5"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    Dense_n1 (Dense)             (None, 32)                448
    _________________________________________________________________
    Dense_n2 (Dense)             (None, 64)                2112
    _________________________________________________________________
    Dense_n3 (Dense)             (None, 32)                2080
    _________________________________________________________________
    Output (Dense)               (None, 1)                 33
    =================================================================
    Total params: 4,673
    Trainable params: 4,673
    Non-trainable params: 0
    _________________________________________________________________
 %% Cell type:markdown id: tags:
 ### 5.2 - Train it
 %% Cell type:code id: tags:
 ``` python
 history = model.fit(x_train,
                    y_train,
                    epochs          = 60,
                    batch_size      = 10,
                    verbose         = 0,
                    validation_data = (x_test, y_test))
 ```
 %% Cell type:markdown id: tags:
 ## Step 6 - Evaluate
 ### 6.1 - Model evaluation
 MAE =  Mean Absolute Error (between the labels and predictions)
 A mae equal to 3 represents an average error in prediction of $3k.
 %% Cell type:code id: tags:
 ``` python
 score = model.evaluate(x_test, y_test, verbose=1)
 print('x_test / loss      : {:5.4f}'.format(score[0]))
 print('x_test / mae       : {:5.4f}'.format(score[1]))
 print('x_test / mse       : {:5.4f}'.format(score[2]))
 ```
 %% Output
    5/5 [==============================] - 0s 2ms/step - loss: 11.9059 - mae: 2.6448 - mse: 11.9059
    x_test / loss      : 11.9059
    x_test / mae       : 2.6448
    x_test / mse       : 11.9059
 %% Cell type:markdown id: tags:
 ### 6.2 - Training history
 What was the best result during our training ?
 %% Cell type:code id: tags:
 ``` python
 df=pd.DataFrame(data=history.history)
 display(df)
 ```
 %% Output
 %% Cell type:code id: tags:
 ``` python
 print("min( val_mae ) : {:.4f}".format( min(history.history["val_mae"]) ) )
 ```
 %% Output
    min( val_mae ) : 2.4794
 %% Cell type:code id: tags:
 ``` python
 pwk.plot_history(history, plot={'MSE' :['mse', 'val_mse'],
                                'MAE' :['mae', 'val_mae'],
                                'LOSS':['loss','val_loss']}, save_as='01-history')
 ```
 %% Output
 %% Cell type:markdown id: tags:
 ## Step 7 - Make a prediction
 The data must be normalized with the parameters (mean, std) previously used.
 %% Cell type:code id: tags:
 ``` python
 my_data = [ 1.26425925, -0.48522739,  1.0436489 , -0.23112788,  1.37120745,
       -2.14308942,  1.13489104, -1.06802005,  1.71189006,  1.57042287,
        0.77859951,  0.14769795,  2.7585581 ]
 real_price = 10.4
 my_data=np.array(my_data).reshape(1,13)
 ```
 %% Cell type:code id: tags:
 ``` python
 predictions = model.predict( my_data )
 print("Prediction : {:.2f} K$".format(predictions[0][0]))
 print("Reality    : {:.2f} K$".format(real_price))
 ```
 %% Output
    Prediction : 10.68 K$
    Reality    : 10.40 K$
 %% Cell type:code id: tags:
 ``` python
 pwk.end()
 ```
 %% Output
    End time is : Thursday 14 January 2021, 11:24:04
    Duration is : 00:26:59 485ms
    This notebook ends here
 %% Cell type:markdown id: tags:
 ---
 <img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>

--- a/BHPD/02-DNN-Regression-Premium.ipynb
+++ b/BHPD/02-DNN-Regression-Premium.ipynb
@@ -130,6 +130,7 @@
   "source": [
    "# ---- Split => train, test\n",
    "#\n",
+    "data       = data.sample(frac=1., axis=0)\n",
    "data_train = data.sample(frac=0.7, axis=0)\n",
    "data_test  = data.drop(data_train.index)\n",
    "\n",
@@ -431,7 +432,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:
 <img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
 # <!-- TITLE --> [BHPD2] - Regression with a Dense Network (DNN) - Advanced code
  <!-- DESC -->  A more advanced implementation of the precedent example
  <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
 ## Objectives :
 - Predicts **housing prices** from a set of house features.
 - Understanding the principle and the architecture of a regression with a dense neural network with backup and restore of the trained model.
 The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.
 Alongside with price, the dataset also provide these information :
 - CRIM: This is the per capita crime rate by town
 - ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft
 - INDUS: This is the proportion of non-retail business acres per town
 - CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
 - NOX: This is the nitric oxides concentration (parts per 10 million)
 - RM: This is the average number of rooms per dwelling
 - AGE: This is the proportion of owner-occupied units built prior to 1940
 - DIS: This is the weighted distances to five Boston employment centers
 - RAD: This is the index of accessibility to radial highways
 - TAX: This is the full-value property-tax rate per 10,000 dollars
 - PTRATIO: This is the pupil-teacher ratio by town
 - B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town
 - LSTAT: This is the percentage lower status of the population
 - MEDV: This is the median value of owner-occupied homes in 1000 dollars
 ## What we're going to do :
 - (Retrieve data)
 - (Preparing the data)
 - (Build a model)
 - Train and save the model
 - Restore saved model
 - Evaluate the model
 - Make some predictions
 %% Cell type:markdown id: tags:
 ## Step 1 - Import and init
 %% Cell type:code id: tags:
 ``` python
 import tensorflow as tf
 from tensorflow import keras
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd
 import os,sys
 from IPython.display import Markdown
 from importlib import reload
 sys.path.append('..')
 import fidle.pwk as pwk
 datasets_dir = pwk.init('BHPD2')
 ```
 %% Cell type:markdown id: tags:
 ## Step 2 - Retrieve data
 ### 2.1 - Option 1  : From Keras
 Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
 %% Cell type:code id: tags:
 ``` python
 # (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)
 ```
 %% Cell type:markdown id: tags:
 ### 2.2 - Option 2 : From a csv file
 More fun !
 %% Cell type:code id: tags:
 ``` python
 data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)
 display(data.head(5).style.format("{0:.2f}"))
 print('Missing Data : ',data.isna().sum().sum(), '  Shape is : ', data.shape)
 ```
 %% Cell type:markdown id: tags:
 ## Step 3 - Preparing the data
 ### 3.1 - Split data
 We will use 80% of the data for training and 20% for validation.
 x will be input data and y the expected output
 %% Cell type:code id: tags:
 ``` python
 # ---- Split => train, test
 #
+data       = data.sample(frac=1., axis=0)
 data_train = data.sample(frac=0.7, axis=0)
 data_test  = data.drop(data_train.index)
 # ---- Split => x,y (medv is price)
 #
 x_train = data_train.drop('medv',  axis=1)
 y_train = data_train['medv']
 x_test  = data_test.drop('medv',   axis=1)
 y_test  = data_test['medv']
 print('Original data shape was : ',data.shape)
 print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
 print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)
 ```
 %% Cell type:markdown id: tags:
 ### 3.2 - Data normalization
 **Note :**
 - All input data must be normalized, train and test.
 - To do this we will subtract the mean and divide by the standard deviation.
 - But test data should not be used in any way, even for normalization.
 - The mean and the standard deviation will therefore only be calculated with the train data.
 %% Cell type:code id: tags:
 ``` python
 display(x_train.describe().style.format("{0:.2f}").set_caption("Before normalization :"))
 mean = x_train.mean()
 std  = x_train.std()
 x_train = (x_train - mean) / std
 x_test  = (x_test  - mean) / std
 display(x_train.describe().style.format("{0:.2f}").set_caption("After normalization :"))
 x_train, y_train = np.array(x_train), np.array(y_train)
 x_test,  y_test  = np.array(x_test),  np.array(y_test)
 ```
 %% Cell type:markdown id: tags:
 ## Step 4 - Build a model
 More informations about :
 - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
 - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)
 - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)
 - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
 %% Cell type:code id: tags:
 ``` python
  def get_model_v1(shape):
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape, name="InputLayer"))
    model.add(keras.layers.Dense(64, activation='relu', name='Dense_n1'))
    model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))
    model.add(keras.layers.Dense(1, name='Output'))
    model.compile(optimizer = 'rmsprop',
                  loss      = 'mse',
                  metrics   = ['mae', 'mse'] )
    return model
 ```
 %% Cell type:markdown id: tags:
 ## 5 - Train the model
 ### 5.1 - Get it
 %% Cell type:code id: tags:
 ``` python
 model=get_model_v1( (13,) )
 model.summary()
 # img=keras.utils.plot_model( model, to_file='./run/model.png', show_shapes=True, show_layer_names=True, dpi=96)
 # display(img)
 ```
 %% Cell type:markdown id: tags:
 ### 5.2 - Add callback
 %% Cell type:code id: tags:
 ``` python
 os.makedirs('./run/models',   mode=0o750, exist_ok=True)
 save_dir = "./run/models/best_model.h5"
 savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)
 ```
 %% Cell type:markdown id: tags:
 ### 5.3 - Train it
 %% Cell type:code id: tags:
 ``` python
 history = model.fit(x_train,
                    y_train,
                    epochs          = 100,
                    batch_size      = 10,
                    verbose         = 1,
                    validation_data = (x_test, y_test),
                    callbacks       = [savemodel_callback])
 ```
 %% Cell type:markdown id: tags:
 ## Step 6 - Evaluate
 ### 6.1 - Model evaluation
 MAE =  Mean Absolute Error (between the labels and predictions)
 A mae equal to 3 represents an average error in prediction of $3k.
 %% Cell type:code id: tags:
 ``` python
 score = model.evaluate(x_test, y_test, verbose=0)
 print('x_test / loss      : {:5.4f}'.format(score[0]))
 print('x_test / mae       : {:5.4f}'.format(score[1]))
 print('x_test / mse       : {:5.4f}'.format(score[2]))
 ```
 %% Cell type:markdown id: tags:
 ### 6.2 - Training history
 What was the best result during our training ?
 %% Cell type:code id: tags:
 ``` python
 print("min( val_mae ) : {:.4f}".format( min(history.history["val_mae"]) ) )
 ```
 %% Cell type:code id: tags:
 ``` python
 pwk.plot_history(history, plot={'MSE' :['mse', 'val_mse'],
                                'MAE' :['mae', 'val_mae'],
                                'LOSS':['loss','val_loss']}, save_as='01-history')
 ```
 %% Cell type:markdown id: tags:
 ## Step 7 - Restore a model :
 %% Cell type:markdown id: tags:
 ### 7.1 - Reload model
 %% Cell type:code id: tags:
 ``` python
 loaded_model = tf.keras.models.load_model('./run/models/best_model.h5')
 loaded_model.summary()
 print("Loaded.")
 ```
 %% Cell type:markdown id: tags:
 ### 7.2 - Evaluate it :
 %% Cell type:code id: tags:
 ``` python
 score = loaded_model.evaluate(x_test, y_test, verbose=0)
 print('x_test / loss      : {:5.4f}'.format(score[0]))
 print('x_test / mae       : {:5.4f}'.format(score[1]))
 print('x_test / mse       : {:5.4f}'.format(score[2]))
 ```
 %% Cell type:markdown id: tags:
 ### 7.3 - Make a prediction
 %% Cell type:code id: tags:
 ``` python
 my_data = [ 1.26425925, -0.48522739,  1.0436489 , -0.23112788,  1.37120745,
       -2.14308942,  1.13489104, -1.06802005,  1.71189006,  1.57042287,
        0.77859951,  0.14769795,  2.7585581 ]
 real_price = 10.4
 my_data=np.array(my_data).reshape(1,13)
 ```
 %% Cell type:code id: tags:
 ``` python
 predictions = loaded_model.predict( my_data )
 print("Prediction : {:.2f} K$   Reality : {:.2f} K$".format(predictions[0][0], real_price))
 ```
 %% Cell type:code id: tags:
 ``` python
 pwk.end()
 ```
 %% Cell type:markdown id: tags:
 ---
 <img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>