shuffle avec un h, is better

2dac2728 · Jean-Luc Parouty · 8b712749 · 2dac2728 · 2dac2728
Commit 2dac2728 authored 3 years ago by Jean-Luc Parouty
--- a/BHPD/01-DNN-Regression.ipynb
+++ b/BHPD/01-DNN-Regression.ipynb
@@ -174,7 +174,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# ---- Suffle and Split => train, test\n",
+    "# ---- Shuffle and Split => train, test\n",
    "#\n",
    "data       = data.sample(frac=1., axis=0)\n",
    "data_train = data.sample(frac=0.7, axis=0)\n",

 %% Cell type:markdown id: tags:
 <img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
 # <!-- TITLE --> [BHPD1] - Regression with a Dense Network (DNN)
 <!-- DESC --> Simple example of a regression with the dataset Boston Housing Prices Dataset (BHPD)
 <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
 ## Objectives :
 - Predicts **housing prices** from a set of house features.
 - Understanding the **principle** and the **architecture** of a regression with a **dense neural network**
 The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.
 Alongside with price, the dataset also provide theses informations :
 - CRIM: This is the per capita crime rate by town
 - ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft
 - INDUS: This is the proportion of non-retail business acres per town
 - CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
 - NOX: This is the nitric oxides concentration (parts per 10 million)
 - RM: This is the average number of rooms per dwelling
 - AGE: This is the proportion of owner-occupied units built prior to 1940
 - DIS: This is the weighted distances to five Boston employment centers
 - RAD: This is the index of accessibility to radial highways
 - TAX: This is the full-value property-tax rate per 10,000 dollars
 - PTRATIO: This is the pupil-teacher ratio by town
 - B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town
 - LSTAT: This is the percentage lower status of the population
 - MEDV: This is the median value of owner-occupied homes in 1000 dollars
 ## What we're going to do :
 - Retrieve data
 - Preparing the data
 - Build a model
 - Train the model
 - Evaluate the result
 %% Cell type:markdown id: tags:
 ## Step 1 - Import and init
 You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :
 - 0 = all messages are logged (default)
 - 1 = INFO messages are not printed.
 - 2 = INFO and WARNING messages are not printed.
 - 3 = INFO , WARNING and ERROR messages are not printed.
 %% Cell type:code id: tags:
 ``` python
 # import os
 # os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
 import tensorflow as tf
 from tensorflow import keras
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd
 import os,sys
 sys.path.append('..')
 import fidle.pwk as pwk
 datasets_dir = pwk.init('BHPD1')
 ```
 %% Cell type:markdown id: tags:
 Verbosity during training :
 - 0 = silent
 - 1 = progress bar
 - 2 = one line per epoch
 %% Cell type:code id: tags:
 ``` python
 fit_verbosity = 1
 ```
 %% Cell type:markdown id: tags:
 Override parameters (batch mode) - Just forget this cell
 %% Cell type:code id: tags:
 ``` python
 pwk.override('fit_verbosity')
 ```
 %% Cell type:markdown id: tags:
 ## Step 2 - Retrieve data
 ### 2.1 - Option 1  : From Keras
 Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
 %% Cell type:code id: tags:
 ``` python
 # (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)
 ```
 %% Cell type:markdown id: tags:
 ### 2.2 - Option 2 : From a csv file
 More fun !
 %% Cell type:code id: tags:
 ``` python
 data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)
 display(data.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
 print('Missing Data : ',data.isna().sum().sum(), '  Shape is : ', data.shape)
 ```
 %% Cell type:markdown id: tags:
 ## Step 3 - Preparing the data
 ### 3.1 - Split data
 We will use 70% of the data for training and 30% for validation.
 The dataset is **shuffled** and shared between **learning** and **testing**.
 x will be input data and y the expected output
 %% Cell type:code id: tags:
 ``` python
-# ---- Suffle and Split => train, test
+# ---- Shuffle and Split => train, test
 #
 data       = data.sample(frac=1., axis=0)
 data_train = data.sample(frac=0.7, axis=0)
 data_test  = data.drop(data_train.index)
 # ---- Split => x,y (medv is price)
 #
 x_train = data_train.drop('medv',  axis=1)
 y_train = data_train['medv']
 x_test  = data_test.drop('medv',   axis=1)
 y_test  = data_test['medv']
 print('Original data shape was : ',data.shape)
 print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
 print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)
 ```
 %% Cell type:markdown id: tags:
 ### 3.2 - Data normalization
 **Note :**
 - All input data must be normalized, train and test.
 - To do this we will **subtract the mean** and **divide by the standard deviation**.
 - But test data should not be used in any way, even for normalization.
 - The mean and the standard deviation will therefore only be calculated with the train data.
 %% Cell type:code id: tags:
 ``` python
 display(x_train.describe().style.format("{0:.2f}").set_caption("Before normalization :"))
 mean = x_train.mean()
 std  = x_train.std()
 x_train = (x_train - mean) / std
 x_test  = (x_test  - mean) / std
 display(x_train.describe().style.format("{0:.2f}").set_caption("After normalization :"))
 display(x_train.head(5).style.format("{0:.2f}").set_caption("Few lines of the dataset :"))
 x_train, y_train = np.array(x_train), np.array(y_train)
 x_test,  y_test  = np.array(x_test),  np.array(y_test)
 ```
 %% Cell type:markdown id: tags:
 ## Step 4 - Build a model
 About informations about :
 - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
 - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)
 - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)
 - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
 %% Cell type:code id: tags:
 ``` python
  def get_model_v1(shape):
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape, name="InputLayer"))
    model.add(keras.layers.Dense(32, activation='relu', name='Dense_n1'))
    model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))
    model.add(keras.layers.Dense(32, activation='relu', name='Dense_n3'))
    model.add(keras.layers.Dense(1, name='Output'))
    model.compile(optimizer = 'adam',
                  loss      = 'mse',
                  metrics   = ['mae', 'mse'] )
    return model
 ```
 %% Cell type:markdown id: tags:
 ## Step 5 - Train the model
 ### 5.1 - Get it
 %% Cell type:code id: tags:
 ``` python
 model=get_model_v1( (13,) )
 model.summary()
 # img=keras.utils.plot_model( model, to_file='./run/model.png', show_shapes=True, show_layer_names=True, dpi=96)
 # display(img)
 ```
 %% Cell type:markdown id: tags:
 ### 5.2 - Train it
 %% Cell type:code id: tags:
 ``` python
 history = model.fit(x_train,
                    y_train,
                    epochs          = 60,
                    batch_size      = 10,
                    verbose         = fit_verbosity,
                    validation_data = (x_test, y_test))
 ```
 %% Cell type:markdown id: tags:
 ## Step 6 - Evaluate
 ### 6.1 - Model evaluation
 MAE =  Mean Absolute Error (between the labels and predictions)
 A mae equal to 3 represents an average error in prediction of $3k.
 %% Cell type:code id: tags:
 ``` python
 score = model.evaluate(x_test, y_test, verbose=0)
 print('x_test / loss      : {:5.4f}'.format(score[0]))
 print('x_test / mae       : {:5.4f}'.format(score[1]))
 print('x_test / mse       : {:5.4f}'.format(score[2]))
 ```
 %% Cell type:markdown id: tags:
 ### 6.2 - Training history
 What was the best result during our training ?
 %% Cell type:code id: tags:
 ``` python
 df=pd.DataFrame(data=history.history)
 display(df)
 ```
 %% Cell type:code id: tags:
 ``` python
 print("min( val_mae ) : {:.4f}".format( min(history.history["val_mae"]) ) )
 ```
 %% Cell type:code id: tags:
 ``` python
 pwk.plot_history(history, plot={'MSE' :['mse', 'val_mse'],
                                'MAE' :['mae', 'val_mae'],
                                'LOSS':['loss','val_loss']}, save_as='01-history')
 ```
 %% Cell type:markdown id: tags:
 ## Step 7 - Make a prediction
 The data must be normalized with the parameters (mean, std) previously used.
 %% Cell type:code id: tags:
 ``` python
 my_data = [ 1.26425925, -0.48522739,  1.0436489 , -0.23112788,  1.37120745,
       -2.14308942,  1.13489104, -1.06802005,  1.71189006,  1.57042287,
        0.77859951,  0.14769795,  2.7585581 ]
 real_price = 10.4
 my_data=np.array(my_data).reshape(1,13)
 ```
 %% Cell type:code id: tags:
 ``` python
 predictions = model.predict( my_data )
 print("Prediction : {:.2f} K$".format(predictions[0][0]))
 print("Reality    : {:.2f} K$".format(real_price))
 ```
 %% Cell type:code id: tags:
 ``` python
 pwk.end()
 ```
 %% Cell type:markdown id: tags:
 ---
 <img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>

--- a/BHPD_PyTorch/01-DNN-Regression_PyTorch.ipynb
+++ b/BHPD_PyTorch/01-DNN-Regression_PyTorch.ipynb