02-DNN-Regression Premium.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Neural Network (DNN) - BHPD dataset\n",
    "========================================\n",
    "---\n",
    "Introduction au Deep Learning  (IDLE) - S. Arias, E. Maldonado, JL. Parouty - CNRS/SARI/DEVLOG - 2020  \n",
    "\n",
    "## A very simple example of **regression** (Premium edition):\n",
    "\n",
    "Objective is to predicts **housing prices** from a set of house features. \n",
    "\n",
    "The **[Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.  \n",
    "Alongside with price, the dataset also provide information such as Crime, areas of non-retail business in the town,  \n",
    "age of people who own the house and many other attributes...\n",
    "\n",
    "What we're going to do:\n",
    "\n",
    " - (Retrieve data)\n",
    " - (Preparing the data)\n",
    " - (Build a model)\n",
    " - Train and save the model\n",
    " - Restore saved model\n",
    " - Evaluate the model\n",
    " - Make some predictions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1/ Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "from IPython.display import display, Markdown\n",
    "import fidle.pwk as ooo\n",
    "from importlib import reload\n",
    "\n",
    "ooo.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2/ Retrieve data\n",
    "\n",
    "**From Keras :**\n",
    "Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  "
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "(x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**From a csv file :**  \n",
    "More fun !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('./data/BostonHousing.csv', header=0)\n",
    "\n",
    "display(data.head(5).style.format(\"{0:.2f}\"))\n",
    "print('Données manquantes : ',data.isna().sum().sum(), '  Shape is : ', data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3/ Preparing the data\n",
    "### 3.1/ Split data\n",
    "We will use 80% of the data for training and 20% for validation.  \n",
    "x will be input data and y the expected output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Split => train, test\n",
    "#\n",
    "data_train = data.sample(frac=0.7, axis=0)\n",
    "data_test  = data.drop(data_train.index)\n",
    "\n",
    "# ---- Split => x,y (medv is price)\n",
    "#\n",
    "x_train = data_train.drop('medv',  axis=1)\n",
    "y_train = data_train['medv']\n",
    "x_test  = data_test.drop('medv',   axis=1)\n",
    "y_test  = data_test['medv']\n",
    "\n",
    "print('Original data shape was : ',data.shape)\n",
    "print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)\n",
    "print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2/ Data normalization\n",
    "**Note :** \n",
    " - All input data must be normalized, train and test.  \n",
    " - To do this we will subtract the mean and divide by the standard deviation.  \n",
    " - But test data should not be used in any way, even for normalization.  \n",
    " - The mean and the standard deviation will therefore only be calculated with the train data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"Before normalization :\"))\n",
    "\n",
    "mean = x_train.mean()\n",
    "std  = x_train.std()\n",
    "x_train = (x_train - mean) / std\n",
    "x_test  = (x_test  - mean) / std\n",
    "\n",
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"After normalization :\"))\n",
    "\n",
    "x_train, y_train = np.array(x_train), np.array(y_train)\n",
    "x_test,  y_test  = np.array(x_test),  np.array(y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4/ Build a model\n",
    "About informations about : \n",
    " - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)\n",
    " - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)\n",
    " - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)\n",
    " - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "  def get_model_v1(shape):\n",
    "    \n",
    "    model = keras.models.Sequential()\n",
    "    model.add(keras.layers.Dense(64, activation='relu', input_shape=shape))\n",
    "    model.add(keras.layers.Dense(64, activation='relu'))\n",
    "    model.add(keras.layers.Dense(1))\n",
    "    \n",
    "    model.compile(optimizer = 'rmsprop',\n",
    "                  loss      = 'mse',\n",
    "                  metrics   = ['mae', 'mse'] )\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5/ Train the model\n",
    "### 5.1/ Get it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model=get_model_v1( (13,) )\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2/ Add callback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs('./run/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = \"./run/models/best_model.h5\"\n",
    "savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3/ Train it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = 100,\n",
    "                    batch_size      = 10,\n",
    "                    verbose         = 0,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    callbacks       = [savemodel_callback])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6/ Evaluate\n",
    "### 6.1/ Model evaluation\n",
    "MAE =  Mean Absolute Error (between the labels and predictions)  \n",
    "A mae equal to 3 represents an average error in prediction of $3k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2/ Training history\n",
    "What was the best result during our training ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"min( val_mae ) : {:.4f}\".format( min(history.history[\"val_mae\"]) ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reload(ooo)\n",
    "ooo.plot_history(history, plot={'MSE' :['mse', 'val_mse'],\n",
    "                                'MAE' :['mae', 'val_mae'],\n",
    "                                'LOSS':['loss','val_loss']})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7/ Restore a model :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1/ Reload model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loaded_model = tf.keras.models.load_model('./run/models/best_model.h5')\n",
    "loaded_model.summary()\n",
    "print(\"Loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2/ Evaluate it :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = loaded_model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3/ Make a prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = loaded_model.predict( x_train[13].reshape(1,13) )\n",
    "print(\"Prédiction : {:.2f} K$   Reality : {:.2f} K$\".format(predictions[0][0], y_train[13]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "That's all folks !"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}