{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [K3BHPD2] - Regression with a Dense Network (DNN) - Advanced code\n",
    "  <!-- DESC -->  A more advanced implementation of the precedent example, using Keras3\n",
    "  <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Predicts **housing prices** from a set of house features. \n",
    " - Understanding the principle and the architecture of a regression with a dense neural network with backup and restore of the trained model. \n",
    "\n",
    "The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston.  \n",
    "Alongside with price, the dataset also provide these information :\n",
    "\n",
    " - CRIM: This is the per capita crime rate by town\n",
    " - ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft\n",
    " - INDUS: This is the proportion of non-retail business acres per town\n",
    " - CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)\n",
    " - NOX: This is the nitric oxides concentration (parts per 10 million)\n",
    " - RM: This is the average number of rooms per dwelling\n",
    " - AGE: This is the proportion of owner-occupied units built prior to 1940\n",
    " - DIS: This is the weighted distances to five Boston employment centers\n",
    " - RAD: This is the index of accessibility to radial highways\n",
    " - TAX: This is the full-value property-tax rate per 10,000 dollars\n",
    " - PTRATIO: This is the pupil-teacher ratio by town\n",
    " - B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town\n",
    " - LSTAT: This is the percentage lower status of the population\n",
    " - MEDV: This is the median value of owner-occupied homes in 1000 dollars\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - (Retrieve data)\n",
    " - (Preparing the data)\n",
    " - (Build a model)\n",
    " - Train and save the model\n",
    " - Restore saved model\n",
    " - Evaluate the model\n",
    " - Make some predictions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init\n",
    "\n",
    "You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :\n",
    "- 0 = all messages are logged (default)\n",
    "- 1 = INFO messages are not printed.\n",
    "- 2 = INFO and WARNING messages are not printed.\n",
    "- 3 = INFO , WARNING and ERROR messages are not printed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['KERAS_BACKEND'] = 'torch'\n",
    "\n",
    "import keras\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import os,sys\n",
    "\n",
    "from IPython.display import Markdown\n",
    "from importlib import reload\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('K3BHPD2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verbosity during training : \n",
    "- 0 = silent\n",
    "- 1 = progress bar\n",
    "- 2 = one line per epoch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_verbosity = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.override('fit_verbosity')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve data\n",
    "\n",
    "### 2.1 - Option 1  : From Keras\n",
    "Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://keras.io/api/datasets)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Option 2 : From a csv file\n",
    "More fun !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)\n",
    "\n",
    "display(data.head(5).style.format(\"{0:.2f}\"))\n",
    "print('Missing Data : ',data.isna().sum().sum(), '  Shape is : ', data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Preparing the data\n",
    "### 3.1 - Split data\n",
    "We will use 80% of the data for training and 20% for validation.  \n",
    "x will be input data and y the expected output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Split => train, test\n",
    "#\n",
    "data       = data.sample(frac=1., axis=0)\n",
    "data_train = data.sample(frac=0.7, axis=0)\n",
    "data_test  = data.drop(data_train.index)\n",
    "\n",
    "# ---- Split => x,y (medv is price)\n",
    "#\n",
    "x_train = data_train.drop('medv',  axis=1)\n",
    "y_train = data_train['medv']\n",
    "x_test  = data_test.drop('medv',   axis=1)\n",
    "y_test  = data_test['medv']\n",
    "\n",
    "print('Original data shape was : ',data.shape)\n",
    "print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)\n",
    "print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Data normalization\n",
    "**Note :** \n",
    " - All input data must be normalized, train and test.  \n",
    " - To do this we will subtract the mean and divide by the standard deviation.  \n",
    " - But test data should not be used in any way, even for normalization.  \n",
    " - The mean and the standard deviation will therefore only be calculated with the train data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"Before normalization :\"))\n",
    "\n",
    "mean = x_train.mean()\n",
    "std  = x_train.std()\n",
    "x_train = (x_train - mean) / std\n",
    "x_test  = (x_test  - mean) / std\n",
    "\n",
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"After normalization :\"))\n",
    "\n",
    "x_train, y_train = np.array(x_train), np.array(y_train)\n",
    "x_test,  y_test  = np.array(x_test),  np.array(y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Build a model\n",
    "More informations about : \n",
    " - [Optimizer](https://keras.io/api/optimizers)\n",
    " - [Activation](https://keras.io/api/layers/activations)\n",
    " - [Loss](https://keras.io/api/losses)\n",
    " - [Metrics](https://keras.io/api/metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model_v1(shape):\n",
    "  \n",
    "  model = keras.models.Sequential()\n",
    "  model.add(keras.layers.Input(shape, name=\"InputLayer\"))\n",
    "  model.add(keras.layers.Dense(64, activation='relu', name='Dense_n1'))\n",
    "  model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))\n",
    "  model.add(keras.layers.Dense(1, name='Output'))\n",
    "\n",
    "  model.compile(optimizer = 'rmsprop',\n",
    "                loss      = 'mse',\n",
    "                metrics   = ['mae', 'mse'] )\n",
    "  return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 - Train the model\n",
    "### 5.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model=get_model_v1( (13,) )\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = f'{run_dir}/models/best_model.keras'\n",
    "\n",
    "savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_mae', mode='max', save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 - Train it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = 50,\n",
    "                    batch_size      = 10,\n",
    "                    verbose         = fit_verbosity,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    callbacks       = [savemodel_callback])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Evaluate\n",
    "### 6.1 - Model evaluation\n",
    "MAE =  Mean Absolute Error (between the labels and predictions)  \n",
    "A mae equal to 3 represents an average error in prediction of $3k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Training history\n",
    "What was the best result during our training ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"min( val_mae ) : {:.4f}\".format( min(history.history[\"val_mae\"]) ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.scrawler.history( history, plot={'MSE' :['mse', 'val_mse'],\n",
    "                        'MAE' :['mae', 'val_mae'],\n",
    "                        'LOSS':['loss','val_loss']}, save_as='01-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 - Restore a model :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1 - Reload model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loaded_model = keras.models.load_model(f'{run_dir}/models/best_model.keras')\n",
    "loaded_model.summary()\n",
    "print(\"Loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2 - Evaluate it :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = loaded_model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3 - Make a prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_data = [ 1.26425925, -0.48522739,  1.0436489 , -0.23112788,  1.37120745,\n",
    "       -2.14308942,  1.13489104, -1.06802005,  1.71189006,  1.57042287,\n",
    "        0.77859951,  0.14769795,  2.7585581 ]\n",
    "real_price = 10.4\n",
    "\n",
    "my_data=np.array(my_data).reshape(1,13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = loaded_model.predict( my_data )\n",
    "print(\"Prediction : {:.2f} K$   Reality : {:.2f} K$\".format(predictions[0][0], real_price))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}