01-DNN-Wine-Regression.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [K3WINE1] - Wine quality prediction with a Dense Network (DNN)\n",
    "  <!-- DESC -->  Another example of regression, with a wine quality prediction, using Keras 3 and PyTorch\n",
    "  <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Predict the **quality of wines**, based on their analysis\n",
    " - Understanding the principle and the architecture of a regression with a dense neural network with backup and restore of the trained model. \n",
    "\n",
    "The **[Wine Quality datasets](https://archive.ics.uci.edu/ml/datasets/wine+Quality)** are made up of analyses of a large number of wines, with an associated quality (between 0 and 10)  \n",
    "This dataset is provide by :  \n",
    "Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez  \n",
    "A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, @2009  \n",
    "This dataset can be retreive at [University of California Irvine (UCI)](https://archive-beta.ics.uci.edu/ml/datasets/wine+quality)\n",
    "\n",
    "\n",
    "Due to privacy and logistic issues, only physicochemical and sensory variables are available  \n",
    "There is no data about grape types, wine brand, wine selling price, etc.\n",
    "\n",
    "- fixed acidity\n",
    "- volatile acidity\n",
    "- citric acid\n",
    "- residual sugar\n",
    "- chlorides\n",
    "- free sulfur dioxide\n",
    "- total sulfur dioxide\n",
    "- density\n",
    "- pH\n",
    "- sulphates\n",
    "- alcohol\n",
    "- quality (score between 0 and 10)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - (Retrieve data)\n",
    " - (Preparing the data)\n",
    " - (Build a model)\n",
    " - Train and save the model\n",
    " - Restore saved model\n",
    " - Evaluate the model\n",
    " - Make some predictions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['KERAS_BACKEND'] = 'torch'\n",
    "\n",
    "import keras\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('K3WINE1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verbosity during training : \n",
    "- 0 = silent\n",
    "- 1 = progress bar\n",
    "- 2 = one line per epoch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_verbosity = 1\n",
    "dataset_name  = 'winequality-red.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.override('fit_verbosity', 'dataset_name')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(f'{datasets_dir}/WineQuality/origine/{dataset_name}', header=0,sep=';')\n",
    "\n",
    "display(data.head(5).style.format(\"{0:.2f}\"))\n",
    "print('Missing Data : ',data.isna().sum().sum(), '  Shape is : ', data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Preparing the data\n",
    "### 3.1 - Split data\n",
    "We will use 80% of the data for training and 20% for validation.  \n",
    "x will be the data of the analysis and y the quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Split => train, test\n",
    "#\n",
    "data       = data.sample(frac=1., axis=0)     # Shuffle\n",
    "data_train = data.sample(frac=0.8, axis=0)    # get 80 %\n",
    "data_test  = data.drop(data_train.index)      # test = all - train\n",
    "\n",
    "# ---- Split => x,y (medv is price)\n",
    "#\n",
    "x_train = data_train.drop('quality',  axis=1)\n",
    "y_train = data_train['quality']\n",
    "x_test  = data_test.drop('quality',   axis=1)\n",
    "y_test  = data_test['quality']\n",
    "\n",
    "print('Original data shape was : ',data.shape)\n",
    "print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)\n",
    "print('x_test  : ',x_test.shape,  'y_test  : ',y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Data normalization\n",
    "**Note :** \n",
    " - All input data must be normalized, train and test.  \n",
    " - To do this we will subtract the mean and divide by the standard deviation.  \n",
    " - But test data should not be used in any way, even for normalization.  \n",
    " - The mean and the standard deviation will therefore only be calculated with the train data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"Before normalization :\"))\n",
    "\n",
    "mean = x_train.mean()\n",
    "std  = x_train.std()\n",
    "x_train = (x_train - mean) / std\n",
    "x_test  = (x_test  - mean) / std\n",
    "\n",
    "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"After normalization :\"))\n",
    "\n",
    "# Convert ou DataFrame to numpy array\n",
    "x_train, y_train = np.array(x_train), np.array(y_train)\n",
    "x_test,  y_test  = np.array(x_test),  np.array(y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Build a model\n",
    "More informations about : \n",
    " - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)\n",
    " - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)\n",
    " - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)\n",
    " - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model_v1(shape):\n",
    "  \n",
    "  model = keras.models.Sequential()\n",
    "  model.add(keras.layers.Input(shape, name=\"InputLayer\"))\n",
    "  model.add(keras.layers.Dense(64, activation='relu', name='Dense_n1'))\n",
    "  model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))\n",
    "  model.add(keras.layers.Dense(1, name='Output'))\n",
    "\n",
    "  model.compile(optimizer = 'rmsprop',\n",
    "                loss      = 'mse',\n",
    "                metrics   = ['mae', 'mse'] )\n",
    "  return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 - Train the model\n",
    "### 5.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model=get_model_v1( (11,) )\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs('./run/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = \"./run/models/best_model.keras\"\n",
    "\n",
    "savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_mae', mode='max', save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 - Train it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = 100,\n",
    "                    batch_size      = 10,\n",
    "                    verbose         = fit_verbosity,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    callbacks       = [savemodel_callback])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Evaluate\n",
    "### 6.1 - Model evaluation\n",
    "MAE =  Mean Absolute Error (between the labels and predictions)  \n",
    "A mae equal to 3 represents an average error in prediction of $3k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Training history\n",
    "What was the best result during our training ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"min( val_mae ) : {:.4f}\".format( min(history.history[\"val_mae\"]) ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.scrawler.history( history, plot={'MSE' :['mse', 'val_mse'],\n",
    "                        'MAE' :['mae', 'val_mae'],\n",
    "                        'LOSS':['loss','val_loss']}, save_as='01-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 - Restore a model :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1 - Reload model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loaded_model = keras.models.load_model('./run/models/best_model.keras')\n",
    "loaded_model.summary()\n",
    "print(\"Loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2 - Evaluate it :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = loaded_model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / mae       : {:5.4f}'.format(score[1]))\n",
    "print('x_test / mse       : {:5.4f}'.format(score[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3 - Make a prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Pick n entries from our test set\n",
    "n = 200\n",
    "ii = np.random.randint(1,len(x_test),n)\n",
    "x_sample = x_test[ii]\n",
    "y_sample = y_test[ii]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Make a predictions\n",
    "y_pred = loaded_model.predict( x_sample, verbose=2 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Show it\n",
    "print('Wine    Prediction   Real   Delta')\n",
    "for i in range(n):\n",
    "    pred   = y_pred[i][0]\n",
    "    real   = y_sample[i]\n",
    "    delta  = real-pred\n",
    "    print(f'{i:03d}        {pred:.2f}       {real}      {delta:+.2f} ')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Few questions :\n",
    "- Can this model be used for red wines from Bordeaux and/or Beaujolais?\n",
    "- What are the limitations of this model?\n",
    "- What are the limitations of this dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}