{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n", "\n", "# <!-- TITLE --> [K3WINE1] - Wine quality prediction with a Dense Network (DNN)\n", " <!-- DESC --> Another example of regression, with a wine quality prediction, using Keras 3 and PyTorch\n", " <!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - Predict the **quality of wines**, based on their analysis\n", " - Understanding the principle and the architecture of a regression with a dense neural network with backup and restore of the trained model. \n", "\n", "The **[Wine Quality datasets](https://archive.ics.uci.edu/ml/datasets/wine+Quality)** are made up of analyses of a large number of wines, with an associated quality (between 0 and 10) \n", "This dataset is provide by : \n", "Paulo Cortez, University of Minho, GuimarĂ£es, Portugal, http://www3.dsi.uminho.pt/pcortez \n", "A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, @2009 \n", "This dataset can be retreive at [University of California Irvine (UCI)](https://archive-beta.ics.uci.edu/ml/datasets/wine+quality)\n", "\n", "\n", "Due to privacy and logistic issues, only physicochemical and sensory variables are available \n", "There is no data about grape types, wine brand, wine selling price, etc.\n", "\n", "- fixed acidity\n", "- volatile acidity\n", "- citric acid\n", "- residual sugar\n", "- chlorides\n", "- free sulfur dioxide\n", "- total sulfur dioxide\n", "- density\n", "- pH\n", "- sulphates\n", "- alcohol\n", "- quality (score between 0 and 10)\n", "\n", "## What we're going to do :\n", "\n", " - (Retrieve data)\n", " - (Preparing the data)\n", " - (Build a model)\n", " - Train and save the model\n", " - Restore saved model\n", " - Evaluate the model\n", " - Make some predictions\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Import and init\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['KERAS_BACKEND'] = 'torch'\n", "\n", "import keras\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import fidle\n", "\n", "# Init Fidle environment\n", "run_id, run_dir, datasets_dir = fidle.init('K3WINE1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verbosity during training : \n", "- 0 = silent\n", "- 1 = progress bar\n", "- 2 = one line per epoch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit_verbosity = 1\n", "dataset_name = 'winequality-red.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Override parameters (batch mode) - Just forget this cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.override('fit_verbosity', 'dataset_name')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Retrieve data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(f'{datasets_dir}/WineQuality/origine/{dataset_name}', header=0,sep=';')\n", "\n", "display(data.head(5).style.format(\"{0:.2f}\"))\n", "print('Missing Data : ',data.isna().sum().sum(), ' Shape is : ', data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Preparing the data\n", "### 3.1 - Split data\n", "We will use 80% of the data for training and 20% for validation. \n", "x will be the data of the analysis and y the quality" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Split => train, test\n", "#\n", "data = data.sample(frac=1., axis=0) # Shuffle\n", "data_train = data.sample(frac=0.8, axis=0) # get 80 %\n", "data_test = data.drop(data_train.index) # test = all - train\n", "\n", "# ---- Split => x,y (medv is price)\n", "#\n", "x_train = data_train.drop('quality', axis=1)\n", "y_train = data_train['quality']\n", "x_test = data_test.drop('quality', axis=1)\n", "y_test = data_test['quality']\n", "\n", "print('Original data shape was : ',data.shape)\n", "print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)\n", "print('x_test : ',x_test.shape, 'y_test : ',y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 - Data normalization\n", "**Note :** \n", " - All input data must be normalized, train and test. \n", " - To do this we will subtract the mean and divide by the standard deviation. \n", " - But test data should not be used in any way, even for normalization. \n", " - The mean and the standard deviation will therefore only be calculated with the train data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"Before normalization :\"))\n", "\n", "mean = x_train.mean()\n", "std = x_train.std()\n", "x_train = (x_train - mean) / std\n", "x_test = (x_test - mean) / std\n", "\n", "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"After normalization :\"))\n", "\n", "# Convert ou DataFrame to numpy array\n", "x_train, y_train = np.array(x_train), np.array(y_train)\n", "x_test, y_test = np.array(x_test), np.array(y_test)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Build a model\n", "More informations about : \n", " - [Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)\n", " - [Activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations)\n", " - [Loss](https://www.tensorflow.org/api_docs/python/tf/keras/losses)\n", " - [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_model_v1(shape):\n", " \n", " model = keras.models.Sequential()\n", " model.add(keras.layers.Input(shape, name=\"InputLayer\"))\n", " model.add(keras.layers.Dense(64, activation='relu', name='Dense_n1'))\n", " model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))\n", " model.add(keras.layers.Dense(1, name='Output'))\n", "\n", " model.compile(optimizer = 'rmsprop',\n", " loss = 'mse',\n", " metrics = ['mae', 'mse'] )\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 - Train the model\n", "### 5.1 - Get it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=get_model_v1( (11,) )\n", "\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 - Add callback" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.makedirs('./run/models', mode=0o750, exist_ok=True)\n", "save_dir = \"./run/models/best_model.keras\"\n", "\n", "savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_mae', mode='max', save_best_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3 - Train it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "history = model.fit(x_train,\n", " y_train,\n", " epochs = 100,\n", " batch_size = 10,\n", " verbose = fit_verbosity,\n", " validation_data = (x_test, y_test),\n", " callbacks = [savemodel_callback])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6 - Evaluate\n", "### 6.1 - Model evaluation\n", "MAE = Mean Absolute Error (between the labels and predictions) \n", "A mae equal to 3 represents an average error in prediction of $3k." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score = model.evaluate(x_test, y_test, verbose=0)\n", "\n", "print('x_test / loss : {:5.4f}'.format(score[0]))\n", "print('x_test / mae : {:5.4f}'.format(score[1]))\n", "print('x_test / mse : {:5.4f}'.format(score[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 - Training history\n", "What was the best result during our training ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"min( val_mae ) : {:.4f}\".format( min(history.history[\"val_mae\"]) ) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.scrawler.history( history, plot={'MSE' :['mse', 'val_mse'],\n", " 'MAE' :['mae', 'val_mae'],\n", " 'LOSS':['loss','val_loss']}, save_as='01-history')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7 - Restore a model :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.1 - Reload model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loaded_model = keras.models.load_model('./run/models/best_model.keras')\n", "loaded_model.summary()\n", "print(\"Loaded.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.2 - Evaluate it :" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score = loaded_model.evaluate(x_test, y_test, verbose=0)\n", "\n", "print('x_test / loss : {:5.4f}'.format(score[0]))\n", "print('x_test / mae : {:5.4f}'.format(score[1]))\n", "print('x_test / mse : {:5.4f}'.format(score[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.3 - Make a prediction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Pick n entries from our test set\n", "n = 200\n", "ii = np.random.randint(1,len(x_test),n)\n", "x_sample = x_test[ii]\n", "y_sample = y_test[ii]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Make a predictions\n", "y_pred = loaded_model.predict( x_sample, verbose=2 )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Show it\n", "print('Wine Prediction Real Delta')\n", "for i in range(n):\n", " pred = y_pred[i][0]\n", " real = y_sample[i]\n", " delta = real-pred\n", " print(f'{i:03d} {pred:.2f} {real} {delta:+.2f} ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Few questions :\n", "- Can this model be used for red wines from Bordeaux and/or Beaujolais?\n", "- What are the limitations of this model?\n", "- What are the limitations of this dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.2 ('fidle-env')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d" } } }, "nbformat": 4, "nbformat_minor": 4 }