{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n", "\n", "\n", "# <!-- TITLE --> [K3BHPD1] - Regression with a Dense Network (DNN)\n", "<!-- DESC --> Simple example of a regression with the dataset Boston Housing Prices Dataset (BHPD)\n", "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - Predicts **housing prices** from a set of house features. \n", " - Understanding the **principle** and the **architecture** of a regression with a **dense neural network** \n", "\n", "\n", "The **[Boston Housing Prices Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)** consists of price of houses in various places in Boston. \n", "Alongside with price, the dataset also provide theses informations : \n", "\n", " - CRIM: This is the per capita crime rate by town\n", " - ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft\n", " - INDUS: This is the proportion of non-retail business acres per town\n", " - CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)\n", " - NOX: This is the nitric oxides concentration (parts per 10 million)\n", " - RM: This is the average number of rooms per dwelling\n", " - AGE: This is the proportion of owner-occupied units built prior to 1940\n", " - DIS: This is the weighted distances to five Boston employment centers\n", " - RAD: This is the index of accessibility to radial highways\n", " - TAX: This is the full-value property-tax rate per 10,000 dollars\n", " - PTRATIO: This is the pupil-teacher ratio by town\n", " - B: This is calculated as 1000(Bk — 0.63)^2, where Bk is the proportion of people of African American descent by town\n", " - LSTAT: This is the percentage lower status of the population\n", " - MEDV: This is the median value of owner-occupied homes in 1000 dollars\n", " \n", "## What we're going to do :\n", "\n", " - Retrieve data\n", " - Preparing the data\n", " - Build a model\n", " - Train the model\n", " - Evaluate the result\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Step 1 - Import and init\n", "\n", "You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :\n", "- 0 = all messages are logged (default)\n", "- 1 = INFO messages are not printed.\n", "- 2 = INFO and WARNING messages are not printed.\n", "- 3 = INFO , WARNING and ERROR messages are not printed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['KERAS_BACKEND'] = 'torch'\n", "\n", "import keras\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import os,sys\n", "\n", "import fidle\n", "\n", "# Init Fidle environment\n", "run_id, run_dir, datasets_dir = fidle.init('K3BHPD1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verbosity during training : \n", "- 0 = silent\n", "- 1 = progress bar\n", "- 2 = one line per epoch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit_verbosity = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Override parameters (batch mode) - Just forget this cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.override('fit_verbosity')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Retrieve data\n", "\n", "### 2.1 - Option 1 : From Keras\n", "Boston housing is a famous historic dataset, so we can get it directly from [Keras datasets](https://keras.io/api/datasets) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data(test_split=0.2, seed=113)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 - Option 2 : From a csv file\n", "More fun !" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(f'{datasets_dir}/BHPD/origine/BostonHousing.csv', header=0)\n", "\n", "display(data.head(5).style.format(\"{0:.2f}\").set_caption(\"Few lines of the dataset :\"))\n", "print('Missing Data : ',data.isna().sum().sum(), ' Shape is : ', data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Preparing the data\n", "### 3.1 - Split data\n", "We will use 70% of the data for training and 30% for validation. \n", "The dataset is **shuffled** and shared between **learning** and **testing**. \n", "x will be input data and y the expected output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Shuffle and Split => train, test\n", "#\n", "data = data.sample(frac=1., axis=0)\n", "data_train = data.sample(frac=0.7, axis=0)\n", "data_test = data.drop(data_train.index)\n", "\n", "# ---- Split => x,y (medv is price)\n", "#\n", "x_train = data_train.drop('medv', axis=1)\n", "y_train = data_train['medv']\n", "x_test = data_test.drop('medv', axis=1)\n", "y_test = data_test['medv']\n", "\n", "print('Original data shape was : ',data.shape)\n", "print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)\n", "print('x_test : ',x_test.shape, 'y_test : ',y_test.shape)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### 3.2 - Data normalization\n", "**Note :** \n", " - All input data must be normalized, train and test. \n", " - To do this we will **subtract the mean** and **divide by the standard deviation**. \n", " - But test data should not be used in any way, even for normalization. \n", " - The mean and the standard deviation will therefore only be calculated with the train data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"Before normalization :\"))\n", "\n", "mean = x_train.mean()\n", "std = x_train.std()\n", "x_train = (x_train - mean) / std\n", "x_test = (x_test - mean) / std\n", "\n", "display(x_train.describe().style.format(\"{0:.2f}\").set_caption(\"After normalization :\"))\n", "display(x_train.head(5).style.format(\"{0:.2f}\").set_caption(\"Few lines of the dataset :\"))\n", "\n", "x_train, y_train = np.array(x_train), np.array(y_train)\n", "x_test, y_test = np.array(x_test), np.array(y_test)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Build a model\n", "About informations about : \n", " - [Optimizer](https://keras.io/api/optimizers)\n", " - [Activation](https://keras.io/api/layers/activations)\n", " - [Loss](https://keras.io/api/losses)\n", " - [Metrics](https://keras.io/api/metrics)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_model_v1(shape):\n", " \n", " model = keras.models.Sequential()\n", " model.add(keras.layers.Input(shape, name=\"InputLayer\"))\n", " model.add(keras.layers.Dense(32, activation='relu', name='Dense_n1'))\n", " model.add(keras.layers.Dense(64, activation='relu', name='Dense_n2'))\n", " model.add(keras.layers.Dense(32, activation='relu', name='Dense_n3'))\n", " model.add(keras.layers.Dense(1, name='Output'))\n", " \n", " model.compile(optimizer = 'adam',\n", " loss = 'mse',\n", " metrics = ['mae', 'mse'] )\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5 - Train the model\n", "### 5.1 - Get it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model=get_model_v1( (13,) )\n", "\n", "model.summary()\n", "\n", "# img=keras.utils.plot_model( model, to_file='./run/model.png', show_shapes=True, show_layer_names=True, dpi=96)\n", "# display(img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 - Train it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "history = model.fit(x_train,\n", " y_train,\n", " epochs = 60,\n", " batch_size = 10,\n", " verbose = fit_verbosity,\n", " validation_data = (x_test, y_test))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Step 6 - Evaluate\n", "### 6.1 - Model evaluation\n", "MAE = Mean Absolute Error (between the labels and predictions) \n", "A mae equal to 3 represents an average error in prediction of $3k." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score = model.evaluate(x_test, y_test, verbose=0)\n", "\n", "print('x_test / loss : {:5.4f}'.format(score[0]))\n", "print('x_test / mae : {:5.4f}'.format(score[1]))\n", "print('x_test / mse : {:5.4f}'.format(score[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 - Training history\n", "What was the best result during our training ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.DataFrame(data=history.history)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"min( val_mae ) : {:.4f}\".format( min(history.history[\"val_mae\"]) ) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.scrawler.history( history, plot={'MSE' :['mse', 'val_mse'],\n", " 'MAE' :['mae', 'val_mae'],\n", " 'LOSS':['loss','val_loss']}, save_as='01-history')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7 - Make a prediction\n", "The data must be normalized with the parameters (mean, std) previously used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data = [ 1.26425925, -0.48522739, 1.0436489 , -0.23112788, 1.37120745,\n", " -2.14308942, 1.13489104, -1.06802005, 1.71189006, 1.57042287,\n", " 0.77859951, 0.14769795, 2.7585581 ]\n", "real_price = 10.4\n", "\n", "my_data=np.array(my_data).reshape(1,13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "predictions = model.predict( my_data, verbose=fit_verbosity )\n", "print(\"Prediction : {:.2f} K$\".format(predictions[0][0]))\n", "print(\"Reality : {:.2f} K$\".format(real_price))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.2 ('fidle-env')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d" } } }, "nbformat": 4, "nbformat_minor": 4 }