{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n", "\n", "# <!-- TITLE --> [LOGR1] - Logistic regression\n", "<!-- DESC --> Simple example of logistic regression with a sklearn solution\n", "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - A logistic regression has the objective of providing a probability of belonging to a class. \n", " - Découvrir une implémentation 100% Tensorflow ..et apprendre à aimer Keras\n", "\n", "## What we're going to do :\n", "\n", "X contains characteristics \n", "y contains the probability of membership (1 or 0) \n", "\n", "We'll look for a value of $\\theta$ such that the linear regression $\\theta^{T}X$ can be used to calculate our probability: \n", "\n", "$\\hat{p} = h_\\theta(X) = \\sigma(\\theta^T{X})$ \n", "\n", "Where $\\sigma$ is the logit function, typically a sigmoid (S) function: \n", "\n", "$\n", "\\sigma(t) = \\dfrac{1}{1 + \\exp(-t)}\n", "$ \n", "\n", "The predicted value $\\hat{y}$ will then be calculated as follows:\n", "\n", "$\n", "\\hat{y} =\n", "\\begin{cases}\n", " 0 & \\text{if } \\hat{p} < 0.5 \\\\\n", " 1 & \\text{if } \\hat{p} \\geq 0.5\n", "\\end{cases}\n", "$\n", "\n", "**Calculation of the cost of the regression:** \n", "For a training observation x, the cost can be calculated as follows: \n", "\n", "$\n", "c(\\theta) =\n", "\\begin{cases}\n", " -\\log(\\hat{p}) & \\text{if } y = 1 \\\\\n", " -\\log(1 - \\hat{p}) & \\text{if } y = 0\n", "\\end{cases}\n", "$\n", "\n", "The regression cost function (log loss) over the whole training set can be written as follows: \n", "\n", "$\n", "J(\\theta) = -\\dfrac{1}{m} \\sum_{i=1}^{m}{\\left[ y^{(i)} log\\left(\\hat{p}^{(i)}\\right) + (1 - y^{(i)}) log\\left(1 - \\hat{p}^{(i)}\\right)\\right]}\n", "$\n", "## Step 1 - Import and init\n", "\n", "You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :\n", "- 0 = all messages are logged (default)\n", "- 1 = INFO messages are not printed.\n", "- 2 = INFO and WARNING messages are not printed.\n", "- 3 = INFO , WARNING and ERROR messages are not printed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import os\n", "# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'\n", "\n", "import numpy as np\n", "from sklearn import metrics\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "# import math\n", "import random\n", "# import os\n", "import sys\n", "\n", "import fidle\n", "\n", "# Init Fidle environment\n", "run_id, run_dir, datasets_dir = fidle.init('LOGR1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 - Usefull stuff (hidden)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [] }, "outputs": [], "source": [ "def vector_infos(name,V):\n", " '''Displaying some information about a vector'''\n", " with np.printoptions(precision=4, suppress=True):\n", " print(\"{:16} : ndim={} shape={:10} Mean = {} Std = {}\".format( name,V.ndim, str(V.shape), V.mean(axis=0), V.std(axis=0)))\n", "\n", " \n", "def do_i_have_it(hours_of_work, hours_of_sleep):\n", " '''Returns the exam result based on work and sleep hours'''\n", " hours_of_sleep_min = 5\n", " hours_of_work_min = 4\n", " hours_of_game_max = 3\n", " # ---- Have to sleep and work\n", " if hours_of_sleep < hours_of_sleep_min: return 0\n", " if hours_of_work < hours_of_work_min: return 0\n", " # ---- Gameboy is not good for you\n", " hours_of_game = 24 - 10 - hours_of_sleep - hours_of_work + random.gauss(0,0.4)\n", " if hours_of_game > hours_of_game_max: return 0\n", " # ---- Fine, you got it\n", " return 1\n", "\n", "\n", "def make_students_dataset(size, noise):\n", " '''Fabrique un dataset pour <size> étudiants'''\n", " x = []\n", " y = []\n", " for i in range(size):\n", " w = random.gauss(5,1)\n", " s = random.gauss(7,1.5)\n", " r = do_i_have_it(w,s)\n", " x.append([w,s])\n", " y.append(r)\n", " return (np.array(x), np.array(y))\n", "\n", "\n", "def plot_data(x,y, colors=('green','red'), legend=True):\n", " '''Affiche un dataset'''\n", " fig, ax = plt.subplots(1, 1)\n", " fig.set_size_inches(10,8)\n", " ax.plot(x[y==1, 0], x[y==1, 1], 'o', color=colors[0], markersize=4, label=\"y=1 (positive)\")\n", " ax.plot(x[y==0, 0], x[y==0, 1], 'o', color=colors[1], markersize=4, label=\"y=0 (negative)\")\n", " if legend : ax.legend()\n", " plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n", " plt.xlabel('Hours of work')\n", " plt.ylabel('Hours of sleep')\n", " plt.show()\n", "\n", "\n", "def plot_results(x_test,y_test, y_pred):\n", " '''Affiche un resultat'''\n", "\n", " precision = metrics.precision_score(y_test, y_pred)\n", " recall = metrics.recall_score(y_test, y_pred)\n", "\n", " print(\"Accuracy = {:5.3f} Recall = {:5.3f}\".format(precision, recall))\n", "\n", " x_pred_positives = x_test[ y_pred == 1 ] # items prédits positifs\n", " x_real_positives = x_test[ y_test == 1 ] # items réellement positifs\n", " x_pred_negatives = x_test[ y_pred == 0 ] # items prédits négatifs\n", " x_real_negatives = x_test[ y_test == 0 ] # items réellement négatifs\n", "\n", " fig, axs = plt.subplots(2, 2)\n", " fig.subplots_adjust(wspace=.1,hspace=0.2)\n", " fig.set_size_inches(14,10)\n", " \n", " axs[0,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n", " axs[0,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label=\"Réels positifs\")\n", " axs[0,0].legend()\n", " axs[0,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n", " axs[0,0].set_xlabel('$x_1$')\n", " axs[0,0].set_ylabel('$x_2$')\n", "\n", "\n", " axs[0,1].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n", " axs[0,1].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label=\"Réels négatifs\")\n", " axs[0,1].legend()\n", " axs[0,1].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n", " axs[0,1].set_xlabel('$x_1$')\n", " axs[0,1].set_ylabel('$x_2$')\n", " \n", " axs[1,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n", " axs[1,0].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n", " axs[1,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label=\"Réels positifs\")\n", " axs[1,0].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label=\"Réels négatifs\")\n", " axs[1,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n", " axs[1,0].set_xlabel('$x_1$')\n", " axs[1,0].set_ylabel('$x_2$')\n", "\n", " axs[1,1].pie([precision,1-precision], explode=[0,0.1], labels=[\"\",\"Errors\"], \n", " autopct='%1.1f%%', shadow=False, startangle=70, colors=[\"lightsteelblue\",\"coral\"])\n", " axs[1,1].axis('equal')\n", "\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 - Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_size = 1000 # Number of observations\n", "data_cols = 2 # observation size\n", "data_noise = 0.2\n", "random_seed = 123" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Data preparation\n", "### 2.1 - Get some data\n", "The data here are totally fabricated and represent the **examination results** (passed or failed) based on the students' **working** and **sleeping hours** . \n", "X=(working hours, sleeping hours) y={result} where result=0 (failed) or 1 (passed)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_data,y_data=make_students_dataset(data_size,data_noise)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 - Show it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data(x_data, y_data)\n", "vector_infos('Dataset X',x_data)\n", "vector_infos('Dataset y',y_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 - Preparation of data\n", "\n", "We're going to:\n", "- split the data to have : :\n", " - a training set\n", " - a test set\n", "- normalize the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Split data\n", "\n", "n = int(data_size * 0.8)\n", "x_train = x_data[:n]\n", "y_train = y_data[:n]\n", "x_test = x_data[n:]\n", "y_test = y_data[n:]\n", "\n", "# ---- Normalization\n", "\n", "mean = np.mean(x_train, axis=0)\n", "std = np.std(x_train, axis=0)\n", "\n", "x_train = (x_train-mean)/std\n", "x_test = (x_test-mean)/std\n", "\n", "# ---- About it\n", "\n", "vector_infos('X_train',x_train)\n", "vector_infos('y_train',y_train)\n", "vector_infos('X_test',x_test)\n", "vector_infos('y_test',y_test)\n", "\n", "y_train_h = y_train.reshape(-1,) # nécessaire pour la visu." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 - Have a look" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.utils.display_md('**This is what we know :**')\n", "plot_data(x_train, y_train)\n", "fidle.utils.display_md('**This is what we want to classify :**')\n", "plot_data(x_test, y_test, colors=(\"gray\",\"gray\"), legend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Logistic model #1\n", "### 3.1 - Here is the classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Create an instance\n", "# Use SAGA solver (Stochastic Average Gradient descent solver)\n", "#\n", "logreg = LogisticRegression(C=1e5, verbose=0, solver='saga')\n", "\n", "# ---- Fit the data.\n", "#\n", "logreg.fit(x_train, y_train)\n", "\n", "# ---- Do a prediction\n", "#\n", "y_pred = logreg.predict(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.3 - Evaluation\n", "\n", "Accuracy = Ability to avoid false positives = $\\frac{Tp}{Tp+Fp}$ \n", "Recall = Ability to find the right positives = $\\frac{Tp}{Tp+Fn}$ \n", "Avec : \n", "$T_p$ (true positive) Correct positive answer \n", "$F_p$ (false positive) False positive answer \n", "$T_n$ (true negative) Correct negative answer \n", "$F_n$ (false negative) Wrong negative answer " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_results(x_test,y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Bending the space to a model #2 ;-)\n", "\n", "We're going to increase the characteristics of our observations, with : ${x_1}^2$, ${x_2}^2$, ${x_1}^3$ et ${x_2}^3$ \n", "\n", "$\n", "X=\n", "\\begin{bmatrix}1 & x_{11} & x_{12} \\\\\n", "\\vdots & \\dots\\\\\n", "1 & x_{m1} & x_{m2} \\end{bmatrix}\n", "\\text{et }\n", "X_{ng}=\\begin{bmatrix}1 & x_{11} & x_{12} & x_{11}^2 & x_{12}^2& x_{11}^3 & x_{12}^3 \\\\\n", "\\vdots & & & \\dots \\\\\n", "1 & x_{m1} & x_{m2} & x_{m1}^2 & x_{m2}^2& x_{m1}^3 & x_{m2}^3 \\end{bmatrix}\n", "$\n", "\n", "Note : `sklearn.preprocessing.PolynomialFeatures` can do that for us, but we'll do it ourselves:\n", "### 4.1 - Extend data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_train_enhanced = np.c_[x_train,\n", " x_train[:, 0] ** 2,\n", " x_train[:, 1] ** 2,\n", " x_train[:, 0] ** 3,\n", " x_train[:, 1] ** 3]\n", "x_test_enhanced = np.c_[x_test,\n", " x_test[:, 0] ** 2,\n", " x_test[:, 1] ** 2,\n", " x_test[:, 0] ** 3,\n", " x_test[:, 1] ** 3]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 - Run the classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Create an instance\n", "# Use SAGA solver (Stochastic Average Gradient descent solver)\n", "#\n", "logreg = LogisticRegression(C=1e5, verbose=0, solver='saga', max_iter=5000, n_jobs=-1)\n", "\n", "# ---- Fit the data.\n", "#\n", "logreg.fit(x_train_enhanced, y_train)\n", "\n", "# ---- Do a prediction\n", "#\n", "y_pred = logreg.predict(x_test_enhanced)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 - Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_results(x_test_enhanced, y_test, y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.2 ('fidle-env')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d" } } }, "nbformat": 4, "nbformat_minor": 4 }