04-Logistic-Regression.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [LOGR1] - Logistic regression\n",
    "<!-- DESC --> Simple example of logistic regression with a sklearn solution\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - A logistic regression has the objective of providing a probability of belonging to a class.  \n",
    " - Découvrir une implémentation 100% Tensorflow ..et apprendre à aimer Keras\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    "X contains characteristics  \n",
    "y contains the probability of membership (1 or 0)  \n",
    "\n",
    "We'll look for a value of $\\theta$ such that the linear regression $\\theta^{T}X$ can be used to calculate our probability:  \n",
    "\n",
    "$\\hat{p} = h_\\theta(X) = \\sigma(\\theta^T{X})$  \n",
    "\n",
    "Where $\\sigma$ is the logit function, typically a sigmoid (S) function:  \n",
    "\n",
    "$\n",
    "\\sigma(t) = \\dfrac{1}{1 + \\exp(-t)}\n",
    "$  \n",
    "\n",
    "The predicted value $\\hat{y}$ will then be calculated as follows:\n",
    "\n",
    "$\n",
    "\\hat{y} =\n",
    "\\begin{cases}\n",
    "  0 & \\text{if } \\hat{p} < 0.5 \\\\\n",
    "  1 & \\text{if } \\hat{p} \\geq 0.5\n",
    "\\end{cases}\n",
    "$\n",
    "\n",
    "**Calculation of the cost of the regression:**  \n",
    "For a training observation x, the cost can be calculated as follows:  \n",
    "\n",
    "$\n",
    "c(\\theta) =\n",
    "\\begin{cases}\n",
    "  -\\log(\\hat{p}) & \\text{if } y = 1 \\\\\n",
    "  -\\log(1 - \\hat{p}) & \\text{if } y = 0\n",
    "\\end{cases}\n",
    "$\n",
    "\n",
    "The regression cost function (log loss) over the whole training set can be written as follows:  \n",
    "\n",
    "$\n",
    "J(\\theta) = -\\dfrac{1}{m} \\sum_{i=1}^{m}{\\left[ y^{(i)} log\\left(\\hat{p}^{(i)}\\right) + (1 - y^{(i)}) log\\left(1 - \\hat{p}^{(i)}\\right)\\right]}\n",
    "$\n",
    "## Step 1 - Import and init\n",
    "\n",
    "You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :\n",
    "- 0 = all messages are logged (default)\n",
    "- 1 = INFO messages are not printed.\n",
    "- 2 = INFO and WARNING messages are not printed.\n",
    "- 3 = INFO , WARNING and ERROR messages are not printed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import os\n",
    "# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'\n",
    "\n",
    "import numpy as np\n",
    "from sklearn import metrics\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "# import math\n",
    "import random\n",
    "# import os\n",
    "import sys\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('LOGR1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 - Usefull stuff (hidden)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def vector_infos(name,V):\n",
    "    '''Displaying some information about a vector'''\n",
    "    with np.printoptions(precision=4, suppress=True):\n",
    "        print(\"{:16} : ndim={}  shape={:10}  Mean = {}  Std = {}\".format( name,V.ndim, str(V.shape), V.mean(axis=0), V.std(axis=0)))\n",
    "\n",
    "        \n",
    "def do_i_have_it(hours_of_work, hours_of_sleep):\n",
    "    '''Returns the exam result based on work and sleep hours'''\n",
    "    hours_of_sleep_min = 5\n",
    "    hours_of_work_min  = 4\n",
    "    hours_of_game_max  = 3\n",
    "    # ---- Have to sleep and work\n",
    "    if hours_of_sleep < hours_of_sleep_min: return 0\n",
    "    if hours_of_work < hours_of_work_min:   return 0\n",
    "    # ---- Gameboy is not good for you\n",
    "    hours_of_game = 24 - 10 - hours_of_sleep - hours_of_work + random.gauss(0,0.4)\n",
    "    if hours_of_game > hours_of_game_max:   return 0\n",
    "    # ---- Fine, you got it\n",
    "    return 1\n",
    "\n",
    "\n",
    "def make_students_dataset(size, noise):\n",
    "    '''Fabrique un dataset pour <size> étudiants'''\n",
    "    x = []\n",
    "    y = []\n",
    "    for i in range(size):\n",
    "        w = random.gauss(5,1)\n",
    "        s = random.gauss(7,1.5)\n",
    "        r   = do_i_have_it(w,s)\n",
    "        x.append([w,s])\n",
    "        y.append(r)\n",
    "    return (np.array(x), np.array(y))\n",
    "\n",
    "\n",
    "def plot_data(x,y, colors=('green','red'), legend=True):\n",
    "    '''Affiche un dataset'''\n",
    "    fig, ax = plt.subplots(1, 1)\n",
    "    fig.set_size_inches(10,8)\n",
    "    ax.plot(x[y==1, 0], x[y==1, 1], 'o', color=colors[0], markersize=4, label=\"y=1 (positive)\")\n",
    "    ax.plot(x[y==0, 0], x[y==0, 1], 'o', color=colors[1], markersize=4, label=\"y=0 (negative)\")\n",
    "    if legend : ax.legend()\n",
    "    plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "    plt.xlabel('Hours of work')\n",
    "    plt.ylabel('Hours of sleep')\n",
    "    plt.show()\n",
    "\n",
    "\n",
    "def plot_results(x_test,y_test, y_pred):\n",
    "    '''Affiche un resultat'''\n",
    "\n",
    "    precision = metrics.precision_score(y_test, y_pred)\n",
    "    recall    = metrics.recall_score(y_test, y_pred)\n",
    "\n",
    "    print(\"Accuracy = {:5.3f}    Recall = {:5.3f}\".format(precision, recall))\n",
    "\n",
    "    x_pred_positives = x_test[ y_pred == 1 ]     # items prédits    positifs\n",
    "    x_real_positives = x_test[ y_test == 1 ]     # items réellement positifs\n",
    "    x_pred_negatives = x_test[ y_pred == 0 ]     # items prédits    négatifs\n",
    "    x_real_negatives = x_test[ y_test == 0 ]     # items réellement négatifs\n",
    "\n",
    "    fig, axs = plt.subplots(2, 2)\n",
    "    fig.subplots_adjust(wspace=.1,hspace=0.2)\n",
    "    fig.set_size_inches(14,10)\n",
    "    \n",
    "    axs[0,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n",
    "    axs[0,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green',      markersize=4,  label=\"Réels positifs\")\n",
    "    axs[0,0].legend()\n",
    "    axs[0,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "    axs[0,0].set_xlabel('$x_1$')\n",
    "    axs[0,0].set_ylabel('$x_2$')\n",
    "\n",
    "\n",
    "    axs[0,1].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n",
    "    axs[0,1].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red',        markersize=4,  label=\"Réels négatifs\")\n",
    "    axs[0,1].legend()\n",
    "    axs[0,1].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "    axs[0,1].set_xlabel('$x_1$')\n",
    "    axs[0,1].set_ylabel('$x_2$')\n",
    "    \n",
    "    axs[1,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n",
    "    axs[1,0].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n",
    "    axs[1,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green',      markersize=4,  label=\"Réels positifs\")\n",
    "    axs[1,0].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red',        markersize=4,  label=\"Réels négatifs\")\n",
    "    axs[1,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "    axs[1,0].set_xlabel('$x_1$')\n",
    "    axs[1,0].set_ylabel('$x_2$')\n",
    "\n",
    "    axs[1,1].pie([precision,1-precision], explode=[0,0.1], labels=[\"\",\"Errors\"], \n",
    "                 autopct='%1.1f%%', shadow=False, startangle=70, colors=[\"lightsteelblue\",\"coral\"])\n",
    "    axs[1,1].axis('equal')\n",
    "\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_size      = 1000       # Number of observations\n",
    "data_cols      = 2          # observation size\n",
    "data_noise     = 0.2\n",
    "random_seed    = 123"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Data preparation\n",
    "### 2.1 - Get some data\n",
    "The data here are totally fabricated and represent the **examination results** (passed or failed) based on the students' **working** and **sleeping hours** .  \n",
    "X=(working hours, sleeping hours) y={result} where result=0 (failed) or 1 (passed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_data,y_data=make_students_dataset(data_size,data_noise)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Show it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_data(x_data, y_data)\n",
    "vector_infos('Dataset X',x_data)\n",
    "vector_infos('Dataset y',y_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Preparation of data\n",
    "\n",
    "We're going to:\n",
    "- split the data to have : :\n",
    "  - a training set\n",
    "  - a test set\n",
    "- normalize the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Split data\n",
    "\n",
    "n = int(data_size * 0.8)\n",
    "x_train = x_data[:n]\n",
    "y_train = y_data[:n]\n",
    "x_test  = x_data[n:]\n",
    "y_test  = y_data[n:]\n",
    "\n",
    "# ---- Normalization\n",
    "\n",
    "mean = np.mean(x_train, axis=0)\n",
    "std  = np.std(x_train, axis=0)\n",
    "\n",
    "x_train = (x_train-mean)/std\n",
    "x_test  = (x_test-mean)/std\n",
    "\n",
    "# ---- About it\n",
    "\n",
    "vector_infos('X_train',x_train)\n",
    "vector_infos('y_train',y_train)\n",
    "vector_infos('X_test',x_test)\n",
    "vector_infos('y_test',y_test)\n",
    "\n",
    "y_train_h = y_train.reshape(-1,) # nécessaire pour la visu."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.utils.display_md('**This is what we know :**')\n",
    "plot_data(x_train, y_train)\n",
    "fidle.utils.display_md('**This is what we want to classify :**')\n",
    "plot_data(x_test,  y_test, colors=(\"gray\",\"gray\"), legend=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Logistic model #1\n",
    "### 3.1 - Here is the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Create an instance\n",
    "#      Use SAGA solver (Stochastic Average Gradient descent solver)\n",
    "#\n",
    "logreg = LogisticRegression(C=1e5, verbose=0, solver='saga')\n",
    "\n",
    "# ---- Fit the data.\n",
    "#\n",
    "logreg.fit(x_train, y_train)\n",
    "\n",
    "# ---- Do a prediction\n",
    "#\n",
    "y_pred = logreg.predict(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### 3.3 - Evaluation\n",
    "\n",
    "Accuracy = Ability to avoid false positives = $\\frac{Tp}{Tp+Fp}$  \n",
    "Recall = Ability to find the right positives = $\\frac{Tp}{Tp+Fn}$  \n",
    "Avec :  \n",
    "$T_p$ (true positive) Correct positive answer  \n",
    "$F_p$ (false positive) False positive answer  \n",
    "$T_n$ (true negative) Correct negative answer  \n",
    "$F_n$ (false negative) Wrong negative answer  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_results(x_test,y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Bending the space to a model #2 ;-)\n",
    "\n",
    "We're going to increase the characteristics of our observations, with : ${x_1}^2$, ${x_2}^2$, ${x_1}^3$ et ${x_2}^3$  \n",
    "\n",
    "$\n",
    "X=\n",
    "\\begin{bmatrix}1 & x_{11} & x_{12} \\\\\n",
    "\\vdots & \\dots\\\\\n",
    "1 & x_{m1} & x_{m2}  \\end{bmatrix}\n",
    "\\text{et }\n",
    "X_{ng}=\\begin{bmatrix}1 & x_{11} & x_{12} & x_{11}^2 & x_{12}^2& x_{11}^3 & x_{12}^3 \\\\\n",
    "\\vdots & & & \\dots \\\\\n",
    "1 & x_{m1} & x_{m2} & x_{m1}^2 & x_{m2}^2& x_{m1}^3 & x_{m2}^3 \\end{bmatrix}\n",
    "$\n",
    "\n",
    "Note : `sklearn.preprocessing.PolynomialFeatures` can do that for us, but we'll do it ourselves:\n",
    "### 4.1 - Extend data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train_enhanced = np.c_[x_train,\n",
    "                         x_train[:, 0] ** 2,\n",
    "                         x_train[:, 1] ** 2,\n",
    "                         x_train[:, 0] ** 3,\n",
    "                         x_train[:, 1] ** 3]\n",
    "x_test_enhanced = np.c_[x_test,\n",
    "                        x_test[:, 0] ** 2,\n",
    "                        x_test[:, 1] ** 2,\n",
    "                        x_test[:, 0] ** 3,\n",
    "                        x_test[:, 1] ** 3]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 - Run the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Create an instance\n",
    "#      Use SAGA solver (Stochastic Average Gradient descent solver)\n",
    "#\n",
    "logreg = LogisticRegression(C=1e5, verbose=0, solver='saga', max_iter=5000, n_jobs=-1)\n",
    "\n",
    "# ---- Fit the data.\n",
    "#\n",
    "logreg.fit(x_train_enhanced, y_train)\n",
    "\n",
    "# ---- Do a prediction\n",
    "#\n",
    "y_pred = logreg.predict(x_test_enhanced)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 - Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_results(x_test_enhanced, y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}