Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
"# <!-- TITLE --> [LOGR1] - Logistic regression\n",
"<!-- DESC --> Simple example of logistic regression with a sklearn solution\n",
"<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
" - A logistic regression has the objective of providing a probability of belonging to a class. \n",
" - Découvrir une implémentation 100% Tensorflow ..et apprendre à aimer Keras\n",
"\n",
"## What we're going to do :\n",
"\n",
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"X contains characteristics \n",
"y contains the probability of membership (1 or 0) \n",
"\n",
"We'll look for a value of $\\theta$ such that the linear regression $\\theta^{T}X$ can be used to calculate our probability: \n",
"\n",
"$\\hat{p} = h_\\theta(X) = \\sigma(\\theta^T{X})$ \n",
"\n",
"Where $\\sigma$ is the logit function, typically a sigmoid (S) function: \n",
"\n",
"$\n",
"\\sigma(t) = \\dfrac{1}{1 + \\exp(-t)}\n",
"$ \n",
"\n",
"The predicted value $\\hat{y}$ will then be calculated as follows:\n",
"\n",
"$\n",
"\\hat{y} =\n",
"\\begin{cases}\n",
" 0 & \\text{if } \\hat{p} < 0.5 \\\\\n",
" 1 & \\text{if } \\hat{p} \\geq 0.5\n",
"\\end{cases}\n",
"$\n",
"\n",
"**Calculation of the cost of the regression:** \n",
"For a training observation x, the cost can be calculated as follows: \n",
"\n",
"$\n",
"c(\\theta) =\n",
"\\begin{cases}\n",
" -\\log(\\hat{p}) & \\text{if } y = 1 \\\\\n",
" -\\log(1 - \\hat{p}) & \\text{if } y = 0\n",
"\\end{cases}\n",
"$\n",
"\n",
"The regression cost function (log loss) over the whole training set can be written as follows: \n",
"\n",
"$\n",
"J(\\theta) = -\\dfrac{1}{m} \\sum_{i=1}^{m}{\\left[ y^{(i)} log\\left(\\hat{p}^{(i)}\\right) + (1 - y^{(i)}) log\\left(1 - \\hat{p}^{(i)}\\right)\\right]}\n",
"$\n",
"## Step 1 - Import and init\n",
"\n",
"You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :\n",
"- 0 = all messages are logged (default)\n",
"- 1 = INFO messages are not printed.\n",
"- 2 = INFO and WARNING messages are not printed.\n",
"- 3 = INFO , WARNING and ERROR messages are not printed."
"# import os\n",
"# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'\n",
"\n",
"import numpy as np\n",
"from sklearn import metrics\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"# Init Fidle environment\n",
"run_id, run_dir, datasets_dir = fidle.init('LOGR1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 - Usefull stuff (hidden)"
"metadata": {
"jupyter": {
"source_hidden": true
"outputs": [],
"source": [
"def vector_infos(name,V):\n",
" '''Displaying some information about a vector'''\n",
" with np.printoptions(precision=4, suppress=True):\n",
" print(\"{:16} : ndim={} shape={:10} Mean = {} Std = {}\".format( name,V.ndim, str(V.shape), V.mean(axis=0), V.std(axis=0)))\n",
"\n",
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
" \n",
"def do_i_have_it(hours_of_work, hours_of_sleep):\n",
" '''Returns the exam result based on work and sleep hours'''\n",
" hours_of_sleep_min = 5\n",
" hours_of_work_min = 4\n",
" hours_of_game_max = 3\n",
" # ---- Have to sleep and work\n",
" if hours_of_sleep < hours_of_sleep_min: return 0\n",
" if hours_of_work < hours_of_work_min: return 0\n",
" # ---- Gameboy is not good for you\n",
" hours_of_game = 24 - 10 - hours_of_sleep - hours_of_work + random.gauss(0,0.4)\n",
" if hours_of_game > hours_of_game_max: return 0\n",
" # ---- Fine, you got it\n",
" return 1\n",
"\n",
"\n",
"def make_students_dataset(size, noise):\n",
" '''Fabrique un dataset pour <size> étudiants'''\n",
" x = []\n",
" y = []\n",
" for i in range(size):\n",
" w = random.gauss(5,1)\n",
" s = random.gauss(7,1.5)\n",
" r = do_i_have_it(w,s)\n",
" x.append([w,s])\n",
" y.append(r)\n",
" return (np.array(x), np.array(y))\n",
"\n",
"\n",
"def plot_data(x,y, colors=('green','red'), legend=True):\n",
" '''Affiche un dataset'''\n",
" fig, ax = plt.subplots(1, 1)\n",
" fig.set_size_inches(10,8)\n",
" ax.plot(x[y==1, 0], x[y==1, 1], 'o', color=colors[0], markersize=4, label=\"y=1 (positive)\")\n",
" ax.plot(x[y==0, 0], x[y==0, 1], 'o', color=colors[1], markersize=4, label=\"y=0 (negative)\")\n",
" if legend : ax.legend()\n",
" plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
" plt.xlabel('Hours of work')\n",
" plt.ylabel('Hours of sleep')\n",
" plt.show()\n",
"\n",
"\n",
"def plot_results(x_test,y_test, y_pred):\n",
" '''Affiche un resultat'''\n",
"\n",
" precision = metrics.precision_score(y_test, y_pred)\n",
" recall = metrics.recall_score(y_test, y_pred)\n",
"\n",
" print(\"Accuracy = {:5.3f} Recall = {:5.3f}\".format(precision, recall))\n",
"\n",
" x_pred_positives = x_test[ y_pred == 1 ] # items prédits positifs\n",
" x_real_positives = x_test[ y_test == 1 ] # items réellement positifs\n",
" x_pred_negatives = x_test[ y_pred == 0 ] # items prédits négatifs\n",
" x_real_negatives = x_test[ y_test == 0 ] # items réellement négatifs\n",
"\n",
" fig, axs = plt.subplots(2, 2)\n",
" fig.subplots_adjust(wspace=.1,hspace=0.2)\n",
" fig.set_size_inches(14,10)\n",
" \n",
" axs[0,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n",
" axs[0,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label=\"Réels positifs\")\n",
" axs[0,0].legend()\n",
" axs[0,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
" axs[0,0].set_xlabel('$x_1$')\n",
" axs[0,0].set_ylabel('$x_2$')\n",
"\n",
"\n",
" axs[0,1].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n",
" axs[0,1].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label=\"Réels négatifs\")\n",
" axs[0,1].legend()\n",
" axs[0,1].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
" axs[0,1].set_xlabel('$x_1$')\n",
" axs[0,1].set_ylabel('$x_2$')\n",
" \n",
" axs[1,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label=\"Prédits positifs\")\n",
" axs[1,0].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label=\"Prédits négatifs\")\n",
" axs[1,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label=\"Réels positifs\")\n",
" axs[1,0].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label=\"Réels négatifs\")\n",
" axs[1,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
" axs[1,0].set_xlabel('$x_1$')\n",
" axs[1,0].set_ylabel('$x_2$')\n",
"\n",
" axs[1,1].pie([precision,1-precision], explode=[0,0.1], labels=[\"\",\"Errors\"], \n",
" autopct='%1.1f%%', shadow=False, startangle=70, colors=[\"lightsteelblue\",\"coral\"])\n",
" axs[1,1].axis('equal')\n",
"\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 - Parameters"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"data_size = 1000 # Number of observations\n",
"data_cols = 2 # observation size\n",
"data_noise = 0.2\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 - Data preparation\n",
"### 2.1 - Get some data\n",
"The data here are totally fabricated and represent the **examination results** (passed or failed) based on the students' **working** and **sleeping hours** . \n",
"X=(working hours, sleeping hours) y={result} where result=0 (failed) or 1 (passed)"
]
},
{
"cell_type": "code",
"x_data,y_data=make_students_dataset(data_size,data_noise)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 - Show it"
]
},
{
"cell_type": "code",
"plot_data(x_data, y_data)\n",
"vector_infos('Dataset X',x_data)\n",
"vector_infos('Dataset y',y_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 - Preparation of data\n",
"\n",
"We're going to:\n",
"- split the data to have : :\n",
" - a training set\n",
" - a test set\n",
"- normalize the data"
"# ---- Split data\n",
"\n",
"n = int(data_size * 0.8)\n",
"x_train = x_data[:n]\n",
"y_train = y_data[:n]\n",
"x_test = x_data[n:]\n",
"y_test = y_data[n:]\n",
"mean = np.mean(x_train, axis=0)\n",
"std = np.std(x_train, axis=0)\n",
"x_train = (x_train-mean)/std\n",
"x_test = (x_test-mean)/std\n",
"# ---- About it\n",
"\n",
"vector_infos('X_train',x_train)\n",
"vector_infos('X_test',x_test)\n",
"vector_infos('y_test',y_test)\n",
"\n",
"y_train_h = y_train.reshape(-1,) # nécessaire pour la visu."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 - Have a look"
]
},
{
"cell_type": "code",
"fidle.utils.display_md('**This is what we know :**')\n",
"plot_data(x_train, y_train)\n",
"fidle.utils.display_md('**This is what we want to classify :**')\n",
"plot_data(x_test, y_test, colors=(\"gray\",\"gray\"), legend=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 - Logistic model #1\n",
"### 3.1 - Here is the classifier"
"# ---- Create an instance\n",
"# Use SAGA solver (Stochastic Average Gradient descent solver)\n",
"#\n",
"logreg = LogisticRegression(C=1e5, verbose=0, solver='saga')\n",
"# ---- Fit the data.\n",
"#\n",
"logreg.fit(x_train, y_train)\n",
"# ---- Do a prediction\n",
"#\n",
"y_pred = logreg.predict(x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### 3.3 - Evaluation\n",
"\n",
"Accuracy = Ability to avoid false positives = $\\frac{Tp}{Tp+Fp}$ \n",
"Recall = Ability to find the right positives = $\\frac{Tp}{Tp+Fn}$ \n",
"Avec : \n",
"$T_p$ (true positive) Correct positive answer \n",
"$F_p$ (false positive) False positive answer \n",
"$T_n$ (true negative) Correct negative answer \n",
"$F_n$ (false negative) Wrong negative answer "
]
},
{
"cell_type": "code",
"plot_results(x_test,y_test, y_pred)"
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Bending the space to a model #2 ;-)\n",
"\n",
"We're going to increase the characteristics of our observations, with : ${x_1}^2$, ${x_2}^2$, ${x_1}^3$ et ${x_2}^3$ \n",
"\n",
"$\n",
"X=\n",
"\\begin{bmatrix}1 & x_{11} & x_{12} \\\\\n",
"\\vdots & \\dots\\\\\n",
"1 & x_{m1} & x_{m2} \\end{bmatrix}\n",
"\\text{et }\n",
"X_{ng}=\\begin{bmatrix}1 & x_{11} & x_{12} & x_{11}^2 & x_{12}^2& x_{11}^3 & x_{12}^3 \\\\\n",
"\\vdots & & & \\dots \\\\\n",
"1 & x_{m1} & x_{m2} & x_{m1}^2 & x_{m2}^2& x_{m1}^3 & x_{m2}^3 \\end{bmatrix}\n",
"$\n",
"\n",
"Note : `sklearn.preprocessing.PolynomialFeatures` can do that for us, but we'll do it ourselves:\n",
"### 4.1 - Extend data"
]
},
{
"cell_type": "code",
"x_train_enhanced = np.c_[x_train,\n",
" x_train[:, 0] ** 2,\n",
" x_train[:, 1] ** 2,\n",
" x_train[:, 0] ** 3,\n",
" x_train[:, 1] ** 3]\n",
"x_test_enhanced = np.c_[x_test,\n",
" x_test[:, 0] ** 2,\n",
" x_test[:, 1] ** 2,\n",
" x_test[:, 0] ** 3,\n",
" x_test[:, 1] ** 3]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ---- Create an instance\n",
"# Use SAGA solver (Stochastic Average Gradient descent solver)\n",
"#\n",
"logreg = LogisticRegression(C=1e5, verbose=0, solver='saga', max_iter=5000, n_jobs=-1)\n",
"# ---- Fit the data.\n",
"#\n",
"logreg.fit(x_train_enhanced, y_train)\n",
"# ---- Do a prediction\n",
"#\n",
"y_pred = logreg.predict(x_test_enhanced)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"plot_results(x_test_enhanced, y_test, y_pred)"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.2 ('fidle-env')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
},
"vscode": {
"interpreter": {
"hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}