{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [LINR1] - Linear regression with direct resolution\n",
    "<!-- DESC --> Low-level implementation, using numpy, of a direct resolution for a linear regression\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Just one, the illustration of a direct resolution :-)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    "Equation : $$Y = X.\\theta + N$$  \n",
    "Where N is a noise vector\n",
    "and $\\theta = (a,b)$ a vector as y = a.x + b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import math\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import sys\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('LINR1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve a set of points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Paramètres\n",
    "nb    = 100     # Nombre de points\n",
    "xmin  = 0       # Distribution / x\n",
    "xmax  = 10\n",
    "a     = 4       # Distribution / y\n",
    "b     = 2       # y= a.x + b (+ bruit)\n",
    "noise = 7       # bruit\n",
    "\n",
    "theta = np.array([[a],[b]])\n",
    "\n",
    "# ---- Vecteur X  (1,x) x nb\n",
    "#      la premiere colonne est a 1 afin que X.theta <=> 1.b + x.a\n",
    "\n",
    "Xc1 = np.ones((nb,1))\n",
    "Xc2 = np.random.uniform(xmin,xmax,(nb,1))\n",
    "X = np.c_[ Xc1, Xc2 ]\n",
    "\n",
    "# ---- Noise\n",
    "# N = np.random.uniform(-noise,noise,(nb,1))\n",
    "N = noise * np.random.normal(0,1,(nb,1))\n",
    "\n",
    "# ---- Vecteur Y\n",
    "Y = (X @ theta) + N\n",
    "\n",
    "# print(\"X:\\n\",X,\"\\nY:\\n \",Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "width = 12\n",
    "height = 6\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "fig.set_size_inches(width,height)\n",
    "ax.plot(X[:,1], Y, \".\")\n",
    "ax.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "ax.set_xlabel('x axis')\n",
    "ax.set_ylabel('y axis')\n",
    "fidle.scrawler.save_fig('01-set_of_points')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Direct calculation of the normal equation\n",
    "\n",
    "\n",
    "We'll try to find an optimal value of $\\theta$, minimizing a cost function.  \n",
    "The cost function, classically used in the case of linear regressions, is the **root mean square error** (racine carré de l'erreur quadratique moyenne):  \n",
    "\n",
    "$$RMSE(X,h_\\theta)=\\sqrt{\\frac1n\\sum_{i=1}^n\\left[h_\\theta(X^{(i)})-Y^{(i)}\\right]^2}$$\n",
    "\n",
    "With the simplified variant : $$MSE(X,h_\\theta)=\\frac1n\\sum_{i=1}^n\\left[h_\\theta(X^{(i)})-Y^{(i)}\\right]^2$$\n",
    "\n",
    "The optimal value of regression is : $$\\hat{ \\theta } =( X^{T} .X)^{-1}.X^{T}.Y$$\n",
    "\n",
    "Démontstration : https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "theta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y\n",
    "\n",
    "print(\"Theta :\\n\",theta,\"\\n\\ntheta hat :\\n\",theta_hat)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xd = np.array([[1,xmin], [1,xmax]])\n",
    "Yd = Xd @ theta_hat\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "fig.set_size_inches(width,height)\n",
    "ax.plot(X[:,1], Y, \".\")\n",
    "ax.plot(Xd[:,1], Yd, \"-\")\n",
    "ax.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)\n",
    "ax.set_xlabel('x axis')\n",
    "ax.set_ylabel('y axis')\n",
    "fidle.scrawler.save_fig('02-regression-line')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}