03-Prediction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [K3IMDB3] - Reload and reuse a saved model\n",
    "<!-- DESC --> Retrieving a saved model to perform a sentiment analysis (movie review), using Keras 3 and PyTorch\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - The objective is to guess whether our personal film reviews are **positive or negative** based on the analysis of the text. \n",
    " - For this, we will use our **previously saved model**.\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Preparing our data\n",
    " - Retrieve our saved model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['KERAS_BACKEND'] = 'torch'\n",
    "\n",
    "import keras\n",
    "\n",
    "import json,re\n",
    "import numpy as np\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('K3IMDB3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Parameters\n",
    "The words in the vocabulary are classified from the most frequent to the rarest.  \n",
    "`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown).  \n",
    "`review_len` is the review length  \n",
    "`saved_models` where our models were previously saved  \n",
    "`dictionaries_dir` is where we will go to save our dictionaries. (./data is a good choice)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size           = 10000\n",
    "review_len           = 256\n",
    "\n",
    "saved_models         = './run/K3IMDB2'\n",
    "dictionaries_dir     = './data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.override('vocab_size', 'review_len', 'saved_models', 'dictionaries_dir')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 : Preparing the data\n",
    "### 2.1 - Our reviews :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reviews = [ \"This film is particularly nice, a must see.\",\n",
    "             \"This film is a great classic that cannot be ignored.\",\n",
    "             \"I don't remember ever having seen such a movie...\",\n",
    "             \"This movie is just abominable and doesn't deserve to be seen!\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Retrieve dictionaries\n",
    "Note : This dictionary is generated by [02-Embedding-Keras](02-Keras-embedding.ipynb) notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(f'{dictionaries_dir}/word_index.json', 'r') as fp:\n",
    "    word_index = json.load(fp)\n",
    "    index_word = { i:w      for w,i in word_index.items() }\n",
    "    print('Dictionaries loaded. ', len(word_index), 'entries' )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Clean, index and padd\n",
    "Phases are split into words, punctuation is removed, sentence length is limited and padding is added...  \n",
    "**Note** : 1 is \"Start\" and 2 is \"unknown\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_char = 1      # Start of a sequence (padding is 0)\n",
    "oov_char   = 2      # Out-of-vocabulary\n",
    "index_from = 3      # First word id\n",
    "\n",
    "nb_reviews = len(reviews)\n",
    "x_data     = []\n",
    "\n",
    "# ---- For all reviews\n",
    "for review in reviews:\n",
    "    print('Words are : ', end='')\n",
    "    # ---- First index must be <start>\n",
    "    index_review=[start_char]\n",
    "    print(f'{start_char} ', end='')\n",
    "    # ---- For all words\n",
    "    for w in review.split(' '):\n",
    "        # ---- Clean it\n",
    "        w_clean = re.sub(r\"[^a-zA-Z0-9]\", \"\", w)\n",
    "        # ---- Not empty ?\n",
    "        if len(w_clean)>0:\n",
    "            # ---- Get the index - must be inside dict or is out of vocab (oov)\n",
    "            w_index = word_index.get(w, oov_char)\n",
    "            if w_index>vocab_size : w_index=oov_char\n",
    "            # ---- Add the index if < vocab_size\n",
    "            index_review.append(w_index)\n",
    "            print(f'{w_index} ', end='')\n",
    "    # ---- Add the indexed review\n",
    "    x_data.append(index_review)\n",
    "    print()\n",
    "\n",
    "# ---- Padding\n",
    "x_data = keras.preprocessing.sequence.pad_sequences(x_data, value   = 0, padding = 'post', maxlen  = review_len)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def translate(x):\n",
    "    return ' '.join( [index_word.get(i,'?') for i in x] )\n",
    "\n",
    "for i in range(nb_reviews):\n",
    "    imax=np.where(x_data[i]==0)[0][0]+5\n",
    "    print(f'\\nText review {i}  :',    reviews[i])\n",
    "    print(f'tokens vector  :', list(x_data[i][:imax]), '(...)')\n",
    "    print('Translation    :', translate(x_data[i][:imax]), '(...)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Bring back the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model(f'{saved_models}/models/best_model.keras')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred   = model.predict(x_data, verbose=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### And the winner is :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i,review in enumerate(reviews):\n",
    "    rate    = y_pred[i][0]\n",
    "    opinion =  'NEGATIVE :-(' if rate<0.5 else 'POSITIVE :-)'    \n",
    "    print(f'{review:<70} => {rate:.2f} - {opinion}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}