Skip to content
Snippets Groups Projects
02-Prediction.ipynb 8.57 KiB
Newer Older
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "# <!-- TITLE --> [IMDB2] - Text embedding with IMDB - Reloaded\n",
    "<!-- DESC --> Example of reusing a previously saved model\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - For this, we will use our **previously saved model**.\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "## What we're going to do :\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    " - Preparing the data\n",
    " - Retrieve our saved model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
     "data": {
      "text/html": [
       "<style>\n",
       "\n",
       "div.warn {    \n",
       "    background-color: #fcf2f2;\n",
       "    border-color: #dFb5b4;\n",
       "    border-left: 5px solid #dfb5b4;\n",
       "    padding: 0.5em;\n",
       "    font-weight: bold;\n",
       "    font-size: 1.1em;;\n",
       "    }\n",
       "\n",
       "\n",
       "\n",
       "div.nota {    \n",
       "    background-color: #DAFFDE;\n",
       "    border-left: 5px solid #92CC99;\n",
       "    padding: 0.5em;\n",
       "    }\n",
       "\n",
       "\n",
       "\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "FIDLE 2020 - Practical Work Module\n",
      "Version              : 0.2.9\n",
      "Run time             : Wednesday 19 February 2020, 22:08:28\n",
      "TensorFlow version   : 2.0.0\n",
      "Keras version        : 2.2.4-tf\n"
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "\n",
    "import os,sys,h5py,json,re\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as ooo\n",
    "\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "ooo.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 : Preparing the data\n",
    "### 2.1 - Our reviews :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [],
   "source": [
    "reviews = [ \"This film is particularly nice, a must see.\",\n",
    "             \"Some films are great classics and cannot be ignored.\",\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "             \"This movie is just abominable and doesn't deserve to be seen!\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Retrieve dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./data/word_index.json', 'r') as fp:\n",
    "    word_index = json.load(fp)\n",
    "    index_word = {index:word for word,index in word_index.items()} "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Clean, index and padd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [],
   "source": [
    "max_len    = 256\n",
    "vocab_size = 10000\n",
    "\n",
    "\n",
    "nb_reviews = len(reviews)\n",
    "x_data     = []\n",
    "\n",
    "# ---- For all reviews\n",
    "for review in reviews:\n",
    "    # ---- First index must be <start>\n",
    "    index_review=[1]\n",
    "    # ---- For all words\n",
    "    for w in review.split(' '):\n",
    "        # ---- Clean it\n",
    "        w_clean = re.sub(r\"[^a-zA-Z0-9]\", \"\", w)\n",
    "        # ---- Not empty ?\n",
    "        if len(w_clean)>0:\n",
    "            # ---- Get the index\n",
    "            w_index = word_index.get(w,2)\n",
    "            if w_index>vocab_size : w_index=2\n",
    "            # ---- Add the index if < vocab_size\n",
    "            index_review.append(w_index)\n",
    "    # ---- Add the indexed review\n",
    "    x_data.append(index_review)    \n",
    "\n",
    "# ---- Padding\n",
    "x_data = keras.preprocessing.sequence.pad_sequences(x_data, value   = 0, padding = 'post', maxlen  = max_len)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Text review      : This film is particularly nice, a must see.\n",
      "x_train[0]       : [1, 2, 22, 9, 572, 2, 6, 215, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> film is particularly <unknown> a must <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n",
      "\n",
      "Text review      : Some films are great classics and cannot be ignored.\n",
      "x_train[1]       : [1, 2, 108, 26, 87, 2239, 5, 566, 30, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> films are great classics and cannot be <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n",
      "\n",
      "Text review      : This movie is just abominable and doesn't deserve to be seen!\n",
      "x_train[2]       : [1, 2, 20, 9, 43, 2, 5, 152, 1833, 8, 30, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> movie is just <unknown> and doesn't deserve to be <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n"
     ]
    }
   ],
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "source": [
    "def translate(x):\n",
    "    return ' '.join( [index_word.get(i,'?') for i in x] )\n",
    "\n",
    "for i in range(nb_reviews):\n",
    "    imax=np.where(x_data[i]==0)[0][0]+5\n",
    "    print(f'\\nText review      :',    reviews[i])\n",
    "    print(  f'x_train[{i:}]       :', list(x_data[i][:imax]), '(...)')\n",
    "    print(  'Translation      :', translate(x_data[i][:imax]), '(...)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Bring back the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model('./run/models/best_model.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred   = model.predict(x_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### And the winner is :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "This film is particularly nice, a must see.                            => POSITIVE (0.54)\n",
      "\n",
      "Some films are great classics and cannot be ignored.                   => POSITIVE (0.61)\n",
      "\n",
      "This movie is just abominable and doesn't deserve to be seen!          => NEGATIVE (0.33)\n"
     ]
    }
   ],
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "source": [
    "for i in range(nb_reviews):\n",
    "    print(f'\\n{reviews[i]:<70} =>',('NEGATIVE' if y_pred[i][0]<0.5 else 'POSITIVE'),f'({y_pred[i][0]:.2f})')"
   ]
  },
  {
   "cell_type": "markdown",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "source": [
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}