02-Prediction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Text Embedding - IMDB dataset\n",
    "=============================\n",
    "---\n",
    "Introduction au Deep Learning  (IDLE) - S. Arias, E. Maldonado, JL. Parouty - CNRS/SARI/DEVLOG - 2020  \n",
    "\n",
    "## Reviews analysis :\n",
    "\n",
    "The objective is to guess whether our new and personals films reviews are **positive or negative** .  \n",
    "For this, we will use our previously saved model.\n",
    "\n",
    "What we're going to do:\n",
    "\n",
    " - Preparing the data\n",
    " - Retrieve our saved model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IDLE 2020 - Practical Work Module\n",
      "  Version            : 0.2.4\n",
      "  Run time           : Monday 27 January 2020, 21:57:44\n",
      "  Matplotlib style   : fidle/talk.mplstyle\n",
      "  TensorFlow version : 2.0.0\n",
      "  Keras version      : 2.2.4-tf\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "\n",
    "import os,h5py,json,re\n",
    "\n",
    "import fidle.pwk as ooo\n",
    "from importlib import reload\n",
    "\n",
    "ooo.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 : Preparing the data\n",
    "### 2.1 - Our reviews :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "reviews = [ \"This film is particularly nice, a must see.\",\n",
    "             \"Some films are classics and cannot be ignored.\",\n",
    "             \"This movie is just abominable and doesn't deserve to be seen!\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Retrieve dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./data/word_index.json', 'r') as fp:\n",
    "    word_index = json.load(fp)\n",
    "    index_word = {index:word for word,index in word_index.items()} "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Clean, index and padd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_len    = 256\n",
    "vocab_size = 10000\n",
    "\n",
    "\n",
    "nb_reviews = len(reviews)\n",
    "x_data     = []\n",
    "\n",
    "# ---- For all reviews\n",
    "for review in reviews:\n",
    "    # ---- First index must be <start>\n",
    "    index_review=[1]\n",
    "    # ---- For all words\n",
    "    for w in review.split(' '):\n",
    "        # ---- Clean it\n",
    "        w_clean = re.sub(r\"[^a-zA-Z0-9]\", \"\", w)\n",
    "        # ---- Not empty ?\n",
    "        if len(w_clean)>0:\n",
    "            # ---- Get the index\n",
    "            w_index = word_index.get(w,2)\n",
    "            if w_index>vocab_size : w_index=2\n",
    "            # ---- Add the index if < vocab_size\n",
    "            index_review.append(w_index)\n",
    "    # ---- Add the indexed review\n",
    "    x_data.append(index_review)    \n",
    "\n",
    "# ---- Padding\n",
    "x_data = keras.preprocessing.sequence.pad_sequences(x_data, value   = 0, padding = 'post', maxlen  = max_len)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Text review      : This film is particularly nice, a must see.\n",
      "x_train[0]       : [1, 2, 22, 9, 572, 2, 6, 215, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> film is particularly <unknown> a must <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n",
      "\n",
      "Text review      : Some films are classics and cannot be ignored.\n",
      "x_train[1]       : [1, 2, 108, 26, 2239, 5, 566, 30, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> films are classics and cannot be <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n",
      "\n",
      "Text review      : This movie is just abominable and doesn't deserve to be seen!\n",
      "x_train[2]       : [1, 2, 20, 9, 43, 2, 5, 152, 1833, 8, 30, 2, 0, 0, 0, 0, 0] (...)\n",
      "Translation      : <start> <unknown> movie is just <unknown> and doesn't deserve to be <unknown> <pad> <pad> <pad> <pad> <pad> (...)\n"
     ]
    }
   ],
   "source": [
    "def translate(x):\n",
    "    return ' '.join( [index_word.get(i,'?') for i in x] )\n",
    "\n",
    "for i in range(nb_reviews):\n",
    "    imax=np.where(x_data[i]==0)[0][0]+5\n",
    "    print(f'\\nText review      :',    reviews[i])\n",
    "    print(  f'x_train[{i:}]       :', list(x_data[i][:imax]), '(...)')\n",
    "    print(  'Translation      :', translate(x_data[i][:imax]), '(...)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Bring back the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model('./run/models/best_model.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred   = model.predict(x_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### And the winner is :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "This film is particularly nice, a must see.                            => POSITIVE (0.62)\n",
      "\n",
      "Some films are classics and cannot be ignored.                         => POSITIVE (0.57)\n",
      "\n",
      "This movie is just abominable and doesn't deserve to be seen!          => NEGATIVE (0.42)\n"
     ]
    }
   ],
   "source": [
    "for i in range(nb_reviews):\n",
    "    print(f'\\n{reviews[i]:<70} =>',('NEGATIVE' if y_pred[i][0]<0.5 else 'POSITIVE'),f'({y_pred[i][0]:.2f})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 0, 1, 2]"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=[1]+[i for i in range(3)]\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}