05-LSTM-Keras.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [IMDB5] - Sentiment analysis with a RNN network\n",
    "<!-- DESC --> Still the same problem, but with a network combining embedding and RNN\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - Use of a model combining embedding and LSTM\n",
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Retrieve data\n",
    " - Preparing the data\n",
    " - Build a Embedding/LSTM model\n",
    " - Train the model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "import os,sys,h5py,json\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "\n",
    "run_dir = './run/IMDB5'\n",
    "datasets_dir = pwk.init('IMDB5', run_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Parameters\n",
    "The words in the vocabulary are classified from the most frequent to the rarest.  \n",
    "`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown).  \n",
    "`hide_most_frequently` is the number of ignored words, among the most common ones  \n",
    "`review_len` is the review length  \n",
    "`dense_vector_size` is the size of the generated dense vectors  \n",
    "`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch\\\n",
    "`scale` is a dataset scale factor - note a scale=1 need a training time > 10'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size           = 10000\n",
    "hide_most_frequently = 0\n",
    "\n",
    "review_len           = 256\n",
    "dense_vector_size    = 32\n",
    "\n",
    "epochs               = 10\n",
    "batch_size           = 128\n",
    "\n",
    "fit_verbosity        = 1\n",
    "scale                = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')\n",
    "pwk.override('batch_size', 'epochs', 'fit_verbosity', 'scale')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Retrieve data\n",
    "\n",
    "IMDb dataset can bet get directly from Keras - see [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  \n",
    "Note : Due to their nature, textual data can be somewhat complex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 - Get dataset\n",
    "For simplicity, we will use a pre-formatted dataset - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)  \n",
    "However, Keras offers some usefull tools for formatting textual data - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text)  \n",
    "\n",
    "**Load dataset :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=vocab_size, skip_top=hide_most_frequently, seed= 42,)\n",
    "\n",
    "y_train = np.asarray(y_train).astype('float32')\n",
    "y_test  = np.asarray(y_test ).astype('float32')\n",
    "\n",
    "# ---- Rescale\n",
    "#\n",
    "n1 = int(scale * len(x_train))\n",
    "n2 = int(scale * len(x_test))\n",
    "x_train, y_train = x_train[:n1], y_train[:n1]\n",
    "x_test,  y_test  = x_test[:n2],  y_test[:n2]\n",
    "\n",
    "# ---- About\n",
    "#\n",
    "print(\"Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"Min(x_train,x_test)  : \", pwk.rmin([x_train,x_test]) )\n",
    "print(\"x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About this dataset :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"  Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"  x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"  x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))\n",
    "\n",
    "print('\\nReview example (x_train[12]) :\\n\\n',x_train[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Have a look for humans (optional)\n",
    "When we loaded the dataset, we asked for using \\<start\\> as 1, \\<unknown word\\> as 2  \n",
    "So, we shifted the dataset by 3 with the parameter index_from=3\n",
    "\n",
    "**Load dictionary :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
    "#\n",
    "word_index = imdb.get_word_index()\n",
    "\n",
    "# ---- Shift the dictionary from +3\n",
    "#\n",
    "word_index = {w:(i+3) for w,i in word_index.items()}\n",
    "\n",
    "# ---- Add <pad>, <start> and unknown tags\n",
    "#\n",
    "word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2} )\n",
    "\n",
    "# ---- Create a reverse dictionary : {index:word}\n",
    "#\n",
    "index_word = {index:word for word,index in word_index.items()} \n",
    "\n",
    "# ---- Add a nice function to transpose :\n",
    "#\n",
    "def dataset2text(review):\n",
    "    return ' '.join([index_word.get(i, '?') for i in review])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Have a look :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\nDictionary size     : ', len(word_index))\n",
    "for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )\n",
    "pwk.subtitle('Review example :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('After translation :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Preprocess the data (padding)\n",
    "In order to be processed by an NN, all entries must have the **same length.**  \n",
    "We chose a review length of **review_len**  \n",
    "We will therefore complete them with a padding (of \\<pad\\>\\)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
    "                                                     value   = 0,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "x_test  = keras.preprocessing.sequence.pad_sequences(x_test,\n",
    "                                                     value   = 0 ,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "pwk.subtitle('After padding :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('In real words :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Build the model\n",
    "\n",
    "More documentation about this model functions :\n",
    " - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)\n",
    " - [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model(dense_vector_size=128):\n",
    "    \n",
    "    model = keras.Sequential()\n",
    "    model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim = dense_vector_size))\n",
    "    model.add(keras.layers.GRU(50))\n",
    "    model.add(keras.layers.Dense(1, activation='sigmoid'))\n",
    "\n",
    "    model.compile(optimizer = 'rmsprop',\n",
    "                  loss      = 'binary_crossentropy',\n",
    "                  metrics   = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Train the model\n",
    "### 6.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_model(32)\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = f'{run_dir}/models/best_model.h5'\n",
    "savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3 - Train it\n",
    "CPU : batch_size=128, epochs=10 : Need 9'30 (CPU, laptop)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.chrono_start()\n",
    "\n",
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = epochs,\n",
    "                    batch_size      = batch_size,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    verbose         = fit_verbosity,\n",
    "                    callbacks       = [savemodel_callback])\n",
    "\n",
    "pwk.chrono_show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.4 - Training history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.plot_history(history, save_as='02-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 - Evaluation\n",
    "Reload and evaluate best model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model(f'{run_dir}/models/best_model.h5')\n",
    "\n",
    "# ---- Evaluate\n",
    "score  = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / accuracy  : {:5.4f}'.format(score[1]))\n",
    "\n",
    "values=[score[1], 1-score[1]]\n",
    "pwk.plot_donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
    "\n",
    "# ---- Confusion matrix\n",
    "\n",
    "y_sigmoid = model.predict(x_test)\n",
    "\n",
    "y_pred = y_sigmoid.copy()\n",
    "y_pred[ y_sigmoid< 0.5 ] = 0\n",
    "y_pred[ y_sigmoid>=0.5 ] = 1    \n",
    "\n",
    "pwk.display_confusion_matrix(y_test,y_pred,labels=range(2))\n",
    "pwk.plot_confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "8e38643e33497db9a306e3f311fa98cb1e65371278ca73ee4ea0c76aa5a4f387"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit ('fidle-cpu': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}