Skip to content
Snippets Groups Projects
05-LSTM-Keras.ipynb 13.3 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "# <!-- TITLE --> [IMDB5] - Sentiment analysis with a RNN network\n",
    "<!-- DESC --> Still the same problem, but with a network combining embedding and RNN\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - Use of a model combining embedding and LSTM\n",
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Retrieve data\n",
    " - Preparing the data\n",
    " - Build a Embedding/LSTM model\n",
    " - Train the model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "import os,sys,h5py,json\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "run_dir = './run/IMDB5'\n",
    "datasets_dir = pwk.init('IMDB5', run_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Parameters\n",
    "The words in the vocabulary are classified from the most frequent to the rarest.  \n",
    "`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown).  \n",
    "`hide_most_frequently` is the number of ignored words, among the most common ones  \n",
    "`review_len` is the review length  \n",
    "`dense_vector_size` is the size of the generated dense vectors  \n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch\\\n",
    "`scale` is a dataset scale factor - note a scale=1 need a training time > 10'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size           = 10000\n",
    "hide_most_frequently = 0\n",
    "review_len           = 256\n",
    "dense_vector_size    = 32\n",
    "epochs               = 10\n",
    "batch_size           = 128\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "fit_verbosity        = 1\n",
    "scale                = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "pwk.override('batch_size', 'epochs', 'fit_verbosity', 'scale')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Retrieve data\n",
    "IMDb dataset can bet get directly from Keras - see [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  \n",
    "Note : Due to their nature, textual data can be somewhat complex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 - Get dataset\n",
    "For simplicity, we will use a pre-formatted dataset - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)  \n",
    "However, Keras offers some usefull tools for formatting textual data - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text)  \n",
    "**Load dataset :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=vocab_size, skip_top=hide_most_frequently, seed= 42,)\n",
    "y_train = np.asarray(y_train).astype('float32')\n",
    "y_test  = np.asarray(y_test ).astype('float32')\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "# ---- Rescale\n",
    "#\n",
    "n1 = int(scale * len(x_train))\n",
    "n2 = int(scale * len(x_test))\n",
    "x_train, y_train = x_train[:n1], y_train[:n1]\n",
    "x_test,  y_test  = x_test[:n2],  y_test[:n2]\n",
    "\n",
    "# ---- About\n",
    "print(\"Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"Min(x_train,x_test)  : \", pwk.rmin([x_train,x_test]) )\n",
    "print(\"x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About this dataset :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"  Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"  x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"  x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))\n",
    "\n",
    "print('\\nReview example (x_train[12]) :\\n\\n',x_train[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Have a look for humans (optional)\n",
    "When we loaded the dataset, we asked for using \\<start\\> as 1, \\<unknown word\\> as 2  \n",
    "So, we shifted the dataset by 3 with the parameter index_from=3\n",
    "\n",
    "**Load dictionary :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
    "word_index = imdb.get_word_index()\n",
    "\n",
    "# ---- Shift the dictionary from +3\n",
    "word_index = {w:(i+3) for w,i in word_index.items()}\n",
    "\n",
    "# ---- Add <pad>, <start> and unknown tags\n",
    "word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2} )\n",
    "\n",
    "# ---- Create a reverse dictionary : {index:word}\n",
    "index_word = {index:word for word,index in word_index.items()} \n",
    "\n",
    "# ---- Add a nice function to transpose :\n",
    "#\n",
    "def dataset2text(review):\n",
    "    return ' '.join([index_word.get(i, '?') for i in review])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Have a look :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\nDictionary size     : ', len(word_index))\n",
    "for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )\n",
    "pwk.subtitle('Review example :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('After translation :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Preprocess the data (padding)\n",
    "In order to be processed by an NN, all entries must have the **same length.**  \n",
    "We chose a review length of **review_len**  \n",
    "We will therefore complete them with a padding (of \\<pad\\>\\)  "
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
    "                                                     value   = 0,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "x_test  = keras.preprocessing.sequence.pad_sequences(x_test,\n",
    "                                                     value   = 0 ,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "pwk.subtitle('After padding :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('In real words :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Build the model\n",
    "\n",
    "More documentation about this model functions :\n",
    " - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)\n",
    " - [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model(dense_vector_size=128):\n",
    "    \n",
    "    model = keras.Sequential()\n",
    "    model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim = dense_vector_size))\n",
    "    model.add(keras.layers.GRU(50))\n",
    "    model.add(keras.layers.Dense(1, activation='sigmoid'))\n",
    "    model.compile(optimizer = 'rmsprop',\n",
    "                  loss      = 'binary_crossentropy',\n",
    "                  metrics   = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Train the model\n",
    "### 6.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_model(32)\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = f'{run_dir}/models/best_model.h5'\n",
    "savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3 - Train it\n",
    "CPU : batch_size=128, epochs=10 : Need 9'30 (CPU, laptop)"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.chrono_start()\n",
    "\n",
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = epochs,\n",
    "                    batch_size      = batch_size,\n",
    "                    validation_data = (x_test, y_test),\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "                    verbose         = fit_verbosity,\n",
    "                    callbacks       = [savemodel_callback])\n",
    "\n",
    "pwk.chrono_show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.4 - Training history"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.plot_history(history, save_as='02-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 - Evaluation\n",
    "Reload and evaluate best model"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "model = keras.models.load_model(f'{run_dir}/models/best_model.h5')\n",
    "\n",
    "# ---- Evaluate\n",
    "score  = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / accuracy  : {:5.4f}'.format(score[1]))\n",
    "\n",
    "values=[score[1], 1-score[1]]\n",
    "pwk.plot_donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
    "\n",
    "# ---- Confusion matrix\n",
    "\n",
    "y_sigmoid = model.predict(x_test)\n",
    "y_pred = y_sigmoid.copy()\n",
    "y_pred[ y_sigmoid< 0.5 ] = 0\n",
    "y_pred[ y_sigmoid>=0.5 ] = 1    \n",
    "\n",
    "pwk.display_confusion_matrix(y_test,y_pred,labels=range(2))\n",
    "pwk.plot_confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
  }
 ],
 "metadata": {
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
  "interpreter": {
   "hash": "8e38643e33497db9a306e3f311fa98cb1e65371278ca73ee4ea0c76aa5a4f387"
  },
  "kernelspec": {
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "display_name": "Python 3.9.7 64-bit ('fidle-cpu': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}