05-LSTM-Keras.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [IMDB5] - Sentiment analysis with a LSTM network\n",
    "<!-- DESC --> Still the same problem, but with a network combining embedding and LSTM\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - Use of a model combining embedding and LSTM\n",
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Retrieve data\n",
    " - Preparing the data\n",
    " - Build a Embedding/LSTM model\n",
    " - Train the model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "import os,sys,h5py,json\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "\n",
    "datasets_dir = pwk.init('IMDB3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve data\n",
    "\n",
    "IMDb dataset can bet get directly from Keras - see [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  \n",
    "Note : Due to their nature, textual data can be somewhat complex.\n",
    "\n",
    "### 2.1 - Data structure :  \n",
    "The dataset is composed of 2 parts: \n",
    "\n",
    " - **reviews**, this will be our **x**\n",
    " - **opinions** (positive/negative), this will be our **y**\n",
    "\n",
    "There are also a **dictionary**, because words are indexed in reviews\n",
    "\n",
    "```\n",
    "<dataset> = (<reviews>, <opinions>)\n",
    "\n",
    "with :  <reviews>  = [ <review1>, <review2>, ... ]\n",
    "        <opinions> = [ <rate1>,   <rate2>,   ... ]   where <ratei>   = integer\n",
    "\n",
    "where : <reviewi> = [ <w1>, <w2>, ...]    <wi> are the index (int) of the word in the dictionary\n",
    "        <ratei>   = int                   0 for negative opinion, 1 for positive\n",
    "\n",
    "\n",
    "<dictionary> = [ <word1>:<w1>, <word2>:<w2>, ... ]\n",
    "\n",
    "with :  <wordi>   = word\n",
    "        <wi>      = int\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Get dataset\n",
    "For simplicity, we will use a pre-formatted dataset - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)  \n",
    "However, Keras offers some usefull tools for formatting textual data - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text)  \n",
    "\n",
    "**Load dataset :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size = 10000\n",
    "\n",
    "# ----- Retrieve x,y\n",
    "\n",
    "# Uncomment this if you want to load dataset directly from keras (small size <20M)\n",
    "#\n",
    "(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words  = vocab_size,\n",
    "                                                       skip_top   = 0,\n",
    "                                                       maxlen     = None,\n",
    "                                                       seed       = 42,\n",
    "                                                       start_char = 1,\n",
    "                                                       oov_char   = 2,\n",
    "                                                       index_from = 3, )\n",
    "\n",
    "# To load a h5 version of the dataset :\n",
    "#\n",
    "# with  h5py.File(f'{datasets_dir}/IMDB/origine/dataset_imdb.h5','r') as f:\n",
    "#        x_train = f['x_train'][:]\n",
    "#        y_train = f['y_train'][:]\n",
    "#        x_test  = f['x_test'][:]\n",
    "#        y_test  = f['y_test'][:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About this dataset :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"  Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"  x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"  x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))\n",
    "\n",
    "print('\\nReview example (x_train[12]) :\\n\\n',x_train[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Have a look for humans (optional)\n",
    "When we loaded the dataset, we asked for using \\<start\\> as 1, \\<unknown word\\> as 2  \n",
    "So, we shifted the dataset by 3 with the parameter index_from=3\n",
    "\n",
    "**Load dictionary :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
    "#\n",
    "word_index = imdb.get_word_index()\n",
    "\n",
    "# ---- Shift the dictionary from +3\n",
    "#\n",
    "word_index = {w:(i+3) for w,i in word_index.items()}\n",
    "\n",
    "# ---- Add <pad>, <start> and unknown tags\n",
    "#\n",
    "word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2} )\n",
    "\n",
    "# ---- Create a reverse dictionary : {index:word}\n",
    "#\n",
    "index_word = {index:word for word,index in word_index.items()} \n",
    "\n",
    "# ---- Add a nice function to transpose :\n",
    "#\n",
    "def dataset2text(review):\n",
    "    return ' '.join([index_word.get(i, '?') for i in review])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Have a look :**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\nDictionary size     : ', len(word_index))\n",
    "for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )\n",
    "pwk.subtitle('Review example :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('After translation :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look for NN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sizes=[len(i) for i in x_train]\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist(sizes, bins=400)\n",
    "plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)), \n",
    "              xlabel='Size', ylabel='Density', xlim=[0,1500])\n",
    "pwk.save_fig('01-stats-sizes')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Preprocess the data (padding)\n",
    "In order to be processed by an NN, all entries must have the **same length.**  \n",
    "We chose a review length of **review_len**  \n",
    "We will therefore complete them with a padding (of \\<pad\\>\\)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "review_len = 256\n",
    "\n",
    "x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
    "                                                     value   = 0,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "x_test  = keras.preprocessing.sequence.pad_sequences(x_test,\n",
    "                                                     value   = 0 ,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "pwk.subtitle('After padding :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('In real words :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Save dataset and dictionary (For future use but not mandatory)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Write dataset in a h5 file, could be usefull\n",
    "#\n",
    "output_dir = './data'\n",
    "pwk.mkdir(output_dir)\n",
    "\n",
    "with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:\n",
    "    f.create_dataset(\"x_train\",    data=x_train)\n",
    "    f.create_dataset(\"y_train\",    data=y_train)\n",
    "    f.create_dataset(\"x_test\",     data=x_test)\n",
    "    f.create_dataset(\"y_test\",     data=y_test)\n",
    "\n",
    "with open(f'{output_dir}/word_index.json', 'w') as fp:\n",
    "    json.dump(word_index, fp)\n",
    "\n",
    "with open(f'{output_dir}/index_word.json', 'w') as fp:\n",
    "    json.dump(index_word, fp)\n",
    "\n",
    "print('Saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Build the model\n",
    "Few remarks :\n",
    " - We'll choose a dense vector size for the embedding output with **dense_vector_size**\n",
    " - **GlobalAveragePooling1D** do a pooling on the last dimension : (None, lx, ly) -> (None, ly)  \n",
    "   In other words: we average the set of vectors/words of a sentence\n",
    " - L'embedding de Keras fonctionne de manière supervisée. Il s'agit d'une couche de *vocab_size* neurones vers *n_neurons* permettant de maintenir une table de vecteurs (les poids constituent les vecteurs). Cette couche ne calcule pas de sortie a la façon des couches normales, mais renvois la valeur des vecteurs. n mots => n vecteurs (ensuite empilés par le pooling)  \n",
    "Voir : [Explication plus détaillée (en)](https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer)  \n",
    "ainsi que : [Sentiment detection with Keras](https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks)  \n",
    "\n",
    "More documentation about this model functions :\n",
    " - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)\n",
    " - [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model(dense_vector_size=128):\n",
    "    \n",
    "    model = keras.Sequential()\n",
    "    model.add(keras.layers.Embedding(input_dim    = vocab_size, \n",
    "                                     output_dim   = dense_vector_size, \n",
    "                                     input_length = review_len))\n",
    "    model.add(keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2))\n",
    "    model.add(keras.layers.Dense(1,                 activation='sigmoid'))\n",
    "\n",
    "    model.compile(optimizer = 'adam',\n",
    "                  loss      = 'binary_crossentropy',\n",
    "                  metrics   = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Train the model\n",
    "### 5.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_model(32)\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs('./run/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = \"./run/models/best_model.h5\"\n",
    "savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 - Train it\n",
    "GPU : batch_size=512 :  6' 30s  \n",
    "CPU : batch_size=512 : 12' 57s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "n_epochs   = 10\n",
    "batch_size = 512\n",
    "\n",
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = n_epochs,\n",
    "                    batch_size      = batch_size,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    verbose         = 1,\n",
    "                    callbacks       = [savemodel_callback])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Evaluate\n",
    "### 6.1 - Training history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.plot_history(history, save_as='02-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Reload and evaluate best model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model('./run/models/best_model.h5')\n",
    "\n",
    "# ---- Evaluate\n",
    "score  = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / accuracy  : {:5.4f}'.format(score[1]))\n",
    "\n",
    "values=[score[1], 1-score[1]]\n",
    "pwk.plot_donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
    "\n",
    "# ---- Confusion matrix\n",
    "\n",
    "y_sigmoid = model.predict(x_test)\n",
    "\n",
    "y_pred = y_sigmoid.copy()\n",
    "y_pred[ y_sigmoid< 0.5 ] = 0\n",
    "y_pred[ y_sigmoid>=0.5 ] = 1    \n",
    "\n",
    "pwk.display_confusion_matrix(y_test,y_pred,labels=range(2))\n",
    "pwk.plot_confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}