Skip to content
Snippets Groups Projects
05-LSTM-Keras.ipynb 15.4 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "# <!-- TITLE --> [IMDB5] - Sentiment analysis with a LSTM network\n",
    "<!-- DESC --> Still the same problem, but with a network combining embedding and LSTM\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - Use of a model combining embedding and LSTM\n",
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Retrieve data\n",
    " - Preparing the data\n",
    " - Build a Embedding/LSTM model\n",
    " - Train the model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Init python stuff"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.keras as keras\n",
    "import tensorflow.keras.datasets.imdb as imdb\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "import os,sys,h5py,json\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "datasets_dir = pwk.init('IMDB3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve data\n",
    "\n",
    "IMDb dataset can bet get directly from Keras - see [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  \n",
    "Note : Due to their nature, textual data can be somewhat complex.\n",
    "\n",
    "### 2.1 - Data structure :  \n",
    "The dataset is composed of 2 parts: \n",
    "\n",
    " - **reviews**, this will be our **x**\n",
    " - **opinions** (positive/negative), this will be our **y**\n",
    "\n",
    "There are also a **dictionary**, because words are indexed in reviews\n",
    "\n",
    "```\n",
    "<dataset> = (<reviews>, <opinions>)\n",
    "\n",
    "with :  <reviews>  = [ <review1>, <review2>, ... ]\n",
    "        <opinions> = [ <rate1>,   <rate2>,   ... ]   where <ratei>   = integer\n",
    "\n",
    "where : <reviewi> = [ <w1>, <w2>, ...]    <wi> are the index (int) of the word in the dictionary\n",
    "        <ratei>   = int                   0 for negative opinion, 1 for positive\n",
    "\n",
    "\n",
    "<dictionary> = [ <word1>:<w1>, <word2>:<w2>, ... ]\n",
    "with :  <wordi>   = word\n",
    "        <wi>      = int\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Get dataset\n",
    "For simplicity, we will use a pre-formatted dataset - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)  \n",
    "However, Keras offers some usefull tools for formatting textual data - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text)  \n",
    "**Load dataset :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size = 10000\n",
    "\n",
    "# ----- Retrieve x,y\n",
    "\n",
    "# Uncomment this if you want to load dataset directly from keras (small size <20M)\n",
    "#\n",
    "(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words  = vocab_size,\n",
    "                                                       skip_top   = 0,\n",
    "                                                       maxlen     = None,\n",
    "                                                       seed       = 42,\n",
    "                                                       start_char = 1,\n",
    "                                                       oov_char   = 2,\n",
    "                                                       index_from = 3, )\n",
    "# To load a h5 version of the dataset :\n",
    "#\n",
    "# with  h5py.File(f'{datasets_dir}/IMDB/origine/dataset_imdb.h5','r') as f:\n",
    "#        x_train = f['x_train'][:]\n",
    "#        y_train = f['y_train'][:]\n",
    "#        x_test  = f['x_test'][:]\n",
    "#        y_test  = f['y_test'][:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About this dataset :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"  Max(x_train,x_test)  : \", pwk.rmax([x_train,x_test]) )\n",
    "print(\"  x_train : {}  y_train : {}\".format(x_train.shape, y_train.shape))\n",
    "print(\"  x_test  : {}  y_test  : {}\".format(x_test.shape,  y_test.shape))\n",
    "\n",
    "print('\\nReview example (x_train[12]) :\\n\\n',x_train[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Have a look for humans (optional)\n",
    "When we loaded the dataset, we asked for using \\<start\\> as 1, \\<unknown word\\> as 2  \n",
    "So, we shifted the dataset by 3 with the parameter index_from=3\n",
    "\n",
    "**Load dictionary :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
    "word_index = imdb.get_word_index()\n",
    "\n",
    "# ---- Shift the dictionary from +3\n",
    "word_index = {w:(i+3) for w,i in word_index.items()}\n",
    "\n",
    "# ---- Add <pad>, <start> and unknown tags\n",
    "word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2} )\n",
    "\n",
    "# ---- Create a reverse dictionary : {index:word}\n",
    "index_word = {index:word for word,index in word_index.items()} \n",
    "\n",
    "# ---- Add a nice function to transpose :\n",
    "#\n",
    "def dataset2text(review):\n",
    "    return ' '.join([index_word.get(i, '?') for i in review])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Have a look :**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\nDictionary size     : ', len(word_index))\n",
    "for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )\n",
    "pwk.subtitle('Review example :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('After translation :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 - Have a look for NN"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sizes=[len(i) for i in x_train]\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist(sizes, bins=400)\n",
    "plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)), \n",
    "              xlabel='Size', ylabel='Density', xlim=[0,1500])\n",
    "pwk.save_fig('01-stats-sizes')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Preprocess the data (padding)\n",
    "In order to be processed by an NN, all entries must have the **same length.**  \n",
    "We chose a review length of **review_len**  \n",
    "We will therefore complete them with a padding (of \\<pad\\>\\)  "
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "review_len = 256\n",
    "\n",
    "x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
    "                                                     value   = 0,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "x_test  = keras.preprocessing.sequence.pad_sequences(x_test,\n",
    "                                                     value   = 0 ,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "pwk.subtitle('After padding :')\n",
    "print(x_train[12])\n",
    "pwk.subtitle('In real words :')\n",
    "print(dataset2text(x_train[12]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Save dataset and dictionary (For future use but not mandatory)**"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Write dataset in a h5 file, could be usefull\n",
    "#\n",
    "output_dir = './data'\n",
    "pwk.mkdir(output_dir)\n",
    "with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:\n",
    "    f.create_dataset(\"x_train\",    data=x_train)\n",
    "    f.create_dataset(\"y_train\",    data=y_train)\n",
    "    f.create_dataset(\"x_test\",     data=x_test)\n",
    "    f.create_dataset(\"y_test\",     data=y_test)\n",
    "\n",
    "with open(f'{output_dir}/word_index.json', 'w') as fp:\n",
    "    json.dump(word_index, fp)\n",
    "\n",
    "with open(f'{output_dir}/index_word.json', 'w') as fp:\n",
    "    json.dump(index_word, fp)\n",
    "\n",
    "print('Saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Build the model\n",
    "Few remarks :\n",
    " - We'll choose a dense vector size for the embedding output with **dense_vector_size**\n",
    " - **GlobalAveragePooling1D** do a pooling on the last dimension : (None, lx, ly) -> (None, ly)  \n",
    "   In other words: we average the set of vectors/words of a sentence\n",
    " - L'embedding de Keras fonctionne de manière supervisée. Il s'agit d'une couche de *vocab_size* neurones vers *n_neurons* permettant de maintenir une table de vecteurs (les poids constituent les vecteurs). Cette couche ne calcule pas de sortie a la façon des couches normales, mais renvois la valeur des vecteurs. n mots => n vecteurs (ensuite empilés par le pooling)  \n",
    "Voir : [Explication plus détaillée (en)](https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer)  \n",
    "ainsi que : [Sentiment detection with Keras](https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks)  \n",
    "\n",
    "More documentation about this model functions :\n",
    " - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)\n",
    " - [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model(dense_vector_size=128):\n",
    "    \n",
    "    model = keras.Sequential()\n",
    "    model.add(keras.layers.Embedding(input_dim    = vocab_size, \n",
    "                                     output_dim   = dense_vector_size, \n",
    "                                     input_length = review_len))\n",
    "    model.add(keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2))\n",
    "    model.add(keras.layers.Dense(1,                 activation='sigmoid'))\n",
    "\n",
    "    model.compile(optimizer = 'adam',\n",
    "                  loss      = 'binary_crossentropy',\n",
    "                  metrics   = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Train the model\n",
    "### 5.1 - Get it"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_model(32)\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Add callback"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs('./run/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = \"./run/models/best_model.h5\"\n",
    "savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 - Train it\n",
    "GPU : batch_size=512 :  6' 30s  \n",
    "CPU : batch_size=512 : 12' 57s"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "n_epochs   = 10\n",
    "batch_size = 512\n",
    "\n",
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = n_epochs,\n",
    "                    batch_size      = batch_size,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    verbose         = 1,\n",
    "                    callbacks       = [savemodel_callback])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Evaluate\n",
    "### 6.1 - Training history"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.plot_history(history, save_as='02-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Reload and evaluate best model"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model('./run/models/best_model.h5')\n",
    "\n",
    "# ---- Evaluate\n",
    "score  = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / accuracy  : {:5.4f}'.format(score[1]))\n",
    "\n",
    "values=[score[1], 1-score[1]]\n",
    "pwk.plot_donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
    "\n",
    "# ---- Confusion matrix\n",
    "\n",
    "y_sigmoid = model.predict(x_test)\n",
    "y_pred = y_sigmoid.copy()\n",
    "y_pred[ y_sigmoid< 0.5 ] = 0\n",
    "y_pred[ y_sigmoid>=0.5 ] = 1    \n",
    "\n",
    "pwk.display_confusion_matrix(y_test,y_pred,labels=range(2))\n",
    "pwk.plot_confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
   ]
  },
  {
   "cell_type": "code",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}