02-Keras-embedding.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [K3IMDB2] - Sentiment analysis with text embedding\n",
    "<!-- DESC --> A very classical example of word embedding with a dataset from Internet Movie Database (IMDB), using Keras 3 on PyTorch\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
    " - Understand the management of **textual data** and **sentiment analysis**\n",
    "\n",
    "Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  \n",
    "Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  \n",
    "For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://keras.io/datasets)\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Retrieve data\n",
    " - Preparing the data\n",
    " - Build a model\n",
    " - Train the model\n",
    " - Evaluate the result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init\n",
    "### 1.1 - Python stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['KERAS_BACKEND'] = 'torch'\n",
    "\n",
    "import keras\n",
    "import keras.datasets.imdb as imdb\n",
    "\n",
    "import h5py,json\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('K3IMDB2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Parameters\n",
    "The words in the vocabulary are classified from the most frequent to the rarest.  \n",
    "`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown).  \n",
    "`hide_most_frequently` is the number of ignored words, among the most common ones  \n",
    "`review_len` is the review length  \n",
    "`dense_vector_size` is the size of the generated dense vectors  \n",
    "`output_dir` is where we will go to save our dictionaries. (./data is a good choice)\\\n",
    "`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size           = 5000\n",
    "hide_most_frequently = 0\n",
    "\n",
    "review_len           = 256\n",
    "dense_vector_size    = 32\n",
    "\n",
    "epochs               = 30\n",
    "batch_size           = 512\n",
    "\n",
    "output_dir           = './data'\n",
    "fit_verbosity        = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')\n",
    "fidle.override('batch_size', 'epochs', 'output_dir', 'fit_verbosity')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Retrieve data\n",
    "\n",
    "IMDb dataset can bet get directly from Keras - see [documentation](https://keras.io/api/datasets)  \n",
    "Note : Due to their nature, textual data can be somewhat complex.\n",
    "\n",
    "For more details about the management of this dataset, see notebook [IMDB1](01-One-hot-encoding.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Get dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ----- Retrieve x,y\n",
    "#\n",
    "start_char = 1      # Start of a sequence (padding is 0)\n",
    "oov_char   = 2      # Out-of-vocabulary\n",
    "index_from = 3      # First word id\n",
    "\n",
    "(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words  = vocab_size, \n",
    "                                                       skip_top   = hide_most_frequently,\n",
    "                                                       start_char = start_char, \n",
    "                                                       oov_char   = oov_char, \n",
    "                                                       index_from = index_from)\n",
    "\n",
    "# ---- About\n",
    "#\n",
    "print(\"Max(x_train,x_test)  : \", fidle.utils.rmax([x_train,x_test]) )\n",
    "print(\"Min(x_train,x_test)  : \", fidle.utils.rmin([x_train,x_test]) )\n",
    "print(\"Len(x_train)         : \", len(x_train))\n",
    "print(\"Len(x_test)          : \", len(x_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Load dictionary\n",
    "Not essential, but nice if you want to take a closer look at our reviews ;-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
    "#      Shift the dictionary from +3\n",
    "#      Add <pad>, <start> and <unknown> tags\n",
    "#      Create a reverse dictionary : {index:word}\n",
    "#\n",
    "word_index = imdb.get_word_index()\n",
    "word_index = {w:(i+index_from) for w,i in word_index.items()}\n",
    "word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )\n",
    "index_word = {index:word for word,index in word_index.items()} \n",
    "\n",
    "# ---- A nice function to transpose :\n",
    "#\n",
    "def dataset2text(review):\n",
    "    return ' '.join([index_word.get(i, '?') for i in review])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Preprocess the data (padding)\n",
    "In order to be processed by an NN, all entries must have the **same length.**  \n",
    "We chose a review length of **review_len**  \n",
    "We will therefore complete them with a padding (of 0 as \\<pad\\>\\)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
    "                                                     value   = 0,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "x_test  = keras.preprocessing.sequence.pad_sequences(x_test,\n",
    "                                                     value   = 0 ,\n",
    "                                                     padding = 'post',\n",
    "                                                     maxlen  = review_len)\n",
    "\n",
    "fidle.utils.subtitle('After padding :')\n",
    "print(x_train[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Save dataset and dictionary (For future use but not mandatory)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Write dataset in a h5 file, could be usefull\n",
    "#\n",
    "fidle.utils.mkdir(output_dir)\n",
    "\n",
    "with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:\n",
    "    f.create_dataset(\"x_train\",    data=x_train)\n",
    "    f.create_dataset(\"y_train\",    data=y_train)\n",
    "    f.create_dataset(\"x_test\",     data=x_test)\n",
    "    f.create_dataset(\"y_test\",     data=y_test)\n",
    "    print('Dataset h5 file saved.')\n",
    "\n",
    "with open(f'{output_dir}/word_index.json', 'w') as fp:\n",
    "    json.dump(word_index, fp)\n",
    "    print('Word to index saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Build the model\n",
    "\n",
    "More documentation about this model functions :\n",
    " - [Embedding](https://keras.io/api/layers/core_layers/embedding/)\n",
    " - [GlobalAveragePooling1D](https://keras.io/api/layers/pooling_layers/global_average_pooling1d/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.Sequential(name='Embedding model')\n",
    "\n",
    "model.add(keras.layers.Input( shape=(review_len,) ))\n",
    "model.add(keras.layers.Embedding( input_dim    = vocab_size,\n",
    "                                  output_dim   = dense_vector_size))\n",
    "model.add(keras.layers.GlobalAveragePooling1D())\n",
    "model.add(keras.layers.Dense(dense_vector_size, activation='relu'))\n",
    "model.add(keras.layers.Dense(1,                 activation='sigmoid'))\n",
    "\n",
    "model.compile( optimizer = 'adam',\n",
    "               loss      = 'binary_crossentropy',\n",
    "               metrics   = ['accuracy'])\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Train the model\n",
    "### 5.1 Add Callbacks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)\n",
    "save_dir = f'{run_dir}/models/best_model.keras'\n",
    "\n",
    "savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Train it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "history = model.fit(x_train,\n",
    "                    y_train,\n",
    "                    epochs          = epochs,\n",
    "                    batch_size      = batch_size,\n",
    "                    validation_data = (x_test, y_test),\n",
    "                    verbose         = fit_verbosity,\n",
    "                    callbacks       = [savemodel_callback])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Evaluate\n",
    "### 6.1 - Training history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.scrawler.history(history, save_as='02-history')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 - Reload and evaluate best model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = keras.models.load_model(f'{run_dir}/models/best_model.keras')\n",
    "\n",
    "# ---- Evaluate\n",
    "score  = model.evaluate(x_test, y_test, verbose=0)\n",
    "\n",
    "print('x_test / loss      : {:5.4f}'.format(score[0]))\n",
    "print('x_test / accuracy  : {:5.4f}'.format(score[1]))\n",
    "\n",
    "values=[score[1], 1-score[1]]\n",
    "fidle.scrawler.donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
    "\n",
    "# ---- Confusion matrix\n",
    "\n",
    "y_sigmoid = model.predict(x_test, verbose=fit_verbosity)\n",
    "\n",
    "y_pred = y_sigmoid.copy()\n",
    "y_pred[ y_sigmoid< 0.5 ] = 0\n",
    "y_pred[ y_sigmoid>=0.5 ] = 1    \n",
    "\n",
    "fidle.scrawler.confusion_matrix_txt(y_test,y_pred,labels=range(2))\n",
    "fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}