Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
"# <!-- TITLE --> [K3IMDB5] - Sentiment analysis with a RNN network\n",
"<!-- DESC --> Still the same problem, but with a network combining embedding and RNN, using Keras 3 and PyTorch\n",
"<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
"## Objectives :\n",
" - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. \n",
" - Use of a model combining embedding and LSTM\n",
"\n",
"Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)** \n",
"Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/) \n",
"For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://keras.io/datasets)\n",
"## What we're going to do :\n",
"\n",
" - Retrieve data\n",
" - Preparing the data\n",
" - Build a Embedding/LSTM model\n",
" - Train the model\n",
" - Evaluate the result\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 - Init python stuff"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"import os\n",
"os.environ['KERAS_BACKEND'] = 'torch'\n",
"import keras\n",
"import keras.datasets.imdb as imdb\n",
"import json,re\n",
"import numpy as np\n",
"run_id, run_dir, datasets_dir = fidle.init('K3IMDB5')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 - Parameters\n",
"The words in the vocabulary are classified from the most frequent to the rarest. \n",
"`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown). \n",
"`hide_most_frequently` is the number of ignored words, among the most common ones \n",
"`review_len` is the review length \n",
"`dense_vector_size` is the size of the generated dense vectors \n",
"`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch\\\n",
"`scale` is a dataset scale factor - note a scale=1 need a training time > 10'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vocab_size = 10000\n",
"hide_most_frequently = 0\n",
"review_len = 256\n",
"dense_vector_size = 32\n",
"epochs = 10\n",
"batch_size = 128\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Override parameters (batch mode) - Just forget this cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fidle.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')\n",
"fidle.override('batch_size', 'epochs', 'fit_verbosity', 'scale')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 - Retrieve data\n",
"IMDb dataset can bet get directly from Keras - see [documentation](https://keras.io/api/datasets) \n",
"Note : Due to their nature, textual data can be somewhat complex."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For simplicity, we will use a pre-formatted dataset - See [documentation](https://keras.io/api/datasets/imdb/) \n",
"However, Keras offers some usefull tools for formatting textual data - See [documentation](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/) \n",
"execution_count": null,
"metadata": {},
"outputs": [],
"# ----- Retrieve x,y\n",
"#\n",
"start_char = 1 # Start of a sequence (padding is 0)\n",
"oov_char = 2 # Out-of-vocabulary\n",
"index_from = 3 # First word id\n",
"(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size, \n",
" skip_top = hide_most_frequently,\n",
" start_char = start_char, \n",
" oov_char = oov_char, \n",
" index_from = index_from)\n",
"# ---- Rescale\n",
"#\n",
"n1 = int(scale * len(x_train))\n",
"n2 = int(scale * len(x_test))\n",
"x_train, y_train = x_train[:n1], y_train[:n1]\n",
"x_test, y_test = x_test[:n2], y_test[:n2]\n",
"\n",
"print(\"Max(x_train,x_test) : \", fidle.utils.rmax([x_train,x_test]) )\n",
"print(\"Min(x_train,x_test) : \", fidle.utils.rmin([x_train,x_test]) )\n",
"print(\"Len(x_train) : \", len(x_train))\n",
"print(\"Len(x_test) : \", len(x_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 - Have a look for humans (optional)\n",
"When we loaded the dataset, we asked for using \\<start\\> as 1, \\<unknown word\\> as 2 \n",
"So, we shifted the dataset by 3 with the parameter index_from=3\n",
"\n",
"**Load dictionary :**"
"metadata": {},
"outputs": [],
"source": [
"# ---- Retrieve dictionary {word:index}, and encode it in ascii\n",
"# Shift the dictionary from +3\n",
"# Add <pad>, <start> and <unknown> tags\n",
"# Create a reverse dictionary : {index:word}\n",
"word_index = {w:(i+index_from) for w,i in word_index.items()}\n",
"word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )\n",
"index_word = {index:word for word,index in word_index.items()} \n",
"\n",
"# ---- A nice function to transpose :\n",
"#\n",
"def dataset2text(review):\n",
" return ' '.join([index_word.get(i, '?') for i in review])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Have a look :**"
]
},
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('\\nDictionary size : ', len(word_index))\n",
"for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )\n",
"fidle.utils.subtitle('Review example :')\n",
"fidle.utils.subtitle('After translation :')\n",
"print(dataset2text(x_train[12]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Preprocess the data (padding)\n",
"In order to be processed by an NN, all entries must have the **same length.** \n",
"We chose a review length of **review_len** \n",
"We will therefore complete them with a padding (of \\<pad\\>\\) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_train = keras.preprocessing.sequence.pad_sequences(x_train,\n",
" value = 0,\n",
" padding = 'post',\n",
" maxlen = review_len)\n",
"\n",
"x_test = keras.preprocessing.sequence.pad_sequences(x_test,\n",
" value = 0 ,\n",
" padding = 'post',\n",
" maxlen = review_len)\n",
"\n",
"fidle.utils.subtitle('After padding :')\n",
"fidle.utils.subtitle('In real words :')\n",
"print(dataset2text(x_train[12]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"More documentation about this model functions :\n",
" - [Embedding](https://keras.io/api/layers/core_layers/embedding/)\n",
" - [GlobalAveragePooling1D](https://keras.io/api/layers/pooling_layers/global_average_pooling1d)"
"metadata": {},
"outputs": [],
"source": [
"model = keras.Sequential()\n",
"model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim = dense_vector_size))\n",
"model.add(keras.layers.GRU(50))\n",
"model.add(keras.layers.Dense(1, activation='sigmoid'))\n",
"\n",
"model.compile(optimizer = 'rmsprop',\n",
" loss = 'binary_crossentropy',\n",
" metrics = ['accuracy'])\n",
"\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6 - Train the model\n",
"### 6.1 - Add Callbacks"
"metadata": {},
"outputs": [],
"source": [
"os.makedirs(f'{run_dir}/models', mode=0o750, exist_ok=True)\n",
"save_dir = f'{run_dir}/models/best_model.keras'\n",
"\n",
"savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6.2 - Train it\n",
"Note : With a scale=0.2, batch_size=128, epochs=10 => Need 4' on a cpu laptop"
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"history = model.fit(x_train,\n",
" y_train,\n",
" batch_size = batch_size,\n",
" validation_data = (x_test, y_test),\n",
" callbacks = [savemodel_callback])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"execution_count": null,
"metadata": {},
"outputs": [],
"fidle.scrawler.history(history, save_as='02-history')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 7 - Evaluation\n",
"Reload and evaluate best model"
"execution_count": null,
"metadata": {},
"outputs": [],
"model = keras.models.load_model(f'{run_dir}/models/best_model.keras')\n",
"\n",
"# ---- Evaluate\n",
"score = model.evaluate(x_test, y_test, verbose=0)\n",
"\n",
"print('x_test / loss : {:5.4f}'.format(score[0]))\n",
"print('x_test / accuracy : {:5.4f}'.format(score[1]))\n",
"\n",
"values=[score[1], 1-score[1]]\n",
"fidle.scrawler.donut(values,[\"Accuracy\",\"Errors\"], title=\"#### Accuracy donut is :\", save_as='03-donut')\n",
"\n",
"# ---- Confusion matrix\n",
"\n",
"y_sigmoid = model.predict(x_test, verbose=fit_verbosity)\n",
"y_pred = y_sigmoid.copy()\n",
"y_pred[ y_sigmoid< 0.5 ] = 0\n",
"y_pred[ y_sigmoid>=0.5 ] = 1 \n",
"\n",
"fidle.scrawler.confusion_matrix_txt(y_test,y_pred,labels=range(2))\n",
"fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.2 ('fidle-env')",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
},
"vscode": {
"interpreter": {
"hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}