Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img width=\"800px\" src=\"../fidle/img/header.svg\"></img>\n",
"# <!-- TITLE --> [K3IMDB3] - Reload and reuse a saved model\n",
"<!-- DESC --> Retrieving a saved model to perform a sentiment analysis (movie review), using Keras 3 and PyTorch\n",
"<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
" - The objective is to guess whether our personal film reviews are **positive or negative** based on the analysis of the text. \n",
" - For this, we will use our **previously saved model**.\n",
"## What we're going to do :\n",
" - Retrieve our saved model\n",
" - Evaluate the result\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 - Init python stuff"
]
},
{
"cell_type": "code",
"import os\n",
"os.environ['KERAS_BACKEND'] = 'torch'\n",
"import keras\n",
"import json,re\n",
"import numpy as np\n",
"run_id, run_dir, datasets_dir = fidle.init('K3IMDB3')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 - Parameters\n",
"The words in the vocabulary are classified from the most frequent to the rarest. \n",
"`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown). \n",
"`review_len` is the review length \n",
"`saved_models` where our models were previously saved \n",
"`dictionaries_dir` is where we will go to save our dictionaries. (./data is a good choice)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vocab_size = 10000\n",
"review_len = 256\n",
"\n",
"saved_models = './run/K3IMDB2'\n",
"dictionaries_dir = './data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Override parameters (batch mode) - Just forget this cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fidle.override('vocab_size', 'review_len', 'saved_models', 'dictionaries_dir')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 : Preparing the data\n",
"### 2.1 - Our reviews :"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"reviews = [ \"This film is particularly nice, a must see.\",\n",
" \"This film is a great classic that cannot be ignored.\",\n",
" \"I don't remember ever having seen such a movie...\",\n",
" \"This movie is just abominable and doesn't deserve to be seen!\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 - Retrieve dictionaries\n",
"Note : This dictionary is generated by [02-Embedding-Keras](02-Keras-embedding.ipynb) notebook."
"metadata": {},
"outputs": [],
"source": [
"with open(f'{dictionaries_dir}/word_index.json', 'r') as fp:\n",
" index_word = { i:w for w,i in word_index.items() }\n",
" print('Dictionaries loaded. ', len(word_index), 'entries' )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 - Clean, index and padd\n",
"Phases are split into words, punctuation is removed, sentence length is limited and padding is added... \n",
"**Note** : 1 is \"Start\" and 2 is \"unknown\""
"metadata": {},
"outputs": [],
"source": [
"start_char = 1 # Start of a sequence (padding is 0)\n",
"oov_char = 2 # Out-of-vocabulary\n",
"index_from = 3 # First word id\n",
"\n",
"nb_reviews = len(reviews)\n",
"x_data = []\n",
"\n",
"# ---- For all reviews\n",
"for review in reviews:\n",
" index_review=[start_char]\n",
" print(f'{start_char} ', end='')\n",
" # ---- For all words\n",
" for w in review.split(' '):\n",
" # ---- Clean it\n",
" w_clean = re.sub(r\"[^a-zA-Z0-9]\", \"\", w)\n",
" # ---- Not empty ?\n",
" if len(w_clean)>0:\n",
" # ---- Get the index - must be inside dict or is out of vocab (oov)\n",
" w_index = word_index.get(w, oov_char)\n",
" if w_index>vocab_size : w_index=oov_char\n",
" # ---- Add the index if < vocab_size\n",
" index_review.append(w_index)\n",
" x_data.append(index_review)\n",
" print()\n",
"x_data = keras.preprocessing.sequence.pad_sequences(x_data, value = 0, padding = 'post', maxlen = review_len)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 - Have a look"
]
},
{
"cell_type": "code",
"source": [
"def translate(x):\n",
" return ' '.join( [index_word.get(i,'?') for i in x] )\n",
"\n",
"for i in range(nb_reviews):\n",
" imax=np.where(x_data[i]==0)[0][0]+5\n",
" print(f'\\nText review {i} :', reviews[i])\n",
" print(f'tokens vector :', list(x_data[i][:imax]), '(...)')\n",
" print('Translation :', translate(x_data[i][:imax]), '(...)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"metadata": {},
"outputs": [],
"source": [
"model = keras.models.load_model(f'{saved_models}/models/best_model.keras')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Predict"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"y_pred = model.predict(x_data, verbose=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### And the winner is :"
]
},
{
"cell_type": "code",
"for i,review in enumerate(reviews):\n",
" rate = y_pred[i][0]\n",
" opinion = 'NEGATIVE :-(' if rate<0.5 else 'POSITIVE :-)' \n",
" print(f'{review:<70} => {rate:.2f} - {opinion}')"
{
"cell_type": "code",
"<img width=\"80px\" src=\"../fidle/img/logo-paysage.svg\"></img>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.2 ('fidle-env')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
},
"vscode": {
"interpreter": {
"hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}