Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • daconcea/fidle
  • bossardl/fidle
  • Julie.Remenant/fidle
  • abijolao/fidle
  • monsimau/fidle
  • karkars/fidle
  • guilgautier/fidle
  • cailletr/fidle
  • talks/fidle
9 results
Show changes
Commits on Source (127)
Showing
with 10596 additions and 616 deletions
......@@ -2,5 +2,11 @@
*/.ipynb_checkpoints/*
__pycache__
*/__pycache__/*
/run/**
*/data/*
run/
figs/
GTSRB/data
IMDB/data
MNIST/data
VAE/data
BHPD/data/*
!BHPD/data/BostonHousing.csv
source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [GTS5] - CNN with GTSRB dataset - Full convolutions
<!-- DESC --> Episode 5 : A lot of models, a lot of datasets and a lot of results.
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- Try multiple solutions
- Design a generic and batch-usable code
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
## What we're going to do :
Our main steps:
- Try n models with n datasets
- Save a Pandas/h5 report
- Write to be run in batch mode
## Step 1 - Import
### 1.1 - Python
%% Cell type:code id: tags:
``` python
import tensorflow as tf
from tensorflow import keras
import numpy as np
import h5py
import os,time,json
import random
from IPython.display import display
VERSION='1.6'
```
%% Cell type:markdown id: tags:
### 1.2 - Where are we ?
%% Cell type:code id: tags:
``` python
# At GRICAD
dataset_dir = '/bettik/PROJECTS/pr-fidle/datasets/GTSRB/'
# At IDRIS
# dataset_dir = f'{os.getenv("WORK","")}/datasets/GTSRB'
# At Home
# dataset_dir = f'{os.getenv("HOME","")}/datasets/GTSRB'
print(f'We will use : dataset_dir={dataset_dir}')
```
%% Output
We will use : dataset_dir=/bettik/PROJECTS/pr-fidle/datasets/GTSRB
%% Cell type:markdown id: tags:
## Step 2 - Init and start
%% Cell type:code id: tags:
``` python
# ---- Where I am ?
now = time.strftime("%A %d %B %Y - %Hh%Mm%Ss")
here = os.getcwd()
random.seed(time.time())
tag_id = '{:06}'.format(random.randint(0,99999))
# ---- Who I am ?
oar_id = os.getenv("OAR_JOB_ID", "??")
slurm_id = os.getenv("SLURM_JOBID", "??")
print('\nFull Convolutions Notebook')
print(' Version : {}'.format(VERSION))
print(' Now is : {}'.format(now))
print(' OAR id : {}'.format(oar_id))
print(' SLURM id : {}'.format(slurm_id))
print(' Tag id : {}'.format(tag_id))
print(' Working directory : {}'.format(here))
print(' Dataset_dir : {}'.format(dataset_dir))
print(' TensorFlow version :',tf.__version__)
print(' Keras version :',tf.keras.__version__)
print(' for tensorboard : --logdir {}/run/logs_{}'.format(here,tag_id))
```
%% Output
Full Convolutions Notebook
Version : 1.6
Now is : Friday 28 February 2020 - 15h06m25s
OAR id : 5878410
SLURM id : ??
Tag id : 083052
Working directory : /home/paroutyj/fidle/GTSRB
Dataset_dir : /bettik/PROJECTS/pr-fidle/datasets/GTSRB
TensorFlow version : 2.0.0
Keras version : 2.2.4-tf
for tensorboard : --logdir /home/paroutyj/fidle/GTSRB/run/logs_083052
%% Cell type:markdown id: tags:
## Step 3 - Dataset loading
%% Cell type:code id: tags:
``` python
def read_dataset(dataset_dir, name):
'''Reads h5 dataset from dataset_dir
Args:
dataset_dir : datasets dir
name : dataset name, without .h5
Returns: x_train,y_train,x_test,y_test data'''
# ---- Read dataset
filename=f'{dataset_dir}/{name}.h5'
with h5py.File(filename,'r') as f:
x_train = f['x_train'][:]
y_train = f['y_train'][:]
x_test = f['x_test'][:]
y_test = f['y_test'][:]
# ---- done
return x_train,y_train,x_test,y_test
```
%% Cell type:markdown id: tags:
## Step 4 - Models collection
%% Cell type:code id: tags:
``` python
# A basic model
#
def get_model_v1(lx,ly,lz):
model = keras.models.Sequential()
model.add( keras.layers.Conv2D(96, (3,3), activation='relu', input_shape=(lx,ly,lz)))
model.add( keras.layers.MaxPooling2D((2, 2)))
model.add( keras.layers.Dropout(0.2))
model.add( keras.layers.Conv2D(192, (3, 3), activation='relu'))
model.add( keras.layers.MaxPooling2D((2, 2)))
model.add( keras.layers.Dropout(0.2))
model.add( keras.layers.Flatten())
model.add( keras.layers.Dense(1500, activation='relu'))
model.add( keras.layers.Dropout(0.5))
model.add( keras.layers.Dense(43, activation='softmax'))
return model
# A more sophisticated model
#
def get_model_v2(lx,ly,lz):
model = keras.models.Sequential()
model.add( keras.layers.Conv2D(64, (3, 3), padding='same', input_shape=(lx,ly,lz), activation='relu'))
model.add( keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add( keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add( keras.layers.Dropout(0.2))
model.add( keras.layers.Conv2D(128, (3, 3), padding='same', activation='relu'))
model.add( keras.layers.Conv2D(128, (3, 3), activation='relu'))
model.add( keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add( keras.layers.Dropout(0.2))
model.add( keras.layers.Conv2D(256, (3, 3), padding='same',activation='relu'))
model.add( keras.layers.Conv2D(256, (3, 3), activation='relu'))
model.add( keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add( keras.layers.Dropout(0.2))
model.add( keras.layers.Flatten())
model.add( keras.layers.Dense(512, activation='relu'))
model.add( keras.layers.Dropout(0.5))
model.add( keras.layers.Dense(43, activation='softmax'))
return model
def get_model_v3(lx,ly,lz):
model = keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, (5, 5), padding='same', activation='relu', input_shape=(lx,ly,lz)))
model.add(tf.keras.layers.BatchNormalization(axis=-1))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv2D(64, (5, 5), padding='same', activation='relu'))
model.add(tf.keras.layers.BatchNormalization(axis=-1))
model.add(tf.keras.layers.Conv2D(128, (5, 5), padding='same', activation='relu'))
model.add(tf.keras.layers.BatchNormalization(axis=-1))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(43, activation='softmax'))
return model
```
%% Cell type:markdown id: tags:
## Step 5 - Multiple datasets, multiple models ;-)
%% Cell type:code id: tags:
``` python
def multi_run(dataset_dir, datasets, models, datagen=None,
train_size=1, test_size=1, batch_size=64, epochs=16,
verbose=0, extension_dir='last'):
"""
Launches a dataset-model combination
args:
dataset_dir : Directory of the datasets
datasets : List of dataset (whitout .h5)
models : List of model like { "model name":get_model(), ...}
datagen : Data generator or None (None)
train_size : % of train dataset to use. 1 mean all. (1)
test_size : % of test dataset to use. 1 mean all. (1)
batch_size : Batch size (64)
epochs : Number of epochs (16)
verbose : Verbose level (0)
extension_dir : postfix for logs and models dir (_last)
return:
report : Report as a dict for Pandas.
"""
# ---- Logs and models dir
#
os.makedirs(f'./run/logs_{extension_dir}', mode=0o750, exist_ok=True)
os.makedirs(f'./run/models_{extension_dir}', mode=0o750, exist_ok=True)
# ---- Columns of output
#
output={}
output['Dataset'] = []
output['Size'] = []
for m in models:
output[m+'_Accuracy'] = []
output[m+'_Duration'] = []
# ---- Let's go
#
for d_name in datasets:
print("\nDataset : ",d_name)
# ---- Read dataset
x_train,y_train,x_test,y_test = read_dataset(dataset_dir, d_name)
d_size=os.path.getsize(f'{dataset_dir}/{d_name}.h5')/(1024*1024)
output['Dataset'].append(d_name)
output['Size'].append(d_size)
# ---- Get the shape
(n,lx,ly,lz) = x_train.shape
n_train = int( x_train.shape[0] * train_size )
n_test = int( x_test.shape[0] * test_size )
# ---- For each model
for m_name,m_function in models.items():
print(" Run model {} : ".format(m_name), end='')
# ---- get model
try:
model=m_function(lx,ly,lz)
# ---- Compile it
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# ---- Callbacks tensorboard
log_dir = f"./run/logs_{extension_dir}/tb_{d_name}_{m_name}"
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
# ---- Callbacks bestmodel
save_dir = f"./run/models_{extension_dir}/model_{d_name}_{m_name}.h5"
bestmodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, monitor='accuracy', save_best_only=True)
# ---- Train
start_time = time.time()
if datagen==None:
# ---- No data augmentation (datagen=None) --------------------------------------
history = model.fit(x_train[:n_train], y_train[:n_train],
batch_size = batch_size,
epochs = epochs,
verbose = verbose,
validation_data = (x_test[:n_test], y_test[:n_test]),
callbacks = [tensorboard_callback, bestmodel_callback])
else:
# ---- Data augmentation (datagen given) ----------------------------------------
datagen.fit(x_train)
history = model.fit(datagen.flow(x_train, y_train, batch_size=batch_size),
steps_per_epoch = int(n_train/batch_size),
epochs = epochs,
verbose = verbose,
validation_data = (x_test[:n_test], y_test[:n_test]),
callbacks = [tensorboard_callback, bestmodel_callback])
# ---- Result
end_time = time.time()
duration = end_time-start_time
accuracy = max(history.history["val_accuracy"])*100
#
output[m_name+'_Accuracy'].append(accuracy)
output[m_name+'_Duration'].append(duration)
print(f"Accuracy={accuracy:.2f} and Duration={duration:.2f}")
except:
output[m_name+'_Accuracy'].append('0')
output[m_name+'_Duration'].append('999')
print('-')
return output
```
%% Cell type:markdown id: tags:
## Step 6 - Run !
%% Cell type:code id: tags:
``` python
start_time = time.time()
print('\n---- Run','-'*50)
# --------- Datasets, models, and more.. -----------------------------------
#
# ---- For tests
# datasets = ['set-24x24-L', 'set-24x24-RGB']
# models = {'v1':get_model_v1, 'v4':get_model_v2}
# batch_size = 64
# epochs = 2
# train_size = 0.1
# test_size = 0.1
# with_datagen = False
# verbose = 0
#
# ---- All possibilities
# datasets = ['set-24x24-L', 'set-24x24-RGB', 'set-48x48-L', 'set-48x48-RGB', 'set-24x24-L-LHE', 'set-24x24-RGB-HE', 'set-48x48-L-LHE', 'set-48x48-RGB-HE']
# models = {'v1':get_model_v1, 'v2':get_model_v2, 'v3':get_model_v3}
# batch_size = 64
# epochs = 16
# train_size = 1
# test_size = 1
# with_datagen = False
# verbose = 0
#
# ---- Data augmentation
datasets = ['set-48x48-RGB']
models = {'v2':get_model_v2}
batch_size = 64
epochs = 20
train_size = 1
test_size = 1
with_datagen = True
verbose = 0
#
# ---------------------------------------------------------------------------
# ---- Data augmentation
#
if with_datagen :
datagen = keras.preprocessing.image.ImageDataGenerator(featurewise_center=False,
featurewise_std_normalization=False,
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=0.2,
shear_range=0.1,
rotation_range=10.)
else:
datagen=None
# ---- Run
#
output = multi_run(dataset_dir,
datasets, models,
datagen=datagen,
train_size=train_size, test_size=test_size,
batch_size=batch_size, epochs=epochs,
verbose=verbose,
extension_dir=tag_id)
# ---- Save report
#
report={}
report['output']=output
report['description']='train_size={} test_size={} batch_size={} epochs={} data_aug={}'.format(train_size,test_size,batch_size,epochs,with_datagen)
report_name=f'./run/report_{tag_id}.json'
with open(report_name, 'w') as file:
json.dump(report, file)
print('\nReport saved as ',report_name)
end_time = time.time()
duration = end_time-start_time
print(f'Duration : {duration:.2f} s')
print('-'*59)
```
%% Output
---- Run --------------------------------------------------
Dataset : set-24x24-L
Run model v1 : Accuracy=39.98 and Duration=2.23
Run model v4 : Accuracy=6.18 and Duration=2.17
Dataset : set-24x24-RGB
Run model v1 : Accuracy=53.52 and Duration=2.20
Run model v4 : Accuracy=11.80 and Duration=2.01
Report saved as ./run/report_083052.json
Duration : 10.37 s
-----------------------------------------------------------
%% Cell type:markdown id: tags:
## Step 7 - That's all folks..
%% Cell type:code id: tags:
``` python
print('\n{}'.format(time.strftime("%A %-d %B %Y, %H:%M:%S")))
print("The work is done.\n")
```
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [GTS6] - CNN with GTSRB dataset - Full convolutions as a batch
<!-- DESC --> Episode 6 : Run Full convolution notebook as a batch
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- Run a notebook code as a **job**
- Follow up with Tensorboard
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
## What we're going to do :
Our main steps:
- Run Full-convolution.ipynb as a batch :
- Notebook mode
- Script mode
- Tensorboard follow up
%% Cell type:markdown id: tags:
### Step 0 - Just for convenience
%% Cell type:code id: tags:
``` python
import sys
sys.path.append('..')
import fidle.pwk as ooo
ooo.init()
```
%% Output
FIDLE 2020 - Practical Work Module
Version : 0.4.3
Run time : Friday 28 February 2020, 17:55:56
TensorFlow version : 2.0.0
Keras version : 2.2.4-tf
%% Cell type:markdown id: tags:
## Step 1 - Run a notebook as a batch
To run a notebook in a command line :
```jupyter nbconvert (...) --to notebook --execute <notebook>```
For example :
```jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --output='./run/full_convolutions' --execute '05-Full-convolutions.ipynb'```
%% Cell type:markdown id: tags:
## Step 2 - Export as a script (What we're going to do this time)
To export a notebook as a script :
```jupyter nbconvert --to script <notebook>```
To run the script :
```ipython <script>```
%% Cell type:code id: tags:
``` python
%%bash
# ---- This will convert a notebook to a notebook.py script
#
jupyter nbconvert --to script --output='./run/full_convolutions_01' '05-Full-convolutions.ipynb'
```
%% Output
[NbConvertApp] Converting notebook 05-Full-convolutions.ipynb to script
[NbConvertApp] Writing 13061 bytes to ./run/full_convolutions_01.py
%% Cell type:code id: tags:
``` python
!ls -l ./run/*.py
```
%% Output
-rwxr-xr-x 1 paroutyj l-simap 13061 Feb 28 17:56 ./run/full_convolutions_01.py
%% Cell type:markdown id: tags:
## Step 2 - Batch submission
### 2.1 - Create batch script :
%% Cell type:code id: tags:
``` python
%%writefile "./run/full_convolutions_01.sh"
#!/bin/bash
#OAR -n Full convolutions
#OAR -t gpu
#OAR -l /nodes=1/gpudevice=1,walltime=01:00:00
#OAR --stdout full_convolutions_%jobid%.out
#OAR --stderr full_convolutions_%jobid%.err
#OAR --project fidle
#---- With cpu
# use :
# OAR -l /nodes=1/core=32,walltime=02:00:00
# and add a 2>/dev/null to ipython xxx
# ----------------------------------
# _ _ _
# | |__ __ _| |_ ___| |__
# | '_ \ / _` | __/ __| '_ \
# | |_) | (_| | || (__| | | |
# |_.__/ \__,_|\__\___|_| |_|
# Full convolutions
# ----------------------------------
#
CONDA_ENV=fidle
RUN_DIR=~/fidle/GTSRB
RUN_SCRIPT=./run/full_convolutions_01.py
# ---- Cuda Conda initialization
#
echo '------------------------------------------------------------'
echo "Start : $0"
echo '------------------------------------------------------------'
#
source /applis/environments/cuda_env.sh dahu 10.0
source /applis/environments/conda.sh
#
conda activate "$CONDA_ENV"
# ---- Run it...
#
cd $RUN_DIR
ipython $RUN_SCRIPT
```
%% Output
Overwriting ./run/full_convolutions_01.sh
%% Cell type:markdown id: tags:
### 2.2 - Have a look
%% Cell type:code id: tags:
``` python
%%bash
chmod 755 ./run/*.sh
chmod 755 ./run/*.py
ls -l ./run/*full_convolutions*
```
%% Output
-rwxr-xr-x 1 paroutyj l-simap 13061 Feb 28 16:31 ./run/full_convolutions_01.py
-rwxr-xr-x 1 paroutyj l-simap 1015 Feb 28 16:31 ./run/full_convolutions_01.sh
%% Cell type:markdown id: tags:
### 2.3 - Job submission
Have to be done on the frontal :
```bash
# hostname
f-dahu
# pwd
/home/paroutyj
# oarsub -S ~/fidle/GTSRB/run/full_convolutions_01.sh
[GPUNODE] Adding gpu node restriction
[ADMISSION RULE] Modify resource description with type constraints
#oarstat -u
Job id S User Duration System message
--------- - -------- ---------- ------------------------------------------------
5878410 R paroutyj 0:19:56 R=8,W=1:0:0,J=I,P=fidle,T=gpu (Karma=0.005,quota_ok)
5896266 W paroutyj 0:00:00 R=8,W=1:0:0,J=B,N=Full convolutions,P=fidle,T=gpu
# ls -l
total 8
-rw-r--r-- 1 paroutyj l-simap 0 Feb 28 15:58 full_convolutions_5896266.err
-rw-r--r-- 1 paroutyj l-simap 5703 Feb 28 15:58 full_convolutions_5896266.out
```
%% Cell type:markdown id: tags:
<div class='todo'>
Your mission if you accept it: Run our full_convolution code in batch mode.<br>
For that :
<ul>
<li>Validate the full_convolution notebook on short tests</li>
<li>Submit it in batch mode for validation</li>
<li>Modify the notebook for a full run and submit it :-)</li>
</ul>
</div>
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [TSB1] - Tensorboard with/from Jupyter
<!-- DESC --> 4 ways to use Tensorboard from the Jupyter environment
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- Using Tensorboard
- ...and if possible, simply and easily !
About [Tensorboard](https://www.tensorflow.org/tensorboard/get_started)
## What we're going to do :
- Using Tensorboard
%% Cell type:markdown id: tags:
## Option 1 - From Jupyter
It's the easiest and most fun way: Launch Tensorboard directly from Jupiter.
Unfortunately, this feature seems to be a bit capricious with the recent versions of Jupyter...
It works on Jean-Zay (at **IDRIS**), but on Jupyter Notebook.
%% Cell type:markdown id: tags:
## Option 2 - Shell command
That's what we're going to use in **GRICAD.**
In fact, this is like starting tensorboard from the command line.
More about it : `tensorboard --help`
%% Cell type:code id: tags:
``` python
%%bash
tensorboard_start --logdir ./run/logs
```
%% Cell type:code id: tags:
``` python
%%bash
tensorboard_status
```
%% Cell type:code id: tags:
``` python
%%bash
tensorboard_stop
```
%% Cell type:markdown id: tags:
## Option 3 - Magic command
**Start**
%% Cell type:code id: tags:
``` python
%load_ext tensorboard
```
%% Cell type:markdown id: tags:
For example for use on a GRICAD cluster :
%% Cell type:code id: tags:
``` python
%tensorboard --port 21277 --host 0.0.0.0 --logdir ./run/logs
```
%% Cell type:markdown id: tags:
**Stop**
No way... use bash method
## Option 4 - Tensorboard as a module
**Start**
%% Cell type:code id: tags:
``` python
import tensorboard.notebook as tsb
```
%% Cell type:code id: tags:
``` python
tsb.start('--port 21277 --host 0.0.0.0 --logdir ./run/logs')
```
%% Cell type:markdown id: tags:
**Check**
%% Cell type:code id: tags:
``` python
a=tsb.list()
```
%% Output
No known TensorBoard instances running.
%% Cell type:markdown id: tags:
**Stop**
No way... use bash method
%% Cell type:code id: tags:
``` python
!kill 214798
```
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [IMDB2] - Text embedding with IMDB - Reloaded
<!-- DESC --> Example of reusing a previously saved model
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text.
- For this, we will use our **previously saved model**.
Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**
Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)
For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
## What we're going to do :
- Preparing the data
- Retrieve our saved model
- Evaluate the result
%% Cell type:markdown id: tags:
## Step 1 - Init python stuff
%% Cell type:code id: tags:
``` python
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.datasets.imdb as imdb
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import pandas as pd
import os,sys,h5py,json,re
from importlib import reload
sys.path.append('..')
import fidle.pwk as ooo
ooo.init()
```
%% Output
FIDLE 2020 - Practical Work Module
Version : 0.2.9
Run time : Wednesday 19 February 2020, 22:08:28
TensorFlow version : 2.0.0
Keras version : 2.2.4-tf
%% Cell type:markdown id: tags:
## Step 2 : Preparing the data
### 2.1 - Our reviews :
%% Cell type:code id: tags:
``` python
reviews = [ "This film is particularly nice, a must see.",
"Some films are great classics and cannot be ignored.",
"This movie is just abominable and doesn't deserve to be seen!"]
```
%% Cell type:markdown id: tags:
### 2.2 - Retrieve dictionaries
%% Cell type:code id: tags:
``` python
with open('./data/word_index.json', 'r') as fp:
word_index = json.load(fp)
index_word = {index:word for word,index in word_index.items()}
```
%% Cell type:markdown id: tags:
### 2.3 - Clean, index and padd
%% Cell type:code id: tags:
``` python
max_len = 256
vocab_size = 10000
nb_reviews = len(reviews)
x_data = []
# ---- For all reviews
for review in reviews:
# ---- First index must be <start>
index_review=[1]
# ---- For all words
for w in review.split(' '):
# ---- Clean it
w_clean = re.sub(r"[^a-zA-Z0-9]", "", w)
# ---- Not empty ?
if len(w_clean)>0:
# ---- Get the index
w_index = word_index.get(w,2)
if w_index>vocab_size : w_index=2
# ---- Add the index if < vocab_size
index_review.append(w_index)
# ---- Add the indexed review
x_data.append(index_review)
# ---- Padding
x_data = keras.preprocessing.sequence.pad_sequences(x_data, value = 0, padding = 'post', maxlen = max_len)
```
%% Cell type:markdown id: tags:
### 2.4 - Have a look
%% Cell type:code id: tags:
``` python
def translate(x):
return ' '.join( [index_word.get(i,'?') for i in x] )
for i in range(nb_reviews):
imax=np.where(x_data[i]==0)[0][0]+5
print(f'\nText review :', reviews[i])
print( f'x_train[{i:}] :', list(x_data[i][:imax]), '(...)')
print( 'Translation :', translate(x_data[i][:imax]), '(...)')
```
%% Output
Text review : This film is particularly nice, a must see.
x_train[0] : [1, 2, 22, 9, 572, 2, 6, 215, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> film is particularly <unknown> a must <unknown> <pad> <pad> <pad> <pad> <pad> (...)
Text review : Some films are great classics and cannot be ignored.
x_train[1] : [1, 2, 108, 26, 87, 2239, 5, 566, 30, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> films are great classics and cannot be <unknown> <pad> <pad> <pad> <pad> <pad> (...)
Text review : This movie is just abominable and doesn't deserve to be seen!
x_train[2] : [1, 2, 20, 9, 43, 2, 5, 152, 1833, 8, 30, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> movie is just <unknown> and doesn't deserve to be <unknown> <pad> <pad> <pad> <pad> <pad> (...)
%% Cell type:markdown id: tags:
## Step 2 - Bring back the model
%% Cell type:code id: tags:
``` python
model = keras.models.load_model('./run/models/best_model.h5')
```
%% Cell type:markdown id: tags:
## Step 4 - Predict
%% Cell type:code id: tags:
``` python
y_pred = model.predict(x_data)
```
%% Cell type:markdown id: tags:
#### And the winner is :
%% Cell type:code id: tags:
``` python
for i in range(nb_reviews):
print(f'\n{reviews[i]:<70} =>',('NEGATIVE' if y_pred[i][0]<0.5 else 'POSITIVE'),f'({y_pred[i][0]:.2f})')
```
%% Output
This film is particularly nice, a must see. => POSITIVE (0.54)
Some films are great classics and cannot be ignored. => POSITIVE (0.61)
This movie is just abominable and doesn't deserve to be seen! => NEGATIVE (0.33)
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [IMDB3] - Text embedding/LSTM model with IMDB
<!-- DESC --> Still the same problem, but with a network combining embedding and LSTM
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text.
- Use of a model combining embedding and LSTM
Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**
Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)
For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
## What we're going to do :
- Retrieve data
- Preparing the data
- Build a Embedding/LSTM model
- Train the model
- Evaluate the result
%% Cell type:markdown id: tags:
## Step 1 - Init python stuff
%% Cell type:code id: tags:
``` python
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.datasets.imdb as imdb
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os,sys,h5py,json
from importlib import reload
sys.path.append('..')
import fidle.pwk as ooo
ooo.init()
```
%% Cell type:markdown id: tags:
## Step 2 - Retrieve data
**From Keras :**
This IMDb dataset can bet get directly from [Keras datasets](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
Due to their nature, textual data can be somewhat complex.
### 2.1 - Data structure :
The dataset is composed of 2 parts: **reviews** and **opinions** (positive/negative), with a **dictionary**
- dataset = (reviews, opinions)
- reviews = \[ review_0, review_1, ...\]
- review_i = [ int1, int2, ...] where int_i is the index of the word in the dictionary.
- opinions = \[ int0, int1, ...\] where int_j == 0 if opinion is negative or 1 if opinion is positive.
- dictionary = \[ mot1:int1, mot2:int2, ... ]
%% Cell type:markdown id: tags:
### 2.2 - Get dataset
For simplicity, we will use a pre-formatted dataset.
See : https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data
However, Keras offers some usefull tools for formatting textual data.
See : https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text
%% Cell type:code id: tags:
``` python
vocab_size = 10000
# ----- Retrieve x,y
#
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size,
skip_top = 0,
maxlen = None,
seed = 42,
start_char = 1,
oov_char = 2,
index_from = 3, )
```
%% Cell type:code id: tags:
``` python
print(" Max(x_train,x_test) : ", ooo.rmax([x_train,x_test]) )
print(" x_train : {} y_train : {}".format(x_train.shape, y_train.shape))
print(" x_test : {} y_test : {}".format(x_test.shape, y_test.shape))
print('\nReview example (x_train[12]) :\n\n',x_train[12])
```
%% Cell type:markdown id: tags:
### 2.3 - Have a look for humans (optional)
When we loaded the dataset, we asked for using \<start\> as 1, \<unknown word\> as 2
So, we shifted the dataset by 3 with the parameter index_from=3
%% Cell type:code id: tags:
``` python
# ---- Retrieve dictionary {word:index}, and encode it in ascii
word_index = imdb.get_word_index()
# ---- Shift the dictionary from +3
word_index = {w:(i+3) for w,i in word_index.items()}
# ---- Add <pad>, <start> and unknown tags
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2} )
# ---- Create a reverse dictionary : {index:word}
index_word = {index:word for word,index in word_index.items()}
# ---- Add a nice function to transpose :
#
def dataset2text(review):
return ' '.join([index_word.get(i, '?') for i in review])
```
%% Cell type:code id: tags:
``` python
print('\nDictionary size : ', len(word_index))
print('\nReview example (x_train[12]) :\n\n',x_train[12])
print('\nIn real words :\n\n', dataset2text(x_train[12]))
```
%% Cell type:markdown id: tags:
### 2.4 - Have a look for neurons
%% Cell type:code id: tags:
``` python
plt.figure(figsize=(12, 6))
ax=sns.distplot([len(i) for i in x_train],bins=60)
ax.set_title('Distribution of reviews by size')
plt.xlabel("Review's sizes")
plt.ylabel('Density')
ax.set_xlim(0, 1500)
plt.show()
```
%% Cell type:markdown id: tags:
## Step 3 - Preprocess the data
In order to be processed by an NN, all entries must have the same length.
We chose a review length of **review_len**
We will therefore complete them with a padding (of \<pad\>\)
%% Cell type:code id: tags:
``` python
review_len = 256
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
value = 0,
padding = 'post',
maxlen = review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test,
value = 0 ,
padding = 'post',
maxlen = review_len)
print('\nReview example (x_train[12]) :\n\n',x_train[12])
print('\nIn real words :\n\n', dataset2text(x_train[12]))
```
%% Cell type:markdown id: tags:
### Save dataset and dictionary (can be usefull)
%% Cell type:code id: tags:
``` python
os.makedirs('./data', mode=0o750, exist_ok=True)
with h5py.File('./data/dataset_imdb.h5', 'w') as f:
f.create_dataset("x_train", data=x_train)
f.create_dataset("y_train", data=y_train)
f.create_dataset("x_test", data=x_test)
f.create_dataset("y_test", data=y_test)
with open('./data/word_index.json', 'w') as fp:
json.dump(word_index, fp)
with open('./data/index_word.json', 'w') as fp:
json.dump(index_word, fp)
print('Saved.')
```
%% Cell type:markdown id: tags:
## Step 4 - Build the model
Few remarks :
1. We'll choose a dense vector size for the embedding output with **dense_vector_size**
2. **GlobalAveragePooling1D** do a pooling on the last dimension : (None, lx, ly) -> (None, ly)
In other words: we average the set of vectors/words of a sentence
3. L'embedding de Keras fonctionne de manière supervisée. Il s'agit d'une couche de *vocab_size* neurones vers *n_neurons* permettant de maintenir une table de vecteurs (les poids constituent les vecteurs). Cette couche ne calcule pas de sortie a la façon des couches normales, mais renvois la valeur des vecteurs. n mots => n vecteurs (ensuite empilés par le pooling)
Voir : https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer
A SUIVRE : https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks
### 4.1 - Build
More documentation about :
- [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)
- [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)
%% Cell type:code id: tags:
``` python
def get_model(dense_vector_size=128):
model = keras.Sequential()
model.add(keras.layers.Embedding(input_dim = vocab_size,
output_dim = dense_vector_size,
input_length = review_len))
model.add(keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
return model
```
%% Cell type:markdown id: tags:
## Step 5 - Train the model
### 5.1 - Get it
%% Cell type:code id: tags:
``` python
model = get_model()
model.summary()
```
%% Cell type:markdown id: tags:
### 5.2 - Add callback
%% Cell type:code id: tags:
``` python
os.makedirs('./run/models', mode=0o750, exist_ok=True)
save_dir = "./run/models/best_model.h5"
savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)
```
%% Cell type:markdown id: tags:
### 5.1 - Train it
GPU : batch_size=512 : 305s
%% Cell type:code id: tags:
``` python
%%time
n_epochs = 10
batch_size = 32
history = model.fit(x_train,
y_train,
epochs = n_epochs,
batch_size = batch_size,
validation_data = (x_test, y_test),
verbose = 1,
callbacks = [savemodel_callback])
```
%% Cell type:markdown id: tags:
## Step 6 - Evaluate
### 6.1 - Training history
%% Cell type:code id: tags:
``` python
ooo.plot_history(history)
```
%% Cell type:markdown id: tags:
### 6.2 - Reload and evaluate best model
%% Cell type:code id: tags:
``` python
model = keras.models.load_model('./run/models/best_model.h5')
# ---- Evaluate
reload(ooo)
score = model.evaluate(x_test, y_test, verbose=0)
print('x_test / loss : {:5.4f}'.format(score[0]))
print('x_test / accuracy : {:5.4f}'.format(score[1]))
values=[score[1], 1-score[1]]
ooo.plot_donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :")
# ---- Confusion matrix
y_pred = model.predict_classes(x_test)
ooo.display_confusion_matrix(y_test,y_pred,labels=range(2),color='orange',font_size='20pt')
```
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.