Skip to content
Snippets Groups Projects
Commit b0e3ba38 authored by Jean-Luc Parouty's avatar Jean-Luc Parouty
Browse files

Check notebook after ci update

parent 7c2ca0c7
No related branches found
No related tags found
No related merge requests found
Showing
with 41843 additions and 2113 deletions
This diff is collapsed.
This diff is collapsed.
source diff could not be displayed: it is too large. Options to address this: view the blob.
source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id: tags:
 
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
 
# <!-- TITLE --> [GTSRB1] - Dataset analysis and preparation
<!-- DESC --> Episode 1 : Analysis of the GTSRB dataset and creation of an enhanced dataset
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
 
## Objectives :
- Understand the **complexity associated with data**, even when it is only images
- Learn how to build up a simple and **usable image dataset**
 
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
 
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
 
 
## What we're going to do :
 
- Understanding the dataset
- Preparing and formatting enhanced data
- Save enhanced datasets in h5 file format
 
%% Cell type:markdown id: tags:
 
## Step 1 - Import and init
 
%% Cell type:code id: tags:
 
``` python
import os, time, sys
import csv
import math, random
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py
 
from skimage.morphology import disk
from skimage.util import img_as_ubyte
from skimage.filters import rank
from skimage import io, color, exposure, transform
 
from importlib import reload
 
sys.path.append('..')
import fidle.pwk as pwk
 
datasets_dir = pwk.init('GTSRB1')
```
 
%% Output
 
 
<br>**FIDLE 2020 - Practical Work Module**
 
Version : 1.2b1 DEV
Notebook id : GTSRB1
Run time : Monday 11 January 2021, 21:34:27
TensorFlow version : 2.2.0
Keras version : 2.3.0-tf
Datasets dir : /home/pjluc/datasets/fidle
Run dir : ./run
Update keras cache : False
 
%% Cell type:markdown id: tags:
 
## Step 2 - Parameters
The generation of datasets may require some time and space : **10' and 10 GB**.
 
You can choose to perform tests or generate the whole enhanced dataset by setting the following parameters:
`scale` : 1 mean 100% of the dataset - set 0.1 for tests
`output_dir` : where to write enhanced dataset, could be :
- `./data`, for tests purpose
- `<datasets_dir>/celeba/enhanced` to add clusters in your datasets dir.
- `<datasets_dir>/GTSRB/enhanced` to add clusters in your datasets dir.
 
Uncomment the right lines according to what you want :
 
%% Cell type:code id: tags:
 
``` python
# ---- For smart tests :
#
scale = 0.1
output_dir = './data'
 
# ---- For a Full dataset generation :
#
# scale = 1
# output_dir = f'{datasets_dir}/GTSRB/enhanced'
```
 
%% Cell type:markdown id: tags:
 
Override parameters (batch mode) - Just forget this cell
 
%% Cell type:code id: tags:
 
``` python
pwk.override('scale', 'output_dir')
```
 
%% Cell type:markdown id: tags:
 
## Step 3 - Read the dataset
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
- Each directory contains one CSV file with annotations : `GT-<ClassID>.csv` and the training images
- First line is fieldnames: `Filename ; Width ; Height ; Roi.X1 ; Roi.Y1 ; Roi.X2 ; Roi.Y2 ; ClassId`
 
### 3.1 - Understanding the dataset
The original dataset is in : **\<dataset_dir\>/GTSRB/origine.**
There is 3 subsets : **Train**, **Test** and **Meta.**
Each subset have an **csv file** and a **subdir** with **images**.
 
 
%% Cell type:code id: tags:
 
``` python
df = pd.read_csv(f'{datasets_dir}/GTSRB/origine/Test.csv', header=0)
display(df.head(10))
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
### 3.2 - Usefull functions
A nice function to read a subset :
 
%% Cell type:code id: tags:
 
``` python
def read_csv_dataset(csv_file):
'''
Reads traffic sign data from German Traffic Sign Recognition Benchmark dataset.
Arguments:
csv filename : Description file, Example /data/GTSRB/Train.csv
Returns:
x,y : np array of images, np array of corresponding labels
'''
 
path = os.path.dirname(csv_file)
name = os.path.basename(csv_file)
 
# ---- Read csv file
#
df = pd.read_csv(csv_file, header=0)
 
# ---- Get filenames and ClassIds
#
filenames = df['Path'].to_list()
y = df['ClassId'].to_list()
x = []
 
# ---- Read images
#
for filename in filenames:
image=io.imread(f'{path}/{filename}')
x.append(image)
pwk.update_progress(name,len(x),len(filenames))
 
# ---- Return
#
return np.array(x,dtype=object),np.array(y)
```
 
%% Cell type:markdown id: tags:
 
### 3.2 - Read the data
We will read the following datasets:
- **Train** subset, for learning data as : `x_train, y_train`
- **Test** subset, for validation data as : `x_test, y_test`
- **Meta** subset, for visualisation as : `x_meta, y_meta`
 
The learning data will be randomly mixted and the illustration data (Meta) sorted.
Will take about 1'30s on HPC.
 
%% Cell type:code id: tags:
 
``` python
pwk.chrono_start()
 
# ---- Read datasets
 
(x_train,y_train) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Train.csv')
(x_test ,y_test) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Test.csv')
(x_meta ,y_meta) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Meta.csv')
 
# ---- Shuffle train set
 
x_train, y_train = pwk.shuffle_np_dataset(x_train, y_train)
 
# ---- Sort Meta
 
combined = list(zip(x_meta,y_meta))
combined.sort(key=lambda x: x[1])
x_meta,y_meta = zip(*combined)
 
pwk.chrono_show()
```
 
%% Output
 
Train.csv [########################################] 100.0% of 39209
Test.csv [########################################] 100.0% of 12630
Meta.csv [########################################] 100.0% of 43
Duration : 00:00:31 663ms
 
%% Cell type:markdown id: tags:
 
## Step 4 - Few statistics about train dataset
We want to know if our images are homogeneous in terms of size, ratio, width or height.
 
### 4.1 - Do statistics
 
%% Cell type:code id: tags:
 
``` python
train_size = []
train_ratio = []
train_lx = []
train_ly = []
 
test_size = []
test_ratio = []
test_lx = []
test_ly = []
 
for image in x_train:
(lx,ly,lz) = image.shape
train_size.append(lx*ly/1024)
train_ratio.append(lx/ly)
train_lx.append(lx)
train_ly.append(ly)
 
for image in x_test:
(lx,ly,lz) = image.shape
test_size.append(lx*ly/1024)
test_ratio.append(lx/ly)
test_lx.append(lx)
test_ly.append(ly)
```
 
%% Cell type:markdown id: tags:
 
### 4.2 - Show statistics
 
%% Cell type:code id: tags:
 
``` python
# ------ Global stuff
print("x_train shape : ",x_train.shape)
print("y_train shape : ",y_train.shape)
print("x_test shape : ",x_test.shape)
print("y_test shape : ",y_test.shape)
 
# ------ Statistics / sizes
plt.figure(figsize=(16,6))
plt.hist([train_size,test_size], bins=100)
plt.gca().set(title='Sizes in Kpixels - Train=[{:5.2f}, {:5.2f}]'.format(min(train_size),max(train_size)),
ylabel='Population', xlim=[0,30])
plt.legend(['Train','Test'])
pwk.save_fig('01-stats-sizes')
plt.show()
 
# ------ Statistics / ratio lx/ly
plt.figure(figsize=(16,6))
plt.hist([train_ratio,test_ratio], bins=100)
plt.gca().set(title='Ratio lx/ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ratio),max(train_ratio)),
ylabel='Population', xlim=[0.8,1.2])
plt.legend(['Train','Test'])
pwk.save_fig('02-stats-ratios')
plt.show()
 
# ------ Statistics / lx
plt.figure(figsize=(16,6))
plt.hist([train_lx,test_lx], bins=100)
plt.gca().set(title='Images lx - Train=[{:5.2f}, {:5.2f}]'.format(min(train_lx),max(train_lx)),
ylabel='Population', xlim=[20,150])
plt.legend(['Train','Test'])
pwk.save_fig('03-stats-lx')
plt.show()
 
# ------ Statistics / ly
plt.figure(figsize=(16,6))
plt.hist([train_ly,test_ly], bins=100)
plt.gca().set(title='Images ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ly),max(train_ly)),
ylabel='Population', xlim=[20,150])
plt.legend(['Train','Test'])
pwk.save_fig('04-stats-ly')
plt.show()
 
# ------ Statistics / classId
plt.figure(figsize=(16,6))
plt.hist([y_train,y_test], bins=43)
plt.gca().set(title='ClassesId', ylabel='Population', xlim=[0,43])
plt.legend(['Train','Test'])
pwk.save_fig('05-stats-classes')
plt.show()
```
 
%% Output
 
x_train shape : (39209,)
y_train shape : (39209,)
x_test shape : (12630,)
y_test shape : (12630,)
 
 
 
 
 
 
%% Cell type:markdown id: tags:
 
## Step 5 - List of classes
What are the 43 classes of our images...
 
%% Cell type:code id: tags:
 
``` python
pwk.plot_images(x_meta,y_meta, range(43), columns=8, x_size=2, y_size=2,
colorbar=False, y_pred=None, cm='binary', save_as='06-meta-signs')
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
## Step 6 - What does it really look like
 
%% Cell type:code id: tags:
 
``` python
# ---- Get and show few images
 
samples = [ random.randint(0,len(x_train)-1) for i in range(32)]
pwk.plot_images(x_train,y_train, samples, columns=8, x_size=2, y_size=2,
colorbar=False, y_pred=None, cm='binary', save_as='07-real-signs')
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
## Step 7 - dataset cooking...
 
Images **must** :
- have the **same size** to match the size of the network,
- be **normalized**.
 
It is possible to work on **rgb** or **monochrome** images and to **equalize** the histograms.
 
See : [Exposure with scikit-image](https://scikit-image.org/docs/dev/api/skimage.exposure.html)
See : [Local histogram equalization](https://scikit-image.org/docs/dev/api/skimage.filters.rank.html#skimage.filters.rank.equalize)
See : [Histogram equalization](https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.equalize_hist)
 
### 7.1 - Enhancement cooking
 
%% Cell type:code id: tags:
 
``` python
def images_enhancement(images, width=25, height=25, mode='RGB'):
'''
Resize and convert images - doesn't change originals.
input images must be RGBA or RGB.
Note : all outputs are fixed size numpy array of float64
args:
images : images list
width,height : new images size (25,25)
mode : RGB | RGB-HE | L | L-HE | L-LHE | L-CLAHE
return:
numpy array of enhanced images
'''
modes = { 'RGB':3, 'RGB-HE':3, 'L':1, 'L-HE':1, 'L-LHE':1, 'L-CLAHE':1}
lz=modes[mode]
 
out=[]
for img in images:
 
# ---- if RGBA, convert to RGB
if img.shape[2]==4:
img=color.rgba2rgb(img)
 
# ---- Resize
img = transform.resize(img, (width,height))
 
# ---- RGB / Histogram Equalization
if mode=='RGB-HE':
hsv = color.rgb2hsv(img.reshape(width,height,3))
hsv[:, :, 2] = exposure.equalize_hist(hsv[:, :, 2])
img = color.hsv2rgb(hsv)
 
# ---- Grayscale
if mode=='L':
img=color.rgb2gray(img)
 
# ---- Grayscale / Histogram Equalization
if mode=='L-HE':
img=color.rgb2gray(img)
img=exposure.equalize_hist(img)
 
# ---- Grayscale / Local Histogram Equalization
if mode=='L-LHE':
img=color.rgb2gray(img)
img = img_as_ubyte(img)
img=rank.equalize(img, disk(10))/255.
 
# ---- Grayscale / Contrast Limited Adaptive Histogram Equalization (CLAHE)
if mode=='L-CLAHE':
img=color.rgb2gray(img)
img=exposure.equalize_adapthist(img)
 
# ---- Add image in list of list
out.append(img)
pwk.update_progress('Enhancement: ',len(out),len(images))
 
# ---- Reshape images
# (-1, width,height,1) for L
# (-1, width,height,3) for RGB
#
out = np.array(out,dtype='float64')
out = out.reshape(-1,width,height,lz)
return out
```
 
%% Cell type:markdown id: tags:
 
### 7.2 - To get an idea of the different recipes
 
%% Cell type:code id: tags:
 
``` python
i=random.randint(0,len(x_train)-16)
x_samples = x_train[i:i+16]
y_samples = y_train[i:i+16]
 
datasets = {}
 
datasets['RGB'] = images_enhancement( x_samples, width=25, height=25, mode='RGB' )
datasets['RGB-HE'] = images_enhancement( x_samples, width=25, height=25, mode='RGB-HE' )
datasets['L'] = images_enhancement( x_samples, width=25, height=25, mode='L' )
datasets['L-HE'] = images_enhancement( x_samples, width=25, height=25, mode='L-HE' )
datasets['L-LHE'] = images_enhancement( x_samples, width=25, height=25, mode='L-LHE' )
datasets['L-CLAHE'] = images_enhancement( x_samples, width=25, height=25, mode='L-CLAHE' )
 
pwk.subtitle('EXPECTED')
x_expected=[ x_meta[i] for i in y_samples]
pwk.plot_images(x_expected, y_samples, range(12), columns=12, x_size=1, y_size=1,
colorbar=False, y_pred=None, cm='binary', save_as='08-expected')
 
pwk.subtitle('ORIGINAL')
pwk.plot_images(x_samples, y_samples, range(12), columns=12, x_size=1, y_size=1,
colorbar=False, y_pred=None, cm='binary', save_as='09-original')
 
pwk.subtitle('ENHANCED')
n=10
for k,d in datasets.items():
print("dataset : {} min,max=[{:.3f},{:.3f}] shape={}".format(k,d.min(),d.max(), d.shape))
pwk.plot_images(d, y_samples, range(12), columns=12, x_size=1, y_size=1,
colorbar=False, y_pred=None, cm='binary', save_as=f'{n}-enhanced-{k}')
n+=1
```
 
%% Output
 
Enhancement: [################] 100.0% of 16
Enhancement: [################] 100.0% of 16
Enhancement: [################] 100.0% of 16
Enhancement: [################] 100.0% of 16
Enhancement: [################] 100.0% of 16
Enhancement: [################] 100.0% of 16
 
<br>**EXPECTED**
 
 
<br>**ORIGINAL**
 
 
<br>**ENHANCED**
 
dataset : RGB min,max=[0.034,1.000] shape=(16, 25, 25, 3)
 
 
dataset : RGB-HE min,max=[0.001,1.000] shape=(16, 25, 25, 3)
 
 
dataset : L min,max=[0.042,1.000] shape=(16, 25, 25, 1)
 
 
dataset : L-HE min,max=[0.002,1.000] shape=(16, 25, 25, 1)
 
 
dataset : L-LHE min,max=[0.000,1.000] shape=(16, 25, 25, 1)
 
 
dataset : L-CLAHE min,max=[0.000,1.000] shape=(16, 25, 25, 1)
 
 
%% Cell type:markdown id: tags:
 
### 7.3 - Cook and save
A function to save a dataset
 
%% Cell type:code id: tags:
 
``` python
def save_h5_dataset(x_train, y_train, x_test, y_test, x_meta,y_meta, filename):
 
# ---- Create h5 file
with h5py.File(filename, "w") as f:
f.create_dataset("x_train", data=x_train)
f.create_dataset("y_train", data=y_train)
f.create_dataset("x_test", data=x_test)
f.create_dataset("y_test", data=y_test)
f.create_dataset("x_meta", data=x_meta)
f.create_dataset("y_meta", data=y_meta)
 
# ---- done
size=os.path.getsize(filename)/(1024*1024)
print('Dataset : {:24s} shape : {:22s} size : {:6.1f} Mo (saved)'.format(filename, str(x_train.shape),size))
```
 
%% Cell type:markdown id: tags:
 
Generate enhanced datasets :
 
%% Cell type:code id: tags:
 
``` python
pwk.chrono_start()
 
n_train = int( len(x_train)*scale )
n_test = int( len(x_test)*scale )
 
pwk.subtitle('Parameters :')
print(f'Scale is : {scale}')
print(f'x_train length is : {n_train}')
print(f'x_test length is : {n_test}')
print(f'output dir is : {output_dir}\n')
 
pwk.subtitle('Running...')
 
pwk.mkdir(output_dir)
 
for s in [24, 48]:
for m in ['RGB', 'RGB-HE', 'L', 'L-LHE']:
# ---- A nice dataset name
filename = f'{output_dir}/set-{s}x{s}-{m}.h5'
pwk.subtitle(f'Dataset : {filename}')
 
# ---- Enhancement
# Note : x_train is a numpy array of python objects (images with <> sizes)
# but images_enhancement() return a real array of float64 numpy (images with same size)
# so, we can save it in nice h5 files
#
x_train_new = images_enhancement( x_train[:n_train], width=s, height=s, mode=m )
x_test_new = images_enhancement( x_test[:n_test], width=s, height=s, mode=m )
x_meta_new = images_enhancement( x_meta, width=s, height=s, mode='RGB' )
 
# ---- Save
save_h5_dataset( x_train_new, y_train[:n_train], x_test_new, y_test[:n_test], x_meta_new,y_meta, filename)
 
x_train_new,x_test_new=0,0
 
pwk.chrono_show()
```
 
%% Output
 
<br>**Parameters :**
 
Scale is : 0.1
x_train length is : 3920
x_test length is : 1263
output dir is : ./data
 
<br>**Running...**
 
<br>**Dataset : ./data/set-24x24-RGB.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-24x24-RGB.h5 shape : (3920, 24, 24, 3) size : 68.9 Mo (saved)
 
<br>**Dataset : ./data/set-24x24-RGB-HE.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-24x24-RGB-HE.h5 shape : (3920, 24, 24, 3) size : 68.9 Mo (saved)
 
<br>**Dataset : ./data/set-24x24-L.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-24x24-L.h5 shape : (3920, 24, 24, 1) size : 23.4 Mo (saved)
 
<br>**Dataset : ./data/set-24x24-L-LHE.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-24x24-L-LHE.h5 shape : (3920, 24, 24, 1) size : 23.4 Mo (saved)
 
<br>**Dataset : ./data/set-48x48-RGB.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-48x48-RGB.h5 shape : (3920, 48, 48, 3) size : 275.6 Mo (saved)
 
<br>**Dataset : ./data/set-48x48-RGB-HE.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-48x48-RGB-HE.h5 shape : (3920, 48, 48, 3) size : 275.6 Mo (saved)
 
<br>**Dataset : ./data/set-48x48-L.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-48x48-L.h5 shape : (3920, 48, 48, 1) size : 93.4 Mo (saved)
 
<br>**Dataset : ./data/set-48x48-L-LHE.h5**
 
Enhancement: [########################################] 100.0% of 3920
Enhancement: [########################################] 100.0% of 1263
Enhancement: [########################################] 100.0% of 43
Dataset : ./data/set-48x48-L-LHE.h5 shape : (3920, 48, 48, 1) size : 93.4 Mo (saved)
Duration : 00:00:59 691ms
 
%% Cell type:markdown id: tags:
 
<div class='todo'>
Adapt the code below to read :
<ul>
<li>the different h5 datasets you saved in ./data,</li>
<li>The h5 datasets available in the Fidle project datasets directory.</li>
</ul>
 
</div>
 
%% Cell type:markdown id: tags:
 
## Step 8 - Reload data to be sure ;-)
 
%% Cell type:code id: tags:
 
``` python
pwk.chrono_start()
 
dataset='set-48x48-L'
samples=range(24)
 
with h5py.File(f'{output_dir}/{dataset}.h5','r') as f:
x_tmp = f['x_train'][:]
y_tmp = f['y_train'][:]
print("dataset loaded from h5 file.")
 
pwk.plot_images(x_tmp,y_tmp, samples, columns=8, x_size=2, y_size=2,
colorbar=False, y_pred=None, cm='binary', save_as='16-enhanced_images')
x_tmp,y_tmp=0,0
 
pwk.chrono_show()
```
 
%% Output
 
dataset loaded from h5 file.
 
 
Duration : 00:00:01 801ms
 
%% Cell type:code id: tags:
 
``` python
pwk.end()
```
 
%% Output
 
End time is : Monday 11 January 2021, 21:36:05
Duration is : 00:01:38 239ms
This notebook ends here
 
%% Cell type:markdown id: tags:
 
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [GTSRB6] - Full convolutions as a batch
<!-- DESC --> Episode 6 : To compute bigger, use your notebook in batch mode
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- Run a notebook code as a **job**
- Follow up with Tensorboard
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
## What we're going to do :
Our main steps:
- Run Full-convolution.ipynb as a batch :
- Notebook mode
- Script mode
- Tensorboard follow up
%% Cell type:markdown id: tags:
### Step 1 - Import and init
Not really useful here ;-)
%% Cell type:code id: tags:
``` python
import sys
sys.path.append('..')
import fidle.pwk as pwk
datasets_dir = pwk.init('GTSRB6')
```
%% Output
<br>**FIDLE 2020 - Practical Work Module**
Version : 2.0.7
Notebook id : GTSRB6
Run time : Wednesday 27 January 2021, 19:11:13
TensorFlow version : 2.2.0
Keras version : 2.3.0-tf
Datasets dir : /gpfswork/rech/mlh/uja62cb/datasets
Run dir : ./run
Update keras cache : False
Save figs : True
Path figs : ./run/figs
%% Cell type:markdown id: tags:
## Step 2 - How to run a notebook as a batch ?
Two simple solutions are possible :-)
- **Option 1 - As a notebook ! (a good choice)**
Very simple.
The result is the executed notebook, so we can retrieve all the cell'soutputs of the notebook :
```jupyter nbconvert (...) --to notebook --execute <notebook>```
Example :
```jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --execute my_notebook.ipynb'```
The result will be a notebook: 'my_notebook.nbconvert.ipynb'.
See: [nbconvert documentation](https://nbconvert.readthedocs.io/en/latest/usage.html#convert-notebook)
- **Option 2 - As a script**
Very simple too, but with some constraints on the notebook.
We will convert the notebook to a Python script (IPython, to be precise) :
```jupyter nbconvert --to script <notebook>```
Then we can execute this script :
```ipython <script>```
See: [nbconvert documentation](https://nbconvert.readthedocs.io/en/latest/usage.html#executable-script)
%% Cell type:markdown id: tags:
## Step 3 - Run as a script
Maybe not always the best solution, but this solution is very rustic !
### 3.1 - Convert to IPython script :
%% Cell type:code id: tags:
``` python
! jupyter nbconvert --to script --output='05-full_convolutions' '05-Full-convolutions.ipynb'
! ls -l *.py
```
%% Output
[NbConvertApp] Converting notebook 05-Full-convolutions.ipynb to script
[NbConvertApp] Writing 12984 bytes to 05-full_convolutions.py
-rw-r--r-- 1 uja62cb mlh 12984 Jan 27 19:11 05-full_convolutions.py
%% Cell type:markdown id: tags:
### 2.2 - Batch submission
See the two examples of bash launch script :
- `batch_slurm.sh` using Slurm (like at IDRIS)
- `batch_oar.sh` using OAR (like at GRICAD)
%% Cell type:markdown id: tags:
#### Example at IDRIS
On the frontal :
```bash
# hostname
jean-zay2
# sbatch $WORK/fidle/GTSRB/batch_slurm.sh
Submitted batch job 249794
#squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
249794 gpu_p1 GTSRB Fu uja62cb PD 0:00 1 (Resources)
# ls -l _batch/
total 32769
-rw-r--r-- 1 uja62cb gensim01 13349 Sep 10 11:32 GTSRB_249794.err
-rw-r--r-- 1 uja62cb gensim01 489 Sep 10 11:31 GTSRB_249794.out
```
%% Cell type:markdown id: tags:
#### Example at GRICAD
Have to be done on the frontal :
```bash
# hostname
f-dahu
# pwd
/home/paroutyj
# oarsub -S ~/fidle/GTSRB/batch_oar.sh
[GPUNODE] Adding gpu node restriction
[ADMISSION RULE] Modify resource description with type constraints
#oarstat -u
Job id S User Duration System message
--------- - -------- ---------- ------------------------------------------------
5878410 R paroutyj 0:19:56 R=8,W=1:0:0,J=I,P=fidle,T=gpu (Karma=0.005,quota_ok)
5896266 W paroutyj 0:00:00 R=8,W=1:0:0,J=B,N=Full convolutions,P=fidle,T=gpu
# ls -l
total 8
-rw-r--r-- 1 paroutyj l-simap 0 Feb 28 15:58 batch_oar_5896266.err
-rw-r--r-- 1 paroutyj l-simap 5703 Feb 28 15:58 batch_oar_5896266.out
```
%% Cell type:code id: tags:
``` python
pwk.end()
```
%% Output
End time is : Wednesday 27 January 2021, 19:11:15
Duration is : 00:00:02 542ms
This notebook ends here
%% Cell type:markdown id: tags:
----
<div class='todo'>
Your mission if you accept it: Run our full_convolution code in batch mode.<br>
For that :
<ul>
<li>Validate the full_convolution notebook on short tests</li>
<li>Submit it in batch mode for validation</li>
<li>Modify the notebook for a full run and submit it :-)</li>
</ul>
</div>
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>
# <!-- TITLE --> [IMDB2] - Reload and reuse a saved model
<!-- DESC --> Retrieving a saved model to perform a sentiment analysis (movie review)
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->
## Objectives :
- The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text.
- For this, we will use our **previously saved model**.
Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**
Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)
For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)
## What we're going to do :
- Preparing the data
- Retrieve our saved model
- Evaluate the result
%% Cell type:markdown id: tags:
## Step 1 - Init python stuff
%% Cell type:code id: tags:
``` python
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.datasets.imdb as imdb
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import os,sys,h5py,json,re
from importlib import reload
sys.path.append('..')
import fidle.pwk as pwk
datasets_dir = pwk.init('IMDB2')
```
%% Output
<br>**FIDLE 2020 - Practical Work Module**
Version : 2.0.7
Notebook id : IMDB2
Run time : Wednesday 27 January 2021, 19:12:12
TensorFlow version : 2.2.0
Keras version : 2.3.0-tf
Datasets dir : /gpfswork/rech/mlh/uja62cb/datasets
Run dir : ./run
Update keras cache : False
Save figs : True
Path figs : ./run/figs
%% Cell type:markdown id: tags:
## Step 2 : Preparing the data
### 2.1 - Our reviews :
%% Cell type:code id: tags:
``` python
reviews = [ "This film is particularly nice, a must see.",
"Some films are great classics and cannot be ignored.",
"This movie is just abominable and doesn't deserve to be seen!"]
```
%% Cell type:markdown id: tags:
### 2.2 - Retrieve dictionaries
Note : This dictionary is generated by [01-Embedding-Keras](01-Embedding-Keras.ipynb) notebook.
%% Cell type:code id: tags:
``` python
with open('./data/word_index.json', 'r') as fp:
word_index = json.load(fp)
index_word = {index:word for word,index in word_index.items()}
```
%% Cell type:markdown id: tags:
### 2.3 - Clean, index and padd
%% Cell type:code id: tags:
``` python
max_len = 256
vocab_size = 10000
nb_reviews = len(reviews)
x_data = []
# ---- For all reviews
for review in reviews:
# ---- First index must be <start>
index_review=[1]
# ---- For all words
for w in review.split(' '):
# ---- Clean it
w_clean = re.sub(r"[^a-zA-Z0-9]", "", w)
# ---- Not empty ?
if len(w_clean)>0:
# ---- Get the index
w_index = word_index.get(w,2)
if w_index>vocab_size : w_index=2
# ---- Add the index if < vocab_size
index_review.append(w_index)
# ---- Add the indexed review
x_data.append(index_review)
# ---- Padding
x_data = keras.preprocessing.sequence.pad_sequences(x_data, value = 0, padding = 'post', maxlen = max_len)
```
%% Cell type:markdown id: tags:
### 2.4 - Have a look
%% Cell type:code id: tags:
``` python
def translate(x):
return ' '.join( [index_word.get(i,'?') for i in x] )
for i in range(nb_reviews):
imax=np.where(x_data[i]==0)[0][0]+5
print(f'\nText review :', reviews[i])
print( f'x_train[{i:}] :', list(x_data[i][:imax]), '(...)')
print( 'Translation :', translate(x_data[i][:imax]), '(...)')
```
%% Output
Text review : This film is particularly nice, a must see.
x_train[0] : [1, 2, 22, 9, 572, 2, 6, 215, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> film is particularly <unknown> a must <unknown> <pad> <pad> <pad> <pad> <pad> (...)
Text review : Some films are great classics and cannot be ignored.
x_train[1] : [1, 2, 108, 26, 87, 2239, 5, 566, 30, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> films are great classics and cannot be <unknown> <pad> <pad> <pad> <pad> <pad> (...)
Text review : This movie is just abominable and doesn't deserve to be seen!
x_train[2] : [1, 2, 20, 9, 43, 2, 5, 152, 1833, 8, 30, 2, 0, 0, 0, 0, 0] (...)
Translation : <start> <unknown> movie is just <unknown> and doesn't deserve to be <unknown> <pad> <pad> <pad> <pad> <pad> (...)
%% Cell type:markdown id: tags:
## Step 2 - Bring back the model
%% Cell type:code id: tags:
``` python
model = keras.models.load_model('./run/models/best_model.h5')
```
%% Cell type:markdown id: tags:
## Step 4 - Predict
%% Cell type:code id: tags:
``` python
y_pred = model.predict(x_data)
```
%% Cell type:markdown id: tags:
#### And the winner is :
%% Cell type:code id: tags:
``` python
for i in range(nb_reviews):
print(f'\n{reviews[i]:<70} =>',('NEGATIVE' if y_pred[i][0]<0.5 else 'POSITIVE'),f'({y_pred[i][0]:.2f})')
```
%% Output
This film is particularly nice, a must see. => POSITIVE (0.56)
Some films are great classics and cannot be ignored. => POSITIVE (0.62)
This movie is just abominable and doesn't deserve to be seen! => NEGATIVE (0.34)
%% Cell type:code id: tags:
``` python
pwk.end()
```
%% Output
End time is : Wednesday 27 January 2021, 19:12:14
Duration is : 00:00:02 516ms
This notebook ends here
%% Cell type:markdown id: tags:
---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment