Minor changes
%% Cell type:markdown id: tags: | ||
<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img> | ||
# <!-- TITLE --> [GTSRB1] - Dataset analysis and preparation | ||
<!-- DESC --> Episode 1 : Analysis of the GTSRB dataset and creation of an enhanced dataset | ||
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) --> | ||
## Objectives : | ||
- Understand the **complexity associated with data**, even when it is only images | ||
- Learn how to build up a simple and **usable image dataset** | ||
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes. | ||
The final aim is to recognise them ! | ||
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset | ||
## What we're going to do : | ||
- Understanding the dataset | ||
- Preparing and formatting enhanced data | ||
- Save enhanced datasets in h5 file format | ||
%% Cell type:markdown id: tags: | ||
## Step 1 - Import and init | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
import os, time, sys | ||
import csv | ||
import math, random | ||
import numpy as np | ||
import pandas as pd | ||
import matplotlib.pyplot as plt | ||
import h5py | ||
from skimage.morphology import disk | ||
from skimage.util import img_as_ubyte | ||
from skimage.filters import rank | ||
from skimage import io, color, exposure, transform | ||
from importlib import reload | ||
sys.path.append('..') | ||
import fidle.pwk as pwk | ||
datasets_dir = pwk.init('GTSRB1') | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 2 - Parameters | ||
The generation of datasets may require some time and space : **10' and 10 GB**. | ||
You can choose to perform tests or generate the whole enhanced dataset by setting the following parameters: | ||
`scale` : 1 mean 100% of the dataset - set 0.1 for tests | ||
`output_dir` : where to write enhanced dataset, could be : | ||
- `./data`, for tests purpose | ||
- `<datasets_dir>/GTSRB/enhanced` to add clusters in your datasets dir. | ||
Uncomment the right lines according to what you want : | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
# ---- For smart tests : | ||
# | ||
scale = 0.2 | ||
output_dir = './data' | ||
# ---- For a Full dataset generation : | ||
# | ||
# scale = 1 | ||
# output_dir = f'{datasets_dir}/GTSRB/enhanced' | ||
# ---- Verbosity - 0 = silent, 1 = progress bar, 2 = one line | ||
# | ||
progress_verbosity = 1 | ||
``` | ||
%% Cell type:markdown id: tags: | ||
Override parameters (batch mode) - Just forget this cell | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.override('scale', 'output_dir', 'progress_verbosity') | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 3 - Read the dataset | ||
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset | ||
- Each directory contains one CSV file with annotations : `GT-<ClassID>.csv` and the training images | ||
- First line is fieldnames: `Filename ; Width ; Height ; Roi.X1 ; Roi.Y1 ; Roi.X2 ; Roi.Y2 ; ClassId` | ||
### 3.1 - Understanding the dataset | ||
The original dataset is in : **\<dataset_dir\>/GTSRB/origine.** | ||
There is 3 subsets : **Train**, **Test** and **Meta.** | ||
Each subset have an **csv file** and a **subdir** with **images**. | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
df = pd.read_csv(f'{datasets_dir}/GTSRB/origine/Test.csv', header=0) | ||
display(df.head(10)) | ||
``` | ||
%% Cell type:markdown id: tags: | ||
### 3.2 - Usefull functions | ||
A nice function to read a subset : | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
def read_csv_dataset(csv_file): | ||
''' | ||
Reads traffic sign data from German Traffic Sign Recognition Benchmark dataset. | ||
Arguments: | ||
csv filename : Description file, Example /data/GTSRB/Train.csv | ||
Returns: | ||
x,y : np array of images, np array of corresponding labels | ||
''' | ||
path = os.path.dirname(csv_file) | ||
name = os.path.basename(csv_file) | ||
# ---- Read csv file | ||
# | ||
df = pd.read_csv(csv_file, header=0) | ||
# ---- Get filenames and ClassIds | ||
# | ||
filenames = df['Path'].to_list() | ||
y = df['ClassId'].to_list() | ||
x = [] | ||
# ---- Read images | ||
# | ||
for filename in filenames: | ||
image=io.imread(f'{path}/{filename}') | ||
x.append(image) | ||
pwk.update_progress(name,len(x),len(filenames), verbosity=progress_verbosity) | ||
# ---- Return | ||
# | ||
return np.array(x,dtype=object),np.array(y) | ||
``` | ||
%% Cell type:markdown id: tags: | ||
### 3.2 - Read the data | ||
We will read the following datasets: | ||
- **Train** subset, for learning data as : `x_train, y_train` | ||
- **Test** subset, for validation data as : `x_test, y_test` | ||
- **Meta** subset, for visualisation as : `x_meta, y_meta` | ||
The learning data will be randomly mixted and the illustration data (Meta) sorted. | ||
Will take about 1'30s on HPC. | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.chrono_start() | ||
# ---- Read datasets | ||
(x_train,y_train) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Train.csv') | ||
(x_test ,y_test) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Test.csv') | ||
(x_meta ,y_meta) = read_csv_dataset(f'{datasets_dir}/GTSRB/origine/Meta.csv') | ||
# ---- Shuffle train set | ||
x_train, y_train = pwk.shuffle_np_dataset(x_train, y_train) | ||
# ---- Sort Meta | ||
combined = list(zip(x_meta,y_meta)) | ||
combined.sort(key=lambda x: x[1]) | ||
x_meta,y_meta = zip(*combined) | ||
pwk.chrono_show() | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 4 - Few statistics about train dataset | ||
We want to know if our images are homogeneous in terms of size, ratio, width or height. | ||
### 4.1 - Do statistics | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
train_size = [] | ||
train_ratio = [] | ||
train_lx = [] | ||
train_ly = [] | ||
test_size = [] | ||
test_ratio = [] | ||
test_lx = [] | ||
test_ly = [] | ||
for image in x_train: | ||
(lx,ly,lz) = image.shape | ||
train_size.append(lx*ly/1024) | ||
train_ratio.append(lx/ly) | ||
train_lx.append(lx) | ||
train_ly.append(ly) | ||
for image in x_test: | ||
(lx,ly,lz) = image.shape | ||
test_size.append(lx*ly/1024) | ||
test_ratio.append(lx/ly) | ||
test_lx.append(lx) | ||
test_ly.append(ly) | ||
``` | ||
%% Cell type:markdown id: tags: | ||
### 4.2 - Show statistics | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
# ------ Global stuff | ||
print("x_train shape : ",x_train.shape) | ||
print("y_train shape : ",y_train.shape) | ||
print("x_test shape : ",x_test.shape) | ||
print("y_test shape : ",y_test.shape) | ||
# ------ Statistics / sizes | ||
plt.figure(figsize=(16,6)) | ||
plt.hist([train_size,test_size], bins=100) | ||
plt.gca().set(title='Sizes in Kpixels - Train=[{:5.2f}, {:5.2f}]'.format(min(train_size),max(train_size)), | ||
ylabel='Population', xlim=[0,30]) | ||
plt.legend(['Train','Test']) | ||
pwk.save_fig('01-stats-sizes') | ||
plt.show() | ||
# ------ Statistics / ratio lx/ly | ||
plt.figure(figsize=(16,6)) | ||
plt.hist([train_ratio,test_ratio], bins=100) | ||
plt.gca().set(title='Ratio lx/ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ratio),max(train_ratio)), | ||
ylabel='Population', xlim=[0.8,1.2]) | ||
plt.legend(['Train','Test']) | ||
pwk.save_fig('02-stats-ratios') | ||
plt.show() | ||
# ------ Statistics / lx | ||
plt.figure(figsize=(16,6)) | ||
plt.hist([train_lx,test_lx], bins=100) | ||
plt.gca().set(title='Images lx - Train=[{:5.2f}, {:5.2f}]'.format(min(train_lx),max(train_lx)), | ||
ylabel='Population', xlim=[20,150]) | ||
plt.legend(['Train','Test']) | ||
pwk.save_fig('03-stats-lx') | ||
plt.show() | ||
# ------ Statistics / ly | ||
plt.figure(figsize=(16,6)) | ||
plt.hist([train_ly,test_ly], bins=100) | ||
plt.gca().set(title='Images ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ly),max(train_ly)), | ||
ylabel='Population', xlim=[20,150]) | ||
plt.legend(['Train','Test']) | ||
pwk.save_fig('04-stats-ly') | ||
plt.show() | ||
# ------ Statistics / classId | ||
plt.figure(figsize=(16,6)) | ||
plt.hist([y_train,y_test], bins=43) | ||
plt.gca().set(title='ClassesId', ylabel='Population', xlim=[0,43]) | ||
plt.legend(['Train','Test']) | ||
pwk.save_fig('05-stats-classes') | ||
plt.show() | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 5 - List of classes | ||
What are the 43 classes of our images... | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.plot_images(x_meta,y_meta, range(43), columns=8, x_size=2, y_size=2, | ||
colorbar=False, y_pred=None, cm='binary', save_as='06-meta-signs') | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 6 - What does it really look like | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
# ---- Get and show few images | ||
samples = [ random.randint(0,len(x_train)-1) for i in range(32)] | ||
pwk.plot_images(x_train,y_train, samples, columns=8, x_size=2, y_size=2, | ||
colorbar=False, y_pred=None, cm='binary', save_as='07-real-signs') | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 7 - dataset cooking... | ||
Images **must** : | ||
- have the **same size** to match the size of the network, | ||
- be **normalized**. | ||
It is possible to work on **rgb** or **monochrome** images and to **equalize** the histograms. | ||
See : [Exposure with scikit-image](https://scikit-image.org/docs/dev/api/skimage.exposure.html) | ||
See : [Local histogram equalization](https://scikit-image.org/docs/dev/api/skimage.filters.rank.html#skimage.filters.rank.equalize) | ||
See : [Histogram equalization](https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.equalize_hist) | ||
### 7.1 - Enhancement cooking | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
def images_enhancement(images, width=25, height=25, mode='RGB'): | ||
''' | ||
Resize and convert images - doesn't change originals. | ||
input images must be RGBA or RGB. | ||
Note : all outputs are fixed size numpy array of float64 | ||
args: | ||
images : images list | ||
width,height : new images size (25,25) | ||
mode : RGB | RGB-HE | L | L-HE | L-LHE | L-CLAHE | ||
return: | ||
numpy array of enhanced images | ||
''' | ||
modes = { 'RGB':3, 'RGB-HE':3, 'L':1, 'L-HE':1, 'L-LHE':1, 'L-CLAHE':1} | ||
lz=modes[mode] | ||
out=[] | ||
for img in images: | ||
# ---- if RGBA, convert to RGB | ||
if img.shape[2]==4: | ||
img=color.rgba2rgb(img) | ||
# ---- Resize | ||
img = transform.resize(img, (width,height)) | ||
# ---- RGB / Histogram Equalization | ||
if mode=='RGB-HE': | ||
hsv = color.rgb2hsv(img.reshape(width,height,3)) | ||
hsv[:, :, 2] = exposure.equalize_hist(hsv[:, :, 2]) | ||
img = color.hsv2rgb(hsv) | ||
# ---- Grayscale | ||
if mode=='L': | ||
img=color.rgb2gray(img) | ||
# ---- Grayscale / Histogram Equalization | ||
if mode=='L-HE': | ||
img=color.rgb2gray(img) | ||
img=exposure.equalize_hist(img) | ||
# ---- Grayscale / Local Histogram Equalization | ||
if mode=='L-LHE': | ||
img=color.rgb2gray(img) | ||
img = img_as_ubyte(img) | ||
img=rank.equalize(img, disk(10))/255. | ||
# ---- Grayscale / Contrast Limited Adaptive Histogram Equalization (CLAHE) | ||
if mode=='L-CLAHE': | ||
img=color.rgb2gray(img) | ||
img=exposure.equalize_adapthist(img) | ||
# ---- Add image in list of list | ||
out.append(img) | ||
pwk.update_progress('Enhancement: ',len(out),len(images)) | ||
# ---- Reshape images | ||
# (-1, width,height,1) for L | ||
# (-1, width,height,3) for RGB | ||
# | ||
out = np.array(out,dtype='float64') | ||
out = out.reshape(-1,width,height,lz) | ||
return out | ||
``` | ||
%% Cell type:markdown id: tags: | ||
### 7.2 - To get an idea of the different recipes | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
i=random.randint(0,len(x_train)-16) | ||
x_samples = x_train[i:i+16] | ||
y_samples = y_train[i:i+16] | ||
datasets = {} | ||
datasets['RGB'] = images_enhancement( x_samples, width=25, height=25, mode='RGB' ) | ||
datasets['RGB-HE'] = images_enhancement( x_samples, width=25, height=25, mode='RGB-HE' ) | ||
datasets['L'] = images_enhancement( x_samples, width=25, height=25, mode='L' ) | ||
datasets['L-HE'] = images_enhancement( x_samples, width=25, height=25, mode='L-HE' ) | ||
datasets['L-LHE'] = images_enhancement( x_samples, width=25, height=25, mode='L-LHE' ) | ||
datasets['L-CLAHE'] = images_enhancement( x_samples, width=25, height=25, mode='L-CLAHE' ) | ||
pwk.subtitle('EXPECTED') | ||
x_expected=[ x_meta[i] for i in y_samples] | ||
pwk.plot_images(x_expected, y_samples, range(12), columns=12, x_size=1, y_size=1, | ||
colorbar=False, y_pred=None, cm='binary', save_as='08-expected') | ||
pwk.subtitle('ORIGINAL') | ||
pwk.plot_images(x_samples, y_samples, range(12), columns=12, x_size=1, y_size=1, | ||
colorbar=False, y_pred=None, cm='binary', save_as='09-original') | ||
pwk.subtitle('ENHANCED') | ||
n=10 | ||
for k,d in datasets.items(): | ||
print("dataset : {} min,max=[{:.3f},{:.3f}] shape={}".format(k,d.min(),d.max(), d.shape)) | ||
pwk.plot_images(d, y_samples, range(12), columns=12, x_size=1, y_size=1, | ||
colorbar=False, y_pred=None, cm='binary', save_as=f'{n}-enhanced-{k}') | ||
n+=1 | ||
``` | ||
%% Cell type:markdown id: tags: | ||
### 7.3 - Cook and save | ||
A function to save a dataset | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
def save_h5_dataset(x_train, y_train, x_test, y_test, x_meta,y_meta, filename): | ||
# ---- Create h5 file | ||
with h5py.File(filename, "w") as f: | ||
f.create_dataset("x_train", data=x_train) | ||
f.create_dataset("y_train", data=y_train) | ||
f.create_dataset("x_test", data=x_test) | ||
f.create_dataset("y_test", data=y_test) | ||
f.create_dataset("x_meta", data=x_meta) | ||
f.create_dataset("y_meta", data=y_meta) | ||
# ---- done | ||
size=os.path.getsize(filename)/(1024*1024) | ||
print('Dataset : {:24s} shape : {:22s} size : {:6.1f} Mo (saved)'.format(filename, str(x_train.shape),size)) | ||
``` | ||
%% Cell type:markdown id: tags: | ||
Generate enhanced datasets : | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.chrono_start() | ||
n_train = int( len(x_train)*scale ) | ||
n_test = int( len(x_test)*scale ) | ||
pwk.subtitle('Parameters :') | ||
print(f'Scale is : {scale}') | ||
print(f'x_train length is : {n_train}') | ||
print(f'x_test length is : {n_test}') | ||
print(f'output dir is : {output_dir}\n') | ||
pwk.subtitle('Running...') | ||
pwk.mkdir(output_dir) | ||
for s in [24, 48]: | ||
for m in ['RGB', 'RGB-HE', 'L', 'L-LHE']: | ||
# ---- A nice dataset name | ||
filename = f'{output_dir}/set-{s}x{s}-{m}.h5' | ||
pwk.subtitle(f'Dataset : {filename}') | ||
# ---- Enhancement | ||
# Note : x_train is a numpy array of python objects (images with <> sizes) | ||
# but images_enhancement() return a real array of float64 numpy (images with same size) | ||
# so, we can save it in nice h5 files | ||
# | ||
x_train_new = images_enhancement( x_train[:n_train], width=s, height=s, mode=m ) | ||
x_test_new = images_enhancement( x_test[:n_test], width=s, height=s, mode=m ) | ||
x_meta_new = images_enhancement( x_meta, width=s, height=s, mode='RGB' ) | ||
# ---- Save | ||
save_h5_dataset( x_train_new, y_train[:n_train], x_test_new, y_test[:n_test], x_meta_new,y_meta, filename) | ||
x_train_new,x_test_new=0,0 | ||
pwk.chrono_show() | ||
``` | ||
%% Cell type:markdown id: tags: | ||
## Step 8 - Reload data to be sure ;-) | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.chrono_start() | ||
dataset='set-48x48-L' | ||
samples=range(24) | ||
with h5py.File(f'{output_dir}/{dataset}.h5','r') as f: | ||
x_tmp = f['x_train'][:] | ||
y_tmp = f['y_train'][:] | ||
print("dataset loaded from h5 file.") | ||
pwk.plot_images(x_tmp,y_tmp, samples, columns=8, x_size=2, y_size=2, | ||
colorbar=False, y_pred=None, cm='binary', save_as='16-enhanced_images') | ||
x_tmp,y_tmp=0,0 | ||
pwk.chrono_show() | ||
``` | ||
%% Cell type:code id: tags: | ||
``` | ||
``` python | ||
pwk.end() | ||
``` | ||
%% Cell type:markdown id: tags: | ||
--- | ||
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img> | ||
... | ... |
Please register or sign in to comment