01-Preparation-of-data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "German Traffic Sign Recognition Benchmark (GTSRB)\n",
    "=================================================\n",
    "---\n",
    "Introduction au Deep Learning  (IDLE) - S. Arias, E. Maldonado, JL. Parouty - CNRS/SARI/DEVLOG - 2020  \n",
    "\n",
    "## Episode 1 : Preparation of data\n",
    "\n",
    " - Understanding the dataset\n",
    " - Preparing and formatting enhanced data\n",
    " - Save enhanced datasets in h5 file format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1/ Import and init"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "\n",
       "div.warn {    \n",
       "    background-color: #fcf2f2;\n",
       "    border-color: #dFb5b4;\n",
       "    border-left: 5px solid #dfb5b4;\n",
       "    padding: 0.5em;\n",
       "    font-weight: bold;\n",
       "    font-size: 1.1em;;\n",
       "    }\n",
       "\n",
       "\n",
       "\n",
       "div.nota {    \n",
       "    background-color: #DAFFDE;\n",
       "    border-left: 5px solid #92CC99;\n",
       "    padding: 0.5em;\n",
       "    }\n",
       "\n",
       "\n",
       "\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "FIDLE 2020 - Practical Work Module\n",
      "Version              : 0.2.7\n",
      "Run time             : Monday 10 February 2020, 09:29:27\n",
      "TensorFlow version   : 2.0.0\n",
      "Keras version        : 2.2.4-tf\n"
     ]
    }
   ],
   "source": [
    "import os, time, sys\n",
    "import csv\n",
    "import math, random\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import h5py\n",
    "\n",
    "from skimage.morphology import disk\n",
    "from skimage.filters import rank\n",
    "from skimage import io, color, exposure, transform\n",
    "\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as ooo\n",
    "\n",
    "ooo.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2/ Read the dataset\n",
    "Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset\n",
    " - Each directory contains one CSV file with annotations (\"GT-<ClassID>.csv\") and the training images\n",
    " - First line is fieldnames: Filename;Width;Height;Roi.X1;Roi.Y1;Roi.X2;Roi.Y2;ClassId  \n",
    "    \n",
    "### 2.1/ Usefull functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_dataset_dir(csv_filename):\n",
    "    '''Reads traffic sign data from German Traffic Sign Recognition Benchmark dataset.\n",
    "\n",
    "    Arguments:  csv filename\n",
    "                Example /data/GTSRB/Train.csv\n",
    "    Returns:   np array of images, np array of corresponding labels'''\n",
    "\n",
    "    # ---- csv filename and path\n",
    "    #\n",
    "    name=os.path.basename(csv_filename)\n",
    "    path=os.path.dirname(csv_filename)\n",
    "    \n",
    "    # ---- Read csv file\n",
    "    #\n",
    "    f,x,y = [],[],[]\n",
    "    with open(csv_filename) as csv_file:\n",
    "        reader = csv.DictReader(csv_file, delimiter=',')\n",
    "        for row in reader:\n",
    "            f.append( path+'/'+row['Path'] )\n",
    "            y.append( int(row['ClassId'])  )\n",
    "        csv_file.close()\n",
    "    nb_images = len(f)\n",
    "\n",
    "    # ---- Read images\n",
    "    #\n",
    "    for filename in f:\n",
    "        image=io.imread(filename)\n",
    "        x.append(image)\n",
    "        ooo.update_progress(name,len(x),nb_images)\n",
    "    # ---- Return\n",
    "    #\n",
    "    return np.array(x),np.array(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2/ Read the data\n",
    "We will read the following datasets:\n",
    " - **x_train, y_train** : Learning data\n",
    " - **x_test, y_test** : Validation or test data\n",
    " - x_meta, y_meta : Illustration data\n",
    " \n",
    "The learning data will be randomly mixted and the illustration data sorted.  \n",
    "Will take about 2-3'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train.csv        [#---------------------------------------]   2.5% of 39209\r"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# ---- Read datasets\n",
    "(x_train,y_train) = read_dataset_dir('./data/origine/Train.csv')\n",
    "(x_test ,y_test)  = read_dataset_dir('./data/origine/Test.csv')\n",
    "(x_meta ,y_meta)  = read_dataset_dir('./data/origine/Meta.csv')\n",
    "    \n",
    "# ---- Shuffle train set\n",
    "combined = list(zip(x_train,y_train))\n",
    "random.shuffle(combined)\n",
    "x_train,y_train = zip(*combined)\n",
    "\n",
    "# ---- Sort Meta\n",
    "combined = list(zip(x_meta,y_meta))\n",
    "combined.sort(key=lambda x: x[1])\n",
    "x_meta,y_meta = zip(*combined)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3/ Few statistics about train dataset\n",
    "We want to know if our images are homogeneous in terms of size, ratio, width or height.\n",
    "\n",
    "### 3.1/ Do statistics "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_size  = []\n",
    "train_ratio = []\n",
    "train_lx    = []\n",
    "train_ly    = []\n",
    "\n",
    "test_size   = []\n",
    "test_ratio  = []\n",
    "test_lx     = []\n",
    "test_ly     = []\n",
    "\n",
    "for image in x_train:\n",
    "    (lx,ly,lz) = image.shape\n",
    "    train_size.append(lx*ly/1024)\n",
    "    train_ratio.append(lx/ly)\n",
    "    train_lx.append(lx)\n",
    "    train_ly.append(ly)\n",
    "\n",
    "for image in x_test:\n",
    "    (lx,ly,lz) = image.shape\n",
    "    test_size.append(lx*ly/1024)\n",
    "    test_ratio.append(lx/ly)\n",
    "    test_lx.append(lx)\n",
    "    test_ly.append(ly)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2/ Show statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ------ Global stuff\n",
    "print(\"x_train size : \",len(x_train))\n",
    "print(\"y_train size : \",len(y_train))\n",
    "print(\"x_test size  : \",len(x_test))\n",
    "print(\"y_test size  : \",len(y_test))\n",
    "\n",
    "# ------ Statistics / sizes\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist([train_size,test_size], bins=100)\n",
    "plt.gca().set(title='Sizes in Kpixels - Train=[{:5.2f}, {:5.2f}]'.format(min(train_size),max(train_size)), \n",
    "              ylabel='Population',\n",
    "              xlim=[0,30])\n",
    "plt.legend(['Train','Test'])\n",
    "plt.show()\n",
    "\n",
    "# ------ Statistics / ratio lx/ly\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist([train_ratio,test_ratio], bins=100)\n",
    "plt.gca().set(title='Ratio lx/ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ratio),max(train_ratio)), \n",
    "              ylabel='Population',\n",
    "              xlim=[0.8,1.2])\n",
    "plt.legend(['Train','Test'])\n",
    "plt.show()\n",
    "\n",
    "# ------ Statistics / lx\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist([train_lx,test_lx], bins=100)\n",
    "plt.gca().set(title='Images lx - Train=[{:5.2f}, {:5.2f}]'.format(min(train_lx),max(train_lx)), \n",
    "              ylabel='Population',\n",
    "              xlim=[20,150])\n",
    "plt.legend(['Train','Test'])\n",
    "plt.show()\n",
    "\n",
    "# ------ Statistics / ly\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist([train_ly,test_ly], bins=100)\n",
    "plt.gca().set(title='Images ly - Train=[{:5.2f}, {:5.2f}]'.format(min(train_ly),max(train_ly)), \n",
    "              ylabel='Population',\n",
    "              xlim=[20,150])\n",
    "plt.legend(['Train','Test'])\n",
    "plt.show()\n",
    "\n",
    "# ------ Statistics / classId\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.hist([y_train,y_test], bins=43)\n",
    "plt.gca().set(title='ClassesId', \n",
    "              ylabel='Population',\n",
    "              xlim=[0,43])\n",
    "plt.legend(['Train','Test'])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4/ List of classes\n",
    "What are the 43 classes of our images..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ooo.plot_images(x_meta,y_meta, range(43), columns=8, x_size=2, y_size=2, \n",
    "                                colorbar=False, y_pred=None, cm='binary')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5/ What does it really look like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Get and show few images\n",
    "\n",
    "samples = [ random.randint(0,len(x_train)-1) for i in range(32)]\n",
    "ooo.plot_images(x_train,y_train, samples, columns=8, x_size=2, y_size=2, colorbar=False, y_pred=None, cm='binary')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6/ dataset cooking...\n",
    "\n",
    "Images must have the **same size** to match the size of the network.   \n",
    "It is possible to work on **rgb** or **monochrome** images and **equalize** the histograms.   \n",
    "The data must be **normalized**.  \n",
    "\n",
    "See : [Exposure with scikit-image](https://scikit-image.org/docs/dev/api/skimage.exposure.html)  \n",
    "See : [Local histogram equalization](https://scikit-image.org/docs/dev/api/skimage.filters.rank.html#skimage.filters.rank.equalize)  \n",
    "See : [Histogram equalization](https://scikit-image.org/docs/dev/api/skimage.exposure.html#skimage.exposure.equalize_hist)  \n",
    "\n",
    "### 6.1/ Enhancement cook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def images_enhancement(images, width=25, height=25, mode='RGB'):\n",
    "    '''\n",
    "    Resize and convert images - doesn't change originals.\n",
    "    input images must be RGBA or RGB.\n",
    "    args:\n",
    "        images :         images list\n",
    "        width,height :   new images size (25,25)\n",
    "        mode :           RGB | RGB-HE | L | L-HE | L-LHE | L-CLAHE\n",
    "    return:\n",
    "        numpy array of enhanced images\n",
    "    '''\n",
    "    modes = { 'RGB':3, 'RGB-HE':3, 'L':1, 'L-HE':1, 'L-LHE':1, 'L-CLAHE':1}\n",
    "    lz=modes[mode]\n",
    "    \n",
    "    out=[]\n",
    "    for img in images:\n",
    "        \n",
    "        # ---- if RGBA, convert to RGB\n",
    "        if img.shape[2]==4:\n",
    "            img=color.rgba2rgb(img)\n",
    "            \n",
    "        # ---- Resize\n",
    "        img = transform.resize(img, (width,height))\n",
    "\n",
    "        # ---- RGB / Histogram Equalization\n",
    "        if mode=='RGB-HE':\n",
    "            hsv = color.rgb2hsv(img.reshape(width,height,3))\n",
    "            hsv[:, :, 2] = exposure.equalize_hist(hsv[:, :, 2])\n",
    "            img = color.hsv2rgb(hsv)\n",
    "        \n",
    "        # ---- Grayscale\n",
    "        if mode=='L':\n",
    "            img=color.rgb2gray(img)\n",
    "            \n",
    "        # ---- Grayscale / Histogram Equalization\n",
    "        if mode=='L-HE':\n",
    "            img=color.rgb2gray(img)\n",
    "            img=exposure.equalize_hist(img)\n",
    "            \n",
    "        # ---- Grayscale / Local Histogram Equalization\n",
    "        if mode=='L-LHE':\n",
    "            img=color.rgb2gray(img)\n",
    "            img=rank.equalize(img, disk(10))/255.\n",
    "        \n",
    "        # ---- Grayscale / Contrast Limited Adaptive Histogram Equalization (CLAHE)\n",
    "        if mode=='L-CLAHE':\n",
    "            img=color.rgb2gray(img)\n",
    "            img=exposure.equalize_adapthist(img)\n",
    "            \n",
    "        # ---- Add image in list of list\n",
    "        out.append(img)\n",
    "        ooo.update_progress('Enhancement: ',len(out),len(images))\n",
    "\n",
    "    # ---- Reshape images\n",
    "    #     (-1, width,height,1) for L\n",
    "    #     (-1, width,height,3) for RGB\n",
    "    #\n",
    "    out = np.array(out,dtype='float64')\n",
    "    out = out.reshape(-1,width,height,lz)\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2/ To get an idea of the different recipes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i=random.randint(0,len(x_train)-16)\n",
    "x_samples = x_train[i:i+16]\n",
    "y_samples = y_train[i:i+16]\n",
    "\n",
    "datasets  = {}\n",
    "\n",
    "datasets['RGB']      = images_enhancement( x_samples, width=25, height=25, mode='RGB'  )\n",
    "datasets['RGB-HE']   = images_enhancement( x_samples, width=25, height=25, mode='RGB-HE'  )\n",
    "datasets['L']        = images_enhancement( x_samples, width=25, height=25, mode='L'  )\n",
    "datasets['L-HE']     = images_enhancement( x_samples, width=25, height=25, mode='L-HE'  )\n",
    "datasets['L-LHE']    = images_enhancement( x_samples, width=25, height=25, mode='L-LHE'  )\n",
    "datasets['L-CLAHE']  = images_enhancement( x_samples, width=25, height=25, mode='L-CLAHE'  )\n",
    "\n",
    "print('\\nEXPECTED (Meta) :\\n')\n",
    "x_expected=[ x_meta[i] for i in y_samples]\n",
    "ooo.plot_images(x_expected, y_samples, range(16), columns=16, x_size=1, y_size=1, colorbar=False, y_pred=None, cm='binary')\n",
    "\n",
    "print('\\nORIGINAL IMAGES :\\n')\n",
    "ooo.plot_images(x_samples,  y_samples, range(16), columns=16, x_size=1, y_size=1, colorbar=False, y_pred=None, cm='binary')\n",
    "\n",
    "print('\\nENHANCED :\\n')\n",
    "for k,d in datasets.items():\n",
    "    print(\"dataset : {}  min,max=[{:.3f},{:.3f}]  shape={}\".format(k,d.min(),d.max(), d.shape))\n",
    "    ooo.plot_images(d, y_samples, range(16), columns=16, x_size=1, y_size=1, colorbar=False, y_pred=None, cm='binary')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3/ Cook and save\n",
    "A function to save a dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def save_h5_dataset(x_train, y_train, x_test, y_test, x_meta,y_meta, h5name):\n",
    "    \n",
    "    # ---- Filename\n",
    "    filename='./data/'+h5name\n",
    "    \n",
    "    # ---- Create h5 file\n",
    "    with h5py.File(filename, \"w\") as f:\n",
    "        f.create_dataset(\"x_train\", data=x_train)\n",
    "        f.create_dataset(\"y_train\", data=y_train)\n",
    "        f.create_dataset(\"x_test\",  data=x_test)\n",
    "        f.create_dataset(\"y_test\",  data=y_test)\n",
    "        f.create_dataset(\"x_meta\",  data=x_meta)\n",
    "        f.create_dataset(\"y_meta\",  data=y_meta)\n",
    "        \n",
    "    # ---- done\n",
    "    size=os.path.getsize(filename)/(1024*1024)\n",
    "    print('Dataset : {:24s}  shape : {:22s} size : {:6.1f} Mo   (saved)\\n'.format(filename, str(x_train.shape),size))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create enhanced datasets, and save them...  \n",
    "Will take about 7-8'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "for s in [24, 48]:\n",
    "    for m in ['RGB', 'RGB-HE', 'L', 'L-LHE']:\n",
    "        # ---- A nice dataset name\n",
    "        name='set-{}x{}-{}.h5'.format(s,s,m)\n",
    "        print(\"\\nDataset : \",name)\n",
    "        # ---- Enhancement\n",
    "        x_train_new = images_enhancement( x_train, width=s, height=s, mode=m )\n",
    "        x_test_new  = images_enhancement( x_test,  width=s, height=s, mode=m )\n",
    "        x_meta_new  = images_enhancement( x_meta,  width=s, height=s, mode='RGB' )\n",
    "        # ---- Save\n",
    "        save_h5_dataset( x_train_new, y_train, x_test_new, y_test, x_meta_new,y_meta, name)\n",
    "\n",
    "x_train_new,x_test_new=0,0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7/ Reload data to be sure ;-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "dataset='set-48x48-L'\n",
    "samples=range(24)\n",
    "\n",
    "with  h5py.File('./data/'+dataset+'.h5') as f:\n",
    "    x_tmp = f['x_train'][:]\n",
    "    y_tmp = f['y_train'][:]\n",
    "    print(\"dataset loaded from h5 file.\")\n",
    "\n",
    "ooo.plot_images(x_tmp,y_tmp, samples, columns=8, x_size=2, y_size=2, colorbar=False, y_pred=None, cm='binary')\n",
    "x_tmp,y_tmp=0,0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "That's all folks !"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}