Skip to content
Snippets Groups Projects
01-Preparation-of-data.ipynb 2.02 MiB
Newer Older
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "# <!-- TITLE --> [GTS1] - CNN with GTSRB dataset - Data analysis and preparation\n",
    "<!-- DESC --> Episode 1 : Data analysis and creation of a usable dataset\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "## Objectives :\n",
    " - Understand the **complexity associated with data**, even when it is only images\n",
    " - Learn how to build up a simple and **usable image dataset**\n",
    "\n",
    "The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.  \n",
    "The final aim is to recognise them !  \n",
    "Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset\n",
    "\n",
    "\n",
    "## What we're going to do :\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    " - Understanding the dataset\n",
    " - Preparing and formatting enhanced data\n",
    " - Save enhanced datasets in h5 file format\n"
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 -  Import and init\n",
    "### 1.1 - Python"
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "\n",
       "div.warn {    \n",
       "    background-color: #fcf2f2;\n",
       "    border-color: #dFb5b4;\n",
       "    border-left: 5px solid #dfb5b4;\n",
       "    padding: 0.5em;\n",
       "    font-weight: bold;\n",
       "    font-size: 1.1em;;\n",
       "    }\n",
       "\n",
       "\n",
       "\n",
       "div.nota {    \n",
       "    background-color: #DAFFDE;\n",
       "    border-left: 5px solid #92CC99;\n",
       "    padding: 0.5em;\n",
       "    }\n",
       "\n",
       "div.todo:before { content:url();\n",
       "    float:left;\n",
       "    margin-right:20px;\n",
       "    margin-top:-20px;\n",
       "    margin-bottom:20px;\n",
       "}\n",
       "div.todo{\n",
       "    font-weight: bold;\n",
       "    font-size: 1.1em;\n",
       "    margin-top:40px;\n",
       "}\n",
       "div.todo ul{\n",
       "    margin: 0.2em;\n",
       "}\n",
       "div.todo li{\n",
       "    margin-left:60px;\n",
       "    margin-top:0;\n",
       "    margin-bottom:0;\n",
       "}\n",
       "\n",
       "\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "FIDLE 2020 - Practical Work Module\n",
      "Version              : 0.4.2\n",
      "Run time             : Friday 28 February 2020, 09:23:59\n",
      "TensorFlow version   : 2.0.0\n",
      "Keras version        : 2.2.4-tf\n"
     ]
    }
   ],
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   "source": [
    "import os, time, sys\n",
    "import csv\n",
    "import math, random\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "import matplotlib.pyplot as plt\n",
    "import h5py\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
    "from skimage.morphology import disk\n",
    "from skimage.filters import rank\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "from skimage import io, color, exposure, transform\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "from importlib import reload\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as ooo\n",
    "\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "ooo.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Where are we ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Well, we should be at GRICAD !\n",
      "We are going to use: /bettik/PROJECTS/pr-fidle/datasets/GTSRB\n"
     ]
    }
   ],
   "source": [
    "place, dataset_dir = ooo.good_place( { 'GRICAD' : f'{os.getenv(\"SCRATCH_DIR\",\"\")}/PROJECTS/pr-fidle/datasets/GTSRB',\n",
    "                                       'IDRIS'  : f'{os.getenv(\"WORK\",\"\")}/datasets/GTSRB',\n",
    "                                       'HOME'   : f'{os.getenv(\"HOME\",\"\")}/datasets/GTSRB'} )"
   ]
  },
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Read the dataset\n",
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
    "Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset\n",
    " - Each directory contains one CSV file with annotations (\"GT-<ClassID>.csv\") and the training images\n",
    " - First line is fieldnames: Filename;Width;Height;Roi.X1;Roi.Y1;Roi.X2;Roi.Y2;ClassId  \n",
    "    \n",
    "### 2.1 - Understanding the dataset\n",
    "The original dataset is in : **\\<dataset_dir\\>/origine.**  \n",
    "There is 3 subsets : **Train**, **Test** and **Meta**.  \n",
    "Each subset have an **csv file** and a **subdir**.\n",
    "    "
Jean-Luc Parouty's avatar
Jean-Luc Parouty committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Width</th>\n",
Loading
Loading full blame...