07-Check-CelebA.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [VAE7] - Checking the clustered dataset\n",
    "<!-- DESC --> Episode : 3 Clustered dataset verification and testing of our datagenerator\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Making sure our clustered dataset is correct\n",
    " - Do a little bit of python while waiting to build and train our VAE model.\n",
    "\n",
    "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3).  \n",
    "\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Reload our dataset\n",
    " - Check and verify our clustered dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init\n",
    "### 1.2 - Import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import os,time,sys,json,glob,importlib\n",
    "import math, random\n",
    "\n",
    "from modules.datagen import DataGenerator\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "\n",
    "run_dir='./run/VAE7'\n",
    "datasets_dir = pwk.init('VAE7', run_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Parameters\n",
    "(Un)comment the right lines to be in accordance with the VAE6 notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Tests\n",
    "#\n",
    "image_size   = (128,128)\n",
    "enhanced_dir = './data'\n",
    "\n",
    "# ----Full clusters generation\n",
    "#\n",
    "# image_size   = (192,160)\n",
    "# enhanced_dir = f'{datasets_dir}/celeba/enhanced'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Used for continous integration - Just forget this line\n",
    "#\n",
    "pwk.override('image_size', 'enhanced_dir')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Data verification\n",
    "What we're going to do:\n",
    " - Recover all clusters by normalizing images\n",
    " - Make some statistics to be sure we have all the data\n",
    " - picking one image per cluster to check that everything is good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Return a legend from a description \n",
    "#\n",
    "def get_legend(x_desc,i):\n",
    "    cols  = x_desc.columns\n",
    "    desc  = x_desc.iloc[i]\n",
    "    legend =[]\n",
    "    for i,v in enumerate(desc):\n",
    "        if v==1 : legend.append(cols[i])\n",
    "    return str('\\n'.join(legend))\n",
    "\n",
    "pwk.chrono_start()\n",
    "\n",
    "# ---- the place of the clusters files\n",
    "#\n",
    "lx,ly      = image_size\n",
    "train_dir  = f'{enhanced_dir}/clusters-{lx}x{ly}'\n",
    "\n",
    "# ---- get cluster list\n",
    "#\n",
    "clusters_name = [ os.path.splitext(f)[0] for f in glob.glob( f'{train_dir}/*.npy') ]\n",
    "\n",
    "# ---- Counters set to 0\n",
    "#\n",
    "imax  = len(clusters_name)\n",
    "i,n1,n2,s = 0,0,0,0\n",
    "imgs,desc = [],[]\n",
    "\n",
    "# ---- Reload all clusters\n",
    "#\n",
    "pwk.subtitle('Reload all clusters...')\n",
    "pwk.update_progress('Load clusters :',i,imax, redraw=True)\n",
    "for cluster_name in clusters_name:  \n",
    "    \n",
    "    # ---- Reload images and normalize\n",
    "\n",
    "    x_data = np.load(cluster_name+'.npy')\n",
    "    \n",
    "    # ---- Reload descriptions\n",
    "    \n",
    "    x_desc = pd.read_csv(cluster_name+'.csv', header=0)\n",
    "    \n",
    "    # ---- Counters\n",
    "    \n",
    "    n1 += len(x_data)\n",
    "    n2 += len(x_desc.index)\n",
    "    s  += x_data.nbytes\n",
    "    i  += 1\n",
    "    \n",
    "    # ---- Get somes images/legends\n",
    "    \n",
    "    j=random.randint(0,len(x_data)-1)\n",
    "    imgs.append( x_data[j].copy() )\n",
    "    desc.append( get_legend(x_desc,j) )\n",
    "    x_data=None\n",
    "    \n",
    "    # ---- To appear professional\n",
    "    \n",
    "    pwk.update_progress('Load clusters :',i,imax, redraw=True)\n",
    "\n",
    "d=pwk.chrono_stop()\n",
    "\n",
    "pwk.subtitle('Few stats :')\n",
    "print(f'Loading time      : {d:.2f} s or {pwk.hdelay(d)}')\n",
    "print(f'Number of cluster : {i}')\n",
    "print(f'Number of images  : {n1}')\n",
    "print(f'Number of desc.   : {n2}')\n",
    "print(f'Total size of img : {pwk.hsize(s)}')\n",
    "\n",
    "pwk.subtitle('Have a look (1 image/ cluster)...')\n",
    "pwk.plot_images(imgs,desc,x_size=2,y_size=2,fontsize=8,columns=7,y_padding=2.5, save_as='01-images_and_desc')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class='nota'>\n",
    "    <b>Note :</b> With this approach, the use of data is much much more effective !\n",
    "    <ul>\n",
    "        <li>Data loading speed : <b>x 10</b> (81 s vs 16 min.)</li>\n",
    "    </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Using our DataGenerator\n",
    "We are going to use a \"dataset generator\", which is an implementation of [tensorflow.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence)  \n",
    "During the trainning, batches will be requested to our DataGenerator, which will read the clusters as they come in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Our DataGenerator\n",
    "\n",
    "data_gen = DataGenerator(train_dir, batch_size=32, debug=True, scale=0.2)\n",
    "\n",
    "# ---- We ask him to retrieve all batchs\n",
    "\n",
    "batch_sizes=[]\n",
    "for i in range( len(data_gen)):\n",
    "    x,y = data_gen[i]\n",
    "    batch_sizes.append(len(x))\n",
    "\n",
    "print(f'\\n\\ntotal number of items : {sum(batch_sizes)}')\n",
    "print(f'batch sizes      : {batch_sizes}')\n",
    "print(f'Last batch shape : {x.shape}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}