{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [VAE7] - Checking the clustered dataset\n",
    "<!-- DESC --> Episode : 3 Clustered dataset verification and testing of our datagenerator\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Making sure our clustered dataset is correct\n",
    " - Do a little bit of python while waiting to build and train our VAE model.\n",
    "\n",
    "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3).  \n",
    "\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Reload our dataset\n",
    " - Check and verify our clustered dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init\n",
    "### 1.2 - Import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import os,sys,glob\n",
    "import random\n",
    "\n",
    "from modules.datagen import DataGenerator\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('VAE7')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Parameters\n",
    "(Un)comment the right lines to be in accordance with the VAE6 notebook  \n",
    "`image_size` : images size in the clusters, should be 128x128 or 192,160 - original size is (218,178)  \n",
    "`progress_verbosity`: Verbosity of progress bar: 0=silent, 1=progress bar, 2=One line  \n",
    "`enhanced_dir` : the place where clustered dataset was saved, can be :\n",
    "- `./data`, for tests purpose\n",
    "- `f{datasets_dir}/celeba/enhanced` in your datasets dir.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Tests\n",
    "#\n",
    "image_size   = (128,128)\n",
    "enhanced_dir = './data'\n",
    "\n",
    "# ----Full clusters generation\n",
    "#\n",
    "# image_size   = (192,160)\n",
    "# enhanced_dir = f'{datasets_dir}/celeba/enhanced'\n",
    "\n",
    "progress_verbosity = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Used for continous integration - Just forget this line\n",
    "#\n",
    "fidle.override('image_size', 'enhanced_dir', 'progress_verbosity')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Data verification\n",
    "What we're going to do:\n",
    " - Recover all clusters by normalizing images\n",
    " - Make some statistics to be sure we have all the data\n",
    " - picking one image per cluster to check that everything is good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Return a legend from a description \n",
    "#\n",
    "def get_legend(x_desc,i):\n",
    "    cols  = x_desc.columns\n",
    "    desc  = x_desc.iloc[i]\n",
    "    legend =[]\n",
    "    for i,v in enumerate(desc):\n",
    "        if v==1 : legend.append(cols[i])\n",
    "    return str('\\n'.join(legend))\n",
    "\n",
    "chrono = fidle.Chrono()\n",
    "chrono.start()\n",
    "\n",
    "# ---- the place of the clusters files\n",
    "#\n",
    "lx,ly      = image_size\n",
    "train_dir  = f'{enhanced_dir}/clusters-{lx}x{ly}'\n",
    "\n",
    "# ---- get cluster list\n",
    "#\n",
    "clusters_name = [ os.path.splitext(f)[0] for f in glob.glob( f'{train_dir}/*.npy') ]\n",
    "\n",
    "# ---- Counters set to 0\n",
    "#\n",
    "imax  = len(clusters_name)\n",
    "i,n1,n2,s = 0,0,0,0\n",
    "imgs,desc = [],[]\n",
    "\n",
    "# ---- Reload all clusters\n",
    "#\n",
    "fidle.utils.subtitle('Reload all clusters...')\n",
    "for cluster_name in clusters_name:  \n",
    "    \n",
    "    # ---- Reload images and normalize\n",
    "\n",
    "    x_data = np.load(cluster_name+'.npy')\n",
    "    \n",
    "    # ---- Reload descriptions\n",
    "    \n",
    "    x_desc = pd.read_csv(cluster_name+'.csv', header=0)\n",
    "    \n",
    "    # ---- Counters\n",
    "    \n",
    "    n1 += len(x_data)\n",
    "    n2 += len(x_desc.index)\n",
    "    s  += x_data.nbytes\n",
    "    i  += 1\n",
    "    \n",
    "    # ---- Get somes images/legends\n",
    "    \n",
    "    j=random.randint(0,len(x_data)-1)\n",
    "    imgs.append( x_data[j].copy() )\n",
    "    desc.append( get_legend(x_desc,j) )\n",
    "    x_data=None\n",
    "    \n",
    "    # ---- To appear professional\n",
    "    \n",
    "    fidle.utils.update_progress('Load clusters :',i,imax, redraw=True, verbosity=progress_verbosity)\n",
    "\n",
    "d=chrono.get_delay(format='seconds')\n",
    "\n",
    "fidle.utils.subtitle('Few stats :')\n",
    "print(f'Loading time      : {d} s or {fidle.utils.hdelay(d)}')\n",
    "print(f'Number of cluster : {i}')\n",
    "print(f'Number of images  : {n1}')\n",
    "print(f'Number of desc.   : {n2}')\n",
    "print(f'Total size of img : {fidle.utils.hsize(s)}')\n",
    "\n",
    "fidle.utils.subtitle('Have a look (1 image/ cluster)...')\n",
    "fidle.scrawler.images(imgs,desc,x_size=2,y_size=2,fontsize=8,columns=7,y_padding=2.5, save_as='01-images_and_desc')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class='nota'>\n",
    "    <b>Note :</b> With this approach, the use of data is much much more effective !\n",
    "    <ul>\n",
    "        <li>Data loading speed : <b>x 10</b> (81 s vs 16 min.)</li>\n",
    "    </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Using our DataGenerator\n",
    "We are going to use a \"dataset generator\", which is an implementation of [tensorflow.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence)  \n",
    "During the trainning, batches will be requested to our DataGenerator, which will read the clusters as they come in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Our DataGenerator\n",
    "\n",
    "data_gen = DataGenerator(train_dir, batch_size=32, debug=True, scale=0.2)\n",
    "\n",
    "# ---- We ask him to retrieve all batchs\n",
    "\n",
    "batch_sizes=[]\n",
    "for i in range( len(data_gen)):\n",
    "    x,y = data_gen[i]\n",
    "    batch_sizes.append(len(x))\n",
    "\n",
    "print(f'\\n\\ntotal number of items : {sum(batch_sizes)}')\n",
    "print(f'batch sizes      : {batch_sizes}')\n",
    "print(f'Last batch shape : {x.shape}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}