{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n", "\n", "# <!-- TITLE --> [VAE7] - Checking the clustered dataset\n", "<!-- DESC --> Episode : 3 Clustered dataset verification and testing of our datagenerator\n", "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - Making sure our clustered dataset is correct\n", " - Do a little bit of python while waiting to build and train our VAE model.\n", "\n", "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3). \n", "\n", "\n", "## What we're going to do :\n", "\n", " - Reload our dataset\n", " - Check and verify our clustered dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Import and init\n", "### 1.2 - Import" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import os,time,sys,json,glob,importlib\n", "import math, random\n", "\n", "from modules.datagen import DataGenerator\n", "\n", "sys.path.append('..')\n", "import fidle.pwk as pwk\n", "\n", "run_dir='./run/VAE7'\n", "datasets_dir = pwk.init('VAE7', run_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 - Parameters\n", "(Un)comment the right lines to be in accordance with the VAE6 notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Tests\n", "#\n", "image_size = (128,128)\n", "enhanced_dir = './data'\n", "\n", "# ----Full clusters generation\n", "#\n", "# image_size = (192,160)\n", "# enhanced_dir = f'{datasets_dir}/celeba/enhanced'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Used for continous integration - Just forget this line\n", "#\n", "pwk.override('image_size', 'enhanced_dir')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Data verification\n", "What we're going to do:\n", " - Recover all clusters by normalizing images\n", " - Make some statistics to be sure we have all the data\n", " - picking one image per cluster to check that everything is good." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Return a legend from a description \n", "#\n", "def get_legend(x_desc,i):\n", " cols = x_desc.columns\n", " desc = x_desc.iloc[i]\n", " legend =[]\n", " for i,v in enumerate(desc):\n", " if v==1 : legend.append(cols[i])\n", " return str('\\n'.join(legend))\n", "\n", "pwk.chrono_start()\n", "\n", "# ---- the place of the clusters files\n", "#\n", "lx,ly = image_size\n", "train_dir = f'{enhanced_dir}/clusters-{lx}x{ly}'\n", "\n", "# ---- get cluster list\n", "#\n", "clusters_name = [ os.path.splitext(f)[0] for f in glob.glob( f'{train_dir}/*.npy') ]\n", "\n", "# ---- Counters set to 0\n", "#\n", "imax = len(clusters_name)\n", "i,n1,n2,s = 0,0,0,0\n", "imgs,desc = [],[]\n", "\n", "# ---- Reload all clusters\n", "#\n", "pwk.subtitle('Reload all clusters...')\n", "pwk.update_progress('Load clusters :',i,imax, redraw=True)\n", "for cluster_name in clusters_name: \n", " \n", " # ---- Reload images and normalize\n", "\n", " x_data = np.load(cluster_name+'.npy')\n", " \n", " # ---- Reload descriptions\n", " \n", " x_desc = pd.read_csv(cluster_name+'.csv', header=0)\n", " \n", " # ---- Counters\n", " \n", " n1 += len(x_data)\n", " n2 += len(x_desc.index)\n", " s += x_data.nbytes\n", " i += 1\n", " \n", " # ---- Get somes images/legends\n", " \n", " j=random.randint(0,len(x_data)-1)\n", " imgs.append( x_data[j].copy() )\n", " desc.append( get_legend(x_desc,j) )\n", " x_data=None\n", " \n", " # ---- To appear professional\n", " \n", " pwk.update_progress('Load clusters :',i,imax, redraw=True)\n", "\n", "d=pwk.chrono_stop()\n", "\n", "pwk.subtitle('Few stats :')\n", "print(f'Loading time : {d:.2f} s or {pwk.hdelay(d)}')\n", "print(f'Number of cluster : {i}')\n", "print(f'Number of images : {n1}')\n", "print(f'Number of desc. : {n2}')\n", "print(f'Total size of img : {pwk.hsize(s)}')\n", "\n", "pwk.subtitle('Have a look (1 image/ cluster)...')\n", "pwk.plot_images(imgs,desc,x_size=2,y_size=2,fontsize=8,columns=7,y_padding=2.5, save_as='01-images_and_desc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class='nota'>\n", " <b>Note :</b> With this approach, the use of data is much much more effective !\n", " <ul>\n", " <li>Data loading speed : <b>x 10</b> (81 s vs 16 min.)</li>\n", " </ul>\n", "</div>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Using our DataGenerator\n", "We are going to use a \"dataset generator\", which is an implementation of [tensorflow.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) \n", "During the trainning, batches will be requested to our DataGenerator, which will read the clusters as they come in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Our DataGenerator\n", "\n", "data_gen = DataGenerator(train_dir, batch_size=32, debug=True, scale=0.2)\n", "\n", "# ---- We ask him to retrieve all batchs\n", "\n", "batch_sizes=[]\n", "for i in range( len(data_gen)):\n", " x,y = data_gen[i]\n", " batch_sizes.append(len(x))\n", "\n", "print(f'\\n\\ntotal number of items : {sum(batch_sizes)}')\n", "print(f'batch sizes : {batch_sizes}')\n", "print(f'Last batch shape : {x.shape}')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pwk.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }