<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>

# <!-- TITLE --> [VAE7] - Checking the clustered dataset
<!-- DESC --> Episode : 3 Clustered dataset verification and testing of our datagenerator
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->

## Objectives :
 - Making sure our clustered dataset is correct
 - Do a little bit of python while waiting to build and train our VAE model.

The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3). 


## What we're going to do :

 - Reload our dataset
 - Check and verify our clustered dataset

## Step 1 - Import and init
### 1.2 - Import

In [None]:
import numpy as np
import pandas as pd

import os,time,sys,json,glob,importlib
import math, random

from modules.datagen import DataGenerator

sys.path.append('..')
import fidle.pwk as pwk

run_dir='./run/VAE7'
datasets_dir = pwk.init('VAE7', run_dir)

### 1.2 - Parameters
(Un)comment the right lines to be in accordance with the VAE6 notebook

In [None]:
# ---- Tests
#
image_size = (128,128)
enhanced_dir = './data'

# ----Full clusters generation
#
# image_size = (192,160)
# enhanced_dir = f'{datasets_dir}/celeba/enhanced'

In [None]:
# ---- Used for continous integration - Just forget this line
#
pwk.override('image_size', 'enhanced_dir')

## Step 2 - Data verification
What we're going to do:
 - Recover all clusters by normalizing images
 - Make some statistics to be sure we have all the data
 - picking one image per cluster to check that everything is good.

In [None]:
# ---- Return a legend from a description 
#
def get_legend(x_desc,i):
 cols = x_desc.columns
 desc = x_desc.iloc[i]
 legend =[]
 for i,v in enumerate(desc):
 if v==1 : legend.append(cols[i])
 return str('\n'.join(legend))

pwk.chrono_start()

# ---- the place of the clusters files
#
lx,ly = image_size
train_dir = f'{enhanced_dir}/clusters-{lx}x{ly}'

# ---- get cluster list
#
clusters_name = [ os.path.splitext(f)[0] for f in glob.glob( f'{train_dir}/*.npy') ]

# ---- Counters set to 0
#
imax = len(clusters_name)
i,n1,n2,s = 0,0,0,0
imgs,desc = [],[]

# ---- Reload all clusters
#
pwk.subtitle('Reload all clusters...')
pwk.update_progress('Load clusters :',i,imax, redraw=True)
for cluster_name in clusters_name: 
 
 # ---- Reload images and normalize

 x_data = np.load(cluster_name+'.npy')
 
 # ---- Reload descriptions
 
 x_desc = pd.read_csv(cluster_name+'.csv', header=0)
 
 # ---- Counters
 
 n1 += len(x_data)
 n2 += len(x_desc.index)
 s += x_data.nbytes
 i += 1
 
 # ---- Get somes images/legends
 
 j=random.randint(0,len(x_data)-1)
 imgs.append( x_data[j].copy() )
 desc.append( get_legend(x_desc,j) )
 x_data=None
 
 # ---- To appear professional
 
 pwk.update_progress('Load clusters :',i,imax, redraw=True)

d=pwk.chrono_stop()

pwk.subtitle('Few stats :')
print(f'Loading time : {d:.2f} s or {pwk.hdelay(d)}')
print(f'Number of cluster : {i}')
print(f'Number of images : {n1}')
print(f'Number of desc. : {n2}')
print(f'Total size of img : {pwk.hsize(s)}')

pwk.subtitle('Have a look (1 image/ cluster)...')
pwk.plot_images(imgs,desc,x_size=2,y_size=2,fontsize=8,columns=7,y_padding=2.5, save_as='01-images_and_desc')

<div class='nota'>
 <b>Note :</b> With this approach, the use of data is much much more effective !
 <ul>
 <li>Data loading speed : <b>x 10</b> (81 s vs 16 min.)</li>
 </ul>
</div>

## Step 3 - Using our DataGenerator
We are going to use a "dataset generator", which is an implementation of [tensorflow.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) 
During the trainning, batches will be requested to our DataGenerator, which will read the clusters as they come in.

In [None]:
# ---- Our DataGenerator

data_gen = DataGenerator(train_dir, batch_size=32, debug=True, scale=0.2)

# ---- We ask him to retrieve all batchs

batch_sizes=[]
for i in range( len(data_gen)):
 x,y = data_gen[i]
 batch_sizes.append(len(x))

print(f'\n\ntotal number of items : {sum(batch_sizes)}')
print(f'batch sizes : {batch_sizes}')
print(f'Last batch shape : {x.shape}')


In [None]:
pwk.end()

---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>