<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>

# <!-- TITLE --> [VAE6] - Generation of a clustered dataset
<!-- DESC --> Episode 2 : Analysis of the CelebA dataset and creation of an clustered and usable dataset
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->

## Objectives :
 - Formatting our dataset in **cluster files**, using batch mode
 - Adapting a notebook for batch use


The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3). 


## What we're going to do :
 - Lire les images
 - redimensionner et normaliser celles-ci,
 - Constituer des clusters d'images en format npy


## Step 1 - Import and init
### 1.2 - Import

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from skimage import io, transform

import os,pathlib,time,sys,json,glob
import csv
import math, random

from importlib import reload

sys.path.append('..')
import fidle.pwk as pwk

datasets_dir = pwk.init('VAE6')

<br>**FIDLE 2020 - Practical Work Module**

Version : 2.0.7
Notebook id : VAE6
Run time : Wednesday 27 January 2021, 09:48:49
TensorFlow version : 2.2.0
Keras version : 2.3.0-tf
Datasets dir : /gpfswork/rech/mlh/uja62cb/datasets
Run dir : ./run
Update keras cache : False


### 1.2 Parameters
All the dataset will be use for training 
Reading the 200,000 images can take a long time **(>20 minutes)** and a lot of place **(>170 GB)** 
Example : 
Image Sizes: 128x128 : 74 GB 
Image Sizes: 192x160 : 138 GB 

You can define theses parameters : 
`scale` : 1 mean 100% of the dataset - set 0.05 for tests 
`image_size` : images size in the clusters, should be 128x128 or 192,160 (original is 218,178) 
`output_dir` : where to write clusters, could be :
 - `./data`, for tests purpose
 - `<datasets_dir>/celeba/enhanced` to add clusters in your datasets dir. 
 
`cluster_size` : number of images in a cluster, 10000 is fine. (will be adjust by scale)

**Note :** If the target folder is not empty and exit_if_exist is True, the construction is blocked. 

In [2]:
# ---- Parameters you can change -----------------------------------

# ---- Tests
scale = 0.02
cluster_size = 10000
image_size = (128,128)
output_dir = './data'
exit_if_exist = False

# ---- Full clusters generation, medium size
# scale = 1.
# cluster_size = 10000
# image_size = (128,128)
# output_dir = f'{datasets_dir}/celeba/enhanced'
# exit_if_exist = True

# ---- Full clusters generation, large size
# scale = 1.
# cluster_size = 10000
# image_size = (192,160)
# output_dir = f'{datasets_dir}/celeba/enhanced'
# exit_if_exist = True

In [3]:
# ---- Used for continous integration - Just forget this line
#
pwk.override('scale', 'cluster_size', 'image_size', 'output_dir', 'exit_if_exist')

### 1.2 - Directories and files :

In [4]:
dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'
dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'

## Step 2 - Read and shuffle filenames catalog

In [5]:
dataset_desc = pd.read_csv(dataset_csv, header=0)
dataset_desc = dataset_desc.reindex(np.random.permutation(dataset_desc.index))

## Step 3 - Save as clusters of n images

### 4.2 - Cooking function

In [6]:
def read_and_save( dataset_img, dataset_desc, scale=1,
 cluster_size=1000, cluster_dir='./dataset_cluster', cluster_name='images',
 image_size=(128,128),
 exit_if_exist=True):
 global pwk
 
 def save_cluster(imgs,desc,cols,id):
 file_img = f'{cluster_dir}/{cluster_name}-{id:03d}.npy'
 file_desc = f'{cluster_dir}/{cluster_name}-{id:03d}.csv'
 np.save(file_img, np.array(imgs))
 df=pd.DataFrame(data=desc,columns=cols)
 df.to_csv(file_desc, index=False)
 return [],[],id+1
 
 pwk.chrono_start()
 cols = list(dataset_desc.columns)

 # ---- Check if cluster files exist
 #
 if exit_if_exist and os.path.isfile(f'{cluster_dir}/images-000.npy'):
 print('\n*** Oups. There are already clusters in the target folder!\n')
 return 0,0
 pwk.mkdir(cluster_dir)

 # ---- Scale
 #
 n=int(len(dataset_desc)*scale)
 dataset = dataset_desc[:n]
 cluster_size = int(cluster_size*scale)
 pwk.subtitle('Parameters :')
 print(f'Scale is : {scale}')
 print(f'Image size is : {image_size}')
 print(f'dataset length is : {n}')
 print(f'cluster size is : {cluster_size}')
 print(f'clusters nb is :',int(n/cluster_size + 1))
 print(f'cluster dir is : {cluster_dir}')
 
 # ---- Read and save clusters
 #
 pwk.subtitle('Running...')
 imgs, desc, cluster_id = [],[],0
 #
 for i,row in dataset.iterrows():
 #
 filename = f'{dataset_img}/{row.image_id}'
 #
 # ---- Read image, resize (and normalize)
 #
 img = io.imread(filename)
 img = transform.resize(img, image_size)
 #
 # ---- Add image and description
 #
 imgs.append( img )
 desc.append( row.values )
 #
 # ---- Progress bar
 #
 pwk.update_progress(f'Cluster {cluster_id:03d} :',len(imgs),cluster_size)
 #
 # ---- Save cluster if full
 #
 if len(imgs)==cluster_size:
 imgs,desc,cluster_id=save_cluster(imgs,desc,cols, cluster_id)

 # ---- Save uncomplete cluster
 if len(imgs)>0 : imgs,desc,cluster_id=save_cluster(imgs,desc,cols,cluster_id)

 duration=pwk.chrono_stop()
 return cluster_id,duration


### 4.3 - Cluster building

In [7]:
# ---- Build clusters
#
lx,ly = image_size
cluster_dir = f'{output_dir}/clusters-{lx}x{ly}'

cluster_nb,duration = read_and_save( dataset_img, dataset_desc,
 scale = scale,
 cluster_size = cluster_size, 
 cluster_dir = cluster_dir,
 image_size = image_size,
 exit_if_exist = exit_if_exist)

# ---- Conclusion...

directory = pathlib.Path(cluster_dir)
s=sum(f.stat().st_size for f in directory.glob('**/*') if f.is_file())

pwk.subtitle('Conclusion :')
print('Duration : ',pwk.hdelay(duration))
print('Size : ',pwk.hsize(s))

<br>**Parameters :**

Scale is : 0.02
Image size is : (128, 128)
dataset length is : 4051
cluster size is : 200
clusters nb is : 21
cluster dir is : ./data/clusters-128x128


<br>**Running...**

Cluster 000 : [########################################] 100.0% of 200
Cluster 001 : [########################################] 100.0% of 200
Cluster 002 : [########################################] 100.0% of 200
Cluster 003 : [########################################] 100.0% of 200
Cluster 004 : [########################################] 100.0% of 200
Cluster 005 : [########################################] 100.0% of 200
Cluster 006 : [########################################] 100.0% of 200
Cluster 007 : [########################################] 100.0% of 200
Cluster 008 : [########################################] 100.0% of 200
Cluster 009 : [########################################] 100.0% of 200
Cluster 010 : [########################################] 100.0% of 200
Cluster 011 : [########################################] 100.0% of 200
Cluster 012 : [########################################] 100.0% of 200
Cluster 013 : [########################################] 100.0% of 200
Cluste

<br>**Conclusion :**

Duration : 0:01:57
Size : 1.5 Go


In [8]:
pwk.end()

End time is : Wednesday 27 January 2021, 09:50:47
Duration is : 00:01:58 822ms
This notebook ends here


---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>