<img width="800px" src="../fidle/img/00-Fidle-header-01.svg"></img>

# <!-- TITLE --> [VAE5] - Another game play : About the CelebA dataset
<!-- DESC --> Episode 1 : Presentation of the CelebA dataset and problems related to its size
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->

## Objectives :
 - Data **analysis**
 - Problems related to the use of **more real datasets**

We'll do the same thing again but with a more interesting dataset:  **CelebFaces**  
"[CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations."

## Step 1 - Import and init

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from skimage import io, transform

import os,time,sys,json,glob
import csv
import math, random

from importlib import reload

import fidle

# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('VAE5')

`progress_verbosity`: Verbosity of progress bar: 0=silent, 1=progress bar, 2=One line

In [None]:
progress_verbosity = 1

Override parameters (batch mode) - Just forget this cell

In [None]:
fidle.override('progress_verbosity')

## Step 2 - Understanding the dataset

### 2.1 - Read the catalog file

In [None]:
dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'
dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'

# ---- Read dataset attributes

dataset_desc = pd.read_csv(dataset_csv, header=0)

# ---- Have a look

display(dataset_desc.head(10))

print(f'\nDonnées manquantes : {dataset_desc.isna().sum().sum()}')
print(f'dataset_desc.shape : {dataset_desc.shape}')

### 2.2 - Load 1000 images

In [None]:
chrono = fidle.Chrono()
chrono.start()

nb_images=5000
filenames = [ f'{dataset_img}/{i}' for i in dataset_desc.image_id[:nb_images] ]
x=[]
for filename in filenames:
    image=io.imread(filename)
    x.append(image)
    fidle.utils.update_progress(f"{nb_images} images :",len(x),nb_images, verbosity=progress_verbosity)
x_data=np.array(x)
x=None
    
duration=chrono.get_delay(format='seconds')
print(f'\nDuration   : {duration} s')
print(f'Shape is   : {x_data.shape}')
print(f'Numpy type : {x_data.dtype}')

fidle.utils.display_md('<br>**Note :** Estimation for **200.000** normalized images : ')
x_data=x_data/255
k=200000/nb_images
print(f'Charging time : {k*duration:.2f} s or {fidle.utils.hdelay(k*duration)}')
print(f'Numpy type    : {x_data.dtype}')
print(f'Memory size   : {fidle.utils.hsize(k*x_data.nbytes)}')

## Step 3 - Have a look

### 3.1 - Few statistics
We want to know if our images are homogeneous in terms of size, ratio, width or height.

In [None]:
data_size  = []
data_ratio = []
data_lx    = []
data_ly    = []

for image in x_data:
    (lx,ly,lz) = image.shape
    data_size.append(lx*ly/1024)
    data_ratio.append(lx/ly)
    data_lx.append(lx)
    data_ly.append(ly)

df=pd.DataFrame({'Size':data_size, 'Ratio':data_ratio, 'Lx':data_lx, 'Ly':data_ly})
display(df.describe().style.format("{0:.2f}").set_caption("About our images :"))
    

### 3.2 - What does it really look like

In [None]:
samples = [ random.randint(0,len(x_data)-1) for i in range(32)]
fidle.scrawler.images(x_data, indices=samples, columns=8, x_size=2, y_size=2, save_as='01-celebA')

## AAArrrg !!
Fine ! :-)  
But how can we effectively use this dataset, considering its size and the number of files ?  
We're talking about a **10' to 20' of loading time** and **170 GB of data**... ;-(  

The only solution will be to:
- group images into clusters, to limit the number of files,
- read the data gradually, because not all of it can be stored in memory

Welcome in the real world ;-)


In [None]:
fidle.end()

---
<img width="80px" src="../fidle/img/00-Fidle-logo-01.svg"></img>