{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [VAE5] - Another game play : About the CelebA dataset\n",
    "<!-- DESC --> Episode 1 : Presentation of the CelebA dataset and problems related to its size\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Data **analysis**\n",
    " - Problems related to the use of **more real datasets**\n",
    "\n",
    "We'll do the same thing again but with a more interesting dataset:  **CelebFaces**  \n",
    "\"[CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "from skimage import io, transform\n",
    "\n",
    "import os,time,sys,json,glob\n",
    "import csv\n",
    "import math, random\n",
    "\n",
    "from importlib import reload\n",
    "\n",
    "import fidle\n",
    "\n",
    "# Init Fidle environment\n",
    "run_id, run_dir, datasets_dir = fidle.init('VAE5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`progress_verbosity`: Verbosity of progress bar: 0=silent, 1=progress bar, 2=One line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "progress_verbosity = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.override('progress_verbosity')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Understanding the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 - Read the catalog file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'\n",
    "dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'\n",
    "\n",
    "# ---- Read dataset attributes\n",
    "\n",
    "dataset_desc = pd.read_csv(dataset_csv, header=0)\n",
    "\n",
    "# ---- Have a look\n",
    "\n",
    "display(dataset_desc.head(10))\n",
    "\n",
    "print(f'\\nDonnées manquantes : {dataset_desc.isna().sum().sum()}')\n",
    "print(f'dataset_desc.shape : {dataset_desc.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 - Load 1000 images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chrono = fidle.Chrono()\n",
    "chrono.start()\n",
    "\n",
    "nb_images=5000\n",
    "filenames = [ f'{dataset_img}/{i}' for i in dataset_desc.image_id[:nb_images] ]\n",
    "x=[]\n",
    "for filename in filenames:\n",
    "    image=io.imread(filename)\n",
    "    x.append(image)\n",
    "    fidle.utils.update_progress(f\"{nb_images} images :\",len(x),nb_images, verbosity=progress_verbosity)\n",
    "x_data=np.array(x)\n",
    "x=None\n",
    "    \n",
    "duration=chrono.get_delay(format='seconds')\n",
    "print(f'\\nDuration   : {duration} s')\n",
    "print(f'Shape is   : {x_data.shape}')\n",
    "print(f'Numpy type : {x_data.dtype}')\n",
    "\n",
    "fidle.utils.display_md('<br>**Note :** Estimation for **200.000** normalized images : ')\n",
    "x_data=x_data/255\n",
    "k=200000/nb_images\n",
    "print(f'Charging time : {k*duration:.2f} s or {fidle.utils.hdelay(k*duration)}')\n",
    "print(f'Numpy type    : {x_data.dtype}')\n",
    "print(f'Memory size   : {fidle.utils.hsize(k*x_data.nbytes)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Have a look\n",
    "\n",
    "### 3.1 - Few statistics\n",
    "We want to know if our images are homogeneous in terms of size, ratio, width or height."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_size  = []\n",
    "data_ratio = []\n",
    "data_lx    = []\n",
    "data_ly    = []\n",
    "\n",
    "for image in x_data:\n",
    "    (lx,ly,lz) = image.shape\n",
    "    data_size.append(lx*ly/1024)\n",
    "    data_ratio.append(lx/ly)\n",
    "    data_lx.append(lx)\n",
    "    data_ly.append(ly)\n",
    "\n",
    "df=pd.DataFrame({'Size':data_size, 'Ratio':data_ratio, 'Lx':data_lx, 'Ly':data_ly})\n",
    "display(df.describe().style.format(\"{0:.2f}\").set_caption(\"About our images :\"))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - What does it really look like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samples = [ random.randint(0,len(x_data)-1) for i in range(32)]\n",
    "fidle.scrawler.images(x_data, indices=samples, columns=8, x_size=2, y_size=2, save_as='01-celebA')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AAArrrg !!\n",
    "Fine ! :-)  \n",
    "But how can we effectively use this dataset, considering its size and the number of files ?  \n",
    "We're talking about a **10' to 20' of loading time** and **170 GB of data**... ;-(  \n",
    "\n",
    "The only solution will be to:\n",
    "- group images into clusters, to limit the number of files,\n",
    "- read the data gradually, because not all of it can be stored in memory\n",
    "\n",
    "Welcome in the real world ;-)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fidle.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('fidle-env')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}