{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [VAE6] - Generation of a clustered dataset\n",
    "<!-- DESC --> Episode 2 : Analysis of the CelebA dataset and creation of an clustered and usable dataset\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Formatting our dataset in **cluster files**, using batch mode\n",
    " - Adapting a notebook for batch use\n",
    "\n",
    "\n",
    "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about **200,000 images** (202599,218,178,3).  \n",
    "The size and the number of files of this dataset make it impossible to use it as it is.  \n",
    "A formatting in the form of clusters of n images is essential.\n",
    "\n",
    "\n",
    "## What we're going to do :\n",
    " - Lire les images\n",
    " - redimensionner et normaliser celles-ci,\n",
    " - Constituer des clusters d'images en format npy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Import and init"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "from skimage import io, transform\n",
    "\n",
    "import os,pathlib,time,sys,json,glob\n",
    "import csv\n",
    "import math, random\n",
    "\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "\n",
    "run_dir='./run/VAE6'\n",
    "datasets_dir = pwk.init('VAE6', run_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Parameters\n",
    "All the dataset will be use for training  \n",
    "Reading the 200,000 images can take a long time **(>20 minutes)** and a lot of place **(>170 GB)**  \n",
    "Example :  \n",
    "Image Sizes: 128x128 : 74 GB  \n",
    "Image Sizes: 192x160 : 138 GB  \n",
    "\n",
    "You can define theses parameters :  \n",
    "`scale` : 1 mean 100% of the dataset - set 0.05 for tests  \n",
    "`image_size` : images size in the clusters, should be 128x128 or 192,160 - original size is (218,178)  \n",
    "`output_dir` : where to write clusters, could be :\n",
    " - `./data`, for tests purpose\n",
    " - `<datasets_dir>/celeba/enhanced` to add clusters in your datasets dir.  \n",
    " \n",
    "`cluster_size` : number of images in a cluster, 10000 is fine. (will be adjust by scale)\n",
    "\n",
    "**Note :** If the target folder is not empty and exit_if_exist is True, the construction is blocked.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Parameters you can change -----------------------------------\n",
    "\n",
    "# ---- Tests\n",
    "scale         = 0.05\n",
    "seed          = 123\n",
    "cluster_size  = 10000\n",
    "image_size    = (128,128)\n",
    "output_dir    = './data'\n",
    "exit_if_exist = False\n",
    "\n",
    "# ---- Full clusters generation, medium size\n",
    "# scale         = 1.\n",
    "# seed          = 123\n",
    "# cluster_size  = 10000\n",
    "# image_size    = (128,128)\n",
    "# output_dir    = f'{datasets_dir}/celeba/enhanced'\n",
    "# exit_if_exist = True\n",
    "\n",
    "# ---- Full clusters generation, large size\n",
    "# scale         = 1.\n",
    "# seed          = 123\n",
    "# cluster_size  = 10000\n",
    "# image_size    = (192,160)\n",
    "# output_dir    = f'{datasets_dir}/celeba/enhanced'\n",
    "# exit_if_exist = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Used for continous integration - Just forget this line\n",
    "#\n",
    "pwk.override('scale', 'seed', 'cluster_size', 'image_size', 'output_dir', 'exit_if_exist')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Cluster construction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 - Directories and files :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'\n",
    "dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Cooking function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_and_save( dataset_csv, dataset_img, shuffle=True, seed=None, scale=1,\n",
    "                   cluster_size=1000, cluster_dir='./dataset_cluster', cluster_name='images',\n",
    "                   image_size=(128,128),\n",
    "                   exit_if_exist=True):\n",
    "    '''\n",
    "    Will read the images and save a clustered dataset\n",
    "    args:\n",
    "        dataset_csv : list and description of original images\n",
    "        dataset_img : original images directory\n",
    "        shuffle     : shuffle data if True  (True)\n",
    "        seed        : random seed value. False mean no seed, None mean using /dev/urandom (None)\n",
    "        scale       : scale of dataset to use. 1. mean 100% (1.)\n",
    "        cluster_size : Size of generated cluster (10000)\n",
    "        cluster_dir  : Directory of generated clusters (''./dataset_cluster')\n",
    "        cluster_name : Name of generated clusters ('images')\n",
    "        image_size   : Size of generated images (128,128)\n",
    "        exit_if_exist : Exit if clusters still exists.\n",
    "    '''\n",
    "    global pwk\n",
    "    \n",
    "    def save_cluster(imgs,desc,cols,id):\n",
    "        file_img  = f'{cluster_dir}/{cluster_name}-{id:03d}.npy'\n",
    "        file_desc = f'{cluster_dir}/{cluster_name}-{id:03d}.csv'\n",
    "        np.save(file_img,  np.array(imgs))\n",
    "        df=pd.DataFrame(data=desc,columns=cols)\n",
    "        df.to_csv(file_desc, index=False)\n",
    "        return [],[],id+1\n",
    "    \n",
    "    pwk.chrono_start()\n",
    "    \n",
    "    # ---- Seed\n",
    "    #\n",
    "    if seed is not False:\n",
    "        np.random.seed(seed)\n",
    "        print(f'Seeded ({seed})')\n",
    "            \n",
    "    # ---- Read dataset description\n",
    "    #\n",
    "    dataset_desc = pd.read_csv(dataset_csv, header=0)\n",
    "    n=len(dataset_desc)\n",
    "    print(f'Description loaded ({n} images).')\n",
    "    \n",
    "    # ---- Shuffle\n",
    "    #\n",
    "    if shuffle:\n",
    "        dataset_desc = dataset_desc.reindex(np.random.permutation(dataset_desc.index))\n",
    "        print('Shuffled.')\n",
    "    cols = list(dataset_desc.columns)\n",
    "\n",
    "    # ---- Check if cluster files exist\n",
    "    #\n",
    "    if exit_if_exist and os.path.isfile(f'{cluster_dir}/images-000.npy'):\n",
    "        print('\\n*** Oups. There are already clusters in the target folder!\\n')\n",
    "        return 0,0\n",
    "    pwk.mkdir(cluster_dir)\n",
    "\n",
    "    # ---- Rescale\n",
    "    #\n",
    "    n=int(len(dataset_desc)*scale)\n",
    "    dataset = dataset_desc[:n]\n",
    "    cluster_size = int(cluster_size*scale)\n",
    "    print('Rescaled.')\n",
    "    pwk.subtitle('Parameters :')\n",
    "    print(f'Scale is : {scale}')\n",
    "    print(f'Image size is     : {image_size}')\n",
    "    print(f'dataset length is : {n}')\n",
    "    print(f'cluster size is   : {cluster_size}')\n",
    "    print(f'clusters nb  is   :',int(n/cluster_size + 1))\n",
    "    print(f'cluster dir  is   : {cluster_dir}')\n",
    "    \n",
    "    # ---- Read and save clusters\n",
    "    #\n",
    "    pwk.subtitle('Running...')\n",
    "    imgs, desc, cluster_id = [],[],0\n",
    "    #\n",
    "    for i,row in dataset.iterrows():\n",
    "        #\n",
    "        filename = f'{dataset_img}/{row.image_id}'\n",
    "        #\n",
    "        # ---- Read image, resize (and normalize)\n",
    "        #\n",
    "        img = io.imread(filename)\n",
    "        img = transform.resize(img, image_size)\n",
    "        #\n",
    "        # ---- Add image and description\n",
    "        #\n",
    "        imgs.append( img )\n",
    "        desc.append( row.values )\n",
    "        #\n",
    "        # ---- Progress bar\n",
    "        #\n",
    "        pwk.update_progress(f'Cluster {cluster_id:03d} :',len(imgs),cluster_size)\n",
    "        #\n",
    "        # ---- Save cluster if full\n",
    "        #\n",
    "        if len(imgs)==cluster_size:\n",
    "            imgs,desc,cluster_id=save_cluster(imgs,desc,cols, cluster_id)\n",
    "\n",
    "    # ---- Save uncomplete cluster\n",
    "    if len(imgs)>0 : imgs,desc,cluster_id=save_cluster(imgs,desc,cols,cluster_id)\n",
    "\n",
    "    duration=pwk.chrono_stop()\n",
    "    return cluster_id,duration\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 - Clusters building"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Build clusters\n",
    "#\n",
    "lx,ly        = image_size\n",
    "cluster_dir  = f'{output_dir}/clusters-{lx}x{ly}'\n",
    "\n",
    "cluster_nb,duration = read_and_save( dataset_csv, dataset_img,\n",
    "                                     shuffle       = True,\n",
    "                                     seed          = seed,\n",
    "                                     scale         = scale,\n",
    "                                     cluster_size  = cluster_size, \n",
    "                                     cluster_dir   = cluster_dir,\n",
    "                                     image_size    = image_size,\n",
    "                                     exit_if_exist = exit_if_exist)\n",
    "\n",
    "# ---- Conclusion...\n",
    "\n",
    "directory = pathlib.Path(cluster_dir)\n",
    "s=sum(f.stat().st_size for f in directory.glob('**/*') if f.is_file())\n",
    "\n",
    "pwk.subtitle('Ressources :')\n",
    "print('Duration     : ',pwk.hdelay(duration))\n",
    "print('Size         : ',pwk.hsize(s))\n",
    "\n",
    "pwk.subtitle('Estimation with scale=1 :')\n",
    "print('Duration     : ',pwk.hdelay(duration*(1/scale)))\n",
    "print('Size         : ',pwk.hsize(s*(1/scale)))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}