{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n", "\n", "# <!-- TITLE --> [VAE6] - Generation of a clustered dataset\n", "<!-- DESC --> Episode 2 : Analysis of the CelebA dataset and creation of an clustered and usable dataset\n", "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - Formatting our dataset in **cluster files**, using batch mode\n", " - Adapting a notebook for batch use\n", "\n", "\n", "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about **200,000 images** (202599,218,178,3). \n", "The size and the number of files of this dataset make it impossible to use it as it is. \n", "A formatting in the form of clusters of n images is essential.\n", "\n", "\n", "## What we're going to do :\n", " - Lire les images\n", " - redimensionner et normaliser celles-ci,\n", " - Constituer des clusters d'images en format npy\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Import and init" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from skimage import io, transform\n", "\n", "import os,pathlib,time,sys,json,glob\n", "import csv\n", "import math, random\n", "\n", "import fidle\n", "\n", "# Init Fidle environment\n", "run_id, run_dir, datasets_dir = fidle.init('VAE6')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Parameters\n", "All the dataset will be use for training \n", "Reading the 200,000 images can take a long time **(>20 minutes)** and a lot of place **(>170 GB)** \n", "Example : \n", "Image Sizes: 128x128 : 74 GB \n", "Image Sizes: 192x160 : 138 GB \n", "\n", "You can define theses parameters : \n", "`scale` : 1 mean 100% of the dataset - set 0.05 for tests \n", "`image_size` : images size in the clusters, should be 128x128 or 192,160 - original size is (218,178) \n", "`output_dir` : where to write clusters, could be :\n", " - `./data`, for tests purpose\n", " - `<datasets_dir>/celeba/enhanced` to add clusters in your datasets dir. \n", " \n", "`cluster_size` : number of images in a cluster, 10000 is fine. (will be adjust by scale) \n", "`progress_verbosity`: Verbosity of progress bar: 0=silent, 1=progress bar, 2=One line \n", "\n", "**Note :** If the target folder is not empty and exit_if_exist is True, the construction is blocked. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Parameters you can change -----------------------------------\n", "#\n", "progress_verbosity = 1\n", "\n", "# ---- Just for tests\n", "# Save clustered dataset in ./data\n", "#\n", "scale = 0.05\n", "seed = 123\n", "cluster_size = 10000\n", "image_size = (128,128)\n", "output_dir = './data'\n", "exit_if_exist = False\n", "\n", "# ---- Full clusters generation, medium size : 74 GB\n", "# Save clustered dataset in <datasets_dir> \n", "#\n", "# scale = 1.\n", "# seed = 123\n", "# cluster_size = 10000\n", "# image_size = (128,128)\n", "# output_dir = f'{datasets_dir}/celeba/enhanced'\n", "# exit_if_exist = True\n", "\n", "# ---- Just for tests\n", "# Save clustered dataset in ./data\n", "#\n", "# scale = 0.05\n", "# seed = 123\n", "# cluster_size = 10000\n", "# image_size = (192,160)\n", "# output_dir = './data'\n", "# exit_if_exist = False\n", "\n", "# ---- Full clusters generation, large size : 138 GB\n", "# Save clustered dataset in <datasets_dir> \n", "#\n", "# scale = 1.\n", "# seed = 123\n", "# cluster_size = 10000\n", "# image_size = (192,160)\n", "# output_dir = f'{datasets_dir}/celeba/enhanced'\n", "# exit_if_exist = True" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Used for continous integration - Just forget these lines\n", "#\n", "fidle.override('progress_verbosity', 'scale', 'seed', )\n", "fidle.override('cluster_size', 'image_size', 'output_dir', 'exit_if_exist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Cluster construction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 - Directories and files :" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'\n", "dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 - Cooking function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_and_save( dataset_csv, dataset_img, shuffle=True, seed=None, scale=1,\n", " cluster_size=1000, cluster_dir='./dataset_cluster', cluster_name='images',\n", " image_size=(128,128), exit_if_exist=True, verbosity=1):\n", " '''\n", " Will read the images and save a clustered dataset\n", "\n", " Args:\n", " dataset_csv : list and description of original images\n", " dataset_img : original images directory\n", " shuffle : shuffle data if True (True)\n", " seed : random seed value. False mean no seed, None mean using /dev/urandom (None)\n", " scale : scale of dataset to use. 1. mean 100% (1.)\n", " cluster_size : Size of generated cluster (10000)\n", " cluster_dir : Directory of generated clusters (''./dataset_cluster')\n", " cluster_name : Name of generated clusters ('images')\n", " image_size : Size of generated images (128,128)\n", " exit_if_exist : Exit if clusters still exists.\n", "\n", " Returns:\n", " nb_clusters : Number of clusters\n", " duration: total duration\n", " '''\n", "\n", " def save_cluster(imgs,desc,cols,id):\n", " file_img = f'{cluster_dir}/{cluster_name}-{id:03d}.npy'\n", " file_desc = f'{cluster_dir}/{cluster_name}-{id:03d}.csv'\n", " np.save(file_img, np.array(imgs))\n", " df=pd.DataFrame(data=desc,columns=cols)\n", " df.to_csv(file_desc, index=False)\n", " return [],[],id+1\n", " \n", " chrono = fidle.Chrono()\n", " chrono.start()\n", " \n", " # ---- Seed\n", " #\n", " if seed is not False:\n", " np.random.seed(seed)\n", " print(f'Seeded ({seed})')\n", " \n", " # ---- Read dataset description\n", " #\n", " dataset_desc = pd.read_csv(dataset_csv, header=0)\n", " n=len(dataset_desc)\n", " print(f'Description loaded ({n} images).')\n", " \n", " # ---- Shuffle\n", " #\n", " if shuffle:\n", " dataset_desc = dataset_desc.reindex(np.random.permutation(dataset_desc.index))\n", " print('Shuffled.')\n", " cols = list(dataset_desc.columns)\n", "\n", " # ---- Check if cluster files exist\n", " #\n", " if exit_if_exist and os.path.isfile(f'{cluster_dir}/images-000.npy'):\n", " print('\\n*** Oups. There are already clusters in the target folder!\\n')\n", " return 0,0\n", " fidle.utils.mkdir(cluster_dir)\n", "\n", " # ---- Rescale\n", " #\n", " n=int(len(dataset_desc)*scale)\n", " dataset = dataset_desc[:n]\n", " cluster_size = int(cluster_size*scale)\n", " print('Rescaled.')\n", " fidle.utils.subtitle('Parameters :')\n", " print(f'Scale is : {scale}')\n", " print(f'Image size is : {image_size}')\n", " print(f'dataset length is : {n}')\n", " print(f'cluster size is : {cluster_size}')\n", " print(f'clusters nb is :',int(n/cluster_size + 1))\n", " print(f'cluster dir is : {cluster_dir}')\n", " \n", " # ---- Read and save clusters\n", " #\n", " fidle.utils.subtitle('Running...')\n", " imgs, desc, cluster_id = [],[],0\n", " #\n", " for i,row in dataset.iterrows():\n", " #\n", " filename = f'{dataset_img}/{row.image_id}'\n", " #\n", " # ---- Read image, resize (and normalize)\n", " #\n", " img = io.imread(filename)\n", " img = transform.resize(img, image_size)\n", " #\n", " # ---- Add image and description\n", " #\n", " imgs.append( img )\n", " desc.append( row.values )\n", " #\n", " # ---- Progress bar\n", " #\n", " fidle.utils.update_progress(f'Cluster {cluster_id:03d} :',len(imgs),\n", " cluster_size, verbosity=verbosity)\n", " #\n", " # ---- Save cluster if full\n", " #\n", " if len(imgs)==cluster_size:\n", " imgs,desc,cluster_id=save_cluster(imgs,desc,cols, cluster_id)\n", "\n", " # ---- Save uncomplete cluster\n", " if len(imgs)>0 : imgs,desc,cluster_id=save_cluster(imgs,desc,cols,cluster_id)\n", "\n", " duration=chrono.get_delay(format='seconds')\n", " return cluster_id,duration\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 - Clusters building" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ---- Build clusters\n", "#\n", "lx,ly = image_size\n", "cluster_dir = f'{output_dir}/clusters-{lx}x{ly}'\n", "\n", "cluster_nb,duration = read_and_save( dataset_csv, dataset_img,\n", " shuffle = True,\n", " seed = seed,\n", " scale = scale,\n", " cluster_size = cluster_size, \n", " cluster_dir = cluster_dir,\n", " image_size = image_size,\n", " exit_if_exist = exit_if_exist,\n", " verbosity = progress_verbosity )\n", "\n", "# ---- Conclusion...\n", "\n", "directory = pathlib.Path(cluster_dir)\n", "s=sum(f.stat().st_size for f in directory.glob('**/*') if f.is_file())\n", "\n", "fidle.utils.subtitle('Ressources :')\n", "print('Duration : ',fidle.utils.hdelay(duration))\n", "print('Size : ',fidle.utils.hsize(s))\n", "\n", "fidle.utils.subtitle('Estimation with scale=1 :')\n", "print('Duration : ',fidle.utils.hdelay(duration*(1/scale)))\n", "print('Size : ',fidle.utils.hsize(s*(1/scale)))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fidle.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.2 ('fidle-env')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "b3929042cc22c1274d74e3e946c52b845b57cb6d84f2d591ffe0519b38e4896d" } } }, "nbformat": 4, "nbformat_minor": 4 }