{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n", "\n", "# <!-- TITLE --> [VAE6] - Generation of a clustered dataset\n", "<!-- DESC --> Episode 2 : Analysis of the CelebA dataset and creation of an clustered and usable dataset\n", "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n", "\n", "## Objectives :\n", " - Formatting our dataset in **cluster files**, using batch mode\n", " - Adapting a notebook for batch use\n", "\n", "\n", "The [CelebFaces Attributes Dataset (CelebA)](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains about 200,000 images (202599,218,178,3). \n", "\n", "\n", "## What we're going to do :\n", " - Lire les images\n", " - redimensionner et normaliser celles-ci,\n", " - Constituer des clusters d'images en format npy\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Import and init\n", "### 1.2 - Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<style>\n", "\n", "div.warn { \n", " background-color: #fcf2f2;\n", " border-color: #dFb5b4;\n", " border-left: 5px solid #dfb5b4;\n", " padding: 0.5em;\n", " font-weight: bold;\n", " font-size: 1.1em;;\n", " }\n", "\n", "\n", "\n", "div.nota { \n", " background-color: #DAFFDE;\n", " border-left: 5px solid #92CC99;\n", " padding: 0.5em;\n", " }\n", "\n", "div.todo:before { content:url();\n", " float:left;\n", " margin-right:20px;\n", " margin-top:-20px;\n", " margin-bottom:20px;\n", "}\n", "div.todo{\n", " font-weight: bold;\n", " font-size: 1.1em;\n", " margin-top:40px;\n", "}\n", "div.todo ul{\n", " margin: 0.2em;\n", "}\n", "div.todo li{\n", " margin-left:60px;\n", " margin-top:0;\n", " margin-bottom:0;\n", "}\n", "\n", "div .comment{\n", " font-size:0.8em;\n", " color:#696969;\n", "}\n", "\n", "\n", "\n", "</style>\n", "\n" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "<br>**FIDLE 2020 - Practical Work Module**" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Version : 2.0.7\n", "Notebook id : VAE6\n", "Run time : Wednesday 27 January 2021, 09:48:49\n", "TensorFlow version : 2.2.0\n", "Keras version : 2.3.0-tf\n", "Datasets dir : /gpfswork/rech/mlh/uja62cb/datasets\n", "Run dir : ./run\n", "Update keras cache : False\n" ] } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from skimage import io, transform\n", "\n", "import os,pathlib,time,sys,json,glob\n", "import csv\n", "import math, random\n", "\n", "from importlib import reload\n", "\n", "sys.path.append('..')\n", "import fidle.pwk as pwk\n", "\n", "datasets_dir = pwk.init('VAE6')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Parameters\n", "All the dataset will be use for training \n", "Reading the 200,000 images can take a long time **(>20 minutes)** and a lot of place **(>170 GB)** \n", "Example : \n", "Image Sizes: 128x128 : 74 GB \n", "Image Sizes: 192x160 : 138 GB \n", "\n", "You can define theses parameters : \n", "`scale` : 1 mean 100% of the dataset - set 0.05 for tests \n", "`image_size` : images size in the clusters, should be 128x128 or 192,160 (original is 218,178) \n", "`output_dir` : where to write clusters, could be :\n", " - `./data`, for tests purpose\n", " - `<datasets_dir>/celeba/enhanced` to add clusters in your datasets dir. \n", " \n", "`cluster_size` : number of images in a cluster, 10000 is fine. (will be adjust by scale)\n", "\n", "**Note :** If the target folder is not empty and exit_if_exist is True, the construction is blocked. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# ---- Parameters you can change -----------------------------------\n", "\n", "# ---- Tests\n", "scale = 0.02\n", "cluster_size = 10000\n", "image_size = (128,128)\n", "output_dir = './data'\n", "exit_if_exist = False\n", "\n", "# ---- Full clusters generation, medium size\n", "# scale = 1.\n", "# cluster_size = 10000\n", "# image_size = (128,128)\n", "# output_dir = f'{datasets_dir}/celeba/enhanced'\n", "# exit_if_exist = True\n", "\n", "# ---- Full clusters generation, large size\n", "# scale = 1.\n", "# cluster_size = 10000\n", "# image_size = (192,160)\n", "# output_dir = f'{datasets_dir}/celeba/enhanced'\n", "# exit_if_exist = True" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# ---- Used for continous integration - Just forget this line\n", "#\n", "pwk.override('scale', 'cluster_size', 'image_size', 'output_dir', 'exit_if_exist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 - Directories and files :" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "dataset_csv = f'{datasets_dir}/celeba/origine/list_attr_celeba.csv'\n", "dataset_img = f'{datasets_dir}/celeba/origine/img_align_celeba'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Read and shuffle filenames catalog" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dataset_desc = pd.read_csv(dataset_csv, header=0)\n", "dataset_desc = dataset_desc.reindex(np.random.permutation(dataset_desc.index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Save as clusters of n images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 - Cooking function" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def read_and_save( dataset_img, dataset_desc, scale=1,\n", " cluster_size=1000, cluster_dir='./dataset_cluster', cluster_name='images',\n", " image_size=(128,128),\n", " exit_if_exist=True):\n", " global pwk\n", " \n", " def save_cluster(imgs,desc,cols,id):\n", " file_img = f'{cluster_dir}/{cluster_name}-{id:03d}.npy'\n", " file_desc = f'{cluster_dir}/{cluster_name}-{id:03d}.csv'\n", " np.save(file_img, np.array(imgs))\n", " df=pd.DataFrame(data=desc,columns=cols)\n", " df.to_csv(file_desc, index=False)\n", " return [],[],id+1\n", " \n", " pwk.chrono_start()\n", " cols = list(dataset_desc.columns)\n", "\n", " # ---- Check if cluster files exist\n", " #\n", " if exit_if_exist and os.path.isfile(f'{cluster_dir}/images-000.npy'):\n", " print('\\n*** Oups. There are already clusters in the target folder!\\n')\n", " return 0,0\n", " pwk.mkdir(cluster_dir)\n", "\n", " # ---- Scale\n", " #\n", " n=int(len(dataset_desc)*scale)\n", " dataset = dataset_desc[:n]\n", " cluster_size = int(cluster_size*scale)\n", " pwk.subtitle('Parameters :')\n", " print(f'Scale is : {scale}')\n", " print(f'Image size is : {image_size}')\n", " print(f'dataset length is : {n}')\n", " print(f'cluster size is : {cluster_size}')\n", " print(f'clusters nb is :',int(n/cluster_size + 1))\n", " print(f'cluster dir is : {cluster_dir}')\n", " \n", " # ---- Read and save clusters\n", " #\n", " pwk.subtitle('Running...')\n", " imgs, desc, cluster_id = [],[],0\n", " #\n", " for i,row in dataset.iterrows():\n", " #\n", " filename = f'{dataset_img}/{row.image_id}'\n", " #\n", " # ---- Read image, resize (and normalize)\n", " #\n", " img = io.imread(filename)\n", " img = transform.resize(img, image_size)\n", " #\n", " # ---- Add image and description\n", " #\n", " imgs.append( img )\n", " desc.append( row.values )\n", " #\n", " # ---- Progress bar\n", " #\n", " pwk.update_progress(f'Cluster {cluster_id:03d} :',len(imgs),cluster_size)\n", " #\n", " # ---- Save cluster if full\n", " #\n", " if len(imgs)==cluster_size:\n", " imgs,desc,cluster_id=save_cluster(imgs,desc,cols, cluster_id)\n", "\n", " # ---- Save uncomplete cluster\n", " if len(imgs)>0 : imgs,desc,cluster_id=save_cluster(imgs,desc,cols,cluster_id)\n", "\n", " duration=pwk.chrono_stop()\n", " return cluster_id,duration\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 - Cluster building" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "<br>**Parameters :**" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Scale is : 0.02\n", "Image size is : (128, 128)\n", "dataset length is : 4051\n", "cluster size is : 200\n", "clusters nb is : 21\n", "cluster dir is : ./data/clusters-128x128\n" ] }, { "data": { "text/markdown": [ "<br>**Running...**" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Cluster 000 : [########################################] 100.0% of 200\n", "Cluster 001 : [########################################] 100.0% of 200\n", "Cluster 002 : [########################################] 100.0% of 200\n", "Cluster 003 : [########################################] 100.0% of 200\n", "Cluster 004 : [########################################] 100.0% of 200\n", "Cluster 005 : [########################################] 100.0% of 200\n", "Cluster 006 : [########################################] 100.0% of 200\n", "Cluster 007 : [########################################] 100.0% of 200\n", "Cluster 008 : [########################################] 100.0% of 200\n", "Cluster 009 : [########################################] 100.0% of 200\n", "Cluster 010 : [########################################] 100.0% of 200\n", "Cluster 011 : [########################################] 100.0% of 200\n", "Cluster 012 : [########################################] 100.0% of 200\n", "Cluster 013 : [########################################] 100.0% of 200\n", "Cluster 014 : [########################################] 100.0% of 200\n", "Cluster 015 : [########################################] 100.0% of 200\n", "Cluster 016 : [########################################] 100.0% of 200\n", "Cluster 017 : [########################################] 100.0% of 200\n", "Cluster 018 : [########################################] 100.0% of 200\n", "Cluster 019 : [########################################] 100.0% of 200\n", "Cluster 020 : [##########------------------------------] 25.0% of 200\r" ] }, { "data": { "text/markdown": [ "<br>**Conclusion :**" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Duration : 0:01:57\n", "Size : 1.5 Go\n" ] } ], "source": [ "# ---- Build clusters\n", "#\n", "lx,ly = image_size\n", "cluster_dir = f'{output_dir}/clusters-{lx}x{ly}'\n", "\n", "cluster_nb,duration = read_and_save( dataset_img, dataset_desc,\n", " scale = scale,\n", " cluster_size = cluster_size, \n", " cluster_dir = cluster_dir,\n", " image_size = image_size,\n", " exit_if_exist = exit_if_exist)\n", "\n", "# ---- Conclusion...\n", "\n", "directory = pathlib.Path(cluster_dir)\n", "s=sum(f.stat().st_size for f in directory.glob('**/*') if f.is_file())\n", "\n", "pwk.subtitle('Conclusion :')\n", "print('Duration : ',pwk.hdelay(duration))\n", "print('Size : ',pwk.hsize(s))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "End time is : Wednesday 27 January 2021, 09:50:47\n", "Duration is : 00:01:58 822ms\n", "This notebook ends here\n" ] } ], "source": [ "pwk.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }