SYNOP1-Preparation-of-data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=\"800px\" src=\"../fidle/img/00-Fidle-header-01.svg\"></img>\n",
    "\n",
    "# <!-- TITLE --> [SYNOP1] - Preparation of data\n",
    "<!-- DESC --> Episode 1 : Data analysis and preparation of a usuable meteorological dataset (SYNOP)\n",
    "<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->\n",
    "\n",
    "## Objectives :\n",
    " - Undestand the data\n",
    " - cleanup a usable dataset\n",
    "\n",
    "\n",
    "SYNOP meteorological data, can be found on :  \n",
    "https://public.opendatasoft.com  \n",
    "\n",
    "About SYNOP datasets :  \n",
    "https://public.opendatasoft.com/explore/dataset/donnees-synop-essentielles-omm/information/?sort=date\n",
    "\n",
    "This dataset contains a set of measurements (temperature, pressure, ...) made every 3 hours at the LYS airport.  \n",
    "The objective will be to predict the evolution of the weather !\n",
    "\n",
    "## What we're going to do :\n",
    "\n",
    " - Read the data\n",
    " - Cleanup and build a usable dataset\n",
    "\n",
    "## Step 1 - Import and init"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "from tensorflow.keras.callbacks import TensorBoard\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import pandas as pd\n",
    "import h5py, json\n",
    "import os,time,sys\n",
    "import math, random\n",
    "\n",
    "from importlib import reload\n",
    "\n",
    "sys.path.append('..')\n",
    "import fidle.pwk as pwk\n",
    "\n",
    "run_dir = './run/SYNOP'\n",
    "datasets_dir = pwk.init('SYNOP1', run_dir)\n",
    "\n",
    "pd.set_option('display.max_rows',200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Parameters\n",
    "`output_dir` : where to save our enhanced dataset.  \n",
    "./data is a good choice because our dataset will be very small."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Our future enhanced dataset (no need to change)\n",
    "#\n",
    "dataset_filename     = 'synop-LYS.csv'\n",
    "description_filename = 'synop.json'\n",
    "output_dir           = './data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Override parameters (batch mode) - Just forget this cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.override('output_dir')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Retrieve the dataset\n",
    "There are two parts to recover:\n",
    " - The data itself (csv)\n",
    " - Description of the data (json)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_filename   = 'origine/donnees-synop-essentielles-omm-LYS.csv'\n",
    "schema_filename = 'origine/schema.json'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 - Read dataset description\n",
    "We need the list and description of the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(f'{datasets_dir}/SYNOP/{schema_filename}','r') as json_file:\n",
    "    schema = json.load(json_file)\n",
    "\n",
    "synop_codes=list( schema['definitions']['donnees-synop-essentielles-omm_records']['properties']['fields']['properties'].keys() )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Read data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(f'{datasets_dir}/SYNOP/{data_filename}', header=0, sep=';')\n",
    "pwk.subtitle('Raw data :')\n",
    "display(df.tail(10))\n",
    "\n",
    "# ---- Get the columns name as descriptions\n",
    "synop_desc = list(df.columns)\n",
    "\n",
    "# ---- Set Codes as columns name\n",
    "df.columns   = synop_codes\n",
    "code2desc    = dict(zip(synop_codes, synop_desc))\n",
    "\n",
    "# ---- Count the na values by columns\n",
    "columns_na = df.isna().sum().tolist()\n",
    "\n",
    "# ---- Show all of that\n",
    "df_desc=pd.DataFrame({'Code':synop_codes, 'Description':synop_desc, 'Na':columns_na})\n",
    "\n",
    "pwk.subtitle('List of columns :')\n",
    "display(df_desc.style.set_properties(**{'text-align': 'left'}))\n",
    "\n",
    "print('Shape is : ', df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Prepare dataset\n",
    "### 4.1 - Keep only certain columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_used=['date','pmer','tend','cod_tend','dd','ff','td','u','ww','pres','rafper','per','rr1','rr3','tc']\n",
    "\n",
    "# ---- Drop unused columns\n",
    "\n",
    "to_drop = df.columns.difference(columns_used)\n",
    "df.drop( to_drop, axis=1, inplace=True)\n",
    "\n",
    "# ---- Show all of that\n",
    "\n",
    "pwk.subtitle('Our selected columns :')\n",
    "display(df.head(20))\n",
    "\n",
    "pwk.subtitle('Few statistics :')\n",
    "display(df.describe().style.format('{:.2f}'))\n",
    "\n",
    "# ---- 'per' column is constant, we can drop it\n",
    "\n",
    "df.drop(['per'],axis=1,inplace=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 - Cleanup dataset\n",
    "Let's sort it and cook up some NaN values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- First of all, we have to sort on the date\n",
    "\n",
    "df.sort_values(['date'],  inplace=True)\n",
    "df.reset_index(drop=True, inplace=True)\n",
    "\n",
    "# ---- Before : Lines with NaN\n",
    "\n",
    "na_rows=df.isna().any(axis=1)\n",
    "pwk.subtitle('Before :')\n",
    "display( df[na_rows].head(10) )\n",
    "\n",
    "# ---- Nice interpolation for plugging holes\n",
    "\n",
    "df.interpolate(inplace=True)\n",
    "\n",
    "# ---- After\n",
    "\n",
    "pwk.subtitle('After :')\n",
    "display(df[na_rows].head(10))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - About our enhanced dataset\n",
    "### 5.1 - Summarize it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Count the na values by columns\n",
    "dataset_na    = df.isna().sum().tolist()\n",
    "dataset_cols  = df.columns.tolist()\n",
    "dataset_desc  = [ code2desc[c] for c in dataset_cols ]\n",
    "\n",
    "# ---- Show all of that\n",
    "df_desc=pd.DataFrame({'Columns':dataset_cols, 'Description':dataset_desc, 'Na':dataset_na})\n",
    "pwk.subtitle('Dataset columns :')\n",
    "display(df_desc.style.set_properties(**{'text-align': 'left'}))\n",
    "\n",
    "pwk.subtitle('Have a look :')\n",
    "display(df.tail(20))\n",
    "print('Shape is : ', df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 - Have a look (1 month)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i=random.randint(0,len(df)-240)\n",
    "df.iloc[i:i+240].plot(subplots=True, fontsize=12, figsize=(16,20))\n",
    "pwk.save_fig('01-one-month')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Save it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Save it\n",
    "#\n",
    "pwk.mkdir(output_dir)\n",
    "\n",
    "filedata = f'{output_dir}/{dataset_filename}'\n",
    "filedesc = f'{output_dir}/{description_filename}'\n",
    "\n",
    "df.to_csv(filedata, sep=';', index=False)\n",
    "size=os.path.getsize(filedata)/(1024*1024)\n",
    "print(f'Dataset saved. ({size:0.1f} Mo)')\n",
    "\n",
    "with open(filedesc, 'w', encoding='utf-8') as f:\n",
    "    json.dump(code2desc, f, indent=4)\n",
    "print('Synop description saved.')\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwk.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<img width=\"80px\" src=\"../fidle/img/00-Fidle-logo-01.svg\"></img>"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "8e38643e33497db9a306e3f311fa98cb1e65371278ca73ee4ea0c76aa5a4f387"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit ('fidle-cpu': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}