Commit f9c74594 authored by Ooo_i_t_C_p's avatar Ooo_i_t_C_p
Browse files

bbc deployment guide;not so quick anymore

parent 4b3c1dd5
## BBC data import (quick) guide
### Process overview
The steps in this guide will let you move data from json/csv files to relational DBMS like PostgreSQL. A pack of sql scripts restructure the data and construct a database of the data model concept. The whole process is divided in phases:
A. Pre-configuration (sort of)
B. Data import
C. Mapping & flattening
D. Restructuring / Normalizing / Build
#### Process steps
Those phases are technically a sequence of the following steps:
##### phase A
- [joining multiple data files](#joining-multiple-files-into-one) - collections into one file _(optional)_
- [transferring](#locating-data) the data file on server _(optional ; admin access required)_
##### phase B
- [creating a database](#database-setup) with spatial extensions
- importing - supported for **json** and **csv** formats
- _(I)_ from a [json file](#importing-json-data--i) into a _(jsonb type)_ PostgreSQL table
- _(II)_ from a [csv file](#importing-from-csv--ii) into a flat PostgreSQL table
##### phase C
- [mapping and flattening](#flattening--mapping) data in a PostgreSQL table
##### phase D
- [preparing](#prepare-a-session) a session _(optional)_
- [building](#building-the-database) the database - **creating and populating tables**
> Database tables are built according to the [data model][bbc data model].
![data conditioning flow](data-conditioning_final.png "data conditioning flow")
_Add comments or modify this schematic representation of the import process by [opening][data conditioning] it editing mode._
### Joining multiple files into one
This rather easy step can be done with a _cross-platform_ command _`cat`_ available via the console. Pressing <kbd>Ctrl</kbd>+<kbd>Alt</kbd>+<kbd>T</kbd> on _Linux_ systems will open terminal window.
$ cat /path/to/data_one.json /path/to/data_two.json > /path/to/data_all.json
**`head`** and **`tail`** commands will help you verify contents in the concatenated file.
`$ head -5 /path/to/data_all.json`
`$ tail -5 /path/to/data_all.json`
Those will display first five and last five lines on _stdout_. _(without numeric argument both commands will output 10 lines)_.
>All files with one entry on each line can be processed whether it is a json, csv or any other type.
Count the total number of lines with
`$ wc -l /path/to/data_all.json`
>On a Windows system the _path_ should be prefixed with a drive letter, eg. **`C:/`** and the character __`\`__ should be used insted of **`/`** as folder name separator.
### Locating data
Transferring multiple GB of data to a desired location can be sped-up with _Common Internet File System (CIFS)_ protocol.
If necessary install _`cifs`_ using the apt command
`$ sudo apt-get install cifs-utils`
Become _**root**_
`$ sudo -i`
Chose a location (create a folder)
`# mkdir /mnt/BBC`
Then mount the repository
`# mount -t cifs // /mnt/BBC -o username=<user>`
Copy the contents
`# cp -a /mnt/BBC/garage/bbc/data/db_bbc_clean.csv /home/<user>/`
Then un-mount the point
`# umount -l /mnt/BBC`
Exit the root session
`# exit`
The last thing to do is modifying the owner and permissions of the transferred file.
`$ sudo chown <user> db_bbc_clean.csv`
`$ sudo chmod 664 db_bbc_clean.csv`
### Database setup
PostgreSQL should be installed on your system as well as the spatial component PostGIS prior taking further action. You should also be able to access it either as **_postgres_** either as a **_user_** with privileges to create and manage databases.
Enter the psql
`$ sudo -u postgres psql`
and type
Spatial data requires a spatial extension which has to be created in the database where spatial data operations will take place.
Let's connect to a _spatial_ DB
`=# \c bbcdataYY_MM`
`=# CREATE EXTENSION postgis;`
`=# CREATE EXTENSION postgis_topology;`
> A clean-up script [bbc_drop_types-tables.sql](bbc_drop_types-tables.sql) is available to delete previously constructed tables and types. It can be used if the script was executed on a non-appropriate database by error or simply to clean everything and restart the process from zero.
### Importing json data _(I)_
> Skip to [csv part](#importing-from-csv--ii) part to process csv text files.
Table **`t`** will be used as an intermediate table to _inject_ json records. The table name as well as its column name __`j`__ should be equal to those defined in the mapping script _(see the [flattening & mapping section](#flattening--mapping))_. If different table or column names are supposed to be used in this step then they have to be modified accordingly in the mentioned script below.
`=# CREATE TABLE t (j jsonb);`
The 'injection' is done with a simple transfer from a file into the **`t`** table _(mind the table name again)_.
`=# \COPY t FROM '/path/to/data12.json' CSV QUOTE e'\x01' DELIMITER e'\x02'`
Using the `head` argument will limit number of entries. This is useful during test phases.
=# \COPY t FROM PROGRAM 'head -100000 /path/to/data12.json' CSV QUOTE e'\x01' DELIMITER e'\x02'
The _csv mode_ allows to define the escape characters. That assures correct data parsing. It is recommended to verify special characters, role of quotes, points, commas and the encoding of the original data file.
Any modification in json file should be avoided and should never be considered as an approach of managing data since changing large amounts of unreadable data will lead to undesirable results, even more if multiple datasets have to be handled. If at any point parsing starts failing the data treatment strategy has failed with it.
### Importing from csv _(II)_
Building database from csv makes the whole process simpler and faster since the 'parsing' step is not needed. An init table called **`bbcinit`** has to be set up as an import target table. This is handled with the script [bbc_init4csv.sql](bbc_init4csv.sql).
To initiate the process run
`=# \i ./bbc/bbc_init4csv.sql`
List existing tables with the _psql meta command_ to verify if the init table was created.
`=# \dt`
List of relations
Schema | Name | Type | Owner
public | bbcinit | table | postgres
public | spatial_ref_sys | table | postgres
topology | layer | table | postgres
topology | topology | table | postgres
(4 rows)
Then the import can start.
`=# \COPY bbcinit FROM ./bbc/data/db_bbc_clean.csv CSV HEADER DELIMITER ';'`
### Flattening & mapping
Flattening json data into a large relational table is implemented by a sql script [mapjson2flat.sql](mapjson2flat.sql).
The script does the following:
- defines data types
- maps the **`keys`** from json file to table __`columns`__
- extracts data form json
- flattens data into a table respecting the json structure
To initiate the process run
`=# \i ./bbc/mapjson2flat.sql`
### Prepare a session
Populating the database can take a certain amount of time. During that period a computer might switch to stand-by mode and interrupt currently ongoing session. The the _`screen`_ command will prevent that, keeping the initiated processes alive.
You can also [jump](#building-the-database) straight to the building part and skip this additional precautionary step which is more an energy saving measure than anything else.
To verify if current _\<user\>_ has any previous sessions left list all current screen sessions:
`$ screen -ls`
If there are none create a new screen session with
`$ screen`
To resume a detached session type
`$ screen -r <session>`
If the session is active and you would like to reattach it use
`$ screen -d -r <session>`
<sub>[more][screen] about screen</sub>
### Building the database
Build is the final step. While connected to the appropriate database run the script to build and populate the database.
- Run [bbc_build_all.sql](bbc_build_all.sql) if the data origin is a **json** file.
`=# \i './bbc/bbc_build_all.sql'`
- and [bbc_csv_build.sql](bbc_csv_build.sql) if the source is a **csv** file.
`=# \i ./bbc/bbc_csv_build_new.sql`
The build will create new tables and relations based on the [data model][bbc data model]. If this is a server session you are now able to explore the data by using [pgAdmin][pgadmin] or via [psql].
Keep reading if your database is on a local computer and you would like to transfer it on a server.
### Moving data to a remote location
These instructions might be helpful to transfer the data on a server.
First export the database on your local system
`$ sudo -u postgres pg_dump bbcdataYY_MM > /path/to/bbc_io.sql`
Then copy the file to the wanted location _(server access will be password protected)_.
$ scp /path/to/bbc_io.sql
>Don't forget to create a database before importing data, see the [database setup](#database-setup) section above.
Now login to the server where the data will be held and connect to the spatial database.
`$ sudo -u postgres psql`
`=# \c bbcdb`
and import the whole database
`=# \i /remote/location/destination/folder/bbc_io.sql`
Or execute the import directly from the console.
$ sudo -u postgres psql bbcdb < /remote/location/destination/folder/bbc_io.sql
### References
Sources and references used for writing sql scripts mentioned in this manual can be found in comments within these same scripts. See below for some SQL related topics. There is also an [exhaustive archive of bookmarks](bbc_bmarks_20_01.html) at your disposal _(the section exceptional_bkmrks is exceptional, indeed)_.
- computing time intervals
- computing geometries
- handling arrays and strings
- managing json data
- managing database types and tables
- database performance issues
[bbc data model]:
[data conditioning]:
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment