Commit 3e3d8159 authored by Eric Maldonado's avatar Eric Maldonado
Browse files

add join example

parent 3f90ed71
......@@ -165,7 +165,7 @@
"\n",
"A Dictionnary of series where keys are column name\n",
"\n",
"<div><img style=\"float: left;margin-left : 70px\" src='fig/dataframe_type.png' height=\"600\" width=\"600\"/>\n"
"<div><img style=\"float: left;margin-left : 70px\" src='fig/dataframe_type.png' height=\"800\" width=\"800\"/>\n"
]
},
{
......@@ -345,7 +345,7 @@
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -1217,7 +1217,9 @@
}
},
"source": [
"## Join\n"
"## Join\n",
"\n",
"<div><img style=\"float: left;margin-left : 70px\" src='fig/join-example.png' height=\"600\" width=\"600\"/>"
]
},
{
......@@ -1468,7 +1470,7 @@
"\n",
"1. Count the number of movies\n",
"2. Display the lastest movies\n",
"3. Display movies between XXXX and XXXX\n",
"3. Display movies between 1939 and 1940\n",
"4. Diplay all the titleType available\n",
"5. Count the number of movies by titleType\n",
"6. Display Humphrey Bogart movies\n",
......
%% Cell type:markdown id: tags:
# Pandas : Python Data Analysis Library
<div><img style="float: left;margin-left : 70px" src='images/pandas_logo.png' height="500" width="500"/>
%% Cell type:markdown id: tags:
# Pandas a Data Analysis Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
GeoExtension : geopandas
%% Cell type:markdown id: tags:
# Webography
- Online Doc: https://pandas.pydata.org/
- CheatSheet : https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Tutorials:
- https://www.dataschool.io/easier-data-analysis-with-pandas/
- https://www.dataschool.io/data-science-best-practices-with-pandas/
- Pandas PySciDataGre Talk : https://python.univ-grenoble-alpes.fr/working-session-librairie-pandas.html
- GeoPandas : http://geopandas.org/
%% Cell type:markdown id: tags:
Code using pandas usually starts with the import statement
%% Cell type:code id: tags:
``` python
import pandas as pd
```
%% Cell type:markdown id: tags:
Pandas
- 2 data structures (Series, DataFrame) for data analysis
- multiple methods for convenient data filtering.
- Toolkit utilities to perform Input/Output operations.
It can read data from a variety of formats such as CSV, TSV, MS Excel, etc.
%% Cell type:markdown id: tags:
Pandas has two main data structures for data storage
- Series
- DataFrame
%% Cell type:code id: tags:
``` python
## Series structure
import pandas as pd
import numpy as np
series1 = pd.Series([1,2,3,4])
print(series1)
print(series1.sum())
print(series1.mean())
print(series1.to_csv())
fruits = np.array(['kiwi','orange','mango','apple'])
series2 = pd.Series(fruits)
print(series2)
```
%%%% Output: stream
0 1
1 2
2 3
3 4
dtype: int64
10
2.5
0,1
1,2
2,3
3,4
0 apple
1 orange
2 mango
3 pear
dtype: object
%% Cell type:markdown id: tags:
# Dataframe
A Dictionnary of series where keys are column name
<div><img style="float: left;margin-left : 70px" src='fig/dataframe_type.png' height="600" width="600"/>
<div><img style="float: left;margin-left : 70px" src='fig/dataframe_type.png' height="800" width="800"/>
%% Cell type:markdown id: tags:
## How to create a data frame ?
%% Cell type:markdown id: tags:
### From scratch
%% Cell type:code id: tags:
``` python
## DataFrame structure
import pandas as pd
# intialise data of lists.
data = {'Name':['John', 'Paul', 'Debby', 'Laura'], 'Sex':['Male','Male','Female','Female'],'Age':[20, 40, 19, 30]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
type(df_person.Age)
```
%%%% Output: stream
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
%%%% Output: execute_result
pandas.core.series.Series
%% Cell type:markdown id: tags:
## From a File
%% Cell type:code id: tags:
``` python
import pandas as pd
df_person = pd.read_csv('files/person.txt', sep = ',',encoding = "utf-8", header=0)
print(df_person)
```
%%%% Output: stream
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
%% Cell type:markdown id: tags:
By default, new index is created
If you want use a based field index, you have to specify it on the read_csv function:
df_person = pd.read_csv('files/person.txt', sep = ',',index_col='Name',encoding = "utf-8", header=0)
%% Cell type:markdown id: tags:
## Basics commands
%% Cell type:code id: tags:
``` python
#display simple statistics
print(df_person.describe())
```
%%%% Output: stream
Age
count 4.000000
mean 29.250000
std 9.844626
min 21.000000
25% 21.750000
50% 27.000000
75% 34.500000
max 42.000000
%% Cell type:code id: tags:
``` python
#display the dataframe columns
print(df_person.columns)
```
%%%% Output: stream
Index(['Name', 'Sex', 'Age'], dtype='object')
%% Cell type:code id: tags:
``` python
#query one column
print(df_person["Age"])
# another method to query one column
print(df_person.Age)
```
%%%% Output: stream
0 22
1 42
2 21
3 32
Name: Age, dtype: int64
0 22
1 42
2 21
3 32
Name: Age, dtype: int64
%% Cell type:code id: tags:
``` python
#query multiple column
print(df_person[['Name','Age']])
```
%%%% Output: stream
Name Age
0 John 22
1 Paul 42
2 Debby 21
3 Laura 32
%% Cell type:code id: tags:
``` python
# display unique value of a column
print(df_person.Sex.unique())
```
%%%% Output: stream
['Male' 'Female']
%% Cell type:code id: tags:
``` python
# display the 5 first rows
print(df_person.head())
# display the 5 last rows
print(df_person.tail())
# display 2 first rows
print(df_person[:2])
# display by number position
print(df_person.iloc[2])
print(df_person.iloc[:])
```
%%%% Output: stream
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
Name Sex Age
0 John Male 22
1 Paul Male 42
Name Debby
Sex Female
Age 21
Name: 2, dtype: object
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
%% Cell type:code id: tags:
``` python
print(df_person.loc[2])
```
%%%% Output: stream
Name John
Sex Male
Age 20
Name: 0, dtype: object
%% Cell type:code id: tags:
``` python
# Basic operations on columns
df_person.Age = df_person.Age + 2
print(df_person.Age)
```
%%%% Output: stream
0 22
1 42
2 21
3 32
Name: Age, dtype: int64
%% Cell type:markdown id: tags:
# Add a row
%% Cell type:code id: tags:
``` python
df_person = df_person.append({'Name':'Glenn','Sex': 'Male','Age':10},ignore_index=True)
df_person
```
%%%% Output: execute_result
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
4 Glenn Male 10
%% Cell type:markdown id: tags:
### Add some rows
%% Cell type:code id: tags:
``` python
data = {'Name':['Marguerite', 'Annie', 'Stephen', 'Ava'], 'Sex':['Female','Female','Male','Female'],'Age':[34, 23, 49, 22]}
df_person= df_person.append(pd.DataFrame(data),ignore_index=True)
df_person
```
%%%% Output: execute_result
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
4 Glenn Male 10
5 Marguerite Female 34
6 Annie Female 23
7 Stephen Male 49
8 Ava Female 22
%% Cell type:markdown id: tags:
## Add a column
%% Cell type:code id: tags:
``` python
df_person["Nationality"] = 'USA'
df_person
```
%%%% Output: execute_result
Name Sex Age Nationality
0 John Male 20 USA
1 Paul Male 40 USA
2 Debby Female 19 USA
3 Laura Female 30 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Stephen Male 49 USA
8 Ava Female 22 USA
%% Cell type:markdown id: tags:
## Basic statistics
%% Cell type:code id: tags:
``` python
type(df_person.Age)
```
%%%% Output: execute_result
pandas.core.series.Series
%% Cell type:code id: tags:
``` python
## Mean
print (df_person.Age.mean())
## Min and Max
print (df_person.Age.min())
print(df_person.Age.max())
print (df_person.Age.count())
```
%%%% Output: stream
27.444444444444443
10
49
9
%% Cell type:markdown id: tags:
## How to sort data ?
%% Cell type:code id: tags:
``` python
df_person_sorted = df_person.sort_values(['Age'], ascending=True)
print(df_person_sorted)
```
%%%% Output: stream
Name Sex Age Nationality
4 Glenn Male 10 USA
2 Debby Female 19 USA
0 John Male 20 USA
8 Ava Female 22 USA
6 Annie Female 23 USA
3 Laura Female 30 USA
5 Marguerite Female 34 USA
1 Paul Male 40 USA
7 Stephen Male 49 USA
%% Cell type:markdown id: tags:
## Selection
%% Cell type:code id: tags:
``` python
# selection with one criteria
print(df_person[df_person['Sex']=='Female'])
print ("--------------------")
print(df_person[df_person['Age']<20])
print ("--------------------")
# selection with 2 criteria
print(df_person[(df_person['Sex'] =='Male') & (df_person['Age'] > 30)])
```
%%%% Output: stream
Name Sex Age Nationality
2 Debby Female 19 USA
3 Laura Female 30 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
8 Ava Female 22 USA
--------------------
Name Sex Age Nationality
2 Debby Female 19 USA
4 Glenn Male 10 USA
--------------------
Name Sex Age Nationality
1 Paul Male 40 USA
7 Stephen Male 49 USA
%% Cell type:markdown id: tags:
## Update data
%% Cell type:code id: tags:
``` python
#change one value by index
df_person.loc[7,"Name"] = "Stephane"
print(df_person)
#change one value after a selection
df_person.loc[df_person["Name"] == 'Stephane', "Name"] = "Eric"
print(df_person)
```
%%%% Output: stream
Name Sex Age Nationality
0 John Male 20 USA
1 Paul Male 40 USA
2 Debby Female 19 USA
3 Laura Female 30 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Stephane Male 49 USA
8 Ava Female 22 USA
Name Sex Age Nationality
0 John Male 20 USA
1 Paul Male 40 USA
2 Debby Female 19 USA
3 Laura Female 30 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA
%% Cell type:code id: tags:
``` python
##Add a Column
df_person["City"] = "City"
print(df_person)
##Delete a row
df_person = df_person.drop("City",axis=1)
print(df_person)
```
%%%% Output: stream
Name Sex Age Nationality City
0 John Male 20 USA City
1 Paul Male 40 USA City
2 Debby Female 19 USA City
3 Laura Female 30 USA City
4 Glenn Male 10 USA City
5 Marguerite Female 34 USA City
6 Annie Female 23 USA City
7 Eric Male 49 USA City
8 Ava Female 22 USA City
Name Sex Age Nationality
0 John Male 20 USA
1 Paul Male 40 USA
2 Debby Female 19 USA
3 Laura Female 30 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA
%% Cell type:markdown id: tags:
## Concat
<div><img style="float: left;margin-left : 70px" src='fig/concat-example.png' height="600" width="600"/>
%% Cell type:code id: tags:
``` python
data = {'Name':['Benedicte', 'Bernard', 'Nicolas', 'Anne'], 'Sex':['Female','Male','Male','Female'],'Age':[24, 34, 49, 42],'Nationality':['FR','FR','FR','FR']}
df_person_fr = pd.DataFrame(data)
list_person = [df_person,df_person_fr]
result = pd.concat(list_person)
print(result)
```
%%%% Output: stream
Name Sex Age Nationality
0 John Male 20 USA
1 Paul Male 40 USA
2 Debby Female 19 USA
3 Laura Female 30 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA
0 Benedicte Female 24 FR
1 Bernard Male 34 FR
2 Nicolas Male 49 FR
3 Anne Female 42 FR
%% Cell type:markdown id: tags:
## Join
<div><img style="float: left;margin-left : 70px" src='fig/join-example.png' height="600" width="600"/>
%% Cell type:code id: tags:
``` python
import random
data = {'id_Address':[0,1,2,3],'Address':['gordon street', 'aqua boulevard', 'st georges street', '5th street'], 'City':['Boston','Chicago','Charlotte','San Francisco']}
# Create DataFrame
df_address = pd.DataFrame(data)
print(df_address)
df_person["id_Address"] = ""
nb_elements = df_person.Name.count()
cpt = 0
while (cpt < nb_elements ):
df_person.loc[cpt,"id_Address"] = random.randint(0, 3)
cpt = cpt + 1
print(df_person)
result = pd.merge(df_person, df_address, how='left',on='id_Address')
print(result)
```
%%%% Output: stream
id_Address Address City
0 0 gordon street Boston
1 1 aqua boulevard Chicago
2 2 st georges street Charlotte
3 3 5th street San Francisco
Name Sex Age Nationality id_Address
0 John Male 20 USA 0
1 Paul Male 40 USA 0
2 Debby Female 19 USA 2
3 Laura Female 30 USA 1
4 Glenn Male 10 USA 3
5 Marguerite Female 34 USA 2
6 Annie Female 23 USA 2
7 Eric Male 49 USA 1
8 Ava Female 22 USA 0
Name Sex Age Nationality id_Address Address \
0 John Male 20 USA 0 gordon street
1 Paul Male 40 USA 0 gordon street
2 Debby Female 19 USA 2 st georges street
3 Laura Female 30 USA 1 aqua boulevard
4 Glenn Male 10 USA 3 5th street
5 Marguerite Female 34 USA 2 st georges street
6 Annie Female 23 USA 2 st georges street
7 Eric Male 49 USA 1 aqua boulevard
8 Ava Female 22 USA 0 gordon street
City
0 Boston
1 Boston
2 Charlotte
3 Chicago
4 San Francisco
5 Charlotte
6 Charlotte
7 Chicago
8 Boston
%% Cell type:markdown id: tags:
## Group By
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
<div><img style="float: left;margin-left : 70px" src='fig/groupby-example.png' height="800" width="800"/>
%% Cell type:code id: tags:
``` python
print(df_person.groupby('Sex')['Sex'].count())
print(df_person.groupby('Sex')['Age'].mean())
```
%%%% Output: stream
Sex
Female 5
Male 4
Name: Sex, dtype: int64
Sex
Female 25.60
Male 29.75
Name: Age, dtype: float64
%% Cell type:markdown id: tags:
## Export Data
%% Cell type:code id: tags:
``` python
export_csv = df_person.to_csv (r'./files/export_person.csv', index = None, header=True)
```
%% Cell type:markdown id: tags:
## Plot Data
%% Cell type:code id: tags:
``` python
%matplotlib inline
df_person.groupby('Sex')['Sex'].count().plot.bar()
```
%%%% Output: execute_result
<matplotlib.axes._subplots.AxesSubplot at 0x7ff2f50bd780>
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
# DIY
%% Cell type:markdown id: tags:
## Goals : Compute light statistics on IMDB Movies files
The goal of this session is to end up with a script that computes some simple statistics from IMDB Movies files. The file was modified and reduced for this exercice
Material
Data are in 2 files Directory named "files"
- name.tsv
This file contains the actors, the separated character is tabulation '\t'. The first line is the header.
nconst primaryName birthYear deathYear primaryProfession knownForTitles
- title.tsv
This file contains the movies, the separated character is ',' The first line is the header.
tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
We want to
- load data from tsv file
- compute some basic statistics
- save data to tsv file
%% Cell type:markdown id: tags:
## Compute some basic statistics
1. Count the number of movies
2. Display the lastest movies
3. Display movies between XXXX and XXXX
3. Display movies between 1939 and 1940
4. Diplay all the titleType available
5. Count the number of movies by titleType
6. Display Humphrey Bogart movies
7. Plot movie count by year between 1950 and 1960
%% Cell type:markdown id: tags:
## A possible correction
%% Cell type:code id: tags:
``` python
#load data
df_title = pd.read_csv('files/diy_12_pandas/title.tsv', sep = ',', encoding = "utf-8", header=0)
print(df_title.index)
print(df_title.head())
#1
print(df_title.primaryTitle.count())
#2
print(df_title.sort_values(['startYear'], ascending=False).head())
#3
print(df_title[(df_title.startYear>=1939) & (df_title.startYear<=1940)]['originalTitle'])
#4
print(df_title["titleType"].unique())
#5
print(df_title.groupby("titleType")["titleType"].count())
#6
df_name = pd.read_csv('files/diy_12_pandas/name.tsv', sep = '\t', encoding = "utf-8", header=0)
author_titles = df_name.loc[df_name['primaryName']=='Humphrey Bogart']['knownForTitles']
print(df_title.loc[df_title['tconst'].isin(author_titles.tolist()[0].split(','))])
#7
df_title[(df_title['startYear']>=1940) & (df_title['startYear']<=1950)].groupby("startYear")["startYear"].count().plot.bar()
```
%%%% Output: stream
RangeIndex(start=0, stop=135460, step=1)
tconst titleType primaryTitle \
0 tt0000009 movie Miss Jerry
1 tt0000020 short The Derby 1895
2 tt0000024 short Opening of the Kiel Canal
3 tt0000025 short The Oxford and Cambridge University Boat Race
4 tt0000165 short Riña en un café
originalTitle isAdult startYear endYear \
0 Miss Jerry 0 1894.0 NaN
1 The Derby 1895 0 1895.0 NaN
2 Opening of the Kiel Canal 0 1895.0 NaN
3 The Oxford and Cambridge University Boat Race 0 1895.0 NaN
4 Riña en un café 0 1897.0 NaN
runtimeMinutes genres
0 45.0 Romance
1 1.0 Documentary,Short,Sport
2 NaN News,Short
3 NaN News,Short,Sport
4 1.0 Short
135460
tconst titleType primaryTitle originalTitle \
123383 tt3554046 movie Space Jam 2 Space Jam 2
112158 tt1630029 movie Avatar 2 Avatar 2
132762 tt7304824 movie Ofrenda a la tormenta Ofrenda a la tormenta
134753 tt8984382 movie Jagdzeit Jagdzeit
131619 tt6615224 movie HeadShop HeadShop
isAdult startYear endYear runtimeMinutes \
123383 0 2021.0 NaN NaN
112158 0 2021.0 NaN NaN
132762 0 2020.0 NaN NaN
134753 0 2020.0 NaN NaN
131619 0 2020.0 NaN NaN
genres
123383 Action,Adventure,Animation
112158 Action,Adventure,Fantasy
132762 Crime,Thriller
134753 Drama
131619 Comedy
3530 Así es la vida
5342 Alhambra
5438 Dernière jeunesse
5484 El genio alegre
5577 María de la O
5611 Olympic Honeymoon
5648 El rayo
5756 Usted tiene ojos de mujer fatal
5780 3:1 a szerelem javára
5792 Allegri masnadieri
5933 Ho perduto mio marito
6024 Molinos de viento
6033 My Favorite Wife
6145 L'ultima nemica
6196 Aldeia da Roupa Branca
6216 La bestia negra
6253 Cossacks in Exile
6270 Le due madri
6274 Der Edelweißkönig
6277 Eravamo 7 sorelle
6281 Farinet ou l'or dans la montagne
6288 I figli del marchese Lucera
6335 Io, suo padre
6383 Marionette
6399 Napoli che non muore
6401 Narcisse
6424 Piccoli naufraghi
6433 Prinzessin Sissy
6495 Suspiros de España
6507 Terra di fuoco
...
106234 Swinguette
107773 Hatsukoi
107803 Shamisen bushi
111182 Stranger Than Fiction, #69
112706 Patriot
115430 Elnémult harangok
116809 Sai shang feng yun
117975 The Brown Bomber
121499 The Green Goddess
121557 Stozhary
122719 Yesterday Is Over Your Shoulder
122967 Sabakaruru onna
124321 Verena Stadler
125342 Boevye stranitsy
125380 Death at Newtown-Stewart
125391 The Sleeping Princess
125455 Chances Fair and Choosers True
125490 Plain Jane
127362 United Action Means Victory
127519 Salute to America
127520 Living in Hollywood
127874 Cavalcade of Variety
129070 The Tempest/II
129406 NBC News
131267 World in Flames
131626 Hirurgiya
131807 Hurricane Special
131951 Då länkarna smiddes
133077 En correctionnelle
134454 Hollywood Funtime, Program 2
Name: originalTitle, Length: 1004, dtype: object
['movie' 'short' 'tvMovie' 'tvSeries' 'tvMiniSeries' 'tvShort' 'tvSpecial'
'tvEpisode' 'video' 'videoGame']
titleType
movie 86687
short 12624
tvEpisode 13
tvMiniSeries 2651
tvMovie 9199
tvSeries 17863
tvShort 410
tvSpecial 695
video 4465
videoGame 853
Name: titleType, dtype: int64
tconst titleType primaryTitle originalTitle \
7458 tt0033870 movie The Maltese Falcon The Maltese Falcon
7678 tt0034583 movie Casablanca Casablanca
8620 tt0037382 movie To Have and Have Not To Have and Have Not
10879 tt0043265 movie The African Queen The African Queen
isAdult startYear endYear runtimeMinutes genres
7458 0 1941.0 NaN 100.0 Film-Noir,Mystery
7678 0 1942.0 NaN 102.0 Drama,Romance,War
8620 0 1944.0 NaN 100.0 Adventure,Comedy,Romance
10879 0 1951.0 NaN 105.0 Adventure,Drama,Romance
%%%% Output: execute_result
<matplotlib.axes._subplots.AxesSubplot at 0x7ff2ee79ab38>
%%%% Output: display_data
[Hidden Image Output]
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment