For simplicity, we will use a pre-formatted dataset - See [documentation](
For simplicity, we will use a pre-formatted dataset - See [documentation](
However, Keras offers some usefull tools for formatting textual data - See [documentation](
However, Keras offers some usefull tools for formatting textual data - See [documentation](
**Load dataset :**
**Load dataset :**
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# ----- Retrieve x,y
# ----- Retrieve x,y
# Uncomment this if you want to load dataset directly from keras (small size <20M)
# Uncomment this if you want to load dataset directly from keras (small size <20M)
# with h5py.File(f'{datasets_dir}/IMDB/origine/dataset_imdb.h5','r') as f:
# with h5py.File(f'{datasets_dir}/IMDB/origine/dataset_imdb.h5','r') as f:
# x_train = f['x_train'][:]
# x_train = f['x_train'][:]
# y_train = f['y_train'][:]
# y_train = f['y_train'][:]
# x_test = f['x_test'][:]
# x_test = f['x_test'][:]
# y_test = f['y_test'][:]
# y_test = f['y_test'][:]
%% Output
%% Output
/home/pjluc/anaconda3/envs/fidle/lib/python3.7/site-packages/tensorflow_core/python/keras/datasets/ VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
/home/pjluc/anaconda3/envs/fidle/lib/python3.7/site-packages/tensorflow_core/python/keras/datasets/ VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
/home/pjluc/anaconda3/envs/fidle/lib/python3.7/site-packages/tensorflow_core/python/keras/datasets/ VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
/home/pjluc/anaconda3/envs/fidle/lib/python3.7/site-packages/tensorflow_core/python/keras/datasets/ VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
<start> this film contains more action before the opening credits than are in entire hollywood films of this sort this film is produced by tsui <unknown> and stars jet li this team has brought you many worthy hong kong cinema productions including the once upon a time in china series the action was fast and furious with amazing wire work i only saw the <unknown> in two shots aside from the action the story itself was strong and not just used as filler to find any other action films to rival this you must look for a hong kong cinema <unknown> in your area they are really worth checking out and usually never disappoint
<start> this film contains more action before the opening credits than are in entire hollywood films of this sort this film is produced by tsui <unknown> and stars jet li this team has brought you many worthy hong kong cinema productions including the once upon a time in china series the action was fast and furious with amazing wire work i only saw the <unknown> in two shots aside from the action the story itself was strong and not just used as filler to find any other action films to rival this you must look for a hong kong cinema <unknown> in your area they are really worth checking out and usually never disappoint
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### 2.4 - Have a look for NN
### 2.4 - Have a look for NN
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)),
plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)),
%% Output
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Step 3 - Preprocess the data (padding)
## Step 3 - Preprocess the data (padding)
In order to be processed by an NN, all entries must have the **same length.**
In order to be processed by an NN, all entries must have the **same length.**
We chose a review length of **review_len**
We chose a review length of **review_len**
We will therefore complete them with a padding (of \<pad\>\)
We will therefore complete them with a padding (of \<pad\>\)
<start> this film contains more action before the opening credits than are in entire hollywood films of this sort this film is produced by tsui <unknown> and stars jet li this team has brought you many worthy hong kong cinema productions including the once upon a time in china series the action was fast and furious with amazing wire work i only saw the <unknown> in two shots aside from the action the story itself was strong and not just used as filler to find any other action films to rival this you must look for a hong kong cinema <unknown> in your area they are really worth checking out and usually never disappoint <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<start> this film contains more action before the opening credits than are in entire hollywood films of this sort this film is produced by tsui <unknown> and stars jet li this team has brought you many worthy hong kong cinema productions including the once upon a time in china series the action was fast and furious with amazing wire work i only saw the <unknown> in two shots aside from the action the story itself was strong and not just used as filler to find any other action films to rival this you must look for a hong kong cinema <unknown> in your area they are really worth checking out and usually never disappoint <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**Save dataset and dictionary (For future use but not mandatory)**
**Save dataset and dictionary (For future use but not mandatory)**
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# ---- Write dataset in a h5 file, could be usefull
# ---- Write dataset in a h5 file, could be usefull
- We'll choose a dense vector size for the embedding output with **dense_vector_size**
- We'll choose a dense vector size for the embedding output with **dense_vector_size**
-**GlobalAveragePooling1D** do a pooling on the last dimension : (None, lx, ly) -> (None, ly)
-**GlobalAveragePooling1D** do a pooling on the last dimension : (None, lx, ly) -> (None, ly)
In other words: we average the set of vectors/words of a sentence
In other words: we average the set of vectors/words of a sentence
- L'embedding de Keras fonctionne de manière supervisée. Il s'agit d'une couche de *vocab_size* neurones vers *n_neurons* permettant de maintenir une table de vecteurs (les poids constituent les vecteurs). Cette couche ne calcule pas de sortie a la façon des couches normales, mais renvois la valeur des vecteurs. n mots => n vecteurs (ensuite empilés par le pooling)
- L'embedding de Keras fonctionne de manière supervisée. Il s'agit d'une couche de *vocab_size* neurones vers *n_neurons* permettant de maintenir une table de vecteurs (les poids constituent les vecteurs). Cette couche ne calcule pas de sortie a la façon des couches normales, mais renvois la valeur des vecteurs. n mots => n vecteurs (ensuite empilés par le pooling)
Voir : [Explication plus détaillée (en)](
Voir : [Explication plus détaillée (en)](
ainsi que : [Sentiment detection with Keras](
ainsi que : [Sentiment detection with Keras](