Commit b6b5a9ad authored by Florent Chatelain's avatar Florent Chatelain
Browse files

fix typos

parent a1415ee1
......@@ -19,8 +19,7 @@
"\n",
"By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.\n",
"\n",
"You can learn more about this dataset at the \n",
"[UCI Machine Learning repository][UCI Machine Learning repository documentation](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). You can download the dataset for free and place it in your working directory with the filename `sonar.all-data.csv` (also avaible in the gitlab repo of the course)."
"You can learn more about this dataset at the [UCI Machine Learning repository documentation](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). You can download the dataset for free and place it in your working directory with the filename `sonar.all-data.csv` (also avaible in the gitlab repo of the course)."
]
},
{
......
%% Cell type:markdown id: tags:
 
This notebook can be run on mybinder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/8_Trees_Boosting/N1_Classif_tree.ipynb/N3_c_Random_forests_Sonar_Data.ipynb)
 
%% Cell type:markdown id: tags:
 
## SONAR DATA
 
This is a dataset that describes sonar chirp returns bouncing off different services. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders.
 
It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. As such we will not have to normalize the input data, which is often a good practice with the Perceptron algorithm. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.
 
By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.
 
You can learn more about this dataset at the
[UCI Machine Learning repository][UCI Machine Learning repository documentation](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). You can download the dataset for free and place it in your working directory with the filename `sonar.all-data.csv` (also avaible in the gitlab repo of the course).
You can learn more about this dataset at the [UCI Machine Learning repository documentation](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). You can download the dataset for free and place it in your working directory with the filename `sonar.all-data.csv` (also avaible in the gitlab repo of the course).
 
%% Cell type:code id: tags:
 
``` python
from csv import reader
import numpy as np
 
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, "r") as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
 
 
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
 
 
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
 
 
# load and prepare data
filename = "sonar.all-data.csv"
dataset = load_csv(filename)
for i in range(len(dataset[0]) - 1):
str_column_to_float(dataset, i)
# convert string class to integers
str_column_to_int(dataset, len(dataset[0]) - 1)
print("size of sonar dataset = {}".format(np.asarray(dataset).shape))
 
dataset_X = list()
dataset_y = list()
for i in range(len(dataset)):
dataset_X.append(dataset[i][:-1])
dataset_y.append(dataset[i][-1])
```
 
%%%% Output: stream
 
size of sonar dataset = (208, 61)
 
%% Cell type:markdown id: tags:
 
### about Sonar data file
The file contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The file also contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.
 
Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.
 
These experiments were conducted to evaluate the possibilities to detect mines or pipes on the sea floor.
 
See the [UCI Machine Learning repository documentation](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)).
 
%% Cell type:markdown id: tags:
 
## Visualize a few realizations
 
%% Cell type:code id: tags:
 
``` python
%matplotlib inline
import matplotlib.pyplot as plt
 
print(np.asarray(dataset)[:, -1])
plt.plot(np.asarray(dataset[0])[:-1], "b")
print(" 1st sample label = {}".format(dataset[0][-1]))
print(" 200st sample label = {}".format(dataset[200][-1]))
plt.plot(np.asarray(dataset[200])[:-1], "r")
plt.title("blue: rock echoes example; red: mine echoes examples")
plt.xlabel("Angular sector")
plt.ylabel("Response strength");
```
 
%%%% Output: stream
 
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
1st sample label = 1
200st sample label = 0
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:code id: tags:
 
``` python
Positive = list()
Negative = list()
plt.figure(1)
for i in range(len(dataset)):
 
r = dataset[i]
if r[-1] == 1:
Positive.append(r[:-1])
# plt.subplot(211)
# plt.plot(r[:-1])
else:
Negative.append(r[:-1])
# plt.subplot(212)
# plt.plot(r[:-1])
 
plt.matshow(Positive)
plt.title("Positive (presence of Mine/Pipe) signatures")
print("nb positive samples :", np.asarray(Positive).shape)
plt.matshow(Negative)
plt.title("Negative (rock only) signatures")
print("nb negative samples :", np.asarray(Negative).shape)
```
 
%%%% Output: stream
 
nb positive samples : (97, 60)
nb negative samples : (111, 60)
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:markdown id: tags:
 
## Compute a classification tree
 
%% Cell type:code id: tags:
 
``` python
from sklearn import tree
from sklearn import metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
 
nbdepth = 8 # try 4 or 5 and decide wether more depth are usefull...
estimator = tree.DecisionTreeClassifier(max_depth=nbdepth, criterion="entropy")
 
 
X_train, X_test, y_train, y_test = train_test_split(
dataset_X, dataset_y, test_size=0.2, random_state=0
)
print(
"resp. length of X_train and X_test are {0} and {1} ".format(
[len(X_train)], [len(X_test)]
)
)
 
estimator = estimator.fit(X_train, y_train)
estimator
```
 
%%%% Output: stream
 
resp. length of X_train and X_test are [166] and [42]
 
%%%% Output: execute_result
 
DecisionTreeClassifier(criterion='entropy', max_depth=8)
 
%% Cell type:markdown id: tags:
 
## ...and visualize it
 
%% Cell type:code id: tags:
 
``` python
from sklearn.tree import plot_tree
 
plt.figure(figsize=(50, 30))
a = plot_tree(estimator, filled=True, rounded=True, fontsize=25)
```
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:code id: tags:
 
``` python
```
 
%% Cell type:markdown id: tags:
 
## Evaluate its score
 
%% Cell type:code id: tags:
 
``` python
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
 
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=None)
scores = cross_val_score(estimator, dataset_X, dataset_y, cv=cv)
print(
"Mean Accuracy and 95 percent confidence interval : %0.2f (+/- %0.2f)"
% (scores.mean(), scores.std() * 2)
)
```
 
%%%% Output: stream
 
Mean Accuracy and 95 percent confidence interval : 0.69 (+/- 0.15)
 
%% Cell type:markdown id: tags:
 
## Compute a random forest
 
%% Cell type:code id: tags:
 
``` python
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
 
 
clf = RandomForestClassifier(
n_estimators=100,
max_depth=None,
random_state=None,
max_features=4,
min_samples_split=2,
criterion="entropy",
)
scores = cross_val_score(clf, dataset_X, dataset_y, cv=10)
print(
"Mean Accuracy and 95 percent confidence interval RandomForest: %0.2f (+/- %0.2f)"
% (scores.mean(), scores.std() * 2)
)
 
clf = ExtraTreesClassifier(
n_estimators=100,
max_depth=None,
random_state=None,
max_features=4,
min_samples_split=2,
criterion="entropy",
)
scores = cross_val_score(clf, dataset_X, dataset_y, cv=10)
print(
"Mean Accuracy and 95 percent confidence interval ExtraTrees: %0.2f (+/- %0.2f)"
% (scores.mean(), scores.std() * 2)
)
```
 
%%%% Output: stream
 
Mean Accuracy and 95 percent confidence interval RandomForest: 0.72 (+/- 0.23)
Mean Accuracy and 95 percent confidence interval ExtraTrees: 0.70 (+/- 0.28)
 
%% Cell type:markdown id: tags:
 
## ... and evaluate the importance or ranking of the features
 
%% Cell type:code id: tags:
 
``` python
clf.fit(dataset_X, dataset_y)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
print(std)
indices = np.argsort(importances)[::-1]
indices
```
 
%%%% Output: stream
 
[0.01505498 0.01217631 0.01575696 0.01541986 0.01481571 0.01501268
0.01067312 0.01489893 0.03196207 0.03109689 0.03909901 0.04122708
0.02227058 0.01552395 0.02136597 0.01577775 0.01767813 0.01353302
0.01689286 0.01990486 0.01862321 0.01849119 0.01738086 0.0117907
0.01110707 0.01458214 0.0172588 0.02041586 0.01817627 0.01361016
0.0173271 0.01496426 0.01303157 0.01404925 0.01548174 0.02159612
0.01964616 0.01599828 0.01399931 0.01561035 0.01506488 0.01715188
0.01751601 0.01996672 0.02415092 0.01928718 0.02165571 0.02637579
0.02089962 0.0116708 0.01912909 0.01799683 0.01239964 0.01620222
0.01254099 0.01002712 0.01150291 0.01269839 0.01158948 0.0115435 ]
 
%%%% Output: execute_result
 
array([11, 10, 35, 9, 47, 8, 44, 46, 48, 20, 36, 27, 45, 19, 26, 14, 21,
12, 50, 42, 43, 22, 28, 15, 18, 53, 16, 30, 34, 38, 7, 39, 0, 41,
51, 5, 25, 1, 3, 37, 17, 40, 2, 31, 13, 57, 4, 23, 52, 32, 33,
29, 49, 58, 54, 56, 24, 55, 6, 59])
 
%% Cell type:code id: tags:
 
``` python
# Print the feature ranking
print("Feature ranking:")
 
for f in range(10):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure(figsize=(15, 6))
plt.title("Feature importances")
plt.bar(
range(np.asarray(dataset_X).shape[1]),
importances[indices],
color="r",
yerr=std[indices],
align="center",
)
plt.xticks(range(np.asarray(dataset_X).shape[1]), indices)
plt.xlim([-1, np.asarray(dataset_X).shape[1]])
 
plt.show()
```
 
%%%% Output: stream
 
Feature ranking:
1. feature 11 (0.037391)
2. feature 10 (0.033575)
3. feature 35 (0.026083)
4. feature 9 (0.026081)
5. feature 47 (0.026023)
6. feature 8 (0.025810)
7. feature 44 (0.024273)
8. feature 46 (0.023325)
9. feature 48 (0.022112)
10. feature 20 (0.021423)
 
%%%% Output: display_data
 
[Hidden Image Output]
 
%% Cell type:code id: tags:
 
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment