Commit 3db6ca76 authored by Florent Chatelain's avatar Florent Chatelain
Browse files

add lasso nb

parent 6b3a4d9a
TV,Radio,Newspaper,Sales
230.1,37.8,69.2,22.1
44.5,39.3,45.1,10.4
17.2,45.9,69.3,9.3
151.5,41.3,58.5,18.5
180.8,10.8,58.4,12.9
8.7,48.9,75.0,7.2
57.5,32.8,23.5,11.8
120.2,19.6,11.6,13.2
8.6,2.1,1.0,4.8
199.8,2.6,21.2,10.6
66.1,5.8,24.2,8.6
214.7,24.0,4.0,17.4
23.8,35.1,65.9,9.2
97.5,7.6,7.2,9.7
204.1,32.9,46.0,19.0
195.4,47.7,52.9,22.4
67.8,36.6,114.0,12.5
281.4,39.6,55.8,24.4
69.2,20.5,18.3,11.3
147.3,23.9,19.1,14.6
218.4,27.7,53.4,18.0
237.4,5.1,23.5,12.5
13.2,15.9,49.6,5.6
228.3,16.9,26.2,15.5
62.3,12.6,18.3,9.7
262.9,3.5,19.5,12.0
142.9,29.3,12.6,15.0
240.1,16.7,22.9,15.9
248.8,27.1,22.9,18.9
70.6,16.0,40.8,10.5
292.9,28.3,43.2,21.4
112.9,17.4,38.6,11.9
97.2,1.5,30.0,9.6
265.6,20.0,0.3,17.4
95.7,1.4,7.4,9.5
290.7,4.1,8.5,12.8
266.9,43.8,5.0,25.4
74.7,49.4,45.7,14.7
43.1,26.7,35.1,10.1
228.0,37.7,32.0,21.5
202.5,22.3,31.6,16.6
177.0,33.4,38.7,17.1
293.6,27.7,1.8,20.7
206.9,8.4,26.4,12.9
25.1,25.7,43.3,8.5
175.1,22.5,31.5,14.9
89.7,9.9,35.7,10.6
239.9,41.5,18.5,23.2
227.2,15.8,49.9,14.8
66.9,11.7,36.8,9.7
199.8,3.1,34.6,11.4
100.4,9.6,3.6,10.7
216.4,41.7,39.6,22.6
182.6,46.2,58.7,21.2
262.7,28.8,15.9,20.2
198.9,49.4,60.0,23.7
7.3,28.1,41.4,5.5
136.2,19.2,16.6,13.2
210.8,49.6,37.7,23.8
210.7,29.5,9.3,18.4
53.5,2.0,21.4,8.1
261.3,42.7,54.7,24.2
239.3,15.5,27.3,15.7
102.7,29.6,8.4,14.0
131.1,42.8,28.9,18.0
69.0,9.3,0.9,9.3
31.5,24.6,2.2,9.5
139.3,14.5,10.2,13.4
237.4,27.5,11.0,18.9
216.8,43.9,27.2,22.3
199.1,30.6,38.7,18.3
109.8,14.3,31.7,12.4
26.8,33.0,19.3,8.8
129.4,5.7,31.3,11.0
213.4,24.6,13.1,17.0
16.9,43.7,89.4,8.7
27.5,1.6,20.7,6.9
120.5,28.5,14.2,14.2
5.4,29.9,9.4,5.3
116.0,7.7,23.1,11.0
76.4,26.7,22.3,11.8
239.8,4.1,36.9,12.3
75.3,20.3,32.5,11.3
68.4,44.5,35.6,13.6
213.5,43.0,33.8,21.7
193.2,18.4,65.7,15.2
76.3,27.5,16.0,12.0
110.7,40.6,63.2,16.0
88.3,25.5,73.4,12.9
109.8,47.8,51.4,16.7
134.3,4.9,9.3,11.2
28.6,1.5,33.0,7.3
217.7,33.5,59.0,19.4
250.9,36.5,72.3,22.2
107.4,14.0,10.9,11.5
163.3,31.6,52.9,16.9
197.6,3.5,5.9,11.7
184.9,21.0,22.0,15.5
289.7,42.3,51.2,25.4
135.2,41.7,45.9,17.2
222.4,4.3,49.8,11.7
296.4,36.3,100.9,23.8
280.2,10.1,21.4,14.8
187.9,17.2,17.9,14.7
238.2,34.3,5.3,20.7
137.9,46.4,59.0,19.2
25.0,11.0,29.7,7.2
90.4,0.3,23.2,8.7
13.1,0.4,25.6,5.3
255.4,26.9,5.5,19.8
225.8,8.2,56.5,13.4
241.7,38.0,23.2,21.8
175.7,15.4,2.4,14.1
209.6,20.6,10.7,15.9
78.2,46.8,34.5,14.6
75.1,35.0,52.7,12.6
139.2,14.3,25.6,12.2
76.4,0.8,14.8,9.4
125.7,36.9,79.2,15.9
19.4,16.0,22.3,6.6
141.3,26.8,46.2,15.5
18.8,21.7,50.4,7.0
224.0,2.4,15.6,11.6
123.1,34.6,12.4,15.2
229.5,32.3,74.2,19.7
87.2,11.8,25.9,10.6
7.8,38.9,50.6,6.6
80.2,0.0,9.2,8.8
220.3,49.0,3.2,24.7
59.6,12.0,43.1,9.7
0.7,39.6,8.7,1.6
265.2,2.9,43.0,12.7
8.4,27.2,2.1,5.7
219.8,33.5,45.1,19.6
36.9,38.6,65.6,10.8
48.3,47.0,8.5,11.6
25.6,39.0,9.3,9.5
273.7,28.9,59.7,20.8
43.0,25.9,20.5,9.6
184.9,43.9,1.7,20.7
73.4,17.0,12.9,10.9
193.7,35.4,75.6,19.2
220.5,33.2,37.9,20.1
104.6,5.7,34.4,10.4
96.2,14.8,38.9,11.4
140.3,1.9,9.0,10.3
240.1,7.3,8.7,13.2
243.2,49.0,44.3,25.4
38.0,40.3,11.9,10.9
44.7,25.8,20.6,10.1
280.7,13.9,37.0,16.1
121.0,8.4,48.7,11.6
197.6,23.3,14.2,16.6
171.3,39.7,37.7,19.0
187.8,21.1,9.5,15.6
4.1,11.6,5.7,3.2
93.9,43.5,50.5,15.3
149.8,1.3,24.3,10.1
11.7,36.9,45.2,7.3
131.7,18.4,34.6,12.9
172.5,18.1,30.7,14.4
85.7,35.8,49.3,13.3
188.4,18.1,25.6,14.9
163.5,36.8,7.4,18.0
117.2,14.7,5.4,11.9
234.5,3.4,84.8,11.9
17.9,37.6,21.6,8.0
206.8,5.2,19.4,12.2
215.4,23.6,57.6,17.1
284.3,10.6,6.4,15.0
50.0,11.6,18.4,8.4
164.5,20.9,47.4,14.5
19.6,20.1,17.0,7.6
168.4,7.1,12.8,11.7
222.4,3.4,13.1,11.5
276.9,48.9,41.8,27.0
248.4,30.2,20.3,20.2
170.2,7.8,35.2,11.7
276.7,2.3,23.7,11.8
165.6,10.0,17.6,12.6
156.6,2.6,8.3,10.5
218.5,5.4,27.4,12.2
56.2,5.7,29.7,8.7
287.6,43.0,71.8,26.2
253.8,21.3,30.0,17.6
205.0,45.1,19.6,22.6
139.5,2.1,26.6,10.3
191.1,28.7,18.2,17.3
286.0,13.9,3.7,15.9
18.7,12.1,23.4,6.7
39.5,41.1,5.8,10.8
75.5,10.8,6.0,9.9
17.2,4.1,31.6,5.9
166.8,42.0,3.6,19.6
149.7,35.6,6.0,17.3
38.2,3.7,13.8,7.6
94.2,4.9,8.1,9.7
177.0,9.3,6.4,12.8
283.6,42.0,66.2,25.5
232.1,8.6,8.7,13.4
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regularization for Sparsity: L1 Regularization (lasso)\n",
"\n",
"This exercise is based on the tensorflow [playground](https://playground.tensorflow.org) program (developed by google to teach machine learning principles).\n",
"You'll experiment L1 regularization for a small, noisy training data set to perform 'supervised) binary classification. In this kind of setting, overfitting is a real concern. Fortunately, regularization might help.\n",
"\n",
"The input data are bivariate, this yields the two features $X_1$ and $X_2$. Feature crosses such that their product $X_1X_2$, their squared values $X_1^2$ and $X_2^2$, or their sinus are also included as input for the linear model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task I\n",
"\n",
"Consider now this [tensorflow playground model](https://playground.tensorflow.org/#activation=linear&regularization=L2&batchSize=7&dataset=xor&regDataset=reg-plane&learningRate=0.01&regularizationRate=0.1&noise=30&networkShape=&seed=0.10080&showTestData=false&discretize=false&percTrainData=20&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false&playButton_hide=false&regularizationRate_hide=false&percTrainData_hide=false&numHiddenLayers_hide=true&noise_hide=false&problem_hide=true&regularization_hide=false&dataset_hide=false&activation_hide=true), the exercice consists of five related demos. To simplify comparisons, run each demo in a separate tab. Notice that the thicknesses of the lines connecting FEATURES and OUTPUT represent the relative weights (coefficients) of each feature.\n",
"\n",
"| Demo | Regularization Type | Regularization Rate (lambda) |\n",
"|------|---------------------|------------------------------|\n",
"| 1 | $L_2$ | 0.1 |\n",
"| 2 | $L_2$ | 0.3 |\n",
"| 3 | $L_1$ | 0.1 |\n",
"| 4 | $L_1$ | 0.3 |\n",
"| 5 | $L_1$ | experiment |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Questions:**\n",
" - How does switching from $L_2$ to $L_1$ regularization influence the delta between test loss and training loss?\n",
" - How does switching from $L_2$ to $L_1$ regularization influence the learned weights?\n",
" - How does increasing the $L_1$ regularization rate (lambda) influence the learned weights?\n",
" - Why the $L_1$ penalty seems most appropriate than $L_2$ one for this problem?\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task II\n",
"\n",
"We consider now this [neural net model](https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.01&regularizationRate=0.01&noise=50&networkShape=8,8,8,8,8,8&seed=0.91875&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false&playButton_hide=false&regularizationRate_hide=false&percTrainData_hide=false&numHiddenLayers_hide=false&noise_hide=false&problem_hide=true&regularization_hide=false&dataset_hide=false&activation_hide=false) with six hidden layers with eight units (neurons) each. We choose a non-linear activation function, namely the `ReLU` $g(x) = \\max (0,x)$ which is a standard choice in deep learning ([why?](#Some-words-on-activation-functions-for-deep-neural-nets)).\n",
"\n",
"\n",
"Run the model as given, without regularization, for at least 1000 epochs. We can adjust if necessary the learning rate (increase it to speed up the convergence, or decrease it to gain in stability).\n",
"Note the following :\n",
"- delta between Test and Training loss.\n",
"- The learned weights for features and hidden units (neurons)\n",
"\n",
"Redo the same operation using now a L1 regularization, with the regularization rate as given, in a separate tab\n",
"\n",
"**Questions:**\n",
"- How does introducing $L_1$ regularization\n",
" - influence the test loss and also the delta between test loss and training loss?\n",
" - influence the learned weights?\n",
"- What are the only features that are selected with $L_1$ regularization? Is it in agreement with the 'optimal' decision boundary for this data set? Are the hidden layers useful here?"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some words on activation functions for deep neural nets\n",
"\n",
"(Deep) neural nets use nonlinear activation functions for the units, i.e. the neurons, of the hidden layers to define the output of these units given the input. Logistic function (sometimes called *sigmoid*) or other sigmoïdal functions as the hyperbolic tangent 'tanh' are typically used since the 90's and 2000's. \n",
"However these functions appear to be badly adapted to some deep neural net architectures like convolutional networks.\n",
"Adoption of the rectified linear unit (ReLU) activation function in the 2010's may be considered one of the few milestones that now permit the routine development of very deep neural networks, for several reasons:\n",
"\n",
"1. **Counter the _vanishing gradient problem_**\n",
"\n",
" A general problem with the logistic or hyperbolic tangent 'tanh' functions is that they saturate. For instance the logistic snap to 1.0 for large positive input and snap to -1 for large negative input, and is only really sensitive to changes when the input is near 0. \n",
"\n",
" Layers deep in large networks using these nonlinear activation functions fail to receive useful gradient information. Error is back propagated through the network from the outputs and used to update the weights. The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function. This is called the *vanishing gradient problem* and prevents deep networks from learning effectively. \n",
" \n",
" Because ReLU is piecewise linear, it preserves many of the properties that make linear models easy to optimize with gradient-based methods. In particular, it is linear for large positive values which prevent from the *vanishing gradient problem*. It also preserves many of the properties that make linear models generalize well. Yet, it is a nonlinear function as negative values are always output as zero: this allows one to obtain more flexible prediction rules than just linear ones. This yields universal function approximators.\n",
"\n",
"2. **Make computation cheaper!**\n",
"\n",
" ReLU is very cheap to compute: no need for any multiplication or call of complex function (and the gradient is super simple: 1 for positive value and 0 for negative value). This is useful when the number of units is tens of millions or more in deep architectures!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
# Regularization for Sparsity: L1 Regularization (lasso)
This exercise is based on the tensorflow [playground](https://playground.tensorflow.org) program (developed by google to teach machine learning principles).
You'll experiment L1 regularization for a small, noisy training data set to perform 'supervised) binary classification. In this kind of setting, overfitting is a real concern. Fortunately, regularization might help.
The input data are bivariate, this yields the two features $X_1$ and $X_2$. Feature crosses such that their product $X_1X_2$, their squared values $X_1^2$ and $X_2^2$, or their sinus are also included as input for the linear model.
%% Cell type:markdown id: tags:
## Task I
Consider now this [tensorflow playground model](https://playground.tensorflow.org/#activation=linear&regularization=L2&batchSize=7&dataset=xor&regDataset=reg-plane&learningRate=0.01&regularizationRate=0.1&noise=30&networkShape=&seed=0.10080&showTestData=false&discretize=false&percTrainData=20&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false&playButton_hide=false&regularizationRate_hide=false&percTrainData_hide=false&numHiddenLayers_hide=true&noise_hide=false&problem_hide=true&regularization_hide=false&dataset_hide=false&activation_hide=true), the exercice consists of five related demos. To simplify comparisons, run each demo in a separate tab. Notice that the thicknesses of the lines connecting FEATURES and OUTPUT represent the relative weights (coefficients) of each feature.
| Demo | Regularization Type | Regularization Rate (lambda) |
|------|---------------------|------------------------------|
| 1 | $L_2$ | 0.1 |
| 2 | $L_2$ | 0.3 |
| 3 | $L_1$ | 0.1 |
| 4 | $L_1$ | 0.3 |
| 5 | $L_1$ | experiment |
%% Cell type:markdown id: tags:
**Questions:**
- How does switching from $L_2$ to $L_1$ regularization influence the delta between test loss and training loss?
- How does switching from $L_2$ to $L_1$ regularization influence the learned weights?
- How does increasing the $L_1$ regularization rate (lambda) influence the learned weights?
- Why the $L_1$ penalty seems most appropriate than $L_2$ one for this problem?
%% Cell type:markdown id: tags:
## Task II
We consider now this [neural net model](https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.01&regularizationRate=0.01&noise=50&networkShape=8,8,8,8,8,8&seed=0.91875&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false&playButton_hide=false&regularizationRate_hide=false&percTrainData_hide=false&numHiddenLayers_hide=false&noise_hide=false&problem_hide=true&regularization_hide=false&dataset_hide=false&activation_hide=false) with six hidden layers with eight units (neurons) each. We choose a non-linear activation function, namely the `ReLU` $g(x) = \max (0,x)$ which is a standard choice in deep learning ([why?](#Some-words-on-activation-functions-for-deep-neural-nets)).
Run the model as given, without regularization, for at least 1000 epochs. We can adjust if necessary the learning rate (increase it to speed up the convergence, or decrease it to gain in stability).
Note the following :
- delta between Test and Training loss.
- The learned weights for features and hidden units (neurons)
Redo the same operation using now a L1 regularization, with the regularization rate as given, in a separate tab
**Questions:**
- How does introducing $L_1$ regularization
- influence the test loss and also the delta between test loss and training loss?
- influence the learned weights?
- What are the only features that are selected with $L_1$ regularization? Is it in agreement with the 'optimal' decision boundary for this data set? Are the hidden layers useful here?
%% Cell type:raw id: tags:
%% Cell type:markdown id: tags:
### Some words on activation functions for deep neural nets
(Deep) neural nets use nonlinear activation functions for the units, i.e. the neurons, of the hidden layers to define the output of these units given the input. Logistic function (sometimes called *sigmoid*) or other sigmoïdal functions as the hyperbolic tangent 'tanh' are typically used since the 90's and 2000's.
However these functions appear to be badly adapted to some deep neural net architectures like convolutional networks.
Adoption of the rectified linear unit (ReLU) activation function in the 2010's may be considered one of the few milestones that now permit the routine development of very deep neural networks, for several reasons:
1. **Counter the _vanishing gradient problem_**
A general problem with the logistic or hyperbolic tangent 'tanh' functions is that they saturate. For instance the logistic snap to 1.0 for large positive input and snap to -1 for large negative input, and is only really sensitive to changes when the input is near 0.
Layers deep in large networks using these nonlinear activation functions fail to receive useful gradient information. Error is back propagated through the network from the outputs and used to update the weights. The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function. This is called the *vanishing gradient problem* and prevents deep networks from learning effectively.
Because ReLU is piecewise linear, it preserves many of the properties that make linear models easy to optimize with gradient-based methods. In particular, it is linear for large positive values which prevent from the *vanishing gradient problem*. It also preserves many of the properties that make linear models generalize well. Yet, it is a nonlinear function as negative values are always output as zero: this allows one to obtain more flexible prediction rules than just linear ones. This yields universal function approximators.
2. **Make computation cheaper!**
ReLU is very cheap to compute: no need for any multiplication or call of complex function (and the gradient is super simple: 1 for positive value and 0 for negative value). This is useful when the number of units is tens of millions or more in deep architectures!
%% Cell type:code id: tags:
``` python
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment