"- Change the desired number of variables (number of nonzero coefs) from p to 1 to rank the features by signicance order (most significant variables must enter first in the model). Is it perfectly consistent with the previous results?\n",
"- What is hyperparameter to optimize for the OMP procedure? How can you estimate it?\n",
"- Replace the OMP procedure by [`linear_model.OrthogonalMatchingPursuitCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit) to select the most significant variables. Is it in agreement with the previous results?\n",
"- *Optional:* Compared to greedy methods that are suboptimal procedures to approximate the best sparse solution (the best subset), Lasso penalty is often presented as a *convex relexation of the sparsity constraint*. Can you explain why? What are the possible benefits/disadvantages of the greedy and lasso procedures?\n"
"- *Optional:* Compared to greedy methods that are suboptimal procedures to approximate the best sparse solution (the best subset), Lasso penalty is often presented as a *convex relaxation of the sparsity constraint*. Can you explain why? What are the possible benefits/disadvantages of the greedy and lasso procedures?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
...
...
%% Cell type:markdown id: tags:
This notebook can be run on mybinder: [](https://mybinder.org/v2/git/https%3A%2F%2Fgricad-gitlab.univ-grenoble-alpes.fr%2Fchatelaf%2Fml-sicom3a/master?urlpath=lab/tree/notebooks/5bis_linear_models_lasso_logistic/)
%% Cell type:markdown id: tags:
# South Africa Heart Diseases Data
A retrospective sample of males in a heart-disease high-risk region
of the Western Cape, South Africa. There are roughly two controls per
case of Coranary Heart Disease (CHD). Many of the CHD positive men have undergone blood
pressure reduction treatment and other programs to reduce their risk
factors after their CHD event. In some cases the measurements were
made after these treatments. These data are taken from a larger
dataset, described in Rousseauw et al, 1983, South African Medical Journal.
- what does a positive, negative or near-zero weight mean to predict heart disease?
- How do you interpret the weight of obesity for instance?
- How can you explain such surprising findings?
%% Cell type:markdown id: tags:
## Compute $\ell_1$ penalized Logistic Regression and lasso path
Regularization functions such as $\ell_1$ or $\ell_2$ penalty (or combined $\ell_1$/$\ell_2$ penalty as in *elastic net*) can also be used in generalized linear model such that Logistic Regression. The residual sum of squares criterion used for linear regression is then replaced by the opposite of the (conditional) log likelihood.
For instance for Logitic Regression with Lasso-type regularization ($\ell_1$ penalty), this yields for binary classification $y_i \in \{-1,+1\}$ the following optimization problem:
Within scikit-learn, penalized Logistic Regression is available through the `penalty` parameter of [`linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htm), parameter `C` being the inverse of regularization strength such that $C=\frac{1}{\lambda}$
%% Cell type:code id: tags:
``` python
#compute lasso path
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.preprocessingimportStandardScaler
fromtimeimporttime
fromsklearn.svmimportl1_min_c
importmatplotlib.pyplotasplt
# Standardize the data!
# Don't forget that scale matters for penaliezd regression such that Lasso or ridge
# (This is true also for distance based methods s.t. K-NN).
# If one variable has a larger magnitude than the other (imagine that you change the
# unity for a variable from kilograms to grams), this variable will be much less shrunken than
# the others.
# Advice: Except if you have a good reason to do the opposite, standardize all your variables
# to be sure that they are comparable
sc=StandardScaler()
Xs=sc.fit_transform(X)#center (zero mean) and reduce (unit variance) the variables
# generate useful values for the regularization parameter (log-scale)
- What are the only significant variables estimated with cross-validation?
- How can we rank them by significance order (*hint: look at the lasso path*)?
- Do these results seem more credible (than those obtain without regularization) to predict heart diseases?
%% Cell type:markdown id: tags:
### Compute the predicted CHD probability for some patients
For this kind of problem, we are more of course more interested in the modeling between the inputs/response variables and their interpretation rather than the only prediction of the binary responses (CHD or not). With a generalized linear model such that LR, it is possible to compute a risk probability for each patient and to assess and to interpret the influence of each variable.
%% Cell type:code id: tags:
``` python
# Get the predicted risk for the first patient
ipatient=0# patient index
x=Xs[ipatient,:].reshape(1,-1)
x=pd.DataFrame.from_records(x,columns=names)
ylabel='Case'ify[ipatient]else'Control'
print("{} Patient with (standardized) features:\n{}\n".format(ylabel,x))
# TODO: increase/decrease tobacco consumption
x_copy=x.copy()
x_copy['tobacco']+=0# offset to add in standardized unit
# Proba of heart disease
proba_CHD_0=model.predict_proba(x)[0,1]
proba_CHD=model.predict_proba(x_copy)[0,1]
print("Proba of coronary heart disease (CHD): {:.3f}".format(proba_CHD))
print("Odds probability of CHD: {:.3f}".format(proba_CHD/(1-proba_CHD)))
- Based on the logistic regression formula to compute the probability of each class (with, or without CHD) and the values of the estimated weights, what would be the increase factor on the odds probability $$\frac{ \Pr(\textrm{"CHD"} |X=x)}{\Pr( \textrm{"no CHD"}|X=x)}$$ when the tobacco consumption increases of 1 (in standardized unit)?
- Check this by adding an offset to the tobbaco variable in the cell above and comparing the obtained odds
%% Cell type:markdown id: tags:
## Greedy variable selection procedure
An alternative to Lasso penalization to select the most significant variables and regularize the problem is to fit all the possible combination of variables and choose the best one (for instance that minimizes the test error). This is called *best subset criterion*. However this approach is very computationnaly demanding. With $p$ variables, we need to fit $2^p$ models. Additionally, if we want to use cross-validation to evaluate and compare their performances, the problem quickly becomes unfeasible...
Greedy algorithm are then useful procedures that select the best variables (forward selection) or remove the worst variables (backward selection) **step by step**. There is generally no longer any guarantee of converging towards the best subset solution, but this may give useful solutions at a much reduced computational cost.
**Exercise:**
- for our heart diseases dataset, usig a $K=5$-fold CV, how many fits would be required to find the best subset?
Given a prediction rule that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select variables (the features) by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set.
Within scikit-learn, recursive feature elimination is available with [`feature_selection.RFE`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html): for a linear model such that Logistic Regression, the importance of the features is obtained through their coefficients (pay attention to scale your data!). The recursive feature elimination is repeated until the desired number of features to select is reached.
The selected variables are: ['tobacco' 'famhist' 'typea' 'age']
%%%% Output: execute_result
tobacco famhist typea age ldl obesity adiposity sbp alcohol
Rank 1 1 1 1 2 3 4 5 6
%% Cell type:markdown id: tags:
**Exercise:**
- Change the desired number of features to 1 to rank the features by signicance order (most significant variables must enter first in the model). Is it perfectly consistent with the Lasso path ranking?
- What is hyperparameter to optimize for the RFE procedure? How can you estimate it?
- Replace the RFE procedure by [`feature_selection.RFECV`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) to select the most significant variables. Is it in agreement with the Lasso results?
- Do you think that this procedure is still appropriate when the number of variables $p$ is greater than the sample size $n$ of the training set?
Orthogonal Matching Pursuit (OMP) is based on a greedy algorithm that selects at each step the most significant variable to enter the model. For a linear regression model, this is the variable
most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen variables.
OMP was introduced in
> G. Mallat, Z. Zhang, [Matching pursuits with time-frequency dictionaries](http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf), IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
Note also that there exists an other popular method in the statistical literature called *Forward stepwise selection*. The selection step used differs from the one used in OMP in that it selects the variable that will lead to the minimum residual error *after* orthogonalisation. See for instance the following paper for a comparison between both methods
> Blumensath, Thomas, and Mike E. Davies. ["On the difference between orthogonal matching pursuit and orthogonal least squares."](https://eprints.soton.ac.uk/142469/1/BDOMPvsOLS07.pdf) (2007).
This principle can be extented to generalized linear model such that Logistic Regression by replacing the residual sum of squares criterion by the opposite of the log likelihood.
Within scikit-learn, only OMP is available for linear regression model with [`linear_model.OrthogonalMatchingPursuit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html). However this can be directly apply to our binary classification problem
%% Cell type:code id: tags:
``` python
# We use Forward selection, aka Orthogonal Matching Pursuit
Most significant features: ['tobacco' 'ldl' 'famhist' 'typea' 'age']
%% Cell type:markdown id: tags:
**Exercise:**
- Change the desired number of variables (number of nonzero coefs) from p to 1 to rank the features by signicance order (most significant variables must enter first in the model). Is it perfectly consistent with the previous results?
- What is hyperparameter to optimize for the OMP procedure? How can you estimate it?
- Replace the OMP procedure by [`linear_model.OrthogonalMatchingPursuitCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit) to select the most significant variables. Is it in agreement with the previous results?
-*Optional:* Compared to greedy methods that are suboptimal procedures to approximate the best sparse solution (the best subset), Lasso penalty is often presented as a *convex relexation of the sparsity constraint*. Can you explain why? What are the possible benefits/disadvantages of the greedy and lasso procedures?
-*Optional:* Compared to greedy methods that are suboptimal procedures to approximate the best sparse solution (the best subset), Lasso penalty is often presented as a *convex relaxation of the sparsity constraint*. Can you explain why? What are the possible benefits/disadvantages of the greedy and lasso procedures?