|
9 | 9 | "source": [ |
10 | 10 | "<!-- HTML file automatically generated from DocOnce source (https://github.com/doconce/doconce/)\n", |
11 | 11 | "doconce format html exercisesweek37.do.txt -->\n", |
12 | | - "<!-- dom:TITLE: Exercises week 37 -->" |
| 12 | + "<!-- dom:TITLE: Exercises week 37 -->\n" |
13 | 13 | ] |
14 | 14 | }, |
15 | 15 | { |
|
20 | 20 | }, |
21 | 21 | "source": [ |
22 | 22 | "# Exercises week 37\n", |
| 23 | + "\n", |
23 | 24 | "**Implementing gradient descent for Ridge and ordinary Least Squares Regression**\n", |
24 | 25 | "\n", |
25 | | - "Date: **September 8-12, 2025**" |
| 26 | + "Date: **September 8-12, 2025**\n" |
26 | 27 | ] |
27 | 28 | }, |
28 | 29 | { |
|
35 | 36 | "## Learning goals\n", |
36 | 37 | "\n", |
37 | 38 | "After having completed these exercises you will have:\n", |
| 39 | + "\n", |
38 | 40 | "1. Your own code for the implementation of the simplest gradient descent approach applied to ordinary least squares (OLS) and Ridge regression\n", |
39 | 41 | "\n", |
40 | 42 | "2. Be able to compare the analytical expressions for OLS and Ridge regression with the gradient descent approach\n", |
41 | 43 | "\n", |
42 | 44 | "3. Explore the role of the learning rate in the gradient descent approach and the hyperparameter $\\lambda$ in Ridge regression\n", |
43 | 45 | "\n", |
44 | | - "4. Scale the data properly" |
| 46 | + "4. Scale the data properly\n" |
45 | 47 | ] |
46 | 48 | }, |
47 | 49 | { |
|
53 | 55 | "source": [ |
54 | 56 | "## Simple one-dimensional second-order polynomial\n", |
55 | 57 | "\n", |
56 | | - "We start with a very simple function" |
| 58 | + "We start with a very simple function\n" |
57 | 59 | ] |
58 | 60 | }, |
59 | 61 | { |
|
65 | 67 | "source": [ |
66 | 68 | "$$\n", |
67 | 69 | "f(x)= 2-x+5x^2,\n", |
68 | | - "$$" |
| 70 | + "$$\n" |
69 | 71 | ] |
70 | 72 | }, |
71 | 73 | { |
|
75 | 77 | "editable": true |
76 | 78 | }, |
77 | 79 | "source": [ |
78 | | - "defined for $x\\in [-2,2]$. You can add noise if you wish. \n", |
| 80 | + "defined for $x\\in [-2,2]$. You can add noise if you wish.\n", |
79 | 81 | "\n", |
80 | 82 | "We are going to fit this function with a polynomial ansatz. The easiest thing is to set up a second-order polynomial and see if you can fit the above function.\n", |
81 | | - "Feel free to play around with higher-order polynomials." |
| 83 | + "Feel free to play around with higher-order polynomials.\n" |
82 | 84 | ] |
83 | 85 | }, |
84 | 86 | { |
|
94 | 96 | "standardize the features. This ensures all features are on a\n", |
95 | 97 | "comparable scale, which is especially important when using\n", |
96 | 98 | "regularization. Here we will perform standardization, scaling each\n", |
97 | | - "feature to have mean 0 and standard deviation 1." |
| 99 | + "feature to have mean 0 and standard deviation 1.\n" |
98 | 100 | ] |
99 | 101 | }, |
100 | 102 | { |
|
114 | 116 | "term, the data is shifted such that the intercept is effectively 0\n", |
115 | 117 | ". (In practice, one could include an intercept in the model and not\n", |
116 | 118 | "penalize it, but here we simplify by centering.)\n", |
117 | | - "Choose $n=100$ data points and set up $\\boldsymbol{x}, $\\boldsymbol{y}$ and the design matrix $\\boldsymbol{X}$." |
| 119 | + "Choose $n=100$ data points and set up $\\boldsymbol{x}$, $\\boldsymbol{y}$ and the design matrix $\\boldsymbol{X}$.\n" |
118 | 120 | ] |
119 | 121 | }, |
120 | 122 | { |
|
145 | 147 | "editable": true |
146 | 148 | }, |
147 | 149 | "source": [ |
148 | | - "Fill in the necessary details. Do we need to center the $y$-values? \n", |
| 150 | + "Fill in the necessary details. Do we need to center the $y$-values?\n", |
149 | 151 | "\n", |
150 | 152 | "After this preprocessing, each column of $\\boldsymbol{X}_{\\mathrm{norm}}$ has mean zero and standard deviation $1$\n", |
151 | 153 | "and $\\boldsymbol{y}_{\\mathrm{centered}}$ has mean 0. This makes the optimization landscape\n", |
152 | 154 | "nicer and ensures the regularization penalty $\\lambda \\sum_j\n", |
153 | 155 | "\\theta_j^2$ in Ridge regression treats each coefficient fairly (since features are on the\n", |
154 | | - "same scale)." |
| 156 | + "same scale).\n" |
155 | 157 | ] |
156 | 158 | }, |
157 | 159 | { |
|
163 | 165 | "source": [ |
164 | 166 | "## Exercise 2, calculate the gradients\n", |
165 | 167 | "\n", |
166 | | - "Find the gradients for OLS and Ridge regression using the mean-squared error as cost/loss function." |
| 168 | + "Find the gradients for OLS and Ridge regression using the mean-squared error as cost/loss function.\n" |
167 | 169 | ] |
168 | 170 | }, |
169 | 171 | { |
|
173 | 175 | "editable": true |
174 | 176 | }, |
175 | 177 | "source": [ |
176 | | - "## Exercise 3, using the analytical formulae for OLS and Ridge regression to find the optimal paramters $\\boldsymbol{\\theta}$" |
| 178 | + "## Exercise 3, using the analytical formulae for OLS and Ridge regression to find the optimal paramters $\\boldsymbol{\\theta}$\n" |
177 | 179 | ] |
178 | 180 | }, |
179 | 181 | { |
|
210 | 212 | "This computes the Ridge and OLS regression coefficients directly. The identity\n", |
211 | 213 | "matrix $I$ has the same size as $X^T X$. It adds $\\lambda$ to the diagonal of $X^T X$ for Ridge regression. We\n", |
212 | 214 | "then invert this matrix and multiply by $X^T y$. The result\n", |
213 | | - "for $\\boldsymbol{\\theta}$ is a NumPy array of shape (n$\\_$features,) containing the\n", |
214 | | - "fitted parameters $\\boldsymbol{\\theta}$." |
| 215 | + "for $\\boldsymbol{\\theta}$ is a NumPy array of shape (n$\\_$features,) containing the\n", |
| 216 | + "fitted parameters $\\boldsymbol{\\theta}$.\n" |
215 | 217 | ] |
216 | 218 | }, |
217 | 219 | { |
|
223 | 225 | "source": [ |
224 | 226 | "### 3a)\n", |
225 | 227 | "\n", |
226 | | - "Finalize, in the above code, the OLS and Ridge regression determination of the optimal parameters $\\boldsymbol{\\theta}$." |
| 228 | + "Finalize, in the above code, the OLS and Ridge regression determination of the optimal parameters $\\boldsymbol{\\theta}$.\n" |
227 | 229 | ] |
228 | 230 | }, |
229 | 231 | { |
|
235 | 237 | "source": [ |
236 | 238 | "### 3b)\n", |
237 | 239 | "\n", |
238 | | - "Explore the results as function of different values of the hyperparameter $\\lambda$. See for example exercise 4 from week 36." |
| 240 | + "Explore the results as function of different values of the hyperparameter $\\lambda$. See for example exercise 4 from week 36.\n" |
239 | 241 | ] |
240 | 242 | }, |
241 | 243 | { |
|
252 | 254 | "necessary if $n$ and $p$ are so large that the closed-form might be\n", |
253 | 255 | "too slow or memory-intensive. We derive the gradients from the cost\n", |
254 | 256 | "functions defined above. Use the gradients of the Ridge and OLS cost functions with respect to\n", |
255 | | - "the parameters $\\boldsymbol{\\theta}$ and set up (using the template below) your own gradient descent code for OLS and Ridge regression.\n", |
| 257 | + "the parameters $\\boldsymbol{\\theta}$ and set up (using the template below) your own gradient descent code for OLS and Ridge regression.\n", |
256 | 258 | "\n", |
257 | | - "Below is a template code for gradient descent implementation of ridge:" |
| 259 | + "Below is a template code for gradient descent implementation of ridge:\n" |
258 | 260 | ] |
259 | 261 | }, |
260 | 262 | { |
|
301 | 303 | "### 4a)\n", |
302 | 304 | "\n", |
303 | 305 | "Write first a gradient descent code for OLS only using the above template.\n", |
304 | | - "Discuss the results as function of the learning rate parameters and the number of iterations" |
| 306 | + "Discuss the results as function of the learning rate parameters and the number of iterations\n" |
305 | 307 | ] |
306 | 308 | }, |
307 | 309 | { |
|
314 | 316 | "### 4b)\n", |
315 | 317 | "\n", |
316 | 318 | "Write then a similar code for Ridge regression using the above template.\n", |
317 | | - "Try to add a stopping parameter as function of the number iterations and the difference between the new and old $\\theta$ values. How would you define a stopping criterion?" |
| 319 | + "Try to add a stopping parameter as function of the number iterations and the difference between the new and old $\\theta$ values. How would you define a stopping criterion?\n" |
318 | 320 | ] |
319 | 321 | }, |
320 | 322 | { |
|
339 | 341 | "Then we sample feature values for $\\boldsymbol{X}$ randomly (e.g. from a normal distribution). We use a normal distribution so features are roughly centered around 0.\n", |
340 | 342 | "Then we compute the target values $y$ using the linear combination $\\boldsymbol{X}\\hat{\\boldsymbol{\\theta}}$ and add some noise (to simulate measurement error or unexplained variance).\n", |
341 | 343 | "\n", |
342 | | - "Below is the code to generate the dataset:" |
| 344 | + "Below is the code to generate the dataset:\n" |
343 | 345 | ] |
344 | 346 | }, |
345 | 347 | { |
346 | 348 | "cell_type": "code", |
347 | | - "execution_count": 4, |
| 349 | + "execution_count": null, |
348 | 350 | "id": "8be1cebe", |
349 | 351 | "metadata": { |
350 | 352 | "collapsed": false, |
|
368 | 370 | "X = np.random.randn(n_samples, n_features) # standard normal distribution\n", |
369 | 371 | "\n", |
370 | 372 | "# Generate target values y with a linear combination of X and theta_true, plus noise\n", |
371 | | - "noise = 0.5 * np.random.randn(n_samples) # Gaussian noise\n", |
| 373 | + "noise = 0.5 * np.random.randn(n_samples) # Gaussian noise\n", |
372 | 374 | "y = X.dot @ theta_true + noise" |
373 | 375 | ] |
374 | 376 | }, |
|
383 | 385 | "significantly influence $\\boldsymbol{y}$. The rest of the features have zero true\n", |
384 | 386 | "coefficient. For example, feature 0 has\n", |
385 | 387 | "a true weight of 5.0, feature 1 has -3.0, and feature 6 has 2.0, so\n", |
386 | | - "the expected relationship is:" |
| 388 | + "the expected relationship is:\n" |
387 | 389 | ] |
388 | 390 | }, |
389 | 391 | { |
|
395 | 397 | "source": [ |
396 | 398 | "$$\n", |
397 | 399 | "y \\approx 5 \\times x_0 \\;-\\; 3 \\times x_1 \\;+\\; 2 \\times x_6 \\;+\\; \\text{noise}.\n", |
398 | | - "$$" |
| 400 | + "$$\n" |
399 | 401 | ] |
400 | 402 | }, |
401 | 403 | { |
|
405 | 407 | "editable": true |
406 | 408 | }, |
407 | 409 | "source": [ |
408 | | - "You can remove the noise if you wish to. \n", |
| 410 | + "You can remove the noise if you wish to.\n", |
409 | 411 | "\n", |
410 | 412 | "Try to fit the above data set using OLS and Ridge regression with the analytical expressions and your own gradient descent codes.\n", |
411 | 413 | "\n", |
412 | 414 | "If everything worked correctly, the learned coefficients should be\n", |
413 | 415 | "close to the true values [5.0, -3.0, 0.0, …, 2.0, …] that we used to\n", |
414 | 416 | "generate the data. Keep in mind that due to regularization and noise,\n", |
415 | 417 | "the learned values will not exactly equal the true ones, but they\n", |
416 | | - "should be in the same ballpark. Which method (OLS or Ridge) gives the best results?" |
| 418 | + "should be in the same ballpark. Which method (OLS or Ridge) gives the best results?\n" |
417 | 419 | ] |
418 | 420 | } |
419 | 421 | ], |
420 | | - "metadata": {}, |
| 422 | + "metadata": { |
| 423 | + "language_info": { |
| 424 | + "name": "python" |
| 425 | + } |
| 426 | + }, |
421 | 427 | "nbformat": 4, |
422 | 428 | "nbformat_minor": 5 |
423 | 429 | } |
0 commit comments