Update week36.do.txt

mhjensen · mhjensen · commit a38d3736aed4 · 2025-09-01T07:24:47.000+02:00
diff --git a/doc/src/week36/week36.do.txt b/doc/src/week36/week36.do.txt
@@ -1552,10 +1552,163 @@ plt.xlabel(r'$x$')
 plt.ylabel(r'$y$')
 plt.title(r'Gradient descent example for Ridge')
 plt.show()
+!ec
+
+
+!split
+===== Ridge regression and a new Synthetic Dataset =====
+
+
+We create a synthetic linear regression dataset with a sparse
+underlying relationship. This means we have many features but only a
+few of them actually contribute to the target. In our example, we’ll
+use 10 features with only 3 non-zero weights in the true model. This
+way, the target is generated as a linear combination of a few features
+(with known coefficients) plus some random noise. The steps we include are:
+
+Decide on the number of samples and features (e.g. 100 samples, 10 features).
+Define the _true_ coefficient vector with mostly zeros (for sparsity). For example, we set $\hat{\bm{\theta}} = [5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0]$, meaning only features 0, 1, and 6 have a real effect on y.
+
+Then we sample feature values for $\bm{X}$ randomly (e.g. from a normal distribution). We use a normal distribution so features are roughly centered around 0.
+Then we compute the target values $y$ using the linear combination $\bm{X}\hat{\bm{\theta}}$ and add some noise (to simulate measurement error or unexplained variance).
+
+
+Below is the code to generate the dataset:
+!bc pycod
+import numpy as np
+
+# Set random seed for reproducibility
+np.random.seed(0)
+
+# Define dataset size
+n_samples = 100
+n_features = 10
+
+# Define true coefficients (sparse linear relationship)
+theta_true = np.array([5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0])
+
+# Generate feature matrix X (n_samples x n_features) with random values
+X = np.random.randn(n_samples, n_features)  # standard normal distribution
+
+# Generate target values y with a linear combination of X and theta_true, plus noise
+noise = 0.5 * np.random.randn(n_samples)    # Gaussian noise
+y = X.dot @ theta_true + noise
+!ec
+
+This code produces a dataset where only features 0, 1, and 6
+significantly influence y. The rest of the features have zero true
+coefficient, so they only contribute noise. For example, feature 0 has
+a true weight of 5.0, feature 1 has -3.0, and feature 6 has 2.0, so
+the expected relationship is:
+!bt
+\[
+y \approx 5 \times X_0 \;-\; 3 \times X_1 \;+\; 2 \times X_6 \;+\; \text{noise}.
+\]
+!et
+
+
+
+Before fitting a regression model, it’s good practice to normalize or
+standardize the features. This ensures all features are on a
+comparable scale, which is especially important when using
+regularization. Here we will perform standardization, scaling each
+feature to have mean 0 and standard deviation 1:
 
+Compute the mean and standard deviation of each column (feature) in $bm{X}X.
+Subtract the mean and divide by the standard deviation for each feature.
 
+
+We also center the target $\bm{y}$ to mean $0$. Centering $\bm{y}$ (and each feature) means the model won’t require a separate intercept term – the data is shifted such that the intercept is effectively 0 . (In practice, one could include an intercept in the model and not penalize it, but here we simplify by centering.)
+
+!bc pyco
+# Standardize features (zero mean, unit variance for each feature)
+X_mean = X.mean(axis=0)
+X_std = X.std(axis=0)
+X_std[X_std == 0] = 1  # safeguard to avoid division by zero for constant features
+X_norm = (X - X_mean) / X_std
+
+# Center the target to zero mean (optional, to simplify intercept handling)
+y_mean = y.mean()
+y_centered = y - y_mean
+!ec
+
+After this preprocessing, each column of $\bm{X}_norm$ has mean zero and standard deviation $1$
+and $\bm{y}_centered$ has mean 0. This makes the optimization landscape
+nicer and ensures the regularization penalty $\lambda \sum_j
+\beta_j^2$ treats each coefficient fairly (since features are on the
+same scale).
+
+!bc pycod
+# Set regularization parameter
+lam = 1.0
+
+# Closed-form Ridge solution: w = (X^T X + lam * I)^{-1} X^T y
+I = np.eye(n_features)
+w_closed_form = np.linalg.inv(X_norm.T.dot(X_norm) + lam * I).dot(X_norm.T).dot(y_centered)
+
+print("Closed-form Ridge coefficients:", w_closed_form)
+!ec
+
+This computes the ridge regression coefficients directly. The identity
+matrix $I$ has the same size as $X^T X$ (which is n_features x
+n_features), and lam * I adds $\lambda$ to the diagonal of $X^T X. We
+then invert this matrix and multiply by $X^T y. The result
+for $\bm{\theta}$  is a NumPy array of shape (n_features,) containing the
+fitted weights.
+
+
+
+Alternatively, we can fit the ridge regression model using gradient descent. This is useful to visualize the iterative convergence and is necessary if $n$ and $p$ are so large that the closed-form might be too slow or memory-intensive. We derive the gradients from the cost function defined above. The gradient of the ridge cost with respect to the weight vector $w$ is:
+
+
+
+Below is the code for gradient descent implementation of ridge:
+!bc pycod
+# Gradient descent parameters
+alpha = 0.1
+num_iters = 1000
+
+# Initialize weights for gradient descent
+theta = np.zeros(n_features)
+
+# Arrays to store history for plotting
+cost_history = np.zeros(num_iters)
+
+# Gradient descent loop
+m = n_samples  # number of examples
+for t in range(num_iters):
+    # Compute prediction error
+    error = X_norm.dot(theta) - y_centered  # shape (m,)
+    # Compute cost (MSE + regularization) for monitoring
+    cost = (1/(2*m)) * np.dot(error, error) + (lam/(2*m)) * np.dot(theta, theta)
+    cost_history[t] = cost
+    # Compute gradient
+    grad = (1/m) * (X_norm.T.dot(error) + lam * theta)
+    # Update weights
+    theta = theta - alpha * grad
+
+# After the loop, theta contains the fitted coefficients
+theta_gd = theta
+print("Gradient Descent Ridge coefficients:", theta_gd)
 !ec
 
+
+Let uss confirm that the two approaches (closed-form and gradient descent) give similar results, and then evaluate the model. First, compare the learned coefficients to the true coefficients:
+
+!bc pycod
+print("True coefficients:", theta_true)
+print("Closed-form learned coefficients:", theta_closed_form)
+print("Gradient descent learned coefficients:", theta_gd)
+
+If everything worked correctly, the learned coefficients should be
+close to the true values [5.0, -3.0, 0.0, …, 2.0, …] that we used to
+generate the data. Keep in mind that due to regularization and noise,
+the learned values will not exactly equal the true ones, but they
+should be in the same ballpark.
+
+
+
+
 !split
 ===== Using gradient descent methods, limitations =====