Skip to content

Commit a38d373

Browse files
committed
Update week36.do.txt
1 parent a04208b commit a38d373

1 file changed

Lines changed: 153 additions & 0 deletions

File tree

doc/src/week36/week36.do.txt

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1552,10 +1552,163 @@ plt.xlabel(r'$x$')
15521552
plt.ylabel(r'$y$')
15531553
plt.title(r'Gradient descent example for Ridge')
15541554
plt.show()
1555+
!ec
1556+
1557+
1558+
!split
1559+
===== Ridge regression and a new Synthetic Dataset =====
1560+
1561+
1562+
We create a synthetic linear regression dataset with a sparse
1563+
underlying relationship. This means we have many features but only a
1564+
few of them actually contribute to the target. In our example, we’ll
1565+
use 10 features with only 3 non-zero weights in the true model. This
1566+
way, the target is generated as a linear combination of a few features
1567+
(with known coefficients) plus some random noise. The steps we include are:
1568+
1569+
Decide on the number of samples and features (e.g. 100 samples, 10 features).
1570+
Define the _true_ coefficient vector with mostly zeros (for sparsity). For example, we set $\hat{\bm{\theta}} = [5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0]$, meaning only features 0, 1, and 6 have a real effect on y.
1571+
1572+
Then we sample feature values for $\bm{X}$ randomly (e.g. from a normal distribution). We use a normal distribution so features are roughly centered around 0.
1573+
Then we compute the target values $y$ using the linear combination $\bm{X}\hat{\bm{\theta}}$ and add some noise (to simulate measurement error or unexplained variance).
1574+
1575+
1576+
Below is the code to generate the dataset:
1577+
!bc pycod
1578+
import numpy as np
1579+
1580+
# Set random seed for reproducibility
1581+
np.random.seed(0)
1582+
1583+
# Define dataset size
1584+
n_samples = 100
1585+
n_features = 10
1586+
1587+
# Define true coefficients (sparse linear relationship)
1588+
theta_true = np.array([5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0])
1589+
1590+
# Generate feature matrix X (n_samples x n_features) with random values
1591+
X = np.random.randn(n_samples, n_features) # standard normal distribution
1592+
1593+
# Generate target values y with a linear combination of X and theta_true, plus noise
1594+
noise = 0.5 * np.random.randn(n_samples) # Gaussian noise
1595+
y = X.dot @ theta_true + noise
1596+
!ec
1597+
1598+
This code produces a dataset where only features 0, 1, and 6
1599+
significantly influence y. The rest of the features have zero true
1600+
coefficient, so they only contribute noise. For example, feature 0 has
1601+
a true weight of 5.0, feature 1 has -3.0, and feature 6 has 2.0, so
1602+
the expected relationship is:
1603+
!bt
1604+
\[
1605+
y \approx 5 \times X_0 \;-\; 3 \times X_1 \;+\; 2 \times X_6 \;+\; \text{noise}.
1606+
\]
1607+
!et
1608+
1609+
1610+
1611+
Before fitting a regression model, it’s good practice to normalize or
1612+
standardize the features. This ensures all features are on a
1613+
comparable scale, which is especially important when using
1614+
regularization. Here we will perform standardization, scaling each
1615+
feature to have mean 0 and standard deviation 1:
15551616

1617+
Compute the mean and standard deviation of each column (feature) in $bm{X}X.
1618+
Subtract the mean and divide by the standard deviation for each feature.
15561619

1620+
1621+
We also center the target $\bm{y}$ to mean $0$. Centering $\bm{y}$ (and each feature) means the model won’t require a separate intercept term – the data is shifted such that the intercept is effectively 0 . (In practice, one could include an intercept in the model and not penalize it, but here we simplify by centering.)
1622+
1623+
!bc pyco
1624+
# Standardize features (zero mean, unit variance for each feature)
1625+
X_mean = X.mean(axis=0)
1626+
X_std = X.std(axis=0)
1627+
X_std[X_std == 0] = 1 # safeguard to avoid division by zero for constant features
1628+
X_norm = (X - X_mean) / X_std
1629+
1630+
# Center the target to zero mean (optional, to simplify intercept handling)
1631+
y_mean = y.mean()
1632+
y_centered = y - y_mean
1633+
!ec
1634+
1635+
After this preprocessing, each column of $\bm{X}_norm$ has mean zero and standard deviation $1$
1636+
and $\bm{y}_centered$ has mean 0. This makes the optimization landscape
1637+
nicer and ensures the regularization penalty $\lambda \sum_j
1638+
\beta_j^2$ treats each coefficient fairly (since features are on the
1639+
same scale).
1640+
1641+
!bc pycod
1642+
# Set regularization parameter
1643+
lam = 1.0
1644+
1645+
# Closed-form Ridge solution: w = (X^T X + lam * I)^{-1} X^T y
1646+
I = np.eye(n_features)
1647+
w_closed_form = np.linalg.inv(X_norm.T.dot(X_norm) + lam * I).dot(X_norm.T).dot(y_centered)
1648+
1649+
print("Closed-form Ridge coefficients:", w_closed_form)
1650+
!ec
1651+
1652+
This computes the ridge regression coefficients directly. The identity
1653+
matrix $I$ has the same size as $X^T X$ (which is n_features x
1654+
n_features), and lam * I adds $\lambda$ to the diagonal of $X^T X. We
1655+
then invert this matrix and multiply by $X^T y. The result
1656+
for $\bm{\theta}$ is a NumPy array of shape (n_features,) containing the
1657+
fitted weights.
1658+
1659+
1660+
1661+
Alternatively, we can fit the ridge regression model using gradient descent. This is useful to visualize the iterative convergence and is necessary if $n$ and $p$ are so large that the closed-form might be too slow or memory-intensive. We derive the gradients from the cost function defined above. The gradient of the ridge cost with respect to the weight vector $w$ is:
1662+
1663+
1664+
1665+
Below is the code for gradient descent implementation of ridge:
1666+
!bc pycod
1667+
# Gradient descent parameters
1668+
alpha = 0.1
1669+
num_iters = 1000
1670+
1671+
# Initialize weights for gradient descent
1672+
theta = np.zeros(n_features)
1673+
1674+
# Arrays to store history for plotting
1675+
cost_history = np.zeros(num_iters)
1676+
1677+
# Gradient descent loop
1678+
m = n_samples # number of examples
1679+
for t in range(num_iters):
1680+
# Compute prediction error
1681+
error = X_norm.dot(theta) - y_centered # shape (m,)
1682+
# Compute cost (MSE + regularization) for monitoring
1683+
cost = (1/(2*m)) * np.dot(error, error) + (lam/(2*m)) * np.dot(theta, theta)
1684+
cost_history[t] = cost
1685+
# Compute gradient
1686+
grad = (1/m) * (X_norm.T.dot(error) + lam * theta)
1687+
# Update weights
1688+
theta = theta - alpha * grad
1689+
1690+
# After the loop, theta contains the fitted coefficients
1691+
theta_gd = theta
1692+
print("Gradient Descent Ridge coefficients:", theta_gd)
15571693
!ec
15581694

1695+
1696+
Let uss confirm that the two approaches (closed-form and gradient descent) give similar results, and then evaluate the model. First, compare the learned coefficients to the true coefficients:
1697+
1698+
!bc pycod
1699+
print("True coefficients:", theta_true)
1700+
print("Closed-form learned coefficients:", theta_closed_form)
1701+
print("Gradient descent learned coefficients:", theta_gd)
1702+
1703+
If everything worked correctly, the learned coefficients should be
1704+
close to the true values [5.0, -3.0, 0.0, …, 2.0, …] that we used to
1705+
generate the data. Keep in mind that due to regularization and noise,
1706+
the learned values will not exactly equal the true ones, but they
1707+
should be in the same ballpark.
1708+
1709+
1710+
1711+
15591712
!split
15601713
===== Using gradient descent methods, limitations =====
15611714

0 commit comments

Comments
 (0)