You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/src/week36/week36.do.txt
+153Lines changed: 153 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1552,10 +1552,163 @@ plt.xlabel(r'$x$')
1552
1552
plt.ylabel(r'$y$')
1553
1553
plt.title(r'Gradient descent example for Ridge')
1554
1554
plt.show()
1555
+
!ec
1556
+
1557
+
1558
+
!split
1559
+
===== Ridge regression and a new Synthetic Dataset =====
1560
+
1561
+
1562
+
We create a synthetic linear regression dataset with a sparse
1563
+
underlying relationship. This means we have many features but only a
1564
+
few of them actually contribute to the target. In our example, we’ll
1565
+
use 10 features with only 3 non-zero weights in the true model. This
1566
+
way, the target is generated as a linear combination of a few features
1567
+
(with known coefficients) plus some random noise. The steps we include are:
1568
+
1569
+
Decide on the number of samples and features (e.g. 100 samples, 10 features).
1570
+
Define the _true_ coefficient vector with mostly zeros (for sparsity). For example, we set $\hat{\bm{\theta}} = [5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0]$, meaning only features 0, 1, and 6 have a real effect on y.
1571
+
1572
+
Then we sample feature values for $\bm{X}$ randomly (e.g. from a normal distribution). We use a normal distribution so features are roughly centered around 0.
1573
+
Then we compute the target values $y$ using the linear combination $\bm{X}\hat{\bm{\theta}}$ and add some noise (to simulate measurement error or unexplained variance).
1574
+
1575
+
1576
+
Below is the code to generate the dataset:
1577
+
!bc pycod
1578
+
import numpy as np
1579
+
1580
+
# Set random seed for reproducibility
1581
+
np.random.seed(0)
1582
+
1583
+
# Define dataset size
1584
+
n_samples = 100
1585
+
n_features = 10
1586
+
1587
+
# Define true coefficients (sparse linear relationship)
Before fitting a regression model, it’s good practice to normalize or
1612
+
standardize the features. This ensures all features are on a
1613
+
comparable scale, which is especially important when using
1614
+
regularization. Here we will perform standardization, scaling each
1615
+
feature to have mean 0 and standard deviation 1:
1555
1616
1617
+
Compute the mean and standard deviation of each column (feature) in $bm{X}X.
1618
+
Subtract the mean and divide by the standard deviation for each feature.
1556
1619
1620
+
1621
+
We also center the target $\bm{y}$ to mean $0$. Centering $\bm{y}$ (and each feature) means the model won’t require a separate intercept term – the data is shifted such that the intercept is effectively 0 . (In practice, one could include an intercept in the model and not penalize it, but here we simplify by centering.)
1622
+
1623
+
!bc pyco
1624
+
# Standardize features (zero mean, unit variance for each feature)
1625
+
X_mean = X.mean(axis=0)
1626
+
X_std = X.std(axis=0)
1627
+
X_std[X_std == 0] = 1 # safeguard to avoid division by zero for constant features
1628
+
X_norm = (X - X_mean) / X_std
1629
+
1630
+
# Center the target to zero mean (optional, to simplify intercept handling)
1631
+
y_mean = y.mean()
1632
+
y_centered = y - y_mean
1633
+
!ec
1634
+
1635
+
After this preprocessing, each column of $\bm{X}_norm$ has mean zero and standard deviation $1$
1636
+
and $\bm{y}_centered$ has mean 0. This makes the optimization landscape
1637
+
nicer and ensures the regularization penalty $\lambda \sum_j
1638
+
\beta_j^2$ treats each coefficient fairly (since features are on the
1639
+
same scale).
1640
+
1641
+
!bc pycod
1642
+
# Set regularization parameter
1643
+
lam = 1.0
1644
+
1645
+
# Closed-form Ridge solution: w = (X^T X + lam * I)^{-1} X^T y
This computes the ridge regression coefficients directly. The identity
1653
+
matrix $I$ has the same size as $X^T X$ (which is n_features x
1654
+
n_features), and lam * I adds $\lambda$ to the diagonal of $X^T X. We
1655
+
then invert this matrix and multiply by $X^T y. The result
1656
+
for $\bm{\theta}$ is a NumPy array of shape (n_features,) containing the
1657
+
fitted weights.
1658
+
1659
+
1660
+
1661
+
Alternatively, we can fit the ridge regression model using gradient descent. This is useful to visualize the iterative convergence and is necessary if $n$ and $p$ are so large that the closed-form might be too slow or memory-intensive. We derive the gradients from the cost function defined above. The gradient of the ridge cost with respect to the weight vector $w$ is:
1662
+
1663
+
1664
+
1665
+
Below is the code for gradient descent implementation of ridge:
Let uss confirm that the two approaches (closed-form and gradient descent) give similar results, and then evaluate the model. First, compare the learned coefficients to the true coefficients:
0 commit comments