Skip to content

Commit 08c1dc0

Browse files
committed
Update 02-Regression.qmd
1 parent a72a0fe commit 08c1dc0

1 file changed

Lines changed: 25 additions & 31 deletions

File tree

02-Regression.qmd

Lines changed: 25 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ import numpy as np
2323
from sklearn.model_selection import train_test_split
2424
import matplotlib.pyplot as plt
2525
from formulaic import model_matrix
26-
import statsmodels.api as sm
2726
from sklearn import linear_model
2827
2928
@@ -91,7 +90,7 @@ Some possible solutions:
9190

9291
- We can detect an influential point via computing the studentized residuals or cook's distance and decide whether it makes sense to remove it.
9392

94-
- We can use a different linear regression method, called Huber loss regerssion, that allows more tolerance for outliers.
93+
- We can use a different linear regression method, called Huber loss regression, that allows more tolerance for outliers.
9594

9695
### Predictors are not colinear
9796

@@ -143,6 +142,10 @@ We recommend evaluate the assumptions of your linear regression model on the tra
143142

144143
Let's go back to revisit the non-linearity of responder-predictor relationship. To deal with the slight curve in the residual plot, we can extend our model to accommodate non-linear relationships via **Polynomial Regression**.
145144

145+
Here is what polynomial regression is capable, visually:
146+
147+
![Source: https://madrury.github.io/jekyll/update/statistics/2017/08/04/basis-expansions.html](http://madrury.github.io/img/polynomial-various-degrees.png){width="700"}
148+
146149
Recall our original equation:
147150

148151
$$
@@ -206,7 +209,7 @@ $$
206209
MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender
207210
$$
208211
209-
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. Visually, that would look like two parallel lines, with $\beta_2$ dictating the distance between parallel lines. However, this plot suggests that our original model isn't quite right: the additional predictor of Gender changes our prediction by more than a constant - it is dependent on $BMI$ also.
212+
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. Visually, that would look like two parallel lines, with $\beta_2$ dictating the distance between parallel lines. However, this plot suggests that our original model isn't quite right: the additional predictor of $Gender$ changes our prediction by more than a constant - it is dependent on $BMI$ also.
210213
211214
When multiple predictors have an synergistic effect on the outcome, their effect on the outcome occurs jointly - this is called an **Interaction**. To incorporate this into our model, we add an interaction term:
212215
@@ -234,35 +237,24 @@ Which creates unique, non-parallel lines depending on the value of $Gender$.
234237
235238
## Overfitting
236239
237-
Let's take a look at our model again:
240+
Last week, we discussed as the model learns from any data, it will learn to recognize its patterns, and sometimes it will recognize patterns that are only specific to this data and not reproducible anywhere else. This is called **Overfitting**, and why we constructed the **training** and **testing** datasets to identify this phenomena.
241+
242+
Let's look about overfitting in more detail: there can be different magnitudes of overfitting. The more flexible models we employ, the higher risk there will be overfitting, because these models will identify patterns too specific to the training data and not generalize to the test data. For instance, linear regression is a fairly inflexible approach, because it just uses a straight line to model the data. However, if we use polynomial regression, especially for higher degree polynomials, the model becomes more flexible, with a higher risk of overfitting.
243+
244+
Let's take a look at our first model again:
238245
239246
$$
240247
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
241248
$$
242249
243250
```{python}
244-
import pandas as pd
245-
import seaborn as sns
246-
import numpy as np
247-
from sklearn.model_selection import train_test_split
248-
from sklearn.metrics import (confusion_matrix, accuracy_score)
249-
import matplotlib.pyplot as plt
250-
from formulaic import model_matrix
251-
import statsmodels.api as sm
252-
253-
nhanes = pd.read_csv("classroom_data/NHANES.csv")
254-
nhanes.drop_duplicates()
255-
256-
nhanes['BloodPressure'] = nhanes['BPDiaAve'] + (nhanes['BPSysAve'] - nhanes['BPDiaAve']) / 3
257-
nhanes['Hypertension'] = (nhanes['BPDiaAve'] > 80) | (nhanes['BPSysAve'] > 130)
258-
251+
#Use a small part of the data to illlustrate overfitting.
259252
nhanes_tiny = nhanes.sample(n=300, random_state=1)
260253
261254
y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)
262255
263256
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
264-
model = sm.OLS(y_train, X_train)
265-
linear_model = model.fit()
257+
linear_model = sm.OLS(y_train, X_train).fit()
266258
267259
plt.clf()
268260
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')
@@ -288,13 +280,7 @@ plt.show()
288280
289281
We see that Training Error \< Testing Error.
290282
291-
It turns out that the relationship between Training Error and Testing Error is connected to the **Model Complexity.**
292-
293-
A model is more complex relative to another model if:
294-
295-
- It contains more predictors, which can be a result of using higher-order polynomials.
296-
297-
Let's look at what happens if we increase the complexity of the model by fitting it with a more smooth function. We use a polynomial function of order 2.
283+
Let's look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:
298284
299285
```{python}
300286
p_degree = 2
@@ -411,12 +397,16 @@ plt.show()
411397
412398
As our Polynomial Degree increased, the following happened:
413399
414-
- In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.
400+
- In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This makes sense, as the model does not generalize as well to the testing set.
401+
402+
- As the degrees increased, both training and testing error decreased slightly.
415403
416-
- After degree 6, we see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.
404+
- After degree 6, we see that the Training Error continued to decrease, but the Testing Error blew up! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set at all.
417405
418406
We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. It seems that our ideal prediction model is around a polynomial of degree 5 or 6, with the minimal Testing Error.
419407
408+
### Another example
409+
420410
Here is another illustration of the phenomena, using synthetic controlled data:
421411
422412
On the left shows the Training Data in black dots. Then, three models are displayed: linear regression (orange line), two other models of increasing complexity in blue and green.
@@ -431,6 +421,8 @@ Hopefully you start to see the importance of examining the Testing Error instead
431421
432422
## Appendix: Inference
433423
424+
For this course, we focus on prediction from our machine learning models. These models have an equally important usage in statistical inference. This appendix gives a quick overview of what that is about.
425+
434426
### Population and Sample
435427
436428
The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:
@@ -460,6 +452,8 @@ Suppose that from fitting the model on the Training Set, $\beta_1=2$. That means
460452
Let's see this in practice:
461453
462454
```{python}
455+
import statsmodels.api as sm
456+
463457
y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)
464458
465459
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
@@ -468,7 +462,7 @@ linear_model = sm.OLS(y_train, X_train).fit()
468462
linear_model.summary()
469463
```
470464
471-
Based on the output, $\beta_0=69$, $\beta_1=.55$. We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87).
465+
Based on the output, $\beta_0=69$, $\beta_1=.55$. We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87). Notice that we used a different package called `statsmodels` to look the model inference.
472466
473467
####
474468

0 commit comments

Comments
 (0)