You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 02-Regression.qmd
+25-31Lines changed: 25 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,6 @@ import numpy as np
23
23
from sklearn.model_selection import train_test_split
24
24
import matplotlib.pyplot as plt
25
25
from formulaic import model_matrix
26
-
import statsmodels.api as sm
27
26
from sklearn import linear_model
28
27
29
28
@@ -91,7 +90,7 @@ Some possible solutions:
91
90
92
91
- We can detect an influential point via computing the studentized residuals or cook's distance and decide whether it makes sense to remove it.
93
92
94
-
- We can use a different linear regression method, called Huber loss regerssion, that allows more tolerance for outliers.
93
+
- We can use a different linear regression method, called Huber loss regression, that allows more tolerance for outliers.
95
94
96
95
### Predictors are not colinear
97
96
@@ -143,6 +142,10 @@ We recommend evaluate the assumptions of your linear regression model on the tra
143
142
144
143
Let's go back to revisit the non-linearity of responder-predictor relationship. To deal with the slight curve in the residual plot, we can extend our model to accommodate non-linear relationships via **Polynomial Regression**.
145
144
145
+
Here is what polynomial regression is capable, visually:
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. Visually, that would look like two parallel lines, with $\beta_2$ dictating the distance between parallel lines. However, this plot suggests that our original model isn't quite right: the additional predictor of Gender changes our prediction by more than a constant - it is dependent on $BMI$ also.
212
+
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. Visually, that would look like two parallel lines, with $\beta_2$ dictating the distance between parallel lines. However, this plot suggests that our original model isn't quite right: the additional predictor of $Gender$ changes our prediction by more than a constant - it is dependent on $BMI$ also.
210
213
211
214
When multiple predictors have an synergistic effect on the outcome, their effect on the outcome occurs jointly - this is called an **Interaction**. To incorporate this into our model, we add an interaction term:
212
215
@@ -234,35 +237,24 @@ Which creates unique, non-parallel lines depending on the value of $Gender$.
234
237
235
238
## Overfitting
236
239
237
-
Let's take a look at our model again:
240
+
Last week, we discussed as the model learns from any data, it will learn to recognize its patterns, and sometimes it will recognize patterns that are only specific to this data and not reproducible anywhere else. This is called **Overfitting**, and why we constructed the **training** and **testing** datasets to identify this phenomena.
241
+
242
+
Let's look about overfitting in more detail: there can be different magnitudes of overfitting. The more flexible models we employ, the higher risk there will be overfitting, because these models will identify patterns too specific to the training data and not generalize to the test data. For instance, linear regression is a fairly inflexible approach, because it just uses a straight line to model the data. However, if we use polynomial regression, especially for higher degree polynomials, the model becomes more flexible, with a higher risk of overfitting.
243
+
244
+
Let's take a look at our first model again:
238
245
239
246
$$
240
247
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
241
248
$$
242
249
243
250
```{python}
244
-
import pandas as pd
245
-
import seaborn as sns
246
-
import numpy as np
247
-
from sklearn.model_selection import train_test_split
248
-
from sklearn.metrics import (confusion_matrix, accuracy_score)
It turns out that the relationship between Training Error and Testing Error is connected to the **Model Complexity.**
292
-
293
-
A model is more complex relative to another model if:
294
-
295
-
- It contains more predictors, which can be a result of using higher-order polynomials.
296
-
297
-
Let's look at what happens if we increase the complexity of the model by fitting it with a more smooth function. We use a polynomial function of order 2.
283
+
Let's look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:
298
284
299
285
```{python}
300
286
p_degree = 2
@@ -411,12 +397,16 @@ plt.show()
411
397
412
398
As our Polynomial Degree increased, the following happened:
413
399
414
-
- In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.
400
+
- In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This makes sense, as the model does not generalize as well to the testing set.
401
+
402
+
- As the degrees increased, both training and testing error decreased slightly.
415
403
416
-
- After degree 6, we see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.
404
+
- After degree 6, we see that the Training Error continued to decrease, but the Testing Error blew up! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set at all.
417
405
418
406
We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. It seems that our ideal prediction model is around a polynomial of degree 5 or 6, with the minimal Testing Error.
419
407
408
+
### Another example
409
+
420
410
Here is another illustration of the phenomena, using synthetic controlled data:
421
411
422
412
On the left shows the Training Data in black dots. Then, three models are displayed: linear regression (orange line), two other models of increasing complexity in blue and green.
@@ -431,6 +421,8 @@ Hopefully you start to see the importance of examining the Testing Error instead
431
421
432
422
## Appendix: Inference
433
423
424
+
For this course, we focus on prediction from our machine learning models. These models have an equally important usage in statistical inference. This appendix gives a quick overview of what that is about.
425
+
434
426
### Population and Sample
435
427
436
428
The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:
@@ -460,6 +452,8 @@ Suppose that from fitting the model on the Training Set, $\beta_1=2$. That means
460
452
Let's see this in practice:
461
453
462
454
```{python}
455
+
import statsmodels.api as sm
456
+
463
457
y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)
464
458
465
459
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
Based on the output, $\beta_0=69$, $\beta_1=.55$. We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87).
465
+
Based on the output, $\beta_0=69$, $\beta_1=.55$. We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87). Notice that we used a different package called `statsmodels` to look the model inference.
0 commit comments