Skip to content

Commit 32eb7c8

Browse files
committed
in progress
1 parent d053928 commit 32eb7c8

2 files changed

Lines changed: 231 additions & 0 deletions

File tree

03-Classification.qmd

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Classification
2+
3+
*In the third week of class, we will look at classification...*
4+
5+
Discrminiation vs calibration
6+
7+
Class distributions and priors
8+
9+
```{python}
10+
import pandas as pd
11+
import seaborn as sns
12+
import numpy as np
13+
from sklearn.model_selection import train_test_split
14+
from sklearn.metrics import mean_squared_error
15+
import matplotlib.pyplot as plt
16+
from formulaic import model_matrix
17+
from sklearn import linear_model
18+
import statsmodels.api as sm
19+
20+
nhanes = pd.read_csv("classroom_data/NHANES.csv")
21+
nhanes.drop_duplicates(inplace=True)
22+
nhanes['Hypertension'] = (nhanes['BPDiaAve'] > 80) | (nhanes['BPSysAve'] > 130)
23+
24+
nhanes['Hypertension2'] = nhanes['Hypertension'].replace({True: "Hypertension", False: "No Hypertension"})
25+
26+
#train test
27+
nhanes_train, nhanes_test = train_test_split(nhanes, test_size=0.2, random_state=42)
28+
29+
#class balance
30+
31+
nhanes_train['bins'] = pd.cut(nhanes_train['BMI'], bins=20)
32+
33+
34+
nhanes_train_binned = nhanes_train.groupby('bins')['Hypertension'].agg(['sum', 'count']).reset_index()
35+
nhanes_train_binned['p'] = nhanes_train_binned['sum'] / nhanes_train_binned['count']
36+
37+
nhanes_train_binned['log_odds'] = np.log(nhanes_train_binned['p'] / (1 - nhanes_train_binned['p']))
38+
nhanes_train_binned['bin_midpoint'] = nhanes_train_binned['bins'].apply(lambda x: x.mid)
39+
40+
#predictor vs probability
41+
plt.clf()
42+
plt.scatter(nhanes_train_binned['bin_midpoint'], nhanes_train_binned['p'], color='blue')
43+
plt.xlabel('BMI - Binned Midpoint')
44+
plt.ylabel('Empirical Hypertension Probability')
45+
plt.grid(True)
46+
plt.show()
47+
48+
49+
#predictor vs log odds
50+
plt.clf()
51+
plt.scatter(nhanes_train_binned['bin_midpoint'], nhanes_train_binned['log_odds'], color='blue')
52+
plt.xlabel('BMI - Binned Midpoint')
53+
plt.ylabel('Empirical Hypertension Log Odds')
54+
plt.grid(True)
55+
plt.show()
56+
57+
#wait, probability vs log odds?
58+
plt.clf()
59+
plt.scatter( nhanes_train_binned['log_odds'], nhanes_train_binned['p'], color='blue')
60+
plt.xlabel('Empirical Hypertension Log Odds')
61+
plt.ylabel('Empirical Hypertension Probability')
62+
plt.grid(True)
63+
plt.show()
64+
65+
plt.clf()
66+
ax = sns.boxplot(y="Hypertension2", x="BMI", data=nhanes_train)
67+
ax.set_ylabel('')
68+
plt.show()
69+
70+
71+
```
72+
73+
Now, let's build the model $P(Hypertension) = f(BMI)$ to make a prediction of $Hyptertension$ given $BMI$.
74+
75+
$P(Hypertension)=\beta_0+\beta_1 \cdot BMI$ does not give us outputs between 0 and 1.
76+
77+
$P(Hyptertension) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}$ does, however!
78+
79+
Let's look at this visually to understand.
80+
81+
```{python}
82+
y, X = model_matrix("Hypertension ~ BMI", nhanes)
83+
logit_model = sm.Logit(y, X).fit()
84+
85+
plt.clf()
86+
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
87+
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
88+
plt.xlabel('BMI')
89+
plt.ylabel('Probability of Hypertension')
90+
plt.legend()
91+
plt.show()
92+
```
93+
94+
This gets us to modeling the probability of the outcome (such as given a BMI of 30, there is a 20% chance the person has Hypertension), but ultimately we want a classification of Hyptertension or not.
95+
96+
A reasonable cutoff to start is 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Same for \< 50%. This is called the **Decision Boundary**.
97+
98+
```{python}
99+
plt.clf()
100+
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
101+
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
102+
plt.xlabel('BMI')
103+
plt.ylabel('Probability of Hypertension')
104+
plt.axhline(y=0.5, color='r', linestyle='--', label='Classification Cutoff')
105+
plt.legend();
106+
plt.show()
107+
```
108+
109+
Given this decision boundary, what is the accuracy?
110+
111+
```{python}
112+
113+
prediction_cut = [1 if x >= .5 else 0 for x in logit_model.predict()]
114+
print('Accuracy = ', accuracy_score(y, prediction_cut))
115+
```
116+
117+
Okay, that's a starting point!
118+
119+
We can break down classification accuracy to four additional results:
120+
121+
```{python}
122+
tn, fp, fn, tp = confusion_matrix(y, prediction_cut).ravel().tolist()
123+
print("True Positive:", tp, "\nFalse Positive: ", fp, "\nTrue Negative: ", tn, "\nFalse Negative:", fn)
124+
```
125+
126+
define tp, fp, tn, fn
127+
128+
define confusion matrix
129+
130+
```{python}
131+
cm = confusion_matrix(y, prediction_cut)
132+
print("Confusion Matrix : \n", cm)
133+
```
134+
135+
## Assumptions of logistic regression
136+
137+
### Linearity of log odds - predictor relationship
138+
139+
We can rewrite $P(Hyptertension) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}$ as $log(\frac{P(Hyptertension)}{1 - P(Hyptertension)}) = \beta_0 + \beta_1 \cdot BMI$
140+
141+
where the left hand side is called the **log odds** or the **logit**.
142+
143+
```{python}
144+
145+
```
146+
147+
### Predictors are not colinear
148+
149+
### No outliers
150+
151+
### Number of predictors is less than the number of samples
152+
153+
## Appendix: Inference for Logistic Regression
154+
155+
Let's do the same for our Logic Regression Classifier model, which has an equation of:
156+
157+
$$
158+
\frac{p(Hypertension)}{1-p(Hypertension)}=e^{\beta_0 + \beta_1 \cdot BMI}
159+
$$
160+
161+
On the left hand side of the equationis the **Odds** of having Hypertension.
162+
163+
$\beta_0$ is a parameter describing \_\_, and $\beta_1$ is a parameter describing \_\_\_
164+
165+
```{python}
166+
y, X = model_matrix("Hypertension ~ BMI", nhanes_tiny)
167+
168+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
169+
logit_model = sm.Logit(y_train, X_train).fit()
170+
171+
logit_model.summary()
172+
```
173+
174+
##
175+
176+
```{python}
177+
import pandas as pd
178+
import statsmodels.api as sm
179+
import matplotlib.pyplot as plt
180+
import numpy as np
181+
182+
# 1. Example data (replace with your own data)
183+
data = {'X': np.random.rand(100) * 100, 'y': np.random.randint(0, 2, 100)}
184+
df = pd.DataFrame(data)
185+
186+
# To check the linearity assumption visually for an individual predictor,
187+
# a common method involves grouping the continuous predictor into bins
188+
# and calculating the empirical log-odds for each bin.
189+
190+
# Bin the predictor
191+
df['bins'] = pd.cut(df['X'], bins=10)
192+
193+
# Calculate proportion of '1's (p) and then empirical log-odds (ln(p/(1-p))) in each bin
194+
# Avoid bins where p is 0 or 1 to prevent undefined log-odds
195+
binned_data = df.groupby('bins')['y'].agg(['sum', 'count']).reset_index()
196+
binned_data['p'] = binned_data['sum'] / binned_data['count']
197+
# Filter out bins with 0 or 1 probability if needed for a "perfect" plot
198+
binned_data = binned_data[(binned_data['p'] > 0) & (binned_data['p'] < 1)]
199+
binned_data['log_odds'] = np.log(binned_data['p'] / (1 - binned_data['p']))
200+
binned_data['bin_midpoint'] = binned_data['bins'].apply(lambda x: x.mid)
201+
202+
# 2. Plotting the empirical log-odds
203+
plt.figure(figsize=(8, 5))
204+
plt.scatter(binned_data['bin_midpoint'], binned_data['log_odds'], color='blue')
205+
plt.xlabel('Predictor (X) - Binned Midpoint')
206+
plt.ylabel('Empirical Log Odds')
207+
plt.title('Empirical Log Odds vs. Predictor')
208+
plt.grid(True)
209+
plt.show()
210+
211+
# 3. Alternatively, plotting predicted log odds from a model for the linearity assumption check
212+
213+
# Add a constant to the predictor variable
214+
X = sm.add_constant(df['X'])
215+
# Fit a logistic regression model
216+
model = sm.Logit(df['y'], X).fit(disp=0) # disp=0 suppresses fit output
217+
218+
# Get predicted values in log-odds format (default type for Logit model predict)
219+
predicted_log_odds = model.predict(X)
220+
221+
# Plot predicted log odds (which should form a linear relationship if the model is correctly specified)
222+
plt.figure(figsize=(8, 5))
223+
plt.scatter(df['X'], predicted_log_odds, color='red', alpha=0.5)
224+
plt.xlabel('Predictor (X)')
225+
plt.ylabel('Predicted Log Odds (from model)')
226+
plt.title('Model Predicted Log Odds vs. Predictor')
227+
plt.grid(True)
228+
plt.show()
229+
230+
```

_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ book:
1717
- index.qmd
1818
- 01-Problem-Setup.qmd
1919
- 02-Regression.qmd
20+
- 03-Classification.qmd
2021
- conclusion.qmd
2122
- references.qmd
2223

0 commit comments

Comments
 (0)