-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path01-Problem-Setup scrapped.qmd
More file actions
205 lines (125 loc) · 10 KB
/
01-Problem-Setup scrapped.qmd
File metadata and controls
205 lines (125 loc) · 10 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# Problem Set-Up
*In the first week of class, we will go over what machine learning models are good for, and look at an classification model as an example to show the model development workflow. Then, we will get our hands on the NumPy package to prepare our data for our models for the rest of the course.*
## Classification model example
Suppose that we are given the [**N**ational **H**ealth **A**nd **N**utrition **E**xamination **S**urvey (NHANES) dataset](https://www.cdc.gov/nchs/nhanes/) and want to build a machine learning model to classify whether a person has hypertension blood pressure based on clinical and demographic variables.
Using algebraic expressions, we formulate the following:
$$
Hypertension=f(Age, BMI)
$$
Where $f(Age, BMI)$ is a machine learning model that takes in the variables $Age$, $BMI$, and make a classification on whether someone has $Hypertension$.
A machine learning model, such as the one described above, has *two main uses:*
1. **Classification and Prediction (Focus of this course):** How accurately can we classify or predict the outcome?
- Classification: Given a new person's $Age, BMI$, classify whether the person has $Hyptertension$. The outcome is a yes/no classification.
- Prediction: Given a person's $Age, BMI$, predict the person's $BloodPressure$ value. The outcome is a continuous value.
2. **Inference (Secondary in this course):** Which predictors are associated with the response, and how strong is the association?
- Classification model example: What is the odds ratio of of $Age$ on $Hyptertension$? If the odds ratio of $Age$ on $Hyptertension$ is 2, then an increase of 1 in $Age$ increases the odds of $Hyptertension$ by 2.
- Prediction model example: Suppose the model is described as $BloodPressure = f(Age,BMI)=20 + 3 \cdot Age - .2 \cdot BMI$. Each variable has a relationship to the outcome: an increase of $Age$ by 1 will lead to an increase of $BloodPressure$ by 3. This measures the strength of association between a variable and the outcome.
Let's start with the easiest case for just $Hypertension = f(Age)$, a single predictor.
Before we fit models, we often visualize the data to get a sense whether our setup makes sense.
```{python}
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, accuracy_score)
import matplotlib.pyplot as plt
from formulaic import model_matrix
import statsmodels.api as sm
nhanes = pd.read_csv("classroom_data/NHANES.csv")
nhanes.drop_duplicates()
nhanes['Hypertension'] = (nhanes['BPDiaAve'] > 80) | (nhanes['BPSysAve'] > 130)
nhanes['Hypertension2'] = nhanes['Hypertension'].replace({True: "Hypertension", False: "No Hypertension"})
plt.clf()
ax = sns.boxplot(y="Hypertension2", x="BMI", data=nhanes)
ax.set_ylabel('')
plt.show()
```
Okay, great, it looks like when someone's BMI is higher, then it is more likely that the person has Hypertension.
Now, let's build the model $Hypertension = f(BMI)$ to make a prediction of $Hyptertension$ given $BMI$.
```{python}
y, X = model_matrix("Hypertension ~ BMI", nhanes)
logit_model = sm.Logit(y, X).fit()
plt.clf()
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
plt.xlabel('BMI')
plt.ylabel('Probability of Hypertension')
plt.legend()
plt.show()
```
Instead of boxplots, we plotted the data just using points, with "Hypertension" having a probability of 1 and "No Hypertension" having a probability of 0. We see that we have a fitted line in blue for every value of BMI, which represents our machine learning model $f(BMI)$. This model is called **Logistic Regression**.
The first thing we want to investigate about this model is how well it performs in terms of Classification. Just using $BMI$ as a variable, what is the Accuracy of $f(BMI)$ classifying whether a person has $Hypertension$? Notice that first $f(BMI)$ gives us continuous probability values, such as given a BMI of 30, there is a 20% chance the person has Hypertension. We need a discrete cutoff of this model to decide whether the person has Hypertension.
A reasonable cutoff to start is 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Same for \< 50%. This is called the **Decision Boundary**.
```{python}
plt.clf()
plt.scatter(X.BMI, logit_model.predict(), color="blue", label="Fitted Line")
plt.scatter(X.BMI, y, alpha=.3, color="brown", label="Data")
plt.xlabel('BMI')
plt.ylabel('Probability of Hypertension')
plt.axhline(y=0.5, color='r', linestyle='--', label='Prediction Cutoff')
plt.legend();
plt.show()
```
Given this decision boundary, what is the accuracy?
```{python}
prediction_cut = [1 if x >= .5 else 0 for x in logit_model.predict()]
print('Accuracy = ', accuracy_score(y, prediction_cut))
```
Okay, that's a starting point!
We can break down classification accuracy to four additional results:
```{python}
tn, fp, fn, tp = confusion_matrix(y, prediction_cut).ravel().tolist()
print("True Positive:", tp, "\nFalse Positive: ", fp, "\nTrue Negative: ", tn, "\nFalse Negative:", fn)
```
define tp, fp, tn, fn
define confusion matrix
```{python}
cm = confusion_matrix(y, prediction_cut)
print("Confusion Matrix : \n", cm)
```
### Summary of Example
So what have we done so far? / Preview of what is to come:
1. Selected a predictor and binary outcome, and visualized it
- Eventually we will look at a continuous outcome, multiple predictors, and how to select multiple predictors
2. Fit it to a logistic regression model, which is a classification model
- Logistic regression is a type of linear model, which is the basis for most machine learning models
3. We evaluated the model in terms of accuracy, true positive rate, true negative rate, false positive rate, false negative rate
- We evaluated the model on the same data that we built the model. Ideally, we want to evaluate the model on data it has never seen before. More on this next week.
Before we race ahead....there's a lot of new Python data structures that we are working with in this course. So let's brush up on our data structures and how to make sense of the new ones coming our way!
## Review of Data Structures
We will be seeing *a lot* of different data structures in this course beyond DataFrames, Series, and Lists. So let's review how we think about learning new data structures to make our lives easier when we encounter new data structures.
Let's review the List data structure. For any data structure, we ask the following:
- What does it contain (in terms of data)?
- What can it do (in terms of functions)?
And if it "makes sense" to us, then it is well-designed data structure.
Formally, a data structure in Python (also known as an **Object**) may contain the following:
- **Value** that holds the essential data for the data structure.
- **Attributes** that hold subset or additional data for the data structure.
- Functions called **Methods** that are for the data structure and *have to* take in the variable referenced as an input.
Let's see how this applies to the **List**:
- **Value**: the contents of the list, such as `[2, 3, 4].`
- **Attributes** that store additional values: Not relevant for lists.
- **Methods** that can be used on the object: [my_list.append(x)](https://docs.python.org/3/tutorial/datastructures.html)
How about **Dataframe**?
- **Value**: the 2-dimensional spreadsheet of the dataframe.
- **Attributes** that store additional values: `df.shape` gives the number of rows and columns. `df.my_col_name` access the column called "my_col_name".
- **Methods** that can be used on the object: [df.merge(other_df, on="column_name")](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
Feel free to look at the [cheatsheet on data structures from Intro to Python](https://docs.google.com/document/d/1IHD9_Edg3mbMY9lilAF0QWVjaKBK0eE0-hbo0fSywAA/edit?tab=t.0#heading=h.2bko76vfr8r6) to refresh yourself.
### NumPy
A new Data Structure we will work with in this course is NumPy's ndarray ("n-dimensional array") data structure. It is commonly referred as "**NumPy Array**". It is very similar to a Dataframe, but has the following characteristics for building machine learning models:
- All elements are homogeneous and numeric.
- There are no column or row names.
- Mathematical operations are optimized to be fast.
So, let's see some examples:
- **Value**: the 2-dimensional numerical table. It actually can be any dimension, but we will just work with 1-dimensional (similar to a List) and 2-dimensional.
- **Attributes** that store additional values:
- Two-dimensional subsetting, similar to lists: `data[:5, :3]` subsets for for the first 5 rows and first three columns. `data[:5, [0, 2, 3]]` subsets for the first 5 rows and 1st, 3rd, and 4th columns.
- `data.shape` gives the shape of the NumPy Array. `data.ndim` will tell you the number of dimensions of the NumPy Array.
- **Methods** that can be used on the object:
- `data.sum(axis=0)` sums over rows, `data.sum(axis=1)` sums over columns.
For this course, we often load in a dataset in the Pandas Dataframe format, and then once we pick the our outcome and predictors, we will transform the Dataframe into an NumPy Array, such as this line of code we saw earlier: `y, X = model_matrix("Hypertension ~ BMI", nhanes)`. We specify our outcome, predictor, and Dataframe for the `model_matrix()` function, and the outputs are two NumPy Arrays, one for the outcome, and one for the predictors. Any downstream Machine Learning modeling work off the NumPy Arrays `y` and `X`.
More introduction can be found on [NumPy's tutorial guide](https://numpy.org/devdocs/user/absolute_beginners.html).
### What is this data structure?
If you are not sure about what your variable's data structure, use the `type()` function, such as `type(mystery_data)` and it will tell you.
## Exercises
Exercises for week 1 can be [found here](https://colab.research.google.com/drive/1FmxlZoFAlSJczejg5x0nL4eM6UzfmINH?usp=sharing).