learn-glms/unbounded-count-regression.qmd at main · affcomlab/learn-glms · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
---
title: "Unbounded Count Regression"
format: html
---

When your outcome variable is a <i>unbounded count variable</i>, it has a lower bound of 0 and no upper bound. The fact that there is no upper bound has to do with the nature of the variable. Simply, when it's possible that a variable has an unlimited count amount of any of its levels, and the levels of that variable are independent, then a <i>Poisson</i> or <i>Negative Binomial</i> regression is usually the way to go.

An unbounded count variable is easiest to understand and identify with an example. Imagine you are counting how many coins you pick up on the ground in a day on your daily walk. It is possible that at any given day you collect a lot of coins (with theoretically no maximum), and some days you collect zero coins. And it is assumed that the amount of coins you collect on one day does not impact the number of coins you find on any other day. Then the count for the number of coins is best represented by a Poisson or Negative Binomial distribution and we can use either Poisson or Negative Binomial regression. And at the of the analysis, we can ask "What is the average number of coins you would expect to pick up in a day?"


## Count Events with a Poisson Distribution

The Poisson distribution is for models when the outcome variable is a count variable; specifically, when the variable is discrete. Unlike other distributions (e.g. Guassian, Binary), the Poisson distribution is described by only one parameter: $\lambda$. The $\lambda$ parameter is the mean number of occurrences/events in a specific time interval. <b>Put another way, it's the average expected count of a level of that variable.</b>

The important thing to know is that Poisson regression's goal is to model the $\lambda$ parameter of a Poisson count distribution with a linear and additive combination of an intercept and slope of a predictor variable(s). But there's an inherent problem with modeling a count distribution with a (non-Poisson) linear model because the outcome variable is bounded from [0,$\infty$], and we already know that a linear model works best when the outcome is unbounded (and continuous).

## Defining the Poisson Regression Model and the Log Link Function

As mentioned in the previous section, the Poisson regression model takes into account that the outcome we are modeling is a one sided bounded count outcome.  Since a count variable is bounded between [0,$\infty$], we need a way to modify the linear model to work with this variable. We are motivated to do this because the linear model works best when the outcome it is modeling is a unbounded (and continuous).

Let's summarize the details of this model in a more compact way:

$$ y_i \sim Poisson(\lambda)$$
Here we are saying that our outcome variable $y_i$ is Poisson distributed. This is not the part of the model that regression is specifically concerned with -- this is an assumption we are placing on the model knowing that the data is a bounded count variable between [0,$\infty$].

$$ g(\lambda) = \beta_0 + \beta_1 x_1$$
Here we are now defining how we use the simple Poisson regression model. We call it simple because there is only one predictor variable ($x_1$). The right side of the equation should be very familiar to us: it is a linear model where we have $\beta_0$ (the intercept), and $\beta_1$ (the slope of the predictor variable).

But the left side of this equation is unique to <b>generalized linear models</b>. First, we see that we are assigning $\beta_0$ and $\beta_1$ to estimate $\lambda$, which is the "rate" parameter of the Poisson distribution. However, we also see that $\lambda$ is inside a function called $g$, which we now define as our <i>link function</i>.

The link function will allow us to use the linear model (the right side of the equation that is usually assumed with a continuous and unbounded outcome) to estimate the rate, or average number of events, $\lambda$, which is discrete and bounded.

In Poisson regression, our $g$ is the "logarithmic function", which is defined as:

$$ g = log(\lambda) $$
or, equivalently,

$$ \lambda = e^g $$
Therefore, the full equation becomes,

$$ log(\lambda) = \beta_0 + \beta_1 x_1$$

When we use the log link function with linear predictors, we now can call this <i>"Poisson regression"</i>.


::: {.content-visible when-format="html"}

::: {#fig-poisson}

```{=html}
<div align="center">
<iframe height="480" width="480" src="poisson_distribution.gif"></iframe>
</div>
```

Animation of the Poisson distribution for different values of $\lambda$. Notice how at smaller count values, or smaller $\lambda$ values, we have more proportion at the bounded end, and with larger count values, or larger $\lambda$ values, the Poisson distribution approximates a Normal distribution.
:::

:::

### Multiplicative Proporty
A unique property to the Poisson (and Negative Binomial?) regression model is the "product property" of logarithmic functions. Specifically,

$$log_amn = log_am + log_an $$
So when we interpret the estimates of this model, they are multiplicative effects. We will see in the Data Demonstration sections how this affects our interpretation of our parameters.

:::{.callout-tip title="Takeaway: Link Functions"}
When we use GLMs, we will be using many different link functions depending on the data. In the case of Poisson regression, all this function is trying to do is relate our linear predictors to a response outcome that is discrete and between some bounds (like 0 and $\infty$). More generally, we need link functions because if we <b>did not</b> do this, we are using the linear predictors to predictor some value that is continuous when it may <b>not</b> be continuous (like discrete values for count variables).  We also find that when using the appropriate link function, other problems, like the shape of residuals, gets fixed!
:::

## Overdispersion and Count Events with a Negative Binomial Distribution

While a Poisson regression is used for unbounded count variables, a Negative Binomial regression is also used for the same reason. When to use between a Poisson and Negative Binomial is due to <b>overdispersion</b>.  Overdispersion occurs when the variance of the count variable is greater than the mean. Overdispersion happens a lot in the real world, so we will always first fit a Poisson distribution, check for overdispersion, and default to Negative Binomial regression if it occurs.

It is worth emphasizing that the interpretation are the same for parameter estimates of both Poisson and Negative Binomial regression, which makes the use of Poisson and Negative Binomial regressions key models for unbounded count variable that are also not ordinal.

<b>This chapter will contain two Data Demonstrations in order to outline the thought process of choosing between a Poisson and Negative Binomial regression.</b>


## Data Demonstration #1: Poisson Regression
We will use the simulated toy-dataset "fish_example.csv" to analyze and predict the relationship of the number of fishes caught and predictor variables like age, sex, and years of experience.

-   Data: <a href="./fish_example.csv" download> fish_example.csv </a>
</span>

### Loading the Packages and Reading in Data
```{r, warning = FALSE, message=FALSE}
## Load packages
library(tidyverse)
library(easystats)
library(ggeffects)
library(MASS) #needed for Negative Binomial Regression

## Read in data from file
fishing <- read_csv("fish_example.csv")
fishing

```
```{r, warning=FALSE,echo=FALSE}
mytheme <- theme_bw(
  base_size = 12
)

```

This data shows records for 600 individual fisherpeople where each row is one fisher.

Here's a table of our variables in this dataset:

| Variable    | Description                     |     Values | Measurement |
|-------------|:--------------------------------|-----------:|:-----------:|
| `Fish`       | Number of affairs    | Integer |  Nominal   |
| `Age`    | Age |    Double |    Scale    |
| `Sex`    | Female =0, Male = 1  |   Integer |    Nominal    |
| `Experience`    | Number of Years of Fisher Experience | Double | Scale   |

### Prepare Data / Exploratory Data Analysis

For Poisson Regression, let's first treat the discrete variables in this dataset as factors and apply labels to make it easier to interpret.

```{r}
fishing$Sex <- factor(fishing$Sex, levels = c(0,1), labels = c("Female", "Male"))
```

Next, let's plot the distribution of the count of `Fish`. The main thing to consider is that this is a discrete variable and using Poisson regression inherently takes this into account.

```{r}
ggplot(fishing, aes(x = Fish)) +
  geom_bar() + mytheme
```
Finally, let's consider a plot of the relationship between the count of fishes and years of experience being a fisher.
```{r}
ggplot(data = fishing, mapping = aes(x = Fish, y = Experience)) +
  geom_point()
```
As we can see, there seems to be a positive relationship between the count of fish and years of experience being a fisher.

### Models Without Interaction Terms

#### Less Correct: the General Linear Model

Before we fit the Poisson regression model, it would be worthwhile to try fitting the less appropriate regular general linear model. We'll use our old friend `lm()` to assess the relationship between the outcome variable `Fish` with the predictor variables `Sex` and `Experience` as a multiple regression example. Again, we know it is less appropriate because we are treating the outcome variable as a continuous variable rather than a count variable. For more information on general linear modeling, we covered these topics in our <b>Regression Recap</b> chapter.

```{r}
fitlm1 <- lm(
  formula = Fish ~ Sex + Experience,
  data = fishing
)

model_parameters(fitlm1) |> print_md()
```

As the results show, we have a significant partial effect of `Experience`, such that a 1 unit increase in `Experience` increases the amount of fishes by a small amount (0.24 units of the number of "fish") while controlling for `Sex`. It is worth nothing that the regular `lm()` "worked" in that it was able to produce an estimate.


##### Checking Assumptions

When we fit a model, we always want to see if there were any violations in the assumptions. We will use the diagnostic function `check_model()` to see if there are any deviations from `lm()` assumptions. For this model, the most concerning assumption appears to be the normality of residuals, as the distribution is offset from the normal curve. Updating our model to a generalized linear model that takes into account our discrete outcome variable should help fix this problem.

```{r}
check_model(fitlm1, check = "normality")
check_model(fitlm1)
```

#### More Correct: Poisson Regression

Now let's compare a Poisson regression model of the same variables. <b> Our first step is to fit the Poisson and check if it is overdispersed or not. We will only proceed with the Poisson regression model if we find it not overdispersed.</b>

"Is there a relationship between sex, years of experience, and the amount of fishes caught?"

Again, because we have multiple predictors, multiple regression allows us to control for the effects of multiple predictors on the outcome, and we are left with the unique effect of a predictor when controlling for other predictors. In this case, the outcome variable will be the count of `fish`, and the predictor variables will be `Sex` (a categorical variable) and `Experience` (a continuous variable).

We did a lot of work already with the `lm()` in the "Regression Recap". The `glm()` function is in many ways the same as the `lm()` function. Even the syntax for how we write the equation is the same. In addition, we can use the same `model_parameters()` function to get our table of parameter estimates. Indeed, we will use the `glm()` function for the remainder of the rest of the online modules.

But notice that the `glm()` function has one more input argument. This input argument is the "family" input argument, which requires which "family of distributions" to use and which "link function" choice.

For Poisson regression, we supply the <b>poisson</b> distribution and the <b>log</b> link function.

```{r}

fit_poisson_1 <- glm(
  formula = Fish ~ Sex + Experience,
  family = poisson(link = "log"),
  data = fishing
)

```

Now, before we consider any model parameters, we will check if the model is overdispersed or not using `check_overdispersion()` from the `performance` package, which is include in
the `easystats` package.
```{r}
check_overdispersion(fit_poisson_1)
```

Since no overdispersion was detected, that means our Poisson model fits the data well, and we can continue to use the Poisson model for our unbounded count regression. <i>If the Poisson model was found to be overdispersed, we will default to the Negative Binomial model (See Data Demonstration 2 in this module). </i>

##### Parameter Table Interpretation: Log Mean

Using the Poisson model, let's use the `model_parameters` function and interpret the estimates.

```{r}
model_parameters(fit_poisson_1) |> print_md()
```

The parameter table output from a model fit with `glm()` looks very similar to parameter table outputs from models fit with `lm()`. But there is a <b>big</b> difference -- in `glm()` models, we are no longer working in the same raw units!

This difference is apparent in the second column of the output table: "Log-Mean". Because we fit a Poisson regression with a log function, the parameter values are in "log-mean" unit.

Due to the nature of the dummy coded categorical variable `Sex`, we will assess the parameters from each level of that variable first.

> The intercept is the log mean of `fishes` for female fishers while accounting for the effect of `Experience`. In this case, female fishers with 0 years of fishing experience have an average log mean of 0.55 fishes. This intercept value is significantly different from 0, p<0.001. This will make more sense once we exponentiate this value. The exp(0.55) = 1.73. Therefore, the baseline number of fishes of female female fishers with 0 years of experience is 1.73 fishes.

> The log mean of `fishes` for male fishers while controlling for the effect of `Experience` is 0.07, or 0.07 more, than male fishers. This slope, which is the difference between males and female fishers, is also not significantly different from a log mean of 0. Once again, this value makes more sense when we exponentiate it. The exp(0.07) = 1.07, which is the multiplicative effect of male fishers compared to female fishers; in other words, male participants with 0 years of fishing experience catch 1.07 times the expected number of fishes compared to female fishers with 0 years of fishing experience. Note that the value is 1.07 is greater than 1, so male fishers are expected to catch more fishes than females fishers.

> As you increase the amount of `Experience` by 1 unit (or 1 year), there is an increase in the log mean of both male and female participants by 0.09. This slope of age is significantly different from 0, p<0.001. If we exponentiate this, we get exp(0.09) = 1.09, and is therefore  9% increase (because it is a multiplicative effect). We can interpret this as for every 1 year increase in experience for male and female fishers, there is a 9% increase in the number of fishes caught.

Note that we used the `exp()` a lot in order to make our log-mean estimates make more sense. We actually could have immediately printed the parameter table of `model_parameters()` with the argument "exponentiate = TRUE":

```{r}
model_parameters(fit_poisson_1, exponentiate = TRUE) |> print_md()
```

When the value log-mean values are exponentiated, they are called <b>"Incident Rate Ratios"</b> or IRR. As aforementioned, they are the multiplicative effects on the baseline value (in this case, the baseline group is female fishers with 0 years of experience). For interpretation purposes, we can consider the following guidelines:

- IRR > 1, the comparison group has a higher outcome value than the baseline.
- IRR < 1, the comparison group has a lower outcome value than the baseline.
- IRR = 1, there is little to no difference between comparison and baseline.

##### Checking Assumptions

When we used Poisson regression, we can see that the normality of the residuals, which was more violated in the regular multiple regression model, is more in line with its assumption for uniformity of residuals.

```{r}
#check_model(fit_poisson_1, check = "normality") #this doesn't worK?
check_model(fit_poisson_1)
```

### Models with Interaction Terms

#### Less Correct: the General Linear Model

Now, let's consider a less appropriate linear model with an interaction term. Again, we'll use our old friend `lm()` this time to assess the relationship between the outcome variable `Fish` with the predictor variables `Sex` and `Experience` as a multiple regression example with an interaction term. Again, we know it is less appropriate because we are treating the outcome variable as a continuous variable rather than a count variable. For more information on general linear modeling, we covered these topics in our <b>Regression Recap</b> chapter.

```{r}
fitlm2 <- lm(
  formula = Fish ~ Sex * Experience,
  #equivalent formula: Fish ~ Sex + Experience + Sex:Experience
  data = fishing
)

model_parameters(fitlm2) |> print_md()
```

As the results show, we have a significant partial effect of `Experience`, such that a 1 unit (or 1 year) increase in `Experience` increases the amount of fishes by a small amount (0.24 units of the number of "fish") while controlling for `Sex`. Our model here also includes an interaction term which was not significant.

##### Checking Assumptions

Again, when we fit a model, we always want to see if there were any violations in the assumptions. We will use the diagnostic function `check_model()` to see if there are any deviations from `lm()` assumptions. For this model, the most concerning assumption appears to be the normality of residuals, as the distribution is offset from the normal curve. Updating our model to a generalized linear model that takes into account our discrete outcome variable should help fix this problem.

```{r}
check_model(fitlm2)
```
#### More Correct: Poisson Regression

Now let's compare a Poisson regression model of the same variables and also including the interaction term. <b> Like always with the Poisson model, our first step is to fit the Poisson and check if it is overdispersed or not. We will only proceed with the Poisson regression model if we find it not overdispersed.</b>

"Is there a relationship between sex, years of experience, and the amount of fishes caught, and does the effect of one predictor depend on the other?"

Again, because we have multiple predictors, multiple regression allows us to control for the effects of multiple predictors on the outcome, and we are left with the unique effect of a predictor when controlling for other predictors. In this case, the outcome variable will be the count of `fish`, and the predictor variables will be `Sex` (a categorical variable) and `Experience` (a continuous variable), and the interaction term will assses the degree to which the predictor variables do or do not depend on each other.

For Poisson regression, we supply the <b>poisson</b> distribution and the <b>log</b> link function to the `glm()` function.


```{r}
fit_poisson_2<- glm(
  formula = Fish ~ Sex * Experience,
  #equivalent formula: Fish ~ Sex + Experience + Sex:Experience
  family = poisson(link = "log"),
  data = fishing
)
```

##### Parameter Table Interpretation: Log-Mean

Let's use the `model_parameters` function and interpret the estimates of our model with an interaction term.

```{r}
model_parameters(fit_poisson_2) |> print_md()
```
Due to the nature of the dummy coded categorical variable `Sex`, we will assess the parameters from each level of that variable first.

> The intercept is the log mean of `fishes` for female fishers while accounting for the effect of `Experience`. In this case, the simple effect of female fishers at 0 years of fishing experience have an average log mean of 0.55 fishes. This intercept value is significantly different from 0, p<0.001. This will make more sense once we exponentiate this value. The exp(0.55) = 1.73. Therefore, the baseline number of fishes of female female fishers with 0 years of experience is 1.73 fishes.

> The log mean of `fishes` for male fishers while controlling for the effect of `Experience` is 0.07, or 0.07 more, than male fishers at 0 years of experience. This simple effect slope, which is the difference between males and female fishers, is also not significantly different from a log mean of 0. Once again, this value makes more sense when we exponentiate it. The exp(0.09) = 1.09, which is the multiplicative effect of male fishers compared to female fishers at 0 years of experience; in other words, male participants with 0 years of fishing experience catch 1.09 times the expected number of fishes compared to female fishers with 0 years of fishing experience. Note that the value is 1.07 is greater than 1, so male fishers are expected to catch more fishes than females fishers.

> As you increase the amount of `Experience` by 1 unit (or 1 year), there is an increase in the log mean of female participants by 0.09. This slope of age is significantly different from 0, p<0.001. If we exponentiate this, we get exp(0.09) = 1.09, and is therefore  9% increase (because it is a multiplicative effect). We can interpret this as for every 1 year increase in experience for female fishers, there is a 9% increase in the number of fishes caught. (Note that this is the slope is only for female fishers as the male fishers are have a different slope due to the nature of a model with an interaction term)

> The Sex-by-Experience interaction effect is the difference in the slope between male and female fishers as well as the expected change in the difference between male and female fishers the number of fishes caught for each +1 year change in experience; it is not significantly different from zero.

Note that we used the `exp()` a lot in order to make our log-mean estimates make more sense. We actually could have immediately printed the parameter table of `model_parameters()` with the argument "exponentiate = TRUE":

```{r}
model_parameters(fit_poisson_2, exponentiate = TRUE) |> print_md()
```
Remember, the value log-mean values are exponentiated, they are called <b>"Incident Rate Ratios"</b> or IRR. As aforementioned, they are the multiplicative effects on the baseline value (in this case, the baseline group is female fishers at 0 years of experience). For interpretation purposes, we can consider the following guidelines:

- IRR > 1, the comparison group has a higher outcome value than the baseline.
- IRR < 1, the comparison group has a lower outcome value than the baseline.
- IRR = 1, there is little to no difference between comparison and baseline.

Here, IRR for the interaction effect = 1, so there essentially no difference in the slopes for the individual predictors when we allow them to interact.

##### Checking Assumptions

When we used Poisson regression, we can see that the normality of the residuals, which was more violated in the regular multiple regression model, is more in line with the assumption.

```{r}
check_model(fit_poisson_2)
```

## Data Demonstration 2: Negative Binomial Regression
<span style="background-color: #FFFF00">
We will use the affairs.csv dataset and count regression to predict each participant’s number of affairs from their `gender`, `age`, `satisfaction`, and `religiousness`. Since each participant's number of affairs is independent of each other, and there is a bound of 0 and no maximum amount, then `affairs` example of an unbounded count variable.

-   Data: <a href="./affairs.csv" download> depression.csv </a>
</span>

### Loading the Packages and Reading in Data
```{r, warning = FALSE, message=FALSE}
## Load packages
library(tidyverse)
library(easystats)
library(ggeffects)
library(MASS) #needed for Negative Binomial Regression

## Read in data from file
affairs <- read_csv("affairs.csv")
affairs

```

```{r, warning=FALSE,echo=FALSE}
mytheme <- theme_bw(
  base_size = 12
)

```

This data shows records for 601 individuals where each row is one participant.

Here's a table of our variables in this dataset:

| Variable    | Description                     |     Values | Measurement |
|-------------|:--------------------------------|-----------:|:-----------:|
| `affairs`       | Number of affairs    | Integer |  Nominal   |
| `gender`    | Male or Female |   String |    Nominal    |
| `age`    | Age |    Double |    Scale    |
| `yearsmarried`    | Number of Years Married | Double | Scale   |
| `children` |  Yes or No   |   String |   Ordinal   |
| `religiousness`|   1 (Low) to 5 (Highs)  |   Integer |   Scale   |
| `education` |     | Number of Years of Education |   Integer   | Scale |
| `satisfaction` |   1 (Low) to 5 (Highs)  |   Integer |   Scale |


### Prepare Data / Exploratory Data Analysis

For unbounded count regression, let's first treat the discrete variables in this dataset as factors.

```{r}
affairs$gender <- factor(affairs$gender)
affairs$children <- factor(affairs$children)
```

Next, let's plot the distribution of the count of `affairs`. The main thing to consider is that this is a discrete variable and using Poisson regression inherently takes this into account.

```{r}
ggplot(affairs, aes(x = affairs)) +
  geom_bar() + mytheme
```


Now, let's try to see the relationship between the number of `affairs` and `relgiousness`:

HARD TO SEE ANY RELATIONSHIPS?

```{r}
ggplot(data = affairs, mapping = aes(x = affairs, y = religiousness)) +
  geom_point()
```
### Checking for Overdispersion

As aforementioned, the key to choosing between a Poisson or Negative Binomial regression is checking for overdispersion in the count variable. Specifically, if the mean or variance of the count variable is roughly the same, then using a Poisson regression is reasonable, but if the variance is significantly greater than the mean, then it would be better to default to a Negative Binomial regression.

Therefore, we will fit the Poisson model first and test for overdispersion before choosing the Negative Binomial model.

### Models Without Interaction Terms

#### Less Correct: The General Linear Model

Before we fit the Poisson or Negative Binomial regression model, it would be worthwhile to try fitting the less appropriate regular general linear model. We'll use our old friend `lm()` to assess the relationship between the outcome variable `affairs` with the predictor variables `gender` and `age`. For more information on general linear modeling, we covered these topics in our <b>Regression Recap</b> chapter.

```{r}
fitlm3 <- lm(
  formula = affairs ~ gender + age,
  data = affairs
)

model_parameters(fitlm3) |> print_md()
```

As the results show, we have a significant partial effect of `age`, such that a 1 unit increase in `age` increases the amount of affairs by a small amount (0.03 units of "affair") while controlling for `gender`. Despite the challenge of interpreting non-integer values of the amount of affairs, it is worth nothing that the regular `lm()` "worked" in that it was able to produce an estimate.

##### Checking Assumptions

When we fit a model, we always want to see if there were any violations in the assumptions. When using the general linear model via the `lm()` function for binary responses, taking a look at a plot of the residuals, using `check_residuals()` [^1], should be concerning.

[^1]: `check_residuals()` requires the package `qqplotr` to be installed.

```{r}
## Check residuals
check_residuals(fitlm3) |> plot() # looks bad
```
Yikes! As you can see, the dots <b>do not</b> fall neatly along the line, especially at the ends of the quantiles. Once this or any other assumption of the general linear model are not met, this is a clear sign that parts of the model may not be trustworthy. Specifically, in this case of the non-uniformity of residuals, there is terrible violation to the model's uncertainty estimates. While the predictions may be okay and others could argue this violation does not appear too detrimental, we can (and should) do better.

#### More Correct: The Genearlized Linear Model

Now let's compare our modeling of the same variables using Poisson regression or Negative Binomial regression.

<b>First, we will decide which of the two models is better for this question based on an overdispersion test. Then, we will interpret the results of that model. </b>

"Is there a relationship between the age, gender, and the amount of affairs?"

Again, because we have multiple predictors, multiple regression allows us to control for the effects of multiple predictors on the outcome, and we are left with the unique effect of a predictor when controlling for other predictors. In this case, the outcome variable will be `affairs`, and the predictor variables will be `gender` (categorical variable) and `age` (continuous variable).

We did a lot of work already with the `lm()` in the "Regression Recap". The `glm()` function is in many ways the same as the `lm()` function. Even the syntax for how we write the equation is the same. In addition, we can use the same `model_parameters()` function to get our table of parameter estimates. Indeed, we will use the `glm()` function for the remainder of the rest of the online modules.

But notice that the `glm()` function has one more input argument. This input argument is the "family" input argument, which requires which "family of distributions" to use and which "link function" choice.

#### Choosing Between Poisson or Negative Binomial

For Poisson regression, we supply the <b>Poisson</b> distribution and the <b>log</b> link function.

```{r}
fit_poisson_3 <- glm(
  formula = affairs ~ gender + age,
  family = poisson(link = "log"),
  data = affairs
)
```

Now let's check if there is overdispersion in the Poisson model:

```{r}
check_overdispersion(fit_poisson_3)
```

Since overdispersion was detected in the Poisson model, we will now use the the Negative Binomial model to account for the overdispersion. Remember, overdispersion in the Poisson model means that our model does not fit our data well.

For Negative Binomial regression, we will use a slightly differnt glm function called `glm.nb()` from the `MASS` package. Instead of supplying a family/link function argument, the function already understands we are fitting a Negative Binomial model, so the input argument for `glm.nb()` is more compact.

```{r}
fit_nb_1<- glm.nb(
  formula = affairs ~  gender + age,
  data = affairs
)

```

Therefore, for the same combination of predictor variables and count outcome variable, we will use the Negative Binomial model. This takes overdispersion into account and also takes advantage of the fact that we can interpret the estimates interchangeably between both models.

#### Paramter Table Interpretation: Log-Mean (Negative Binomial)

Using the Negative Binomial model, let's interpret use the `model_parameters` function and interpret the estimates.

```{r}
model_parameters(fit_nb_1) |> print_md()
```

The parameter table output from a model fit with `glm.nb()` looks very similar to parameter table outputs from models fit with `lm()`. But there is a <b>big</b> difference -- in `glm()` and `glm.nb()` models, we are no longer working in the same raw units!

This difference is apparent in the second column of the output table: "Log-Mean". Because we fit a Negative Binomial regression with a log function, the parameter values are in "log-mean" unit.

Due to the nature of the dummy coded categorical variable `gender`, we will assess the parameters from each level of that variable first.

> The intercept is the log mean of `affairs` for female participants while accounting for the effect of `age`. In this case, female participants who are 0 years old have an average log mean of -0.61 affairs. This intercept value is not significantly different from 0. This will make more sense once we exponentiate this value. The exp(-0.61) = 0.54. Therefore, the baseline number of affairs for female participants who are 0 years old is 0.54.

> The log mean of `affairs` for male participants while controlling for the effect of `age` is -0.04, or 0.04 less, than female participants. This slope is also not significantly different from a log mean of 0. Once again, this value makes more sense when we exponentiate it. The exp(-0.04) = 0.96, which is the multiplicative effect of a male participant compared to females; in other words, male participants at age 0 have a 0.96 times the expected number of affairs of females aged 0. Note that the value is 0.96 is less than 1, so males have slightly lesser affairs than females.

> As you increase `age` by 1 unit, there is an increase in the log mean of both male and female participants by 0.03. This slope of age is significantly different from 0 (p<0.05). If we exponentiate this, exp(0.03) = 1.03, or 3% increase (because it is a multiplicative effect). We can interpret this as for every 1 year increase in age for male and females, there is a 3% increase in the number of affairs.

Note that we used the `exp()` a lot in order to make our log-mean estamates make more sense. We could have immediately printed the parameter table of `model_parameters()` with the argument "exponentiate = TRUE":

```{r}
model_parameters(fit_nb_1, exponentiate = TRUE) |> print_md()
```

As in the Poisson model, the value log-mean values are exponentiated, they are called <b>"Incident Rate Ratios"</b> or IRR. As aforementioned, they are the multiplicative effects on the baseline value (in this case, the baseline group is female participants at age 0). For interpretation purposes, we can consider the following guidelines:

- IRR > 1, the comparison group has a higher outcome value than the baseline.
- IRR < 1, the comparison group has a lower outcome value than the baseline.
- IRR = 1, there is little to no difference between comparision and baseline.


#### Estimate means

Because we have a dummy-coded predictor, `gender`, we can estimate the mean number of `affairs` for females and males averaged over age with `estimate_means()`:

```{r}
gmeans1 <- estimate_means(fit_nb_1, by = c("gender"))
gmeans1 |> print_md()
```

Notice that if we take the ratio between males and females, we get the IRR for males: 1.38/1.44 = 0.96, or males have a 0.96 multiplicative (so, lesser) amount of `affairs` compared to females.

##### Checking Assumptions

```{r}
check_residuals(fit_nb_1) |> plot()
```

### Models With Interaction Terms

#### Less Correct: The General Linear Model

Let's again consider a regular general linear model with an interaction term. We'll use `lm()` to assess the relationship between the outcome variable `affairs` with the predictor variables `gender`, `satisfaction`, and `religousnes` with a two-way interaction between `gender` and `religiousness`. Again, for more information on general linear modeling, we covered these topics in our <b>Regression Recap</b> chapter.


```{r}
fitlm4 <- lm(
  formula = affairs ~ gender + satisfaction + religiousness+
    gender:religiousness,
  data = affairs
)

model_parameters(fitlm4) |> print_md()
```

As the results show, we have a significant partial effect of `satisfaction` such that a 1 unit increase in `satisfaction` decreases the amount of `affairs` by 0.83 units while controlling for `gender` and `religiousness`, and a significant partial effect of `religiousness` such that a 1 unit increase in `religiousness` decreases the amount of `affairs` by 0.43 units while controlling for `gender` and `satisfaction`. Our model here also includes an interaction term which was not significant.

##### Checking Assumptions

As seen using the `check_model` function, the most concerning violations of assumptions appears to be the homogeneity of variance and the normality of residuals. Updating our model to a generalized linear model that takes into account our discrete outcome variable should help fix these problems.

```{r}
check_model(fitlm4)
```


#### More Correct: the Generalized Linear Model

Now let's compare our modeling of the same variables with our interaction term using Poisson regression or Negative Binomial regression.

<b>First, we will decide which of the two models is better for this question based on an overdispersion test. Then, we will interpret the results of that model. </b>

"Is there a relationship between satisfaction, gender, and religiousness, with an interaction between gender and religiousness, on the amount of affairs?"

First, let's start by fitting the Poisson model and testing it for overdispersion.

#### Choosing Between Poisson or Negative Binomial

For Poisson regression, we supply the <b>Poisson</b> distribution and the <b>log</b> link function to the `glm()` function.

```{r}
fit_poisson_4 <- glm(
  formula = affairs ~ gender + satisfaction + religiousness +
    gender:religiousness,
  family = poisson(link = "log"),
  data = affairs
)
```

Now let's check if there is overdispersion in the Poisson model:

```{r}
check_overdispersion(fit_poisson_4)
```

Since overdispersion was detected in the Poisson model, we will now use the the Negative Binomial model to account for the overdispersion. Remember, overdispersion in the Poisson model means that our model does not fit our data well.

For Negative Binomial regression, again we will use a slightly differnt glm function called `glm.nb()` from the `MASS` package. Instead of supplying a family/link function argument, the function already understands we are fitting a Negative Binomial model, so the input argument for `glm.nb()` is more compact.

```{r}
fit_nb_2 <- glm.nb(
  formula = affairs ~ gender + satisfaction + religiousness +
    religiousness:gender,
  data = affairs
)

model_parameters(fit_nb_2) |> print_md()
```

Therefore, for the same combination of predictor variables and count outcome variable, we will use the Negative Binomial model. This takes overdispersion into account and also takes advantage of the fact that we can interpret the estimates interchangeably between both models.

#### Paramter Table Interpretation: Log-Mean (Negative Binomial)

Using the Negative Binomial model, let's interpret use the `model_parameters` function and interpret the estimates.

```{r}
model_parameters(fit_nb_2)
```

Due to the nature of the dummy coded categorical variable `gender`, we will assess the parameters from each level of that variable first.

> The intercept is the log mean of `affairs` for female participants while accounting for the effect of `satisfaction` and `religiousness`. In this case, female participants with scores of 1 on both `satisfaction` and `religiousness` have an average log mean of 3.70-0.55-0.48 = 2.67 affairs. This intercept value is significantly different from 0, p<0.001. This will make more sense once we exponentiate this value. The exp(2.67) = 14.44. (Since `satisfaction` and `religiousness` are scaled from 1-5, we must include those slopes when predicting our outcome response for our intercept). Therefore, the baseline number of affairs for female participants who are most unsatisfied and not religious are expected to have 14.44 affairs.

> The log mean of `affairs` for male participants while controlling for the effect of `satisfaction` and `religiousness` is -0.65, or 0.65 less, than female participants. This slope difference between males and females is also not significantly different from a log mean of 0. Once again, this value makes more sense when we exponentiate it. The exp(-0.65) = 0.52, which is the multiplicative effect of a male participant compared to females; in other words, male participants with scores of 1 on both `satisfaction` and `religiousness` have a 0.52 times the expected number of affairs of females with the same scores. Note that the value is 0.52 is less than 1, so males have half than affairs than females.

> As you increase `satisfaction` by 1 unit, there is a decrease in the log mean of female participants by 0.55 while holding `religiousness` constant. This slope of `satisfaction` for females (at a score for 1 for `religiousness`) is significantly different from 0 (p<0.001). If we exponentiate this, exp(-0.55) = 0.58, or 58% decrease in affairs (because it is a multiplicative effect). We can interpret this as for every 1 unit increase in `satisfaction` for females, there is a 58% decrease in the number of affairs.

> As you increase `religiousness` by 1 unit, there is a decrease in the log mean of female participants by 0.48 while holding `satisfaction` constant. This slope of `religiousness` for females (at a score for 1 for `satisfaction`) is significantly different from 0 (p=0.001). If we exponentiate this, exp(-0.48) = 0.62, or 62% decrease in affairs (because it is a multiplicative effect). We can interpret this as for every 1 unit increase in `religiousness` for females, there is a 62% decrease in the number of affairs.

> The `gender[male] x religiousness` interaction effect is not significant. The slopes for gender and religiousness do not depend on each other.

Note that once more we used the `exp()` a lot in order to make our log-mean estamates make more sense. We could have immediately printed the parameter table of `model_parameters()` with the argument "exponentiate = TRUE":

```{r}
model_parameters(fit_nb_2, exponentiate = TRUE) |> print_md()
```

Now, as IRR values, we can simply multiply the terms for our parameter relationships and model predictions.

##### Checking Assumptions

We benefit from using the Negative Binomial model over the general model and the Poisson model by accounting for both model assumptions (including overdispersion):

```{r}
check_model(fit_nb_2)
```
Both the homogeneity of variance and uniformity of residuals have marked improvments compared to the violations found in the regular linear model.