-
Notifications
You must be signed in to change notification settings - Fork 9
Expand file tree
/
Copy pathREADME.Rmd
More file actions
363 lines (271 loc) · 17.3 KB
/
README.Rmd
File metadata and controls
363 lines (271 loc) · 17.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- This README requires access to the vignettes in the package. Please run
devtools::install(build_vignettes = TRUE) before knitting this to README.md -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
comment = "##",
fig.path = "man/figures/README-",
fig.height = 5,
fig.width = 5
# out.width = "100%"
)
# get package versions
library(vcdExtra)
cran_version <- available.packages(repos = "https://cloud.r-project.org")["vcdExtra", "Version"]
dev_version <- getNamespaceVersion("vcdExtra")
```
<!-- badges: start -->
[](https://cran.r-project.org/package=vcdExtra)
[](https://friendly.r-universe.dev/vcdExtra)
[](https://github.com/friendly/vcdExtra/)
[](https://cran.r-project.org/package=vcdExtra)
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
[](https://www.gnu.org/licenses/gpl-2.0.html)
[](https://bsky.app/profile/datavisFriendly.bsky.social)
<!--
[](https://github.com/friendly/vcdExtra/actions/workflows/R-CMD-check.yaml)
-->
<!-- badges: end -->
# vcdExtra <img src="man/figures/logo.png" style="float:right; height:200px;" />
## Extensions and additions to vcd: Visualizing Categorical Data
<!-- Version 0.8-6 -->
Version `r getNamespaceVersion("vcdExtra")`; documentation built for `pkgdown` `r Sys.Date()`
This package provides additional data sets, documentation, and many
functions designed to extend the [vcd](https://CRAN.R-project.org/package=vcd) package for *Visualizing Categorical Data*
and the [gnm](https://CRAN.R-project.org/package=gnm) package for *Generalized Nonlinear Models*.
In particular, `vcdExtra` extends mosaic, assoc and sieve plots from vcd to handle `glm()` and
`gnm()` models and
adds a 3D version in `mosaic3d()`.
The functions here use the "strucplot" framework (Meyer et-al., 2006), which is a lovely, natural conceptual system
for implementing visualization and other displays for _n_-way frequency tables which have a nested, hierarchical
structure in [vcd](https://CRAN.R-project.org/package=vcd).
[This can be compared to the "productplots" framework in [producplots](https://CRAN.R-project.org/package=productplots),
and the defunct `ggmosaic` package]
`vcdExtra` also adds extensions to modeling functions for models fit using `glm()` and `MASS::loglm()`, using the construct `glmlist()`
to construct a list of related models which can be summarized (via `LRstats()`) and graphed (via `mosaic.glmlist()`)
`vcdExtra` is a support package for the book [*Discrete Data Analysis with R*](https://www.routledge.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835) (DDAR) by Michael Friendly and David Meyer. There is also a
[web site for DDAR](http://ddar.datavis.ca) with all figures and code samples from the book.
It is also used in my graduate course, [Psy 6136: Categorical Data Analysis](https://friendly.github.io/psy6136/).
A more general goal of `vcdExtra` is to contribute to the wider topics of _thinking about, analyzing
and visualizing categorical data_, extending this beyond the scope of our book. In this sense,
it continues to be a love letter 💌 to CDA.
## 📂 Installation
Get the released version (`r cran_version`) from CRAN:
install.packages("vcdExtra")
The current development version (`r dev_version`) can be installed from [R-universe](https://friendly.r-universe.dev/vcdExtra) or
directly from the [GitHub repo](https://github.com/friendly/vcdExtra) via:
if (!require(remotes)) install.packages("remotes")
install.packages("vcdExtra", repos = c('https://friendly.r-universe.dev')
# or
remotes::install_github("friendly/vcdExtra", build_vignettes = TRUE)
### Overview
The original purpose of this package was to serve as a sandbox for introducing extensions of
mosaic plots and related graphical methods that apply to loglinear models fitted using `MASS::loglm()`,
generalized linear models using `stats::glm()` and the related, generalized _nonlinear_ models fitted
with `gnm()` in the [gnm](https://CRAN.R-project.org/package=gnm) package.
A related purpose was to fill in some holes in the analysis of
categorical data in R, not provided in base R, [vcd](https://CRAN.R-project.org/package=vcd),
or other commonly used packages.
##### See also:
<a href="https://www.routledge.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835"><img src="man/figures/ddar-cover.png" style="height:100px;"/></a>
<a href="https://friendly.github.io/psy6136/"><img src="https://friendly.github.io/psy6136/icons/psy6136-highres.png" style="height:100px;" /></a>
<a href="https://friendly.github.io/nestedLogit/"><img src="https://friendly.github.io/nestedLogit/logo.png" style="height:100px;" /></a>
* My book, [*Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data*](https://www.routledge.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835)
* My graduate course, [Psy 6136: Categorical Data Analysis](https://friendly.github.io/psy6136/)
* A companion package, [`nestedLogit`](https://friendly.github.io/nestedLogit/), for fitting nested dichotomy logistic regression models for a polytomous response.
## 💡 vcdExtra Highlights
What's in the box?
### Mosaic plot extensions
* The method `mosaic.glm()`
extends the `mosaic.loglm()` method in the vcd
package to this wider class of models, e.g., models for ordinal factors, which can't
be handled with `MASS::loglm()`.
This method also works for
the generalized _nonlinear_ models fit with the [gnm](https://CRAN.R-project.org/package=gnm) package,
including models for square tables and models with multiplicative associations (RC models).
* `mosaic3d()`
introduces a 3D generalization of mosaic displays using the
[rgl](https://CRAN.R-project.org/package=rgl) package.
* A new "labeling" method, `labeling_points()` for mosaic plots allows you to show the observed or
expected frequencies in cells as point symbols, thereby showing the data or model in a dot-density
representation. This goes back to an old paper, Friendly(1995), where I describe visual and
conceptual models for categorical data with a physical analog of gas molecules in chambers.
### Model extensions
* A new class, `glmlist`, is introduced for working with
collections of glm objects, e.g., `Kway()` for fitting
all K-way models from a basic marginal model, and `LRstats()`
for brief statistical summaries of goodness-of-fit for a collection of
models.
* Similarly, for loglinear models fit using `MASS::loglm()`, the function `seq_loglm()`
fits a series of sequential models to the 1-, 2-, ... _n_-way marginal tables, corresponding to a variety of types of models for joint, conditional, mutual, ... independence. It
returns an object of class `loglmlist`, each of which is a class `loglm` object.
The function `seq_mosaic()` generates the mosaic plots and other plots in the
`vcd::strucplot()` framework.
* For **square tables** with ordered factors, `Crossings()` supplements the
specification of terms in model formulas using
`gnm::Symm()`,
`gnm::Diag()`,
`gnm::Topo(),` etc. in the [gnm](https://CRAN.R-project.org/package=gnm) package.
### 🗃️ Datasets
Beyond the wide range of **datasets* in the `vcd` package, this `vcdExtra` package includes many new data sets, that I've found useful for illustrating various ideas, models, methods and visualization. Use `datasets("vcdExtra")` to see a list with titles and descriptions.
The vignette, `vignette("datasets", package="vcdExtra")` provides a classification of these
according to methods of analysis.
```{r vcdExtra-datasets}
vcdExtra::datasets("vcdExtra")[,1]
```
### 📖 Vignettes
A [collection of **tutorial vignettes**](https://cran.r-project.org/web/packages/vcdExtra/vignettes/). In the installed package, they can be viewed using `browseVignettes(package = "vcdExtra")`;
```{r vignettes}
vigns <- as.data.frame(tools::getVignetteInfo("vcdExtra")[,c("File", "Title")])
vigns$Title <- paste0("[", vigns$Title, "](https://friendly.github.io/vcdExtra/articles/",
tools::file_path_sans_ext(vigns$File), ".html)")
vigns |> knitr::kable()
```
* there is also a set of simple **demonstration files** illustrating analysis of datasets with
more detail than provided in their individual help files. Use `demo(package = "vcdExtra")`
to see the list and run `demo("occStatus") to run the analysis for this example, or
`demo("mental-glm")` for another one.
* a few useful **utility functions** for manipulating categorical data sets and working with models for
categorical data: `joint()`, `conditional()`, `mutual()`, `saturated()`. These make it easier to specify
`loglm()` and `glm()` models representing a statistical concept, like conditional association, rather than
figuring out a formula like `[AC] [BC]` for a 3-way table or `[AD] [BD] [CD]` for a 4-way table.
* A re-implementation of `vcd::woolf_test()` extends the analysis of homogeneity of odds ratios
in 2 x 2 x R x C tables to provide tests for differences among the R strata rows and C strata columns.
### Recent work
#### Visual tables
A new function, `color_table()` provides semi-graphic tables of frequency tables or residuals from a loglinear model. The essential idea is to use background shading of cells in the table to show patterns
not discernible in purely numeric tables.
#### Association graphs
I'm now experimenting with using graphical association representations of models in conjunction with
the other methods, and ways of specifying models here. `assoc_graph()`
Association graphs represent variables as nodes and their partial associations between pairs of variables as edges. They are useful for understanding
If two variables are not connected by an edge, they are conditionally independent given the other variables in the model.
How can we use this in practice, to understand a model, or how well it fits a given dataset?
There is now (rudimentary) a `plot()` method for association graphs which allows edges to be weighted by a measure
of the strength of association between variables: partial $G^2$ or Cramer's V. Still very much a WIP.
## 📊 Examples
These `README` examples provide simple illustrations of using some of the package functions in the
context of loglinear models for frequency tables fit using `glm()`, including
models for _structured associations_ taking ordinality into account.
### `Mental` dataset
The dataset `vcdExtra::Mental` is a data frame frequency table representing the cross-classification of mental health status (`mental`) of 1660 young New York residents by their parents' socioeconomic status (`ses`).
Both are _ordered_ factors. The questions are:
* Is `mental` health associated with parents `ses`?
* If so, what is the pattern/nature of the association?
* How can I take the ordinal nature of the factors into account?
```{r ex-mental1}
data(Mental)
str(Mental)
# show as frequency table
(Mental.tab <- xtabs(Freq ~ ses+mental, data=Mental))
```
#### Independence model
Fit the independence model, `Freq ~ mental + ses`, using `glm(..., family = poisson)`
This model is equivalent to the `chisq.test(Mental)` for general association; it
does not take ordinality into account. `LRstats()` provides a compact summary of
fit statistics for one or more models.
```{r ex-mental2}
indep <- glm(Freq ~ mental + ses,
family = poisson, data = Mental)
LRstats(indep)
```
`mosaic.glm()` is the mosaic method for `glm` objects.
The default mosaic display for these data:
```{r mental1}
mosaic(indep)
```
It is usually better to use _standardized residuals_ (`residuals_type="rstandard"`) in mosaic displays, rather than the default Pearson residuals.
Here we also add longer labels for the table factors (`set_varnames`)
and display the
values of residuals (`labeling=labeling_residuals`) in the cells.
The strucplot `formula` argument, `~ ses + mental`
here gives the order of the factors in the mosaic display,
not the statistical model for independence. That is, the
unit square is first split by `ses`, then by `mental` within
each level of `ses`.
```{r mental2}
# labels for table factors
long.labels <- list(set_varnames = c(mental="Mental Health Status",
ses="Parent SES"))
mosaic(indep, formula = ~ ses + mental,
residuals_type="rstandard",
labeling_args = long.labels,
labeling=labeling_residuals)
```
The **opposite-corner** pattern of the residuals clearly shows that association
between Parent SES and mental health depends on the _ordered_ levels of the factors:
higher Parent SES is associated with better mental health status. A principal virtue
of mosaic plots is to show the pattern of association that remains
after a model has been fit, and thus help suggest a better model.
#### Ordinal models
Ordinal models use **numeric** scores for the row and/or column variables.
These models typically use equally spaced _integer_ scores.
The test for association here is analogous to a test of the correlation
between the frequency-weighted scores, carried out using `CMHtest()`.
```{r}
CMHtest(Mental.tab)
```
In the data, `ses` and `mental` were declared to be ordered factors,
so using `as.numeric(Mental$ses)` is sufficient to create a new `Cscore`
variable. Similarly for the numeric version of `mental`, giving `Rscore`.
```{r mental-scores}
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental)
```
Using these, the term `Rscore:Cscore` represents an association
constrained to be **linear x linear**; that is, the slopes for profiles of
mental health status are assumed to vary linearly with those for Parent SES.
(This model asserts that only one parameter (a local odds ratio)
is sufficient to account for all association, and is also called the model of "uniform association".)
```{r mental3}
# fit linear x linear (uniform) association. Use integer scores for rows/cols
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental)
linlin <- glm(Freq ~ mental + ses + Rscore:Cscore,
family = poisson, data = Mental)
mosaic(linlin, ~ ses + mental,
residuals_type="rstandard",
labeling_args = long.labels,
labeling=labeling_residuals,
suppress=1,
gp=shading_Friendly,
main="Lin x Lin model")
```
Note that the test for linear x linear association consumes only 1 degree of freedom,
compared to the `(r-1)*(c-1) = 15` degrees of freedom for general association.
```{r}
anova(linlin, test="Chisq")
```
Other models are possible between the independence model, `Freq ~ mental + ses`,
and the saturated model `Freq ~ mental + ses + mental:ses`.
The `update.glm()` method make these easy to specify, as addition of terms to
the independence model.
```{r}
# use update.glm method to fit other models
linlin <- update(indep, . ~ . + Rscore:Cscore)
roweff <- update(indep, . ~ . + mental:Cscore)
coleff <- update(indep, . ~ . + Rscore:ses)
rowcol <- update(indep, . ~ . + Rscore:ses + mental:Cscore)
```
**Compare the models**:
For `glm` objects, the `print` and `summary` methods give too much information if all one wants to see is a brief summary of model goodness of fit, and there is no easy way to display a compact comparison of model goodness of fit for a collection of models fit to the same data.
`LRstats()` provides a brief summary for one or more models fit to the same dataset.
The likelihood ratio $\chi^2$ values (`LR Chisq`)test lack of fit.
By these tests, none of the ordinal models show significant lack of fit.
By the AIC and BIC statistics, the `linlin` model is the best, combining parsimony and goodness of fit.
```{r}
LRstats(indep, linlin, roweff, coleff, rowcol)
```
The `anova.glm()` function gives tests of nested models.
```{r}
anova(indep, linlin, roweff, test = "Chisq")
```
## References
Friendly, M. (1995). Conceptual and Visual Models for Categorical Data. _The American Statistician_, **49**, 153–160. http://www.datavis.ca/papers/amstat95.pdf
Friendly, M. & Meyer, D. (2016). _Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data_. Boca Raton, FL: Chapman & Hall/CRC.
Meyer, D., Zeileis, A., & Hornik, K. (2006). The Strucplot Framework: Visualizing Multi-way Contingency Tables with vcd. _Journal of Statistical Software_, **17**(3), 1–48. http://www.jstatsoft.org/v17/i03/