-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path3_many_points_scatterplot.qmd
More file actions
162 lines (129 loc) · 4.99 KB
/
3_many_points_scatterplot.qmd
File metadata and controls
162 lines (129 loc) · 4.99 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: "Scatter plots with *many* points"
author: "Michael Stadler, Charlotte Soneson"
format:
html:
toc: true
toc-location: left
embed-resources: true
link-external-icon: true
code-fold: true
editor_options:
chunk_output_type: console
---
## Summary
This document illustrates how to use `ggplot2` and `scattermore` to:
- make a scatter plot with many data points that shows the density of points
without saturation
- that is fast to view and arrange (for example when assembling figure panels)
- that can be stored into a compact file without loss in quality
## Prepare data
Run the following code to prepare the data used in this document:
```{r}
#| label: prepare_data
#| code-fold: false
suppressPackageStartupMessages({
library(ggplot2)
library(tibble)
})
# built-in `diamonds` dataset from the `ggplot2` package (see ?ggplot2::diamonds)
tibble(diamonds)
```
## Create figure
### Load packages
```{r}
#| label: load_packages
library(ggplot2)
library(scattermore)
```
### Plot
Let's first create a simple scatter plot to illustrate the problems.
The `diamonds` dataset has `r nrow(diamonds)` observations. If stored
into a vectorized graphics device such as `pdf()` or `svg()`, the file will be
large (each observation is individually represented as graphic elements)
and slow to open or arrange. Furthermore, the high number of data points
leads to saturation and we do not see the full underlying density of data points.
```{r}
#| label: plot
#| fig-width: 7
#| fig-height: 7
#| code-fold: false
# create base plot
gg <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
labs(x = "Weight of the diamond (carat)", y = "Price (US dollars)") +
theme_bw(20) +
theme(panel.grid = element_blank(),
legend.position = "bottom")
gg + geom_point()
```
A simple way to improve the saturation issue is to use transparency, so that
overlapping observations lead to darker colors. However, this does not
solve the "many points" problem yet.
```{r}
#| label: plot_transparency
#| fig-width: 7
#| fig-height: 7
# ... with transparency
gg + geom_point(color = alpha("black", 0.02))
```
A simple way to solve also the "many points" problem is to avoid showing
all individual observations and instead show the local density of points
using a color scale.
```{r}
#| label: plot_2d_density
#| fig-width: 7
#| fig-height: 7
# ... with marginal density plots by number of cylinders
gg + geom_density_2d_filled(bins = 48) +
coord_cartesian(expand = FALSE) +
theme(legend.position = "none")
```
The linear contour levels or color intervals (controlled by `bins` or `breaks`)
may not work well in a case like ours, where the density is very high in some
regions that will occupy almost the complete color scale and we lose resolution
in low density regions. You can use `breaks` to create non-linear intervals
(here combined with `ndensity` so that we know the range of densities: [0, 1])
and with `theme(panel.background)` to make the zero-density area dark blue.
```{r}
#| label: plot_2d_density_nonlinearbreaks
#| fig-width: 7
#| fig-height: 7
# ... with marginal density plots by number of cylinders
gg + geom_density_2d_filled(contour_var = "ndensity",
breaks = exp(seq(log(1e-4), log(1), length.out = 64))) +
coord_cartesian(expand = FALSE) +
theme(legend.position = "none",
panel.background = element_rect(fill = hcl.colors(64)[1]))
```
Finally, if you prefer to show individual observations and need
something that will scale to millions of points without getting slow or
hard to use, `scattermore` provides a solution for that too.
```{r}
#| label: plot_scattermore
#| fig-width: 7
#| fig-height: 7
# ... with marginal violin and labelled data points
gg + geom_scattermore(pointsize = 2, alpha = 0.02)
```
## Remarks
- [`scattermore`](https://github.com/exaexa/scattermore) does its magic by
rendering the `geom_point` layer into a bitmap, while keeping all the other
layers as they are, allowing you to create `pdf()` files that can be magnified
without losing readability of the axes.
- [`scattermore`](https://github.com/exaexa/scattermore)
provides two `ggplot2` layers: `geom_scattermore()` (which behaves mostly like
`geom_point()`), and `geom_scattermost()` which avoids much of the overhead
of a normal ggplot2 layer and thus is even more efficient, at the price that
it has a slightly different interface and needs to get the data directly as
the `xy` argument.
- An alternative, maybe even more convenient drop-in replacement for `geom_point()`
and other `ggplot2` geoms that follows a similar strategy is `geom_point_rast()`
from the [`ggrastr`](https://github.com/VPetukhov/ggrastr) package. This
package also provides the `rasterize()` function that can take any existing
`ggplot2` plot object and rasterize suitable layer in it. Compared to
`scattermore`, `ggrastr` does not seem to be as fast and scale as well, though.
## Session info
```{r}
#| label: session_info
sessioninfo::session_info()
```