intermediate_ggplot2/3_many_points_scatterplot.qmd at main · fmicompbio/intermediate_ggplot2 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: "Scatter plots with *many* points"
author: "Michael Stadler, Charlotte Soneson"
format:
    html:
        toc: true
        toc-location: left
        embed-resources: true
        link-external-icon: true
        code-fold: true
editor_options:
    chunk_output_type: console
---

## Summary

This document illustrates how to use `ggplot2` and `scattermore` to:

- make a scatter plot with many data points that shows the density of points
  without saturation
- that is fast to view and arrange (for example when assembling figure panels)
- that can be stored into a compact file without loss in quality


## Prepare data

Run the following code to prepare the data used in this document:

```{r}
#| label: prepare_data
#| code-fold: false

suppressPackageStartupMessages({
    library(ggplot2)
    library(tibble)
})

# built-in `diamonds` dataset from the `ggplot2` package (see ?ggplot2::diamonds)
tibble(diamonds)
```


## Create figure

### Load packages
```{r}
#| label: load_packages

library(ggplot2)
library(scattermore)
```


### Plot
Let's first create a simple scatter plot to illustrate the problems.
The `diamonds` dataset has `r nrow(diamonds)` observations. If stored
into a vectorized graphics device such as `pdf()` or `svg()`, the file will be
large (each observation is individually represented as graphic elements)
and slow to open or arrange. Furthermore, the high number of data points
leads to saturation and we do not see the full underlying density of data points.
```{r}
#| label: plot
#| fig-width: 7
#| fig-height: 7
#| code-fold: false

# create base plot
gg <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
    labs(x = "Weight of the diamond (carat)", y = "Price (US dollars)") +
    theme_bw(20) +
    theme(panel.grid = element_blank(),
          legend.position = "bottom")
gg + geom_point()
```

A simple way to improve the saturation issue is to use transparency, so that
overlapping observations lead to darker colors. However, this does not
solve the "many points" problem yet.

```{r}
#| label: plot_transparency
#| fig-width: 7
#| fig-height: 7

# ... with transparency
gg + geom_point(color = alpha("black", 0.02))
```

A simple way to solve also the "many points" problem is to avoid showing
all individual observations and instead show the local density of points
using a color scale.

```{r}
#| label: plot_2d_density
#| fig-width: 7
#| fig-height: 7

# ... with marginal density plots by number of cylinders
gg + geom_density_2d_filled(bins = 48) +
    coord_cartesian(expand = FALSE) +
    theme(legend.position = "none")
```

The linear contour levels or color intervals (controlled by `bins` or `breaks`)
may not work well in a case like ours, where the density is very high in some
regions that will occupy almost the complete color scale and we lose resolution
in low density regions. You can use `breaks` to create non-linear intervals
(here combined with `ndensity` so that we know the range of densities: [0, 1])
and with `theme(panel.background)` to make the zero-density area dark blue.

```{r}
#| label: plot_2d_density_nonlinearbreaks
#| fig-width: 7
#| fig-height: 7

# ... with marginal density plots by number of cylinders
gg + geom_density_2d_filled(contour_var = "ndensity",
                            breaks = exp(seq(log(1e-4), log(1), length.out = 64))) +
    coord_cartesian(expand = FALSE) +
    theme(legend.position = "none",
          panel.background = element_rect(fill = hcl.colors(64)[1]))
```

Finally, if you prefer to show individual observations and need
something that will scale to millions of points without getting slow or
hard to use, `scattermore` provides a solution for that too.

```{r}
#| label: plot_scattermore
#| fig-width: 7
#| fig-height: 7

# ... with marginal violin and labelled data points
gg + geom_scattermore(pointsize = 2, alpha = 0.02)
```


## Remarks

- [`scattermore`](https://github.com/exaexa/scattermore) does its magic by
  rendering the `geom_point` layer into a bitmap, while keeping all the other
  layers as they are, allowing you to create `pdf()` files that can be magnified
  without losing readability of the axes.
- [`scattermore`](https://github.com/exaexa/scattermore)
  provides two `ggplot2` layers: `geom_scattermore()` (which behaves mostly like
  `geom_point()`), and `geom_scattermost()` which avoids much of the overhead
  of a normal ggplot2 layer and thus is even more efficient, at the price that
  it has a slightly different interface and needs to get the data directly as
  the `xy` argument.
- An alternative, maybe even more convenient drop-in replacement for `geom_point()`
  and other `ggplot2` geoms that follows a similar strategy is `geom_point_rast()`
  from the [`ggrastr`](https://github.com/VPetukhov/ggrastr) package. This
  package also provides the `rasterize()` function that can take any existing
  `ggplot2` plot object and rasterize suitable layer in it. Compared to
  `scattermore`, `ggrastr` does not seem to be as fast and scale as well, though.

## Session info

```{r}
#| label: session_info
sessioninfo::session_info()
```