`Monitor::stats` task silently freezes on every pool simultaneously at exactly 2^30 ms uptime

## Summary

After **exactly 2^30 milliseconds** (≈ 12 days 10 h 15 m 42 s) of pgdog uptime, the per-pool `Monitor::stats` task stops firing on every pool simultaneously, with no log signal. `Monitor::maintenance` and the request-handling path keep running normally. `stats.averages` for every pool is left frozen at whatever value it held in the last `calc_averages` call.

This has reproduced twice in production, each freeze occurring within ~20 minutes of the 2^30 ms mark from process start:

| Run | Started (UTC) | Froze (UTC) | Δ (seconds) | vs 2^30 ms (1,073,742 s) |
|---|---|---|---|---|
| 1 | Apr 16 12:20 | Apr 28 22:15 | 1,072,500 | −20 min |
| 2 | May 14 16:25 | May 27 02:50 | 1,074,300 | +9 min |

Both windows are inside the noise of "deploy timestamp vs actual process start" and "metrics scrape granularity", and the two intervals differ from each other by only 30 minutes out of 12.4 days. The periodicity is not coincidence.

2^30 ms is the level-4 / level-5 boundary in tokio's hashed timer wheel. Tokio is 1.52.3 in this build.

## Environment

- pgdog **v0.1.40**
- tokio **1.52.3**
- Linux x86_64
- `stats_period` default (15 s)

## Symptoms

For every pool on the affected host, every `avg_*` field in `SHOW POOLS` is stuck at constant decimal values. Example:

```
avg_xact_count         | 1956
avg_query_count        | 3313
avg_received           | 24626894
avg_sent               | 5648070
avg_xact_time          | 3.004
avg_idle_xact_time     | 2.348
avg_query_time         | 0.813
avg_bind_count         | 2455
avg_server_parse_count | 2455
... (every avg_* field frozen)
```

OpenMetrics gauges (`pgdog.avg_query_time`, `pgdog.avg_xact_time`, etc.) show min == max == avg over many hours, while counters like `pgdog.total_query_count.count` keep incrementing at the expected rate. Every database on the host is affected at the same instant.

## What's still working after the freeze

- Process is alive; same PID throughout.
- `Monitor::maintenance` fires every 333 ms: continuous \`closing server connection ... reason: max age\` and \`new connection requested: reason=min\` entries for every pool.
- Counters (\`stats.counts\`) on every pool continue growing.
- Clients keep connecting and being served; healthchecks succeed.

## What's broken

- \`Monitor::stats\` (\`pgdog/src/backend/pool/monitor.rs:379\`) is no longer calling \`calc_averages\` on any pool. There is no other writer to \`stats.averages\`, so the field stays frozen at the last computed value indefinitely. Only a process restart recovers.

## Log evidence

Filtering \`journalctl -u pgdog --utc --since '<freeze − 30m>' --until '<freeze + 30m>'\` for \`panic|reload|shutdown|monitor|replication|config|orchestrat\` returns **nothing**. The interval surrounding the freeze contains only normal pool maintenance INFO lines. No \`replace_databases\`, no SIGHUP, no replication reconnect, no panic.

## Probable root cause

The single code difference between the dead task and the surviving task is the tokio time primitive each uses:

- \`Monitor::stats\` calls \`tokio::time::sleep(duration)\` inside a \`loop { select! { ... } }\`, which constructs a fresh \`Sleep\` future every iteration and registers it with the timer driver each time.
- \`Monitor::maintenance\` uses \`tokio::time::interval(MAINTENANCE)\` and \`.tick().await\`, which holds a single \`Sleep\` and resets it in-place.

At exactly the 2^30 ms boundary the tokio hashed timer wheel transitions from level 4 to level 5. Short \`Sleep\` registrations that happen at or shortly after that transition appear not to be rescheduled, while pre-existing intervals continue to tick. This explains the all-pools-simultaneously freeze (it's process-wide, not per-pool), the lack of any log signal (no panic, no error path), the survival of maintenance and request handling, and the 2^30 ms periodicity.

## Suggested mitigations on the pgdog side

Independent of any upstream fix, pgdog can harden against this class of failure:

1. **Switch \`Monitor::stats\` to \`tokio::time::interval(stats_period)\`** for symmetry with \`Monitor::maintenance\`. Empirically this avoids the freeze, since intervals continued ticking across both occurrences.

2. **Supervise long-lived monitor tasks.** \`Monitor::stats\` is currently spawned with \`tokio::spawn\` and not tracked. If it exits for any reason — panic, runtime issue, future cancellation — there is no supervisor to restart it and no log line emitted. Wrap the spawn in a supervisor that re-spawns on exit and emits \`warn!\`. Same applies to \`Monitor::maintenance\`.

3. **Add a heartbeat metric.** A \`pgdog.stats_calc_count\` counter incremented inside \`calc_averages\` would let operators detect this externally before the gauge values diverge from reality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Monitor::stats` task silently freezes on every pool simultaneously at exactly 2^30 ms uptime #1017

Summary

Environment

Symptoms

What's still working after the freeze

What's broken

Log evidence

Probable root cause

Suggested mitigations on the pgdog side

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run	Started (UTC)	Froze (UTC)	Δ (seconds)	vs 2^30 ms (1,073,742 s)
1	Apr 16 12:20	Apr 28 22:15	1,072,500	−20 min
2	May 14 16:25	May 27 02:50	1,074,300	+9 min

Monitor::stats task silently freezes on every pool simultaneously at exactly 2^30 ms uptime #1017

Description

Summary

Environment

Symptoms

What's still working after the freeze

What's broken

Log evidence

Probable root cause

Suggested mitigations on the pgdog side

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Monitor::stats` task silently freezes on every pool simultaneously at exactly 2^30 ms uptime #1017