Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@ integration/testdata/local/

# Local test and metric artifacts
.coverage/

# Mac DS Store
.DS_Store
50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,53 @@ make metrics_check
```

The defaults can be adjusted with `CYCLO_TOP`, `CYCLO_OVER`, `CRAP_TOP`, and `CRAP_OVER`.
## PostgreSQL Extensions

The PostgreSQL driver's schema bootstrap (`drivers/pg/query/sql/schema_up.sql`) installs the following extensions:

| Extension | Required | Purpose |
|---------------|----------|----------------------------------------------------------------------------------------------------------|
| `pg_trgm` | yes | GIN trigram indexes for `contains` / `starts with` / `ends with` lookups on graph entity properties. |
| `intarray` | yes | Extended integer array operations used when maintaining node kind arrays. |
| `pgstattuple` | no | Measures btree leaf density / fragmentation and GIN pending-list size for driver-managed optimization. |

`pgstattuple` is treated as best-effort. Installing it typically requires a superuser role and on some managed
Postgres deployments the contrib package is not exposed at all. Bootstrap wraps the install in an exception
handler that downgrades any failure to a `WARNING`, and the driver's `Optimize` implementation re-checks for the
extension at runtime and logs a warning when it is missing rather than failing the caller. To enable index
optimization in such environments, have an administrator install the extension out-of-band:

```sql
create extension if not exists pgstattuple;
```

### Index and Table Optimization

When `pgstattuple` is available the driver's `Optimize` method (exposed through the optional `graph.Optimizer`
interface) runs four phases on each invocation, scoped to partitions of the `node` and `edge` tables:

1. **Orphan cleanup.** Drops any `INVALID` indexes whose name matches the `_ccnew[N]` pattern, which Postgres
leaves behind when a prior `REINDEX CONCURRENTLY` was aborted. Cleanup failures are logged at `WARN` and
never fatal; an orphan wastes disk but does not block productive rebuilds.
2. **GIN pending-list flush.** Measures every GIN index with `pgstatginindex` and flags those whose pending
list reaches `2048` pages (`16 MiB`, four times the default `gin_pending_list_limit`). Flagged indexes are
flushed with `gin_clean_pending_list`. The threshold is a starting default and may be tightened or relaxed
once fleet samples have been collected.
3. **Vacuum / analyze.** Reads `pg_stat_user_tables` for each partition. A partition is flagged for
`VACUUM (ANALYZE)` when its dead-tuple ratio reaches `20%` and dead-tuple count reaches `10000`, or when
`n_mod_since_analyze` reaches `50000` and the most recent (auto)analyze is older than `24h`. The dead-tuple
ratio mirrors Postgres's default `autovacuum_vacuum_scale_factor`; the analyze staleness window is aligned
with the typical pipeline-level optimization cooldown. `VACUUM FULL` is never emitted.
4. **Btree reindex.** Measures every candidate btree index with `pgstatindex` and flags those whose average
leaf density falls below `60%` or whose leaf-page fragmentation reaches `40%`. Thresholds are calibrated
against production samples and a freshly rebuilt baseline (~73.8% density). Rebuilds run with
`REINDEX INDEX CONCURRENTLY`, smallest first so that an early cancellation still produces the maximum
number of completed rebuilds.

Phases run in the order above so that vacuum has a chance to reclaim index space before the btree assessment
quantifies bloat. Per-candidate failures within any phase are logged at `WARN` and the loop continues with the
next candidate. Context cancellation aborts before the next candidate; in-flight `REINDEX CONCURRENTLY` and
`VACUUM` statements run to a safe stopping point in Postgres.

The pass is not bounded by wall-clock time or candidate count. Callers should serialize `Optimize` against
their own scheduling loop.
Loading
Loading