Skip to content

feat(pg): PostgreSQL datastore compatibility layer#6

Open
dnplkndll wants to merge 21 commits into
mainfrom
feat/pg-compat-clean
Open

feat(pg): PostgreSQL datastore compatibility layer#6
dnplkndll wants to merge 21 commits into
mainfrom
feat/pg-compat-clean

Conversation

@dnplkndll
Copy link
Copy Markdown

@dnplkndll dnplkndll commented Apr 30, 2026

What changed

Full PostgreSQL compatibility for Fleet's datastore layer, enabling the production deployment at fleet.hz.ledoweb.com to run against PostgreSQL 16 instead of MySQL.

Core dialect abstraction

  • DialectHelper interface (dialect.go) abstracting all MySQL vs PostgreSQL SQL differences
  • mysqlDialect and postgresDialect implementations covering all methods: InsertIgnoreInto, ReplaceInto, FromDual, OnDuplicateKey, OnConflictDoNothing, GroupConcat, JsonQuote, JSONAgg, JSONExtract, JSONUnquoteExtract, JSONBuildObject, JSONObjectFunc, FindInSet, FullTextMatch, RegexpMatch, GoquDialect, ReturningID, IsPostgres, CreateTableLike, AtomicTableSwap

Runtime SQL translation (server/platform/postgres/rebind_driver.go)

The pgx-rebind driver wraps pgx/stdlib and rewrites MySQL syntax to PG at the driver layer:

  • Placeholders: ?$N
  • Boolean columns: col = 1/col = 0col = true/col = false (for the ~60 boolean cols listed in schema_bool_cols_gen.go)
  • JOIN syntax: UPDATE … JOINUPDATE … FROM; multi-table DELETE → DELETE … USING
  • JSON functions: JSON_EXTRACT(col, '$.path')col->'path'; JSON_OBJECT(…)jsonb_build_object(…); JSON_QUOTE(…), JSON_UNQUOTE(…), JSON_ARRAYAGG(…) → PG equivalents
  • MySQL functions: IF()CASE WHEN, IFNULLCOALESCE, MD5()md5(), UUID()gen_random_uuid()::text, UTC_TIMESTAMP()TO_CHAR(NOW() AT TIME ZONE 'UTC', …), CURDATE()CURRENT_DATE, DATABASE()current_schema(), HEX()/UNHEX()encode/decode, FIND_IN_SETarray_position(string_to_array), …
  • Type casts: CAST AS UNSIGNEDCAST AS integer, CAST AS SIGNED INTCAST AS integer, TIMESTAMP(?)(?)::timestamp
  • Upsert: INSERT IGNORE INTOINSERT INTO … ON CONFLICT DO NOTHING; REPLACE INTOINSERT … ON CONFLICT DO UPDATE; ON DUPLICATE KEY UPDATE VALUES(col)ON CONFLICT (pk) DO UPDATE SET col = EXCLUDED.col
  • DDL column-type translation (added in this session): BLOBbytea, MEDIUMTEXT/LONGTEXTTEXT, TINYINT(1)smallint, DATETIME[(N)]timestamp[(N)], INT UNSIGNED NOT NULL AUTO_INCREMENTINTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY, INT UNSIGNEDINTEGER, inline UNIQUE KEY name (cols)CONSTRAINT name UNIQUE (cols), enum('a','b','c')VARCHAR(255) CHECK (col IN ('a','b','c')), ENGINE=InnoDB/DEFAULT CHARSET=…/ALGORITHM=INSTANT stripped
  • DDL multi-statement output (added in this session): ALTER TABLE … ADD COLUMN …, ADD KEY <name> (<cols>) produces an ALTER followed by a separate CREATE INDEX. ADD UNIQUE KEYCREATE UNIQUE INDEX. Multiple ADD KEY clauses each become their own CREATE INDEX.
  • ON UPDATE CURRENT_TIMESTAMP (added in this session): the MySQL column attribute is stripped from CREATE TABLE and a per-table BEFORE UPDATE trigger calling fleet_set_updated_at() is appended. The trigger function is installed by pg_baseline_post.sql.

Migration runner fixes (this session)

A chain of bugs prevented fleet prepare db from actually running migrations on PG; all addressed on this branch:

  • MigrationStatus: removed broken short-circuit that reported AllMigrationsCompleted based purely on hosts table existence. Now uses the standard compareMigrations path. On truly-fresh PG, returns NoMigrationsCompleted without calling loadMigrations so goose's createVersionTable doesn't collide with the baseline.
  • goose.GetDBVersion: no longer panics unreachable when iteration ends without an applied row; returns (0, nil) so callers proceed as for a fresh DB.
  • seedPGMigrationHistory: now seeds both migration_status_tables AND migration_status_data (previously only the first, causing every old data-migration to appear missing on every startup).
  • MigrateTables on PG: runs migratePGBaseline AND then invokes goose Up so any migration newer than the baseline marker actually executes.
  • pg_baseline_post.sql: wraps each ALTER OWNER in an EXCEPTION WHEN insufficient_privilege block so individual objects the role doesn't own no longer abort the whole script. Also installs fleet_set_updated_at() trigger function used by ON UPDATE CURRENT_TIMESTAMP column translations.
  • prepare.go: on PG, falls through to MigrateTables even when all migrations are reported complete, so the idempotent post-baseline fixups (ownership + trigger function) re-apply. MySQL path keeps its early-return behavior.
  • BackfillVPPAppCountriesFromTokens: removed va. alias prefix from SET clause (PG rejects).
  • CleanupSCDData + DeleteAllForDataset in the chart datastore: rewrote MySQL DELETE … ORDER BY … LIMIT ? as a subquery-on-PK form valid on both dialects.

Production schema drift remediation (this session)

After enabling the migration runner properly, several tables had drift from migrations recorded as "applied" via seedPGMigrationHistory but never actually executed. All remediated on production via /tmp/fix_pg_drift.sql:

  • Rebuilt acme_enrollments, host_managed_local_account_passwords, user_api_endpoints to match canonical migrations (all were empty, no data lost).
  • Dropped leftover columns policies.conditional_access_bypass_enabled and mdm_configuration_profile_variables.requires_value.
  • Manually applied 8 pending migrations + seeded migration_status_data with 9 historical versions.
  • Regenerated pg_baseline_schema.sql from the corrected prod DB; marker bumped to 20260506171058.

pgcompat tooling (tools/pgcompat/)

Three validators that gate every PR on PR-relevant paths:

  • check_primary_keys — every raw ON DUPLICATE KEY UPDATE site is covered by knownPrimaryKeys in the rebind driver
  • check_schema_driftschema.sql vs pg_baseline_schema.sql table sets match (allowlist for intentional drift)
  • check_column_drift (new this session) — for every common table, column sets match; flags stale allowlist entries so they get removed after baseline regen
  • gen_bool_cols — regenerates schema_bool_cols_gen.go from the baseline; CI fails if stale

Fresh-PG-install CI smoke test (this session)

validate-pg-compat.yml now spins up an empty PG via docker-compose, builds the fleet binary, runs prepare db against it (expects the baseline + post-marker migrations to apply cleanly), then runs prepare db a second time and asserts idempotency. Verifies every table ends up owned by the app role. This gate would have caught the migration-runner bugs we shipped earlier on this branch.

Test plan

  • Validate PG Compatibility CI — passes on latest SHA
  • Go Tests (PostgreSQL) CI — passes on latest SHA
  • Build & Push Ledo Fleet Image CI — passes on latest SHA
  • Fresh-PG-install smoke test passes (baseline + post-marker migrations apply, idempotent on second prepare db, all tables owned by fleet)
  • 33 new table-driven test cases in rebind_driver_test.go covering every DDL translation + verbatim regression cases for the two upstream migrations (20260428125634, 20260429180725) that needed manual SQL before this branch
  • Production at fleet.hz.ledoweb.com running on this branch's image; verified post-rollout via 5-tier review: pod healthy, schema integrity correct (205 tables, marker 20260506171058, trigger function installed), API routing works, both previously-failing crons (send_managed_local_account_rotation_commands, refresh_vpp_app_versions) now complete cleanly
  • 24 successful PG backups in last 24h, 0 failures

@dnplkndll dnplkndll changed the title fix(pg): Windows MDM + remaining MySQL-only SQL → PostgreSQL dialect helpers feat(pg): PostgreSQL datastore compatibility layer May 5, 2026
@dnplkndll dnplkndll force-pushed the feat/pg-compat-clean branch 3 times, most recently from dbcec59 to 6062962 Compare May 14, 2026 12:57
@dnplkndll dnplkndll changed the base branch from ledoent to main May 14, 2026 13:20
Co-authored-by: johnjeremiah <jjeremiah@gmail.com>
@dnplkndll dnplkndll force-pushed the feat/pg-compat-clean branch from 162c243 to 7f5122b Compare May 23, 2026 12:56
akuthiala and others added 20 commits May 23, 2026 10:37
Adds a Postgres backend to Fleet's datastore alongside the existing
MySQL. Non-breaking: MySQL remains the default and is unaffected.

Core pieces:
  - DialectHelper interface (server/datastore/mysql/dialect.go) abstracts
    SQL dialect differences for upserts, aggregates, JSON ops, error
    classification, and atomic swap-table DDL. mysqlDialect + postgresDialect
    implementations, dialect.IsPostgres() routes runtime branches.
  - pgx-rebind driver (server/platform/postgres/rebind_driver.go)
    transparently translates MySQL SQL to Postgres at query execution time
    via 50+ regex-based rewrites compiled once at startup. Per-table-name
    regexes cached in sync.Map. knownPrimaryKeys map drives ON DUPLICATE
    KEY → ON CONFLICT (<pk>) DO UPDATE rewriting.
  - Embedded PG baseline (server/datastore/mysql/pg_baseline_schema.sql,
    pg_baseline_post.sql) seeded from production pg_dump. Carries a
    pg-baseline-up-to-migration: <ts> marker; fresh-apply seeds
    migration_status_tables from code and logs a loud warning whenever
    code carries migrations newer than the baseline. Object-ownership is
    reasserted on every startup so atomic table swaps work even when the
    baseline was loaded as the postgres superuser.
  - server/goose/migration.go gains UpFnPG / DownFnPG / UpFnMySQL /
    DownFnMySQL fields so individual migrations can target one dialect.
    First user: 20260513210000_AddMissingPGIndexes (this commit).
  - 349 missing PG indexes added via the AddMissingPGIndexes migration
    (UpFnPG-only), bringing PG to index parity with MySQL on hot paths
    like host_software_installed_paths (host_id, software_id).

Wiring:
  - FLEET_MYSQL_DRIVER=postgres selects the new driver; standard
    FLEET_MYSQL_ADDRESS / USERNAME / PASSWORD / DATABASE env vars route to
    the PG cluster unchanged.
  - server/config/config.go validates the new driver value.
  - cmd/fleet/prepare.go threads dialect into the migration apply path.
  - docker-compose.yml gains a postgres service for local dev.

Tests:
  - 39 PG smoke tests (hosts, software, vulnerabilities, policies,
    host-counts) and B1/B2/B3 tiers running on both backends via the new
    CreateDS(t) helper.
  - Driver-rewrite unit tests cover every regex (UPDATE...JOIN,
    DELETE USING, GROUP_CONCAT, ON CONFLICT ambiguity resolution,
    smallint-bool encoding, MAX(bool), INTERVAL placeholder, CAST NULL
    AS SIGNED, FIND_IN_SET, COALESCE token, null-byte stripping, ...).
  - Dialect unit tests for both dialects (LAST_INSERT_ID stripping,
    ReturningID, AtomicTableSwap, CreateTableLike).
  - List-options helper has new coverage for single-aggregate ORDER BY
    skip and text-column cursor binding.
  - Benchmarks for UpdateHostSoftware / ListSoftware / ListHosts in
    server/datastore/mysql/benchmarks_test.go.

Squashed from 70+ incremental commits on feat/pg-compat-clean; full
provenance preserved on feat/pg-compat-clean-backup-2026-05-13.
…p on dep-review

CI infrastructure that gates the PG backend:
  - test-go-postgres.yaml: spins up Postgres in a service container, runs
    the full datastore + service test suites against the PG driver. Mirrors
    the existing MySQL test workflow.
  - validate-pg-compat.yml: invokes the tools/pgcompat validators on every
    PR/push — check_primary_keys, check_schema_drift, check_column_drift.
    Empty-allowlist gate-of-the-gate test ensures the validators themselves
    can never become a no-op.
  - build-ledo.yml: ledoent-specific image build that refuses to publish to
    ghcr.io unless both test-go-postgres and validate-pg-compat succeeded
    on the build SHA.
  - sync-upstream.yml: paranoia check that refuses to force-push ledoent/main
    if any non-bot commits exist outside upstream/main.
  - weekly-aggregate.yml: gitaggregate cron + workflow_dispatch, pinned to
    git-aggregator==4.1.
  - dependency-review.yml: skip on private repos (the action requires
    GitHub Advanced Security which isn't available on free private mirrors).
    Upstream public fleetdm/fleet still runs it.
  - test-website.yml: npm audit step added so frontend dep regressions
    block PRs.
  - tools/ci/apiparamcheck: custom golangci-lint plugin that flags REST
    handler params not registered in the request struct, catching the
    'missing query param decode' class of bug.
…rift

Three small static-analysis tools that prevent silent PG-compat regressions.
None require a running Postgres; they read Go source and SQL schema files.

  - check_primary_keys: scans non-test Go for raw 'ON DUPLICATE KEY UPDATE'
    SQL and verifies every targeted table has an entry in knownPrimaryKeys
    (the map in server/platform/postgres/rebind_driver.go that drives the
    ON CONFLICT (<pk>) DO UPDATE rewrite). Missing entries produce invalid
    PG SQL at runtime.
  - check_schema_drift: diffs CREATE TABLE identifier sets between
    server/datastore/mysql/schema.sql (MySQL canonical) and
    pg_baseline_schema.sql (PG baseline). known_schema_diff.txt records
    intentional divergence and is itself validated — stale entries fail.
  - check_column_drift: diffs column lists per shared table. Optional
    allowlist via known_column_drift.txt.
  - gen_identity_cols / gen_bool_cols: code generators that produce the
    Postgres dialect's static knowledge of IDENTITY columns and bool
    columns so the rebind driver can rewrite INSERTs correctly.
  - validators_test.go is a gate-of-the-gate: an empty schema-diff
    allowlist must produce a non-zero exit.

Designed to be extractable as a standalone PR to fleetdm/fleet — they're
useful to any Fleet operator building PG support, with or without the
larger driver/baseline layer.
Playwright API-mode test matrix that exercises every URL filter Fleet's
frontend can construct against a live server, asserting each response is
not a Postgres-driver or Postgres-syntax failure (SQLSTATE, 'must appear
in the GROUP BY', 'operator does not exist', 'cannot find encode plan',
'syntax error', etc.).

Read-only (HTTP GET only). ~220 probes in ~15s with 8 workers.

Coverage:
  - /hosts + /hosts/count: status, low_disk_space, mdm_enrollment_status,
    os_settings/apple_settings/disk_encryption/bootstrap_package, populate_*,
    every ORDER BY allowlist key × direction, cursor pagination (after=),
    vulnerability filter, search.
  - /software/versions, /software/titles, /software (deprecated):
    vulnerable, exploit, cvss range, self_service, available_for_install,
    packages_only, team filtering, ordering.
  - /vulnerabilities, /host_summary, /labels/:id/hosts, /hosts/:id/*,
    sanity endpoints (/config, /version, /me, /labels, /teams, ...).

Run:
  cd tools/pg-compat-harness
  yarn install
  export FLEET_URL=https://your-fleet
  export FLEET_TOKEN=$(awk '/token:/ {print $2}' ~/.fleet/config)
  yarn test

This harness found and gated the GROUP BY and cursor-encoding regressions
fixed elsewhere in this branch (selectSoftwareSQL GroupByAppend,
AppendListOptionsWithParamsSecure textOrderKeys hint).
Small Go program that parses server/datastore/mysql/schema.sql and emits
one CREATE INDEX IF NOT EXISTS statement per MySQL KEY / UNIQUE KEY clause,
suitable for embedding into a PG-only migration.

Handles:
  - balanced parens in column lists (expression bodies)
  - USING BTREE / USING HASH suffix (MySQL hint, PG ignores)
  - DESC column ordering (PG supports natively)
  - identifier quoting where required
  - stable per-table grouping for reviewable diffs

Deliberately skips with explicit reasons:
  - PRIMARY KEY (the CREATE TABLE handles it)
  - FULLTEXT KEY, SPATIAL KEY (need pg_trgm / GiST equivalents)
  - prefix-length indexes col(N) (need PG expression indexes)
  - expression indexes using MySQL-specific functions (ifnull, cast as
    signed) that need PG translation (COALESCE, CAST AS integer)

main_test.go drives translate() from inline schema fixtures — no file I/O
required. Covers plain/unique keys, DESC, USING BTREE, every skip reason,
balanced-paren edge cases, multi-table, PRIMARY ignored, plus unit tests
for extractParenBody and quoteIdent helpers.

Usage:
  go run ./tools/pg-index-translate \
    -in  server/datastore/mysql/schema.sql \
    -out server/datastore/mysql/migrations/tables/{ts}_AddMissingPGIndexes.sql
  - docs/Deploy/postgresql.md: end-to-end guide for running Fleet against
    Postgres — connection env vars, baseline schema apply, migration
    apply, ownership reassertion, troubleshooting (drift warning, must
    be owner of table, schema/column drift validator output).
  - docs/Deploy/README.md: links the new guide from the deployment index
    alongside the MySQL guide.
GetDBVersion returned a too-old current version on production PG because
the baseline-seed path (and goose's own run-and-record loop for newly
introduced migrations) inserted rows into migration_status_tables out of
version_id order. Concretely, id 523 carried version 20260422181702 while
id 521 carried 20260506171058. Plain 'ORDER BY id DESC' picked the
older version, so 'fleet prepare db' tried to re-run every migration
from 20260423161823 onward and failed on json_merge_patch — a MySQL-only
function that PG never had, with the migration body long since folded
into the embedded baseline.

Switching to 'ORDER BY version_id DESC, id DESC' makes the query
immune to insertion order while preserving up/down semantics: the
tie-break by id DESC keeps the most recent applied/rolled-back state
for the same version. MySQL is unaffected — its migration runner
always applies in monotonic version order so id and version_id stay
aligned. We do not change the MySQL dialect to keep blast radius
minimal; that path has years of behavior to preserve.

Test pins the exact ORDER BY clause via sqlmock so any future change
back to the buggy form fails CI loudly.
…ces/views

pg_baseline_post.sql already loops over public tables, sequences, and
views and reasserts ownership to current_user, but it skipped functions.
On baselines that were loaded by the postgres superuser (typical on
self-hosted PG), CREATE OR REPLACE FUNCTION later in the same file
errored with 'must be owner of function fleet_set_updated_at' — the
application user can't replace something it doesn't own.

Add a fourth loop using pg_proc / pg_namespace to enumerate public
functions whose owner is not current_user, and ALTER FUNCTION ... OWNER
TO current_user with the standard insufficient_privilege fallback.
pg_get_function_identity_arguments() disambiguates overloaded
signatures.

Hit in production tonight on the AddMissingPGIndexes deploy. With this
fix every future fleet prepare db on a postgres-superuser-loaded
baseline succeeds without manual ALTER FUNCTION.
The existing implementation already sorts the seeded versions ascending
(via versionsAtOrBelow → partitionMigrationVersions → slices.Sort), so
PG assigns auto-increment ids in the same order as version_id. That
property is load-bearing for any downstream consumer that infers
'current version' from MAX(id), even with the dialect query now
correctly ordered by version_id DESC.

No functional change — just document the invariant so a future refactor
doesn't quietly drop the sort.
Required by TestVersionsAbove_EmbeddedBaselineCoversAllCode now that
AddMissingPGIndexes (20260513210000) ships in code. Dump source is
fleet.hz.ledoweb.com fleet-db-1, which has all 532 indexes applied
(11 from the original baseline + 521 added by AddMissingPGIndexes
either via the SQL we ran manually tonight or via the migration on
future fresh applies). check-pg-compat validators pass:

  schema-drift:   202 MySQL tables / 205 PG tables in sync (after allowlist)
  primary-keys:   every ON DUPLICATE KEY UPDATE site covered
  column-drift:   no drift between schema.sql and pg_baseline_schema.sql

Generated via the documented procedure in the file's header:

  kubectl exec -n fleet --context hetzner-ledo fleet-db-1 -- \
    pg_dump -U postgres -d fleet --schema-only --no-owner --no-privileges

Stripped the pg_dump-17 \restrict/\unrestrict meta-commands and the
SET search_path='' line per the same header comment. Header preserved
with the regen recipe and verification commands.
POST /api/fleet/orbit/setup_experience/init returned 500 with:

  ERROR: COALESCE types integer and text cannot be matched (SQLSTATE 42804)

The setup-experience init query UNION-ALLs two SELECTs — one for
software_installers, one for vpp_apps — projecting a NULL placeholder in
each leg for the column the other leg owns. The outer ORDER BY then
COALESCEs across both columns plus a literal 0:

  ORDER BY sort_name ASC, COALESCE(software_installer_id, vpp_app_team_id, 0)

In MySQL the untyped NULL silently coerces. In Postgres the untyped NULL
resolves to text, then COALESCE sees one int leg, one text leg, one int
literal — strict-type rejects.

Fix: replace the bare 'NULL AS ...' projections with
'CAST(NULL AS UNSIGNED) AS ...'. MySQL keeps it as unsigned-int NULL; the
PG rebind driver already rewrites 'AS UNSIGNED' → 'AS integer' (see
server/platform/postgres/rebind_driver.go reAsUnsigned*). Both dialects
now compose a proper-typed NULL for the outer COALESCE.

Surfaced by Windows host enrollment — the iOS/Android/Apple-setup
experience flow runs the same code path; this fix covers them all.
A broad audit of the codebase's 487 COALESCE call sites found no other
sites with the same untyped-NULL-in-UNION shape.
Extends the harness with 19 new probes covering every orbit and osquery
agent POST endpoint listed in server/service/handler.go:1006-1099 +
osquery enroll/config/distributed/carve/log.

Each probe sends a fake orbit_node_key (or node_key) and asserts the
auth-middleware SQL — typically SELECT … FROM hosts WHERE
orbit_node_key = ? — runs without an SQLSTATE crash. A 401 response is
the success case; a 500 with PG error markers is the failure case.

Required infrastructure additions:
  - Probe gains optional method ('GET'|'POST'), body, and expectAuthFail
  - check() does request.post(...) when method is POST
  - check() permits 4xx when expectAuthFail is true (still fails on
    body containing PG error markers, still fails on 5xx)

Limitation documented in the orbitProbes comment: fake-key probes
reject at the auth gate, so per-endpoint handler SQL (e.g. the
setup_experience/init COALESCE bug fixed in df17814) is NOT
exercised by these probes. Catching post-auth SQL needs a fixture
host with a real orbit_node_key — tracked as a future expansion.

Current count: 242 passing (was 223; +19 orbit/osquery probes).
Two raw ON DUPLICATE KEY UPDATE sites in microsoft_mdm.go
(lines 364 and ~551) upsert into windows_mdm_commands, which has
PRIMARY KEY (command_uuid) per schema.sql. Without the
knownPrimaryKeys entry, the PG rebind driver emits
'ON CONFLICT DO UPDATE SET …' without a target, which PG rejects.

Caught by check_primary_keys in the pg-compat validator suite — same
gate that blocked the last build. Aggregated CI now passes locally:

  $ go run ./tools/pgcompat/check_primary_keys
  OK: every raw ON DUPLICATE KEY UPDATE site is covered by
      knownPrimaryKeys.
…ploy)

Upstream PR fleetdm#45367 (aggregated commit bee5eda, 2026-05-14) added the
orbit_debug_until column to the hosts table in MySQL schema.sql. The PG
baseline is regenerated from a production pg_dump; that migration hasn't
landed on prod yet (it lands when this very deploy rolls out), so the
baseline lags by one column.

Adding the entry to known_column_drift.txt with a deferred-regen
comment, per the file header's prescribed workflow. Once the next
aggregation runs after this deploy lands, the prod baseline will
include orbit_debug_until and this allowlist entry can be removed.

Without this entry, the validate-pg-compat CI gate fails check_column_drift
on every aggregated build, blocking the build-ledo image publish that
deploys this fix in the first place. Classic chicken-and-egg — break
with a documented intentional drift.
@dnplkndll dnplkndll force-pushed the feat/pg-compat-clean branch from a3b9ccf to f9dab02 Compare May 23, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants