feat(pg): PostgreSQL datastore compatibility layer#6
Open
dnplkndll wants to merge 21 commits into
Open
Conversation
dbcec59 to
6062962
Compare
Co-authored-by: johnjeremiah <jjeremiah@gmail.com>
162c243 to
7f5122b
Compare
Adds a Postgres backend to Fleet's datastore alongside the existing
MySQL. Non-breaking: MySQL remains the default and is unaffected.
Core pieces:
- DialectHelper interface (server/datastore/mysql/dialect.go) abstracts
SQL dialect differences for upserts, aggregates, JSON ops, error
classification, and atomic swap-table DDL. mysqlDialect + postgresDialect
implementations, dialect.IsPostgres() routes runtime branches.
- pgx-rebind driver (server/platform/postgres/rebind_driver.go)
transparently translates MySQL SQL to Postgres at query execution time
via 50+ regex-based rewrites compiled once at startup. Per-table-name
regexes cached in sync.Map. knownPrimaryKeys map drives ON DUPLICATE
KEY → ON CONFLICT (<pk>) DO UPDATE rewriting.
- Embedded PG baseline (server/datastore/mysql/pg_baseline_schema.sql,
pg_baseline_post.sql) seeded from production pg_dump. Carries a
pg-baseline-up-to-migration: <ts> marker; fresh-apply seeds
migration_status_tables from code and logs a loud warning whenever
code carries migrations newer than the baseline. Object-ownership is
reasserted on every startup so atomic table swaps work even when the
baseline was loaded as the postgres superuser.
- server/goose/migration.go gains UpFnPG / DownFnPG / UpFnMySQL /
DownFnMySQL fields so individual migrations can target one dialect.
First user: 20260513210000_AddMissingPGIndexes (this commit).
- 349 missing PG indexes added via the AddMissingPGIndexes migration
(UpFnPG-only), bringing PG to index parity with MySQL on hot paths
like host_software_installed_paths (host_id, software_id).
Wiring:
- FLEET_MYSQL_DRIVER=postgres selects the new driver; standard
FLEET_MYSQL_ADDRESS / USERNAME / PASSWORD / DATABASE env vars route to
the PG cluster unchanged.
- server/config/config.go validates the new driver value.
- cmd/fleet/prepare.go threads dialect into the migration apply path.
- docker-compose.yml gains a postgres service for local dev.
Tests:
- 39 PG smoke tests (hosts, software, vulnerabilities, policies,
host-counts) and B1/B2/B3 tiers running on both backends via the new
CreateDS(t) helper.
- Driver-rewrite unit tests cover every regex (UPDATE...JOIN,
DELETE USING, GROUP_CONCAT, ON CONFLICT ambiguity resolution,
smallint-bool encoding, MAX(bool), INTERVAL placeholder, CAST NULL
AS SIGNED, FIND_IN_SET, COALESCE token, null-byte stripping, ...).
- Dialect unit tests for both dialects (LAST_INSERT_ID stripping,
ReturningID, AtomicTableSwap, CreateTableLike).
- List-options helper has new coverage for single-aggregate ORDER BY
skip and text-column cursor binding.
- Benchmarks for UpdateHostSoftware / ListSoftware / ListHosts in
server/datastore/mysql/benchmarks_test.go.
Squashed from 70+ incremental commits on feat/pg-compat-clean; full
provenance preserved on feat/pg-compat-clean-backup-2026-05-13.
…p on dep-review
CI infrastructure that gates the PG backend:
- test-go-postgres.yaml: spins up Postgres in a service container, runs
the full datastore + service test suites against the PG driver. Mirrors
the existing MySQL test workflow.
- validate-pg-compat.yml: invokes the tools/pgcompat validators on every
PR/push — check_primary_keys, check_schema_drift, check_column_drift.
Empty-allowlist gate-of-the-gate test ensures the validators themselves
can never become a no-op.
- build-ledo.yml: ledoent-specific image build that refuses to publish to
ghcr.io unless both test-go-postgres and validate-pg-compat succeeded
on the build SHA.
- sync-upstream.yml: paranoia check that refuses to force-push ledoent/main
if any non-bot commits exist outside upstream/main.
- weekly-aggregate.yml: gitaggregate cron + workflow_dispatch, pinned to
git-aggregator==4.1.
- dependency-review.yml: skip on private repos (the action requires
GitHub Advanced Security which isn't available on free private mirrors).
Upstream public fleetdm/fleet still runs it.
- test-website.yml: npm audit step added so frontend dep regressions
block PRs.
- tools/ci/apiparamcheck: custom golangci-lint plugin that flags REST
handler params not registered in the request struct, catching the
'missing query param decode' class of bug.
…rift
Three small static-analysis tools that prevent silent PG-compat regressions.
None require a running Postgres; they read Go source and SQL schema files.
- check_primary_keys: scans non-test Go for raw 'ON DUPLICATE KEY UPDATE'
SQL and verifies every targeted table has an entry in knownPrimaryKeys
(the map in server/platform/postgres/rebind_driver.go that drives the
ON CONFLICT (<pk>) DO UPDATE rewrite). Missing entries produce invalid
PG SQL at runtime.
- check_schema_drift: diffs CREATE TABLE identifier sets between
server/datastore/mysql/schema.sql (MySQL canonical) and
pg_baseline_schema.sql (PG baseline). known_schema_diff.txt records
intentional divergence and is itself validated — stale entries fail.
- check_column_drift: diffs column lists per shared table. Optional
allowlist via known_column_drift.txt.
- gen_identity_cols / gen_bool_cols: code generators that produce the
Postgres dialect's static knowledge of IDENTITY columns and bool
columns so the rebind driver can rewrite INSERTs correctly.
- validators_test.go is a gate-of-the-gate: an empty schema-diff
allowlist must produce a non-zero exit.
Designed to be extractable as a standalone PR to fleetdm/fleet — they're
useful to any Fleet operator building PG support, with or without the
larger driver/baseline layer.
Playwright API-mode test matrix that exercises every URL filter Fleet's
frontend can construct against a live server, asserting each response is
not a Postgres-driver or Postgres-syntax failure (SQLSTATE, 'must appear
in the GROUP BY', 'operator does not exist', 'cannot find encode plan',
'syntax error', etc.).
Read-only (HTTP GET only). ~220 probes in ~15s with 8 workers.
Coverage:
- /hosts + /hosts/count: status, low_disk_space, mdm_enrollment_status,
os_settings/apple_settings/disk_encryption/bootstrap_package, populate_*,
every ORDER BY allowlist key × direction, cursor pagination (after=),
vulnerability filter, search.
- /software/versions, /software/titles, /software (deprecated):
vulnerable, exploit, cvss range, self_service, available_for_install,
packages_only, team filtering, ordering.
- /vulnerabilities, /host_summary, /labels/:id/hosts, /hosts/:id/*,
sanity endpoints (/config, /version, /me, /labels, /teams, ...).
Run:
cd tools/pg-compat-harness
yarn install
export FLEET_URL=https://your-fleet
export FLEET_TOKEN=$(awk '/token:/ {print $2}' ~/.fleet/config)
yarn test
This harness found and gated the GROUP BY and cursor-encoding regressions
fixed elsewhere in this branch (selectSoftwareSQL GroupByAppend,
AppendListOptionsWithParamsSecure textOrderKeys hint).
Small Go program that parses server/datastore/mysql/schema.sql and emits
one CREATE INDEX IF NOT EXISTS statement per MySQL KEY / UNIQUE KEY clause,
suitable for embedding into a PG-only migration.
Handles:
- balanced parens in column lists (expression bodies)
- USING BTREE / USING HASH suffix (MySQL hint, PG ignores)
- DESC column ordering (PG supports natively)
- identifier quoting where required
- stable per-table grouping for reviewable diffs
Deliberately skips with explicit reasons:
- PRIMARY KEY (the CREATE TABLE handles it)
- FULLTEXT KEY, SPATIAL KEY (need pg_trgm / GiST equivalents)
- prefix-length indexes col(N) (need PG expression indexes)
- expression indexes using MySQL-specific functions (ifnull, cast as
signed) that need PG translation (COALESCE, CAST AS integer)
main_test.go drives translate() from inline schema fixtures — no file I/O
required. Covers plain/unique keys, DESC, USING BTREE, every skip reason,
balanced-paren edge cases, multi-table, PRIMARY ignored, plus unit tests
for extractParenBody and quoteIdent helpers.
Usage:
go run ./tools/pg-index-translate \
-in server/datastore/mysql/schema.sql \
-out server/datastore/mysql/migrations/tables/{ts}_AddMissingPGIndexes.sql
- docs/Deploy/postgresql.md: end-to-end guide for running Fleet against
Postgres — connection env vars, baseline schema apply, migration
apply, ownership reassertion, troubleshooting (drift warning, must
be owner of table, schema/column drift validator output).
- docs/Deploy/README.md: links the new guide from the deployment index
alongside the MySQL guide.
GetDBVersion returned a too-old current version on production PG because the baseline-seed path (and goose's own run-and-record loop for newly introduced migrations) inserted rows into migration_status_tables out of version_id order. Concretely, id 523 carried version 20260422181702 while id 521 carried 20260506171058. Plain 'ORDER BY id DESC' picked the older version, so 'fleet prepare db' tried to re-run every migration from 20260423161823 onward and failed on json_merge_patch — a MySQL-only function that PG never had, with the migration body long since folded into the embedded baseline. Switching to 'ORDER BY version_id DESC, id DESC' makes the query immune to insertion order while preserving up/down semantics: the tie-break by id DESC keeps the most recent applied/rolled-back state for the same version. MySQL is unaffected — its migration runner always applies in monotonic version order so id and version_id stay aligned. We do not change the MySQL dialect to keep blast radius minimal; that path has years of behavior to preserve. Test pins the exact ORDER BY clause via sqlmock so any future change back to the buggy form fails CI loudly.
…ces/views pg_baseline_post.sql already loops over public tables, sequences, and views and reasserts ownership to current_user, but it skipped functions. On baselines that were loaded by the postgres superuser (typical on self-hosted PG), CREATE OR REPLACE FUNCTION later in the same file errored with 'must be owner of function fleet_set_updated_at' — the application user can't replace something it doesn't own. Add a fourth loop using pg_proc / pg_namespace to enumerate public functions whose owner is not current_user, and ALTER FUNCTION ... OWNER TO current_user with the standard insufficient_privilege fallback. pg_get_function_identity_arguments() disambiguates overloaded signatures. Hit in production tonight on the AddMissingPGIndexes deploy. With this fix every future fleet prepare db on a postgres-superuser-loaded baseline succeeds without manual ALTER FUNCTION.
The existing implementation already sorts the seeded versions ascending (via versionsAtOrBelow → partitionMigrationVersions → slices.Sort), so PG assigns auto-increment ids in the same order as version_id. That property is load-bearing for any downstream consumer that infers 'current version' from MAX(id), even with the dialect query now correctly ordered by version_id DESC. No functional change — just document the invariant so a future refactor doesn't quietly drop the sort.
Required by TestVersionsAbove_EmbeddedBaselineCoversAllCode now that
AddMissingPGIndexes (20260513210000) ships in code. Dump source is
fleet.hz.ledoweb.com fleet-db-1, which has all 532 indexes applied
(11 from the original baseline + 521 added by AddMissingPGIndexes
either via the SQL we ran manually tonight or via the migration on
future fresh applies). check-pg-compat validators pass:
schema-drift: 202 MySQL tables / 205 PG tables in sync (after allowlist)
primary-keys: every ON DUPLICATE KEY UPDATE site covered
column-drift: no drift between schema.sql and pg_baseline_schema.sql
Generated via the documented procedure in the file's header:
kubectl exec -n fleet --context hetzner-ledo fleet-db-1 -- \
pg_dump -U postgres -d fleet --schema-only --no-owner --no-privileges
Stripped the pg_dump-17 \restrict/\unrestrict meta-commands and the
SET search_path='' line per the same header comment. Header preserved
with the regen recipe and verification commands.
POST /api/fleet/orbit/setup_experience/init returned 500 with: ERROR: COALESCE types integer and text cannot be matched (SQLSTATE 42804) The setup-experience init query UNION-ALLs two SELECTs — one for software_installers, one for vpp_apps — projecting a NULL placeholder in each leg for the column the other leg owns. The outer ORDER BY then COALESCEs across both columns plus a literal 0: ORDER BY sort_name ASC, COALESCE(software_installer_id, vpp_app_team_id, 0) In MySQL the untyped NULL silently coerces. In Postgres the untyped NULL resolves to text, then COALESCE sees one int leg, one text leg, one int literal — strict-type rejects. Fix: replace the bare 'NULL AS ...' projections with 'CAST(NULL AS UNSIGNED) AS ...'. MySQL keeps it as unsigned-int NULL; the PG rebind driver already rewrites 'AS UNSIGNED' → 'AS integer' (see server/platform/postgres/rebind_driver.go reAsUnsigned*). Both dialects now compose a proper-typed NULL for the outer COALESCE. Surfaced by Windows host enrollment — the iOS/Android/Apple-setup experience flow runs the same code path; this fix covers them all. A broad audit of the codebase's 487 COALESCE call sites found no other sites with the same untyped-NULL-in-UNION shape.
Extends the harness with 19 new probes covering every orbit and osquery
agent POST endpoint listed in server/service/handler.go:1006-1099 +
osquery enroll/config/distributed/carve/log.
Each probe sends a fake orbit_node_key (or node_key) and asserts the
auth-middleware SQL — typically SELECT … FROM hosts WHERE
orbit_node_key = ? — runs without an SQLSTATE crash. A 401 response is
the success case; a 500 with PG error markers is the failure case.
Required infrastructure additions:
- Probe gains optional method ('GET'|'POST'), body, and expectAuthFail
- check() does request.post(...) when method is POST
- check() permits 4xx when expectAuthFail is true (still fails on
body containing PG error markers, still fails on 5xx)
Limitation documented in the orbitProbes comment: fake-key probes
reject at the auth gate, so per-endpoint handler SQL (e.g. the
setup_experience/init COALESCE bug fixed in df17814) is NOT
exercised by these probes. Catching post-auth SQL needs a fixture
host with a real orbit_node_key — tracked as a future expansion.
Current count: 242 passing (was 223; +19 orbit/osquery probes).
Two raw ON DUPLICATE KEY UPDATE sites in microsoft_mdm.go
(lines 364 and ~551) upsert into windows_mdm_commands, which has
PRIMARY KEY (command_uuid) per schema.sql. Without the
knownPrimaryKeys entry, the PG rebind driver emits
'ON CONFLICT DO UPDATE SET …' without a target, which PG rejects.
Caught by check_primary_keys in the pg-compat validator suite — same
gate that blocked the last build. Aggregated CI now passes locally:
$ go run ./tools/pgcompat/check_primary_keys
OK: every raw ON DUPLICATE KEY UPDATE site is covered by
knownPrimaryKeys.
…ploy) Upstream PR fleetdm#45367 (aggregated commit bee5eda, 2026-05-14) added the orbit_debug_until column to the hosts table in MySQL schema.sql. The PG baseline is regenerated from a production pg_dump; that migration hasn't landed on prod yet (it lands when this very deploy rolls out), so the baseline lags by one column. Adding the entry to known_column_drift.txt with a deferred-regen comment, per the file header's prescribed workflow. Once the next aggregation runs after this deploy lands, the prod baseline will include orbit_debug_until and this allowlist entry can be removed. Without this entry, the validate-pg-compat CI gate fails check_column_drift on every aggregated build, blocking the build-ledo image publish that deploys this fix in the first place. Classic chicken-and-egg — break with a documented intentional drift.
a3b9ccf to
f9dab02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Full PostgreSQL compatibility for Fleet's datastore layer, enabling the production deployment at fleet.hz.ledoweb.com to run against PostgreSQL 16 instead of MySQL.
Core dialect abstraction
DialectHelperinterface (dialect.go) abstracting all MySQL vs PostgreSQL SQL differencesmysqlDialectandpostgresDialectimplementations covering all methods:InsertIgnoreInto,ReplaceInto,FromDual,OnDuplicateKey,OnConflictDoNothing,GroupConcat,JsonQuote,JSONAgg,JSONExtract,JSONUnquoteExtract,JSONBuildObject,JSONObjectFunc,FindInSet,FullTextMatch,RegexpMatch,GoquDialect,ReturningID,IsPostgres,CreateTableLike,AtomicTableSwapRuntime SQL translation (
server/platform/postgres/rebind_driver.go)The pgx-rebind driver wraps pgx/stdlib and rewrites MySQL syntax to PG at the driver layer:
?→$Ncol = 1/col = 0→col = true/col = false(for the ~60 boolean cols listed inschema_bool_cols_gen.go)UPDATE … JOIN→UPDATE … FROM; multi-table DELETE → DELETE … USINGJSON_EXTRACT(col, '$.path')→col->'path';JSON_OBJECT(…)→jsonb_build_object(…);JSON_QUOTE(…),JSON_UNQUOTE(…),JSON_ARRAYAGG(…)→ PG equivalentsIF()→CASE WHEN,IFNULL→COALESCE,MD5()→md5(),UUID()→gen_random_uuid()::text,UTC_TIMESTAMP()→TO_CHAR(NOW() AT TIME ZONE 'UTC', …),CURDATE()→CURRENT_DATE,DATABASE()→current_schema(),HEX()/UNHEX()→encode/decode,FIND_IN_SET→array_position(string_to_array), …CAST AS UNSIGNED→CAST AS integer,CAST AS SIGNED INT→CAST AS integer,TIMESTAMP(?)→(?)::timestampINSERT IGNORE INTO→INSERT INTO … ON CONFLICT DO NOTHING;REPLACE INTO→INSERT … ON CONFLICT DO UPDATE;ON DUPLICATE KEY UPDATE VALUES(col)→ON CONFLICT (pk) DO UPDATE SET col = EXCLUDED.colBLOB→bytea,MEDIUMTEXT/LONGTEXT→TEXT,TINYINT(1)→smallint,DATETIME[(N)]→timestamp[(N)],INT UNSIGNED NOT NULL AUTO_INCREMENT→INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,INT UNSIGNED→INTEGER, inlineUNIQUE KEY name (cols)→CONSTRAINT name UNIQUE (cols),enum('a','b','c')→VARCHAR(255) CHECK (col IN ('a','b','c')),ENGINE=InnoDB/DEFAULT CHARSET=…/ALGORITHM=INSTANTstrippedALTER TABLE … ADD COLUMN …, ADD KEY <name> (<cols>)produces an ALTER followed by a separateCREATE INDEX.ADD UNIQUE KEY→CREATE UNIQUE INDEX. Multiple ADD KEY clauses each become their own CREATE INDEX.fleet_set_updated_at()is appended. The trigger function is installed bypg_baseline_post.sql.Migration runner fixes (this session)
A chain of bugs prevented
fleet prepare dbfrom actually running migrations on PG; all addressed on this branch:MigrationStatus: removed broken short-circuit that reportedAllMigrationsCompletedbased purely onhoststable existence. Now uses the standardcompareMigrationspath. On truly-fresh PG, returnsNoMigrationsCompletedwithout callingloadMigrationsso goose'screateVersionTabledoesn't collide with the baseline.goose.GetDBVersion: no longer panicsunreachablewhen iteration ends without an applied row; returns(0, nil)so callers proceed as for a fresh DB.seedPGMigrationHistory: now seeds bothmigration_status_tablesANDmigration_status_data(previously only the first, causing every old data-migration to appear missing on every startup).MigrateTableson PG: runsmigratePGBaselineAND then invokes gooseUpso any migration newer than the baseline marker actually executes.pg_baseline_post.sql: wraps eachALTER OWNERin anEXCEPTION WHEN insufficient_privilegeblock so individual objects the role doesn't own no longer abort the whole script. Also installsfleet_set_updated_at()trigger function used byON UPDATE CURRENT_TIMESTAMPcolumn translations.prepare.go: on PG, falls through toMigrateTableseven when all migrations are reported complete, so the idempotent post-baseline fixups (ownership + trigger function) re-apply. MySQL path keeps its early-return behavior.BackfillVPPAppCountriesFromTokens: removedva.alias prefix from SET clause (PG rejects).CleanupSCDData+DeleteAllForDatasetin the chart datastore: rewrote MySQLDELETE … ORDER BY … LIMIT ?as a subquery-on-PK form valid on both dialects.Production schema drift remediation (this session)
After enabling the migration runner properly, several tables had drift from migrations recorded as "applied" via
seedPGMigrationHistorybut never actually executed. All remediated on production via/tmp/fix_pg_drift.sql:acme_enrollments,host_managed_local_account_passwords,user_api_endpointsto match canonical migrations (all were empty, no data lost).policies.conditional_access_bypass_enabledandmdm_configuration_profile_variables.requires_value.migration_status_datawith 9 historical versions.pg_baseline_schema.sqlfrom the corrected prod DB; marker bumped to20260506171058.pgcompat tooling (
tools/pgcompat/)Three validators that gate every PR on PR-relevant paths:
check_primary_keys— every rawON DUPLICATE KEY UPDATEsite is covered byknownPrimaryKeysin the rebind drivercheck_schema_drift—schema.sqlvspg_baseline_schema.sqltable sets match (allowlist for intentional drift)check_column_drift(new this session) — for every common table, column sets match; flags stale allowlist entries so they get removed after baseline regengen_bool_cols— regeneratesschema_bool_cols_gen.gofrom the baseline; CI fails if staleFresh-PG-install CI smoke test (this session)
validate-pg-compat.ymlnow spins up an empty PG via docker-compose, builds the fleet binary, runsprepare dbagainst it (expects the baseline + post-marker migrations to apply cleanly), then runsprepare dba second time and asserts idempotency. Verifies every table ends up owned by the app role. This gate would have caught the migration-runner bugs we shipped earlier on this branch.Test plan
Validate PG CompatibilityCI — passes on latest SHAGo Tests (PostgreSQL)CI — passes on latest SHABuild & Push Ledo Fleet ImageCI — passes on latest SHArebind_driver_test.gocovering every DDL translation + verbatim regression cases for the two upstream migrations (20260428125634, 20260429180725) that needed manual SQL before this branchsend_managed_local_account_rotation_commands,refresh_vpp_app_versions) now complete cleanly