fix: prevent backend segfault from user-mapping FFI and harden the FDW#36
fix: prevent backend segfault from user-mapping FFI and harden the FDW#36postgresql007 wants to merge 1 commit into
Conversation
A scan running under a role with no user mapping could crash the backend
with SIGSEGV (signal 11). In new(), GetUserMapping() raises a Postgres
ERROR (a C longjmp) when no mapping exists; this unwound past the already
constructed multi-threaded tokio runtime, skipping its Drop and leaking
worker threads. Such a worker pool can also receive job-control signals on
a thread other than the main backend thread. pg_timetable's repeated
scheduling (CALL timetable.f_control_importers()) made it reliably fatal.
Crash fix:
- Read the user mapping BEFORE creating the runtime, so a longjmp from
GetUserMapping can no longer skip a live runtime.
- Use the framework's create_async_runtime() (a current-thread runtime):
a Postgres backend is single-threaded and the FDW only block_on's, so a
worker pool is both unnecessary and unsafe here.
- Connect to etcd lazily in begin_scan / the modify path instead of in
new(), which the framework may call at planning time.
Robustness:
- insert/update/delete: read the key/value out of the Cell directly
instead of Display + trim_matches('), which corrupted any key/value that
legitimately contained single quotes.
- iter_scan: convert non-UTF-8 etcd keys/values lossily instead of
panicking, and iterate by index instead of draining the vec front
(was O(n^2) per scan).
- begin_scan: do not panic when incrementing a prefix yields invalid
UTF-8; fall back to the wider, still-correct range.
- insert/update: return an error on NULL key/value cells instead of
panicking; do not panic on an empty user-mapping options list.
- update/delete: error on a missing key instead of silently creating it.
Tests:
- Regression tests: a scan with no user mapping returns a clean error
instead of crashing; keys/values containing single quotes round-trip
exactly.
Build / CI / reproducibility:
- CI: always install cargo-pgrx. Its binary lives in ~/.cargo/bin (the
deps cache, keyed on Cargo.lock), not the pgrx cache, so gating the
install on the pgrx cache left it missing once Cargo.lock was committed.
- Make testcontainers an optional dependency enabled by the pg_test
feature (built for cargo pgrx test, not for release); drop unused serde.
- Track Cargo.lock (pins the wrappers git dependency) for reproducible
builds.
- pg_regress: CREATE EXTENSION etcd_fdw (was a stale template name).
393cebf to
e067d82
Compare
|
I have tested this but with the procedure I will adjust procedure to materialized approach for now to make it work, |
|
hey @if-loop69420 this PR doesn't solve the issue but it makes some improvements worth to consider. The minimal reproducible code from @hlihhovac shows the problem. When you call that proc twice, the second is crashing. |
|
Hello. I just came across this issue/PR and did a small initial investigation. I can reproduce the crash scenario of I also set Also, debugger showed me segfault at line in let colnames: Vec<String> = columns.iter().map(|x| x.name.clone()).collect();At this point, it seems that issue comes from pgrx or wrappers, but I can't confirm it. Just wanted to share my quick initial findings related to this issue. The validation checks/improvements in this PR itself are nice! |
|
Alright. Currently i think the problem lies mostly with etcd_client. It's using hyper (which uses tokio) internally which means we'd want a way to run a tokio runtime, but that is causing issues when running inside postgres, because the FdwState struct in supabase-wrappers is getting reused when the query plan is cached. Even if we create a new tokio runtime in every function there's still stale state (possibly in the client itself) which is causing problems. I see 2 ways in which this can be fixed. One would be using something like This PR doesn't fix the segfault problem itself, but i think it has some nice changes that we can merge if we want to. |
|
As mentioned in #39 I found another one of the bugs and wrote an etcd client (http) that is sync. So no more tokio. |
|
this PR has new client implementation or should we close it? |
Problem
A scan running under a role with no user mapping crashed the backend with
SIGSEGV(signal 11):In
EtcdFdw::new(),pg_sys::GetUserMapping()raises a PostgresERROR(a Clongjmp) when the current role has no mapping. Thatlongjmpunwound past the already-constructed multi-threaded tokio runtime, skipping itsDropand leaking worker threads. Under pg_timetable's repeated scheduling this accumulated until the backend crashed.Fix
Crash:
GetUserMappinglongjmpcan no longer skip a live runtime.block_ons) — no worker threads to leak.begin_scan/ the modify path rather than innew()(which can run at planning time).Robustness (panics → clean errors):
iter_scan: lossy UTF-8 for arbitrary etcd bytes; index-based iteration instead ofdrain(0..1)(was O(n^2)).begin_scan: no panic when a prefix increment yields invalid UTF-8.insert/update: error on NULL key/value cells; no panic on an empty user-mapping options list.update/delete: error on a missing key instead of silently creating it.Build / reproducibility:
testcontainersis now an optional dependency enabled by thepg_testfeature (kept out of release builds, still available forcargo pgrx test); dropped the unusedserdedependency.Cargo.lock(pins thewrappersgit dependency) for reproducible builds.pg_regress:CREATE EXTENSION etcd_fdw(was a stale template name).Validation
cargo pgrx testagainst PostgreSQL 17 — 8/8 pass, including a new regression test (test_scan_without_user_mapping_errors_cleanly) that asserts a scan with no user mapping returns a clean error instead of crashing the backend: