Skip to content

fix: prevent backend segfault from user-mapping FFI and harden the FDW#36

Open
postgresql007 wants to merge 1 commit into
mainfrom
fix/etcd-fdw-stability
Open

fix: prevent backend segfault from user-mapping FFI and harden the FDW#36
postgresql007 wants to merge 1 commit into
mainfrom
fix/etcd-fdw-stability

Conversation

@postgresql007

Copy link
Copy Markdown

Problem

A scan running under a role with no user mapping crashed the backend with SIGSEGV (signal 11):

LOG:  server process (PID 31) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: CALL timetable.f_control_importers();

In EtcdFdw::new(), pg_sys::GetUserMapping() raises a Postgres ERROR (a C longjmp) when the current role has no mapping. That longjmp unwound past the already-constructed multi-threaded tokio runtime, skipping its Drop and leaking worker threads. Under pg_timetable's repeated scheduling this accumulated until the backend crashed.

Fix

Crash:

  • Read the user mapping before creating the runtime, so the GetUserMapping longjmp can no longer skip a live runtime.
  • Use a current-thread tokio runtime (a PG backend is single-threaded and the FDW only block_ons) — no worker threads to leak.
  • Connect to etcd lazily in begin_scan / the modify path rather than in new() (which can run at planning time).

Robustness (panics → clean errors):

  • iter_scan: lossy UTF-8 for arbitrary etcd bytes; index-based iteration instead of drain(0..1) (was O(n^2)).
  • begin_scan: no panic when a prefix increment yields invalid UTF-8.
  • insert / update: error on NULL key/value cells; no panic on an empty user-mapping options list.
  • update / delete: error on a missing key instead of silently creating it.

Build / reproducibility:

  • testcontainers is now an optional dependency enabled by the pg_test feature (kept out of release builds, still available for cargo pgrx test); dropped the unused serde dependency.
  • Track Cargo.lock (pins the wrappers git dependency) for reproducible builds.
  • pg_regress: CREATE EXTENSION etcd_fdw (was a stale template name).

Validation

cargo pgrx test against PostgreSQL 17 — 8/8 pass, including a new regression test (test_scan_without_user_mapping_errors_cleanly) that asserts a scan with no user mapping returns a clean error instead of crashing the backend:

test result: ok. 8 passed; 0 failed; 0 ignored

A scan running under a role with no user mapping could crash the backend
with SIGSEGV (signal 11). In new(), GetUserMapping() raises a Postgres
ERROR (a C longjmp) when no mapping exists; this unwound past the already
constructed multi-threaded tokio runtime, skipping its Drop and leaking
worker threads. Such a worker pool can also receive job-control signals on
a thread other than the main backend thread. pg_timetable's repeated
scheduling (CALL timetable.f_control_importers()) made it reliably fatal.

Crash fix:
- Read the user mapping BEFORE creating the runtime, so a longjmp from
  GetUserMapping can no longer skip a live runtime.
- Use the framework's create_async_runtime() (a current-thread runtime):
  a Postgres backend is single-threaded and the FDW only block_on's, so a
  worker pool is both unnecessary and unsafe here.
- Connect to etcd lazily in begin_scan / the modify path instead of in
  new(), which the framework may call at planning time.

Robustness:
- insert/update/delete: read the key/value out of the Cell directly
  instead of Display + trim_matches('), which corrupted any key/value that
  legitimately contained single quotes.
- iter_scan: convert non-UTF-8 etcd keys/values lossily instead of
  panicking, and iterate by index instead of draining the vec front
  (was O(n^2) per scan).
- begin_scan: do not panic when incrementing a prefix yields invalid
  UTF-8; fall back to the wider, still-correct range.
- insert/update: return an error on NULL key/value cells instead of
  panicking; do not panic on an empty user-mapping options list.
- update/delete: error on a missing key instead of silently creating it.

Tests:
- Regression tests: a scan with no user mapping returns a clean error
  instead of crashing; keys/values containing single quotes round-trip
  exactly.

Build / CI / reproducibility:
- CI: always install cargo-pgrx. Its binary lives in ~/.cargo/bin (the
  deps cache, keyed on Cargo.lock), not the pgrx cache, so gating the
  install on the pgrx cache left it missing once Cargo.lock was committed.
- Make testcontainers an optional dependency enabled by the pg_test
  feature (built for cargo pgrx test, not for release); drop unused serde.
- Track Cargo.lock (pins the wrappers git dependency) for reproducible
  builds.
- pg_regress: CREATE EXTENSION etcd_fdw (was a stale template name).
@postgresql007 postgresql007 force-pushed the fix/etcd-fdw-stability branch from 393cebf to e067d82 Compare June 9, 2026 19:09
@hlihhovac

hlihhovac commented Jun 9, 2026

Copy link
Copy Markdown

I have tested this but with the procedure timetable.f_control_importers that is used in secure gather consecutive call to execute procedure crashes the client connection

I will adjust procedure to materialized approach for now to make it work,
but for further investigation following functions for further testing:

*** PLAIN LOOP APPROACH *** FAILS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    FOR r IN
        SELECT * FROM timetable.t_shared
    LOOP
        NULL;
    END LOOP;
END;
$$;

import=# select reproduce_etcd_fdw_crash();
 reproduce_etcd_fdw_crash 
--------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash();
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> \q


*** LOOP WITH DYNAMIC SQL  APPROACH *** WORKS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash_dynamic()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    FOR r IN
        EXECUTE 'SELECT * FROM timetable.t_shared'
    LOOP
        NULL;
    END LOOP;
END;
$$;

import=# select reproduce_etcd_fdw_crash_dynamic();
 reproduce_etcd_fdw_crash_dynamic 
----------------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash_dynamic();
 reproduce_etcd_fdw_crash_dynamic 
----------------------------------
 
(1 row)


*** LOOP WITH MATERIALIZED SQL  APPROACH *** WORKS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash_materialized()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    CREATE TEMP TABLE tmp_etcd AS
    SELECT * FROM timetable.t_shared;

    FOR r IN
        SELECT * FROM tmp_etcd
    LOOP
        NULL;
    END LOOP;

    DROP TABLE tmp_etcd;
END;
$$;

import=# select reproduce_etcd_fdw_crash_materialized();
 reproduce_etcd_fdw_crash_materialized 
---------------------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash_materialized();
 reproduce_etcd_fdw_crash_materialized 
---------------------------------------
 
(1 row)
 

@pashagolub

Copy link
Copy Markdown
Contributor

hey @if-loop69420 this PR doesn't solve the issue but it makes some improvements worth to consider. The minimal reproducible code from @hlihhovac shows the problem. When you call that proc twice, the second is crashing.

@Hunaid2000

Copy link
Copy Markdown
Contributor

Hello. I just came across this issue/PR and did a small initial investigation.

I can reproduce the crash scenario of reproduce_etcd_fdw_crash() on 10c07bc (before user_mapping feature). So, it seems that the bug is not related to user mapping and is perhaps an old unkown bug.

I also set client_min_messages = 'debug' so you can observe some logs from supabase/wrappers.

etcd_fdw=# CREATE foreign data wrapper etcd_fdw handler etcd_fdw_handler validator etcd_fdw_validator;
CREATE FOREIGN DATA WRAPPER
etcd_fdw=# CREATE SERVER my_etcd_server foreign data wrapper etcd_fdw options (connstr '127.0.0.1:2379');
CREATE SERVER
etcd_fdw=# CREATE foreign table test (key text, value text) server my_etcd_server options(rowid_column 'key');
CREATE FOREIGN TABLE
etcd_fdw=# SET client_min_messages = 'debug';
SET
etcd_fdw=# select reproduce_etcd_fdw_crash();
DEBUG:  ---> get_foreign_rel_size
DEBUG:  ---> get_foreign_paths
DEBUG:  ---> get_foreign_plan
DEBUG:  ---> begin_foreign_scan
DEBUG:  ---> end_foreign_scan
 reproduce_etcd_fdw_crash
--------------------------

(1 row)

etcd_fdw=# select reproduce_etcd_fdw_crash();
DEBUG:  ---> begin_foreign_scan
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

Also, debugger showed me segfault at line in begin_scan:

let colnames: Vec<String> = columns.iter().map(|x| x.name.clone()).collect();

At this point, it seems that issue comes from pgrx or wrappers, but I can't confirm it. Just wanted to share my quick initial findings related to this issue.

The validation checks/improvements in this PR itself are nice!

@if-loop69420

Copy link
Copy Markdown
Contributor

Alright.
I was able to reproduce it as well.
I've tried to fix it for 2.5h now, but I don't think there's an easy fix.

Currently i think the problem lies mostly with etcd_client. It's using hyper (which uses tokio) internally which means we'd want a way to run a tokio runtime, but that is causing issues when running inside postgres, because the FdwState struct in supabase-wrappers is getting reused when the query plan is cached. Even if we create a new tokio runtime in every function there's still stale state (possibly in the client itself) which is causing problems.

I see 2 ways in which this can be fixed. One would be using something like ureq to send http requests manually (because i couldn't find any actively maintained etcd client library for rust that doesn't use async), or what @postgresql007 is doing currently and rewriting in c.

This PR doesn't fix the segfault problem itself, but i think it has some nice changes that we can merge if we want to.

@if-loop69420

Copy link
Copy Markdown
Contributor

As mentioned in #39 I found another one of the bugs and wrote an etcd client (http) that is sync. So no more tokio.
The stored procedure from @hlihhovac also works now.

@pashagolub

Copy link
Copy Markdown
Contributor

this PR has new client implementation or should we close it?

@if-loop69420

if-loop69420 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

PR #39 has a new client, that doesn't use any async runtime. So it's completely sync, which is what we need for this usecase i think. I'd merge both this PR and PR #39 once we've established that #39 is fine for our current deployments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants