fix: prevent backend segfault from user-mapping FFI and harden the FDW by postgresql007 · Pull Request #36 · cybertec-postgresql/etcd_fdw

postgresql007 · 2026-06-09T18:48:45Z

Problem

A scan running under a role with no user mapping crashed the backend with SIGSEGV (signal 11):

LOG:  server process (PID 31) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: CALL timetable.f_control_importers();

In EtcdFdw::new(), pg_sys::GetUserMapping() raises a Postgres ERROR (a C longjmp) when the current role has no mapping. That longjmp unwound past the already-constructed multi-threaded tokio runtime, skipping its Drop and leaking worker threads. Under pg_timetable's repeated scheduling this accumulated until the backend crashed.

Fix

Crash:

Read the user mapping before creating the runtime, so the GetUserMapping longjmp can no longer skip a live runtime.
Use a current-thread tokio runtime (a PG backend is single-threaded and the FDW only block_ons) — no worker threads to leak.
Connect to etcd lazily in begin_scan / the modify path rather than in new() (which can run at planning time).

Robustness (panics → clean errors):

iter_scan: lossy UTF-8 for arbitrary etcd bytes; index-based iteration instead of drain(0..1) (was O(n^2)).
begin_scan: no panic when a prefix increment yields invalid UTF-8.
insert / update: error on NULL key/value cells; no panic on an empty user-mapping options list.
update / delete: error on a missing key instead of silently creating it.

Build / reproducibility:

testcontainers is now an optional dependency enabled by the pg_test feature (kept out of release builds, still available for cargo pgrx test); dropped the unused serde dependency.
Track Cargo.lock (pins the wrappers git dependency) for reproducible builds.
pg_regress: CREATE EXTENSION etcd_fdw (was a stale template name).

Validation

cargo pgrx test against PostgreSQL 17 — 8/8 pass, including a new regression test (test_scan_without_user_mapping_errors_cleanly) that asserts a scan with no user mapping returns a clean error instead of crashing the backend:

test result: ok. 8 passed; 0 failed; 0 ignored

A scan running under a role with no user mapping could crash the backend with SIGSEGV (signal 11). In new(), GetUserMapping() raises a Postgres ERROR (a C longjmp) when no mapping exists; this unwound past the already constructed multi-threaded tokio runtime, skipping its Drop and leaking worker threads. Such a worker pool can also receive job-control signals on a thread other than the main backend thread. pg_timetable's repeated scheduling (CALL timetable.f_control_importers()) made it reliably fatal. Crash fix: - Read the user mapping BEFORE creating the runtime, so a longjmp from GetUserMapping can no longer skip a live runtime. - Use the framework's create_async_runtime() (a current-thread runtime): a Postgres backend is single-threaded and the FDW only block_on's, so a worker pool is both unnecessary and unsafe here. - Connect to etcd lazily in begin_scan / the modify path instead of in new(), which the framework may call at planning time. Robustness: - insert/update/delete: read the key/value out of the Cell directly instead of Display + trim_matches('), which corrupted any key/value that legitimately contained single quotes. - iter_scan: convert non-UTF-8 etcd keys/values lossily instead of panicking, and iterate by index instead of draining the vec front (was O(n^2) per scan). - begin_scan: do not panic when incrementing a prefix yields invalid UTF-8; fall back to the wider, still-correct range. - insert/update: return an error on NULL key/value cells instead of panicking; do not panic on an empty user-mapping options list. - update/delete: error on a missing key instead of silently creating it. Tests: - Regression tests: a scan with no user mapping returns a clean error instead of crashing; keys/values containing single quotes round-trip exactly. Build / CI / reproducibility: - CI: always install cargo-pgrx. Its binary lives in ~/.cargo/bin (the deps cache, keyed on Cargo.lock), not the pgrx cache, so gating the install on the pgrx cache left it missing once Cargo.lock was committed. - Make testcontainers an optional dependency enabled by the pg_test feature (built for cargo pgrx test, not for release); drop unused serde. - Track Cargo.lock (pins the wrappers git dependency) for reproducible builds. - pg_regress: CREATE EXTENSION etcd_fdw (was a stale template name).

hlihhovac · 2026-06-09T23:01:21Z

I have tested this but with the procedure timetable.f_control_importers that is used in secure gather consecutive call to execute procedure crashes the client connection

I will adjust procedure to materialized approach for now to make it work,
but for further investigation following functions for further testing:

*** PLAIN LOOP APPROACH *** FAILS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    FOR r IN
        SELECT * FROM timetable.t_shared
    LOOP
        NULL;
    END LOOP;
END;
$$;

import=# select reproduce_etcd_fdw_crash();
 reproduce_etcd_fdw_crash 
--------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash();
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> \q


*** LOOP WITH DYNAMIC SQL  APPROACH *** WORKS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash_dynamic()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    FOR r IN
        EXECUTE 'SELECT * FROM timetable.t_shared'
    LOOP
        NULL;
    END LOOP;
END;
$$;

import=# select reproduce_etcd_fdw_crash_dynamic();
 reproduce_etcd_fdw_crash_dynamic 
----------------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash_dynamic();
 reproduce_etcd_fdw_crash_dynamic 
----------------------------------
 
(1 row)


*** LOOP WITH MATERIALIZED SQL  APPROACH *** WORKS

CREATE OR REPLACE FUNCTION reproduce_etcd_fdw_crash_materialized()
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
    r record;
BEGIN
    CREATE TEMP TABLE tmp_etcd AS
    SELECT * FROM timetable.t_shared;

    FOR r IN
        SELECT * FROM tmp_etcd
    LOOP
        NULL;
    END LOOP;

    DROP TABLE tmp_etcd;
END;
$$;

import=# select reproduce_etcd_fdw_crash_materialized();
 reproduce_etcd_fdw_crash_materialized 
---------------------------------------
 
(1 row)

import=# select reproduce_etcd_fdw_crash_materialized();
 reproduce_etcd_fdw_crash_materialized 
---------------------------------------
 
(1 row)

pashagolub · 2026-06-10T07:34:06Z

hey @if-loop69420 this PR doesn't solve the issue but it makes some improvements worth to consider. The minimal reproducible code from @hlihhovac shows the problem. When you call that proc twice, the second is crashing.

Hunaid2000 · 2026-06-15T12:01:06Z

Hello. I just came across this issue/PR and did a small initial investigation.

I can reproduce the crash scenario of reproduce_etcd_fdw_crash() on 10c07bc (before user_mapping feature). So, it seems that the bug is not related to user mapping and is perhaps an old unkown bug.

I also set client_min_messages = 'debug' so you can observe some logs from supabase/wrappers.

etcd_fdw=# CREATE foreign data wrapper etcd_fdw handler etcd_fdw_handler validator etcd_fdw_validator;
CREATE FOREIGN DATA WRAPPER
etcd_fdw=# CREATE SERVER my_etcd_server foreign data wrapper etcd_fdw options (connstr '127.0.0.1:2379');
CREATE SERVER
etcd_fdw=# CREATE foreign table test (key text, value text) server my_etcd_server options(rowid_column 'key');
CREATE FOREIGN TABLE
etcd_fdw=# SET client_min_messages = 'debug';
SET
etcd_fdw=# select reproduce_etcd_fdw_crash();
DEBUG:  ---> get_foreign_rel_size
DEBUG:  ---> get_foreign_paths
DEBUG:  ---> get_foreign_plan
DEBUG:  ---> begin_foreign_scan
DEBUG:  ---> end_foreign_scan
 reproduce_etcd_fdw_crash
--------------------------

(1 row)

etcd_fdw=# select reproduce_etcd_fdw_crash();
DEBUG:  ---> begin_foreign_scan
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

Also, debugger showed me segfault at line in begin_scan:

let colnames: Vec<String> = columns.iter().map(|x| x.name.clone()).collect();

At this point, it seems that issue comes from pgrx or wrappers, but I can't confirm it. Just wanted to share my quick initial findings related to this issue.

The validation checks/improvements in this PR itself are nice!

if-loop69420 · 2026-06-16T08:18:41Z

Alright.
I was able to reproduce it as well.
I've tried to fix it for 2.5h now, but I don't think there's an easy fix.

Currently i think the problem lies mostly with etcd_client. It's using hyper (which uses tokio) internally which means we'd want a way to run a tokio runtime, but that is causing issues when running inside postgres, because the FdwState struct in supabase-wrappers is getting reused when the query plan is cached. Even if we create a new tokio runtime in every function there's still stale state (possibly in the client itself) which is causing problems.

I see 2 ways in which this can be fixed. One would be using something like ureq to send http requests manually (because i couldn't find any actively maintained etcd client library for rust that doesn't use async), or what @postgresql007 is doing currently and rewriting in c.

This PR doesn't fix the segfault problem itself, but i think it has some nice changes that we can merge if we want to.

if-loop69420 · 2026-06-16T17:12:59Z

As mentioned in #39 I found another one of the bugs and wrote an etcd client (http) that is sync. So no more tokio.
The stored procedure from @hlihhovac also works now.

pashagolub · 2026-06-17T06:55:02Z

this PR has new client implementation or should we close it?

if-loop69420 · 2026-06-17T10:42:03Z

PR #39 has a new client, that doesn't use any async runtime. So it's completely sync, which is what we need for this usecase i think. I'd merge both this PR and PR #39 once we've established that #39 is fine for our current deployments

postgresql007 force-pushed the fix/etcd-fdw-stability branch from 393cebf to e067d82 Compare June 9, 2026 19:09

This was referenced Jun 10, 2026

feat: support multiple comma-separated etcd endpoints in connstr #37

Open

feat: atomic writes via etcd transactions and key TTLs via leases #38

Open

pashagolub assigned if-loop69420 Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent backend segfault from user-mapping FFI and harden the FDW#36

fix: prevent backend segfault from user-mapping FFI and harden the FDW#36
postgresql007 wants to merge 1 commit into
mainfrom
fix/etcd-fdw-stability

postgresql007 commented Jun 9, 2026

Uh oh!

hlihhovac commented Jun 9, 2026 •

edited

Loading

Uh oh!

pashagolub commented Jun 10, 2026

Uh oh!

Hunaid2000 commented Jun 15, 2026

Uh oh!

if-loop69420 commented Jun 16, 2026

Uh oh!

if-loop69420 commented Jun 16, 2026

Uh oh!

pashagolub commented Jun 17, 2026

Uh oh!

if-loop69420 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

postgresql007 commented Jun 9, 2026

Problem

Fix

Validation

Uh oh!

hlihhovac commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pashagolub commented Jun 10, 2026

Uh oh!

Hunaid2000 commented Jun 15, 2026

Uh oh!

if-loop69420 commented Jun 16, 2026

Uh oh!

if-loop69420 commented Jun 16, 2026

Uh oh!

pashagolub commented Jun 17, 2026

Uh oh!

if-loop69420 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hlihhovac commented Jun 9, 2026 •

edited

Loading

if-loop69420 commented Jun 17, 2026 •

edited

Loading