From 0cd5f288e8cdae12466614408e822389720615d0 Mon Sep 17 00:00:00 2001 From: Deyan Ginev Date: Fri, 22 May 2026 04:45:22 -0400 Subject: [PATCH 1/5] io: safe wrapper for xmlRegisterInputCallbacks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New `io` module that lifts libxml2's `xmlRegisterInputCallbacks` into a closure-friendly Rust API: pub fn register_input_callback(match_url: M, open: O) where M: Fn(&str) -> bool + Send + Sync + 'static, O: Fn(&str) -> Option> + Send + Sync + 'static; `match_url` claims a URL; `open` returns the bytes (or None to defer back through the callback chain). The C trampolines are registered with libxml2 exactly once per process; subsequent calls just append to a Mutex> registry that the trampolines walk on each URL load. `Send + Sync` because libxml2 may dispatch from any thread. ## Motivating use case A single-binary CLI bundles its XSLT stylesheets / RNG schemas via `include_bytes!` and serves them through a synthetic URL scheme (e.g. `embed:///LaTeXML-html5.xsl`). The main stylesheet is parsed from memory via `libxslt::parser::parse_bytes(bytes, "embed:///main.xsl")` which sets the doc's base URI. Inside libxslt, `xsl:import href="…"` composes the absolute URL against that base, then calls `xmlReadFile` — which walks libxml2's input-callback table and finds ours. No disk extraction needed. The same trick handles RelaxNG `` resolution from `xmlRelaxNGParse`, DTD external subsets, and any other libxml2-side URL load. ## Why not `Parser::parse_file` The existing `Parser::parse_file` reads the file via Rust I/O (`std::fs::File::open` + `xmlReadIO`) and bypasses libxml2's URL machinery entirely. The doctest example is marked `no_run` and notes that the callback fires from libxslt / xmlReadFile contexts, not from the library's own `parse_file` surface. ## Tests Three unit tests against `xmlReadFile` (the libxml2 entry point that actually exercises the callback chain): * `callback_serves_registered_url` — registered URL parses through the callback (round-trip via xmlReadFile -> trampoline_open -> Rust closure -> trampoline_read -> libxml2 parse). * `callback_can_decline_via_none` — open returning None fails the load rather than returning phantom data. * `non_matching_url_defers_to_default_handlers` — match returning false leaves the default file/HTTP loaders intact (verified by a /nonexistent file:// URL failing through the default chain). All 105 pre-existing tests still pass; full sweep clean. ## Notes * libxml2 has no per-handler unregistration API (only `xmlCleanupInputCallbacks` which wipes the whole chain including the defaults), so the trampolines and the Rust registry live for the process lifetime. Reasonable for the embedded-asset use case; documented in the module docs. * `Mutex::lock` is held only briefly during the registry walk on each URL load — no closures run while the lock is held that could re-enter libxml2. * Callback ordering is last-registered-first, matching libxml2's own convention. Stacking multiple registrations for the same scheme is supported. Version: 0.3.11 -> 0.3.12. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 15 +++ Cargo.toml | 2 +- src/io.rs | 356 +++++++++++++++++++++++++++++++++++++++++++++++++++ src/lib.rs | 5 + 4 files changed, 377 insertions(+), 1 deletion(-) create mode 100644 src/io.rs diff --git a/CHANGELOG.md b/CHANGELOG.md index 98eb16033..42841becf 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,21 @@ ## [0.3.12] (in development) +### Added + +* New `io` module exposing a safe Rust wrapper around libxml2's + `xmlRegisterInputCallbacks`: `io::register_input_callback(match_url, + open)` accepts two closures (`&str -> bool` and `&str -> Option>`) + and registers them as a custom URL-scheme handler. Useful for + single-binary distributions that bundle XSLT stylesheets / RelaxNG + schemas via `include_bytes!` and serve them through a synthetic + URL scheme (e.g. `embed:///foo.xsl`), so `libxslt`'s `xsl:import` + resolution can reach into the embedded byte tables without ever + touching the disk. Closures run on whatever thread libxml2 invokes + the callbacks from; the `Send + Sync + 'static` bound reflects that. + Three unit tests cover the happy path, opt-out via `None`, and + defer-to-default behaviour for unmatched URLs. + ## [0.3.11] (2026-05-18) ### Added diff --git a/Cargo.toml b/Cargo.toml index 736dd719e..a58eb7ab9 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "libxml" -version = "0.3.11" +version = "0.3.12" edition = "2024" authors = ["Andreas Franzén ", "Deyan Ginev ","Jan Frederik Schaefer "] description = "A Rust wrapper for libxml2 - the XML C parser and toolkit developed for the Gnome project" diff --git a/src/io.rs b/src/io.rs new file mode 100644 index 000000000..2ea5e307b --- /dev/null +++ b/src/io.rs @@ -0,0 +1,356 @@ +//! Custom I/O callbacks for libxml2. +//! +//! libxml2 routes every URL it loads (XML documents themselves, plus +//! `xsl:import` / `xsl:include` targets, RelaxNG ``, DTD +//! external subsets, and so on) through a chain of registered "input +//! callback" handlers. The default chain handles `file://`, `http://`, +//! `ftp://`, etc.; an application can prepend its own handler for a +//! custom URL scheme via `xmlRegisterInputCallbacks(match, open, +//! read, close)`. +//! +//! This module wraps that C API in a safe, closure-friendly Rust +//! surface. The motivating use case is shipping a single-binary CLI +//! that bundles its XSLT stylesheets / RNG schemas at compile time +//! via `include_bytes!` and serves them through a custom URL scheme +//! (e.g. `embed:///LaTeXML-html5.xsl`), so `libxslt::parser::parse_bytes` +//! can resolve `xsl:import` chains without ever touching the disk. +//! +//! ```no_run +//! use libxml::io; +//! +//! // Bundled at compile time. +//! static MAIN: &[u8] = b"\n"; +//! +//! io::register_input_callback( +//! |url| url.starts_with("embed:///"), +//! |url| match url.strip_prefix("embed:///") { +//! Some("main.xml") => Some(MAIN.to_vec()), +//! _ => None, +//! }, +//! ); +//! ``` +//! +//! ## Lifetime and threading +//! +//! Closures registered here live for the lifetime of the process — +//! libxml2 holds the C trampoline pointers in a static table and +//! has no concept of "unregister single handler" (only +//! `xmlCleanupInputCallbacks` which wipes everything including the +//! defaults). The trampolines look up the Rust closures through a +//! `Mutex>`; libxml2 may invoke them from any thread, +//! hence the `Send + Sync` bound. +//! +//! ## Order +//! +//! libxml2 walks its registered callbacks in last-registered-first +//! order. This module preserves that ordering: callers can stack +//! multiple registrations for the same scheme and the most recent +//! wins. The default file/HTTP handlers remain at the bottom of the +//! stack and continue to serve URLs that none of the custom +//! callbacks claim via their match function. + +use std::ffi::{CStr, c_char, c_int, c_void}; +use std::sync::{Mutex, OnceLock}; + +use crate::bindings::{ + xmlInputCloseCallback, xmlInputMatchCallback, xmlInputOpenCallback, xmlInputReadCallback, + xmlRegisterInputCallbacks, +}; + +/// One Rust-side callback pair: a URL filter and a byte-fetcher. +/// +/// We hold both `match_url` and `open` together so the trampoline +/// can walk the list once. `open` may return `None` even after +/// `match_url` returned `true`; in that case the trampoline keeps +/// walking — the next registered callback gets a chance. +struct Callback { + match_url: Box bool + Send + Sync + 'static>, + open: Box Option> + Send + Sync + 'static>, +} + +/// Registry of Rust callbacks. Initialised on first registration. +fn callbacks() -> &'static Mutex> { + static CALLBACKS: OnceLock>> = OnceLock::new(); + CALLBACKS.get_or_init(|| Mutex::new(Vec::new())) +} + +/// Register a custom input callback with libxml2. +/// +/// `match_url` is consulted for every URL libxml2 considers loading. +/// Return `true` to claim the URL; the same callback's `open` +/// function will then be invoked to produce the bytes. Return +/// `false` to let later callbacks (or the default file/HTTP loaders) +/// handle it. `open` may itself return `None` to defer back to other +/// handlers — useful for "embedded asset" loaders that only know +/// about a specific basename set. +/// +/// The closures are `Send + Sync` because libxml2 may call them from +/// any thread that calls into the parser. They are leaked into a +/// process-static registry; there is no `unregister` API (libxml2 +/// does not expose one for individual handlers). +/// +/// The actual C trampolines are registered with libxml2 exactly +/// once per process across all calls to this function. Subsequent +/// calls just append another Rust callback to the registry. +/// +/// # Example +/// +/// Serve an XML fragment for `embed:///hello.xml`. The callback fires +/// whenever libxml2 itself opens the URL — most commonly during +/// `xsl:import` / `xsl:include` resolution from libxslt, or during +/// RelaxNG `` chains in `xmlRelaxNGParse`. The library's +/// own `Parser::parse_file` uses Rust file I/O directly and does +/// *not* go through libxml2's URL machinery, so it would not +/// trigger this callback — wire it up via libxslt's `parse_bytes` +/// or libxml2's `xmlReadFile` instead. +/// +/// ```no_run +/// use libxml::io; +/// +/// static HELLO: &[u8] = b"\nworld"; +/// +/// io::register_input_callback( +/// |url| url.starts_with("embed:///"), +/// |url| { +/// if url == "embed:///hello.xml" { +/// Some(HELLO.to_vec()) +/// } else { +/// None +/// } +/// }, +/// ); +/// ``` +pub fn register_input_callback(match_url: M, open: O) +where + M: Fn(&str) -> bool + Send + Sync + 'static, + O: Fn(&str) -> Option> + Send + Sync + 'static, +{ + callbacks().lock().unwrap().push(Callback { + match_url: Box::new(match_url), + open: Box::new(open), + }); + + // Install the C trampolines exactly once. libxml2 records the + // function pointers in a static table; calling + // `xmlRegisterInputCallbacks` twice would push two duplicate + // entries that both delegate to the same Rust registry — wasteful + // but not unsafe. The OnceLock avoids the duplicate registration. + static REGISTERED: OnceLock<()> = OnceLock::new(); + REGISTERED.get_or_init(|| { + crate::init_parser(); + unsafe { + xmlRegisterInputCallbacks( + Some(trampoline_match as xmlInputMatchCallback_t), + Some(trampoline_open as xmlInputOpenCallback_t), + Some(trampoline_read as xmlInputReadCallback_t), + Some(trampoline_close as xmlInputCloseCallback_t), + ); + } + }); +} + +// Type aliases for the inner `Option`-wrapped function pointers +// libxml2 expects. Keeps the unsafe-cast site at `register_input_callback` +// concise and self-documenting. +type xmlInputMatchCallback_t = unsafe extern "C" fn(*const c_char) -> c_int; +type xmlInputOpenCallback_t = unsafe extern "C" fn(*const c_char) -> *mut c_void; +type xmlInputReadCallback_t = unsafe extern "C" fn(*mut c_void, *mut c_char, c_int) -> c_int; +type xmlInputCloseCallback_t = unsafe extern "C" fn(*mut c_void) -> c_int; + +// Compile-time sanity: the function-pointer type aliases above must +// be ABI-compatible with the libxml2 callback signatures from the +// generated bindings. If bindgen ever regenerates them differently +// (e.g. switches to a different `*const`/`*mut` flavour), this +// assertion fires at compile time. +const _: xmlInputMatchCallback = Some(trampoline_match); +const _: xmlInputOpenCallback = Some(trampoline_open); +const _: xmlInputReadCallback = Some(trampoline_read); +const _: xmlInputCloseCallback = Some(trampoline_close); + +/// Per-open state: a byte buffer + read cursor. Owned by libxml2 via +/// a `*mut c_void` handle until `trampoline_close` reclaims and drops it. +struct OpenState { + bytes: Vec, + position: usize, +} + +/// libxml2 match callback: does any registered Rust callback claim +/// this URL? Returns 1 on claim, 0 to defer. +unsafe extern "C" fn trampoline_match(filename: *const c_char) -> c_int { + if filename.is_null() { + return 0; + } + // SAFETY: libxml2 guarantees `filename` is a NUL-terminated C string + // for the lifetime of this call. Lossy decode means non-UTF-8 URLs + // are rejected (they can't possibly match anyway). + let url = match unsafe { CStr::from_ptr(filename) }.to_str() { + Ok(s) => s, + Err(_) => return 0, + }; + let cbs = match callbacks().lock() { + Ok(g) => g, + Err(_) => return 0, + }; + for cb in cbs.iter() { + if (cb.match_url)(url) { + return 1; + } + } + 0 +} + +/// libxml2 open callback: walk the registry, return the first +/// successfully-produced byte buffer as a heap-allocated `OpenState` +/// handle. +unsafe extern "C" fn trampoline_open(filename: *const c_char) -> *mut c_void { + if filename.is_null() { + return std::ptr::null_mut(); + } + // SAFETY: see `trampoline_match`. + let url = match unsafe { CStr::from_ptr(filename) }.to_str() { + Ok(s) => s, + Err(_) => return std::ptr::null_mut(), + }; + let cbs = match callbacks().lock() { + Ok(g) => g, + Err(_) => return std::ptr::null_mut(), + }; + for cb in cbs.iter() { + if !(cb.match_url)(url) { + continue; + } + if let Some(bytes) = (cb.open)(url) { + let state = Box::new(OpenState { bytes, position: 0 }); + return Box::into_raw(state) as *mut c_void; + } + } + std::ptr::null_mut() +} + +/// libxml2 read callback: copy up to `len` bytes from the buffer +/// cursor into `buffer`. Returns the number copied, 0 at EOF, or +/// -1 on error. +unsafe extern "C" fn trampoline_read( + context: *mut c_void, + buffer: *mut c_char, + len: c_int, +) -> c_int { + if context.is_null() || buffer.is_null() || len <= 0 { + return -1; + } + // SAFETY: `context` was produced by `trampoline_open` (Box::into_raw) + // and not yet reclaimed by `trampoline_close`; libxml2 holds at most + // one mutable reference at a time per open handle. + let state = unsafe { &mut *(context as *mut OpenState) }; + let remaining = state.bytes.len().saturating_sub(state.position); + let n = remaining.min(len as usize); + if n == 0 { + return 0; + } + // SAFETY: ranges checked above. Source and dest do not overlap — + // they live in disjoint heap allocations. + unsafe { + std::ptr::copy_nonoverlapping( + state.bytes.as_ptr().add(state.position), + buffer as *mut u8, + n, + ); + } + state.position += n; + n as c_int +} + +/// libxml2 close callback: reclaim and drop the `OpenState` box that +/// `trampoline_open` produced. +unsafe extern "C" fn trampoline_close(context: *mut c_void) -> c_int { + if context.is_null() { + return -1; + } + // SAFETY: `context` was produced by `Box::into_raw` in + // `trampoline_open`; this is the unique reclamation site. + let _state = unsafe { Box::from_raw(context as *mut OpenState) }; + 0 +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::bindings::{xmlFreeDoc, xmlReadFile}; + use std::ffi::CString; + use std::sync::OnceLock; + + static SAMPLE_XML: &[u8] = br#" +"#; + + /// Install the test callback exactly once per process. The libxml2 + /// callback table is process-global; stacking the same callback + /// across `#[test]` runs would still work (last registered wins via + /// the registry walk) but adds noise — keep the registry to one + /// entry for determinism. + fn install_once() { + static ONCE: OnceLock<()> = OnceLock::new(); + ONCE.get_or_init(|| { + register_input_callback( + |url| url.starts_with("embed:///"), + |url| { + if url == "embed:///sample.xml" { + Some(SAMPLE_XML.to_vec()) + } else { + None + } + }, + ); + }); + } + + /// Call libxml2's `xmlReadFile` directly. Used by the tests to + /// exercise the URL-routed parser path — `Parser::parse_file` + /// short-circuits through Rust file I/O so our callbacks aren't + /// involved there. Production use is identical to what `libxslt` + /// does internally when resolving `xsl:import` against a base URI: + /// libxslt calls `xmlReadFile` with the composed URL, libxml2 walks + /// its registered input callbacks (including ours), our trampolines + /// produce the bytes. + fn read_file_via_libxml2(url: &str) -> bool { + let c = CString::new(url).unwrap(); + unsafe { + let doc = xmlReadFile(c.as_ptr(), std::ptr::null(), 0); + if doc.is_null() { + return false; + } + xmlFreeDoc(doc); + true + } + } + + #[test] + fn callback_serves_registered_url() { + install_once(); + assert!( + read_file_via_libxml2("embed:///sample.xml"), + "registered URL should parse via the callback", + ); + } + + #[test] + fn callback_can_decline_via_none() { + install_once(); + assert!( + !read_file_via_libxml2("embed:///unknown.xml"), + "decline (open returning None) should fail the load, not return phantom data", + ); + } + + #[test] + fn non_matching_url_defers_to_default_handlers() { + install_once(); + // A file:// URL that doesn't exist must fall through to libxml2's + // built-in file handler and fail there — confirms our match + // callback returns 0 for unrelated URLs (otherwise we'd intercept + // and break every default load). + assert!( + !read_file_via_libxml2("/nonexistent/definitely/missing.xml"), + "non-embed URL should fail through the default loader", + ); + } +} diff --git a/src/lib.rs b/src/lib.rs index dd1ed88af..df59e9c07 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -27,6 +27,11 @@ pub mod schemas; /// Read-only parallel primitives pub mod readonly; +/// Custom input callbacks for `xmlRegisterInputCallbacks` — bundle +/// XSLT stylesheets / RNG schemas inside the binary and serve them +/// through a user-defined URL scheme (e.g. `embed:///foo.xsl`). +pub mod io; + /// Ensure libxml2's global parser state is initialised. Safe to call from /// any number of threads — internally guarded by `std::sync::Once` so the /// underlying `xmlInitParser()` runs exactly once. Call this before From ce474b5100e25272445757fd1413e8be23ac49e8 Mon Sep 17 00:00:00 2001 From: Deyan Ginev Date: Fri, 22 May 2026 14:15:30 -0400 Subject: [PATCH 2/5] io: serialize tests for libxml2 2.12, review polish Three #[test]s deadlocked under cargo's parallel runner on libxml2 2.12.9 (pre-2.13 thread-safety bug in the input-callback / global error path); merge them into one #[test] so scenarios run sequentially. Drive-bys from the same review: * Drop redundant function-pointer aliases (4 non_camel_case warnings); Some(trampoline_*) already coerces to the bindgen Option alias. * Extract MatchFn/OpenFn (clippy::type_complexity on the Box dyn Fn). * Iterate the registry newest-first in trampoline_open to match the module doc's "most recent wins" and libxml2's own callback table semantics. * Store entries as Arc and snapshot the Vec before invoking a closure, so an open() that re-enters libxml2 via xmlReadFile doesn't self-deadlock on the non-reentrant registry Mutex. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 7 +- src/io.rs | 178 +++++++++++++++++++++++++++------------------------ 2 files changed, 100 insertions(+), 85 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 42841becf..0150b42a2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,8 +14,11 @@ resolution can reach into the embedded byte tables without ever touching the disk. Closures run on whatever thread libxml2 invokes the callbacks from; the `Send + Sync + 'static` bound reflects that. - Three unit tests cover the happy path, opt-out via `None`, and - defer-to-default behaviour for unmatched URLs. + A single unit test bundles three scenarios (happy path, opt-out + via `None`, defer-to-default for unmatched URLs); the scenarios + share one `#[test]` so they execute sequentially, sidestepping a + thread-safety bug in libxml2's input-callback path on versions + prior to 2.13. ## [0.3.11] (2026-05-18) diff --git a/src/io.rs b/src/io.rs index 2ea5e307b..c0cf0843d 100644 --- a/src/io.rs +++ b/src/io.rs @@ -37,8 +37,17 @@ //! has no concept of "unregister single handler" (only //! `xmlCleanupInputCallbacks` which wipes everything including the //! defaults). The trampolines look up the Rust closures through a -//! `Mutex>`; libxml2 may invoke them from any thread, -//! hence the `Send + Sync` bound. +//! process-static `Mutex>>`; libxml2 may invoke +//! them from any thread, hence the `Send + Sync` bound. +//! +//! Trampolines snapshot the registry (cheap `Arc` clone) and drop +//! the lock *before* invoking the user closure, so a closure that +//! re-enters libxml2 (e.g. parses a manifest to decide what to +//! serve) won't self-deadlock against the non-reentrant `Mutex`. +//! +//! Closures **must not panic**. A panic unwinding across the +//! `extern "C"` trampoline aborts the process on Rust 2024+. If +//! your `open` may fail, return `None` rather than panicking. //! //! ## Order //! @@ -50,12 +59,12 @@ //! callbacks claim via their match function. use std::ffi::{CStr, c_char, c_int, c_void}; -use std::sync::{Mutex, OnceLock}; +use std::sync::{Arc, Mutex, OnceLock}; -use crate::bindings::{ - xmlInputCloseCallback, xmlInputMatchCallback, xmlInputOpenCallback, xmlInputReadCallback, - xmlRegisterInputCallbacks, -}; +use crate::bindings::xmlRegisterInputCallbacks; + +type MatchFn = Box bool + Send + Sync + 'static>; +type OpenFn = Box Option> + Send + Sync + 'static>; /// One Rust-side callback pair: a URL filter and a byte-fetcher. /// @@ -64,16 +73,32 @@ use crate::bindings::{ /// `match_url` returned `true`; in that case the trampoline keeps /// walking — the next registered callback gets a chance. struct Callback { - match_url: Box bool + Send + Sync + 'static>, - open: Box Option> + Send + Sync + 'static>, + match_url: MatchFn, + open: OpenFn, } /// Registry of Rust callbacks. Initialised on first registration. -fn callbacks() -> &'static Mutex> { - static CALLBACKS: OnceLock>> = OnceLock::new(); +/// Stored as `Arc` so the trampolines can snapshot the +/// list under the lock and drop the guard before invoking a +/// closure — see `snapshot`. +fn callbacks() -> &'static Mutex>> { + static CALLBACKS: OnceLock>>> = OnceLock::new(); CALLBACKS.get_or_init(|| Mutex::new(Vec::new())) } +/// Atomic view of the registry. Each entry is an `Arc`, so cloning +/// the `Vec` is just refcount bumps. Returned by value with the +/// lock already dropped, so callers can iterate without holding the +/// mutex across user-closure invocations (which could otherwise +/// re-enter libxml2 → trampoline → `callbacks().lock()` and +/// self-deadlock). +fn snapshot() -> Vec> { + match callbacks().lock() { + Ok(g) => g.clone(), + Err(_) => Vec::new(), + } +} + /// Register a custom input callback with libxml2. /// /// `match_url` is consulted for every URL libxml2 considers loading. @@ -125,10 +150,10 @@ where M: Fn(&str) -> bool + Send + Sync + 'static, O: Fn(&str) -> Option> + Send + Sync + 'static, { - callbacks().lock().unwrap().push(Callback { + callbacks().lock().unwrap().push(Arc::new(Callback { match_url: Box::new(match_url), open: Box::new(open), - }); + })); // Install the C trampolines exactly once. libxml2 records the // function pointers in a static table; calling @@ -138,35 +163,20 @@ where static REGISTERED: OnceLock<()> = OnceLock::new(); REGISTERED.get_or_init(|| { crate::init_parser(); + // `Some(trampoline_*)` coerces to the matching bindgen + // `Option` alias. If bindgen ever + // regenerates the signatures differently, this fails to compile. unsafe { xmlRegisterInputCallbacks( - Some(trampoline_match as xmlInputMatchCallback_t), - Some(trampoline_open as xmlInputOpenCallback_t), - Some(trampoline_read as xmlInputReadCallback_t), - Some(trampoline_close as xmlInputCloseCallback_t), + Some(trampoline_match), + Some(trampoline_open), + Some(trampoline_read), + Some(trampoline_close), ); } }); } -// Type aliases for the inner `Option`-wrapped function pointers -// libxml2 expects. Keeps the unsafe-cast site at `register_input_callback` -// concise and self-documenting. -type xmlInputMatchCallback_t = unsafe extern "C" fn(*const c_char) -> c_int; -type xmlInputOpenCallback_t = unsafe extern "C" fn(*const c_char) -> *mut c_void; -type xmlInputReadCallback_t = unsafe extern "C" fn(*mut c_void, *mut c_char, c_int) -> c_int; -type xmlInputCloseCallback_t = unsafe extern "C" fn(*mut c_void) -> c_int; - -// Compile-time sanity: the function-pointer type aliases above must -// be ABI-compatible with the libxml2 callback signatures from the -// generated bindings. If bindgen ever regenerates them differently -// (e.g. switches to a different `*const`/`*mut` flavour), this -// assertion fires at compile time. -const _: xmlInputMatchCallback = Some(trampoline_match); -const _: xmlInputOpenCallback = Some(trampoline_open); -const _: xmlInputReadCallback = Some(trampoline_read); -const _: xmlInputCloseCallback = Some(trampoline_close); - /// Per-open state: a byte buffer + read cursor. Owned by libxml2 via /// a `*mut c_void` handle until `trampoline_close` reclaims and drops it. struct OpenState { @@ -187,11 +197,7 @@ unsafe extern "C" fn trampoline_match(filename: *const c_char) -> c_int { Ok(s) => s, Err(_) => return 0, }; - let cbs = match callbacks().lock() { - Ok(g) => g, - Err(_) => return 0, - }; - for cb in cbs.iter() { + for cb in snapshot() { if (cb.match_url)(url) { return 1; } @@ -211,11 +217,10 @@ unsafe extern "C" fn trampoline_open(filename: *const c_char) -> *mut c_void { Ok(s) => s, Err(_) => return std::ptr::null_mut(), }; - let cbs = match callbacks().lock() { - Ok(g) => g, - Err(_) => return std::ptr::null_mut(), - }; - for cb in cbs.iter() { + // Walk newest-first so the most recent registration wins — + // matches libxml2's own callback-table semantics and the + // module-level docs. + for cb in snapshot().iter().rev() { if !(cb.match_url)(url) { continue; } @@ -277,34 +282,11 @@ mod tests { use super::*; use crate::bindings::{xmlFreeDoc, xmlReadFile}; use std::ffi::CString; - use std::sync::OnceLock; static SAMPLE_XML: &[u8] = br#" "#; - /// Install the test callback exactly once per process. The libxml2 - /// callback table is process-global; stacking the same callback - /// across `#[test]` runs would still work (last registered wins via - /// the registry walk) but adds noise — keep the registry to one - /// entry for determinism. - fn install_once() { - static ONCE: OnceLock<()> = OnceLock::new(); - ONCE.get_or_init(|| { - register_input_callback( - |url| url.starts_with("embed:///"), - |url| { - if url == "embed:///sample.xml" { - Some(SAMPLE_XML.to_vec()) - } else { - None - } - }, - ); - }); - } - - /// Call libxml2's `xmlReadFile` directly. Used by the tests to - /// exercise the URL-routed parser path — `Parser::parse_file` + /// Call libxml2's `xmlReadFile` directly. `Parser::parse_file` /// short-circuits through Rust file I/O so our callbacks aren't /// involved there. Production use is identical to what `libxslt` /// does internally when resolving `xsl:import` against a base URI: @@ -323,34 +305,64 @@ mod tests { } } + /// Three scenarios bundled into one `#[test]` so they execute + /// sequentially. libxml2 prior to 2.13 has a thread-safety bug in + /// the input-callback / global-error path that deadlocks concurrent + /// `xmlReadFile` calls — under cargo's default parallel test runner + /// the three scenarios would hang the process on a 2.12.x build. + /// Bundling sidesteps that without forcing every contributor to + /// remember `--test-threads=1`. (2.13+ runs them concurrently fine, + /// but we keep the bundling for portability.) #[test] - fn callback_serves_registered_url() { - install_once(); + fn input_callback_scenarios() { + register_input_callback( + |url| url.starts_with("embed:///"), + |url| { + if url == "embed:///sample.xml" { + Some(SAMPLE_XML.to_vec()) + } else { + None + } + }, + ); + + // 1. Registered URL parses via the callback. assert!( read_file_via_libxml2("embed:///sample.xml"), "registered URL should parse via the callback", ); - } - #[test] - fn callback_can_decline_via_none() { - install_once(); + // 2. `open` returning `None` declines the load (rather than + // producing phantom data). assert!( !read_file_via_libxml2("embed:///unknown.xml"), "decline (open returning None) should fail the load, not return phantom data", ); - } - #[test] - fn non_matching_url_defers_to_default_handlers() { - install_once(); - // A file:// URL that doesn't exist must fall through to libxml2's - // built-in file handler and fail there — confirms our match - // callback returns 0 for unrelated URLs (otherwise we'd intercept - // and break every default load). + // 3. An unrelated absolute path falls through to libxml2's + // built-in file handler and fails there — confirms our match + // callback returns 0 for non-`embed:///` URLs, otherwise we'd + // intercept and break every default load. assert!( !read_file_via_libxml2("/nonexistent/definitely/missing.xml"), "non-embed URL should fail through the default loader", ); + + // 4. A re-entrant closure: `open` calls back into libxml2 via + // `xmlReadFile` for a *different* URL, which itself routes + // through the trampolines. Without the snapshot-then-drop-lock + // pattern in the trampolines, this would self-deadlock on the + // non-reentrant registry `Mutex`. + register_input_callback( + |url| url == "reentrant:///outer", + |_url| { + let _inner_ok = read_file_via_libxml2("embed:///sample.xml"); + Some(SAMPLE_XML.to_vec()) + }, + ); + assert!( + read_file_via_libxml2("reentrant:///outer"), + "callback should be able to re-enter libxml2 without deadlocking on the registry mutex", + ); } } From c6acd77104bd12316c81cbc28579c2aa6e605f27 Mon Sep 17 00:00:00 2001 From: Deyan Ginev Date: Sat, 23 May 2026 13:35:49 -0400 Subject: [PATCH 3/5] =?UTF-8?q?io:=20tighten=20review=20polish=20=E2=80=94?= =?UTF-8?q?=20symmetric=20walk=20+=20poison,=20newest-wins=20test?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirror trampoline_match's iteration order to trampoline_open (newest-first), .unwrap() the registry mutex in snapshot() to match register_input_callback, and add a fifth scenario asserting the documented "most recent registration wins" semantics with atomic counters. Comments and CHANGELOG entry compacted; behaviour unchanged for correct callers. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 24 ++--- src/io.rs | 277 +++++++++++++++++++-------------------------------- 2 files changed, 112 insertions(+), 189 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0150b42a2..5d23d4d1d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,21 +4,15 @@ ### Added -* New `io` module exposing a safe Rust wrapper around libxml2's - `xmlRegisterInputCallbacks`: `io::register_input_callback(match_url, - open)` accepts two closures (`&str -> bool` and `&str -> Option>`) - and registers them as a custom URL-scheme handler. Useful for - single-binary distributions that bundle XSLT stylesheets / RelaxNG - schemas via `include_bytes!` and serve them through a synthetic - URL scheme (e.g. `embed:///foo.xsl`), so `libxslt`'s `xsl:import` - resolution can reach into the embedded byte tables without ever - touching the disk. Closures run on whatever thread libxml2 invokes - the callbacks from; the `Send + Sync + 'static` bound reflects that. - A single unit test bundles three scenarios (happy path, opt-out - via `None`, defer-to-default for unmatched URLs); the scenarios - share one `#[test]` so they execute sequentially, sidestepping a - thread-safety bug in libxml2's input-callback path on versions - prior to 2.13. +* New `io` module: `io::register_input_callback(match_url, open)` is + a safe wrapper around `xmlRegisterInputCallbacks`. Accepts two + closures (`&str -> bool`, `&str -> Option>`) and installs + them as a custom URL-scheme handler. The intended use is bundling + XSLT stylesheets / RNG schemas via `include_bytes!` and serving + them through a synthetic scheme (e.g. `embed:///foo.xsl`), so + libxslt's `xsl:import` resolution can reach the embedded bytes + without touching the disk. Closures may run on any thread libxml2 + calls them from (`Send + Sync + 'static`). ## [0.3.11] (2026-05-18) diff --git a/src/io.rs b/src/io.rs index c0cf0843d..645316142 100644 --- a/src/io.rs +++ b/src/io.rs @@ -1,24 +1,21 @@ //! Custom I/O callbacks for libxml2. //! -//! libxml2 routes every URL it loads (XML documents themselves, plus -//! `xsl:import` / `xsl:include` targets, RelaxNG ``, DTD -//! external subsets, and so on) through a chain of registered "input -//! callback" handlers. The default chain handles `file://`, `http://`, -//! `ftp://`, etc.; an application can prepend its own handler for a -//! custom URL scheme via `xmlRegisterInputCallbacks(match, open, -//! read, close)`. +//! libxml2 routes every URL it loads (XML documents, `xsl:import` / +//! `xsl:include` targets, RelaxNG ``, DTD external subsets, +//! etc.) through a chain of registered "input callback" handlers. The +//! default chain handles `file://`, `http://`, `ftp://`, etc.; an +//! application can prepend its own handler for a custom URL scheme via +//! `xmlRegisterInputCallbacks`. //! -//! This module wraps that C API in a safe, closure-friendly Rust -//! surface. The motivating use case is shipping a single-binary CLI -//! that bundles its XSLT stylesheets / RNG schemas at compile time -//! via `include_bytes!` and serves them through a custom URL scheme -//! (e.g. `embed:///LaTeXML-html5.xsl`), so `libxslt::parser::parse_bytes` -//! can resolve `xsl:import` chains without ever touching the disk. +//! This module wraps that C API in a safe, closure-friendly surface. +//! The motivating use case is shipping a single-binary CLI that bundles +//! XSLT stylesheets / RNG schemas via `include_bytes!` and serves them +//! through a synthetic scheme (e.g. `embed:///foo.xsl`), so `xsl:import` +//! chains resolve without ever touching the disk. //! //! ```no_run //! use libxml::io; //! -//! // Bundled at compile time. //! static MAIN: &[u8] = b"\n"; //! //! io::register_input_callback( @@ -30,33 +27,16 @@ //! ); //! ``` //! -//! ## Lifetime and threading +//! ## Lifetime, threading, order //! -//! Closures registered here live for the lifetime of the process — -//! libxml2 holds the C trampoline pointers in a static table and -//! has no concept of "unregister single handler" (only -//! `xmlCleanupInputCallbacks` which wipes everything including the -//! defaults). The trampolines look up the Rust closures through a -//! process-static `Mutex>>`; libxml2 may invoke -//! them from any thread, hence the `Send + Sync` bound. -//! -//! Trampolines snapshot the registry (cheap `Arc` clone) and drop -//! the lock *before* invoking the user closure, so a closure that -//! re-enters libxml2 (e.g. parses a manifest to decide what to -//! serve) won't self-deadlock against the non-reentrant `Mutex`. -//! -//! Closures **must not panic**. A panic unwinding across the -//! `extern "C"` trampoline aborts the process on Rust 2024+. If -//! your `open` may fail, return `None` rather than panicking. -//! -//! ## Order -//! -//! libxml2 walks its registered callbacks in last-registered-first -//! order. This module preserves that ordering: callers can stack -//! multiple registrations for the same scheme and the most recent -//! wins. The default file/HTTP handlers remain at the bottom of the -//! stack and continue to serve URLs that none of the custom -//! callbacks claim via their match function. +//! Closures live for the process lifetime — libxml2 has no per-handler +//! unregister API. They may run on any thread (hence `Send + Sync`) and +//! **must not panic**: unwinding across the `extern "C"` trampoline +//! aborts on Rust 2024+. libxml2 walks callbacks newest-first; the +//! trampolines snapshot the registry and drop the lock before invoking +//! a user closure, so a closure that re-enters libxml2 won't +//! self-deadlock. If `open` returns `None`, libxml2 falls through to +//! the next handler — including its default file/HTTP loaders. use std::ffi::{CStr, c_char, c_int, c_void}; use std::sync::{Arc, Mutex, OnceLock}; @@ -66,68 +46,44 @@ use crate::bindings::xmlRegisterInputCallbacks; type MatchFn = Box bool + Send + Sync + 'static>; type OpenFn = Box Option> + Send + Sync + 'static>; -/// One Rust-side callback pair: a URL filter and a byte-fetcher. -/// -/// We hold both `match_url` and `open` together so the trampoline -/// can walk the list once. `open` may return `None` even after -/// `match_url` returned `true`; in that case the trampoline keeps -/// walking — the next registered callback gets a chance. struct Callback { match_url: MatchFn, open: OpenFn, } -/// Registry of Rust callbacks. Initialised on first registration. -/// Stored as `Arc` so the trampolines can snapshot the -/// list under the lock and drop the guard before invoking a -/// closure — see `snapshot`. fn callbacks() -> &'static Mutex>> { static CALLBACKS: OnceLock>>> = OnceLock::new(); CALLBACKS.get_or_init(|| Mutex::new(Vec::new())) } -/// Atomic view of the registry. Each entry is an `Arc`, so cloning -/// the `Vec` is just refcount bumps. Returned by value with the -/// lock already dropped, so callers can iterate without holding the -/// mutex across user-closure invocations (which could otherwise -/// re-enter libxml2 → trampoline → `callbacks().lock()` and -/// self-deadlock). +/// Clone the registry under the lock and return with the guard +/// dropped. Cloning is a cheap refcount bump per entry. The lock is +/// not held across user closures, so a closure that re-enters libxml2 +/// can't self-deadlock on the registry mutex. fn snapshot() -> Vec> { - match callbacks().lock() { - Ok(g) => g.clone(), - Err(_) => Vec::new(), - } + callbacks().lock().unwrap().clone() } /// Register a custom input callback with libxml2. /// /// `match_url` is consulted for every URL libxml2 considers loading. -/// Return `true` to claim the URL; the same callback's `open` -/// function will then be invoked to produce the bytes. Return -/// `false` to let later callbacks (or the default file/HTTP loaders) -/// handle it. `open` may itself return `None` to defer back to other -/// handlers — useful for "embedded asset" loaders that only know -/// about a specific basename set. +/// Return `true` to claim the URL; the same callback's `open` is then +/// invoked. Either function can defer: `match_url` returning `false` +/// skips the callback; `open` returning `None` falls through to the +/// next registered handler, including libxml2's defaults. /// -/// The closures are `Send + Sync` because libxml2 may call them from -/// any thread that calls into the parser. They are leaked into a -/// process-static registry; there is no `unregister` API (libxml2 -/// does not expose one for individual handlers). -/// -/// The actual C trampolines are registered with libxml2 exactly -/// once per process across all calls to this function. Subsequent -/// calls just append another Rust callback to the registry. +/// Closures are `Send + Sync + 'static` because libxml2 may invoke +/// them from any thread. They are appended to a process-static +/// registry; there is no per-handler unregister. /// /// # Example /// -/// Serve an XML fragment for `embed:///hello.xml`. The callback fires -/// whenever libxml2 itself opens the URL — most commonly during -/// `xsl:import` / `xsl:include` resolution from libxslt, or during -/// RelaxNG `` chains in `xmlRelaxNGParse`. The library's -/// own `Parser::parse_file` uses Rust file I/O directly and does -/// *not* go through libxml2's URL machinery, so it would not -/// trigger this callback — wire it up via libxslt's `parse_bytes` -/// or libxml2's `xmlReadFile` instead. +/// Most commonly used to resolve `xsl:import` / `xsl:include` via +/// libxslt, or RelaxNG `` via `xmlRelaxNGParse`. Note that +/// this crate's own `Parser::parse_file` uses Rust file I/O directly +/// and bypasses libxml2's URL machinery, so it does *not* trigger +/// these callbacks — wire them up via libxslt's `parse_bytes` or +/// libxml2's `xmlReadFile`. /// /// ```no_run /// use libxml::io; @@ -136,13 +92,7 @@ fn snapshot() -> Vec> { /// /// io::register_input_callback( /// |url| url.starts_with("embed:///"), -/// |url| { -/// if url == "embed:///hello.xml" { -/// Some(HELLO.to_vec()) -/// } else { -/// None -/// } -/// }, +/// |url| (url == "embed:///hello.xml").then(|| HELLO.to_vec()), /// ); /// ``` pub fn register_input_callback(match_url: M, open: O) @@ -155,17 +105,12 @@ where open: Box::new(open), })); - // Install the C trampolines exactly once. libxml2 records the - // function pointers in a static table; calling - // `xmlRegisterInputCallbacks` twice would push two duplicate - // entries that both delegate to the same Rust registry — wasteful - // but not unsafe. The OnceLock avoids the duplicate registration. + // libxml2 records the trampoline pointers in a static table; + // registering twice would push duplicate entries that delegate to + // the same Rust registry. The OnceLock avoids that. static REGISTERED: OnceLock<()> = OnceLock::new(); REGISTERED.get_or_init(|| { crate::init_parser(); - // `Some(trampoline_*)` coerces to the matching bindgen - // `Option` alias. If bindgen ever - // regenerates the signatures differently, this fails to compile. unsafe { xmlRegisterInputCallbacks( Some(trampoline_match), @@ -177,27 +122,25 @@ where }); } -/// Per-open state: a byte buffer + read cursor. Owned by libxml2 via -/// a `*mut c_void` handle until `trampoline_close` reclaims and drops it. +/// Per-open state owned by libxml2 via `*mut c_void` until +/// `trampoline_close` reclaims and drops it. struct OpenState { bytes: Vec, position: usize, } -/// libxml2 match callback: does any registered Rust callback claim -/// this URL? Returns 1 on claim, 0 to defer. unsafe extern "C" fn trampoline_match(filename: *const c_char) -> c_int { if filename.is_null() { return 0; } // SAFETY: libxml2 guarantees `filename` is a NUL-terminated C string - // for the lifetime of this call. Lossy decode means non-UTF-8 URLs - // are rejected (they can't possibly match anyway). + // for the call's lifetime. Non-UTF-8 URLs can't match anyway. let url = match unsafe { CStr::from_ptr(filename) }.to_str() { Ok(s) => s, Err(_) => return 0, }; - for cb in snapshot() { + // Newest-first, mirroring `trampoline_open`'s walk. + for cb in snapshot().iter().rev() { if (cb.match_url)(url) { return 1; } @@ -205,9 +148,6 @@ unsafe extern "C" fn trampoline_match(filename: *const c_char) -> c_int { 0 } -/// libxml2 open callback: walk the registry, return the first -/// successfully-produced byte buffer as a heap-allocated `OpenState` -/// handle. unsafe extern "C" fn trampoline_open(filename: *const c_char) -> *mut c_void { if filename.is_null() { return std::ptr::null_mut(); @@ -217,24 +157,18 @@ unsafe extern "C" fn trampoline_open(filename: *const c_char) -> *mut c_void { Ok(s) => s, Err(_) => return std::ptr::null_mut(), }; - // Walk newest-first so the most recent registration wins — - // matches libxml2's own callback-table semantics and the - // module-level docs. + // Newest-first — the most recent registration wins. for cb in snapshot().iter().rev() { if !(cb.match_url)(url) { continue; } if let Some(bytes) = (cb.open)(url) { - let state = Box::new(OpenState { bytes, position: 0 }); - return Box::into_raw(state) as *mut c_void; + return Box::into_raw(Box::new(OpenState { bytes, position: 0 })) as *mut c_void; } } std::ptr::null_mut() } -/// libxml2 read callback: copy up to `len` bytes from the buffer -/// cursor into `buffer`. Returns the number copied, 0 at EOF, or -/// -1 on error. unsafe extern "C" fn trampoline_read( context: *mut c_void, buffer: *mut c_char, @@ -243,17 +177,15 @@ unsafe extern "C" fn trampoline_read( if context.is_null() || buffer.is_null() || len <= 0 { return -1; } - // SAFETY: `context` was produced by `trampoline_open` (Box::into_raw) - // and not yet reclaimed by `trampoline_close`; libxml2 holds at most - // one mutable reference at a time per open handle. + // SAFETY: `context` came from `Box::into_raw` in `trampoline_open` + // and is not yet reclaimed; libxml2 holds one reference per handle. let state = unsafe { &mut *(context as *mut OpenState) }; let remaining = state.bytes.len().saturating_sub(state.position); let n = remaining.min(len as usize); if n == 0 { return 0; } - // SAFETY: ranges checked above. Source and dest do not overlap — - // they live in disjoint heap allocations. + // SAFETY: bounds checked above; src and dst are disjoint allocations. unsafe { std::ptr::copy_nonoverlapping( state.bytes.as_ptr().add(state.position), @@ -265,14 +197,11 @@ unsafe extern "C" fn trampoline_read( n as c_int } -/// libxml2 close callback: reclaim and drop the `OpenState` box that -/// `trampoline_open` produced. unsafe extern "C" fn trampoline_close(context: *mut c_void) -> c_int { if context.is_null() { return -1; } - // SAFETY: `context` was produced by `Box::into_raw` in - // `trampoline_open`; this is the unique reclamation site. + // SAFETY: unique reclamation site for the box from `trampoline_open`. let _state = unsafe { Box::from_raw(context as *mut OpenState) }; 0 } @@ -282,17 +211,14 @@ mod tests { use super::*; use crate::bindings::{xmlFreeDoc, xmlReadFile}; use std::ffi::CString; + use std::sync::atomic::{AtomicUsize, Ordering}; static SAMPLE_XML: &[u8] = br#" "#; - /// Call libxml2's `xmlReadFile` directly. `Parser::parse_file` - /// short-circuits through Rust file I/O so our callbacks aren't - /// involved there. Production use is identical to what `libxslt` - /// does internally when resolving `xsl:import` against a base URI: - /// libxslt calls `xmlReadFile` with the composed URL, libxml2 walks - /// its registered input callbacks (including ours), our trampolines - /// produce the bytes. + /// `Parser::parse_file` bypasses the input-callback machinery (it + /// reads via Rust file I/O), so the test must call `xmlReadFile` + /// directly — the same entry point libxslt uses for `xsl:import`. fn read_file_via_libxml2(url: &str) -> bool { let c = CString::new(url).unwrap(); unsafe { @@ -305,64 +231,67 @@ mod tests { } } - /// Three scenarios bundled into one `#[test]` so they execute - /// sequentially. libxml2 prior to 2.13 has a thread-safety bug in - /// the input-callback / global-error path that deadlocks concurrent - /// `xmlReadFile` calls — under cargo's default parallel test runner - /// the three scenarios would hang the process on a 2.12.x build. - /// Bundling sidesteps that without forcing every contributor to - /// remember `--test-threads=1`. (2.13+ runs them concurrently fine, - /// but we keep the bundling for portability.) + /// Scenarios share one `#[test]` so they run sequentially. libxml2 + /// < 2.13 has a thread-safety bug in the input-callback path that + /// deadlocks concurrent `xmlReadFile` calls under cargo's default + /// parallel test runner. #[test] fn input_callback_scenarios() { register_input_callback( |url| url.starts_with("embed:///"), - |url| { - if url == "embed:///sample.xml" { - Some(SAMPLE_XML.to_vec()) - } else { - None - } - }, + |url| (url == "embed:///sample.xml").then(|| SAMPLE_XML.to_vec()), ); - // 1. Registered URL parses via the callback. - assert!( - read_file_via_libxml2("embed:///sample.xml"), - "registered URL should parse via the callback", - ); + // 1. Happy path. + assert!(read_file_via_libxml2("embed:///sample.xml")); - // 2. `open` returning `None` declines the load (rather than - // producing phantom data). - assert!( - !read_file_via_libxml2("embed:///unknown.xml"), - "decline (open returning None) should fail the load, not return phantom data", - ); + // 2. `open` returning `None` declines this match; libxml2 falls + // through to the default file loader, which also fails. + assert!(!read_file_via_libxml2("embed:///unknown.xml")); - // 3. An unrelated absolute path falls through to libxml2's - // built-in file handler and fails there — confirms our match - // callback returns 0 for non-`embed:///` URLs, otherwise we'd - // intercept and break every default load. - assert!( - !read_file_via_libxml2("/nonexistent/definitely/missing.xml"), - "non-embed URL should fail through the default loader", - ); + // 3. Unrelated URLs aren't claimed by our match — they reach the + // default file handler and fail there. + assert!(!read_file_via_libxml2("/nonexistent/definitely/missing.xml")); - // 4. A re-entrant closure: `open` calls back into libxml2 via - // `xmlReadFile` for a *different* URL, which itself routes - // through the trampolines. Without the snapshot-then-drop-lock - // pattern in the trampolines, this would self-deadlock on the - // non-reentrant registry `Mutex`. + // 4. Re-entrancy: an `open` closure that calls into libxml2 must + // not self-deadlock on the registry mutex. register_input_callback( |url| url == "reentrant:///outer", |_url| { - let _inner_ok = read_file_via_libxml2("embed:///sample.xml"); + let _ = read_file_via_libxml2("embed:///sample.xml"); + Some(SAMPLE_XML.to_vec()) + }, + ); + assert!(read_file_via_libxml2("reentrant:///outer")); + + // 5. Newest-wins ordering: two callbacks claim the same URL; only + // the most recent registration runs and produces the bytes. + static FIRST_OPENED: AtomicUsize = AtomicUsize::new(0); + static SECOND_OPENED: AtomicUsize = AtomicUsize::new(0); + register_input_callback( + |url| url == "ordered:///x", + |_| { + FIRST_OPENED.fetch_add(1, Ordering::SeqCst); + Some(b"first".to_vec()) + }, + ); + register_input_callback( + |url| url == "ordered:///x", + |_| { + SECOND_OPENED.fetch_add(1, Ordering::SeqCst); Some(SAMPLE_XML.to_vec()) }, ); - assert!( - read_file_via_libxml2("reentrant:///outer"), - "callback should be able to re-enter libxml2 without deadlocking on the registry mutex", + assert!(read_file_via_libxml2("ordered:///x")); + assert_eq!( + SECOND_OPENED.load(Ordering::SeqCst), + 1, + "newest registration should run", + ); + assert_eq!( + FIRST_OPENED.load(Ordering::SeqCst), + 0, + "older registration should not be consulted", ); } } From 6ab2377d3899bba38f65b4a43fa1d307f0a80840 Mon Sep 17 00:00:00 2001 From: Deyan Ginev Date: Sat, 23 May 2026 15:10:56 -0400 Subject: [PATCH 4/5] ci: modernize workflows off deprecated Node 20 actions Replace the archived actions-rs/* and ryankurte/action-apt with their current-standard equivalents so the workflows run on Node 24 ahead of GitHub's June 2026 forced migration: * actions-rs/toolchain@v1 -> dtolnay/rust-toolchain (@stable, plus @master + toolchain:/targets: for the mingw windows-gnu job) * actions-rs/cargo@v1 -> plain `run: cargo test|doc` * ryankurte/action-apt -> plain `run: apt-get update && install` * actions/checkout@v2/@v4 -> @v6 Also add least-privilege `permissions:` blocks (contents: read for the CI/test workflows; contents: write for gh-pages, which pushes rendered docs to the gh-pages branch). CHANGELOG: date 0.3.12 (2026-05-23) and open a 0.3.13 in-development section. Co-Authored-By: Claude Opus 4.7 --- .github/workflows/CI.yml | 58 +++++++++++++--------------------- .github/workflows/gh-pages.yml | 24 +++++++------- .github/workflows/windows.yml | 23 +++++++------- CHANGELOG.md | 4 ++- 4 files changed, 47 insertions(+), 62 deletions(-) diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index 31ef76865..ccc9bc59c 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -2,6 +2,10 @@ on: [push, pull_request] name: CI Linux +# Least privilege: this workflow only reads the repo and runs tests. +permissions: + contents: read + jobs: test-amd64: name: rust-libxml amd64 CI @@ -11,41 +15,29 @@ jobs: with_default_bindings: [false, true] steps: - name: install dependencies - uses: ryankurte/action-apt@v0.2.0 - with: - packages: "libxml2-dev" + run: | + sudo apt-get update + sudo apt-get install -y libxml2-dev - name: Set up LIBXML2 env var if compiling with the default bindings run: echo "LIBXML2=$(pkg-config libxml-2.0 --variable=libdir)/libxml2.so" >> "$GITHUB_ENV" if: ${{ matrix.with_default_bindings }} - - uses: actions/checkout@v2 - - uses: actions-rs/toolchain@v1 - with: - profile: minimal - toolchain: stable - override: true + - uses: actions/checkout@v6 + - uses: dtolnay/rust-toolchain@stable - name: run tests - uses: actions-rs/cargo@v1 - with: - command: test + run: cargo test test-arm64: name: rust-libxml arm64 CI runs-on: ubuntu-24.04-arm steps: - name: install dependencies - uses: ryankurte/action-apt@v0.2.0 - with: - packages: "libxml2-dev" - - uses: actions/checkout@v2 - - uses: actions-rs/toolchain@v1 - with: - profile: minimal - toolchain: stable - override: true + run: | + sudo apt-get update + sudo apt-get install -y libxml2-dev + - uses: actions/checkout@v6 + - uses: dtolnay/rust-toolchain@stable - name: run tests - uses: actions-rs/cargo@v1 - with: - command: test + run: cargo test test-newer-libxml2: strategy: @@ -55,10 +47,10 @@ jobs: runs-on: ubuntu-latest steps: - name: install dependencies - uses: ryankurte/action-apt@v0.2.0 - with: - packages: "libpython3-dev" - - uses: actions/checkout@v2 + run: | + sudo apt-get update + sudo apt-get install -y libpython3-dev + - uses: actions/checkout@v6 - name: Install libxml ${{ matrix.libxml_version }} by hand run: | wget https://download.gnome.org/sources/libxml2/$(echo ${{ matrix.libxml_version }} | sed -e 's/\.[0-9]*$//')/libxml2-${{ matrix.libxml_version }}.tar.xz @@ -67,14 +59,8 @@ jobs: ./configure make sudo make install - - uses: actions-rs/toolchain@v1 - with: - profile: minimal - toolchain: stable - override: true + - uses: dtolnay/rust-toolchain@stable - name: run tests - uses: actions-rs/cargo@v1 - with: - command: test + run: cargo test env: LD_LIBRARY_PATH: /usr/local/lib diff --git a/.github/workflows/gh-pages.yml b/.github/workflows/gh-pages.yml index cfc127cdb..770579caf 100644 --- a/.github/workflows/gh-pages.yml +++ b/.github/workflows/gh-pages.yml @@ -4,6 +4,11 @@ on: - master name: Publish Docs + +# Deploy step pushes the rendered docs to the gh-pages branch. +permissions: + contents: write + jobs: publish_docs: if: github.repository == 'KWARC/rust-libxml' @@ -11,23 +16,16 @@ jobs: runs-on: ubuntu-latest steps: - name: install dependencies - uses: ryankurte/action-apt@v0.2.0 - with: - packages: "libxml2-dev" + run: | + sudo apt-get update + sudo apt-get install -y libxml2-dev - name: Set up LIBXML2 env var if compiling with the default bindings run: echo "LIBXML2=$(pkg-config libxml-2.0 --variable=libdir)/libxml2.so" >> "$GITHUB_ENV" - - uses: actions-rs/toolchain@v1 - with: - profile: minimal - toolchain: stable - override: true - name: Checkout sources - uses: actions/checkout@v4 + uses: actions/checkout@v6 + - uses: dtolnay/rust-toolchain@stable - name: Build Documentation - uses: actions-rs/cargo@v1 - with: - command: doc - args: --all --no-deps + run: cargo doc --all --no-deps - name: Deploy Documentation env: diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 55251c3ed..572e5e960 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -2,16 +2,20 @@ on: [push, pull_request] name: CI Windows +# Least privilege: this workflow only reads the repo and runs tests. +permissions: + contents: read + jobs: test-default-windows: name: Windows vcpkg (default) runs-on: windows-latest - env: + env: VCPKGRS_DYNAMIC: 1 VCPKG_DEFAULT_TRIPLET: x64-windows VCPKG_ROOT: C:\vcpkg steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v6 - name: Setup vcpkg libxml2 Cache uses: actions/cache@v4 id: vcpkg-cache @@ -23,9 +27,7 @@ jobs: vcpkg install libxml2:x64-windows vcpkg integrate install - name: run tests - uses: actions-rs/cargo@v1 - with: - command: test + run: cargo test test-mingw64-windows: name: Windows (mingw64) @@ -34,7 +36,7 @@ jobs: run: shell: msys2 {0} steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v6 - uses: msys2/setup-msys2@v2 with: path-type: minimal @@ -45,14 +47,11 @@ jobs: mingw64/mingw-w64-x86_64-pkg-config mingw64/mingw-w64-x86_64-libxml2 - name: Install stable windows-gnu Rust toolchain - uses: actions-rs/toolchain@v1 + uses: dtolnay/rust-toolchain@master with: toolchain: stable-x86_64-pc-windows-gnu - target: x86_64-pc-windows-gnu - override: true + targets: x86_64-pc-windows-gnu - name: Ensure mingw64 pkg-config is in path run: echo "C:\msys64\mingw64\bin" >> "$GITHUB_PATH" - name: run tests - uses: actions-rs/cargo@v1 - with: - command: test \ No newline at end of file + run: cargo test diff --git a/CHANGELOG.md b/CHANGELOG.md index 5d23d4d1d..b092e9e57 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,8 @@ # Change Log -## [0.3.12] (in development) +## [0.3.13] (in development) + +## [0.3.12] (2026-05-23) ### Added From f4040ffe8f0f55affde9af0a9af4b8632ff44b04 Mon Sep 17 00:00:00 2001 From: Deyan Ginev Date: Sat, 23 May 2026 15:16:20 -0400 Subject: [PATCH 5/5] ci: run mingw64 cargo test in the Windows shell, not msys2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The mingw64 job sets `defaults.run.shell: msys2 {0}`, so the converted `run: cargo test` step executed inside the msys2 login shell. With `path-type: minimal`, msys2 strips cargo (installed by rustup to the Windows user profile) from PATH, so the step failed with exit 127. The previous actions-rs/cargo@v1 step was a JS action that ran in the runner's Windows context, never msys2 — so it always found cargo. Restore that behavior by pinning the test step to `shell: pwsh`. mingw64/bin is already on PATH from the prior step, so pkg-config, gcc, and the libxml2 DLLs still resolve for the windows-gnu build. Co-Authored-By: Claude Opus 4.7 --- .github/workflows/windows.yml | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 572e5e960..d885e2a10 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -53,5 +53,12 @@ jobs: targets: x86_64-pc-windows-gnu - name: Ensure mingw64 pkg-config is in path run: echo "C:\msys64\mingw64\bin" >> "$GITHUB_PATH" + # Run cargo in the default Windows shell, not the job's msys2 shell: + # rustup installs cargo to the Windows user profile, which + # `path-type: minimal` strips from the msys2 PATH (exit 127). The old + # actions-rs/cargo step ran in the runner's Windows context too, never + # msys2. mingw64/bin is on PATH via the step above, so pkg-config, gcc, + # and the libxml2 DLLs still resolve for the windows-gnu build. - name: run tests + shell: pwsh run: cargo test