Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/ros2_medkit_fault_manager/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
Changelog for package ros2_medkit_fault_manager
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Forthcoming
-----------
* Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (``record_hash = sha256(prev_hash + canonical(event))`` via OpenSSL EVP SHA-256) with a persisted chain head, a ``verify`` routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. ``verify`` reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. ``BEFORE UPDATE`` / ``BEFORE DELETE`` triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so ``verify`` detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (`#483 <https://github.com/selfpatch/ros2_medkit/issues/483>`_)

0.6.0 (2026-06-22)
------------------
* Bounded concurrent snapshot capture under fault storms with a ``CaptureThreadPool`` and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (`#456 <https://github.com/selfpatch/ros2_medkit/pull/456>`_)
Expand Down
8 changes: 8 additions & 0 deletions src/ros2_medkit_fault_manager/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ find_package(ros2_medkit_msgs REQUIRED)
find_package(ros2_medkit_serialization REQUIRED)
find_package(SQLite3 REQUIRED)
find_package(nlohmann_json REQUIRED)
# OpenSSL EVP SHA-256 for the tamper-evident audit log hash chain
find_package(OpenSSL REQUIRED)
# yaml-cpp is required as transitive dependency from ros2_medkit_serialization
medkit_find_yaml_cpp()
# rosbag2 for time-window snapshot recording
Expand All @@ -55,6 +57,7 @@ add_library(fault_manager_lib STATIC
src/fault_manager_node.cpp
src/fault_storage.cpp
src/sqlite_fault_storage.cpp
src/fault_audit_log.cpp
src/snapshot_capture.cpp
src/rosbag_capture.cpp
src/correlation/types.cpp
Expand All @@ -81,6 +84,7 @@ target_link_libraries(fault_manager_lib PUBLIC
SQLite::SQLite3
nlohmann_json::nlohmann_json
yaml-cpp::yaml-cpp
OpenSSL::Crypto
)

medkit_apply_compat_defs(fault_manager_lib)
Expand Down Expand Up @@ -143,6 +147,10 @@ if(BUILD_TESTING)
medkit_target_dependencies(test_sqlite_storage rclcpp ros2_medkit_msgs)
medkit_set_test_domain(test_sqlite_storage)

# Fault audit log tests (hash chain, verify, rotation, reopen)
ament_add_gtest(test_fault_audit_log test/test_fault_audit_log.cpp)
target_link_libraries(test_fault_audit_log fault_manager_lib)

# Snapshot capture tests
ament_add_gtest(test_snapshot_capture test/test_snapshot_capture.cpp)
target_link_libraries(test_snapshot_capture fault_manager_lib)
Expand Down
23 changes: 23 additions & 0 deletions src/ros2_medkit_fault_manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
- **Debounce filtering** (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- **Snapshot capture**: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
- **Fault correlation** (optional): Root cause analysis with symptom muting and auto-clear
- **Tamper-evident audit log** (optional): Append-only, hash-chained record of fault state transitions for verifiable history

## Parameters

Expand Down Expand Up @@ -109,6 +110,28 @@ patterns:

**Memory**: Faults are stored in memory only. Useful for testing or when persistence is not required.

## Advanced: Tamper-Evident Audit Log

An optional append-only, hash-chained audit log records every fault state transition (`occurred`, `confirmed`, `healed`, `cleared`) so the fault history is independently verifiable. Auto-recovery (a fault reaching the healing threshold via PASSED events) is recorded as a distinct `healed` row with source `auto_heal`, so the fault's END is in the timeline and is not confused with a manual `cleared`. The manager has no acknowledge action separate from clearing, so `~/clear_fault` is recorded as `cleared` (clear == ack); there is no `ack` kind. The log also records its own lifecycle with `logging_activated` / `logging_deactivated` markers (CIR (EU) 2024/2690 sec. 3.2) at start and stop. It is **off by default** because it adds a write and storage cost per transition.

Each transition appends one immutable row holding `record_hash = sha256(prev_hash + canonical(event))` (OpenSSL EVP SHA-256), the `prev_hash` it links to, and a monotonic `seq`. The hash is computed once at insert and never recomputed. A persisted chain head lets the chain resume across restarts. The log is stored in its own SQLite database (separate from the fault store) and is treated as append-only: the manager only ever inserts rows, and `BEFORE UPDATE` / `BEFORE DELETE` triggers reject out-of-band edits (the guarded rotation prune excepted).

**Completeness is an integrity property.** `verify()` proves nothing was *deleted* from the chain, but it cannot prove a transition that was *never appended*. So a silently dropped append is a hole `verify()` can never see. Every transition on the write path is therefore audited (occurred, timer/threshold confirmations, auto-heal, and clears), and an append failure is never swallowed silently: it increments a dropped-writes health counter and clears an "audit healthy" flag. With `audit_log.fail_closed` set, an append failure is a **hard error** that aborts the operation, so a compliance-strict deployment learns the audit broke instead of losing records unnoticed. The default (`fail_closed=false`) keeps fault processing running but still surfaces the gap via the health counter.

`verify()` walks the persisted chain oldest-first and recomputes every link: editing a row breaks its `record_hash`, deleting a row breaks the next row's `prev_hash` linkage, and deleting the newest row (the head row is read straight from the DB) is caught by the persisted-head check.

**Threat model (read this).** The chain is **unkeyed**, and the head and segment anchors live in the **same writable SQLite file** as the rows. `verify()` therefore catches edits or deletions that did **not** also recompute the chain - that is, casual or accidental tampering, and the bookkeeping bugs that would otherwise lose records. The append-only triggers are defense-in-depth: `audit_log` rejects out-of-band UPDATE/DELETE, and the rotation-prune guard (`audit_prune_guard`) is itself protected by a trigger so an external writer cannot simply flip it open and then delete a prefix - that flip is only permitted from the in-process connection that holds a per-connection temp marker. The single-row chain head (`audit_chain_head`) is intentionally **not** trigger-protected (a trigger there would block the legitimate head update inside the append transaction); a casual edit or delete of the head is instead caught by `verify()` via the seq/hash/head-mismatch checks. None of this stops an attacker with write access to the file: such an attacker can create the same temp marker or drop the triggers, and recompute the entire chain (head and anchors included) to forge a self-consistent history. The triggers are **not** a security boundary - this is tamper-**evident**, not tamper-**proof**. True tamper-*proofing* requires a key or signature over the head (so it cannot be recomputed without the key) or external anchoring of the head hash to an append-only store you do not control; both are out of scope here and belong to the audit-log exporter / signing follow-up.

**Retention/rotation**: when more than `audit_log.retention_max_records` rows are retained, the oldest segment is *sealed* (its final `seq` + hash are persisted as an anchor) and then pruned. The surviving tail still verifies because the oldest retained row links back to the sealed anchor.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `audit_log.enabled` | bool | `false` | Enable the tamper-evident audit log |
| `audit_log.transitions` | string | `"all"` | Which transitions to record: `"all"` (occurred/confirmed/healed/cleared) or `"confirmed_only"`. Lifecycle markers are always recorded. |
| `audit_log.database_path` | string | `""` | SQLite path. Empty => sibling `fault_audit.db` next to the fault DB (or `:memory:` for in-memory fault stores) |
| `audit_log.retention_max_records` | int | `0` | Seal + prune the oldest segment beyond this many retained records (0 = unlimited) |
| `audit_log.fail_closed` | bool | `false` | When `true`, an audit append failure is a hard error that aborts the operation (compliance-strict). When `false`, the failure is logged and counted but fault processing continues. Either way the gap is visible via the dropped-writes health counter. |

## Usage

### Launch
Expand Down
23 changes: 23 additions & 0 deletions src/ros2_medkit_fault_manager/config/fault_manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,26 @@ fault_manager:
# snapshots.capture_pool_size: 2 # max concurrent capture threads (>= 1)
# snapshots.capture_queue_depth: 16 # max pending captures (>= 1)
# snapshots.capture_queue_full_policy: reject_newest # reject_newest | drop_oldest

# Append-only, hash-chained audit log of fault state transitions
# (occurred/confirmed/healed/cleared). OFF by default: it adds a write +
# storage cost per transition. When enabled, each transition appends one
# immutable, hash-chained row, so verify() detects edits or deletions that did
# NOT also recompute the chain (casual/accidental tampering). The chain is
# unkeyed and lives in a single writable file, so it is NOT proof against an
# attacker who can rewrite the whole file; true tamper-proofing needs a signed
# head or external anchoring (out of scope here). See README "Threat model".
audit_log.enabled: false
# Which transitions to record: "all" or "confirmed_only".
# audit_log.transitions: all
# SQLite path. Empty => sibling "fault_audit.db" next to the fault DB
# (or :memory: when the fault store is in-memory).
# audit_log.database_path: ""
# Seal + prune the oldest segment beyond this many retained records
# (0 = unlimited). A sealed anchor keeps the surviving tail verifiable.
# audit_log.retention_max_records: 0
# Fail-closed (compliance-strict): when true, an audit append failure is a hard
# error that aborts the operation instead of being logged and dropped. Default
# false keeps fault processing running; either way a dropped-writes health
# counter makes the gap visible.
# audit_log.fail_closed: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
// Copyright 2026 mfaferek93, bburda
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#pragma once

#include <sqlite3.h>

#include <cstdint>
#include <mutex>
#include <optional>
#include <string>
#include <vector>

namespace ros2_medkit_fault_manager {

/// A single fault state-transition to record in the audit log.
///
/// `transition` is one of the kTransition* constants below. The remaining
/// fields describe the fault at the moment of the transition; all of them feed
/// the canonical serialization that the hash chain is computed over, so an edit
/// to a stored row that does not also recompute the chain breaks verify().
struct AuditEvent {
std::string fault_code;
std::string transition; ///< one of the kTransition* constants below
uint8_t severity{0}; ///< severity at the time of the transition
std::string status; ///< resulting fault status (e.g. CONFIRMED)
std::string source_id; ///< reporting source that drove the transition
std::string description; ///< human-readable description
int64_t occurred_at_ns{0}; ///< wall-clock timestamp of the transition
};

/// Canonical transition kinds. Stored verbatim, so they are part of the hash.
constexpr const char * kTransitionOccurred = "occurred";
constexpr const char * kTransitionConfirmed = "confirmed";
constexpr const char * kTransitionCleared = "cleared";
/// Auto-recovery: a fault reached the healing threshold via PASSED events. Kept
/// distinct from kTransitionCleared so an automatic recovery is not mistaken for
/// a manual clear in the timeline.
constexpr const char * kTransitionHealed = "healed";
/// Audit-log lifecycle markers (CIR (EU) 2024/2690 sec. 3.2: activation /
/// deactivation of logging). Appended directly, independent of the per-fault
/// transition filter, so the log records its own start and stop.
constexpr const char * kTransitionLoggingActivated = "logging_activated";
constexpr const char * kTransitionLoggingDeactivated = "logging_deactivated";
// NOTE: there is deliberately no "ack" kind. The open fault_manager has no
// acknowledge action separate from clearing: ~/clear_fault IS the acknowledge,
// and it is recorded as kTransitionCleared (clear == ack). A separate "ack" kind
// would never be written, so defining it would only mislead readers of the log.

/// One immutable, hash-chained row read back from the audit log.
struct AuditRecord {
int64_t seq{0};
AuditEvent event;
std::string prev_hash;
std::string record_hash;
};

/// Persisted head of the hash chain.
struct ChainHead {
int64_t seq{0}; ///< 0 when the chain is empty
std::string record_hash; ///< genesis hash when the chain is empty
};

/// Result of verifying the persisted chain.
struct AuditVerifyResult {
bool ok{true};
int64_t checked{0}; ///< number of records walked
int64_t bad_seq{0}; ///< seq of the first offending record (0 if ok)
std::string error; ///< human-readable reason when !ok
};

/// Append-only, hash-chained audit log of fault state transitions.
///
/// Each appended row stores `record_hash = sha256(prev_hash + canonical(event))`
/// using OpenSSL's EVP SHA-256 (the same primitive the gateway links). The hash
/// is computed once at insert and never recomputed. A persisted chain head lets
/// the chain resume across process restarts, and rotation seals a segment by
/// persisting an anchor (its final seq + hash) before pruning so the surviving
/// history stays verifiable.
///
/// The table is treated as append-only: this class only ever INSERTs rows (and,
/// on rotation, deletes a sealed prefix). It never UPDATEs an existing record.
/// BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits (the guarded
/// rotation prune excepted) as defense-in-depth.
///
/// Threat model: the hash chain is UNKEYED and the head/anchors live in the same
/// writable file. verify() catches edits or deletions that did not also recompute
/// the chain (casual or accidental tampering), but anyone with write access to the
/// file can recompute the whole chain (and drop the triggers) and forge a
/// consistent history. True tamper-proofing needs a key/signature over the head or
/// external anchoring; that is out of scope here.
class FaultAuditLog {
public:
/// Open (or create) the audit log database.
/// @param db_path SQLite path. Use ":memory:" for an in-memory log.
/// @param retention_max_records Max records to retain before rotation seals
/// and prunes the oldest segment. 0 disables rotation (unlimited).
/// @throws std::runtime_error if the database cannot be opened or initialized.
explicit FaultAuditLog(const std::string & db_path, int64_t retention_max_records = 0);

~FaultAuditLog();

FaultAuditLog(const FaultAuditLog &) = delete;
FaultAuditLog & operator=(const FaultAuditLog &) = delete;
FaultAuditLog(FaultAuditLog &&) = delete;
FaultAuditLog & operator=(FaultAuditLog &&) = delete;

/// Append one transition. Computes the chained hash, inserts the row, and
/// advances the persisted head, atomically.
/// @return the monotonic seq assigned to the new record.
int64_t append(const AuditEvent & event);

/// Walk the persisted chain oldest-first and validate every link.
AuditVerifyResult verify() const;

/// Read records oldest-first.
/// @param limit Max records to return (0 = all).
/// @param after_seq Only return records with seq > after_seq (0 = from start).
std::vector<AuditRecord> read(int64_t limit = 0, int64_t after_seq = 0) const;

/// Current persisted chain head.
ChainHead head() const;

/// Number of records currently retained (excludes pruned/sealed rows).
int64_t record_count() const;

/// Deterministic canonical serialization of an event at a given seq.
/// Stable field order so verify is reproducible across processes.
static std::string canonicalize(int64_t seq, const AuditEvent & event);

/// Genesis hash used as prev_hash for the very first record.
static std::string genesis_hash();

/// SHA-256 of `data` as a lowercase hex string (OpenSSL EVP).
static std::string sha256_hex(const std::string & data);

const std::string & db_path() const {
return db_path_;
}

private:
void initialize_schema();
/// Read the persisted chain head row (audit_chain_head id=1) straight from the
/// DB. Returns nullopt when the row is absent. verify() relies on this so a
/// deleted head row is treated as tampering rather than silently recovered.
std::optional<ChainHead> read_head_row_locked() const;
ChainHead load_head_locked() const;
void store_head_locked(const ChainHead & head_record);
/// Seal + prune the oldest segment if the retained count exceeds the limit.
void rotate_if_needed_locked();

std::string db_path_;
int64_t retention_max_records_{0};
sqlite3 * db_{nullptr};
mutable std::mutex mutex_;
ChainHead head_; ///< cached head, kept in sync with the head table
};

} // namespace ros2_medkit_fault_manager
Loading
Loading