selfpatch · mfaferek93 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
diff --git a/src/ros2_medkit_fault_manager/CHANGELOG.rst b/src/ros2_medkit_fault_manager/CHANGELOG.rst
@@ -2,6 +2,10 @@
 Changelog for package ros2_medkit_fault_manager
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+Forthcoming
+-----------
+* Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (``record_hash = sha256(prev_hash + canonical(event))`` via OpenSSL EVP SHA-256) with a persisted chain head, a ``verify`` routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. ``verify`` reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. ``BEFORE UPDATE`` / ``BEFORE DELETE`` triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so ``verify`` detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (`#483 <https://github.com/selfpatch/ros2_medkit/issues/483>`_)
+
 0.6.0 (2026-06-22)
 ------------------
 * Bounded concurrent snapshot capture under fault storms with a ``CaptureThreadPool`` and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (`#456 <https://github.com/selfpatch/ros2_medkit/pull/456>`_)

diff --git a/src/ros2_medkit_fault_manager/CMakeLists.txt b/src/ros2_medkit_fault_manager/CMakeLists.txt
@@ -41,6 +41,8 @@ find_package(ros2_medkit_msgs REQUIRED)
 find_package(ros2_medkit_serialization REQUIRED)
 find_package(SQLite3 REQUIRED)
 find_package(nlohmann_json REQUIRED)
+# OpenSSL EVP SHA-256 for the tamper-evident audit log hash chain
+find_package(OpenSSL REQUIRED)
 # yaml-cpp is required as transitive dependency from ros2_medkit_serialization
 medkit_find_yaml_cpp()
 # rosbag2 for time-window snapshot recording
@@ -55,6 +57,7 @@ add_library(fault_manager_lib STATIC
   src/fault_manager_node.cpp
   src/fault_storage.cpp
   src/sqlite_fault_storage.cpp
+  src/fault_audit_log.cpp
   src/snapshot_capture.cpp
   src/rosbag_capture.cpp
   src/correlation/types.cpp
@@ -81,6 +84,7 @@ target_link_libraries(fault_manager_lib PUBLIC
   SQLite::SQLite3
   nlohmann_json::nlohmann_json
   yaml-cpp::yaml-cpp
+  OpenSSL::Crypto
 )
 
 medkit_apply_compat_defs(fault_manager_lib)
@@ -143,6 +147,10 @@ if(BUILD_TESTING)
   medkit_target_dependencies(test_sqlite_storage rclcpp ros2_medkit_msgs)
   medkit_set_test_domain(test_sqlite_storage)
 
+  # Fault audit log tests (hash chain, verify, rotation, reopen)
+  ament_add_gtest(test_fault_audit_log test/test_fault_audit_log.cpp)
+  target_link_libraries(test_fault_audit_log fault_manager_lib)
+
   # Snapshot capture tests
   ament_add_gtest(test_snapshot_capture test/test_snapshot_capture.cpp)
   target_link_libraries(test_snapshot_capture fault_manager_lib)

diff --git a/src/ros2_medkit_fault_manager/README.md b/src/ros2_medkit_fault_manager/README.md
@@ -53,6 +53,7 @@ ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
 - **Debounce filtering** (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
 - **Snapshot capture**: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
 - **Fault correlation** (optional): Root cause analysis with symptom muting and auto-clear
+- **Tamper-evident audit log** (optional): Append-only, hash-chained record of fault state transitions for verifiable history
 
 ## Parameters
 
@@ -109,6 +110,28 @@ patterns:
 
 **Memory**: Faults are stored in memory only. Useful for testing or when persistence is not required.
 
+## Advanced: Tamper-Evident Audit Log
+
+An optional append-only, hash-chained audit log records every fault state transition (`occurred`, `confirmed`, `healed`, `cleared`) so the fault history is independently verifiable. Auto-recovery (a fault reaching the healing threshold via PASSED events) is recorded as a distinct `healed` row with source `auto_heal`, so the fault's END is in the timeline and is not confused with a manual `cleared`. The manager has no acknowledge action separate from clearing, so `~/clear_fault` is recorded as `cleared` (clear == ack); there is no `ack` kind. The log also records its own lifecycle with `logging_activated` / `logging_deactivated` markers (CIR (EU) 2024/2690 sec. 3.2) at start and stop. It is **off by default** because it adds a write and storage cost per transition.
+
+Each transition appends one immutable row holding `record_hash = sha256(prev_hash + canonical(event))` (OpenSSL EVP SHA-256), the `prev_hash` it links to, and a monotonic `seq`. The hash is computed once at insert and never recomputed. A persisted chain head lets the chain resume across restarts. The log is stored in its own SQLite database (separate from the fault store) and is treated as append-only: the manager only ever inserts rows, and `BEFORE UPDATE` / `BEFORE DELETE` triggers reject out-of-band edits (the guarded rotation prune excepted).
+
+**Completeness is an integrity property.** `verify()` proves nothing was *deleted* from the chain, but it cannot prove a transition that was *never appended*. So a silently dropped append is a hole `verify()` can never see. Every transition on the write path is therefore audited (occurred, timer/threshold confirmations, auto-heal, and clears), and an append failure is never swallowed silently: it increments a dropped-writes health counter and clears an "audit healthy" flag. With `audit_log.fail_closed` set, an append failure is a **hard error** that aborts the operation, so a compliance-strict deployment learns the audit broke instead of losing records unnoticed. The default (`fail_closed=false`) keeps fault processing running but still surfaces the gap via the health counter.
+
+`verify()` walks the persisted chain oldest-first and recomputes every link: editing a row breaks its `record_hash`, deleting a row breaks the next row's `prev_hash` linkage, and deleting the newest row (the head row is read straight from the DB) is caught by the persisted-head check.
+
+**Threat model (read this).** The chain is **unkeyed**, and the head and segment anchors live in the **same writable SQLite file** as the rows. `verify()` therefore catches edits or deletions that did **not** also recompute the chain - that is, casual or accidental tampering, and the bookkeeping bugs that would otherwise lose records. The append-only triggers are defense-in-depth: `audit_log` rejects out-of-band UPDATE/DELETE, and the rotation-prune guard (`audit_prune_guard`) is itself protected by a trigger so an external writer cannot simply flip it open and then delete a prefix - that flip is only permitted from the in-process connection that holds a per-connection temp marker. The single-row chain head (`audit_chain_head`) is intentionally **not** trigger-protected (a trigger there would block the legitimate head update inside the append transaction); a casual edit or delete of the head is instead caught by `verify()` via the seq/hash/head-mismatch checks. None of this stops an attacker with write access to the file: such an attacker can create the same temp marker or drop the triggers, and recompute the entire chain (head and anchors included) to forge a self-consistent history. The triggers are **not** a security boundary - this is tamper-**evident**, not tamper-**proof**. True tamper-*proofing* requires a key or signature over the head (so it cannot be recomputed without the key) or external anchoring of the head hash to an append-only store you do not control; both are out of scope here and belong to the audit-log exporter / signing follow-up.
+
+**Retention/rotation**: when more than `audit_log.retention_max_records` rows are retained, the oldest segment is *sealed* (its final `seq` + hash are persisted as an anchor) and then pruned. The surviving tail still verifies because the oldest retained row links back to the sealed anchor.
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `audit_log.enabled` | bool | `false` | Enable the tamper-evident audit log |
+| `audit_log.transitions` | string | `"all"` | Which transitions to record: `"all"` (occurred/confirmed/healed/cleared) or `"confirmed_only"`. Lifecycle markers are always recorded. |
+| `audit_log.database_path` | string | `""` | SQLite path. Empty => sibling `fault_audit.db` next to the fault DB (or `:memory:` for in-memory fault stores) |
+| `audit_log.retention_max_records` | int | `0` | Seal + prune the oldest segment beyond this many retained records (0 = unlimited) |
+| `audit_log.fail_closed` | bool | `false` | When `true`, an audit append failure is a hard error that aborts the operation (compliance-strict). When `false`, the failure is logged and counted but fault processing continues. Either way the gap is visible via the dropped-writes health counter. |
+
 ## Usage
 
 ### Launch

diff --git a/src/ros2_medkit_fault_manager/config/fault_manager.yaml b/src/ros2_medkit_fault_manager/config/fault_manager.yaml
@@ -27,3 +27,26 @@ fault_manager:
     # snapshots.capture_pool_size: 2                  # max concurrent capture threads (>= 1)
     # snapshots.capture_queue_depth: 16               # max pending captures (>= 1)
     # snapshots.capture_queue_full_policy: reject_newest  # reject_newest | drop_oldest
+
+    # Append-only, hash-chained audit log of fault state transitions
+    # (occurred/confirmed/healed/cleared). OFF by default: it adds a write +
+    # storage cost per transition. When enabled, each transition appends one
+    # immutable, hash-chained row, so verify() detects edits or deletions that did
+    # NOT also recompute the chain (casual/accidental tampering). The chain is
+    # unkeyed and lives in a single writable file, so it is NOT proof against an
+    # attacker who can rewrite the whole file; true tamper-proofing needs a signed
+    # head or external anchoring (out of scope here). See README "Threat model".
+    audit_log.enabled: false
+    # Which transitions to record: "all" or "confirmed_only".
+    # audit_log.transitions: all
+    # SQLite path. Empty => sibling "fault_audit.db" next to the fault DB
+    # (or :memory: when the fault store is in-memory).
+    # audit_log.database_path: ""
+    # Seal + prune the oldest segment beyond this many retained records
+    # (0 = unlimited). A sealed anchor keeps the surviving tail verifiable.
+    # audit_log.retention_max_records: 0
+    # Fail-closed (compliance-strict): when true, an audit append failure is a hard
+    # error that aborts the operation instead of being logged and dropped. Default
+    # false keeps fault processing running; either way a dropped-writes health
+    # counter makes the gap visible.
+    # audit_log.fail_closed: false
diff --git a/src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_audit_log.hpp b/src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_audit_log.hpp
@@ -0,0 +1,170 @@
+// Copyright 2026 mfaferek93, bburda
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <sqlite3.h>
+
+#include <cstdint>
+#include <mutex>
+#include <optional>
+#include <string>
+#include <vector>
+
+namespace ros2_medkit_fault_manager {
+
+/// A single fault state-transition to record in the audit log.
+///
+/// `transition` is one of the kTransition* constants below. The remaining
+/// fields describe the fault at the moment of the transition; all of them feed
+/// the canonical serialization that the hash chain is computed over, so an edit
+/// to a stored row that does not also recompute the chain breaks verify().
+struct AuditEvent {
+  std::string fault_code;
+  std::string transition;     ///< one of the kTransition* constants below
+  uint8_t severity{0};        ///< severity at the time of the transition
+  std::string status;         ///< resulting fault status (e.g. CONFIRMED)
+  std::string source_id;      ///< reporting source that drove the transition
+  std::string description;    ///< human-readable description
+  int64_t occurred_at_ns{0};  ///< wall-clock timestamp of the transition
+};
+
+/// Canonical transition kinds. Stored verbatim, so they are part of the hash.
+constexpr const char * kTransitionOccurred = "occurred";
+constexpr const char * kTransitionConfirmed = "confirmed";
+constexpr const char * kTransitionCleared = "cleared";
+/// Auto-recovery: a fault reached the healing threshold via PASSED events. Kept
+/// distinct from kTransitionCleared so an automatic recovery is not mistaken for
+/// a manual clear in the timeline.
+constexpr const char * kTransitionHealed = "healed";
+/// Audit-log lifecycle markers (CIR (EU) 2024/2690 sec. 3.2: activation /
+/// deactivation of logging). Appended directly, independent of the per-fault
+/// transition filter, so the log records its own start and stop.
+constexpr const char * kTransitionLoggingActivated = "logging_activated";
+constexpr const char * kTransitionLoggingDeactivated = "logging_deactivated";
+// NOTE: there is deliberately no "ack" kind. The open fault_manager has no
+// acknowledge action separate from clearing: ~/clear_fault IS the acknowledge,
+// and it is recorded as kTransitionCleared (clear == ack). A separate "ack" kind
+// would never be written, so defining it would only mislead readers of the log.
+
+/// One immutable, hash-chained row read back from the audit log.
+struct AuditRecord {
+  int64_t seq{0};
+  AuditEvent event;
+  std::string prev_hash;
+  std::string record_hash;
+};
+
+/// Persisted head of the hash chain.
+struct ChainHead {
+  int64_t seq{0};           ///< 0 when the chain is empty
+  std::string record_hash;  ///< genesis hash when the chain is empty
+};
+
+/// Result of verifying the persisted chain.
+struct AuditVerifyResult {
+  bool ok{true};
+  int64_t checked{0};  ///< number of records walked
+  int64_t bad_seq{0};  ///< seq of the first offending record (0 if ok)
+  std::string error;   ///< human-readable reason when !ok
+};
+
+/// Append-only, hash-chained audit log of fault state transitions.
+///
+/// Each appended row stores `record_hash = sha256(prev_hash + canonical(event))`
+/// using OpenSSL's EVP SHA-256 (the same primitive the gateway links). The hash
+/// is computed once at insert and never recomputed. A persisted chain head lets
+/// the chain resume across process restarts, and rotation seals a segment by
+/// persisting an anchor (its final seq + hash) before pruning so the surviving
+/// history stays verifiable.
+///
+/// The table is treated as append-only: this class only ever INSERTs rows (and,
+/// on rotation, deletes a sealed prefix). It never UPDATEs an existing record.
+/// BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits (the guarded
+/// rotation prune excepted) as defense-in-depth.
+///
+/// Threat model: the hash chain is UNKEYED and the head/anchors live in the same
+/// writable file. verify() catches edits or deletions that did not also recompute
+/// the chain (casual or accidental tampering), but anyone with write access to the
+/// file can recompute the whole chain (and drop the triggers) and forge a
+/// consistent history. True tamper-proofing needs a key/signature over the head or
+/// external anchoring; that is out of scope here.
+class FaultAuditLog {
+ public:
+  /// Open (or create) the audit log database.
+  /// @param db_path SQLite path. Use ":memory:" for an in-memory log.
+  /// @param retention_max_records Max records to retain before rotation seals
+  ///        and prunes the oldest segment. 0 disables rotation (unlimited).
+  /// @throws std::runtime_error if the database cannot be opened or initialized.
+  explicit FaultAuditLog(const std::string & db_path, int64_t retention_max_records = 0);
+
+  ~FaultAuditLog();
+
+  FaultAuditLog(const FaultAuditLog &) = delete;
+  FaultAuditLog & operator=(const FaultAuditLog &) = delete;
+  FaultAuditLog(FaultAuditLog &&) = delete;
+  FaultAuditLog & operator=(FaultAuditLog &&) = delete;
+
+  /// Append one transition. Computes the chained hash, inserts the row, and
+  /// advances the persisted head, atomically.
+  /// @return the monotonic seq assigned to the new record.
+  int64_t append(const AuditEvent & event);
+
+  /// Walk the persisted chain oldest-first and validate every link.
+  AuditVerifyResult verify() const;
+
+  /// Read records oldest-first.
+  /// @param limit Max records to return (0 = all).
+  /// @param after_seq Only return records with seq > after_seq (0 = from start).
+  std::vector<AuditRecord> read(int64_t limit = 0, int64_t after_seq = 0) const;
+
+  /// Current persisted chain head.
+  ChainHead head() const;
+
+  /// Number of records currently retained (excludes pruned/sealed rows).
+  int64_t record_count() const;
+
+  /// Deterministic canonical serialization of an event at a given seq.
+  /// Stable field order so verify is reproducible across processes.
+  static std::string canonicalize(int64_t seq, const AuditEvent & event);
+
+  /// Genesis hash used as prev_hash for the very first record.
+  static std::string genesis_hash();
+
+  /// SHA-256 of `data` as a lowercase hex string (OpenSSL EVP).
+  static std::string sha256_hex(const std::string & data);
+
+  const std::string & db_path() const {
+    return db_path_;
+  }
+
+ private:
+  void initialize_schema();
+  /// Read the persisted chain head row (audit_chain_head id=1) straight from the
+  /// DB. Returns nullopt when the row is absent. verify() relies on this so a
+  /// deleted head row is treated as tampering rather than silently recovered.
+  std::optional<ChainHead> read_head_row_locked() const;
+  ChainHead load_head_locked() const;
+  void store_head_locked(const ChainHead & head_record);
+  /// Seal + prune the oldest segment if the retained count exceeds the limit.
+  void rotate_if_needed_locked();
+
+  std::string db_path_;
+  int64_t retention_max_records_{0};
+  sqlite3 * db_{nullptr};
+  mutable std::mutex mutex_;
+  ChainHead head_;  ///< cached head, kept in sync with the head table
+};
+
+}  // namespace ros2_medkit_fault_manager