Skip to content

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Jan 19, 2026

Summary

Removing nullable fields where possible - source of errors.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or tooling changes

Test Plan

  • Unit tests pass (uv run pytest)
  • E2E tests pass (npx playwright test)
  • Manual testing performed

Manual test steps:

Checklist

  • My code follows the project's style guidelines
  • I have run pre-commit run --all-files and all checks pass
  • I have added tests that prove my fix/feature works
  • I have updated documentation if needed
  • My changes don't introduce new warnings

Summary by cubic

Replaced in-process lifecycles with DI and moved message handling to FastStream. Removed UnifiedConsumer/EventDispatcher and the coordinator; EventBus is removed; NotificationService self-starts; SSE now uses a stateless event router with a connection registry.

  • Refactors

    • Providers & DI: remove EventBus; start Kafka producer via DI; replace EventStoreConsumer with EventStoreService; remove factory helpers; add BoundaryClientProvider/RedisServicesProvider; replace SSE Kafka→Redis bridge with SSEEventRouter + SSEConnectionRegistry.
    • Messaging & Workers: drop UnifiedConsumer/EventDispatcher; adopt FastStream for K8s worker, result processor, saga, and DLQ processor; delete ExecutionCoordinator/QueueManager/ResourceManager; keep UnifiedProducer.
    • SSE & Notifications: remove SSEShutdownManager; SSEService requires explicit SSE_HEARTBEAT_INTERVAL; NotificationService starts background tasks on creation; UserSettingsService now uses Redis-backed cache (no EventBus invalidation).
    • Schema & Idempotency: switch to dataclasses-avroschema (Confluent wire format); add ResourceUsageAvro with a domain adapter; replace IdempotentEventHandler/wrapper with FastStream IdempotencyMiddleware.
  • Migration

    • Remove EventBusManager/EventBus; publish via KafkaEventService; delete any initialize(...) or manager.get_event_bus() calls.
    • Stop using create_* helpers (e.g., SSEKafkaRedisBridge, SSEShutdownManager, DLQManager, SagaOrchestrator, PodMonitor); rely on DI; use SSEEventRouter and SSEConnectionRegistry for SSE.
    • Remove UnifiedConsumer/EventDispatcher usage; use FastStream subscribers and the new workers (including DLQ).
    • Update references to K8sWorkerLogic, ProcessorLogic, and SagaLogic; ExecutionCoordinator/QueueManager/ResourceManager are removed.
    • Drop is_running flags; use status snapshots or shutdown events as needed.
    • Replace IdempotentEventHandler/IdempotentConsumerWrapper/idempotent_handler with FastStream IdempotencyMiddleware.
    • Migrate from pydantic-avro to dataclasses-avroschema; use ResourceUsageAvro in events (domain adapter provided).

Written for commit 4b104db. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • SSE bridge with reliable Redis-backed routing and connection registry; expanded Kubernetes worker (pod lifecycle, daemonset pre-pull, wait-for-active creations); saga and result-processing logic exposed for DI; user settings now use Redis-backed cache.
  • Refactor

    • Centralized dependency and lifecycle management via DI providers; idempotency enforced as FastStream middleware; notifications simplified to batch/internal delivery.
  • Removed

    • Legacy coordinator and EventBus flows, older consumer/dispatcher and idempotency wrappers.
  • Tests

    • Expanded DI-friendly fixtures, fakes and updated tests.
  • Chores

    • Dependency and packaging updates.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 19, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Centralizes external clients into DI providers, replaces many LifecycleEnabled services with DI-managed logic and FastStream entrypoints, removes the coordinator and related queue/resource modules, and rewires SSE/K8s/saga/DLQ/idempotency/schema-registry, plus extensive test and docs updates. (39 words)

Changes

Cohort / File(s) Summary
Core DI & providers
\backend/app/core/providers.py`, `backend/app/core/container.py`, `backend/app/core/dishka_lifespan.py``
New BoundaryClientProvider/RedisServicesProvider/ResultProcessorProvider and container wiring; lifespan simplified to DI-driven startup. Audit provider method signatures and container factory changes.
Lifecycle primitives removed
\backend/app/core/lifecycle.py``, multiple service modules
Deleted LifecycleEnabled and removed lifecycle start/stop/init patterns across services — many classes lost on_start/_on_stop and context-manager behavior. Check impacted classes for new lifecycle expectations.
Messaging core & event plumbing
\backend/app/services/event_bus.py`, `backend/app/events/core/dispatcher.py`, `backend/app/events/core/consumer.py`, `backend/app/events/core/producer.py`, `backend/app/events/core/types.py`, `backend/app/events/event_store_consumer.py``
EventBus and EventDispatcher removed; UnifiedConsumer/UnifiedProducer simplified to DI-managed logic; EventStoreConsumer → EventStoreService. Review removed lifecycle/status APIs and producer/consumer lifecycle changes.
SSE: registry, router, redis bus
\backend/app/services/sse/sse_service.py`, removed: `.../sse_shutdown_manager.py`, `.../kafka_redis_bridge.py`, added: `backend/app/services/sse/sse_connection_registry.py`, `backend/app/services/sse/event_router.py`, `backend/app/services/sse/redis_bus.py``
Replace shutdown/bridge with SSEConnectionRegistry and SSEEventRouter; SSERedisBus gains internal pub/sub and tightened subscription typing. Validate handshake, registry usage, and new internal pub/sub APIs.
K8s worker & pod monitor
\backend/app/services/k8s_worker/worker_logic.py`, `backend/app/services/pod_monitor/monitor.py`, removed: `backend/app/core/k8s_clients.py`, tests fakes `backend/tests/helpers/fakes/kubernetes.py`*`
New K8sWorkerLogic class and run-style PodMonitor; k8s client factory removed in favor of DI-provided clients and test fakes. Inspect Kubernetes API DI surfaces and daemonset/configmap logic.
Saga orchestration
\backend/app/services/saga/saga_logic.py`, `backend/app/services/saga/init.py`, `backend/app/services/saga/saga_service.py`, `backend/app/services/saga/execution_saga.py``
SagaOrchestrator → SagaLogic with register/handle API and topic discovery; AcceptExecutionStep added. Verify saga registration, handle_event API, and topic discovery methods.
DLQ & result processing
\backend/app/dlq/manager.py`, `backend/app/services/result_processor/processor_logic.py`, tests `backend/tests/integration/dlq/`, `backend/tests/integration/result_processor/``
DLQManager now processes per-message/checks retries; ResultProcessor → ProcessorLogic without lifecycle/dispatcher. Update DI wiring for Kafka producer and retry policy configuration.
Idempotency & middleware
\backend/app/services/idempotency/idempotency_manager.py`, removed: `backend/app/services/idempotency/middleware.py`, added: `backend/app/services/idempotency/faststream_middleware.py`, `backend/app/services/idempotency/init.py``
Removed decorator/wrapper middleware; IdempotencyManager constructor changed; new FastStream middleware enforces idempotency at broker pipeline. Review middleware error/skip behavior.
Coordinator & queue/resource removal
removed: \backend/app/services/coordinator/`, `backend/workers/run_coordinator.py`, `docs/**/coordinator`
Coordinator, QueueManager, ResourceManager, and coordinator worker removed; public exports and docs updated. Search for dangling references and CI/dockercompose adjustments.
Workers → FastStream entrypoints
\backend/workers/run_k8s_worker.py`, `backend/workers/run_saga_orchestrator.py`, `backend/workers/run_result_processor.py`, `backend/workers/run_pod_monitor.py`, `backend/workers/run_sse_bridge.py`, `backend/workers/dlq_processor.py``
Manual run loops replaced by FastStream + Dishka main() functions and DI lifespans; note new pipeline middleware (idempotency) and Avro decode routing.
User settings, notifications & tests
\backend/app/services/user_settings_service.py`, `backend/app/services/notification_service.py`, `backend/tests/unit/conftest.py`, `backend/tests/integration/conftest.py`, `backend/tests/helpers/fakes/*`**`
user_settings switched to Redis-backed cache; NotificationService simplified (batch API); tests migrate to DI container fixtures and many new fakes. Validate redis client usage and test fixture compatibility.
Schema registry & Avro models
\backend/app/events/schema/schema_registry.py`, `backend/app/domain/events/typed.py`, `backend/tests/helpers/fakes/schema_registry.py`**`
Schema registry now lazy-registers schemas and caches IDs; Avro model changes (ResourceUsageAvro, AvroBaseModel). Verify serialization/deserialization and subject registration semantics.
SSERedisBus internals
\backend/app/services/sse/redis_bus.py`, `backend/app/services/sse/sse_connection_registry.py`**`
SSERedisBus adds internal pub/sub and InternalSubscription; SSERedisSubscription.get generic typing tightened. Check internal subscription lifecycle and JSON handling.
Tests, docs & infra
\backend/tests/`, `docs/`, `docker-compose.yaml`, `backend/pyproject.toml`, .github/workflows/stack-tests.yml`
Many tests removed/rewritten; new test helpers/fakes added; docs updated to remove coordinator and add SSE Bridge/FastStream; dependencies updated (dishka, faststream, avro lib). Review CI and docker-compose service removals.

Sequence Diagram(s)

sequenceDiagram
participant FastStream as FastStream App
participant Dishka as Dishka DI Container
participant Schema as SchemaRegistryManager
participant Kafka as Kafka Broker
participant Logic as K8sWorkerLogic
participant K8s as Kubernetes API

FastStream->>Dishka: build container (BoundaryClientProvider, EventProvider, K8sWorkerProvider...)
Dishka-->>Logic: resolve K8sWorkerLogic
Kafka->>FastStream: deliver Avro message
FastStream->>Schema: decode Avro -> DomainEvent
Schema-->>FastStream: DomainEvent
FastStream->>Logic: handle_create_pod_command(event)
Logic->>K8s: create ConfigMap/Pod (CoreV1Api / AppsV1Api)
K8s-->>Logic: response/status
Logic->>Kafka: publish PodCreated / ExecutionFailed events
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 I hopped through code with nimble paws,
I swapped the lifecycles and rewired the laws.
Redis and Kafka now sing in line,
FastStream wakes workers — everything's fine. ✨🐇

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'chore: removal of nullable fields' is misleading and does not accurately reflect the actual scope and nature of the changes in this PR. Update the title to accurately describe the major refactoring: e.g., 'refactor: remove lifecycle patterns and migrate to DI-based FastStream workers' or 'refactor: remove EventBus and consolidate messaging via FastStream and DI'.
Docstring Coverage ⚠️ Warning Docstring coverage is 48.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)

136-156: Add _subscribe_to_events() to background tasks so execution event notifications fire.

_subscribe_to_events() is never invoked, so the Kafka consumer for execution-event notifications (EXECUTION_COMPLETED, EXECUTION_FAILED, EXECUTION_TIMEOUT) never starts. Add it to the background task queue:

🔧 Proposed fix
 def _start_background_tasks(self) -> None:
     """Start background processing tasks."""
     tasks = [
+        asyncio.create_task(self._subscribe_to_events()),
         asyncio.create_task(self._process_pending_notifications()),
         asyncio.create_task(self._cleanup_old_notifications()),
     ]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)

118-155: _subscribe_to_events() is never invoked—execution-event notifications won't be consumed.

The async method _subscribe_to_events() is defined (line 207) but never called anywhere in the codebase. This means the service will not set up Kafka consumer subscriptions for execution events, breaking event-driven notifications entirely.

Additionally, _start_background_tasks() calls asyncio.create_task() in __init__ without explicitly verifying a running loop. While lazy DI instantiation during async route handling may avoid an immediate RuntimeError, this pattern is fragile and doesn't fail fast.

Move event subscription setup into an async initialization point (e.g., a start() method called from the lifespan handler, or schedule _subscribe_to_events() as a background task once the loop is guaranteed to be running).

sse shutdown manager: moved to DI
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/core/providers.py (1)

515-531: Remove task creation from NotificationService.__init__ or make the provider async-aware.

NotificationService.__init__ calls asyncio.create_task() on lines 199-200 to start background tasks. Since get_notification_service is a synchronous provider, this will raise RuntimeError: no running event loop if resolved in any context before an event loop is active. Either defer task startup to an explicit async method (e.g., async def start()) called after service resolution in an async context, or convert the provider to resolve the service in an async scope.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/services/k8s_worker/worker.py (1)

169-175: Clear idempotent_consumer on shutdown to avoid stale “running” status.

get_status now reports "running" based on idempotent_consumer is not None, but __aexit__ doesn’t null it after stopping. This can report running even after shutdown.

🐛 Proposed fix
@@
         if self.idempotent_consumer:
             await self.idempotent_consumer.stop()
+            self.idempotent_consumer = None

Also applies to: 430-434

backend/app/services/sse/kafka_redis_bridge.py (1)

16-60: Make aenter idempotent and exception‑safe.

Tests call aenter twice; current code overwrites self.consumers without stopping the already-started set, which can leak running consumers. Also, a partial start failure leaves running consumers behind. Guard re-entry and clean up on failure.

🔧 Suggested fix
 async def __aenter__(self) -> "SSEKafkaRedisBridge":
     """Start the SSE Kafka→Redis bridge."""
     self.logger.info(f"Starting SSE Kafka→Redis bridge with {self.num_consumers} consumers")
 
+    if self.consumers:
+        self.logger.debug("SSE Kafka→Redis bridge already started")
+        return self
+
     # Phase 1: Build all consumers and track them immediately (no I/O)
     self.consumers = [self._build_consumer(i) for i in range(self.num_consumers)]
 
     # Phase 2: Start all in parallel - already tracked in self.consumers for cleanup
     topics = list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.WEBSOCKET_GATEWAY])
-    await asyncio.gather(*[c.start(topics) for c in self.consumers])
+    try:
+        await asyncio.gather(*[c.start(topics) for c in self.consumers])
+    except Exception:
+        await asyncio.gather(*[c.stop() for c in self.consumers], return_exceptions=True)
+        self.consumers.clear()
+        raise
🤖 Fix all issues with AI agents
In `@backend/app/services/coordinator/coordinator.py`:
- Around line 112-117: The log message in ExecutionCoordinator.__aenter__ is
misleading because queue_manager is constructed in __init__; update __aenter__
to not claim initialization: remove or replace the "Queue manager initialized"
logger call in __aenter__ (referencing ExecutionCoordinator.__aenter__ and the
queue_manager created in ExecutionCoordinator.__init__) and, if you need to
indicate readiness here, change it to a truthful message such as
"ExecutionCoordinator service started" or "Queue manager ready" so the log
reflects actual work performed in __aenter__.

In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 289-297: The current _backoff() raises RuntimeError when reconnect
attempts exceed config.max_reconnect_attempts which propagates out of
_watch_loop() and kills the _watch_task; instead, modify _backoff() to not raise
but mark the monitor as paused (e.g., set self._paused = True and log the
failure with details) and/or schedule a retry via a new helper (e.g.,
_ensure_watch_running) that runs in background to attempt restart, and update
_watch_loop() to check self._paused and exit gracefully without killing the
background manager; ensure references to _backoff, _watch_loop, _watch_task, and
config.max_reconnect_attempts are used so the new paused flag or background
restart task can detect and restart the watcher.
🧹 Nitpick comments (6)
backend/app/services/coordinator/coordinator.py (1)

196-197: Redundant hasattr check.

idempotency_manager is always assigned in __init__ (line 97), so hasattr is unnecessary. The None check alone is sufficient.

Suggested simplification
-        if hasattr(self, "idempotency_manager") and self.idempotency_manager:
+        if self.idempotency_manager:
             await self.idempotency_manager.close()
backend/workers/run_k8s_worker.py (1)

42-46: Consider exiting if the worker stops unexpectedly.

The loop now waits only on shutdown_event. If the worker/consumer dies, the process may keep running and just log status. If you want automatic recovery or a clean exit, consider breaking/setting the shutdown event when status indicates not running.

♻️ Possible adjustment
         while not shutdown_event.is_set():
             await asyncio.sleep(60)
             status = await worker.get_status()
             logger.info(f"Kubernetes worker status: {status}")
+            if not status.get("running", True):
+                logger.error("KubernetesWorker not running; initiating shutdown.")
+                shutdown_event.set()
backend/app/services/pod_monitor/monitor.py (1)

122-138: Consider logging exceptions from the watch task during shutdown.

If _watch_task terminates due to an exception other than CancelledError (e.g., RuntimeError from _backoff), it will be silently ignored during __aexit__. Consider retrieving and logging any exception.

🔧 Suggested improvement
         if self._watch_task:
             self._watch_task.cancel()
             try:
                 await self._watch_task
             except asyncio.CancelledError:
                 pass
+            except Exception as e:
+                self.logger.warning(f"Watch task terminated with error: {e}")
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

118-125: Assertion may not validate event handling outcome.

The assertion len(orch._sagas) > 0 verifies that register_saga(_Saga) populated _sagas, not that _handle_event processed the event. Based on the relevant code snippet showing _handle_event iterates over self._sagas, this assertion passes due to the register_saga call on line 121, not because of the event handling on line 123.

Consider asserting on observable side effects of event handling, such as checking the fake repository for saved sagas.

🔧 Suggested improvement
 `@pytest.mark.asyncio`
 async def test_min_success_flow(event_metrics: EventMetrics) -> None:
-    orch = _orch(event_metrics)
-    orch.register_saga(_Saga)
+    fake_repo = _FakeRepo()
+    orch = SagaOrchestrator(
+        config=SagaConfig(name="t", enable_compensation=True, store_events=True, publish_commands=False),
+        saga_repository=fake_repo,
+        producer=_FakeProd(),
+        schema_registry_manager=MagicMock(spec=SchemaRegistryManager),
+        settings=MagicMock(spec=Settings),
+        event_store=_FakeStore(),
+        idempotency_manager=_FakeIdem(),
+        resource_allocation_repository=_FakeAlloc(),
+        logger=_test_logger,
+        event_metrics=event_metrics,
+    )
+    orch.register_saga(_Saga)
     # Handle the event
     await orch._handle_event(make_execution_requested_event(execution_id="e"))
-    # basic sanity; deep behavior covered by integration
-    assert len(orch._sagas) > 0
+    # Verify saga was created and saved
+    assert len(fake_repo.saved) > 0
backend/workers/run_saga_orchestrator.py (1)

31-43: Consider liveness handling after dropping the is_running gate.
With only shutdown_event, the worker will keep sleeping even if the orchestrator stops unexpectedly; please confirm that’s intended or wire a failure hook to trigger shutdown.

backend/app/services/saga/saga_orchestrator.py (1)

361-384: Consider re-raising CancelledError for cleaner task semantics.

While suppressing CancelledError works here since asyncio.gather(..., return_exceptions=True) is used in __aexit__, re-raising is the recommended pattern for asyncio tasks. This ensures the task is properly marked as cancelled rather than completing normally.

♻️ Optional: Re-raise CancelledError
         except asyncio.CancelledError:
             self.logger.info("Timeout checker cancelled")
+            raise

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 32 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/workers/run_pod_monitor.py">

<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P2: The loop now ignores the PodMonitor state, so if the monitor stops internally the worker will keep polling forever instead of exiting. Consider reintroducing a state check (or equivalent stop condition) to avoid hanging on a stopped monitor.</violation>
</file>

<file name="backend/app/services/saga/saga_orchestrator.py">

<violation number="1" location="backend/app/services/saga/saga_orchestrator.py:364">
P2: The timeout checker now stops permanently on any unexpected exception because the loop no longer catches and logs `Exception`. A transient repository error will kill the task and timeout checks will never resume. Wrap the body of the loop in a try/except to log and continue.</violation>
</file>

<file name="backend/app/services/k8s_worker/worker.py">

<violation number="1" location="backend/app/services/k8s_worker/worker.py:433">
P2: `running` is now derived from `idempotent_consumer` existence, but the shutdown path stops the consumer without clearing the reference. After `__aexit__`, `get_status()` will still return running=True even though the worker has stopped. Track an explicit running flag or clear `self.idempotent_consumer` on shutdown so status reflects the actual lifecycle state.</violation>
</file>

<file name="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py">

<violation number="1" location="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py:125">
P3: This assertion no longer validates that `_handle_event` did anything; `_sagas` is populated by `register_saga` before the handler runs, so the test will pass even if event handling fails. Assert on a side effect of `_handle_event` (e.g., running instances or saved saga) instead.</violation>
</file>

<file name="backend/workers/run_k8s_worker.py">

<violation number="1" location="backend/workers/run_k8s_worker.py:43">
P2: The run loop no longer exits when the KubernetesWorker stops. If the worker shuts down unexpectedly, this loop keeps running and prevents graceful cleanup until a signal arrives. Consider restoring the `worker.is_running` guard (or breaking when status indicates stopped) so the process terminates cleanly when the worker stops.</violation>
</file>

<file name="backend/app/events/event_store_consumer.py">

<violation number="1" location="backend/app/events/event_store_consumer.py:131">
P2: Unhandled exceptions in the batch processor will terminate the background task and stop future flushes; add error handling inside the loop to log and continue.</violation>
</file>

<file name="backend/workers/run_coordinator.py">

<violation number="1" location="backend/workers/run_coordinator.py:43">
P2: The loop no longer exits when the coordinator stops, so the worker can keep polling indefinitely after a coordinator failure instead of shutting down and cleaning up.</violation>
</file>

<file name="backend/app/services/pod_monitor/monitor.py">

<violation number="1" location="backend/app/services/pod_monitor/monitor.py:297">
P2: Raising RuntimeError in the watch backoff path will terminate the background watch task without a clean shutdown; since the task is created with `create_task` and `__aexit__` only suppresses `CancelledError`, this exception will bubble up later on shutdown or leave the monitor silently stopped. Consider exiting the watch loop via cancellation instead of throwing a RuntimeError.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The SSERedisBus constructor is being called with arguments
in the wrong order in the sse_redis_bus fixture: SSERedisBus expects
(redis_client, logger, exec_prefix, notif_prefix) but the fixture passes
exec_prefix and notif_prefix positionally and logger last; fix by passing logger
as the second argument (either positionally after redis_client or as
logger=<logger>) and ensure exec_prefix and notif_prefix are provided afterwards
(or as keywords) so the call to SSERedisBus(redis_client, logger,
exec_prefix=..., notif_prefix=..., logger=...) matches the constructor
signature.
🧹 Nitpick comments (1)
backend/tests/integration/idempotency/test_consumer_idempotent.py (1)

9-10: Redundant import of EventDispatcher.

EventDispatcher is imported twice—once from app.events.core (line 9) and again aliased as Disp from app.events.core.dispatcher (line 10). Since app.events.core likely re-exports from the dispatcher module, these are the same class.

Suggested fix
-from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducer
-from app.events.core.dispatcher import EventDispatcher as Disp
+from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducer

Then update line 47 to use EventDispatcher directly:

-    disp: Disp = EventDispatcher(logger=_test_logger)
+    disp = EventDispatcher(logger=_test_logger)

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 issues found across 47 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/workers/run_k8s_worker.py">

<violation number="1" location="backend/workers/run_k8s_worker.py:53">
P1: Exceptions from completed tasks in `done` set are silently swallowed. If `consumer.run()` or `logic.ensure_daemonset_task()` raises an exception, the worker will silently proceed to shutdown without logging or propagating the error. Add exception handling for tasks in `done`.</violation>
</file>

<file name="backend/workers/run_result_processor.py">

<violation number="1" location="backend/workers/run_result_processor.py:42">
P2: Exceptions from completed tasks in `done` set are never retrieved. If `consumer.run()` crashes with an exception before the shutdown signal, the error is silently ignored. Check for exceptions in completed tasks to properly log failures.</violation>
</file>

<file name="backend/app/core/dishka_lifespan.py">

<violation number="1" location="backend/app/core/dishka_lifespan.py:116">
P3: Inconsistent logging style: use structured logging with `extra` parameter instead of f-string interpolation to match the rest of this file's logging pattern. This improves log aggregation and follows the established convention.</violation>
</file>

<file name="backend/app/services/coordinator/coordinator_logic.py">

<violation number="1" location="backend/app/services/coordinator/coordinator_logic.py:280">
P1: `event.priority` is an `int`, so calling `.name` on it will raise `AttributeError`. The priority value must be converted to `QueuePriority` enum first.</violation>
</file>

<file name="backend/app/services/result_processor/processor_logic.py">

<violation number="1" location="backend/app/services/result_processor/processor_logic.py:33">
P3: Misleading docstring: The class is no longer stateful after this refactoring. The removed `_state`, `_consumer`, and `_dispatcher` attributes were what made it stateful. Consider updating the docstring to reflect that this class is now a stateless service that processes events.</violation>
</file>

<file name="backend/app/services/sse/sse_connection_registry.py">

<violation number="1" location="backend/app/services/sse/sse_connection_registry.py:49">
P1: Metrics are decremented even when the connection wasn't actually registered, which can lead to incorrect (potentially negative) connection counts. The code should only decrement metrics when a connection was actually removed from the registry.</violation>
</file>

<file name="backend/app/events/core/consumer.py">

<violation number="1" location="backend/app/events/core/consumer.py:109">
P1: `event` may be undefined when the error callback is invoked. If `self._schema_registry.deserialize_event()` throws an exception, `event` was never assigned, causing a `NameError`. Initialize `event` before the try block or guard the callback invocation.</violation>
</file>

<file name="backend/workers/run_coordinator.py">

<violation number="1" location="backend/workers/run_coordinator.py:53">
P1: Exceptions from completed tasks are silently lost. If `run_task` finishes with an exception (e.g., consumer or scheduling loop crashes), the error is never retrieved or logged. You should check tasks in `done` for exceptions to avoid silent failures.</violation>
</file>

<file name="backend/app/services/sse/event_router.py">

<violation number="1" location="backend/app/services/sse/event_router.py:44">
P2: Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</violation>
</file>

<file name="backend/workers/run_pod_monitor.py">

<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P1: Exceptions from completed tasks are silently swallowed. When using `asyncio.wait()`, you must check completed tasks for exceptions, otherwise failures in `monitor.run()` will go unnoticed. Add exception handling for tasks in `done`.</violation>
</file>

<file name="backend/workers/run_saga_orchestrator.py">

<violation number="1" location="backend/workers/run_saga_orchestrator.py:65">
P1: Exceptions from completed tasks are silently swallowed. If `run_task` fails with an exception, it ends up in `done` but the exception is never checked or re-raised. Add exception handling for completed tasks to avoid silent failures.</violation>
</file>

<file name="backend/tests/integration/result_processor/test_result_processor.py">

<violation number="1" location="backend/tests/integration/result_processor/test_result_processor.py:130">
P2: Unused parameter `consumer_config` after refactoring. The function now creates its own `processor_consumer_config` and `stored_consumer_config` locally instead of using the injected fixture. Remove the unused parameter from the function signature.</violation>
</file>

<file name="backend/app/services/pod_monitor/monitor.py">

<violation number="1" location="backend/app/services/pod_monitor/monitor.py:123">
P2: `asyncio.CancelledError` is caught but not re-raised, breaking Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, which can cause issues with task lifecycle management (e.g., `task.cancelled()` returns False, callers won't see the exception).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


async def route_event(self, event: DomainEvent) -> None:
"""Route an event to Redis for SSE delivery."""
data = event.model_dump()
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inefficient use of model_dump() just to extract a single field. Since all SSE-relevant events have execution_id as a direct attribute, use getattr(event, 'execution_id', None) to avoid serializing the entire model.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/sse/event_router.py, line 44:

<comment>Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</comment>

<file context>
@@ -0,0 +1,63 @@
+
+    async def route_event(self, event: DomainEvent) -> None:
+        """Route an event to Redis for SSE delivery."""
+        data = event.model_dump()
+        execution_id = data.get("execution_id")
+
</file context>
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/services/coordinator/coordinator_logic.py (1)

231-281: Cast priority to QueuePriority before using .name.

If event.priority is an int (as it appears elsewhere), priority.name will raise and the schedule path will fail.

🐛 Proposed fix: normalize to QueuePriority
-                priority = getattr(event, "priority", QueuePriority.NORMAL)
-                self.metrics.record_coordinator_queue_time(queue_time, priority.name)
+                priority = QueuePriority(getattr(event, "priority", QueuePriority.NORMAL))
+                self.metrics.record_coordinator_queue_time(queue_time, priority.name)
backend/app/services/result_processor/processor_logic.py (1)

88-93: Guard memory limit parsing to avoid handler failure.

Line 90 uses rstrip("Mi"), which will mis-handle Gi or other formats and can raise ValueError, aborting the handler. Consider a minimal parse guard to keep result processing resilient.

🛠️ Suggested fix
-            settings_limit = self._settings.K8S_POD_MEMORY_LIMIT
-            memory_limit_mib = int(settings_limit.rstrip("Mi"))  # TODO: Less brittle acquisition of limit
-            memory_percent = (memory_mib / memory_limit_mib) * 100
-            self._metrics.memory_utilization_percent.record(
-                memory_percent, attributes={"lang_and_version": lang_and_version}
-            )
+            settings_limit = self._settings.K8S_POD_MEMORY_LIMIT
+            try:
+                if settings_limit.endswith("Gi"):
+                    memory_limit_mib = int(settings_limit[:-2]) * 1024
+                elif settings_limit.endswith("Mi"):
+                    memory_limit_mib = int(settings_limit[:-2])
+                else:
+                    raise ValueError(f"Unsupported unit: {settings_limit}")
+                memory_percent = (memory_mib / memory_limit_mib) * 100
+                self._metrics.memory_utilization_percent.record(
+                    memory_percent, attributes={"lang_and_version": lang_and_version}
+                )
+            except ValueError:
+                self.logger.warning(
+                    "Skipping memory utilization metric due to unparseable limit",
+                    extra={"limit": settings_limit},
+                )
🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 102-127: Background tasks created in the tasks list
(run_sse_consumers, event_store_consumer.run, notification_service.run) are
fire-and-forget and only awaited at shutdown, so runtime exceptions can be lost;
update the creation of each asyncio.create_task to attach an add_done_callback
that checks task.exception(), logs the error via logger, and on first non-None
exception cancels all other tasks in the tasks collection (safely skipping
already done/cancelled tasks) to implement immediate supervision and fail-fast
behavior, ensuring you still await asyncio.gather(*tasks,
return_exceptions=True) in the finally block to finish cleanup.

In `@backend/app/core/providers.py`:
- Around line 463-472: The provider get_sse_connection_registry currently
creates an SSEConnectionRegistry per request (provide(scope=Scope.REQUEST)),
which resets _active_connections and causes get_connection_count() to always
return 0; change the provider to use application scope
(provide(scope=Scope.APP)) so SSEConnectionRegistry is a singleton for the app,
preserving the asyncio.Lock() and shared _active_connections across requests and
allowing registrations from SSE streams to be visible to the health endpoint.

In `@backend/app/events/core/consumer.py`:
- Around line 104-110: The variable event can be unbound if deserialize_event()
raises, causing a NameError when awaiting self._error_callback(e, event); fix by
initializing event = None before the try/except where deserialize_event() is
called (so the callback always receives a defined value), and keep the existing
error logging and metrics calls (e.g., the except block that calls
self.logger.error, self._event_metrics.record_kafka_consumption_error and awaits
self._error_callback).

In `@backend/app/services/idempotency/middleware.py`:
- Around line 123-144: register_idempotent_handler wraps a handler in
IdempotentEventHandler then passes it to EventDispatcher.register which expects
handler.__name__; set the wrapper's __name__ (or use
functools.update_wrapper/functools.wraps) so the wrapped instance exposes the
original handler's __name__ to avoid AttributeError. In practice, after creating
wrapped in register_idempotent_handler, assign wrapped.__name__ =
getattr(handler, "__name__", type(handler).__name__) (or call
update_wrapper(wrapped, handler)) before calling
self._dispatcher.register(event_type)(wrapped).

In `@backend/app/services/sse/sse_connection_registry.py`:
- Around line 30-49: The register_connection and unregister_connection currently
call metrics.increment_sse_connections / decrement_sse_connections
unconditionally, causing drift on duplicate adds/removes; modify
register_connection to only call metrics.increment_sse_connections("executions")
(and log) if connection_id was not already present in
self._active_connections[execution_id], and modify unregister_connection to only
call metrics.decrement_sse_connections("executions") (and log) if connection_id
was actually removed (i.e., it existed before discard and the set changed or
length decreased); keep the async with self._lock around these checks and
mutations and use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.

In `@backend/tests/integration/events/test_consumer_lifecycle.py`:
- Around line 40-52: The test signals readiness too early: move the
consumer_started.set() so it runs after the internal consumer is initialized
inside run() (or after consumer.run() completes its startup logic) to avoid
calling consumer.seek_to_beginning()/consumer.seek_to_end() before
consumer._consumer is assigned; locate the run_with_signal coroutine and ensure
the readiness event is set only once consumer.run() has created/started the
AIOKafkaConsumer (or add an explicit check/wait for consumer._consumer to be
non-None before performing seeks).

In `@backend/workers/run_coordinator.py`:
- Around line 42-64: The code awaits asyncio.wait([run_task, shutdown_task]) but
only inspects pending tasks, so exceptions from completed tasks (e.g., run_task
from run_coordinator_tasks()) are never re-raised; update the logic after
asyncio.wait to iterate over the done set and call task.result() (or check
task.exception()) for each done task and re-raise or surface any exceptions (at
least for run_task) before cancelling pending tasks; locate
run_coordinator_tasks, run_task, shutdown_task, and the asyncio.wait call to
implement this change so failures in run_coordinator_tasks propagate correctly.

In `@backend/workers/run_k8s_worker.py`:
- Around line 49-64: The code creates run_task =
asyncio.create_task(run_worker_tasks()) and shutdown_task =
asyncio.create_task(shutdown_event.wait()) then awaits asyncio.wait but never
retrieves results from completed tasks, so exceptions from run_worker_tasks()
are swallowed; after the asyncio.wait call inspect each task in the done set
(e.g., check task.done() and call task.result() or task.exception()) and
re-raise or propagate any exception found (for example by calling task.result()
which will raise) before cancelling pending tasks so that worker failures
surface instead of being masked.

In `@backend/workers/run_result_processor.py`:
- Around line 35-53: The current shutdown sequence swallows exceptions from
consumer.run() because run_task's exception is never retrieved; after the
asyncio.wait completes, inspect the completed tasks (e.g., check if run_task is
in the done set), call run_task.result() (or handle/run_task.exception()) to
surface any exception, log it with logger.error including context, and re-raise
or set an exit error before cancelling pending tasks (shutdown_task) so failures
in consumer.run() are not treated as normal shutdown; update the logic around
run_task/run_task.result(), shutdown_task, and shutdown_event accordingly.

In `@backend/workers/run_saga_orchestrator.py`:
- Around line 54-70: The current shutdown logic cancels pending tasks but can
silently swallow exceptions from completed tasks like run_task; after
asyncio.wait returns, inspect the done set (e.g., run_task in done) and
propagate any errors by awaiting the completed tasks or checking
task.exception() and re-raising it before returning/finishing; keep existing
cancellation/await handling for pending tasks (shutdown_task) but ensure
exceptions from run_orchestrator_tasks are not ignored so the orchestrator
failure surfaces to the caller.
♻️ Duplicate comments (4)
backend/app/events/event_store_consumer.py (1)

102-104: Add error handling for batch storage and commit failures.

If _store_batch() or consumer.commit() throws an exception (e.g., transient MongoDB or Kafka broker issue), the consumer will terminate entirely. Consider wrapping these operations in a try-except to log and continue processing subsequent batches.

🐛 Proposed fix
                 if events:
-                    await self._store_batch(events)
-                    await consumer.commit()
+                    try:
+                        await self._store_batch(events)
+                        await consumer.commit()
+                    except Exception as e:
+                        self.logger.error(f"Failed to store/commit batch: {e}", exc_info=True)
+                        # Continue processing - Kafka will redeliver uncommitted messages
backend/app/services/pod_monitor/monitor.py (1)

277-285: Max reconnect RuntimeError terminates monitor ungracefully.

This issue was previously flagged. When _backoff() raises RuntimeError after max attempts, it propagates through _watch_loop() to run(), which only catches CancelledError. The monitor stops without cleanup or recovery path. Consider either catching this in run() for graceful degradation or using cancellation-based exit.

backend/app/services/saga/saga_logic.py (1)

299-321: Keep the timeout loop resilient to transient errors.

As written, any unexpected exception will terminate the loop permanently. This matches the earlier concern and still applies here.

🐛 Proposed fix: log and continue on unexpected errors
     async def check_timeouts_loop(self) -> None:
         """Check for saga timeouts (runs until cancelled)."""
         try:
             while True:
-                # Check every 30 seconds
-                await asyncio.sleep(30)
-
-                cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds)
-
-                timed_out = await self._repo.find_timed_out_sagas(cutoff_time)
-
-                for instance in timed_out:
-                    self.logger.warning(f"Saga {instance.saga_id} timed out")
-
-                    instance.state = SagaState.TIMEOUT
-                    instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds"
-                    instance.completed_at = datetime.now(UTC)
-
-                    await self._save_saga(instance)
-                    self._running_instances.pop(instance.saga_id, None)
+                try:
+                    # Check every 30 seconds
+                    await asyncio.sleep(30)
+
+                    cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds)
+
+                    timed_out = await self._repo.find_timed_out_sagas(cutoff_time)
+
+                    for instance in timed_out:
+                        self.logger.warning(f"Saga {instance.saga_id} timed out")
+
+                        instance.state = SagaState.TIMEOUT
+                        instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds"
+                        instance.completed_at = datetime.now(UTC)
+
+                        await self._save_saga(instance)
+                        self._running_instances.pop(instance.saga_id, None)
+                except Exception as e:
+                    self.logger.error(f"Error in timeout checker: {e}", exc_info=True)
+                    await asyncio.sleep(5)
         except asyncio.CancelledError:
             self.logger.info("Timeout checker cancelled")
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

93-99: Assertion doesn’t validate handle_event side effects.

register_saga already populates _sagas, so the current assertion can pass even if handle_event does nothing. Consider asserting on a side effect (e.g., repo save or saga state change) after the handler runs.

🧹 Nitpick comments (8)
backend/app/services/k8s_worker/worker_logic.py (4)

56-56: Redundant fallback for non-optional parameter.

The config parameter is typed as K8sWorkerConfig (non-optional), yet line 56 has a fallback config or K8sWorkerConfig(). Either make the parameter Optional[K8sWorkerConfig] or remove the fallback to match the signature.

-        self.config = config or K8sWorkerConfig()
+        self.config = config

154-154: Fire-and-forget task lacks exception propagation.

The task created here is not stored, so any unhandled exception will be silently lost (logged only). While the current implementation handles exceptions internally in _create_pod_for_execution, consider storing task references for better observability and graceful shutdown handling.


305-308: Defensive null checks on required dependency.

self.producer is assigned from a required constructor parameter (non-optional UnifiedProducer). These null checks suggest the producer might be None, but that would indicate a constructor bug. Either remove the checks or type the field as Optional[UnifiedProducer] if nullability is intentional.

Also applies to: 319-322, 336-339


356-356: Undocumented hardcoded delay.

The 5-second sleep before DaemonSet operations appears arbitrary. Consider documenting the reason (e.g., waiting for API server readiness) or making it configurable via K8sWorkerConfig.

backend/tests/e2e/test_k8s_worker_create_pod.py (1)

83-91: Resource cleanup may be skipped on test failure.

If the assertions on lines 85-87 fail, the cleanup on lines 90-91 won't execute, potentially leaving orphaned Kubernetes resources. Consider wrapping the verification and cleanup in a try/finally block.

🧹 Suggested fix for reliable cleanup
     # Verify resources exist
-    got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
-    assert got_cm is not None
-    got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
-    assert got_pod is not None
-
-    # Cleanup
-    logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
-    logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
+    try:
+        got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
+        assert got_cm is not None
+        got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
+        assert got_pod is not None
+    finally:
+        # Cleanup regardless of assertion outcome
+        logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
+        logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
backend/app/services/sse/event_router.py (1)

42-46: Avoid full model_dump() just to read execution_id.
model_dump() allocates a full dict (potentially large payloads). A direct attribute lookup avoids the extra work.

Suggested refactor
-        data = event.model_dump()
-        execution_id = data.get("execution_id")
+        execution_id = getattr(event, "execution_id", None)
backend/tests/integration/services/sse/test_partitioned_event_router.py (1)

39-41: Consider closing the subscription after the test.

The test opens a Redis subscription but doesn't explicitly close it. While the fixture cleanup may handle the Redis client, explicitly closing the subscription would be cleaner.

♻️ Suggested cleanup
     # Await the subscription - verify event arrived in Redis
     msg = await asyncio.wait_for(subscription.get(RedisSSEMessage), timeout=2.0)
     assert msg is not None
     assert str(msg.event_type) == str(ev.event_type)
+
+    await subscription.close()
backend/app/services/saga/saga_service.py (1)

162-178: Consider renaming method to reflect new implementation.

The method get_saga_status_from_orchestrator references "orchestrator" which no longer exists. Consider renaming to something like get_saga_status_live or get_live_saga_status to better reflect it fetches from SagaLogic's in-memory state first.

Comment on lines 463 to 472
@provide(scope=Scope.REQUEST)
def get_sse_shutdown_manager(
self, logger: logging.Logger, connection_metrics: ConnectionMetrics
) -> SSEShutdownManager:
return create_sse_shutdown_manager(logger=logger, connection_metrics=connection_metrics)
def get_sse_connection_registry(
self,
logger: logging.Logger,
connection_metrics: ConnectionMetrics,
) -> SSEConnectionRegistry:
return SSEConnectionRegistry(
logger=logger,
connection_metrics=connection_metrics,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find SSEConnectionRegistry class definition
fd -t f "\.py$" | xargs grep -l "class SSEConnectionRegistry"

Repository: HardMax71/Integr8sCode

Length of output: 188


🏁 Script executed:

# Search for get_connection_count usage
rg "get_connection_count" --type py -A 3 -B 3

Repository: HardMax71/Integr8sCode

Length of output: 6270


🏁 Script executed:

# Search for SSEConnectionRegistry usage
rg "SSEConnectionRegistry" --type py -A 2 -B 2

Repository: HardMax71/Integr8sCode

Length of output: 6034


🏁 Script executed:

# Check how SSEService is provided
rg "@provide.*SSEService" --type py -A 5 -B 2

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check for Scope definitions and their meanings
rg "class Scope" --type py -A 10

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Look at sse_connection_registry.py to understand persistence expectations
cat -n backend/app/services/sse/sse_connection_registry.py | head -50

Repository: HardMax71/Integr8sCode

Length of output: 2296


🏁 Script executed:

# Search for SSEService provider definition more broadly
rg "get_sse_service|SSEService" backend/app/core/providers.py -A 5 -B 5

Repository: HardMax71/Integr8sCode

Length of output: 1278


🏁 Script executed:

# Look at the full providers.py around the SSEService definition
rg "def get_" backend/app/core/providers.py | grep -i sse

Repository: HardMax71/Integr8sCode

Length of output: 355


🏁 Script executed:

# Check SSEService class to understand if it needs app-wide state
cat -n backend/app/services/sse/sse_service.py | head -80

Repository: HardMax71/Integr8sCode

Length of output: 3831


🏁 Script executed:

# Check if there's any caching or special handling for REQUEST scope singletons
rg "Scope\." backend/app/core/providers.py -B 2 -A 1 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 706


🏁 Script executed:

# Look for health status endpoint that uses get_connection_count
rg "get_health_status" --type py -A 10 -B 5

Repository: HardMax71/Integr8sCode

Length of output: 3730


🏁 Script executed:

# Search for any test that shows REQUEST vs APP scope behavior
rg "Scope\.(REQUEST|APP)" backend/tests --type py | head -20

Repository: HardMax71/Integr8sCode

Length of output: 48


SSEConnectionRegistry scope must be Scope.APP, not Scope.REQUEST.

With Scope.REQUEST, a new registry instance is created per HTTP request, resulting in an empty _active_connections dictionary on every health status check. This causes get_connection_count() to always return 0. The registry is designed to track active connections globally across the application—registrations from one request's SSE stream must be visible when the health endpoint is called from a different request. The internal asyncio.Lock() and the persistent state expectations confirm the registry is intended as application-wide infrastructure.

🤖 Prompt for AI Agents
In `@backend/app/core/providers.py` around lines 463 - 472, The provider
get_sse_connection_registry currently creates an SSEConnectionRegistry per
request (provide(scope=Scope.REQUEST)), which resets _active_connections and
causes get_connection_count() to always return 0; change the provider to use
application scope (provide(scope=Scope.APP)) so SSEConnectionRegistry is a
singleton for the app, preserving the asyncio.Lock() and shared
_active_connections across requests and allowing registrations from SSE streams
to be visible to the health endpoint.

Comment on lines 30 to 49
async def register_connection(self, execution_id: str, connection_id: str) -> None:
"""Register a new SSE connection."""
async with self._lock:
if execution_id not in self._active_connections:
self._active_connections[execution_id] = set()

self._active_connections[execution_id].add(connection_id)
self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
self.metrics.increment_sse_connections("executions")

async def unregister_connection(self, execution_id: str, connection_id: str) -> None:
"""Unregister an SSE connection."""
async with self._lock:
if execution_id in self._active_connections:
self._active_connections[execution_id].discard(connection_id)
if not self._active_connections[execution_id]:
del self._active_connections[execution_id]

self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})
self.metrics.decrement_sse_connections("executions")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Prevent metrics drift on duplicate register/unregister.

Line 36 and Line 49 update metrics even when the connection was already present/absent, which can over/under-count (and potentially go negative). Consider only updating metrics when the set actually changes.

🛠️ Suggested fix
 async def register_connection(self, execution_id: str, connection_id: str) -> None:
     """Register a new SSE connection."""
     async with self._lock:
-        if execution_id not in self._active_connections:
-            self._active_connections[execution_id] = set()
-
-        self._active_connections[execution_id].add(connection_id)
-        self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
-        self.metrics.increment_sse_connections("executions")
+        conns = self._active_connections.setdefault(execution_id, set())
+        if connection_id not in conns:
+            conns.add(connection_id)
+            self.metrics.increment_sse_connections("executions")
+        self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})

 async def unregister_connection(self, execution_id: str, connection_id: str) -> None:
     """Unregister an SSE connection."""
     async with self._lock:
-        if execution_id in self._active_connections:
-            self._active_connections[execution_id].discard(connection_id)
-            if not self._active_connections[execution_id]:
-                del self._active_connections[execution_id]
-
-        self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})
-        self.metrics.decrement_sse_connections("executions")
+        conns = self._active_connections.get(execution_id)
+        if conns and connection_id in conns:
+            conns.remove(connection_id)
+            if not conns:
+                del self._active_connections[execution_id]
+            self.metrics.decrement_sse_connections("executions")
+        self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})
🤖 Prompt for AI Agents
In `@backend/app/services/sse/sse_connection_registry.py` around lines 30 - 49,
The register_connection and unregister_connection currently call
metrics.increment_sse_connections / decrement_sse_connections unconditionally,
causing drift on duplicate adds/removes; modify register_connection to only call
metrics.increment_sse_connections("executions") (and log) if connection_id was
not already present in self._active_connections[execution_id], and modify
unregister_connection to only call
metrics.decrement_sse_connections("executions") (and log) if connection_id was
actually removed (i.e., it existed before discard and the set changed or length
decreased); keep the async with self._lock around these checks and mutations and
use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.

Comment on lines 40 to 52
async def run_with_signal() -> None:
consumer_started.set()
await consumer.run()

task = asyncio.create_task(run_with_signal())

try:
st = c.get_status()
assert st.state == "running" and st.is_running is True
# Exercise seek functions; don't force specific partition offsets
await c.seek_to_beginning()
await c.seek_to_end()
# No need to sleep; just ensure we can call seek APIs while running
# Wait for consumer to start
await asyncio.wait_for(consumer_started.wait(), timeout=5.0)

# Exercise seek functions while consumer is running
await consumer.seek_to_beginning()
await consumer.seek_to_end()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test may have a race: consumer_started signals before consumer is fully running.

consumer_started.set() fires before consumer.run() is called, so the seek operations at lines 51-52 may execute before the internal AIOKafkaConsumer is started. The seek_to_beginning() and seek_to_end() methods guard with if self._consumer, but self._consumer is only assigned inside run() after construction.

Consider moving the event signal after consumer initialization:

💡 Suggested improvement
     async def run_with_signal() -> None:
-        consumer_started.set()
-        await consumer.run()
+        # Small delay to ensure consumer.run() has started and set self._consumer
+        await asyncio.sleep(0.1)
+        consumer_started.set()
+        await consumer.run()

Alternatively, add a short delay before exercising seek methods, or check that consumer._consumer is not None before proceeding.

🤖 Prompt for AI Agents
In `@backend/tests/integration/events/test_consumer_lifecycle.py` around lines 40
- 52, The test signals readiness too early: move the consumer_started.set() so
it runs after the internal consumer is initialized inside run() (or after
consumer.run() completes its startup logic) to avoid calling
consumer.seek_to_beginning()/consumer.seek_to_end() before consumer._consumer is
assigned; locate the run_with_signal coroutine and ensure the readiness event is
set only once consumer.run() has created/started the AIOKafkaConsumer (or add an
explicit check/wait for consumer._consumer to be non-None before performing
seeks).

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 38 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/services/idempotency/faststream_middleware.py">

<violation number="1" location="backend/app/services/idempotency/faststream_middleware.py:80">
P2: If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</violation>
</file>

<file name="backend/pyproject.toml">

<violation number="1" location="backend/pyproject.toml:92">
P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</violation>

<violation number="2" location="backend/pyproject.toml:127">
P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</violation>
</file>

<file name="backend/workers/run_result_processor.py">

<violation number="1" location="backend/workers/run_result_processor.py:157">
P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</violation>
</file>

<file name="backend/workers/run_k8s_worker.py">

<violation number="1" location="backend/workers/run_k8s_worker.py:142">
P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

await idempotency.mark_completed(event=event, key_strategy="event_based")
return handler_result
except Exception as e:
await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based")
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: If mark_failed throws an exception (e.g., network error), the original handler exception will be masked. Wrap mark_failed in a try-except to preserve the original exception.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/idempotency/faststream_middleware.py, line 80:

<comment>If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</comment>

<file context>
@@ -0,0 +1,81 @@
+            await idempotency.mark_completed(event=event, key_strategy="event_based")
+            return handler_result
+        except Exception as e:
+            await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based")
+            raise
</file context>
Fix with Cubic

"yarl==1.20.1",
"zipp==3.20.2",
"monggregate==0.22.1",
"faststream[kafka]>=0.6.0",
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 127:

<comment>Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</comment>

<file context>
@@ -124,6 +124,7 @@ dependencies = [
     "yarl==1.20.1",
     "zipp==3.20.2",
     "monggregate==0.22.1",
+    "faststream[kafka]>=0.6.0",
 ]
 
</file context>
Suggested change
"faststream[kafka]>=0.6.0",
"faststream[kafka]==0.6.0",
Fix with Cubic

"pyasn1_modules==0.4.2",
"pydantic==2.9.2",
"pydantic-avro==0.9.1",
"dataclasses-avroschema[pydantic]>=0.65.0",
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 92:

<comment>Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</comment>

<file context>
@@ -89,7 +89,7 @@ dependencies = [
     "pyasn1_modules==0.4.2",
     "pydantic==2.9.2",
-    "pydantic-avro==0.9.1",
+    "dataclasses-avroschema[pydantic]>=0.65.0",
     "pydantic-settings==2.5.2",
     "pydantic_core==2.23.4",
</file context>
Suggested change
"dataclasses-avroschema[pydantic]>=0.65.0",
"dataclasses-avroschema[pydantic]==0.65.0",
Fix with Cubic


# Default handler for unmatched events (prevents message loss)
@subscriber()
async def handle_other(event: DomainEvent) -> None:
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 157:

<comment>The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</comment>

<file context>
@@ -76,9 +75,106 @@ def main() -> None:
+
+        # Default handler for unmatched events (prevents message loss)
+        @subscriber()
+        async def handle_other(event: DomainEvent) -> None:
+            pass
+
</file context>
Fix with Cubic


# Default handler for unmatched events (prevents message loss)
@subscriber()
async def handle_other(event: DomainEvent) -> None:
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_k8s_worker.py, line 142:

<comment>Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</comment>

<file context>
@@ -88,9 +69,99 @@ def main() -> None:
+
+        # Default handler for unmatched events (prevents message loss)
+        @subscriber()
+        async def handle_other(event: DomainEvent) -> None:
+            pass
+
</file context>
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/pod_monitor/event_mapper.py (1)

11-21: Guard ResourceUsageAvro construction when resource_usage is null.
If executor output includes "resource_usage": null, the current call raises and drops the event. Use a safe fallback to {}.

🐛 Suggested fix
-        return PodLogs(
+        return PodLogs(
             stdout=data.get("stdout", ""),
             stderr=data.get("stderr", ""),
             exit_code=data.get("exit_code", 0),
-            resource_usage=ResourceUsageAvro(**data.get("resource_usage", {})),
+            resource_usage=ResourceUsageAvro(**(data.get("resource_usage") or {})),
         )

Also applies to: 40-47, 491-502

🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The sse_redis_bus fixture calls SSERedisBus with
redis_client as the only positional argument but SSERedisBus.__init__ expects
logger as the second positional parameter; fix by passing the test logger as the
second positional argument (e.g., SSERedisBus(redis_client, _test_logger,
exec_prefix=..., notif_prefix=...)) or convert to all keywords including
logger=_test_logger to ensure the logger is provided correctly.

In `@backend/tests/integration/result_processor/test_result_processor.py`:
- Around line 72-83: Change the consumer setup in the test to avoid replaying
historical Kafka events: when creating the test ConsumerConfig (the instance
assigned to processor_consumer_config) set auto_offset_reset="latest" instead of
"earliest" and avoid extremely small batch sizes (e.g., remove or increase
max_poll_records=1) so the consumer won't process long retention history slowly;
also ensure test code starts the consumer (the consumer created from
processor_consumer_config) before producing the test events so only new messages
are consumed.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)

461-470: SSEConnectionRegistry must use Scope.APP, not Scope.REQUEST.

With Scope.REQUEST, each HTTP request creates a new registry instance with an empty _active_connections dictionary. This causes get_connection_count() to always return 0 from the health endpoint since registrations from SSE streams are invisible to other requests. The registry is designed as application-wide infrastructure to track connections globally.

-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
backend/workers/run_pod_monitor.py (1)

74-88: Exceptions from monitor.run() are silently swallowed.

When asyncio.wait() returns, if monitor_task completed with an exception (e.g., K8s API failure), that exception is never retrieved from the done set. The worker exits normally without logging the actual failure cause.

🐛 Proposed fix
         done, pending = await asyncio.wait(
             [monitor_task, shutdown_task],
             return_when=asyncio.FIRST_COMPLETED,
         )

+        # Check for exceptions in completed tasks
+        for task in done:
+            if task is not shutdown_task and not task.cancelled():
+                exc = task.exception()
+                if exc:
+                    logger.error("PodMonitor crashed", exc_info=exc)
+                    raise exc
+
         for task in pending:
             task.cancel()
             with suppress(asyncio.CancelledError):
                 await task
backend/app/core/dishka_lifespan.py (1)

105-119: Background task failures may go unnoticed until shutdown.

Tasks created with asyncio.create_task() that fail during normal operation will have their exceptions silently swallowed until shutdown when gather(*tasks, return_exceptions=True) is called. If run_sse_consumers(), event_store_consumer.run(), or notification_service.run() raises an exception, the application continues running in a degraded state without any indication.

Consider adding done callbacks to log failures and optionally trigger shutdown on critical task failures.

🛠️ Suggested supervision pattern
     tasks = [
         asyncio.create_task(run_sse_consumers(), name="sse_consumers"),
         asyncio.create_task(event_store_consumer.run(), name="event_store_consumer"),
         asyncio.create_task(notification_service.run(), name="notification_service"),
     ]
+
+    def _handle_task_done(task: asyncio.Task) -> None:
+        if task.cancelled():
+            return
+        exc = task.exception()
+        if exc:
+            logger.error(
+                "Background task failed",
+                extra={"task_name": task.get_name()},
+                exc_info=exc,
+            )
+
+    for task in tasks:
+        task.add_done_callback(_handle_task_done)
+
     logger.info(
         "Background services started",
         extra={"sse_consumer_count": len(sse_consumers)},
     )
backend/app/services/saga/saga_logic.py (1)

315-325: Timeout loop should not die on transient exceptions.

Any unexpected exception inside check_timeouts_once() will terminate the loop and stop timeout enforcement. Wrap the loop body to log-and-continue on generic exceptions.

🛠️ Suggested hardening
     async def check_timeouts_loop(self) -> None:
         """Check for saga timeouts (runs until cancelled)."""
         try:
             while True:
                 # Check every 30 seconds
                 await asyncio.sleep(30)
-                await self.check_timeouts_once()
+                try:
+                    await self.check_timeouts_once()
+                except Exception:
+                    self.logger.error("Timeout checker failed; continuing loop", exc_info=True)
 
         except asyncio.CancelledError:
             self.logger.info("Timeout checker cancelled")
🧹 Nitpick comments (8)
backend/app/services/saga/execution_saga.py (1)

62-100: Use event.execution_id as a fallback for robustness.
If execution_id is ever missing in context (future refactors or alternate entry points), this step will fail before publishing. A simple fallback makes it resilient.

♻️ Suggested tweak
-            execution_id = context.get("execution_id")
+            execution_id = context.get("execution_id") or event.execution_id
backend/workers/run_result_processor.py (1)

134-158: Handlers call private methods (_handle_*) on ProcessorLogic.

The handlers directly invoke methods prefixed with _ (e.g., _handle_completed, _handle_failed), which conventionally indicates internal/private methods not intended for external use. Consider either making these methods public (remove underscore prefix) or providing a public interface method.

backend/app/services/idempotency/faststream_middleware.py (2)

44-50: Fail-open approach silently skips idempotency on DI failures.

The broad except Exception catch ensures message processing continues even when dependencies are unavailable. Consider logging a warning here to aid debugging when idempotency is unexpectedly skipped due to misconfiguration.

♻️ Suggested improvement
         # Resolve dependencies from request-scoped container
         try:
             idempotency = await container.get(IdempotencyManager)
             schema_registry = await container.get(SchemaRegistryManager)
-        except Exception:
+        except Exception as e:
             # Dependencies not available - skip idempotency
+            # Log at debug level to help diagnose misconfiguration
+            import logging
+            logging.getLogger(__name__).debug(
+                "Skipping idempotency: failed to resolve dependencies: %s", e
+            )
             return await call_next(msg)

58-62: Hardcoded topic string "idempotency" passed to deserialize_event.

The topic parameter in deserialize_event appears to be used only for error context. If the actual topic is available from msg, consider passing it for more accurate error messages during debugging.

backend/workers/run_saga_orchestrator.py (1)

160-161: Consider structured logging for consistency.

Using f-string interpolation for logging is inconsistent with the structured logging pattern (extra={}) used elsewhere in the codebase.

♻️ Suggested change
-        app_logger.info(f"Subscribing to topics: {topics}")
+        app_logger.info("Subscribing to topics", extra={"topics": topics})
backend/app/events/schema/schema_registry.py (1)

84-102: Minor redundancy: AvroSchema created twice.

_ensure_schema_registered already creates an AvroSchema object (line 61), but serialize_event creates it again (line 92). Consider caching the AvroSchema object alongside the schema ID to avoid redundant creation.

backend/app/services/result_processor/processor_logic.py (2)

54-66: Consider explicit type check instead of assert for production safety.

Using assert isinstance(event, ...) for type narrowing works, but assertions are stripped when Python runs with -O (optimized mode). If this could happen in production, the type check would be silently skipped.

♻️ Safer alternative
     async def _handle_completed_wrapper(self, event: DomainEvent) -> None:
-        assert isinstance(event, ExecutionCompletedEvent)
+        if not isinstance(event, ExecutionCompletedEvent):
+            raise TypeError(f"Expected ExecutionCompletedEvent, got {type(event).__name__}")
         await self._handle_completed(event)

Apply similarly to _handle_failed_wrapper and _handle_timeout_wrapper.


177-193: Hardcoded service_version="1.0.0" in event metadata.

Consider reading the version from settings or a shared constant to avoid manual updates when the service version changes.

Comment on lines +74 to +83
@pytest_asyncio.fixture
async def sse_redis_bus(redis_client: redis.Redis) -> SSERedisBus:
"""Provide SSERedisBus with unique prefixes for test isolation."""
suffix = uuid.uuid4().hex[:6]
return SSERedisBus(
redis_client,
exec_prefix=f"sse:exec:{suffix}:",
notif_prefix=f"sse:notif:{suffix}:",
logger=_test_logger,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find SSERedisBus class definition
fd -t f -e py . | xargs rg -l "class SSERedisBus"

Repository: HardMax71/Integr8sCode

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Read the SSERedisBus class definition to check __init__ signature
cat -n backend/app/services/sse/redis_bus.py | head -80

Repository: HardMax71/Integr8sCode

Length of output: 3507


Fix argument order in SSERedisBus initialization.

The SSERedisBus.__init__ constructor requires logger as the second positional parameter (not optional). The fixture passes redis_client positionally but omits logger from the positional arguments, causing a runtime TypeError. Either pass logger as the second positional argument or use all keyword arguments:

SSERedisBus(
    redis_client,
    _test_logger,  # Pass as 2nd positional arg
    exec_prefix=f"sse:exec:{suffix}:",
    notif_prefix=f"sse:notif:{suffix}:",
)
🤖 Prompt for AI Agents
In `@backend/tests/integration/conftest.py` around lines 74 - 83, The
sse_redis_bus fixture calls SSERedisBus with redis_client as the only positional
argument but SSERedisBus.__init__ expects logger as the second positional
parameter; fix by passing the test logger as the second positional argument
(e.g., SSERedisBus(redis_client, _test_logger, exec_prefix=...,
notif_prefix=...)) or convert to all keywords including logger=_test_logger to
ensure the logger is provided correctly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/events/core/producer.py (1)

44-77: Potential UnboundLocalError if serialization fails before topic assignment.

If serialize_event raises an exception at line 49, topic is never assigned, but line 75 references it in the error handler.

Proposed fix
     async def produce(
         self, event_to_produce: DomainEvent, key: str | None = None, headers: dict[str, str] | None = None
     ) -> None:
         """Produce a message to Kafka."""
+        topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}"
         try:
             serialized_value = await self._schema_registry.serialize_event(event_to_produce)
-            topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}"
 
             # Convert headers to list of tuples format
             header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
🤖 Fix all issues with AI agents
In `@backend/app/events/event_store_consumer.py`:
- Around line 41-69: The process_batch method currently logs deserialization
errors from schema_registry_manager.deserialize_event and then commits offsets
via consumer.commit, which drops bad messages; change process_batch to collect
failures (e.g., a failed_messages list or counter) while iterating messages, for
each failed msg call a DLQ sender or increment event_metrics (e.g.,
event_metrics.record_deserialization_failure(topic=..., consumer_group=...)) and
only commit offsets for partitions/messages that succeeded (or skip committing
if any partition has failures), ensuring _store_batch is called only with
successfully deserialized events and consumer.commit is adjusted to avoid losing
failed messages; update logger.error to include a unique identifier for the
failed message to aid tracing.

In `@backend/app/services/k8s_worker/worker_logic.py`:
- Around line 70-163: The duplicate-creation bug occurs because
handle_create_pod_command checks _active_creations but only schedules the task,
and the ID is added later inside _create_pod_for_execution, so two concurrent
calls can both schedule tasks; fix by reserving the execution_id immediately in
handle_create_pod_command (add execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.

In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 160-182: The synchronous kubernetes iterator returned by
self._watch.stream in _run_watch is blocking the event loop; replace the direct
call with an awaited asyncio.to_thread call (await
asyncio.to_thread(self._watch.stream, self._v1.list_namespaced_pod, **kwargs))
so the blocking stream creation runs off the event loop, then keep iterating and
awaiting self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.

In `@backend/tests/unit/services/saga/test_execution_saga_steps.py`:
- Line 204: The assertion in test_execution_saga_steps.py is a no-op because `or
True` makes it always pass; update the test to either remove the `or True` or
replace the assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).
♻️ Duplicate comments (9)
backend/app/services/sse/event_router.py (1)

45-46: Inefficient use of model_dump() to extract a single attribute.

Since SSE-relevant events have execution_id as a direct attribute, use getattr(event, 'execution_id', None) to avoid serializing the entire model. Additionally, SSERedisBus.publish_event calls model_dump() internally, resulting in double serialization.

♻️ Proposed fix
     async def route_event(self, event: DomainEvent) -> None:
         """Route an event to Redis for SSE delivery."""
-        data = event.model_dump()
-        execution_id = data.get("execution_id")
+        execution_id = getattr(event, "execution_id", None)
backend/app/services/pod_monitor/monitor.py (2)

276-295: RuntimeError on max reconnect attempts terminates monitor without recovery path.

As noted in previous reviews, raising RuntimeError here terminates _watch_loop and the run() method without any recovery mechanism. The monitor silently stops functioning.

Consider either:

  1. Set a paused/failed state flag and exit gracefully
  2. Add a supervisor/restart mechanism
  3. Propagate via a more specific exception that callers can handle

109-129: CancelledError should be re-raised to maintain async cancellation contract.

As noted in previous reviews, catching CancelledError without re-raising breaks Python's async cancellation semantics. The task will appear to complete normally instead of being cancelled.

Fix: Re-raise CancelledError
         try:
             await self._watch_loop()
         except asyncio.CancelledError:
             self.logger.info("PodMonitor cancelled")
+            raise
         finally:
backend/app/services/saga/saga_logic.py (1)

308-318: Keep the timeout checker alive after transient errors.

check_timeouts_loop only catches CancelledError, so any repository error will terminate the loop permanently. Wrap each iteration in a catch-all and log.

💡 Proposed fix
 async def check_timeouts_loop(self) -> None:
     """Check for saga timeouts (runs until cancelled)."""
     try:
         while True:
             # Check every 30 seconds
             await asyncio.sleep(30)
-            await self.check_timeouts_once()
+            try:
+                await self.check_timeouts_once()
+            except Exception as e:
+                self.logger.error(f"Timeout checker error: {e}", exc_info=True)
 
     except asyncio.CancelledError:
         self.logger.info("Timeout checker cancelled")
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

67-69: Strengthen the success-flow assertion.

len(logic._sagas) > 0 is already true after register_saga, so it doesn’t validate handle_event. Assert on a side effect, e.g., saga persisted in the repo.

💡 Proposed fix
-    assert len(logic._sagas) > 0  # noqa: SLF001
+    found = await saga_repo.get_saga_by_execution_and_name("e", "s")
+    assert found is not None
backend/app/core/providers.py (1)

567-576: SSEConnectionRegistry should be app-scoped.

With Scope.REQUEST, each request gets a fresh registry and _active_connections resets, so connection counts and cross-request visibility break. Use Scope.APP to keep a singleton registry.

💡 Proposed fix
-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
             self,
             logger: logging.Logger,
             connection_metrics: ConnectionMetrics,
     ) -> SSEConnectionRegistry:
backend/workers/run_k8s_worker.py (1)

147-150: Silent catch-all handler discards unmatched events without logging.

This catch-all handler silently discards any events that don't match the specific filters. While the comment says "prevents message loss," it actually causes silent data loss. Consider logging unmatched events at warning level to aid debugging and detect routing issues or missing handlers for new event types.

💡 Suggested fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
-        async def handle_other(event: DomainEvent) -> None:
-            pass
+        async def handle_other(event: DomainEvent) -> None:
+            app_logger.warning(
+                "Unhandled event type received",
+                extra={"event_type": getattr(event, "event_type", "unknown")},
+            )
backend/workers/run_pod_monitor.py (1)

84-92: Exceptions from completed tasks are silently swallowed.

When monitor.run() fails with an exception, the task will be in the done set, but the exception is never retrieved or logged. This masks critical failures and makes debugging production issues difficult.

🐛 Proposed fix
         done, pending = await asyncio.wait(
             [monitor_task, shutdown_task],
             return_when=asyncio.FIRST_COMPLETED,
         )

+        # Surface failures from completed tasks
+        for task in done:
+            if task is not shutdown_task and not task.cancelled():
+                if exc := task.exception():
+                    logger.error("PodMonitor task failed", exc_info=exc)
+                    raise exc
+
         for task in pending:
             task.cancel()
             with suppress(asyncio.CancelledError):
                 await task
backend/workers/run_result_processor.py (1)

162-165: Silent catch-all handler discards unmatched events without logging.

Same issue as in run_k8s_worker.py. The catch-all handler silently discards events that don't match specific filters. Add logging to aid debugging.

💡 Suggested fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
-        async def handle_other(event: DomainEvent) -> None:
-            pass
+        async def handle_other(event: DomainEvent) -> None:
+            app_logger.warning(
+                "Unhandled event type received",
+                extra={"event_type": getattr(event, "event_type", "unknown")},
+            )
🧹 Nitpick comments (10)
backend/workers/run_saga_orchestrator.py (1)

161-165: Opportunistic timeout check may miss timeouts during idle periods.

If no events arrive, check_timeouts_once() won't be called. For sagas with strict timeout requirements, consider adding a dedicated background task or using FastStream's scheduling capabilities.

💡 Alternative: Add a background timeout checker
# In lifespan, after handler registration:
async def timeout_loop() -> None:
    while True:
        await asyncio.sleep(30)
        await logic.check_timeouts_once()

timeout_task = asyncio.create_task(timeout_loop())
# Cancel in finally block

However, if event throughput is consistently high, the opportunistic approach may be sufficient.

backend/tests/helpers/fakes/kubernetes.py (1)

37-41: FakeK8sAppsV1Api has no methods.

If tests need to interact with AppsV1Api methods (e.g., deployments, statefulsets), you'll need to add stub methods. Currently, it only satisfies basic DI wiring.

backend/tests/helpers/fakes/kafka.py (1)

38-78: Consider guarding __anext__ against consuming when not started.

The async iterator doesn't check _started state before yielding messages. While acceptable for test fakes, this could mask bugs where tests consume from an unstarted consumer.

Optional: Add started guard
     async def __anext__(self) -> Any:
+        if not self._started:
+            raise StopAsyncIteration
         if self._messages:
             return self._messages.pop(0)
         raise StopAsyncIteration
backend/tests/helpers/fakes/schema_registry.py (2)

17-22: Type introspection is fragile - consider adding defensive handling.

The _get_all_event_classes function assumes DomainEvent is Annotated[Union[...], Discriminator]. If the type structure changes, this will silently return incorrect results.

Add defensive check
 `@lru_cache`(maxsize=1)
 def _get_all_event_classes() -> list[type[DomainEvent]]:
     """Get all concrete event classes from DomainEvent union."""
-    union_type = get_args(DomainEvent)[0]  # Annotated[Union[...], Discriminator] -> Union
-    return list(get_args(union_type)) if get_origin(union_type) else [union_type]
+    args = get_args(DomainEvent)
+    if not args:
+        raise TypeError("DomainEvent must be a generic type with arguments")
+    union_type = args[0]  # Annotated[Union[...], Discriminator] -> Union
+    inner_args = get_args(union_type)
+    if not inner_args:
+        return [union_type] if isinstance(union_type, type) else []
+    return list(inner_args)

99-106: Consider caching the event class mapping.

The mapping is rebuilt on every deserialize_json call. Since _get_all_event_classes() is already cached, the mapping could be cached too.

Cache the mapping
+    `@lru_cache`(maxsize=1)
+    def _get_event_class_mapping(self) -> dict[EventType, type[DomainEvent]]:
+        """Get cached mapping from event type to class."""
+        return {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()}
+
     def deserialize_json(self, data: dict[str, Any]) -> DomainEvent:
         """Deserialize JSON data to DomainEvent using event_type field."""
         if not (event_type_str := data.get("event_type")):
             raise ValueError("Missing event_type in event data")
-        mapping = {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()}
+        mapping = self._get_event_class_mapping()
         if not (event_class := mapping.get(EventType(event_type_str))):
             raise ValueError(f"No event class found for event type: {event_type_str}")
         return event_class.model_validate(data)

Note: Since self is involved, you'd need functools.cached_property or a class-level cache instead of lru_cache.

backend/workers/run_result_processor.py (1)

140-160: Consider exposing public handler methods instead of calling private methods.

The handlers directly call underscore-prefixed methods (logic._handle_completed, logic._handle_failed, logic._handle_timeout), which are conventionally private. While this works, it couples the handlers to internal implementation details.

Consider adding public wrapper methods in ProcessorLogic (e.g., handle_completed, handle_failed, handle_timeout) that delegate to the private implementations, or rename the existing methods to be public if they're intended to be called externally.

backend/tests/e2e/test_k8s_worker_create_pod.py (1)

85-92: Consider wrapping cleanup in try/except to handle partial creation failures.

If the test fails between ConfigMap creation and Pod creation, the cleanup at lines 91-92 will attempt to delete a non-existent pod, potentially raising an exception that masks the original failure.

💡 Suggested improvement
     # Cleanup
-    logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
-    logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
+    try:
+        logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns)
+    except ApiException:
+        pass  # Pod may not exist if test failed early
+    try:
+        logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)
+    except ApiException:
+        pass  # ConfigMap may not exist if test failed early

Alternatively, consider using a pytest fixture with cleanup as a finalizer.

backend/app/services/sse/sse_service.py (1)

236-254: Consider dynamic health status instead of hardcoded values.

The health status returns hardcoded status="healthy" and kafka_enabled=True, but:

  1. The service doesn't validate its actual health (e.g., Redis connectivity)
  2. kafka_enabled=True may be misleading since Kafka is now handled by the separate SSE bridge worker

If this is intentional for the refactoring phase, consider adding a TODO or updating the field semantics in SSEHealthDomain.

backend/workers/run_sse_bridge.py (1)

92-129: Lifespan structure is correct but consider error handling for schema initialization.

The lifespan correctly:

  • Initializes schema registry before defining the subscriber
  • Cleans up the container in the finally block

However, if initialize_event_schemas fails, the exception will propagate and prevent startup, but no specific error logging occurs. Consider wrapping with explicit error handling for better diagnostics.

💡 Optional: Add explicit error handling
         # Initialize schema registry
         schema_registry = await container.get(SchemaRegistryManager)
-        await initialize_event_schemas(schema_registry)
+        try:
+            await initialize_event_schemas(schema_registry)
+        except Exception as e:
+            app_logger.error(f"Failed to initialize schemas: {e}")
+            raise
backend/app/services/event_bus.py (1)

111-123: Consider propagating publish errors to callers.

The publish method catches all exceptions and logs them but doesn't raise, which means callers won't know if their event failed to publish. For cross-instance cache invalidation, this might be acceptable, but for critical events, callers may need to know about failures.

If silent failures are intentional for this use case, consider documenting this behavior.

Comment on lines +70 to +163
async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None:
"""Handle create pod command from saga orchestrator."""
execution_id = command.execution_id

# Check if already processing
if execution_id in self._active_creations:
self.logger.warning(f"Already creating pod for execution {execution_id}")
return

# Create pod asynchronously
asyncio.create_task(self._create_pod_for_execution(command))

async def handle_delete_pod_command(self, command: DeletePodCommandEvent) -> None:
"""Handle delete pod command from saga orchestrator (compensation)."""
execution_id = command.execution_id
self.logger.info(f"Deleting pod for execution {execution_id} due to: {command.reason}")

try:
# Delete the pod
pod_name = f"executor-{execution_id}"
await asyncio.to_thread(
self.v1.delete_namespaced_pod,
name=pod_name,
namespace=self.config.namespace,
grace_period_seconds=30,
)
self.logger.info(f"Successfully deleted pod {pod_name}")

# Delete associated ConfigMap
configmap_name = f"script-{execution_id}"
await asyncio.to_thread(
self.v1.delete_namespaced_config_map, name=configmap_name, namespace=self.config.namespace
)
self.logger.info(f"Successfully deleted ConfigMap {configmap_name}")

# NetworkPolicy cleanup is managed via a static cluster policy; no per-execution NP deletion

except ApiException as e:
if e.status == 404:
self.logger.warning(
f"Resources for execution {execution_id} not found (may have already been deleted)"
)
else:
self.logger.error(f"Failed to delete resources for execution {execution_id}: {e}")

async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None:
"""Create pod for execution."""
async with self._creation_semaphore:
execution_id = command.execution_id
self._active_creations.add(execution_id)
self.metrics.update_k8s_active_creations(len(self._active_creations))

start_time = time.time()

try:
# We now have the CreatePodCommandEvent directly from saga
script_content = command.script
entrypoint_content = await self._get_entrypoint_script()

# Create ConfigMap
config_map = self.pod_builder.build_config_map(
command=command, script_content=script_content, entrypoint_content=entrypoint_content
)

await self._create_config_map(config_map)

pod = self.pod_builder.build_pod_manifest(command=command)
await self._create_pod(pod)

# Publish PodCreated event
await self._publish_pod_created(command, pod)

# Update metrics
duration = time.time() - start_time
self.metrics.record_k8s_pod_creation_duration(duration, command.language)
self.metrics.record_k8s_pod_created("success", command.language)

self.logger.info(
f"Successfully created pod {pod.metadata.name} for execution {execution_id}. "
f"Duration: {duration:.2f}s"
)

except Exception as e:
self.logger.error(f"Failed to create pod for execution {execution_id}: {e}", exc_info=True)

# Update metrics
self.metrics.record_k8s_pod_created("failed", "unknown")

# Publish failure event
await self._publish_pod_creation_failed(command, str(e))

finally:
self._active_creations.discard(execution_id)
self.metrics.update_k8s_active_creations(len(self._active_creations))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent duplicate pod creation on concurrent commands.

handle_create_pod_command checks _active_creations before the task adds the ID, so two concurrent calls for the same execution can both enqueue tasks and emit duplicate events. Reserve the ID before scheduling the task and avoid double-add inside _create_pod_for_execution.

💡 Proposed fix
 async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None:
     """Handle create pod command from saga orchestrator."""
     execution_id = command.execution_id

     # Check if already processing
     if execution_id in self._active_creations:
         self.logger.warning(f"Already creating pod for execution {execution_id}")
         return

+    # Reserve execution_id before scheduling to prevent dupes
+    self._active_creations.add(execution_id)
+    self.metrics.update_k8s_active_creations(len(self._active_creations))
+
     # Create pod asynchronously
     asyncio.create_task(self._create_pod_for_execution(command))

 async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None:
     """Create pod for execution."""
     async with self._creation_semaphore:
         execution_id = command.execution_id
-        self._active_creations.add(execution_id)
-        self.metrics.update_k8s_active_creations(len(self._active_creations))
🤖 Prompt for AI Agents
In `@backend/app/services/k8s_worker/worker_logic.py` around lines 70 - 163, The
duplicate-creation bug occurs because handle_create_pod_command checks
_active_creations but only schedules the task, and the ID is added later inside
_create_pod_for_execution, so two concurrent calls can both schedule tasks; fix
by reserving the execution_id immediately in handle_create_pod_command (add
execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.

Comment on lines +160 to 182
async def _run_watch(self) -> None:
"""Run a single watch session."""
self.logger.info(
f"Starting pod watch: selector={self.config.label_selector}, namespace={self.config.namespace}"
)

# Create watch stream
kwargs = {
"namespace": context.namespace,
"label_selector": context.label_selector,
"timeout_seconds": context.timeout_seconds,
kwargs: dict[str, Any] = {
"namespace": self.config.namespace,
"label_selector": self.config.label_selector,
"timeout_seconds": self.config.watch_timeout_seconds,
}
if self.config.field_selector:
kwargs["field_selector"] = self.config.field_selector
if self._last_resource_version:
kwargs["resource_version"] = self._last_resource_version

if context.field_selector:
kwargs["field_selector"] = context.field_selector

if context.resource_version:
kwargs["resource_version"] = context.resource_version

# Watch stream (clients guaranteed by __init__)
stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs)

try:
for event in stream:
if self._state != MonitorState.RUNNING:
break

await self._process_raw_event(event)

finally:
# Store resource version for next watch
self._update_resource_version(stream)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the kubernetes watch.stream implementation and usage patterns
rg -n "watch.stream" --type=py -B 2 -A 5

# Check if there's async handling elsewhere
rg -n "to_thread.*stream" --type=py

Repository: HardMax71/Integr8sCode

Length of output: 1456


🏁 Script executed:

# Also check how self._watch is initialized
rg -n "self\._watch.*=" --type=py -B 1 -A 1

Repository: HardMax71/Integr8sCode

Length of output: 265


🏁 Script executed:

# Check the full monitor.py file to understand the context
wc -l backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 113


🏁 Script executed:

# Get the imports in monitor.py to understand what libraries are being used
head -50 backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 1273


🏁 Script executed:

# Check for asyncio.to_thread or executor usage around this code
rg -n "to_thread|run_in_executor|ThreadPoolExecutor" --type=py backend/app/services/pod_monitor/

Repository: HardMax71/Integr8sCode

Length of output: 265


🏁 Script executed:

# Get the full _run_watch method and surrounding context
sed -n '160,185p' backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 1085


🏁 Script executed:

# Check for any middleware or context managers that might handle this
rg -n "contextmanager|generator|async for" --type=py backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check what _process_raw_event does
rg -n "def _process_raw_event" --type=py -A 15 backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 762


🏁 Script executed:

# Look at the broader _run_watch context including the try-except
sed -n '155,195p' backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 1706


🏁 Script executed:

# Check if there are any known async watch patterns in tests or elsewhere
rg -n "watch|stream" --type=py backend/tests/ | head -30

Repository: HardMax71/Integr8sCode

Length of output: 3314


🏁 Script executed:

# Check the requirements to see which version of kubernetes library is used
cat backend/requirements.txt | grep -i kubernetes

Repository: HardMax71/Integr8sCode

Length of output: 122


🏁 Script executed:

# Look for any existing issues or comments about watch blocking
rg -n "block|thread|concurrent" --type=py backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 177


🏁 Script executed:

# Check if there's any configuration or documentation about this behavior
find backend -name "*.md" -o -name "*.rst" -o -name "*.txt" | xargs grep -l "watch\|async" 2>/dev/null | head -5

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Search for requirements files in various locations
find backend -name "*requirements*" -o -name "pyproject.toml" -o -name "poetry.lock" | head -10

Repository: HardMax71/Integr8sCode

Length of output: 88


🏁 Script executed:

# Check the FakeWatch implementation to understand the expected pattern
sed -n '150,180p' backend/tests/helpers/k8s_fakes.py

Repository: HardMax71/Integr8sCode

Length of output: 1005


🏁 Script executed:

# Check if there are any GitHub issues or comments in the codebase about async watch
rg -n "TODO|FIXME|XXX|HACK|BUG" --type=py backend/app/services/pod_monitor/monitor.py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check pyproject.toml for kubernetes dependency
cat backend/pyproject.toml | grep -A 10 -B 2 "kubernetes\|dependencies"

Repository: HardMax71/Integr8sCode

Length of output: 804


Wrap the watch stream iteration in asyncio.to_thread() to avoid blocking the event loop.

The self._watch.stream() call returns a synchronous iterator from the kubernetes library. Using for event in stream: blocks the event loop while waiting for events, preventing other async tasks from running. The codebase already uses the asyncio.to_thread() pattern elsewhere (lines 117, 305) for blocking I/O. Apply the same pattern here:

stream = await asyncio.to_thread(
    self._watch.stream, self._v1.list_namespaced_pod, **kwargs
)

try:
    for event in stream:
        await self._process_raw_event(event)
🤖 Prompt for AI Agents
In `@backend/app/services/pod_monitor/monitor.py` around lines 160 - 182, The
synchronous kubernetes iterator returned by self._watch.stream in _run_watch is
blocking the event loop; replace the direct call with an awaited
asyncio.to_thread call (await asyncio.to_thread(self._watch.stream,
self._v1.list_namespaced_pod, **kwargs)) so the blocking stream creation runs
off the event loop, then keep iterating and awaiting
self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.

assert ValidateExecutionStep().get_compensation() is None
assert isinstance(AllocateResourcesStep(_FakeAllocRepo()).get_compensation(), ReleaseResourcesCompensation)
assert isinstance(AllocateResourcesStep(alloc_repo).get_compensation(), ReleaseResourcesCompensation)
assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assertion is a no-op (or True makes it always pass).

The expression isinstance(...) or True will always evaluate to True, making this assertion meaningless. Either fix the assertion or remove it if the check isn't needed.

🐛 Proposed fix
-    assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+    # QueueExecutionStep has no compensation
+    assert QueueExecutionStep().get_compensation() is None

Or if the intent was to check that it returns a SagaCompensation subclass:

-    assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+    from app.services.saga.saga_step import SagaCompensation
+    comp = QueueExecutionStep().get_compensation()
+    assert comp is None or isinstance(comp, SagaCompensation)
🤖 Prompt for AI Agents
In `@backend/tests/unit/services/saga/test_execution_saga_steps.py` at line 204,
The assertion in test_execution_saga_steps.py is a no-op because `or True` makes
it always pass; update the test to either remove the `or True` or replace the
assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 47 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">

<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:60">
P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

assert ctx.get("allocation_id") is not None

# Test 2: Limit exceeded (insert 100 active allocations)
for i in range(100):
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 60:

<comment>This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</comment>

<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+    assert ctx.get("allocation_id") is not None
+
+    # Test 2: Limit exceeded (insert 100 active allocations)
+    for i in range(100):
+        doc = ResourceAllocationDocument(
+            allocation_id=f"limit-test-alloc-{i}",
</file context>
Fix with Cubic

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 49 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/pyproject.toml">

<violation number="1" location="backend/pyproject.toml:140">
P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</violation>
</file>

<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">

<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:71">
P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</violation>
</file>

<file name="backend/app/core/providers.py">

<violation number="1" location="backend/app/core/providers.py:192">
P2: MongoDB client is never closed after yielding the database, which can leak connections on shutdown. Wrap the yield in a try/finally and close the client in the finally block.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

[dependency-groups]
dev = [
"coverage==7.13.0",
"fakeredis>=2.33.0",
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 140:

<comment>New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</comment>

<file context>
@@ -137,9 +137,11 @@ packages = ["app", "workers"]
 [dependency-groups]
 dev = [
     "coverage==7.13.0",
+    "fakeredis>=2.33.0",
     "hypothesis==6.103.4",
     "iniconfig==2.0.0",
</file context>
Suggested change
"fakeredis>=2.33.0",
"fakeredis==2.33.0",
Fix with Cubic

memory_limit="512Mi",
status="active",
)
await doc.insert()
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 71:

<comment>This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</comment>

<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+            memory_limit="512Mi",
+            status="active",
+        )
+        await doc.insert()
 
-    # Limit exceeded
</file context>
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 7-11: The AsyncIterable providers are not closing their resources
on shutdown: update get_database to close the AsyncMongoClient after the yield
(await client.close()), update get_dlq_manager to call await consumer.stop()
(and any DLQ manager cleanup/close) after the yield instead of only cancelling
tasks, and add proper cleanup for get_schema_registry (e.g., close or shutdown
the schema registry client/resource after yield); follow the same pattern used
in get_event_bus and get_event_store_consumer where consumers are stopped with
await consumer.stop() after yielding.

In `@backend/workers/run_sse_bridge.py`:
- Around line 101-102: The decode_avro function currently calls
schema_registry.deserialize_event(body, "sse_bridge") with a hardcoded topic
string; update decode_avro to either pass the real subscribed topic from the
message context (e.g., use the topic variable available where messages are
received) into schema_registry.deserialize_event, or remove the topic argument
entirely if deserialize_event truly does not use it. Locate decode_avro and the
call sites where messages and topic are available, then replace the literal
"sse_bridge" with the dynamic topic variable or adjust deserialize_event usage
accordingly to avoid the misleading hardcoded value.
♻️ Duplicate comments (4)
backend/workers/run_result_processor.py (1)

156-159: Catch-all handler silently discards unmatched events.

This handler prevents message loss but provides no visibility into unexpected events. Consider logging at debug/warning level to aid troubleshooting.

Suggested fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.debug(
+                "Received unhandled event type",
+                extra={"event_type": getattr(event, "event_type", "unknown")},
+            )
backend/app/core/providers.py (1)

577-586: SSEConnectionRegistry scope must be Scope.APP, not Scope.REQUEST.

With Scope.REQUEST, each HTTP request gets a new registry instance with an empty _active_connections dictionary. This breaks cross-request connection tracking and causes get_connection_count() to always return 0 in health checks.

Suggested fix
-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
             self,
             logger: logging.Logger,
             connection_metrics: ConnectionMetrics,
     ) -> SSEConnectionRegistry:
backend/workers/run_pod_monitor.py (1)

75-88: Exceptions from completed tasks are still silently swallowed.

The asyncio.wait pattern returns completed tasks in done, but exceptions are never checked. If monitor.run() raises an exception, it ends up in done but the exception is silently ignored. This makes debugging production failures difficult.

Suggested fix
         done, pending = await asyncio.wait(
             [monitor_task, shutdown_task],
             return_when=asyncio.FIRST_COMPLETED,
         )

+        # Check for exceptions in completed tasks
+        for task in done:
+            if task is not shutdown_task and not task.cancelled():
+                exc = task.exception()
+                if exc:
+                    logger.error("Monitor task failed", exc_info=exc)
+                    raise exc
+
         for task in pending:
             task.cancel()
             with suppress(asyncio.CancelledError):
                 await task
backend/workers/run_k8s_worker.py (1)

146-149: Don’t silently drop unmatched events.

The no‑op catch‑all handler will ack and discard events that don’t match filters, masking new event types or header bugs. Consider logging or routing to a dead‑letter path.

🪵 Suggested minimal logging
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning(
+                "Unhandled event received: %s",
+                getattr(event, "event_type", type(event).__name__),
+            )
🧹 Nitpick comments (5)
backend/app/events/schema/schema_registry.py (1)

82-102: Schema object is recreated unnecessarily in serialize_event.

_ensure_schema_registered already creates an AvroSchema object internally but doesn't return it. Then serialize_event creates it again on line 92. Consider returning the schema object from _ensure_schema_registered to avoid redundant schema parsing.

Suggested optimization
-    async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> int:
+    async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> tuple[int, schema.AvroSchema]:
         """Lazily register schema and return schema ID."""
         if event_class in self._schema_id_cache:
-            return self._schema_id_cache[event_class]
+            avro_schema_dict = event_class.avro_schema_to_python()
+            return self._schema_id_cache[event_class], schema.AvroSchema(avro_schema_dict)

         subject = f"{self.subject_prefix}{event_class.__name__}-value"
         avro_schema_dict = event_class.avro_schema_to_python()
         avro_schema_obj = schema.AvroSchema(avro_schema_dict)

         schema_id: int = await self._client.register(subject, avro_schema_obj)
         self._schema_id_cache[event_class] = schema_id
         self._id_to_class_cache[schema_id] = event_class
         self.logger.debug(f"Registered schema {event_class.__name__}: ID {schema_id}")
-        return schema_id
+        return schema_id, avro_schema_obj

     async def serialize_event(self, event: DomainEvent) -> bytes:
         """Serialize event to Confluent wire format."""
         event_class = event.__class__
         subject = f"{self.subject_prefix}{event_class.__name__}-value"

-        await self._ensure_schema_registered(event_class)
-
-        avro_schema_dict = event_class.avro_schema_to_python()
-        avro_schema_obj = schema.AvroSchema(avro_schema_dict)
+        _, avro_schema_obj = await self._ensure_schema_registered(event_class)
backend/app/core/providers.py (1)

271-293: DLQManager background tasks lack graceful shutdown signaling.

The monitor_loop uses while True and relies on task cancellation. Consider adding an explicit stop flag to allow graceful completion of the current iteration before shutdown.

backend/workers/run_saga_orchestrator.py (3)

98-99: Closure-based state for timeout checking works but could be clearer.

Using a mutable dict for state across handler invocations is a valid pattern. Consider using a simple class or dataclass for better readability.


161-162: Minor: f-string logging is used inconsistently.

Line 161 uses f-string formatting while other parts of the codebase use structured logging with extra. This is minor but affects log aggregation consistency.


142-159: Opportunistic timeout check may miss sagas during low traffic periods.

Timeout checking only runs when events arrive via check_timeouts_once(). During periods of no traffic, timed-out sagas won't be detected until the next event. A check_timeouts_loop() method exists in SagaLogic that runs every 30 seconds but is not currently used. Consider whether to use the available background loop instead of the opportunistic approach, especially given the 5-minute maximum timeout constraint.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@backend/tests/unit/conftest.py`:
- Around line 58-90: The session-scoped unit_container fixture holds a single
AsyncMongoMockClient across the whole test session causing DB state leakage; fix
by making isolation explicit: either change unit_container to function scope or
create/attach a fresh AsyncMongoMockClient per test and re-run init_beanie for
that client (references: unit_container, AsyncMongoMockClient, init_beanie), or
add a teardown that drops all collections between tests (callable in unit_scope)
and ensure tests use the existing unit_scope fixture instead of unit_container
so each test gets a clean DB state.

In `@backend/workers/run_event_replay.py`:
- Around line 34-38: The AsyncMongoClient instance (mongo_client) is created
before calling init_beanie and may leak if init_beanie raises; wrap the
init_beanie call in a try/finally so that on any exception you call await
mongo_client.close() to ensure the client is closed, then re-raise the
exception; keep the AsyncExitStack usage for normal shutdown but ensure the
finally branch only closes mongo_client when initialization fails (i.e., before
the stack takes ownership).

In `@backend/workers/run_result_processor.py`:
- Around line 132-137: The subscriber is created with auto_commit=False which
prevents offsets from being committed and combined with FastStream's default
AckPolicy.ACK_FIRST causes messages to never be acknowledged; update the
broker.subscriber(...) invocation (the call that currently passes topics,
group_id, auto_commit=False, decoder=decode_avro) to either set
ack_policy=AckPolicy.ACK to commit offsets after successful handling or remove
auto_commit=False to enable auto-commit, and then modify the no-op handle_other
handler so it explicitly acknowledges or returns success for unknown events to
ensure offsets advance for those messages as well.

In `@backend/workers/run_saga_orchestrator.py`:
- Around line 142-147: The subscriber decorator currently uses the deprecated
auto_commit parameter; replace auto_commit=False in the broker.subscriber(...)
call with an explicit ack_policy by importing AckPolicy from faststream and
using ack_policy=AckPolicy.ACK (recommended) or ack_policy=AckPolicy.MANUAL if
you want manual control; if you choose MANUAL, update the corresponding message
handler function (the consumer callback tied to broker.subscriber with topics,
group_id and decoder=decode_avro) to call message.ack() after successful
processing, and remove the auto_commit argument; apply the same change to the
identical patterns in run_result_processor.py and run_k8s_worker.py.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)

559-568: SSEConnectionRegistry should be app-scoped (singleton).

With Scope.REQUEST, the registry is recreated per request and _active_connections is always empty across requests (e.g., health checks). Please switch this provider to Scope.APP.

🔧 Suggested fix
-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
             self,
             logger: logging.Logger,
             connection_metrics: ConnectionMetrics,
     ) -> SSEConnectionRegistry:
backend/workers/run_result_processor.py (1)

161-164: Don’t silently drop unmatched events.

handle_other currently no-ops, which can hide routing/config errors and new event types. Add at least a warning log.

🔧 Suggested fix
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            logger.warning("Unhandled result-processor event_type: %s", event.event_type)
backend/workers/run_k8s_worker.py (1)

142-145: Log unmatched events to aid debugging.

The catch-all handler silently discards events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters. This can mask filter bugs or cause silent data loss when new event types are introduced.

♻️ Proposed fix: add warning log for unmatched events
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning(
+                "Unmatched event received",
+                extra={"event_type": getattr(event, "event_type", "unknown")},
+            )
backend/workers/run_pod_monitor.py (1)

85-93: Exceptions from completed tasks are still silently swallowed.

When monitor.run() raises an exception, that exception is stored in monitor_task but never retrieved. The process will exit through the finally block without logging or propagating the error, masking critical failures.

🐛 Proposed fix: surface exceptions from completed tasks
         done, pending = await asyncio.wait(
             [monitor_task, shutdown_task],
             return_when=asyncio.FIRST_COMPLETED,
         )

+        # Surface failures from completed tasks
+        for task in done:
+            if task != shutdown_task:
+                try:
+                    task.result()  # Raises if task failed
+                except Exception:
+                    logger.exception("Monitor task failed")
+                    raise
+
         for task in pending:
             task.cancel()
             with suppress(asyncio.CancelledError):
                 await task
🧹 Nitpick comments (4)
backend/workers/dlq_processor.py (1)

87-90: MongoDB connection parameters are hardcoded.

The connection settings (tz_aware=True, serverSelectionTimeoutMS=5000) are duplicated here and in dishka_lifespan.py. Consider extracting these to Settings or a shared factory if consistency matters across workers and the main app.

backend/app/core/dishka_lifespan.py (1)

17-25: Consider try/finally to ensure MongoDB client cleanup on initialization failure.

If init_beanie() raises an exception, the AsyncMongoClient won't be closed since the code won't reach line 24. While unlikely in practice, wrapping the yield block ensures proper cleanup.

Suggested fix
     client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
         settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
     )
-    await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-    yield
-
-    await client.close()
-    await container.close()
+    try:
+        await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        yield
+    finally:
+        await client.close()
+        await container.close()
backend/app/core/providers.py (1)

253-265: Guard long‑running background loops from silent task death.

These while True tasks (DLQ processing, event store batch loop, event bus listener, notification loops) will stop permanently if an exception escapes. Consider adding a try/except with logging and a short backoff so failures surface without killing the loop.

Example pattern
         async def batch_loop() -> None:
             while True:
-                await event_store_consumer.process_batch()
+                try:
+                    await event_store_consumer.process_batch()
+                except Exception:
+                    logger.exception("Event store consumer loop crashed; continuing")
+                    await asyncio.sleep(1)

Also applies to: 359-371, 413-428, 693-712

backend/workers/run_pod_monitor.py (1)

57-61: Consider moving MongoDB client lifecycle into DI for consistency.

The MongoDB client is created manually outside the DI container while other boundary clients (Kafka, Redis) are DI-managed. This works correctly since it's properly closed in the finally block, but moving it into a provider would improve consistency with other workers.

Comment on lines +58 to +90
@pytest_asyncio.fixture(scope="session")
async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContainer, None]:
"""DI container for unit tests with fake boundary clients.

Provides:
- Fake Redis, Kafka, K8s (boundary clients)
- Real metrics, repositories, services (internal)
- Real MongoDB via init_beanie
"""
container = make_async_container(
SettingsProvider(),
LoggingProvider(),
FakeBoundaryClientProvider(),
RedisServicesProvider(),
MetricsProvider(),
EventProvider(),
FakeSchemaRegistryProvider(),
MessagingProvider(),
CoreServicesProvider(),
KafkaServicesProvider(),
RepositoryProvider(),
context={Settings: test_settings},
)

mongo_client: AsyncMongoMockClient[dict[str, object]] = AsyncMongoMockClient()
await init_beanie(
database=mongo_client[test_settings.DATABASE_NAME], # type: ignore[arg-type]
document_models=ALL_DOCUMENTS,
)

yield container

await container.close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Look for test cleanup patterns mentioned in the concern
rg -n "clean|reset|clear|delete_many|drop" backend/tests/unit/ --type py -C2 | head -100

Repository: HardMax71/Integr8sCode

Length of output: 7819


🏁 Script executed:

#!/bin/bash
# Check how the unit_container fixture is actually used in tests
rg -n "def test_.*unit_container" backend/tests/unit/ --type py -C1

Repository: HardMax71/Integr8sCode

Length of output: 1803


🏁 Script executed:

#!/bin/bash
# Look for fixture scope declarations and test isolation patterns
rg -n "@pytest\.fixture|@pytest_asyncio\.fixture" backend/tests/unit/conftest.py --type py -C1

Repository: HardMax71/Integr8sCode

Length of output: 3240


🏁 Script executed:

#!/bin/bash
# Check if mongomock client is being reset or if there's any state cleanup
rg -n "AsyncMongoMockClient|mongo_client" backend/tests/unit/conftest.py --type py -C3

Repository: HardMax71/Integr8sCode

Length of output: 721


🏁 Script executed:

#!/bin/bash
# Examine the full conftest.py structure to understand fixture relationships
head -300 backend/tests/unit/conftest.py | tail -250

Repository: HardMax71/Integr8sCode

Length of output: 7553


Confirm: session-scoped container creates genuine state leakage risk.

The unit_container fixture is session-scoped with a single AsyncMongoMockClient instance that persists across all unit tests. No database cleanup is performed between tests—the mongomock instance accumulates data from all writes. Tests using repositories that modify database state will pollute subsequent tests in the session.

A function-scoped unit_scope fixture exists (line 93) but tests use unit_container directly, bypassing isolation. Either switch test usage to unit_scope, add cleanup hooks between tests, or accept and document the leakage risk.

🤖 Prompt for AI Agents
In `@backend/tests/unit/conftest.py` around lines 58 - 90, The session-scoped
unit_container fixture holds a single AsyncMongoMockClient across the whole test
session causing DB state leakage; fix by making isolation explicit: either
change unit_container to function scope or create/attach a fresh
AsyncMongoMockClient per test and re-run init_beanie for that client
(references: unit_container, AsyncMongoMockClient, init_beanie), or add a
teardown that drops all collections between tests (callable in unit_scope) and
ensure tests use the existing unit_scope fixture instead of unit_container so
each test gets a clean DB state.

Comment on lines +34 to 38
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure Mongo client is closed if initialization fails.

If init_beanie raises, the mongo_client isn’t closed because the AsyncExitStack is entered later. Consider a small try/finally to avoid leaking the client on startup failure.

🔧 Suggested fix
-    mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
-        settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
-    )
-    await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+    mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+        settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+    )
+    try:
+        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+    except Exception:
+        await mongo_client.close()
+        raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
try:
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
except Exception:
await mongo_client.close()
raise
🤖 Prompt for AI Agents
In `@backend/workers/run_event_replay.py` around lines 34 - 38, The
AsyncMongoClient instance (mongo_client) is created before calling init_beanie
and may leak if init_beanie raises; wrap the init_beanie call in a try/finally
so that on any exception you call await mongo_client.close() to ensure the
client is closed, then re-raise the exception; keep the AsyncExitStack usage for
normal shutdown but ensure the finally branch only closes mongo_client when
initialization fails (i.e., before the stack takes ownership).

Comment on lines 132 to 137
subscriber = broker.subscriber(
*topics,
group_id=group_id,
auto_commit=False,
decoder=decode_avro,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd backend/workers && wc -l run_result_processor.py

Repository: HardMax71/Integr8sCode

Length of output: 93


🏁 Script executed:

cat -n backend/workers/run_result_processor.py | head -200

Repository: HardMax71/Integr8sCode

Length of output: 7837


🏁 Script executed:

cat -n backend/workers/run_result_processor.py | sed -n '120,160p'

Repository: HardMax71/Integr8sCode

Length of output: 2074


🌐 Web query:

FastStream 0.6 Kafka subscriber auto_commit=False offset commit behavior handler success

💡 Result:

Short answer: With FastStream 0.6 you must use the ACK-style acknowledgement (not auto-commit) to have offsets committed only after handler success — AckPolicy.ACK commits the offset after the handler returns (ACK_FIRST uses client auto-commit; MANUAL does nothing until you call msg.ack(); NACK_ON_ERROR will avoid committing on errors). See FastStream docs. [1]

Sources

  • FastStream 0.6 acknowledgement docs (AckPolicy behavior and examples). [1]
  • FastStream 0.5 docs showing auto_commit=False example (for comparison). [2]

🏁 Script executed:

grep -rn "AckPolicy\|\.ack()" backend/workers/

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

grep -rn "from faststream" backend/workers/run_result_processor.py

Repository: HardMax71/Integr8sCode

Length of output: 146


🌐 Web query:

FastStream 0.6 KafkaBroker subscriber default AckPolicy when auto_commit=False

💡 Result:

Default AckPolicy is AckPolicy.ACK_FIRST. FastStream’s docs state ACK_FIRST (which relies on Kafka’s enable.auto.commit) is the default; to get at-least-once you must set ack_policy=AckPolicy.ACK (or use manual ack). [1][2]

Sources:

  • FastStream Acknowledgement docs (0.6). [1]
  • KafkaBroker/subscriber API showing subscriber options (auto_commit / ack config). [2]

Add explicit offset commit strategy or enable auto-commit.

With auto_commit=False (line 135), offsets are not committed automatically. FastStream's default AckPolicy.ACK_FIRST relies on Kafka's enable.auto.commit, which is disabled here. This results in messages never being marked as processed—they will be redelivered on consumer restart.

Set ack_policy=AckPolicy.ACK in the broker.subscriber() call to commit offsets after handler success, or remove auto_commit=False to use default auto-commit. Additionally, the no-op handle_other handler silently drops unknown events without acknowledging them, preventing offset advancement for those messages.

🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 132 - 137, The
subscriber is created with auto_commit=False which prevents offsets from being
committed and combined with FastStream's default AckPolicy.ACK_FIRST causes
messages to never be acknowledged; update the broker.subscriber(...) invocation
(the call that currently passes topics, group_id, auto_commit=False,
decoder=decode_avro) to either set ack_policy=AckPolicy.ACK to commit offsets
after successful handling or remove auto_commit=False to enable auto-commit, and
then modify the no-op handle_other handler so it explicitly acknowledges or
returns success for unknown events to ensure offsets advance for those messages
as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@backend/app/services/notification_service.py`:
- Around line 85-95: The docstring incorrectly mentions a run() method; update
it to remove the run() reference and instead state that background work is
started by the application's DI/lifespan scheduler which calls this service's
batch and cleanup methods (for example
NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.

In `@backend/app/services/sse/redis_bus.py`:
- Around line 156-174: The start() method can be called multiple times which
will spawn duplicate listener tasks; add a guard in start() that checks
self._task (and whether it's not done) and either returns early or raises if a
listener is already running, or cancel/wait for the existing task before
creating a new one; update the async def start(self) to inspect self._task and
only call asyncio.create_task(listener()) when there is no active task (use
self._task.done() to determine activity) and ensure subscription/confirmation
logic is not re-run for an already-started listener (references: start, _task,
listener, _pubsub).

In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`:
- Around line 29-36: The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.
♻️ Duplicate comments (5)
backend/app/core/dishka_lifespan.py (1)

51-64: Task supervision during runtime is still missing.

If either pending_notification_task or cleanup_notification_task fails with an unexpected exception during normal operation (between yield and shutdown), the failure goes unnoticed until shutdown. The exception handling inside the loops catches and logs errors, but a complete task crash (e.g., from an unhandled exception type) would be silent.

Consider adding a done callback to log failures immediately:

♻️ Suggested supervision pattern
     pending_task = asyncio.create_task(pending_notification_task())
     cleanup_task = asyncio.create_task(cleanup_notification_task())
+
+    def _on_task_done(task: asyncio.Task[None]) -> None:
+        if task.cancelled():
+            return
+        if exc := task.exception():
+            logger.error("Background task failed unexpectedly", exc_info=exc)
+
+    pending_task.add_done_callback(_on_task_done)
+    cleanup_task.add_done_callback(_on_task_done)
     logger.info("NotificationService background tasks started")
backend/app/services/pod_monitor/monitor.py (2)

120-129: Re-raise asyncio.CancelledError to preserve cancellation semantics.

Catching CancelledError without re-raising breaks Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, causing task.cancelled() to return False.

🛠️ Suggested fix
         try:
             await self._watch_loop()
         except asyncio.CancelledError:
             self.logger.info("PodMonitor cancelled")
+            raise
         finally:

176-182: The synchronous watch.stream() iteration blocks the event loop.

The for event in stream: loop uses a synchronous iterator from the kubernetes library, which blocks the event loop while waiting for events. This prevents other async tasks from running. The codebase already uses asyncio.to_thread() elsewhere (lines 117, 295).

Consider wrapping the blocking iteration or processing events in a thread:

🛠️ Suggested approach

One option is to process each event fetch in a thread:

async def _run_watch(self) -> None:
    # ... setup code ...
    stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs)
    
    try:
        while True:
            event = await asyncio.to_thread(next, iter(stream), None)
            if event is None:
                break
            await self._process_raw_event(event)
    finally:
        self._update_resource_version(stream)

Alternatively, consider using an async kubernetes client if available.

backend/app/services/sse/sse_connection_registry.py (1)

30-38: Metrics can still drift on duplicate register_connection calls.

The unregister_connection method now correctly guards against decrementing when the connection doesn't exist. However, register_connection still unconditionally increments metrics even if the connection_id is already in the set (since set.add() is idempotent but the metrics call isn't).

🛠️ Suggested fix
     async def register_connection(self, execution_id: str, connection_id: str) -> None:
         """Register a new SSE connection."""
         async with self._lock:
-            if execution_id not in self._active_connections:
-                self._active_connections[execution_id] = set()
-
-            self._active_connections[execution_id].add(connection_id)
-            self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
-            self.metrics.increment_sse_connections("executions")
+            conns = self._active_connections.setdefault(execution_id, set())
+            if connection_id not in conns:
+                conns.add(connection_id)
+                self.metrics.increment_sse_connections("executions")
+            self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
backend/app/core/providers.py (1)

409-446: SSEConnectionRegistry should be app‑scoped.

Request scope recreates the registry per request, which breaks shared connection tracking across requests. This should be app‑wide.

🔧 Proposed fix
-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
🧹 Nitpick comments (2)
backend/app/events/event_store_consumer.py (1)

33-46: Trace span closes before metrics recording.

The trace_span context manager exits after line 38, so the metrics recording (line 41) and logging happen outside the span. This is minor, but for observability consistency, consider including the post-storage operations within the span.

♻️ Optional: Include metrics/logging in span scope
     async def store_event(self, event: DomainEvent, topic: str, consumer_group: str) -> bool:
         """Store a single event. Called by FastStream handler.

         Returns True if stored, False if duplicate/failed.
         """
         with trace_span(
             name="event_store_service.store_event",
             kind=SpanKind.CONSUMER,
             attributes={"event.type": event.event_type, "event.id": event.event_id},
         ):
             stored = await self.event_store.store_event(event)

-        if stored:
-            self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group)
-            self.logger.debug(f"Stored event {event.event_id}")
-        else:
-            self.logger.debug(f"Duplicate event {event.event_id}, skipped")
+            if stored:
+                self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group)
+                self.logger.debug(f"Stored event {event.event_id}")
+            else:
+                self.logger.debug(f"Duplicate event {event.event_id}, skipped")

         return stored
backend/app/services/notification_service.py (1)

126-149: Prefer logger.exception for batch/cleanup failures.

This preserves stack traces and is more actionable in ops.

♻️ Suggested change
-        except Exception as e:
-            self.logger.error(f"Error processing pending notifications: {e}")
+        except Exception:
+            self.logger.exception("Error processing pending notifications")
...
-        except Exception as e:
-            self.logger.error(f"Error cleaning up old notifications: {e}")
+        except Exception:
+            self.logger.exception("Error cleaning up old notifications")

Comment on lines +85 to +95
"""Service for creating and managing notifications.

This service handles:
- Creating notifications (user and system)
- Delivering notifications via channels (in-app, webhook, slack)
- Managing notification subscriptions
- Rate limiting via throttle cache

Background tasks (pending notification processing, cleanup) are started
via the run() method, which should be called from app lifespan.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring references a non‑existent run() method.

NotificationService no longer defines run(), so this is misleading. Please update the docstring to reflect the DI/lifespan scheduler that calls the new batch/cleanup methods.

📝 Proposed fix
-    Background tasks (pending notification processing, cleanup) are started
-    via the run() method, which should be called from app lifespan.
+    Background tasks (pending notification processing, cleanup) are started
+    by the DI/lifespan scheduler that calls process_pending_batch() and cleanup_old().
🤖 Prompt for AI Agents
In `@backend/app/services/notification_service.py` around lines 85 - 95, The
docstring incorrectly mentions a run() method; update it to remove the run()
reference and instead state that background work is started by the application's
DI/lifespan scheduler which calls this service's batch and cleanup methods (for
example NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.

Comment on lines +156 to +174
async def start(self) -> None:
"""Start listening for messages."""
await self._pubsub.subscribe(self._channel)
await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation

async def listener() -> None:
while True:
try:
msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5)
if msg and msg.get("type") == "message":
data = json.loads(msg["data"])
await self._handler(data)
except asyncio.CancelledError:
break
except Exception as e:
self._logger.error(f"Error processing internal message on {self._channel}: {e}")

self._task = asyncio.create_task(listener())
self._logger.debug(f"Started internal subscription on {self._channel}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against double start() calls.

If start() is called twice without an intervening close(), a new listener task is created while the old one continues running, causing duplicate message handling and a leaked task.

🛠️ Suggested fix
     async def start(self) -> None:
         """Start listening for messages."""
+        if self._task is not None:
+            return  # Already started
+
         await self._pubsub.subscribe(self._channel)
         await self._pubsub.get_message(timeout=1.0)  # Consume subscribe confirmation
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def start(self) -> None:
"""Start listening for messages."""
await self._pubsub.subscribe(self._channel)
await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation
async def listener() -> None:
while True:
try:
msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5)
if msg and msg.get("type") == "message":
data = json.loads(msg["data"])
await self._handler(data)
except asyncio.CancelledError:
break
except Exception as e:
self._logger.error(f"Error processing internal message on {self._channel}: {e}")
self._task = asyncio.create_task(listener())
self._logger.debug(f"Started internal subscription on {self._channel}")
async def start(self) -> None:
"""Start listening for messages."""
if self._task is not None:
return # Already started
await self._pubsub.subscribe(self._channel)
await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation
async def listener() -> None:
while True:
try:
msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5)
if msg and msg.get("type") == "message":
data = json.loads(msg["data"])
await self._handler(data)
except asyncio.CancelledError:
break
except Exception as e:
self._logger.error(f"Error processing internal message on {self._channel}: {e}")
self._task = asyncio.create_task(listener())
self._logger.debug(f"Started internal subscription on {self._channel}")
🤖 Prompt for AI Agents
In `@backend/app/services/sse/redis_bus.py` around lines 156 - 174, The start()
method can be called multiple times which will spawn duplicate listener tasks;
add a guard in start() that checks self._task (and whether it's not done) and
either returns early or raises if a listener is already running, or cancel/wait
for the existing task before creating a new one; update the async def
start(self) to inspect self._task and only call asyncio.create_task(listener())
when there is no active task (use self._task.done() to determine activity) and
ensure subscription/confirmation logic is not re-run for an already-started
listener (references: start, _task, listener, _pubsub).

Comment on lines +29 to +36
async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None:
s = SystemSettings()
updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1")
assert isinstance(updated, SystemSettings)
# verify audit log entry exists
assert await db.get_collection("audit_log").count_documents({}) >= 1
assert await AuditLogDocument.count() >= 1
reset = await repo.reset_system_settings("admin", "u1")
assert isinstance(reset, SystemSettings)
assert await db.get_collection("audit_log").count_documents({}) >= 2
assert await AuditLogDocument.count() >= 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "test_admin_settings_repository.py" 2>/dev/null | head -20

Repository: HardMax71/Integr8sCode

Length of output: 188


🏁 Script executed:

git ls-files | grep -E "(test_admin|admin_settings)" | head -20

Repository: HardMax71/Integr8sCode

Length of output: 457


🏁 Script executed:

cat -n backend/tests/integration/db/repositories/test_admin_settings_repository.py | head -50

Repository: HardMax71/Integr8sCode

Length of output: 1653


🏁 Script executed:

rg "def db" backend/tests --type py -B 2 -A 5 | head -100

Repository: HardMax71/Integr8sCode

Length of output: 485


🏁 Script executed:

rg "fixture|scope|autouse" backend/tests -B 2 -A 2 --type py | grep -E "(db|audit)" -A 3 -B 1 | head -80

Repository: HardMax71/Integr8sCode

Length of output: 2217


🏁 Script executed:

find . -path "*/tests/integration/conftest.py" -type f

Repository: HardMax71/Integr8sCode

Length of output: 105


🏁 Script executed:

cat -n backend/tests/integration/conftest.py 2>/dev/null | head -100

Repository: HardMax71/Integr8sCode

Length of output: 4140


🏁 Script executed:

cat -n backend/app/db/docs/admin_settings.py

Repository: HardMax71/Integr8sCode

Length of output: 2515


🏁 Script executed:

rg "AuditLogDocument" backend/tests --type py -B 2 -A 2 | head -80

Repository: HardMax71/Integr8sCode

Length of output: 1361


🏁 Script executed:

find . -path "*/tests/helpers/cleanup.py" -type f

Repository: HardMax71/Integr8sCode

Length of output: 100


🏁 Script executed:

cat -n tests/helpers/cleanup.py

Repository: HardMax71/Integr8sCode

Length of output: 122


🏁 Script executed:

rg "scope.*AsyncContainer|@pytest.fixture.*scope" backend/tests --type py -B 2 -A 3 | head -80

Repository: HardMax71/Integr8sCode

Length of output: 5389


🏁 Script executed:

cat -n backend/tests/helpers/cleanup.py

Repository: HardMax71/Integr8sCode

Length of output: 911


🏁 Script executed:

rg "def db\(\)" backend/tests --type py -B 3 -A 10

Repository: HardMax71/Integr8sCode

Length of output: 795


🏁 Script executed:

head -100 backend/tests/conftest.py

Repository: HardMax71/Integr8sCode

Length of output: 3987


Make audit-log assertions delta-based to avoid false positives.

Global AuditLogDocument.count() assertions are fragile; they pass whenever any audit logs exist, not necessarily ones from this test. Use a baseline capture and assert the increment to ensure the test actually validates its own operations.

🔧 Suggested adjustment (baseline + delta)
 async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None:
+    baseline = await AuditLogDocument.count()
     s = SystemSettings()
     updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1")
     assert isinstance(updated, SystemSettings)
-    assert await AuditLogDocument.count() >= 1
+    assert await AuditLogDocument.count() >= baseline + 1
     reset = await repo.reset_system_settings("admin", "u1")
     assert isinstance(reset, SystemSettings)
-    assert await AuditLogDocument.count() >= 2
+    assert await AuditLogDocument.count() >= baseline + 2
🤖 Prompt for AI Agents
In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`
around lines 29 - 36, The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 33 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/workers/dlq_processor.py">

<violation number="1" location="backend/workers/dlq_processor.py:185">
P2: auto_commit=False disables auto-acknowledgment, but the handler never explicitly acknowledges messages. This leaves Kafka offsets uncommitted, so DLQ messages will be reprocessed on restart or rebalance. Either enable auto-commit or accept a message object and call ack/nack after processing.</violation>
</file>

<file name="backend/tests/unit/conftest.py">

<violation number="1" location="backend/tests/unit/conftest.py:65">
P3: The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</violation>
</file>

<file name="backend/app/dlq/manager.py">

<violation number="1" location="backend/app/dlq/manager.py:80">
P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</violation>
</file>

<file name="backend/workers/run_saga_orchestrator.py">

<violation number="1" location="backend/workers/run_saga_orchestrator.py:109">
P2: Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</violation>
</file>

<file name="backend/workers/run_result_processor.py">

<violation number="1" location="backend/workers/run_result_processor.py:115">
P2: Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +80 to +100
start = asyncio.get_running_loop().time()

# Record metrics
self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type)
self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds())

# Process with tracing
ctx = extract_trace_context(message.headers)
with get_tracer().start_as_current_span(
name="dlq.consume",
context=ctx,
kind=SpanKind.CONSUMER,
attributes={
EventAttributes.KAFKA_TOPIC: self.dlq_topic,
EventAttributes.EVENT_TYPE: message.event.event_type,
EventAttributes.EVENT_ID: message.event.event_id,
},
):
await self._process_dlq_message(message)

self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process")
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/dlq/manager.py, line 80:

<comment>Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</comment>

<file context>
@@ -70,42 +69,35 @@ def __init__(
+        Args:
+            message: Typed DLQMessage (deserialized by FastStream/Avro)
+        """
+        start = asyncio.get_running_loop().time()
+
+        # Record metrics
</file context>
Suggested change
start = asyncio.get_running_loop().time()
# Record metrics
self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type)
self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds())
# Process with tracing
ctx = extract_trace_context(message.headers)
with get_tracer().start_as_current_span(
name="dlq.consume",
context=ctx,
kind=SpanKind.CONSUMER,
attributes={
EventAttributes.KAFKA_TOPIC: self.dlq_topic,
EventAttributes.EVENT_TYPE: message.event.event_type,
EventAttributes.EVENT_ID: message.event.event_id,
},
):
await self._process_dlq_message(message)
self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process")
start = asyncio.get_running_loop().time()
try:
# Record metrics
self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type)
self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds())
# Process with tracing
ctx = extract_trace_context(message.headers)
with get_tracer().start_as_current_span(
name="dlq.consume",
context=ctx,
kind=SpanKind.CONSUMER,
attributes={
EventAttributes.KAFKA_TOPIC: self.dlq_topic,
EventAttributes.EVENT_TYPE: message.event.event_type,
EventAttributes.EVENT_ID: message.event.event_id,
},
):
await self._process_dlq_message(message)
except Exception as e:
self.logger.error(f"Error processing DLQ message: {e}")
raise
finally:
self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process")
Fix with Cubic

app_logger.info("SagaOrchestrator starting...")

# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Mongo client is created before the try/finally, so failures during init_beanie or schema registry initialization skip cleanup. Wrap the initialization in the try/finally (or use an async context manager) so the client is always closed on startup errors.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_saga_orchestrator.py, line 109:

<comment>Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</comment>

<file context>
@@ -105,9 +105,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
-        await container.get(Database)  # Triggers init_beanie via DatabaseProvider
-        schema_registry = await container.get(SchemaRegistryManager)  # Triggers initialize_schemas
+        # Initialize MongoDB + Beanie
+        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+        )
</file context>
Fix with Cubic

app_logger.info("ResultProcessor starting...")

# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Mongo client initialization is outside the try/finally, so a failure during init_beanie or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same try/finally (or guard close() when init fails) to avoid resource leaks on startup errors.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 115:

<comment>Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</comment>

<file context>
@@ -111,9 +111,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
-        await container.get(Database)  # Triggers init_beanie via DatabaseProvider
-        schema_registry = await container.get(SchemaRegistryManager)  # Triggers initialize_schemas
+        # Initialize MongoDB + Beanie
+        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+        )
</file context>
Fix with Cubic

Provides:
- Fake Redis, Kafka, K8s (boundary clients)
- Real metrics, repositories, services (internal)
- Real MongoDB via init_beanie
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The fixture uses AsyncMongoMockClient, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/conftest.py, line 65:

<comment>The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</comment>

<file context>
@@ -60,29 +60,33 @@ async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContain
-    - Fake Redis, Kafka, K8s, MongoDB (boundary clients)
+    - Fake Redis, Kafka, K8s (boundary clients)
     - Real metrics, repositories, services (internal)
+    - Real MongoDB via init_beanie
     """
     container = make_async_container(
</file context>
Suggested change
- Real MongoDB via init_beanie
- MongoDB mock via init_beanie
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/core/providers.py (1)

103-116: Redis client lacks lifecycle management.

The Kafka producer uses AsyncIterable with proper start()/stop() lifecycle, but the Redis client is returned directly without cleanup. Redis connections may leak on application shutdown.

🔧 Proposed fix with proper lifecycle
-    `@provide`
-    def get_redis_client(self, settings: Settings, logger: logging.Logger) -> redis.Redis:
-        logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}")
-        return redis.Redis(
+    `@provide`
+    async def get_redis_client(self, settings: Settings, logger: logging.Logger) -> AsyncIterable[redis.Redis]:
+        logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}")
+        client = redis.Redis(
             host=settings.REDIS_HOST,
             port=settings.REDIS_PORT,
             db=settings.REDIS_DB,
             password=settings.REDIS_PASSWORD,
             ssl=settings.REDIS_SSL,
             max_connections=settings.REDIS_MAX_CONNECTIONS,
             decode_responses=settings.REDIS_DECODE_RESPONSES,
             socket_connect_timeout=5,
             socket_timeout=5,
         )
+        yield client
+        await client.aclose()
+        logger.info("Redis client closed")
backend/tests/integration/dlq/test_dlq_manager.py (1)

55-85: Wait for partition assignment before processing to avoid test flakiness.

With auto_offset_reset="latest" and no committed offsets, the consumer's initial offset is set to the partition's log end when assignment completes. If process_message() produces the event before partition assignment finishes, the consumer may start fetching from a position after the message, causing it to be skipped and the test to flake.

Add a check to ensure partition assignment is complete before creating the consume task:

🔧 Suggested stabilization
 await consumer.start()
+while not consumer.assignment():
+    await asyncio.sleep(0.05)
 consume_task = asyncio.create_task(consume_dlq_events())

 # Process message directly via DLQManager (simulating FastStream handler)
 await dlq_manager.process_message(dlq_message)
🤖 Fix all issues with AI agents
In `@backend/app/services/user_settings_service.py`:
- Around line 216-221: The invalidate_cache method currently calls
self._redis.delete(cache_key) without handling Redis errors; update
invalidate_cache to catch Redis-related exceptions (matching the pattern used in
your cache read/write methods), log an error via self.logger.error including the
cache_key and exception details, and swallow the exception so callers like
reset_user_settings won't fail due to cache layer issues while preserving the
existing debug log when deletion succeeds (references: invalidate_cache,
self._redis.delete, self.logger.debug, reset_user_settings).
- Around line 223-227: The _set_cache method currently lets Redis errors
propagate; change it to swallow cache write failures and only log them so
database operations like update_user_settings and restore_settings_to_point
don't fail when caching fails. Wrap the await self._redis.setex(...) call in a
try/except catching broad Redis/connection exceptions (or Exception if specific
client exceptions aren't imported), log the error with self.logger.warning or
self.logger.debug including the cache_key (from _cache_key) and
_CACHE_TTL_SECONDS, and return None silently on failure; leave successful
behavior unchanged (still log debug on success). Ensure no exception is
re-raised so callers of _set_cache (e.g., update_user_settings,
restore_settings_to_point) continue normally when caching fails.
- Around line 55-63: Wrap the Redis read in get_user_settings with a try/except
to catch connection/timeouts from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.
♻️ Duplicate comments (1)
backend/app/core/providers.py (1)

421-430: SSEConnectionRegistry scope issue persists from previous review.

As flagged in a prior review, Scope.REQUEST creates a new registry per HTTP request, causing get_connection_count() to always return 0. The registry should use Scope.APP to track connections globally across requests.

🔧 Fix: Change scope to APP
-    `@provide`(scope=Scope.REQUEST)
+    `@provide`(scope=Scope.APP)
     def get_sse_connection_registry(
             self,
             logger: logging.Logger,
             connection_metrics: ConnectionMetrics,
     ) -> SSEConnectionRegistry:
🧹 Nitpick comments (2)
backend/app/services/user_settings_service.py (1)

40-45: Type annotation uses sync redis.Redis but operations are async.

The redis_client parameter is typed as redis.Redis, but the code uses await with Redis operations (e.g., await self._redis.get(cache_key)). The redis.asyncio module provides redis.asyncio.Redis as the async client type. While the current annotation works at runtime due to Python's duck typing, it's technically incorrect for type checkers.

🔧 Suggested fix
     def __init__(
         self,
         repository: UserSettingsRepository,
         event_service: KafkaEventService,
         logger: logging.Logger,
-        redis_client: redis.Redis,
+        redis_client: redis.Redis,  # type: ignore[type-arg]  # async Redis client
     ) -> None:

Alternatively, use the more explicit async type if available in your redis version:

redis_client: "redis.asyncio.Redis[str]",
backend/tests/integration/services/user_settings/test_user_settings_service.py (1)

39-43: Consider asserting the results of update wrappers.

The update wrapper calls at the end of the test don't verify their return values. While this exercises the code paths, adding assertions would catch regressions more effectively.

💡 Example assertions
     # Update wrappers
-    await svc.update_theme(user_id, Theme.DARK)
-    await svc.update_notification_settings(user_id, DomainNotificationSettings())
-    await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2))
-    await svc.update_custom_setting(user_id, "k", "v")
+    theme_result = await svc.update_theme(user_id, Theme.DARK)
+    assert theme_result.theme == Theme.DARK
+    
+    notif_result = await svc.update_notification_settings(user_id, DomainNotificationSettings())
+    assert notif_result.notifications is not None
+    
+    editor_result = await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2))
+    assert editor_result.editor.tab_size == 2
+    
+    custom_result = await svc.update_custom_setting(user_id, "k", "v")
+    assert custom_result.custom_settings.get("k") == "v"

Comment on lines 55 to 63
async def get_user_settings(self, user_id: str) -> DomainUserSettings:
"""Get settings with cache; rebuild and cache on miss."""
if user_id in self._cache:
cached = self._cache[user_id]
self.logger.debug(f"Settings cache hit for user {user_id}", extra={"cache_size": len(self._cache)})
return cached
"""Get settings with Redis cache; rebuild and cache on miss."""
cache_key = self._cache_key(user_id)
cached = await self._redis.get(cache_key)
if cached:
self.logger.debug(f"Settings cache hit for user {user_id}")
return DomainUserSettings.model_validate_json(cached)

return await self.get_user_settings_fresh(user_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Redis cache operations lack error handling.

If Redis becomes unavailable or times out, the await self._redis.get(cache_key) call will raise an exception, causing the entire request to fail. For a cache layer, Redis failures should be non-fatal—falling back to fresh data retrieval from the database.

🛡️ Proposed fix with graceful degradation
     async def get_user_settings(self, user_id: str) -> DomainUserSettings:
         """Get settings with Redis cache; rebuild and cache on miss."""
         cache_key = self._cache_key(user_id)
-        cached = await self._redis.get(cache_key)
-        if cached:
-            self.logger.debug(f"Settings cache hit for user {user_id}")
-            return DomainUserSettings.model_validate_json(cached)
+        try:
+            cached = await self._redis.get(cache_key)
+            if cached:
+                self.logger.debug(f"Settings cache hit for user {user_id}")
+                return DomainUserSettings.model_validate_json(cached)
+        except Exception as e:
+            self.logger.warning(f"Redis cache read failed for user {user_id}: {e}")
 
         return await self.get_user_settings_fresh(user_id)
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 55 - 63, Wrap the
Redis read in get_user_settings with a try/except to catch connection/timeouts
from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.

Comment on lines 216 to +221
async def invalidate_cache(self, user_id: str) -> None:
"""Invalidate cached settings for a user."""
if self._cache.pop(user_id, None) is not None:
self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)})

def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Add settings to TTL+LRU cache."""
self._cache[user_id] = settings
self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)})

def get_cache_stats(self) -> dict[str, Any]:
"""Get cache statistics for monitoring."""
return {
"cache_size": len(self._cache),
"max_cache_size": self._max_cache_size,
"expired_entries": 0,
"cache_ttl_seconds": self._cache_ttl.total_seconds(),
}
cache_key = self._cache_key(user_id)
deleted = await self._redis.delete(cache_key)
if deleted:
self.logger.debug(f"Invalidated cache for user {user_id}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Cache invalidation should also handle Redis failures.

For consistency with the cache read/write error handling strategy, invalidate_cache should catch Redis exceptions to ensure reset_user_settings doesn't fail due to cache layer issues.

🛡️ Proposed fix
     async def invalidate_cache(self, user_id: str) -> None:
         """Invalidate cached settings for a user."""
         cache_key = self._cache_key(user_id)
-        deleted = await self._redis.delete(cache_key)
-        if deleted:
-            self.logger.debug(f"Invalidated cache for user {user_id}")
+        try:
+            deleted = await self._redis.delete(cache_key)
+            if deleted:
+                self.logger.debug(f"Invalidated cache for user {user_id}")
+        except Exception as e:
+            self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def invalidate_cache(self, user_id: str) -> None:
"""Invalidate cached settings for a user."""
if self._cache.pop(user_id, None) is not None:
self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)})
def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Add settings to TTL+LRU cache."""
self._cache[user_id] = settings
self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)})
def get_cache_stats(self) -> dict[str, Any]:
"""Get cache statistics for monitoring."""
return {
"cache_size": len(self._cache),
"max_cache_size": self._max_cache_size,
"expired_entries": 0,
"cache_ttl_seconds": self._cache_ttl.total_seconds(),
}
cache_key = self._cache_key(user_id)
deleted = await self._redis.delete(cache_key)
if deleted:
self.logger.debug(f"Invalidated cache for user {user_id}")
async def invalidate_cache(self, user_id: str) -> None:
"""Invalidate cached settings for a user."""
cache_key = self._cache_key(user_id)
try:
deleted = await self._redis.delete(cache_key)
if deleted:
self.logger.debug(f"Invalidated cache for user {user_id}")
except Exception as e:
self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}")
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 216 - 221, The
invalidate_cache method currently calls self._redis.delete(cache_key) without
handling Redis errors; update invalidate_cache to catch Redis-related exceptions
(matching the pattern used in your cache read/write methods), log an error via
self.logger.error including the cache_key and exception details, and swallow the
exception so callers like reset_user_settings won't fail due to cache layer
issues while preserving the existing debug log when deletion succeeds
(references: invalidate_cache, self._redis.delete, self.logger.debug,
reset_user_settings).

Comment on lines +223 to +227
async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Set settings in Redis cache with TTL."""
cache_key = self._cache_key(user_id)
await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
self.logger.debug(f"Cached settings for user {user_id}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cache write failures should not propagate.

Similar to the read path, _set_cache should handle Redis failures gracefully. A failed cache write shouldn't cause update_user_settings or restore_settings_to_point to fail when the primary database operation succeeded.

🛡️ Proposed fix
     async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
         """Set settings in Redis cache with TTL."""
         cache_key = self._cache_key(user_id)
-        await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
-        self.logger.debug(f"Cached settings for user {user_id}")
+        try:
+            await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
+            self.logger.debug(f"Cached settings for user {user_id}")
+        except Exception as e:
+            self.logger.warning(f"Failed to cache settings for user {user_id}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Set settings in Redis cache with TTL."""
cache_key = self._cache_key(user_id)
await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
self.logger.debug(f"Cached settings for user {user_id}")
async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Set settings in Redis cache with TTL."""
cache_key = self._cache_key(user_id)
try:
await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
self.logger.debug(f"Cached settings for user {user_id}")
except Exception as e:
self.logger.warning(f"Failed to cache settings for user {user_id}: {e}")
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 223 - 227, The
_set_cache method currently lets Redis errors propagate; change it to swallow
cache write failures and only log them so database operations like
update_user_settings and restore_settings_to_point don't fail when caching
fails. Wrap the await self._redis.setex(...) call in a try/except catching broad
Redis/connection exceptions (or Exception if specific client exceptions aren't
imported), log the error with self.logger.warning or self.logger.debug including
the cache_key (from _cache_key) and _CACHE_TTL_SECONDS, and return None silently
on failure; leave successful behavior unchanged (still log debug on success).
Ensure no exception is re-raised so callers of _set_cache (e.g.,
update_user_settings, restore_settings_to_point) continue normally when caching
fails.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/tests/integration/dlq/test_dlq_manager.py`:
- Around line 79-83: The test uses a fixed asyncio.sleep after await
consumer.start(), which can race with asynchronous partition assignment when
auto_offset_reset="latest"; instead, poll until the consumer has assignments
(e.g., check consumer.assignment() or wait for non-empty consumer._assignment)
and then call consumer.seek_to_end() on each assigned partition before
producing/starting consume_dlq_events(); update the setup that currently
references consumer.start() and consume_dlq_events() to wait for assignment and
seek_to_end() deterministically rather than relying on the 0.5s sleep.
🧹 Nitpick comments (2)
backend/app/events/schema/schema_registry.py (2)

90-92: Redundant schema construction.

_ensure_schema_registered already calls avro_schema_to_python() and creates an AvroSchema object. These lines duplicate that work on every serialization call. Consider caching the AvroSchema object in _ensure_schema_registered and retrieving it here.

Suggested approach

Add a cache for the schema object in __init__:

self._avro_schema_cache: dict[type[DomainEvent], schema.AvroSchema] = {}

Update _ensure_schema_registered to cache and return the schema object, then reuse it in serialize_event.


98-102: Move datetime import to module level and add recursive datetime conversion for robustness.

  1. The from datetime import datetime import on line 99 is inside the function and should be at module level.

  2. The avro_schema_to_python() call happens twice: once in _ensure_schema_registered() (line 70) and again in serialize_event() (line 91). While both calls are necessary due to the schema_id-only caching, this repeated conversion could be optimized. If this becomes a performance concern, consider caching the AvroSchema object by schema_id.

  3. The datetime conversion on lines 100–102 only handles top-level fields. While no nested models currently contain datetime fields, adding a recursive helper would make the code more resilient to future changes where nested Pydantic models might include datetime fields.

Suggested refactor for module-level import and recursive conversion

Add to module imports:

from datetime import datetime

Replace the conversion logic with a recursive helper:

# Convert datetime fields to milliseconds for Avro timestamp-millis logical type
payload = self._convert_datetimes(payload)

Add helper method:

def _convert_datetimes(self, obj: Any) -> Any:
    """Recursively convert datetime fields to milliseconds."""
    if isinstance(obj, datetime):
        return int(obj.timestamp() * 1_000)
    if isinstance(obj, dict):
        return {k: self._convert_datetimes(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [self._convert_datetimes(item) for item in obj]
    return obj

Comment on lines +79 to 83
# Start consumer BEFORE processing (auto_offset_reset="latest")
await consumer.start()
# Small delay to ensure consumer is fully subscribed and ready
await asyncio.sleep(0.5)
consume_task = asyncio.create_task(consume_dlq_events())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "test_dlq_manager.py" | head -5

Repository: HardMax71/Integr8sCode

Length of output: 117


🏁 Script executed:

cat -n backend/tests/integration/dlq/test_dlq_manager.py | head -100

Repository: HardMax71/Integr8sCode

Length of output: 4860


🏁 Script executed:

cat -n backend/tests/integration/dlq/test_dlq_manager.py | sed -n '70,95p'

Repository: HardMax71/Integr8sCode

Length of output: 1405


🏁 Script executed:

rg -i "aiokafka" backend/tests/integration/dlq/test_dlq_manager.py -A 3 -B 3

Repository: HardMax71/Integr8sCode

Length of output: 570


🏁 Script executed:

rg "consumer\.(assignment|seek)" --type py | head -20

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

rg "auto_offset_reset" --type py -B 2 -A 5 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 3247


🏁 Script executed:

rg "\.assignment\(\)|\.seek_to_end\(\)|\.position\(\)" --type py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

python3 << 'EOF'
import subprocess
result = subprocess.run(['pip', 'show', 'aiokafka'], capture_output=True, text=True)
print(result.stdout)
print(result.stderr)
EOF

Repository: HardMax71/Integr8sCode

Length of output: 152


🌐 Web query:

aiokafka AIOKafkaConsumer assignment seek_to_end API documentation

💡 Result:

Summary — AIOKafkaConsumer.seek_to_end

  • Signature: async seek_to_end(*partitions). [1]
  • Behavior: set the consumer position to the most-recent available offset for the given TopicPartition(s). If no partitions are passed, it operates on all currently assigned partitions. [1]
  • Errors: raises IllegalStateError if any partition is not currently assigned; raises TypeError if passed objects are not TopicPartition instances. [1]
  • Note for read_committed isolation: seek_to_end returns the Last Stable Offset (LSO) (not necessarily the high watermark) when isolation_level='read_committed'. [2]

Sources:
[1] aiokafka API docs — AIOKafkaConsumer.seek_to_end.
[2] aiokafka consumer docs — explanation of LSO / read_committed behavior.


🌐 Web query:

aiokafka AIOKafkaConsumer assignment() method API

💡 Result:

AIOKafkaConsumer.assignment()

  • Purpose: return the set of TopicPartition objects currently assigned to the consumer.
  • Signature / usage: consumer.assignment() → set(TopicPartition). If you used consumer.assign([...]) it returns those partitions; if you used consumer.subscribe(...) it returns the partitions currently assigned by the group (may be empty if assignment hasn’t occurred yet or is in rebalance).
  • Notes:
    • Manual assign() and subscribe() cannot be mixed; assign() replaces previous assignment.
    • assignment() does not change positions; it only reports the current assignment.
    • assign()/subscribe() state rules and IllegalStateError behavior described in the docs (e.g., assign() raises IllegalStateError if subscribe() was already called).

Sources: aiokafka API & consumer docs. [1][2]

References
[1] aiokafka API documentation — AIOKafkaConsumer.assign / assignment.
[2] aiokafka consumer guide & source (assign/subscribe and assignment behavior).


🏁 Script executed:

rg "\.position\(|\.committed\(|\.seek\(" --type py backend/tests/integration/ -B 2 -A 2

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

rg "auto_offset_reset.*latest" --type py -B 5 -A 10 | head -60

Repository: HardMax71/Integr8sCode

Length of output: 2963


Avoid race condition with auto_offset_reset="latest" and fixed sleep; wait for consumer assignment.

The hard-coded 0.5s sleep does not guarantee the consumer has completed partition assignment. With subscribe() and auto_offset_reset="latest", assignment happens asynchronously after start(). If the message is produced before assignment finishes, the consumer may miss it or start at an incorrect offset. Use assignment polling and seek_to_end() for deterministic behavior.

🛠️ Suggested fix
     # Start consumer BEFORE processing (auto_offset_reset="latest")
     await consumer.start()
-    # Small delay to ensure consumer is fully subscribed and ready
-    await asyncio.sleep(0.5)
+    # Wait for assignment, then seek to end to ensure we capture the next event
+    while not consumer.assignment():
+        await asyncio.sleep(0.05)
+    await consumer.seek_to_end()
     consume_task = asyncio.create_task(consume_dlq_events())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Start consumer BEFORE processing (auto_offset_reset="latest")
await consumer.start()
# Small delay to ensure consumer is fully subscribed and ready
await asyncio.sleep(0.5)
consume_task = asyncio.create_task(consume_dlq_events())
# Start consumer BEFORE processing (auto_offset_reset="latest")
await consumer.start()
# Wait for assignment, then seek to end to ensure we capture the next event
while not consumer.assignment():
await asyncio.sleep(0.05)
await consumer.seek_to_end()
consume_task = asyncio.create_task(consume_dlq_events())
🤖 Prompt for AI Agents
In `@backend/tests/integration/dlq/test_dlq_manager.py` around lines 79 - 83, The
test uses a fixed asyncio.sleep after await consumer.start(), which can race
with asynchronous partition assignment when auto_offset_reset="latest"; instead,
poll until the consumer has assignments (e.g., check consumer.assignment() or
wait for non-empty consumer._assignment) and then call consumer.seek_to_end() on
each assigned partition before producing/starting consume_dlq_events(); update
the setup that currently references consumer.start() and consume_dlq_events() to
wait for assignment and seek_to_end() deterministically rather than relying on
the 0.5s sleep.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/workers/run_sse_bridge.py`:
- Around line 106-110: The subscriber currently sets auto_commit=True in the
broker.subscriber call which, combined with swallowing exceptions in
route_event, can commit Kafka offsets before Redis publish succeeds; change to
manual commit by switching the subscriber's commit policy to AckPolicy.ACK (or
the equivalent manual/manual-ack option) and ensure route_event propagates
exceptions instead of catching/silencing them so that you only
acknowledge/commit after a successful Redis publish; specifically, update the
broker.subscriber invocation (the decorator using *topics, group_id,
auto_commit, decoder=decode_avro) to use manual/ACK commit and modify
route_event to re-raise errors from the Redis publish path so the broker can
avoid committing on failure.
♻️ Duplicate comments (9)
backend/workers/dlq_processor.py (1)

184-195: Manual ack missing with auto_commit=False.

With auto-commit disabled, offsets won’t be committed unless the handler explicitly acknowledges. Right now the handler only processes the message, so DLQ entries will replay on restart or rebalance. Either enable auto-commit or accept the broker message and call ack/nack after processing.

✅ Minimal fix (enable auto-commit)
         `@broker.subscriber`(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            auto_commit=True,
             decoder=decode_dlq_json,
         )
backend/workers/run_result_processor.py (3)

117-120: Mongo client initialization outside try/finally risks resource leak on startup failure.

If init_beanie() or subsequent initialization steps fail, mongo_client won't be closed since it's created before the try block begins at line 170.

Proposed fix
     `@asynccontextmanager`
     async def lifespan() -> AsyncIterator[None]:
         """Initialize infrastructure before app starts."""
         app_logger = await container.get(logging.Logger)
         app_logger.info("ResultProcessor starting...")
 
         # Initialize MongoDB + Beanie
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
 
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)

Then continue the existing logic inside the new try block, adjusting the finally accordingly.


134-139: auto_commit=False is deprecated in FastStream 0.6; use AckPolicy instead.

FastStream 0.6 replaces the auto_commit parameter with AckPolicy. With auto_commit=False ignored and no explicit AckPolicy, offsets may not be committed correctly after handler success, leading to message redelivery on consumer restart.

Proposed fix
+from faststream.kafka import AckPolicy
+
         # Create subscriber with Avro decoder
         subscriber = broker.subscriber(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            ack_policy=AckPolicy.ACK,
             decoder=decode_avro,
         )
FastStream 0.6 KafkaBroker subscriber AckPolicy parameter

163-166: Silent catch-all handler discards unmatched events without logging.

The handle_other handler silently passes for any event not matching the specific filters. This could mask routing issues or silently drop events if new event types are added. Consider logging at warning level to aid debugging.

Proposed fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning(
+                "Unhandled event type received: %s (event_id=%s)",
+                event.event_type,
+                event.event_id,
+            )
backend/workers/run_saga_orchestrator.py (2)

111-114: Mongo client initialization outside try/finally risks resource leak on startup failure.

Same issue as in run_result_processor.py: if init_beanie() or subsequent steps fail before the try block at line 172, the mongo client won't be closed. Note that the early-exit path (lines 129-132) does handle cleanup correctly.


144-149: auto_commit=False is deprecated in FastStream 0.6; use AckPolicy instead.

Same issue as in run_result_processor.py. Replace with explicit AckPolicy.ACK for at-least-once semantics.

Proposed fix
+from faststream.kafka import AckPolicy
+
         `@broker.subscriber`(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            ack_policy=AckPolicy.ACK,
             decoder=decode_avro,
         )
backend/workers/run_k8s_worker.py (2)

120-125: auto_commit=False is deprecated in FastStream 0.6; use AckPolicy instead.

Same issue as in other workers. Replace with explicit AckPolicy.ACK.

Proposed fix
+from faststream.kafka import AckPolicy
+
         # Create subscriber with Avro decoder
         subscriber = broker.subscriber(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            ack_policy=AckPolicy.ACK,
             decoder=decode_avro,
         )

142-145: Silent catch-all handler discards unmatched events without logging.

Same issue as in run_result_processor.py. Consider logging unhandled events to aid debugging.

Proposed fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning(
+                "Unhandled event type received: %s (event_id=%s)",
+                event.event_type,
+                event.event_id,
+            )
backend/workers/run_sse_bridge.py (1)

101-102: Avoid hardcoded topic string in Avro decoding.

deserialize_event(..., "sse_bridge") is misleading when the subscriber handles multiple configured topics. Pass the actual topic from the message context or remove the parameter if unused.

🧹 Nitpick comments (2)
backend/workers/run_result_processor.py (1)

142-147: Calling private methods (_handle_*) from outside the class.

The handlers invoke logic._handle_completed(), logic._handle_failed(), and logic._handle_timeout() directly. These underscore-prefixed methods are conventionally private. Consider either exposing public methods on ProcessorLogic or documenting this as an intentional internal API usage.

backend/workers/run_saga_orchestrator.py (1)

163-167: Opportunistic timeout checking approach is reasonable but has caveats.

The timeout check runs only when events arrive. If no events are received for extended periods, timed-out sagas won't be detected. This is documented as "opportunistic" which is acceptable, but consider documenting this limitation or implementing a background task for production if timely timeout detection is critical.

Comment on lines 106 to 110
@broker.subscriber(
*topics,
group_id=group_id,
auto_commit=True, # SSE bridge is idempotent (Redis pubsub)
decoder=decode_avro,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

FastStream 0.6.0 auto_commit behavior Kafka broker offset commit timing

💡 Result:

Short answer: FastStream 0.6.0 leaves offset commit timing to the underlying Kafka client when using the default ACK_FIRST (auto-ack) behavior — it uses enable.auto.commit and auto_commit_interval_ms from the Confluent/consumer client. That means offsets are committed asynchronously by the Kafka client at the configured auto-commit interval (and are tied to the client's poll/shutdown behavior); if you need at-least-once semantics commit after processing (use AckPolicy.ACK or MANUAL). [1][2][3]

Sources:

  • FastStream ack docs (explains ACK_FIRST uses enable.auto.commit and describes ACK / MANUAL behaviors). [1]
  • FastStream AsyncConfluentConsumer API showing enable_auto_commit and auto_commit_interval_ms config defaults. [2]
  • Kafka client behavior: auto commits occur according to enable.auto.commit and auto.commit.interval.ms (commits tied to poll/interval; sync on shutdown). [3]

🏁 Script executed:

# First, let's locate and examine the actual code file
find . -name "run_sse_bridge.py" -type f

Repository: HardMax71/Integr8sCode

Length of output: 101


🏁 Script executed:

# Now let's examine the actual code context around lines 106-110
cat -n ./backend/workers/run_sse_bridge.py | sed -n '100,130p'

Repository: HardMax71/Integr8sCode

Length of output: 1483


🏁 Script executed:

# Let's examine the SSEEventRouter and route_event to understand error handling
rg -A 15 "class SSEEventRouter" ./backend/workers/run_sse_bridge.py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Also check if route_event has any error handling or retry logic
rg -B 5 -A 20 "def route_event" ./backend

Repository: HardMax71/Integr8sCode

Length of output: 1851


🏁 Script executed:

# Let's check the error handling in the handler and see if exceptions are re-raised
cat -n ./backend/workers/run_sse_bridge.py | sed -n '112,119p'

Repository: HardMax71/Integr8sCode

Length of output: 426


🏁 Script executed:

# Check if there's any other commit logic or error handling patterns in the file
rg -n "commit\|ack\|ACK" ./backend/workers/run_sse_bridge.py

Repository: HardMax71/Integr8sCode

Length of output: 48


Switch to manual commit or handle Redis failures by re-raising exceptions from route_event.

FastStream 0.6.0's auto_commit=True relies on the Kafka client's asynchronous offset commit (tied to poll/interval, not handler completion). The handler here swallows exceptions from route_event without re-raising them, so if Redis publish fails transiently, the Kafka offset is committed anyway and the event is lost—breaking at-least-once semantics. Either use AckPolicy.ACK (manual commit after handler success) or let handler exceptions propagate so FastStream can handle commit based on failure.

🤖 Prompt for AI Agents
In `@backend/workers/run_sse_bridge.py` around lines 106 - 110, The subscriber
currently sets auto_commit=True in the broker.subscriber call which, combined
with swallowing exceptions in route_event, can commit Kafka offsets before Redis
publish succeeds; change to manual commit by switching the subscriber's commit
policy to AckPolicy.ACK (or the equivalent manual/manual-ack option) and ensure
route_event propagates exceptions instead of catching/silencing them so that you
only acknowledge/commit after a successful Redis publish; specifically, update
the broker.subscriber invocation (the decorator using *topics, group_id,
auto_commit, decoder=decode_avro) to use manual/ACK commit and modify
route_event to re-raise errors from the Redis publish path so the broker can
avoid committing on failure.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/workers/dlq_processor.py`:
- Around line 180-197: The DLQ subscriber (decode_dlq_json and
handle_dlq_message) is being registered inside the lifespan block which
FastStream does not support; move the broker.subscriber registration out of the
lifespan context by placing the decode_dlq_json function and the decorated
handle_dlq_message subscriber at module level (retaining group_id,
auto_commit=False, decoder=decode_dlq_json) so it is registered before
FastStream(broker) is created, or alternatively register them in an
`@app.after_startup` hook or use broker.setup_subscriber(...) followed by await
subscriber.start() at runtime; ensure DLQManager injection
(FromDishka[DLQManager]) and the DLQMessage type remain unchanged.
♻️ Duplicate comments (9)
backend/workers/run_result_processor.py (3)

118-133: MongoDB client leak on initialization failure.

If init_beanie or subsequent initialization steps fail, mongo_client created at line 119 won't be closed. Wrap in a nested try/finally or use AsyncExitStack.

🐛 Suggested fix
         # Initialize MongoDB + Beanie
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        except Exception:
+            await mongo_client.close()
+            raise

136-141: Specify explicit AckPolicy with auto_commit=False.

With auto_commit=False, FastStream's default AckPolicy.ACK_FIRST won't commit offsets since it relies on Kafka's enable.auto.commit. Messages will be redelivered on consumer restart.

🐛 Suggested fix
+from faststream.kafka import AckPolicy
+
         # Create subscriber with Avro decoder
         subscriber = broker.subscriber(
             *topics,
             group_id=group_id,
             auto_commit=False,
             decoder=decode_avro,
+            ack_policy=AckPolicy.ACK,
         )
FastStream 0.6 Kafka AckPolicy.ACK behavior with auto_commit=False

165-168: Silent catch-all handler masks routing issues.

The handle_other handler silently discards unmatched events. If new event types are added or routing fails, this will mask the problem. Log a warning to aid debugging.

🔧 Suggested fix
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning(
+                "Unhandled event type received",
+                extra={"event_type": type(event).__name__, "event": str(event)[:200]},
+            )
backend/workers/run_sse_bridge.py (2)

102-105: Avoid hardcoded topic in decode_avro.

Line 104 passes a literal "sse_bridge"; this is misleading if topics change or the schema layer starts using the topic value. Consider passing the actual subscribed topic or removing the parameter if unused.


108-121: Revisit auto-commit with handler failure semantics.

With auto_commit=True on Line 111, offsets may be committed even if Redis publish fails (depending on FastStream/Kafka client behavior). Prefer manual/ACK commit or ensure failures propagate so offsets aren’t committed on error. Please verify the correct FastStream 0.6.0 configuration.

FastStream 0.6.0 Kafka subscriber manual/ack commit configuration and auto_commit behavior
backend/workers/run_k8s_worker.py (1)

144-147: Silent catch-all handler discards unmatched events without logging.

This concern was raised in a previous review and remains unaddressed. The current implementation will silently consume any events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters, which can mask filter logic bugs or cause silent data loss when new event types are introduced.

♻️ Proposed fix: log unmatched events
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
-        async def handle_other(event: DomainEvent) -> None:
-            pass
+        async def handle_other(
+            event: DomainEvent,
+            logger: FromDishka[logging.Logger],
+        ) -> None:
+            logger.warning(
+                "K8s worker received unhandled event type: %s (event_id=%s)",
+                type(event).__name__,
+                getattr(event, "event_id", "unknown"),
+            )
backend/workers/dlq_processor.py (1)

186-197: auto_commit=False without manual acknowledgment leaves offsets uncommitted.

The subscriber is configured with auto_commit=False, but the handler never explicitly acknowledges messages via ack(). This means Kafka offsets will not be committed, and messages will be reprocessed on consumer restart or rebalance.

Either enable auto_commit=True (default behavior) or inject the raw message and call ack() after successful processing:

🔧 Option 1: Enable auto-commit
         `@broker.subscriber`(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            auto_commit=True,
             decoder=decode_dlq_json,
         )
🔧 Option 2: Manual acknowledgment
         `@broker.subscriber`(
             *topics,
             group_id=group_id,
             auto_commit=False,
             decoder=decode_dlq_json,
         )
         async def handle_dlq_message(
             message: DLQMessage,
+            raw_message: StreamMessage[Any],
             dlq_manager: FromDishka[DLQManager],
         ) -> None:
             """Handle incoming DLQ messages - invoked by FastStream when message arrives."""
             await dlq_manager.process_message(message)
+            await raw_message.ack()
backend/workers/run_saga_orchestrator.py (2)

112-123: Ensure startup failures still close the Mongo client and DI container.

Line 112 creates AsyncMongoClient before the try/finally at Line 174, so exceptions from init_beanie, schema registry init, or producer init can skip cleanup. Wrap the full initialization inside the try/finally so resources are always released.

🛠️ Suggested fix (wrap initialization in the existing try/finally)
-        # Initialize MongoDB + Beanie
-        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
-            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
-        )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
-
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
+        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+        )
+        try:
+            # Initialize MongoDB + Beanie
+            await init_beanie(
+                database=mongo_client[settings.DATABASE_NAME],
+                document_models=ALL_DOCUMENTS,
+            )
+
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
@@
-        if not trigger_topics:
-            app_logger.warning("No saga triggers configured, shutting down")
-            yield
-            await mongo_client.close()
-            await container.close()
-            return
+            if not trigger_topics:
+                app_logger.warning("No saga triggers configured, shutting down")
+                yield
+                return
@@
-        try:
-            yield
-        finally:
-            app_logger.info("SagaOrchestrator shutting down...")
-            await mongo_client.close()
-            await container.close()
-            app_logger.info("SagaOrchestrator shutdown complete")
+            yield
+        finally:
+            app_logger.info("SagaOrchestrator shutting down...")
+            await mongo_client.close()
+            await container.close()
+            app_logger.info("SagaOrchestrator shutdown complete")

Also applies to: 174-180


146-150: Replace auto_commit with explicit AckPolicy for FastStream 0.6.

Line 146–150 uses auto_commit=False. In FastStream 0.6, offset behavior is governed by AckPolicy (and auto_commit may be ignored), which can lead to incorrect or stalled offset advancement. Please verify against your FastStream version and switch to an explicit ack_policy (or MANUAL with explicit message.ack()).

🛠️ Suggested fix (verify AckPolicy import path for your FastStream version)
-from faststream.kafka import KafkaBroker
+from faststream.kafka import KafkaBroker, AckPolicy
@@
         `@broker.subscriber`(
             *topics,
             group_id=group_id,
-            auto_commit=False,
+            ack_policy=AckPolicy.ACK,
             decoder=decode_avro,
         )
FastStream 0.6 Kafka subscriber auto_commit AckPolicy
🧹 Nitpick comments (3)
backend/workers/run_result_processor.py (1)

144-163: Consider exposing handler methods as public API.

The handlers call underscore-prefixed methods (_handle_completed, _handle_failed, _handle_timeout) on ProcessorLogic. Since FastStream handlers now directly invoke these, consider renaming them to public methods (e.g., handle_completed) in ProcessorLogic to clarify intent.

backend/workers/dlq_processor.py (2)

47-48: Unused logger parameter.

The logger parameter is declared but never used within _configure_retry_policies. Consider removing it or adding debug logging for policy configuration.

♻️ Suggested fix
-def _configure_retry_policies(manager: DLQManager, logger: logging.Logger) -> None:
+def _configure_retry_policies(manager: DLQManager) -> None:
     """Configure topic-specific retry policies."""

91-92: Unused logger parameter.

Same as _configure_retry_policies, the logger parameter is unused here as well.

♻️ Suggested fix
-def _configure_filters(manager: DLQManager, testing: bool, logger: logging.Logger) -> None:
+def _configure_filters(manager: DLQManager, testing: bool) -> None:
     """Configure message filters."""

Comment on lines 180 to 197
# Decoder: JSON message → typed DLQMessage
def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage:
data = json.loads(msg.body)
return DLQMessage.model_validate(data)

# Register subscriber for DLQ messages
@broker.subscriber(
*topics,
group_id=group_id,
auto_commit=False,
decoder=decode_dlq_json,
)
async def handle_dlq_message(
message: DLQMessage,
dlq_manager: FromDishka[DLQManager],
) -> None:
"""Handle incoming DLQ messages - invoked by FastStream when message arrives."""
await dlq_manager.process_message(message)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

FastStream KafkaBroker subscriber registration inside lifespan context manager

💡 Result:

Short answer

  • Don't register/start subscribers inside the lifespan() context because lifespan executes before brokers start and after they stop. Use app.after_startup (or register subscribers before creating/running the app) so KafkaBroker will actually start their consumers. [1][2]

Why

  • FastStream's lifespan runs before the broker starts and after it stops, so subscribers registered there won't be able to start consuming. The docs explicitly warn: "lifespan starts BEFORE your broker started ... so you can't publish any messages here." [1]
  • KafkaBroker subscriber handlers must be registered while the broker is able to start them (either at import-time/app-creation-time or in after_startup hooks). [2]

Minimal example

  • Register at app creation (preferred):

    from faststream import FastStream
    from faststream.kafka import KafkaBroker

    broker = KafkaBroker("localhost:9092")

    @broker.subscriber("in-topic")
    async def handler(msg):
    print(msg)

    app = FastStream(broker)

  • Or register in after_startup if you must register dynamically at runtime:

    app = FastStream(broker)

    @app.after_startup
    async def on_start():
    @broker.subscriber("in-topic")
    async def handler(msg):
    print(msg)

Notes

  • FastStream recently fixed dynamic subscriber registration issues; if you hit problems make sure you're on a recent release (see release notes). [3]

Sources

  • FastStream Lifespan / context manager docs. [1]
  • FastStream KafkaBroker API / subscriber docs. [2]
  • FastStream release notes (dynamic subscriber registration / KafkaBroker updates). [3]

🌐 Web query:

FastStream Lifespan context manager broker subscriber registration timing

💡 Result:

Short answer: register subscribers only after the broker is running — not inside a lifespan() context manager. FastStream's lifespan() (context manager) runs before the broker is started and after it is stopped, so you cannot publish or rely on a running broker there. Use @app.after_startup (or broker.setup_subscriber / start for dynamic runtime registration) to register/start subscribers while the broker is active. [1][2][3]

Sources:

  • Lifespan context manager runs before broker start / after broker shutdown. [1]
  • Hook call order and semantics: on_startup runs before broker start; after_startup runs after broker is launched (use after_startup for actions requiring a running broker). [2]
  • Release notes: examples for registering/starting subscribers at runtime with an already-started broker. [3]

🌐 Web query:

FastStream best practices subscriber registration when to register subscribers

💡 Result:

Short answer

  • Prefer registering subscribers at application startup (before starting the broker/worker) so subscriptions are set up deterministically and you don't miss messages. [1]
  • Only register at runtime when you need dynamic/late registration (feature supported), and then call the broker APIs to set up and start the subscriber explicitly. Expect extra complexity (lifecycle, race conditions, and potential missed messages while wiring). [2][3]

Best-practice checklist

  • Register handlers during app initialization (module import or app startup) and call broker.setup/start afterwards so the broker knows all endpoints before message flow begins. This avoids race conditions and simplifies graceful shutdown/restart. [1]
  • If you must register at runtime:
    • Create the Subscriber object, call broker.setup_subscriber(subscriber), then await subscriber.start() (and ensure you later await subscriber.close()). Use this for truly dynamic handlers only. [2]
    • Protect against missing messages during the gap between creating the subscriber and starting it (buffering, backpressure, or feature-specific guarantees depend on the broker). [1][2]
  • Keep registration idempotent and observable (log subscriber start/stop), and include health checks so orchestration can confirm handlers are active before traffic is allowed. [1]
  • Prefer declarative/static registration for simpler deployments, testing, and observability; use dynamic registration for plugin systems or admin-driven handler changes. [1][2]

References

  • FastStream subscription docs (channel/subscriber usage). [1]
  • FastStream release notes: runtime registration support and API example. [2]
  • FastStream subscriber API / proto reference. [3]

Would you like a short code example showing both startup-time and runtime registration?


Subscriber registered inside lifespan will not work as intended.

Registering subscribers inside the lifespan context manager is not supported by FastStream. The lifespan runs before the broker starts and after it stops, so subscribers registered there won't actually start consuming messages. FastStream's documentation explicitly warns against this pattern.

Register the subscriber either:

  1. At module level (preferred) before FastStream(broker) is created
  2. In an @app.after_startup hook if dynamic registration is needed
  3. Using broker.setup_subscriber() followed by await subscriber.start() for runtime registration

This code will silently fail to consume DLQ messages.

🤖 Prompt for AI Agents
In `@backend/workers/dlq_processor.py` around lines 180 - 197, The DLQ subscriber
(decode_dlq_json and handle_dlq_message) is being registered inside the lifespan
block which FastStream does not support; move the broker.subscriber registration
out of the lifespan context by placing the decode_dlq_json function and the
decorated handle_dlq_message subscriber at module level (retaining group_id,
auto_commit=False, decoder=decode_dlq_json) so it is registered before
FastStream(broker) is created, or alternatively register them in an
`@app.after_startup` hook or use broker.setup_subscriber(...) followed by await
subscriber.start() at runtime; ensure DLQManager injection
(FromDishka[DLQManager]) and the DLQMessage type remain unchanged.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 17 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/components/workers/index.md">

<violation number="1" location="docs/components/workers/index.md:61">
P3: The `REJECT_ON_ERROR` row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.</violation>
</file>

<file name="backend/workers/run_sse_bridge.py">

<violation number="1" location="backend/workers/run_sse_bridge.py:112">
P2: `AckPolicy.ACK_FIRST` acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.</violation>
</file>

<file name="docs/architecture/lifecycle.md">

<violation number="1" location="docs/architecture/lifecycle.md:11">
P3: This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@broker.subscriber(
*topics,
group_id=group_id,
ack_policy=AckPolicy.ACK_FIRST, # SSE bridge is idempotent (Redis pubsub)
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: AckPolicy.ACK_FIRST acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_sse_bridge.py, line 112:

<comment>`AckPolicy.ACK_FIRST` acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.</comment>

<file context>
@@ -108,7 +109,7 @@ async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
             *topics,
             group_id=group_id,
-            auto_commit=True,  # SSE bridge is idempotent (Redis pubsub)
+            ack_policy=AckPolicy.ACK_FIRST,  # SSE bridge is idempotent (Redis pubsub)
             decoder=decode_avro,
         )
</file context>
Fix with Cubic

| `ACK_FIRST` | Commit before processing | Already committed | At most once | High throughput, idempotent operations |
| `ACK` | Commit after processing | Commit anyway (no retry) | At least once | Reliable processing without retry needs |
| `NACK_ON_ERROR` | Commit after processing | Seek back for redelivery | At least once | Auto-retry on transient failures |
| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired |
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The REJECT_ON_ERROR row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/components/workers/index.md, line 61:

<comment>The `REJECT_ON_ERROR` row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.</comment>

<file context>
@@ -21,22 +19,97 @@ graph LR
+| `ACK_FIRST`         | Commit before processing  | Already committed           | At most once       | High throughput, idempotent operations      |
+| `ACK`               | Commit after processing   | Commit anyway (no retry)    | At least once      | Reliable processing without retry needs     |
+| `NACK_ON_ERROR`     | Commit after processing   | Seek back for redelivery    | At least once      | Auto-retry on transient failures            |
+| `REJECT_ON_ERROR`   | Commit after processing   | Commit (discard message)    | At least once      | Permanent failures, no retry desired        |
+| `MANUAL`            | User calls `msg.ack()`    | User calls `msg.nack()`     | User-controlled    | Complex conditional acknowledgement         |
+
</file context>
Suggested change
| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired |
| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At most once | Permanent failures, no retry desired |
Fix with Cubic

## What changed

Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/architecture/lifecycle.md, line 11:

<comment>This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.</comment>

<file context>
@@ -8,11 +8,11 @@ The pattern that actually fits Python and asyncio is the language's own RAII: as
 ## What changed
 
-Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
+Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
 
 DI providers return unstarted instances for these services. The FastAPI lifespan acquires them and uses an `AsyncExitStack` to start/stop them in a single place. That removed scattered start/stop logic from providers and made shutdown order explicit.
</file context>
Suggested change
Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, EventStoreConsumer, ResultProcessor, DLQManager, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/architecture/kafka-topic-architecture.md (2)

110-116: Update Replay diagram: UnifiedProducer likely removed.

The replay flow still shows UnifiedProducer, but the PR summary indicates migration to FastStream and removal of UnifiedProducer/EventBus. Please update the diagram to reflect the new producer path (e.g., FastStream producer or DI-provided publisher).


148-155: Event schema section appears outdated (Pydantic AvroBase + unified plumbing).

This section references Pydantic AvroBase and events/core/{producer,consumer,dispatcher}.py, which conflicts with the stated migration to dataclasses-avroschema and FastStream-based producers/consumers. Please update the filenames/tech choices to match the new schema tooling and messaging stack.

🤖 Fix all issues with AI agents
In `@backend/workers/run_result_processor.py`:
- Around line 119-123: The MongoDB client created with AsyncMongoClient
(mongo_client) can leak on exceptions thrown during init_beanie() or later setup
before yield; wrap the initialization and the subsequent yield in a single
try/finally so that mongo_client.close()/await mongo_client.close() (use the
appropriate async close) is always called in the finally block; apply the same
pattern to the other startup block around lines 173-178 to guarantee cleanup
even if init_beanie(), schema registry setup, or any pre-yield logic raises.

In `@backend/workers/run_saga_orchestrator.py`:
- Around line 113-124: The startup code creates an AsyncMongoClient and then
calls init_beanie and container.get(...) but doesn't close mongo_client if one
of those calls raises, so wrap the initialization sequence that creates
mongo_client and resolves SchemaRegistryManager/UnifiedProducer in a
try/except/finally (or a try block that closes on exception) so that on any
exception you call mongo_client.close() (and optionally await
mongo_client.close_async if applicable) before re-raising; reference the
AsyncMongoClient instantiation, init_beanie(..., document_models=ALL_DOCUMENTS),
container.get(SchemaRegistryManager) and container.get(UnifiedProducer) to
locate where to add the try/finally and ensure cleanup happens prior to leaving
the startup routine.

In `@docs/architecture/lifecycle.md`:
- Line 11: Update the lifecycle list in lifecycle.md to reflect the new
surfaces: remove references to EventBus and "SSE Kafka→Redis bridge" and instead
list SSEEventRouter and SSEConnectionRegistry (and the DI/FastStream surface) as
the services that implement the async context manager protocol; keep existing
entries for KubernetesWorker, PodMonitor, EventStoreConsumer, ResultProcessor,
DLQManager, and the Kafka producer, and ensure the description says these
services expose __aenter__/__aexit__ that call start/stop.

In `@docs/architecture/services-overview.md`:
- Line 7: Update the legacy name "Saga Orchestrator" to the new component name
"SagaLogic" (or "SagaLogic/DI" where dependency-injection context is relevant)
in the sentence describing system components and any adjacent mentions in this
document; specifically replace occurrences of "Saga Orchestrator" with
"SagaLogic" (or "SagaLogic/DI" if the surrounding text refers to DI behavior) so
the documentation matches the PR's renamed component and avoids confusion.
- Line 73: Replace the outdated "Saga Orchestrator" label in the execution flow
string "User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result
Processor" with the current component name(s) — e.g., "SagaLogic/FastStream
workers" or "SagaLogic (FastStream workers)". Update the sentence so it reads
"User → API → SagaLogic/FastStream workers → K8s Worker → Pod → Pod Monitor →
Result Processor" (or your preferred phrasing) to ensure terminology matches the
current components.

In `@docs/components/workers/index.md`:
- Line 78: Update the table row for "SSE Bridge" with ACK_FIRST to explicitly
state the crash window where events can be lost: mention that ACK_FIRST may drop
events if a worker crashes after committing but before publishing to Redis, and
only acceptable when downstream replay or client refresh handles missed events;
update the cell text referencing "SSE Bridge" and "ACK_FIRST" to include this
clarification and avoid implying zero data loss.
♻️ Duplicate comments (3)
backend/workers/run_k8s_worker.py (1)

145-148: Log or surface unmatched events instead of silently discarding.

handle_other is still a no-op, so unrecognized events are silently dropped. At minimum, consider warning-level logging to surface routing/config issues.

🔧 Minimal adjustment
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning("Unhandled k8s_worker event_type: %s", event.event_type)
backend/workers/run_result_processor.py (1)

166-169: Log or surface unmatched events instead of silently discarding.

handle_other still drops unknown events without any logging, which can mask routing or schema issues.

🔧 Minimal adjustment
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning("Unhandled result_processor event_type: %s", event.event_type)
backend/workers/dlq_processor.py (1)

181-199: Move subscriber registration out of lifespan.
FastStream’s lifespan runs before the broker starts and after it stops; registering @broker.subscriber inside it can leave the consumer inactive. Define the decoder/subscriber before creating FastStream(broker) (or register in @app.after_startup with explicit setup/start).

🔧 Suggested adjustment (register before lifespan)
     broker = KafkaBroker(
         settings.KAFKA_BOOTSTRAP_SERVERS,
         request_timeout_ms=settings.KAFKA_REQUEST_TIMEOUT_MS,
     )

+    # Decoder: JSON message → typed DLQMessage
+    def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage:
+        data = json.loads(msg.body)
+        return DLQMessage.model_validate(data)
+
+    `@broker.subscriber`(
+        *topics,
+        group_id=group_id,
+        ack_policy=AckPolicy.ACK,
+        decoder=decode_dlq_json,
+    )
+    async def handle_dlq_message(
+        message: DLQMessage,
+        dlq_manager: FromDishka[DLQManager],
+    ) -> None:
+        await dlq_manager.process_message(message)
+
     `@asynccontextmanager`
     async def lifespan() -> AsyncIterator[None]:
         """Initialize infrastructure and start scheduled retry timer."""
         app_logger = await container.get(logging.Logger)
         app_logger.info("DLQ Processor starting...")
@@
-        # Decoder: JSON message → typed DLQMessage
-        def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage:
-            data = json.loads(msg.body)
-            return DLQMessage.model_validate(data)
-
-        # Register subscriber for DLQ messages
-        `@broker.subscriber`(
-            *topics,
-            group_id=group_id,
-            ack_policy=AckPolicy.ACK,
-            decoder=decode_dlq_json,
-        )
-        async def handle_dlq_message(
-            message: DLQMessage,
-            dlq_manager: FromDishka[DLQManager],
-        ) -> None:
-            """Handle incoming DLQ messages - invoked by FastStream when message arrives."""
-            await dlq_manager.process_message(message)
FastStream 0.6.0 lifespan context manager register subscriber before broker start after_startup

Comment on lines +119 to +123
# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure Mongo client is closed on startup failures.

As in saga orchestrator, exceptions during init_beanie() or schema registry setup before yield will bypass the finally block. Wrap init + yield in a single try/finally to guarantee cleanup.

🔧 Proposed fix
-        # Initialize MongoDB + Beanie
-        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
-            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
-        )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        # Initialize MongoDB + Beanie
+        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+        )
 
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
+        try:
+            await init_beanie(
+                database=mongo_client[settings.DATABASE_NAME],
+                document_models=ALL_DOCUMENTS,
+            )
 
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
 
-        # Decoder: Avro message → typed DomainEvent
-        async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
-            return await schema_registry.deserialize_event(msg.body, "result_processor")
+            # Decoder: Avro message → typed DomainEvent
+            async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
+                return await schema_registry.deserialize_event(msg.body, "result_processor")
 
-        # Create subscriber with Avro decoder
-        subscriber = broker.subscriber(
-            *topics,
-            group_id=group_id,
-            ack_policy=AckPolicy.ACK,
-            decoder=decode_avro,
-        )
+            # Create subscriber with Avro decoder
+            subscriber = broker.subscriber(
+                *topics,
+                group_id=group_id,
+                ack_policy=AckPolicy.ACK,
+                decoder=decode_avro,
+            )
 
-        # Route by event_type header (producer sets this, Kafka stores as bytes)
-        `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_COMPLETED.encode())
-        async def handle_completed(
-            event: ExecutionCompletedEvent,
-            logic: FromDishka[ProcessorLogic],
-        ) -> None:
-            await logic._handle_completed(event)
+            # Route by event_type header (producer sets this, Kafka stores as bytes)
+            `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_COMPLETED.encode())
+            async def handle_completed(
+                event: ExecutionCompletedEvent,
+                logic: FromDishka[ProcessorLogic],
+            ) -> None:
+                await logic._handle_completed(event)
 
-        `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_FAILED.encode())
-        async def handle_failed(
-            event: ExecutionFailedEvent,
-            logic: FromDishka[ProcessorLogic],
-        ) -> None:
-            await logic._handle_failed(event)
+            `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_FAILED.encode())
+            async def handle_failed(
+                event: ExecutionFailedEvent,
+                logic: FromDishka[ProcessorLogic],
+            ) -> None:
+                await logic._handle_failed(event)
 
-        `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_TIMEOUT.encode())
-        async def handle_timeout(
-            event: ExecutionTimeoutEvent,
-            logic: FromDishka[ProcessorLogic],
-        ) -> None:
-            await logic._handle_timeout(event)
+            `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_TIMEOUT.encode())
+            async def handle_timeout(
+                event: ExecutionTimeoutEvent,
+                logic: FromDishka[ProcessorLogic],
+            ) -> None:
+                await logic._handle_timeout(event)
 
-        # Default handler for unmatched events (prevents message loss)
-        `@subscriber`()
-        async def handle_other(event: DomainEvent) -> None:
-            pass
+            # Default handler for unmatched events (prevents message loss)
+            `@subscriber`()
+            async def handle_other(event: DomainEvent) -> None:
+                pass
 
-        app_logger.info("Infrastructure initialized, starting event processing...")
-
-        try:
-            yield
-        finally:
-            app_logger.info("ResultProcessor shutting down...")
-            await mongo_client.close()
-            await container.close()
-            app_logger.info("ResultProcessor shutdown complete")
+            app_logger.info("Infrastructure initialized, starting event processing...")
+            yield
+        finally:
+            app_logger.info("ResultProcessor shutting down...")
+            await mongo_client.close()
+            await container.close()
+            app_logger.info("ResultProcessor shutdown complete")

Also applies to: 173-178

🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 119 - 123, The MongoDB
client created with AsyncMongoClient (mongo_client) can leak on exceptions
thrown during init_beanie() or later setup before yield; wrap the initialization
and the subsequent yield in a single try/finally so that
mongo_client.close()/await mongo_client.close() (use the appropriate async
close) is always called in the finally block; apply the same pattern to the
other startup block around lines 173-178 to guarantee cleanup even if
init_beanie(), schema registry setup, or any pre-yield logic raises.

Comment on lines +113 to +124
# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)

# Resolve schema registry (initialization handled by provider)
schema_registry = await container.get(SchemaRegistryManager)

# Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
await container.get(UnifiedProducer)
app_logger.info("Kafka producer ready")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure Mongo client is closed on startup failures.

If init_beanie() or schema registry init throws before yield, the finally block won’t run, leaving AsyncMongoClient (and possibly the container) unclosed. Consider wrapping the initialization in the same try/finally that handles shutdown.

🔧 Proposed fix
-        # Initialize MongoDB + Beanie
-        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
-            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
-        )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        # Initialize MongoDB + Beanie
+        mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+            settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+        )
 
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
+        try:
+            await init_beanie(
+                database=mongo_client[settings.DATABASE_NAME],
+                document_models=ALL_DOCUMENTS,
+            )
 
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
 
-        # Get saga logic to determine topics
-        logic = await container.get(SagaLogic)
-        trigger_topics = logic.get_trigger_topics()
+            # Get saga logic to determine topics
+            logic = await container.get(SagaLogic)
+            trigger_topics = logic.get_trigger_topics()
 
-        if not trigger_topics:
-            app_logger.warning("No saga triggers configured, shutting down")
-            yield
-            await mongo_client.close()
-            await container.close()
-            return
+            if not trigger_topics:
+                app_logger.warning("No saga triggers configured, shutting down")
+                yield
+                return
 
-        # Build topic list with prefix
-        topics = [f"{settings.KAFKA_TOPIC_PREFIX}{t}" for t in trigger_topics]
-        group_id = f"{GroupId.SAGA_ORCHESTRATOR}.{settings.KAFKA_GROUP_SUFFIX}"
+            # Build topic list with prefix
+            topics = [f"{settings.KAFKA_TOPIC_PREFIX}{t}" for t in trigger_topics]
+            group_id = f"{GroupId.SAGA_ORCHESTRATOR}.{settings.KAFKA_GROUP_SUFFIX}"
 
-        # Decoder: Avro message → typed DomainEvent
-        async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
-            return await schema_registry.deserialize_event(msg.body, "saga_orchestrator")
+            # Decoder: Avro message → typed DomainEvent
+            async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
+                return await schema_registry.deserialize_event(msg.body, "saga_orchestrator")
 
-        # Register handler dynamically after determining topics
-        # Saga orchestrator uses single handler - routing is internal to SagaLogic
-        `@broker.subscriber`(
-            *topics,
-            group_id=group_id,
-            ack_policy=AckPolicy.ACK,
-            decoder=decode_avro,
-        )
-        async def handle_saga_event(
-            event: DomainEvent,
-            saga_logic: FromDishka[SagaLogic],
-        ) -> None:
-            """
-            Handle saga trigger events.
-
-            Dependencies are automatically injected via Dishka.
-            Routing is handled internally by SagaLogic based on saga configuration.
-            """
-            # Handle the event through saga logic (internal routing)
-            await saga_logic.handle_event(event)
-
-            # Opportunistic timeout check (replaces background loop)
-            now = time.monotonic()
-            if now - timeout_check_state["last_check"] >= timeout_check_state["interval"]:
-                await saga_logic.check_timeouts_once()
-                timeout_check_state["last_check"] = now
-
-        app_logger.info(f"Subscribing to topics: {topics}")
-        app_logger.info("Infrastructure initialized, starting event processing...")
-
-        try:
-            yield
-        finally:
-            app_logger.info("SagaOrchestrator shutting down...")
-            await mongo_client.close()
-            await container.close()
-            app_logger.info("SagaOrchestrator shutdown complete")
+            # Register handler dynamically after determining topics
+            # Saga orchestrator uses single handler - routing is internal to SagaLogic
+            `@broker.subscriber`(
+                *topics,
+                group_id=group_id,
+                ack_policy=AckPolicy.ACK,
+                decoder=decode_avro,
+            )
+            async def handle_saga_event(
+                event: DomainEvent,
+                saga_logic: FromDishka[SagaLogic],
+            ) -> None:
+                """
+                Handle saga trigger events.
+
+                Dependencies are automatically injected via Dishka.
+                Routing is handled internally by SagaLogic based on saga configuration.
+                """
+                # Handle the event through saga logic (internal routing)
+                await saga_logic.handle_event(event)
+
+                # Opportunistic timeout check (replaces background loop)
+                now = time.monotonic()
+                if now - timeout_check_state["last_check"] >= timeout_check_state["interval"]:
+                    await saga_logic.check_timeouts_once()
+                    timeout_check_state["last_check"] = now
+
+            app_logger.info(f"Subscribing to topics: {topics}")
+            app_logger.info("Infrastructure initialized, starting event processing...")
+            yield
+        finally:
+            app_logger.info("SagaOrchestrator shutting down...")
+            await mongo_client.close()
+            await container.close()
+            app_logger.info("SagaOrchestrator shutdown complete")

Also applies to: 175-180

🤖 Prompt for AI Agents
In `@backend/workers/run_saga_orchestrator.py` around lines 113 - 124, The startup
code creates an AsyncMongoClient and then calls init_beanie and
container.get(...) but doesn't close mongo_client if one of those calls raises,
so wrap the initialization sequence that creates mongo_client and resolves
SchemaRegistryManager/UnifiedProducer in a try/except/finally (or a try block
that closes on exception) so that on any exception you call mongo_client.close()
(and optionally await mongo_client.close_async if applicable) before re-raising;
reference the AsyncMongoClient instantiation, init_beanie(...,
document_models=ALL_DOCUMENTS), container.get(SchemaRegistryManager) and
container.get(UnifiedProducer) to locate where to add the try/finally and ensure
cleanup happens prior to leaving the startup routine.

## What changed

Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update lifecycle list to match new surfaces.

This line still references EventBus and the “SSE Kafka→Redis bridge,” which the PR objectives say were removed/replaced by SSEEventRouter/SSEConnectionRegistry. Please align the lifecycle list with the new DI/FastStream surface to avoid misleading docs.

🤖 Prompt for AI Agents
In `@docs/architecture/lifecycle.md` at line 11, Update the lifecycle list in
lifecycle.md to reflect the new surfaces: remove references to EventBus and "SSE
Kafka→Redis bridge" and instead list SSEEventRouter and SSEConnectionRegistry
(and the DI/FastStream surface) as the services that implement the async context
manager protocol; keep existing entries for KubernetesWorker, PodMonitor,
EventStoreConsumer, ResultProcessor, DLQManager, and the Kafka producer, and
ensure the description says these services expose __aenter__/__aexit__ that call
start/stop.

## High-level architecture

The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Coordinator accepts validated execution requests and enqueues them to Kafka with metadata and idempotency guards. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations.
The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align saga naming with the new SagaLogic/DI terminology.
The PR objectives say SagaLogic replaces SagaOrchestrator, but this line still uses “Saga Orchestrator.” Consider renaming here (and in adjacent sections) to match the new component name and avoid confusion.

🔧 Suggested wording tweak
-The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker.
+The Saga Logic (FastStream subscriber) drives stateful execution via events and publishes commands to the K8s Worker.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations.
The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Logic (FastStream subscriber) drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations.
🤖 Prompt for AI Agents
In `@docs/architecture/services-overview.md` at line 7, Update the legacy name
"Saga Orchestrator" to the new component name "SagaLogic" (or "SagaLogic/DI"
where dependency-injection context is relevant) in the sentence describing
system components and any adjacent mentions in this document; specifically
replace occurrences of "Saga Orchestrator" with "SagaLogic" (or "SagaLogic/DI"
if the surrounding text refers to DI behavior) so the documentation matches the
PR's renamed component and avoids confusion.

## Common flows

The main execution flow goes: User → API → Coordinator → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine.
The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the execution flow label to match current component names.
If “Saga Orchestrator” has been replaced by SagaLogic/FastStream workers, the flow line should reflect that to avoid conflicting terminology.

🔧 Suggested wording tweak
-The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor.
+The main execution flow goes: User → API → Saga Logic (FastStream) → K8s Worker → Pod → Pod Monitor → Result Processor.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine.
The main execution flow goes: User → API → Saga Logic (FastStream) → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine.
🤖 Prompt for AI Agents
In `@docs/architecture/services-overview.md` at line 73, Replace the outdated
"Saga Orchestrator" label in the execution flow string "User → API → Saga
Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor" with the
current component name(s) — e.g., "SagaLogic/FastStream workers" or "SagaLogic
(FastStream workers)". Update the sentence so it reads "User → API →
SagaLogic/FastStream workers → K8s Worker → Pod → Pod Monitor → Result
Processor" (or your preferred phrasing) to ensure terminology matches the
current components.

| Saga Orchestrator | `ACK` | Saga state is persisted to MongoDB; duplicate processing is idempotent |
| K8s Worker | `ACK` | Pod creation is idempotent (same execution ID); uses idempotency middleware |
| Result Processor | `ACK` | Result persistence is idempotent; duplicate writes are safe |
| SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify ACK_FIRST rationale for SSE Bridge to avoid implying no data loss.
ACK_FIRST can drop events if the worker crashes after commit but before Redis publish. If loss is acceptable due to downstream replay or client refresh, state that explicitly.

💬 Suggested wording tweak
-| SSE Bridge          | `ACK_FIRST`  | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss |
+| SSE Bridge          | `ACK_FIRST`  | Redis pubsub is fire-and-forget; ACK_FIRST may drop events on crash, which is acceptable only if clients can refresh/replay from another source |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss |
| SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; ACK_FIRST may drop events on crash, which is acceptable only if clients can refresh/replay from another source |
🤖 Prompt for AI Agents
In `@docs/components/workers/index.md` at line 78, Update the table row for "SSE
Bridge" with ACK_FIRST to explicitly state the crash window where events can be
lost: mention that ACK_FIRST may drop events if a worker crashes after
committing but before publishing to Redis, and only acceptable when downstream
replay or client refresh handles missed events; update the cell text referencing
"SSE Bridge" and "ACK_FIRST" to include this clarification and avoid implying
zero data loss.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/workers/dlq_processor.py`:
- Around line 161-178: The initialization sequence can leak the AsyncMongoClient
if init_beanie() or any subsequent container.get calls raise before the outer
finally runs; wrap the MongoDB setup so mongo_client is closed on error by
adding a nested try/finally (or try/except + close) immediately after creating
mongo_client and before awaiting init_beanie(), ensuring mongo_client.close() is
called on any exception; update the block around AsyncMongoClient creation,
init_beanie(database=...), and the subsequent container.get(...) calls
(including SchemaRegistryManager, UnifiedProducer, and DLQManager/_configure_*
calls) to guarantee mongo_client is closed if initialization fails before
yielding.
♻️ Duplicate comments (7)
backend/workers/run_k8s_worker.py (1)

144-147: Silent catch-all handler discards unmatched events without logging.

The handle_other handler silently ignores events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters. This can mask routing bugs or missing handlers for new event types. Consider logging at debug/warning level.

🔧 Suggested improvement
         # Default handler for unmatched events (prevents message loss)
         `@subscriber`()
-        async def handle_other(event: DomainEvent) -> None:
-            pass
+        async def handle_other(event: DomainEvent) -> None:
+            app_logger.debug(f"Ignoring unmatched event: {event.event_type}")
backend/workers/run_result_processor.py (2)

118-129: Ensure MongoDB client is closed on initialization failures.

Same issue as in dlq_processor.py - if init_beanie() or subsequent setup calls fail before yield, the mongo_client won't be closed.

🔧 Proposed fix pattern
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+            # ... rest of initialization ...
+        except Exception:
+            await mongo_client.close()
+            raise

165-168: Silent catch-all handler discards unmatched events without logging.

Same pattern as other workers - unmatched events are silently dropped. Consider logging for debugging purposes.

backend/workers/run_saga_orchestrator.py (1)

112-123: Ensure MongoDB client is closed on initialization failures.

Same pattern as other workers - if init_beanie() or subsequent setup fails before yield, mongo_client won't be closed.

backend/workers/run_sse_bridge.py (2)

108-120: AckPolicy.ACK_FIRST may lose events on handler failures.

ACK_FIRST commits the offset before route_event runs. If Redis is temporarily unavailable, the event is lost since the offset is already committed. While the comment notes SSE is "idempotent," it's actually "lossy" - clients may miss events with no way to recover them from this worker.

If occasional event loss is acceptable (clients can refresh state), this is fine. Otherwise, consider AckPolicy.ACK to commit after successful Redis publish.


102-104: Hardcoded topic string "sse_bridge" in deserialize_event.

The topic parameter passed to deserialize_event is hardcoded as "sse_bridge" while the actual subscribed topics come from CONSUMER_GROUP_SUBSCRIPTIONS. While the parameter appears unused in the current implementation, this creates maintenance risk if it becomes used later.

docs/components/workers/index.md (1)

56-62: REJECT_ON_ERROR delivery guarantee is incorrect.

Line 61 states REJECT_ON_ERROR provides "At least once" delivery, but the behavior described (commit on error, discard message) is actually at most once - failed messages are never redelivered. The note below correctly explains this, but the table row is misleading.

-| `REJECT_ON_ERROR`   | Commit after processing   | Commit (discard message)    | At least once      | Permanent failures, no retry desired        |
+| `REJECT_ON_ERROR`   | Commit after processing   | Commit (discard message)    | At most once       | Permanent failures, no retry desired        |
🧹 Nitpick comments (1)
backend/workers/run_result_processor.py (1)

144-163: Consider exposing public handler methods in ProcessorLogic.

The handlers call private methods (_handle_completed, _handle_failed, _handle_timeout) directly. While functional, this breaks encapsulation. Consider adding public wrapper methods in ProcessorLogic for cleaner API boundaries.

Comment on lines +161 to +178
# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)

# Resolve schema registry (initialization handled by provider)
await container.get(SchemaRegistryManager)

# Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
await container.get(UnifiedProducer)
app_logger.info("Kafka producer ready")

# Get DLQ manager and configure policies
manager = await container.get(DLQManager)
_configure_retry_policies(manager, app_logger)
_configure_filters(manager, testing=settings.TESTING, logger=app_logger)
app_logger.info("DLQ Manager configured")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure MongoDB client is closed on initialization failures.

If init_beanie() or subsequent setup calls throw before reaching yield, the finally block won't execute, leaving mongo_client unclosed. Wrap the initialization in a nested try/finally or handle exceptions explicitly.

🔧 Proposed fix
         # Initialize MongoDB + Beanie
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        await container.get(SchemaRegistryManager)
-
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
-
-        # Get DLQ manager and configure policies
-        manager = await container.get(DLQManager)
-        _configure_retry_policies(manager, app_logger)
-        _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
-        app_logger.info("DLQ Manager configured")
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+
+            # Resolve schema registry (initialization handled by provider)
+            await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
+
+            # Get DLQ manager and configure policies
+            manager = await container.get(DLQManager)
+            _configure_retry_policies(manager, app_logger)
+            _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
+            app_logger.info("DLQ Manager configured")
+        except Exception:
+            await mongo_client.close()
+            raise
🤖 Prompt for AI Agents
In `@backend/workers/dlq_processor.py` around lines 161 - 178, The initialization
sequence can leak the AsyncMongoClient if init_beanie() or any subsequent
container.get calls raise before the outer finally runs; wrap the MongoDB setup
so mongo_client is closed on error by adding a nested try/finally (or try/except
+ close) immediately after creating mongo_client and before awaiting
init_beanie(), ensuring mongo_client.close() is called on any exception; update
the block around AsyncMongoClient creation, init_beanie(database=...), and the
subsequent container.get(...) calls (including SchemaRegistryManager,
UnifiedProducer, and DLQManager/_configure_* calls) to guarantee mongo_client is
closed if initialization fails before yielding.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/events/core/producer.py">

<violation number="1" location="backend/app/events/core/producer.py:55">
P2: The merge order lets caller-supplied headers override the required `event_type` header, so routing can break if `headers` contains `event_type`. Preserve the `event_type` value after merging additional headers.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

# Always include event_type header for routing, merge with any additional headers
all_headers = {"event_type": str(event_to_produce.event_type)}
if headers:
all_headers.update(headers)
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The merge order lets caller-supplied headers override the required event_type header, so routing can break if headers contains event_type. Preserve the event_type value after merging additional headers.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/events/core/producer.py, line 55:

<comment>The merge order lets caller-supplied headers override the required `event_type` header, so routing can break if `headers` contains `event_type`. Preserve the `event_type` value after merging additional headers.</comment>

<file context>
@@ -49,8 +49,13 @@ async def produce(
+            # Always include event_type header for routing, merge with any additional headers
+            all_headers = {"event_type": str(event_to_produce.event_type)}
+            if headers:
+                all_headers.update(headers)
+
             # Convert headers to list of tuples format
</file context>
Suggested change
all_headers.update(headers)
all_headers = {**headers, "event_type": str(event_to_produce.event_type)}
Fix with Cubic

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/domain/enums/kafka.py (1)

56-67: Remove unused EVENT_STORE_CONSUMER and DLQ_MANAGER enum values or add missing dictionary entries.

These two GroupId enum values have no corresponding entries in CONSUMER_GROUP_SUBSCRIPTIONS or CONSUMER_GROUP_EVENTS, and are not referenced anywhere in the codebase. If any code attempts to look them up, it will raise a KeyError. Either remove these enum values if they are unused, or add the missing dictionary entries if they are intended for future use.

🤖 Fix all issues with AI agents
In `@backend/app/events/core/producer.py`:
- Around line 52-58: The merge currently lets caller-supplied headers override
the routing header; to make event_type authoritative, change the merge so caller
headers are applied first and then set/overwrite "event_type" from
event_to_produce.event_type (i.e., build all_headers from headers or update with
headers, then set all_headers["event_type"] = str(event_to_produce.event_type)),
keeping the existing conversion to header_list = [(k, v.encode()) for k, v in
all_headers.items()]; reference the variables event_to_produce, headers,
all_headers, and header_list when locating the change.
♻️ Duplicate comments (8)
backend/workers/dlq_processor.py (2)

185-205: Verify FastStream supports registering subscribers inside lifespan.
If FastStream 0.6 doesn’t activate subscribers registered within lifespan, the DLQ consumer may never start. Please confirm; if unsupported, move subscriber registration to module scope or @app.after_startup.

FastStream 0.6 subscriber registration inside lifespan context manager supported? recommended registration timing

161-178: Close Mongo client on startup failures before lifespan yields.
If init_beanie or any container.get(...) raises before the try: yield, the Mongo client won’t be closed. Wrap the init sequence to ensure cleanup on errors.

🐛 Proposed fix
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        await container.get(SchemaRegistryManager)
-
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
-
-        # Get DLQ manager and configure policies
-        manager = await container.get(DLQManager)
-        _configure_retry_policies(manager, app_logger)
-        _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
-        app_logger.info("DLQ Manager configured")
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+
+            # Resolve schema registry (initialization handled by provider)
+            await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
+
+            # Get DLQ manager and configure policies
+            manager = await container.get(DLQManager)
+            _configure_retry_policies(manager, app_logger)
+            _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
+            app_logger.info("DLQ Manager configured")
+        except Exception:
+            await mongo_client.close()
+            await container.close()
+            raise
backend/workers/run_sse_bridge.py (2)

102-105: Avoid hardcoded topic string in deserialize_event.
If deserialize_event ever uses the topic argument, the literal "sse_bridge" will be misleading. Consider passing the actual topic or documenting why a constant is correct here.


107-111: ACK_FIRST risks at‑most‑once delivery; prefer ACK after handler success.
ACK_FIRST commits before the handler runs; if Redis publish fails (or errors are suppressed), events can be lost. Consider AckPolicy.ACK (or NACK_ON_ERROR) and ensure failures propagate.

🐛 Proposed fix
         subscriber = broker.subscriber(
             *topics,
             group_id=group_id,
-            ack_policy=AckPolicy.ACK_FIRST,  # SSE bridge is idempotent (Redis pubsub)
+            ack_policy=AckPolicy.ACK,  # commit after successful handler
             decoder=decode_avro,
         )
FastStream 0.6 AckPolicy.ACK_FIRST vs ACK semantics Kafka subscriber
backend/workers/run_k8s_worker.py (1)

144-147: Warn on unmatched events instead of silently dropping them.
A warning helps detect new/incorrect event routing.

💡 Suggested tweak
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning("Unhandled K8s worker event_type=%s", event.event_type)
backend/workers/run_result_processor.py (2)

118-129: Close Mongo client on startup failures before lifespan yields.
Init errors before the try: yield skip cleanup and leave the client open.

🐛 Proposed fix
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
-
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
+        except Exception:
+            await mongo_client.close()
+            await container.close()
+            raise

165-168: Warn on unmatched events instead of silently dropping them.
Helps catch new event types or filter mismatches.

💡 Suggested tweak
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning("Unhandled result-processor event_type=%s", event.event_type)
backend/workers/run_saga_orchestrator.py (1)

112-123: Close Mongo client on startup failures before lifespan yields.
If init_beanie or provider resolution fails before try: yield, the client isn’t closed.

🐛 Proposed fix
         mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
             settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
         )
-        await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
-        # Resolve schema registry (initialization handled by provider)
-        schema_registry = await container.get(SchemaRegistryManager)
-
-        # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
-        await container.get(UnifiedProducer)
-        app_logger.info("Kafka producer ready")
+        try:
+            await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+
+            # Resolve schema registry (initialization handled by provider)
+            schema_registry = await container.get(SchemaRegistryManager)
+
+            # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+            await container.get(UnifiedProducer)
+            app_logger.info("Kafka producer ready")
+        except Exception:
+            await mongo_client.close()
+            await container.close()
+            raise
🧹 Nitpick comments (2)
backend/workers/dlq_processor.py (1)

202-205: Log unexpected DLQ messages instead of silently discarding.
A warning makes it easier to spot misrouted or malformed messages.

💡 Suggested tweak
         `@subscriber`()
         async def handle_other(message: DLQMessage) -> None:
-            pass
+            app_logger.warning(
+                "Unhandled DLQ message without original_topic header: %s",
+                message,
+            )
backend/workers/run_saga_orchestrator.py (1)

169-172: Consider logging unmatched events.
A warning would surface unexpected saga events or filter mismatches.

💡 Suggested tweak
         `@subscriber`()
         async def handle_other(event: DomainEvent) -> None:
-            pass
+            app_logger.warning("Unhandled saga event_type=%s", event.event_type)

Comment on lines +52 to +58
# Always include event_type header for routing, merge with any additional headers
all_headers = {"event_type": str(event_to_produce.event_type)}
if headers:
all_headers.update(headers)

# Convert headers to list of tuples format
header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
header_list = [(k, v.encode()) for k, v in all_headers.items()]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Header merge order allows accidental override of event_type.

The comment states the intent to "always include event_type header for routing," but the current merge order allows callers to override it via the headers dict. If the event_type should be authoritative for routing, reverse the merge order so the event's type always wins.

Proposed fix to ensure event_type cannot be overridden
         # Always include event_type header for routing, merge with any additional headers
-        all_headers = {"event_type": str(event_to_produce.event_type)}
-        if headers:
-            all_headers.update(headers)
+        all_headers = dict(headers) if headers else {}
+        all_headers["event_type"] = str(event_to_produce.event_type)  # Authoritative
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Always include event_type header for routing, merge with any additional headers
all_headers = {"event_type": str(event_to_produce.event_type)}
if headers:
all_headers.update(headers)
# Convert headers to list of tuples format
header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
header_list = [(k, v.encode()) for k, v in all_headers.items()]
# Always include event_type header for routing, merge with any additional headers
all_headers = dict(headers) if headers else {}
all_headers["event_type"] = str(event_to_produce.event_type) # Authoritative
# Convert headers to list of tuples format
header_list = [(k, v.encode()) for k, v in all_headers.items()]
🤖 Prompt for AI Agents
In `@backend/app/events/core/producer.py` around lines 52 - 58, The merge
currently lets caller-supplied headers override the routing header; to make
event_type authoritative, change the merge so caller headers are applied first
and then set/overwrite "event_type" from event_to_produce.event_type (i.e.,
build all_headers from headers or update with headers, then set
all_headers["event_type"] = str(event_to_produce.event_type)), keeping the
existing conversion to header_list = [(k, v.encode()) for k, v in
all_headers.items()]; reference the variables event_to_produce, headers,
all_headers, and header_list when locating the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants