-
Notifications
You must be signed in to change notification settings - Fork 0
chore: removal of nullable fields #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughCentralizes external clients into DI providers, replaces many LifecycleEnabled services with DI-managed logic and FastStream entrypoints, removes the coordinator and related queue/resource modules, and rewires SSE/K8s/saga/DLQ/idempotency/schema-registry, plus extensive test and docs updates. (39 words) Changes
Sequence Diagram(s)sequenceDiagram
participant FastStream as FastStream App
participant Dishka as Dishka DI Container
participant Schema as SchemaRegistryManager
participant Kafka as Kafka Broker
participant Logic as K8sWorkerLogic
participant K8s as Kubernetes API
FastStream->>Dishka: build container (BoundaryClientProvider, EventProvider, K8sWorkerProvider...)
Dishka-->>Logic: resolve K8sWorkerLogic
Kafka->>FastStream: deliver Avro message
FastStream->>Schema: decode Avro -> DomainEvent
Schema-->>FastStream: DomainEvent
FastStream->>Logic: handle_create_pod_command(event)
Logic->>K8s: create ConfigMap/Pod (CoreV1Api / AppsV1Api)
K8s-->>Logic: response/status
Logic->>Kafka: publish PodCreated / ExecutionFailed events
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 2 files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)
136-156: Add_subscribe_to_events()to background tasks so execution event notifications fire.
_subscribe_to_events()is never invoked, so the Kafka consumer for execution-event notifications (EXECUTION_COMPLETED, EXECUTION_FAILED, EXECUTION_TIMEOUT) never starts. Add it to the background task queue:🔧 Proposed fix
def _start_background_tasks(self) -> None: """Start background processing tasks.""" tasks = [ + asyncio.create_task(self._subscribe_to_events()), asyncio.create_task(self._process_pending_notifications()), asyncio.create_task(self._cleanup_old_notifications()), ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/notification_service.py (1)
118-155: _subscribe_to_events() is never invoked—execution-event notifications won't be consumed.The async method
_subscribe_to_events()is defined (line 207) but never called anywhere in the codebase. This means the service will not set up Kafka consumer subscriptions for execution events, breaking event-driven notifications entirely.Additionally,
_start_background_tasks()callsasyncio.create_task()in__init__without explicitly verifying a running loop. While lazy DI instantiation during async route handling may avoid an immediateRuntimeError, this pattern is fragile and doesn't fail fast.Move event subscription setup into an async initialization point (e.g., a
start()method called from the lifespan handler, or schedule_subscribe_to_events()as a background task once the loop is guaranteed to be running).
sse shutdown manager: moved to DI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/core/providers.py (1)
515-531: Remove task creation fromNotificationService.__init__or make the provider async-aware.
NotificationService.__init__callsasyncio.create_task()on lines 199-200 to start background tasks. Sinceget_notification_serviceis a synchronous provider, this will raiseRuntimeError: no running event loopif resolved in any context before an event loop is active. Either defer task startup to an explicit async method (e.g.,async def start()) called after service resolution in an async context, or convert the provider to resolve the service in an async scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/services/k8s_worker/worker.py (1)
169-175: Clearidempotent_consumeron shutdown to avoid stale “running” status.
get_statusnow reports"running"based onidempotent_consumer is not None, but__aexit__doesn’t null it after stopping. This can report running even after shutdown.🐛 Proposed fix
@@ if self.idempotent_consumer: await self.idempotent_consumer.stop() + self.idempotent_consumer = NoneAlso applies to: 430-434
backend/app/services/sse/kafka_redis_bridge.py (1)
16-60: Make aenter idempotent and exception‑safe.Tests call aenter twice; current code overwrites
self.consumerswithout stopping the already-started set, which can leak running consumers. Also, a partial start failure leaves running consumers behind. Guard re-entry and clean up on failure.🔧 Suggested fix
async def __aenter__(self) -> "SSEKafkaRedisBridge": """Start the SSE Kafka→Redis bridge.""" self.logger.info(f"Starting SSE Kafka→Redis bridge with {self.num_consumers} consumers") + if self.consumers: + self.logger.debug("SSE Kafka→Redis bridge already started") + return self + # Phase 1: Build all consumers and track them immediately (no I/O) self.consumers = [self._build_consumer(i) for i in range(self.num_consumers)] # Phase 2: Start all in parallel - already tracked in self.consumers for cleanup topics = list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.WEBSOCKET_GATEWAY]) - await asyncio.gather(*[c.start(topics) for c in self.consumers]) + try: + await asyncio.gather(*[c.start(topics) for c in self.consumers]) + except Exception: + await asyncio.gather(*[c.stop() for c in self.consumers], return_exceptions=True) + self.consumers.clear() + raise
🤖 Fix all issues with AI agents
In `@backend/app/services/coordinator/coordinator.py`:
- Around line 112-117: The log message in ExecutionCoordinator.__aenter__ is
misleading because queue_manager is constructed in __init__; update __aenter__
to not claim initialization: remove or replace the "Queue manager initialized"
logger call in __aenter__ (referencing ExecutionCoordinator.__aenter__ and the
queue_manager created in ExecutionCoordinator.__init__) and, if you need to
indicate readiness here, change it to a truthful message such as
"ExecutionCoordinator service started" or "Queue manager ready" so the log
reflects actual work performed in __aenter__.
In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 289-297: The current _backoff() raises RuntimeError when reconnect
attempts exceed config.max_reconnect_attempts which propagates out of
_watch_loop() and kills the _watch_task; instead, modify _backoff() to not raise
but mark the monitor as paused (e.g., set self._paused = True and log the
failure with details) and/or schedule a retry via a new helper (e.g.,
_ensure_watch_running) that runs in background to attempt restart, and update
_watch_loop() to check self._paused and exit gracefully without killing the
background manager; ensure references to _backoff, _watch_loop, _watch_task, and
config.max_reconnect_attempts are used so the new paused flag or background
restart task can detect and restart the watcher.
🧹 Nitpick comments (6)
backend/app/services/coordinator/coordinator.py (1)
196-197: Redundanthasattrcheck.
idempotency_manageris always assigned in__init__(line 97), sohasattris unnecessary. The None check alone is sufficient.Suggested simplification
- if hasattr(self, "idempotency_manager") and self.idempotency_manager: + if self.idempotency_manager: await self.idempotency_manager.close()backend/workers/run_k8s_worker.py (1)
42-46: Consider exiting if the worker stops unexpectedly.The loop now waits only on
shutdown_event. If the worker/consumer dies, the process may keep running and just log status. If you want automatic recovery or a clean exit, consider breaking/setting the shutdown event when status indicates not running.♻️ Possible adjustment
while not shutdown_event.is_set(): await asyncio.sleep(60) status = await worker.get_status() logger.info(f"Kubernetes worker status: {status}") + if not status.get("running", True): + logger.error("KubernetesWorker not running; initiating shutdown.") + shutdown_event.set()backend/app/services/pod_monitor/monitor.py (1)
122-138: Consider logging exceptions from the watch task during shutdown.If
_watch_taskterminates due to an exception other thanCancelledError(e.g.,RuntimeErrorfrom_backoff), it will be silently ignored during__aexit__. Consider retrieving and logging any exception.🔧 Suggested improvement
if self._watch_task: self._watch_task.cancel() try: await self._watch_task except asyncio.CancelledError: pass + except Exception as e: + self.logger.warning(f"Watch task terminated with error: {e}")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
118-125: Assertion may not validate event handling outcome.The assertion
len(orch._sagas) > 0verifies thatregister_saga(_Saga)populated_sagas, not that_handle_eventprocessed the event. Based on the relevant code snippet showing_handle_eventiterates overself._sagas, this assertion passes due to theregister_sagacall on line 121, not because of the event handling on line 123.Consider asserting on observable side effects of event handling, such as checking the fake repository for saved sagas.
🔧 Suggested improvement
`@pytest.mark.asyncio` async def test_min_success_flow(event_metrics: EventMetrics) -> None: - orch = _orch(event_metrics) - orch.register_saga(_Saga) + fake_repo = _FakeRepo() + orch = SagaOrchestrator( + config=SagaConfig(name="t", enable_compensation=True, store_events=True, publish_commands=False), + saga_repository=fake_repo, + producer=_FakeProd(), + schema_registry_manager=MagicMock(spec=SchemaRegistryManager), + settings=MagicMock(spec=Settings), + event_store=_FakeStore(), + idempotency_manager=_FakeIdem(), + resource_allocation_repository=_FakeAlloc(), + logger=_test_logger, + event_metrics=event_metrics, + ) + orch.register_saga(_Saga) # Handle the event await orch._handle_event(make_execution_requested_event(execution_id="e")) - # basic sanity; deep behavior covered by integration - assert len(orch._sagas) > 0 + # Verify saga was created and saved + assert len(fake_repo.saved) > 0backend/workers/run_saga_orchestrator.py (1)
31-43: Consider liveness handling after dropping theis_runninggate.
With onlyshutdown_event, the worker will keep sleeping even if the orchestrator stops unexpectedly; please confirm that’s intended or wire a failure hook to trigger shutdown.backend/app/services/saga/saga_orchestrator.py (1)
361-384: Consider re-raisingCancelledErrorfor cleaner task semantics.While suppressing
CancelledErrorworks here sinceasyncio.gather(..., return_exceptions=True)is used in__aexit__, re-raising is the recommended pattern for asyncio tasks. This ensures the task is properly marked as cancelled rather than completing normally.♻️ Optional: Re-raise CancelledError
except asyncio.CancelledError: self.logger.info("Timeout checker cancelled") + raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 issues found across 32 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/run_pod_monitor.py">
<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P2: The loop now ignores the PodMonitor state, so if the monitor stops internally the worker will keep polling forever instead of exiting. Consider reintroducing a state check (or equivalent stop condition) to avoid hanging on a stopped monitor.</violation>
</file>
<file name="backend/app/services/saga/saga_orchestrator.py">
<violation number="1" location="backend/app/services/saga/saga_orchestrator.py:364">
P2: The timeout checker now stops permanently on any unexpected exception because the loop no longer catches and logs `Exception`. A transient repository error will kill the task and timeout checks will never resume. Wrap the body of the loop in a try/except to log and continue.</violation>
</file>
<file name="backend/app/services/k8s_worker/worker.py">
<violation number="1" location="backend/app/services/k8s_worker/worker.py:433">
P2: `running` is now derived from `idempotent_consumer` existence, but the shutdown path stops the consumer without clearing the reference. After `__aexit__`, `get_status()` will still return running=True even though the worker has stopped. Track an explicit running flag or clear `self.idempotent_consumer` on shutdown so status reflects the actual lifecycle state.</violation>
</file>
<file name="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py">
<violation number="1" location="backend/tests/unit/services/saga/test_saga_orchestrator_unit.py:125">
P3: This assertion no longer validates that `_handle_event` did anything; `_sagas` is populated by `register_saga` before the handler runs, so the test will pass even if event handling fails. Assert on a side effect of `_handle_event` (e.g., running instances or saved saga) instead.</violation>
</file>
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:43">
P2: The run loop no longer exits when the KubernetesWorker stops. If the worker shuts down unexpectedly, this loop keeps running and prevents graceful cleanup until a signal arrives. Consider restoring the `worker.is_running` guard (or breaking when status indicates stopped) so the process terminates cleanly when the worker stops.</violation>
</file>
<file name="backend/app/events/event_store_consumer.py">
<violation number="1" location="backend/app/events/event_store_consumer.py:131">
P2: Unhandled exceptions in the batch processor will terminate the background task and stop future flushes; add error handling inside the loop to log and continue.</violation>
</file>
<file name="backend/workers/run_coordinator.py">
<violation number="1" location="backend/workers/run_coordinator.py:43">
P2: The loop no longer exits when the coordinator stops, so the worker can keep polling indefinitely after a coordinator failure instead of shutting down and cleaning up.</violation>
</file>
<file name="backend/app/services/pod_monitor/monitor.py">
<violation number="1" location="backend/app/services/pod_monitor/monitor.py:297">
P2: Raising RuntimeError in the watch backoff path will terminate the background watch task without a clean shutdown; since the task is created with `create_task` and `__aexit__` only suppresses `CancelledError`, this exception will bubble up later on shutdown or leave the monitor silently stopped. Consider exiting the watch loop via cancellation instead of throwing a RuntimeError.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The SSERedisBus constructor is being called with arguments
in the wrong order in the sse_redis_bus fixture: SSERedisBus expects
(redis_client, logger, exec_prefix, notif_prefix) but the fixture passes
exec_prefix and notif_prefix positionally and logger last; fix by passing logger
as the second argument (either positionally after redis_client or as
logger=<logger>) and ensure exec_prefix and notif_prefix are provided afterwards
(or as keywords) so the call to SSERedisBus(redis_client, logger,
exec_prefix=..., notif_prefix=..., logger=...) matches the constructor
signature.
🧹 Nitpick comments (1)
backend/tests/integration/idempotency/test_consumer_idempotent.py (1)
9-10: Redundant import ofEventDispatcher.
EventDispatcheris imported twice—once fromapp.events.core(line 9) and again aliased asDispfromapp.events.core.dispatcher(line 10). Sinceapp.events.corelikely re-exports from the dispatcher module, these are the same class.Suggested fix
-from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducer -from app.events.core.dispatcher import EventDispatcher as Disp +from app.events.core import ConsumerConfig, EventDispatcher, UnifiedConsumer, UnifiedProducerThen update line 47 to use
EventDispatcherdirectly:- disp: Disp = EventDispatcher(logger=_test_logger) + disp = EventDispatcher(logger=_test_logger)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 issues found across 47 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:53">
P1: Exceptions from completed tasks in `done` set are silently swallowed. If `consumer.run()` or `logic.ensure_daemonset_task()` raises an exception, the worker will silently proceed to shutdown without logging or propagating the error. Add exception handling for tasks in `done`.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:42">
P2: Exceptions from completed tasks in `done` set are never retrieved. If `consumer.run()` crashes with an exception before the shutdown signal, the error is silently ignored. Check for exceptions in completed tasks to properly log failures.</violation>
</file>
<file name="backend/app/core/dishka_lifespan.py">
<violation number="1" location="backend/app/core/dishka_lifespan.py:116">
P3: Inconsistent logging style: use structured logging with `extra` parameter instead of f-string interpolation to match the rest of this file's logging pattern. This improves log aggregation and follows the established convention.</violation>
</file>
<file name="backend/app/services/coordinator/coordinator_logic.py">
<violation number="1" location="backend/app/services/coordinator/coordinator_logic.py:280">
P1: `event.priority` is an `int`, so calling `.name` on it will raise `AttributeError`. The priority value must be converted to `QueuePriority` enum first.</violation>
</file>
<file name="backend/app/services/result_processor/processor_logic.py">
<violation number="1" location="backend/app/services/result_processor/processor_logic.py:33">
P3: Misleading docstring: The class is no longer stateful after this refactoring. The removed `_state`, `_consumer`, and `_dispatcher` attributes were what made it stateful. Consider updating the docstring to reflect that this class is now a stateless service that processes events.</violation>
</file>
<file name="backend/app/services/sse/sse_connection_registry.py">
<violation number="1" location="backend/app/services/sse/sse_connection_registry.py:49">
P1: Metrics are decremented even when the connection wasn't actually registered, which can lead to incorrect (potentially negative) connection counts. The code should only decrement metrics when a connection was actually removed from the registry.</violation>
</file>
<file name="backend/app/events/core/consumer.py">
<violation number="1" location="backend/app/events/core/consumer.py:109">
P1: `event` may be undefined when the error callback is invoked. If `self._schema_registry.deserialize_event()` throws an exception, `event` was never assigned, causing a `NameError`. Initialize `event` before the try block or guard the callback invocation.</violation>
</file>
<file name="backend/workers/run_coordinator.py">
<violation number="1" location="backend/workers/run_coordinator.py:53">
P1: Exceptions from completed tasks are silently lost. If `run_task` finishes with an exception (e.g., consumer or scheduling loop crashes), the error is never retrieved or logged. You should check tasks in `done` for exceptions to avoid silent failures.</violation>
</file>
<file name="backend/app/services/sse/event_router.py">
<violation number="1" location="backend/app/services/sse/event_router.py:44">
P2: Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</violation>
</file>
<file name="backend/workers/run_pod_monitor.py">
<violation number="1" location="backend/workers/run_pod_monitor.py:45">
P1: Exceptions from completed tasks are silently swallowed. When using `asyncio.wait()`, you must check completed tasks for exceptions, otherwise failures in `monitor.run()` will go unnoticed. Add exception handling for tasks in `done`.</violation>
</file>
<file name="backend/workers/run_saga_orchestrator.py">
<violation number="1" location="backend/workers/run_saga_orchestrator.py:65">
P1: Exceptions from completed tasks are silently swallowed. If `run_task` fails with an exception, it ends up in `done` but the exception is never checked or re-raised. Add exception handling for completed tasks to avoid silent failures.</violation>
</file>
<file name="backend/tests/integration/result_processor/test_result_processor.py">
<violation number="1" location="backend/tests/integration/result_processor/test_result_processor.py:130">
P2: Unused parameter `consumer_config` after refactoring. The function now creates its own `processor_consumer_config` and `stored_consumer_config` locally instead of using the injected fixture. Remove the unused parameter from the function signature.</violation>
</file>
<file name="backend/app/services/pod_monitor/monitor.py">
<violation number="1" location="backend/app/services/pod_monitor/monitor.py:123">
P2: `asyncio.CancelledError` is caught but not re-raised, breaking Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, which can cause issues with task lifecycle management (e.g., `task.cancelled()` returns False, callers won't see the exception).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| async def route_event(self, event: DomainEvent) -> None: | ||
| """Route an event to Redis for SSE delivery.""" | ||
| data = event.model_dump() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Inefficient use of model_dump() just to extract a single field. Since all SSE-relevant events have execution_id as a direct attribute, use getattr(event, 'execution_id', None) to avoid serializing the entire model.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/sse/event_router.py, line 44:
<comment>Inefficient use of `model_dump()` just to extract a single field. Since all SSE-relevant events have `execution_id` as a direct attribute, use `getattr(event, 'execution_id', None)` to avoid serializing the entire model.</comment>
<file context>
@@ -0,0 +1,63 @@
+
+ async def route_event(self, event: DomainEvent) -> None:
+ """Route an event to Redis for SSE delivery."""
+ data = event.model_dump()
+ execution_id = data.get("execution_id")
+
</file context>
backend/tests/integration/result_processor/test_result_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/services/coordinator/coordinator_logic.py (1)
231-281: Cast priority toQueuePrioritybefore using.name.If
event.priorityis an int (as it appears elsewhere),priority.namewill raise and the schedule path will fail.🐛 Proposed fix: normalize to QueuePriority
- priority = getattr(event, "priority", QueuePriority.NORMAL) - self.metrics.record_coordinator_queue_time(queue_time, priority.name) + priority = QueuePriority(getattr(event, "priority", QueuePriority.NORMAL)) + self.metrics.record_coordinator_queue_time(queue_time, priority.name)backend/app/services/result_processor/processor_logic.py (1)
88-93: Guard memory limit parsing to avoid handler failure.Line 90 uses
rstrip("Mi"), which will mis-handleGior other formats and can raiseValueError, aborting the handler. Consider a minimal parse guard to keep result processing resilient.🛠️ Suggested fix
- settings_limit = self._settings.K8S_POD_MEMORY_LIMIT - memory_limit_mib = int(settings_limit.rstrip("Mi")) # TODO: Less brittle acquisition of limit - memory_percent = (memory_mib / memory_limit_mib) * 100 - self._metrics.memory_utilization_percent.record( - memory_percent, attributes={"lang_and_version": lang_and_version} - ) + settings_limit = self._settings.K8S_POD_MEMORY_LIMIT + try: + if settings_limit.endswith("Gi"): + memory_limit_mib = int(settings_limit[:-2]) * 1024 + elif settings_limit.endswith("Mi"): + memory_limit_mib = int(settings_limit[:-2]) + else: + raise ValueError(f"Unsupported unit: {settings_limit}") + memory_percent = (memory_mib / memory_limit_mib) * 100 + self._metrics.memory_utilization_percent.record( + memory_percent, attributes={"lang_and_version": lang_and_version} + ) + except ValueError: + self.logger.warning( + "Skipping memory utilization metric due to unparseable limit", + extra={"limit": settings_limit}, + )
🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 102-127: Background tasks created in the tasks list
(run_sse_consumers, event_store_consumer.run, notification_service.run) are
fire-and-forget and only awaited at shutdown, so runtime exceptions can be lost;
update the creation of each asyncio.create_task to attach an add_done_callback
that checks task.exception(), logs the error via logger, and on first non-None
exception cancels all other tasks in the tasks collection (safely skipping
already done/cancelled tasks) to implement immediate supervision and fail-fast
behavior, ensuring you still await asyncio.gather(*tasks,
return_exceptions=True) in the finally block to finish cleanup.
In `@backend/app/core/providers.py`:
- Around line 463-472: The provider get_sse_connection_registry currently
creates an SSEConnectionRegistry per request (provide(scope=Scope.REQUEST)),
which resets _active_connections and causes get_connection_count() to always
return 0; change the provider to use application scope
(provide(scope=Scope.APP)) so SSEConnectionRegistry is a singleton for the app,
preserving the asyncio.Lock() and shared _active_connections across requests and
allowing registrations from SSE streams to be visible to the health endpoint.
In `@backend/app/events/core/consumer.py`:
- Around line 104-110: The variable event can be unbound if deserialize_event()
raises, causing a NameError when awaiting self._error_callback(e, event); fix by
initializing event = None before the try/except where deserialize_event() is
called (so the callback always receives a defined value), and keep the existing
error logging and metrics calls (e.g., the except block that calls
self.logger.error, self._event_metrics.record_kafka_consumption_error and awaits
self._error_callback).
In `@backend/app/services/idempotency/middleware.py`:
- Around line 123-144: register_idempotent_handler wraps a handler in
IdempotentEventHandler then passes it to EventDispatcher.register which expects
handler.__name__; set the wrapper's __name__ (or use
functools.update_wrapper/functools.wraps) so the wrapped instance exposes the
original handler's __name__ to avoid AttributeError. In practice, after creating
wrapped in register_idempotent_handler, assign wrapped.__name__ =
getattr(handler, "__name__", type(handler).__name__) (or call
update_wrapper(wrapped, handler)) before calling
self._dispatcher.register(event_type)(wrapped).
In `@backend/app/services/sse/sse_connection_registry.py`:
- Around line 30-49: The register_connection and unregister_connection currently
call metrics.increment_sse_connections / decrement_sse_connections
unconditionally, causing drift on duplicate adds/removes; modify
register_connection to only call metrics.increment_sse_connections("executions")
(and log) if connection_id was not already present in
self._active_connections[execution_id], and modify unregister_connection to only
call metrics.decrement_sse_connections("executions") (and log) if connection_id
was actually removed (i.e., it existed before discard and the set changed or
length decreased); keep the async with self._lock around these checks and
mutations and use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.
In `@backend/tests/integration/events/test_consumer_lifecycle.py`:
- Around line 40-52: The test signals readiness too early: move the
consumer_started.set() so it runs after the internal consumer is initialized
inside run() (or after consumer.run() completes its startup logic) to avoid
calling consumer.seek_to_beginning()/consumer.seek_to_end() before
consumer._consumer is assigned; locate the run_with_signal coroutine and ensure
the readiness event is set only once consumer.run() has created/started the
AIOKafkaConsumer (or add an explicit check/wait for consumer._consumer to be
non-None before performing seeks).
In `@backend/workers/run_coordinator.py`:
- Around line 42-64: The code awaits asyncio.wait([run_task, shutdown_task]) but
only inspects pending tasks, so exceptions from completed tasks (e.g., run_task
from run_coordinator_tasks()) are never re-raised; update the logic after
asyncio.wait to iterate over the done set and call task.result() (or check
task.exception()) for each done task and re-raise or surface any exceptions (at
least for run_task) before cancelling pending tasks; locate
run_coordinator_tasks, run_task, shutdown_task, and the asyncio.wait call to
implement this change so failures in run_coordinator_tasks propagate correctly.
In `@backend/workers/run_k8s_worker.py`:
- Around line 49-64: The code creates run_task =
asyncio.create_task(run_worker_tasks()) and shutdown_task =
asyncio.create_task(shutdown_event.wait()) then awaits asyncio.wait but never
retrieves results from completed tasks, so exceptions from run_worker_tasks()
are swallowed; after the asyncio.wait call inspect each task in the done set
(e.g., check task.done() and call task.result() or task.exception()) and
re-raise or propagate any exception found (for example by calling task.result()
which will raise) before cancelling pending tasks so that worker failures
surface instead of being masked.
In `@backend/workers/run_result_processor.py`:
- Around line 35-53: The current shutdown sequence swallows exceptions from
consumer.run() because run_task's exception is never retrieved; after the
asyncio.wait completes, inspect the completed tasks (e.g., check if run_task is
in the done set), call run_task.result() (or handle/run_task.exception()) to
surface any exception, log it with logger.error including context, and re-raise
or set an exit error before cancelling pending tasks (shutdown_task) so failures
in consumer.run() are not treated as normal shutdown; update the logic around
run_task/run_task.result(), shutdown_task, and shutdown_event accordingly.
In `@backend/workers/run_saga_orchestrator.py`:
- Around line 54-70: The current shutdown logic cancels pending tasks but can
silently swallow exceptions from completed tasks like run_task; after
asyncio.wait returns, inspect the done set (e.g., run_task in done) and
propagate any errors by awaiting the completed tasks or checking
task.exception() and re-raising it before returning/finishing; keep existing
cancellation/await handling for pending tasks (shutdown_task) but ensure
exceptions from run_orchestrator_tasks are not ignored so the orchestrator
failure surfaces to the caller.
♻️ Duplicate comments (4)
backend/app/events/event_store_consumer.py (1)
102-104: Add error handling for batch storage and commit failures.If
_store_batch()orconsumer.commit()throws an exception (e.g., transient MongoDB or Kafka broker issue), the consumer will terminate entirely. Consider wrapping these operations in a try-except to log and continue processing subsequent batches.🐛 Proposed fix
if events: - await self._store_batch(events) - await consumer.commit() + try: + await self._store_batch(events) + await consumer.commit() + except Exception as e: + self.logger.error(f"Failed to store/commit batch: {e}", exc_info=True) + # Continue processing - Kafka will redeliver uncommitted messagesbackend/app/services/pod_monitor/monitor.py (1)
277-285: Max reconnect RuntimeError terminates monitor ungracefully.This issue was previously flagged. When
_backoff()raisesRuntimeErrorafter max attempts, it propagates through_watch_loop()torun(), which only catchesCancelledError. The monitor stops without cleanup or recovery path. Consider either catching this inrun()for graceful degradation or using cancellation-based exit.backend/app/services/saga/saga_logic.py (1)
299-321: Keep the timeout loop resilient to transient errors.As written, any unexpected exception will terminate the loop permanently. This matches the earlier concern and still applies here.
🐛 Proposed fix: log and continue on unexpected errors
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: - # Check every 30 seconds - await asyncio.sleep(30) - - cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds) - - timed_out = await self._repo.find_timed_out_sagas(cutoff_time) - - for instance in timed_out: - self.logger.warning(f"Saga {instance.saga_id} timed out") - - instance.state = SagaState.TIMEOUT - instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds" - instance.completed_at = datetime.now(UTC) - - await self._save_saga(instance) - self._running_instances.pop(instance.saga_id, None) + try: + # Check every 30 seconds + await asyncio.sleep(30) + + cutoff_time = datetime.now(UTC) - timedelta(seconds=self.config.timeout_seconds) + + timed_out = await self._repo.find_timed_out_sagas(cutoff_time) + + for instance in timed_out: + self.logger.warning(f"Saga {instance.saga_id} timed out") + + instance.state = SagaState.TIMEOUT + instance.error_message = f"Saga timed out after {self.config.timeout_seconds} seconds" + instance.completed_at = datetime.now(UTC) + + await self._save_saga(instance) + self._running_instances.pop(instance.saga_id, None) + except Exception as e: + self.logger.error(f"Error in timeout checker: {e}", exc_info=True) + await asyncio.sleep(5) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
93-99: Assertion doesn’t validate handle_event side effects.
register_sagaalready populates_sagas, so the current assertion can pass even ifhandle_eventdoes nothing. Consider asserting on a side effect (e.g., repo save or saga state change) after the handler runs.
🧹 Nitpick comments (8)
backend/app/services/k8s_worker/worker_logic.py (4)
56-56: Redundant fallback for non-optional parameter.The
configparameter is typed asK8sWorkerConfig(non-optional), yet line 56 has a fallbackconfig or K8sWorkerConfig(). Either make the parameterOptional[K8sWorkerConfig]or remove the fallback to match the signature.- self.config = config or K8sWorkerConfig() + self.config = config
154-154: Fire-and-forget task lacks exception propagation.The task created here is not stored, so any unhandled exception will be silently lost (logged only). While the current implementation handles exceptions internally in
_create_pod_for_execution, consider storing task references for better observability and graceful shutdown handling.
305-308: Defensive null checks on required dependency.
self.produceris assigned from a required constructor parameter (non-optionalUnifiedProducer). These null checks suggest the producer might be None, but that would indicate a constructor bug. Either remove the checks or type the field asOptional[UnifiedProducer]if nullability is intentional.Also applies to: 319-322, 336-339
356-356: Undocumented hardcoded delay.The 5-second sleep before DaemonSet operations appears arbitrary. Consider documenting the reason (e.g., waiting for API server readiness) or making it configurable via
K8sWorkerConfig.backend/tests/e2e/test_k8s_worker_create_pod.py (1)
83-91: Resource cleanup may be skipped on test failure.If the assertions on lines 85-87 fail, the cleanup on lines 90-91 won't execute, potentially leaving orphaned Kubernetes resources. Consider wrapping the verification and cleanup in a try/finally block.
🧹 Suggested fix for reliable cleanup
# Verify resources exist - got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) - assert got_cm is not None - got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - assert got_pod is not None - - # Cleanup - logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + try: + got_cm = logic.v1.read_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + assert got_cm is not None + got_pod = logic.v1.read_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + assert got_pod is not None + finally: + # Cleanup regardless of assertion outcome + logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns)backend/app/services/sse/event_router.py (1)
42-46: Avoid fullmodel_dump()just to readexecution_id.
model_dump()allocates a full dict (potentially large payloads). A direct attribute lookup avoids the extra work.Suggested refactor
- data = event.model_dump() - execution_id = data.get("execution_id") + execution_id = getattr(event, "execution_id", None)backend/tests/integration/services/sse/test_partitioned_event_router.py (1)
39-41: Consider closing the subscription after the test.The test opens a Redis subscription but doesn't explicitly close it. While the fixture cleanup may handle the Redis client, explicitly closing the subscription would be cleaner.
♻️ Suggested cleanup
# Await the subscription - verify event arrived in Redis msg = await asyncio.wait_for(subscription.get(RedisSSEMessage), timeout=2.0) assert msg is not None assert str(msg.event_type) == str(ev.event_type) + + await subscription.close()backend/app/services/saga/saga_service.py (1)
162-178: Consider renaming method to reflect new implementation.The method
get_saga_status_from_orchestratorreferences "orchestrator" which no longer exists. Consider renaming to something likeget_saga_status_liveorget_live_saga_statusto better reflect it fetches fromSagaLogic's in-memory state first.
| @provide(scope=Scope.REQUEST) | ||
| def get_sse_shutdown_manager( | ||
| self, logger: logging.Logger, connection_metrics: ConnectionMetrics | ||
| ) -> SSEShutdownManager: | ||
| return create_sse_shutdown_manager(logger=logger, connection_metrics=connection_metrics) | ||
| def get_sse_connection_registry( | ||
| self, | ||
| logger: logging.Logger, | ||
| connection_metrics: ConnectionMetrics, | ||
| ) -> SSEConnectionRegistry: | ||
| return SSEConnectionRegistry( | ||
| logger=logger, | ||
| connection_metrics=connection_metrics, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find SSEConnectionRegistry class definition
fd -t f "\.py$" | xargs grep -l "class SSEConnectionRegistry"Repository: HardMax71/Integr8sCode
Length of output: 188
🏁 Script executed:
# Search for get_connection_count usage
rg "get_connection_count" --type py -A 3 -B 3Repository: HardMax71/Integr8sCode
Length of output: 6270
🏁 Script executed:
# Search for SSEConnectionRegistry usage
rg "SSEConnectionRegistry" --type py -A 2 -B 2Repository: HardMax71/Integr8sCode
Length of output: 6034
🏁 Script executed:
# Check how SSEService is provided
rg "@provide.*SSEService" --type py -A 5 -B 2Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check for Scope definitions and their meanings
rg "class Scope" --type py -A 10Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Look at sse_connection_registry.py to understand persistence expectations
cat -n backend/app/services/sse/sse_connection_registry.py | head -50Repository: HardMax71/Integr8sCode
Length of output: 2296
🏁 Script executed:
# Search for SSEService provider definition more broadly
rg "get_sse_service|SSEService" backend/app/core/providers.py -A 5 -B 5Repository: HardMax71/Integr8sCode
Length of output: 1278
🏁 Script executed:
# Look at the full providers.py around the SSEService definition
rg "def get_" backend/app/core/providers.py | grep -i sseRepository: HardMax71/Integr8sCode
Length of output: 355
🏁 Script executed:
# Check SSEService class to understand if it needs app-wide state
cat -n backend/app/services/sse/sse_service.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 3831
🏁 Script executed:
# Check if there's any caching or special handling for REQUEST scope singletons
rg "Scope\." backend/app/core/providers.py -B 2 -A 1 | head -40Repository: HardMax71/Integr8sCode
Length of output: 706
🏁 Script executed:
# Look for health status endpoint that uses get_connection_count
rg "get_health_status" --type py -A 10 -B 5Repository: HardMax71/Integr8sCode
Length of output: 3730
🏁 Script executed:
# Search for any test that shows REQUEST vs APP scope behavior
rg "Scope\.(REQUEST|APP)" backend/tests --type py | head -20Repository: HardMax71/Integr8sCode
Length of output: 48
SSEConnectionRegistry scope must be Scope.APP, not Scope.REQUEST.
With Scope.REQUEST, a new registry instance is created per HTTP request, resulting in an empty _active_connections dictionary on every health status check. This causes get_connection_count() to always return 0. The registry is designed to track active connections globally across the application—registrations from one request's SSE stream must be visible when the health endpoint is called from a different request. The internal asyncio.Lock() and the persistent state expectations confirm the registry is intended as application-wide infrastructure.
🤖 Prompt for AI Agents
In `@backend/app/core/providers.py` around lines 463 - 472, The provider
get_sse_connection_registry currently creates an SSEConnectionRegistry per
request (provide(scope=Scope.REQUEST)), which resets _active_connections and
causes get_connection_count() to always return 0; change the provider to use
application scope (provide(scope=Scope.APP)) so SSEConnectionRegistry is a
singleton for the app, preserving the asyncio.Lock() and shared
_active_connections across requests and allowing registrations from SSE streams
to be visible to the health endpoint.
| async def register_connection(self, execution_id: str, connection_id: str) -> None: | ||
| """Register a new SSE connection.""" | ||
| async with self._lock: | ||
| if execution_id not in self._active_connections: | ||
| self._active_connections[execution_id] = set() | ||
|
|
||
| self._active_connections[execution_id].add(connection_id) | ||
| self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id}) | ||
| self.metrics.increment_sse_connections("executions") | ||
|
|
||
| async def unregister_connection(self, execution_id: str, connection_id: str) -> None: | ||
| """Unregister an SSE connection.""" | ||
| async with self._lock: | ||
| if execution_id in self._active_connections: | ||
| self._active_connections[execution_id].discard(connection_id) | ||
| if not self._active_connections[execution_id]: | ||
| del self._active_connections[execution_id] | ||
|
|
||
| self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id}) | ||
| self.metrics.decrement_sse_connections("executions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent metrics drift on duplicate register/unregister.
Line 36 and Line 49 update metrics even when the connection was already present/absent, which can over/under-count (and potentially go negative). Consider only updating metrics when the set actually changes.
🛠️ Suggested fix
async def register_connection(self, execution_id: str, connection_id: str) -> None:
"""Register a new SSE connection."""
async with self._lock:
- if execution_id not in self._active_connections:
- self._active_connections[execution_id] = set()
-
- self._active_connections[execution_id].add(connection_id)
- self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
- self.metrics.increment_sse_connections("executions")
+ conns = self._active_connections.setdefault(execution_id, set())
+ if connection_id not in conns:
+ conns.add(connection_id)
+ self.metrics.increment_sse_connections("executions")
+ self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})
async def unregister_connection(self, execution_id: str, connection_id: str) -> None:
"""Unregister an SSE connection."""
async with self._lock:
- if execution_id in self._active_connections:
- self._active_connections[execution_id].discard(connection_id)
- if not self._active_connections[execution_id]:
- del self._active_connections[execution_id]
-
- self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})
- self.metrics.decrement_sse_connections("executions")
+ conns = self._active_connections.get(execution_id)
+ if conns and connection_id in conns:
+ conns.remove(connection_id)
+ if not conns:
+ del self._active_connections[execution_id]
+ self.metrics.decrement_sse_connections("executions")
+ self.logger.debug("Unregistered SSE connection", extra={"connection_id": connection_id})🤖 Prompt for AI Agents
In `@backend/app/services/sse/sse_connection_registry.py` around lines 30 - 49,
The register_connection and unregister_connection currently call
metrics.increment_sse_connections / decrement_sse_connections unconditionally,
causing drift on duplicate adds/removes; modify register_connection to only call
metrics.increment_sse_connections("executions") (and log) if connection_id was
not already present in self._active_connections[execution_id], and modify
unregister_connection to only call
metrics.decrement_sse_connections("executions") (and log) if connection_id was
actually removed (i.e., it existed before discard and the set changed or length
decreased); keep the async with self._lock around these checks and mutations and
use the same symbols (_active_connections, register_connection,
unregister_connection, metrics.increment_sse_connections,
metrics.decrement_sse_connections) to locate the changes.
| async def run_with_signal() -> None: | ||
| consumer_started.set() | ||
| await consumer.run() | ||
|
|
||
| task = asyncio.create_task(run_with_signal()) | ||
|
|
||
| try: | ||
| st = c.get_status() | ||
| assert st.state == "running" and st.is_running is True | ||
| # Exercise seek functions; don't force specific partition offsets | ||
| await c.seek_to_beginning() | ||
| await c.seek_to_end() | ||
| # No need to sleep; just ensure we can call seek APIs while running | ||
| # Wait for consumer to start | ||
| await asyncio.wait_for(consumer_started.wait(), timeout=5.0) | ||
|
|
||
| # Exercise seek functions while consumer is running | ||
| await consumer.seek_to_beginning() | ||
| await consumer.seek_to_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test may have a race: consumer_started signals before consumer is fully running.
consumer_started.set() fires before consumer.run() is called, so the seek operations at lines 51-52 may execute before the internal AIOKafkaConsumer is started. The seek_to_beginning() and seek_to_end() methods guard with if self._consumer, but self._consumer is only assigned inside run() after construction.
Consider moving the event signal after consumer initialization:
💡 Suggested improvement
async def run_with_signal() -> None:
- consumer_started.set()
- await consumer.run()
+ # Small delay to ensure consumer.run() has started and set self._consumer
+ await asyncio.sleep(0.1)
+ consumer_started.set()
+ await consumer.run()Alternatively, add a short delay before exercising seek methods, or check that consumer._consumer is not None before proceeding.
🤖 Prompt for AI Agents
In `@backend/tests/integration/events/test_consumer_lifecycle.py` around lines 40
- 52, The test signals readiness too early: move the consumer_started.set() so
it runs after the internal consumer is initialized inside run() (or after
consumer.run() completes its startup logic) to avoid calling
consumer.seek_to_beginning()/consumer.seek_to_end() before consumer._consumer is
assigned; locate the run_with_signal coroutine and ensure the readiness event is
set only once consumer.run() has created/started the AIOKafkaConsumer (or add an
explicit check/wait for consumer._consumer to be non-None before performing
seeks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 38 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/app/services/idempotency/faststream_middleware.py">
<violation number="1" location="backend/app/services/idempotency/faststream_middleware.py:80">
P2: If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</violation>
</file>
<file name="backend/pyproject.toml">
<violation number="1" location="backend/pyproject.toml:92">
P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</violation>
<violation number="2" location="backend/pyproject.toml:127">
P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:157">
P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</violation>
</file>
<file name="backend/workers/run_k8s_worker.py">
<violation number="1" location="backend/workers/run_k8s_worker.py:142">
P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| await idempotency.mark_completed(event=event, key_strategy="event_based") | ||
| return handler_result | ||
| except Exception as e: | ||
| await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: If mark_failed throws an exception (e.g., network error), the original handler exception will be masked. Wrap mark_failed in a try-except to preserve the original exception.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/idempotency/faststream_middleware.py, line 80:
<comment>If `mark_failed` throws an exception (e.g., network error), the original handler exception will be masked. Wrap `mark_failed` in a try-except to preserve the original exception.</comment>
<file context>
@@ -0,0 +1,81 @@
+ await idempotency.mark_completed(event=event, key_strategy="event_based")
+ return handler_result
+ except Exception as e:
+ await idempotency.mark_failed(event=event, error=str(e), key_strategy="event_based")
+ raise
</file context>
| "yarl==1.20.1", | ||
| "zipp==3.20.2", | ||
| "monggregate==0.22.1", | ||
| "faststream[kafka]>=0.6.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 127:
<comment>Pin faststream to a specific version (or add an upper bound) to avoid unexpected breaking changes during dependency resolution.</comment>
<file context>
@@ -124,6 +124,7 @@ dependencies = [
"yarl==1.20.1",
"zipp==3.20.2",
"monggregate==0.22.1",
+ "faststream[kafka]>=0.6.0",
]
</file context>
| "faststream[kafka]>=0.6.0", | |
| "faststream[kafka]==0.6.0", |
| "pyasn1_modules==0.4.2", | ||
| "pydantic==2.9.2", | ||
| "pydantic-avro==0.9.1", | ||
| "dataclasses-avroschema[pydantic]>=0.65.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 92:
<comment>Pin the dataclasses-avroschema dependency to a specific version to keep dependency resolution reproducible and consistent with the rest of the project.</comment>
<file context>
@@ -89,7 +89,7 @@ dependencies = [
"pyasn1_modules==0.4.2",
"pydantic==2.9.2",
- "pydantic-avro==0.9.1",
+ "dataclasses-avroschema[pydantic]>=0.65.0",
"pydantic-settings==2.5.2",
"pydantic_core==2.23.4",
</file context>
| "dataclasses-avroschema[pydantic]>=0.65.0", | |
| "dataclasses-avroschema[pydantic]==0.65.0", |
|
|
||
| # Default handler for unmatched events (prevents message loss) | ||
| @subscriber() | ||
| async def handle_other(event: DomainEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 157:
<comment>The catch-all handler silently discards unhandled events without logging. This could mask routing issues or missing handlers for new event types. Consider logging a warning when unknown events are received.</comment>
<file context>
@@ -76,9 +75,106 @@ def main() -> None:
+
+ # Default handler for unmatched events (prevents message loss)
+ @subscriber()
+ async def handle_other(event: DomainEvent) -> None:
+ pass
+
</file context>
|
|
||
| # Default handler for unmatched events (prevents message loss) | ||
| @subscriber() | ||
| async def handle_other(event: DomainEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_k8s_worker.py, line 142:
<comment>Silent catch-all handler will discard unmatched events without any logging. This can mask bugs in filter logic or cause silent data loss when new event types are added. Consider logging unmatched events at warning level to aid debugging.</comment>
<file context>
@@ -88,9 +69,99 @@ def main() -> None:
+
+ # Default handler for unmatched events (prevents message loss)
+ @subscriber()
+ async def handle_other(event: DomainEvent) -> None:
+ pass
+
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/pod_monitor/event_mapper.py (1)
11-21: GuardResourceUsageAvroconstruction whenresource_usageis null.
If executor output includes"resource_usage": null, the current call raises and drops the event. Use a safe fallback to{}.🐛 Suggested fix
- return PodLogs( + return PodLogs( stdout=data.get("stdout", ""), stderr=data.get("stderr", ""), exit_code=data.get("exit_code", 0), - resource_usage=ResourceUsageAvro(**data.get("resource_usage", {})), + resource_usage=ResourceUsageAvro(**(data.get("resource_usage") or {})), )Also applies to: 40-47, 491-502
🤖 Fix all issues with AI agents
In `@backend/tests/integration/conftest.py`:
- Around line 74-83: The sse_redis_bus fixture calls SSERedisBus with
redis_client as the only positional argument but SSERedisBus.__init__ expects
logger as the second positional parameter; fix by passing the test logger as the
second positional argument (e.g., SSERedisBus(redis_client, _test_logger,
exec_prefix=..., notif_prefix=...)) or convert to all keywords including
logger=_test_logger to ensure the logger is provided correctly.
In `@backend/tests/integration/result_processor/test_result_processor.py`:
- Around line 72-83: Change the consumer setup in the test to avoid replaying
historical Kafka events: when creating the test ConsumerConfig (the instance
assigned to processor_consumer_config) set auto_offset_reset="latest" instead of
"earliest" and avoid extremely small batch sizes (e.g., remove or increase
max_poll_records=1) so the consumer won't process long retention history slowly;
also ensure test code starts the consumer (the consumer created from
processor_consumer_config) before producing the test events so only new messages
are consumed.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)
461-470: SSEConnectionRegistry must useScope.APP, notScope.REQUEST.With
Scope.REQUEST, each HTTP request creates a new registry instance with an empty_active_connectionsdictionary. This causesget_connection_count()to always return 0 from the health endpoint since registrations from SSE streams are invisible to other requests. The registry is designed as application-wide infrastructure to track connections globally.- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry(backend/workers/run_pod_monitor.py (1)
74-88: Exceptions frommonitor.run()are silently swallowed.When
asyncio.wait()returns, ifmonitor_taskcompleted with an exception (e.g., K8s API failure), that exception is never retrieved from thedoneset. The worker exits normally without logging the actual failure cause.🐛 Proposed fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Check for exceptions in completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + exc = task.exception() + if exc: + logger.error("PodMonitor crashed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/app/core/dishka_lifespan.py (1)
105-119: Background task failures may go unnoticed until shutdown.Tasks created with
asyncio.create_task()that fail during normal operation will have their exceptions silently swallowed until shutdown whengather(*tasks, return_exceptions=True)is called. Ifrun_sse_consumers(),event_store_consumer.run(), ornotification_service.run()raises an exception, the application continues running in a degraded state without any indication.Consider adding done callbacks to log failures and optionally trigger shutdown on critical task failures.
🛠️ Suggested supervision pattern
tasks = [ asyncio.create_task(run_sse_consumers(), name="sse_consumers"), asyncio.create_task(event_store_consumer.run(), name="event_store_consumer"), asyncio.create_task(notification_service.run(), name="notification_service"), ] + + def _handle_task_done(task: asyncio.Task) -> None: + if task.cancelled(): + return + exc = task.exception() + if exc: + logger.error( + "Background task failed", + extra={"task_name": task.get_name()}, + exc_info=exc, + ) + + for task in tasks: + task.add_done_callback(_handle_task_done) + logger.info( "Background services started", extra={"sse_consumer_count": len(sse_consumers)}, )backend/app/services/saga/saga_logic.py (1)
315-325: Timeout loop should not die on transient exceptions.Any unexpected exception inside
check_timeouts_once()will terminate the loop and stop timeout enforcement. Wrap the loop body to log-and-continue on generic exceptions.🛠️ Suggested hardening
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: # Check every 30 seconds await asyncio.sleep(30) - await self.check_timeouts_once() + try: + await self.check_timeouts_once() + except Exception: + self.logger.error("Timeout checker failed; continuing loop", exc_info=True) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")
🧹 Nitpick comments (8)
backend/app/services/saga/execution_saga.py (1)
62-100: Useevent.execution_idas a fallback for robustness.
Ifexecution_idis ever missing in context (future refactors or alternate entry points), this step will fail before publishing. A simple fallback makes it resilient.♻️ Suggested tweak
- execution_id = context.get("execution_id") + execution_id = context.get("execution_id") or event.execution_idbackend/workers/run_result_processor.py (1)
134-158: Handlers call private methods (_handle_*) onProcessorLogic.The handlers directly invoke methods prefixed with
_(e.g.,_handle_completed,_handle_failed), which conventionally indicates internal/private methods not intended for external use. Consider either making these methods public (remove underscore prefix) or providing a public interface method.backend/app/services/idempotency/faststream_middleware.py (2)
44-50: Fail-open approach silently skips idempotency on DI failures.The broad
except Exceptioncatch ensures message processing continues even when dependencies are unavailable. Consider logging a warning here to aid debugging when idempotency is unexpectedly skipped due to misconfiguration.♻️ Suggested improvement
# Resolve dependencies from request-scoped container try: idempotency = await container.get(IdempotencyManager) schema_registry = await container.get(SchemaRegistryManager) - except Exception: + except Exception as e: # Dependencies not available - skip idempotency + # Log at debug level to help diagnose misconfiguration + import logging + logging.getLogger(__name__).debug( + "Skipping idempotency: failed to resolve dependencies: %s", e + ) return await call_next(msg)
58-62: Hardcoded topic string "idempotency" passed to deserialize_event.The
topicparameter indeserialize_eventappears to be used only for error context. If the actual topic is available frommsg, consider passing it for more accurate error messages during debugging.backend/workers/run_saga_orchestrator.py (1)
160-161: Consider structured logging for consistency.Using f-string interpolation for logging is inconsistent with the structured logging pattern (
extra={}) used elsewhere in the codebase.♻️ Suggested change
- app_logger.info(f"Subscribing to topics: {topics}") + app_logger.info("Subscribing to topics", extra={"topics": topics})backend/app/events/schema/schema_registry.py (1)
84-102: Minor redundancy: AvroSchema created twice.
_ensure_schema_registeredalready creates anAvroSchemaobject (line 61), butserialize_eventcreates it again (line 92). Consider caching theAvroSchemaobject alongside the schema ID to avoid redundant creation.backend/app/services/result_processor/processor_logic.py (2)
54-66: Consider explicit type check instead of assert for production safety.Using
assert isinstance(event, ...)for type narrowing works, but assertions are stripped when Python runs with-O(optimized mode). If this could happen in production, the type check would be silently skipped.♻️ Safer alternative
async def _handle_completed_wrapper(self, event: DomainEvent) -> None: - assert isinstance(event, ExecutionCompletedEvent) + if not isinstance(event, ExecutionCompletedEvent): + raise TypeError(f"Expected ExecutionCompletedEvent, got {type(event).__name__}") await self._handle_completed(event)Apply similarly to
_handle_failed_wrapperand_handle_timeout_wrapper.
177-193: Hardcodedservice_version="1.0.0"in event metadata.Consider reading the version from settings or a shared constant to avoid manual updates when the service version changes.
| @pytest_asyncio.fixture | ||
| async def sse_redis_bus(redis_client: redis.Redis) -> SSERedisBus: | ||
| """Provide SSERedisBus with unique prefixes for test isolation.""" | ||
| suffix = uuid.uuid4().hex[:6] | ||
| return SSERedisBus( | ||
| redis_client, | ||
| exec_prefix=f"sse:exec:{suffix}:", | ||
| notif_prefix=f"sse:notif:{suffix}:", | ||
| logger=_test_logger, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find SSERedisBus class definition
fd -t f -e py . | xargs rg -l "class SSERedisBus"Repository: HardMax71/Integr8sCode
Length of output: 103
🏁 Script executed:
#!/bin/bash
# Read the SSERedisBus class definition to check __init__ signature
cat -n backend/app/services/sse/redis_bus.py | head -80Repository: HardMax71/Integr8sCode
Length of output: 3507
Fix argument order in SSERedisBus initialization.
The SSERedisBus.__init__ constructor requires logger as the second positional parameter (not optional). The fixture passes redis_client positionally but omits logger from the positional arguments, causing a runtime TypeError. Either pass logger as the second positional argument or use all keyword arguments:
SSERedisBus(
redis_client,
_test_logger, # Pass as 2nd positional arg
exec_prefix=f"sse:exec:{suffix}:",
notif_prefix=f"sse:notif:{suffix}:",
)🤖 Prompt for AI Agents
In `@backend/tests/integration/conftest.py` around lines 74 - 83, The
sse_redis_bus fixture calls SSERedisBus with redis_client as the only positional
argument but SSERedisBus.__init__ expects logger as the second positional
parameter; fix by passing the test logger as the second positional argument
(e.g., SSERedisBus(redis_client, _test_logger, exec_prefix=...,
notif_prefix=...)) or convert to all keywords including logger=_test_logger to
ensure the logger is provided correctly.
backend/tests/integration/result_processor/test_result_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/events/core/producer.py (1)
44-77: PotentialUnboundLocalErrorif serialization fails before topic assignment.If
serialize_eventraises an exception at line 49,topicis never assigned, but line 75 references it in the error handler.Proposed fix
async def produce( self, event_to_produce: DomainEvent, key: str | None = None, headers: dict[str, str] | None = None ) -> None: """Produce a message to Kafka.""" + topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}" try: serialized_value = await self._schema_registry.serialize_event(event_to_produce) - topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}" # Convert headers to list of tuples format header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
🤖 Fix all issues with AI agents
In `@backend/app/events/event_store_consumer.py`:
- Around line 41-69: The process_batch method currently logs deserialization
errors from schema_registry_manager.deserialize_event and then commits offsets
via consumer.commit, which drops bad messages; change process_batch to collect
failures (e.g., a failed_messages list or counter) while iterating messages, for
each failed msg call a DLQ sender or increment event_metrics (e.g.,
event_metrics.record_deserialization_failure(topic=..., consumer_group=...)) and
only commit offsets for partitions/messages that succeeded (or skip committing
if any partition has failures), ensuring _store_batch is called only with
successfully deserialized events and consumer.commit is adjusted to avoid losing
failed messages; update logger.error to include a unique identifier for the
failed message to aid tracing.
In `@backend/app/services/k8s_worker/worker_logic.py`:
- Around line 70-163: The duplicate-creation bug occurs because
handle_create_pod_command checks _active_creations but only schedules the task,
and the ID is added later inside _create_pod_for_execution, so two concurrent
calls can both schedule tasks; fix by reserving the execution_id immediately in
handle_create_pod_command (add execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.
In `@backend/app/services/pod_monitor/monitor.py`:
- Around line 160-182: The synchronous kubernetes iterator returned by
self._watch.stream in _run_watch is blocking the event loop; replace the direct
call with an awaited asyncio.to_thread call (await
asyncio.to_thread(self._watch.stream, self._v1.list_namespaced_pod, **kwargs))
so the blocking stream creation runs off the event loop, then keep iterating and
awaiting self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.
In `@backend/tests/unit/services/saga/test_execution_saga_steps.py`:
- Line 204: The assertion in test_execution_saga_steps.py is a no-op because `or
True` makes it always pass; update the test to either remove the `or True` or
replace the assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).
♻️ Duplicate comments (9)
backend/app/services/sse/event_router.py (1)
45-46: Inefficient use ofmodel_dump()to extract a single attribute.Since SSE-relevant events have
execution_idas a direct attribute, usegetattr(event, 'execution_id', None)to avoid serializing the entire model. Additionally,SSERedisBus.publish_eventcallsmodel_dump()internally, resulting in double serialization.♻️ Proposed fix
async def route_event(self, event: DomainEvent) -> None: """Route an event to Redis for SSE delivery.""" - data = event.model_dump() - execution_id = data.get("execution_id") + execution_id = getattr(event, "execution_id", None)backend/app/services/pod_monitor/monitor.py (2)
276-295:RuntimeErroron max reconnect attempts terminates monitor without recovery path.As noted in previous reviews, raising
RuntimeErrorhere terminates_watch_loopand therun()method without any recovery mechanism. The monitor silently stops functioning.Consider either:
- Set a paused/failed state flag and exit gracefully
- Add a supervisor/restart mechanism
- Propagate via a more specific exception that callers can handle
109-129:CancelledErrorshould be re-raised to maintain async cancellation contract.As noted in previous reviews, catching
CancelledErrorwithout re-raising breaks Python's async cancellation semantics. The task will appear to complete normally instead of being cancelled.Fix: Re-raise CancelledError
try: await self._watch_loop() except asyncio.CancelledError: self.logger.info("PodMonitor cancelled") + raise finally:backend/app/services/saga/saga_logic.py (1)
308-318: Keep the timeout checker alive after transient errors.
check_timeouts_looponly catchesCancelledError, so any repository error will terminate the loop permanently. Wrap each iteration in a catch-all and log.💡 Proposed fix
async def check_timeouts_loop(self) -> None: """Check for saga timeouts (runs until cancelled).""" try: while True: # Check every 30 seconds await asyncio.sleep(30) - await self.check_timeouts_once() + try: + await self.check_timeouts_once() + except Exception as e: + self.logger.error(f"Timeout checker error: {e}", exc_info=True) except asyncio.CancelledError: self.logger.info("Timeout checker cancelled")backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
67-69: Strengthen the success-flow assertion.
len(logic._sagas) > 0is already true afterregister_saga, so it doesn’t validatehandle_event. Assert on a side effect, e.g., saga persisted in the repo.💡 Proposed fix
- assert len(logic._sagas) > 0 # noqa: SLF001 + found = await saga_repo.get_saga_by_execution_and_name("e", "s") + assert found is not Nonebackend/app/core/providers.py (1)
567-576: SSEConnectionRegistry should be app-scoped.With
Scope.REQUEST, each request gets a fresh registry and_active_connectionsresets, so connection counts and cross-request visibility break. UseScope.APPto keep a singleton registry.💡 Proposed fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_k8s_worker.py (1)
147-150: Silent catch-all handler discards unmatched events without logging.This catch-all handler silently discards any events that don't match the specific filters. While the comment says "prevents message loss," it actually causes silent data loss. Consider logging unmatched events at warning level to aid debugging and detect routing issues or missing handlers for new event types.
💡 Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other(event: DomainEvent) -> None: + app_logger.warning( + "Unhandled event type received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/workers/run_pod_monitor.py (1)
84-92: Exceptions from completed tasks are silently swallowed.When
monitor.run()fails with an exception, the task will be in thedoneset, but the exception is never retrieved or logged. This masks critical failures and makes debugging production issues difficult.🐛 Proposed fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Surface failures from completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + if exc := task.exception(): + logger.error("PodMonitor task failed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/workers/run_result_processor.py (1)
162-165: Silent catch-all handler discards unmatched events without logging.Same issue as in
run_k8s_worker.py. The catch-all handler silently discards events that don't match specific filters. Add logging to aid debugging.💡 Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other(event: DomainEvent) -> None: + app_logger.warning( + "Unhandled event type received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )
🧹 Nitpick comments (10)
backend/workers/run_saga_orchestrator.py (1)
161-165: Opportunistic timeout check may miss timeouts during idle periods.If no events arrive,
check_timeouts_once()won't be called. For sagas with strict timeout requirements, consider adding a dedicated background task or using FastStream's scheduling capabilities.💡 Alternative: Add a background timeout checker
# In lifespan, after handler registration: async def timeout_loop() -> None: while True: await asyncio.sleep(30) await logic.check_timeouts_once() timeout_task = asyncio.create_task(timeout_loop()) # Cancel in finally blockHowever, if event throughput is consistently high, the opportunistic approach may be sufficient.
backend/tests/helpers/fakes/kubernetes.py (1)
37-41:FakeK8sAppsV1Apihas no methods.If tests need to interact with AppsV1Api methods (e.g., deployments, statefulsets), you'll need to add stub methods. Currently, it only satisfies basic DI wiring.
backend/tests/helpers/fakes/kafka.py (1)
38-78: Consider guarding__anext__against consuming when not started.The async iterator doesn't check
_startedstate before yielding messages. While acceptable for test fakes, this could mask bugs where tests consume from an unstarted consumer.Optional: Add started guard
async def __anext__(self) -> Any: + if not self._started: + raise StopAsyncIteration if self._messages: return self._messages.pop(0) raise StopAsyncIterationbackend/tests/helpers/fakes/schema_registry.py (2)
17-22: Type introspection is fragile - consider adding defensive handling.The
_get_all_event_classesfunction assumesDomainEventisAnnotated[Union[...], Discriminator]. If the type structure changes, this will silently return incorrect results.Add defensive check
`@lru_cache`(maxsize=1) def _get_all_event_classes() -> list[type[DomainEvent]]: """Get all concrete event classes from DomainEvent union.""" - union_type = get_args(DomainEvent)[0] # Annotated[Union[...], Discriminator] -> Union - return list(get_args(union_type)) if get_origin(union_type) else [union_type] + args = get_args(DomainEvent) + if not args: + raise TypeError("DomainEvent must be a generic type with arguments") + union_type = args[0] # Annotated[Union[...], Discriminator] -> Union + inner_args = get_args(union_type) + if not inner_args: + return [union_type] if isinstance(union_type, type) else [] + return list(inner_args)
99-106: Consider caching the event class mapping.The mapping is rebuilt on every
deserialize_jsoncall. Since_get_all_event_classes()is already cached, the mapping could be cached too.Cache the mapping
+ `@lru_cache`(maxsize=1) + def _get_event_class_mapping(self) -> dict[EventType, type[DomainEvent]]: + """Get cached mapping from event type to class.""" + return {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()} + def deserialize_json(self, data: dict[str, Any]) -> DomainEvent: """Deserialize JSON data to DomainEvent using event_type field.""" if not (event_type_str := data.get("event_type")): raise ValueError("Missing event_type in event data") - mapping = {cls.model_fields["event_type"].default: cls for cls in _get_all_event_classes()} + mapping = self._get_event_class_mapping() if not (event_class := mapping.get(EventType(event_type_str))): raise ValueError(f"No event class found for event type: {event_type_str}") return event_class.model_validate(data)Note: Since
selfis involved, you'd needfunctools.cached_propertyor a class-level cache instead oflru_cache.backend/workers/run_result_processor.py (1)
140-160: Consider exposing public handler methods instead of calling private methods.The handlers directly call underscore-prefixed methods (
logic._handle_completed,logic._handle_failed,logic._handle_timeout), which are conventionally private. While this works, it couples the handlers to internal implementation details.Consider adding public wrapper methods in
ProcessorLogic(e.g.,handle_completed,handle_failed,handle_timeout) that delegate to the private implementations, or rename the existing methods to be public if they're intended to be called externally.backend/tests/e2e/test_k8s_worker_create_pod.py (1)
85-92: Consider wrapping cleanup in try/except to handle partial creation failures.If the test fails between ConfigMap creation and Pod creation, the cleanup at lines 91-92 will attempt to delete a non-existent pod, potentially raising an exception that masks the original failure.
💡 Suggested improvement
# Cleanup - logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) - logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + try: + logic.v1.delete_namespaced_pod(name=f"executor-{exec_id}", namespace=ns) + except ApiException: + pass # Pod may not exist if test failed early + try: + logic.v1.delete_namespaced_config_map(name=f"script-{exec_id}", namespace=ns) + except ApiException: + pass # ConfigMap may not exist if test failed earlyAlternatively, consider using a pytest fixture with cleanup as a finalizer.
backend/app/services/sse/sse_service.py (1)
236-254: Consider dynamic health status instead of hardcoded values.The health status returns hardcoded
status="healthy"andkafka_enabled=True, but:
- The service doesn't validate its actual health (e.g., Redis connectivity)
kafka_enabled=Truemay be misleading since Kafka is now handled by the separate SSE bridge workerIf this is intentional for the refactoring phase, consider adding a TODO or updating the field semantics in
SSEHealthDomain.backend/workers/run_sse_bridge.py (1)
92-129: Lifespan structure is correct but consider error handling for schema initialization.The lifespan correctly:
- Initializes schema registry before defining the subscriber
- Cleans up the container in the finally block
However, if
initialize_event_schemasfails, the exception will propagate and prevent startup, but no specific error logging occurs. Consider wrapping with explicit error handling for better diagnostics.💡 Optional: Add explicit error handling
# Initialize schema registry schema_registry = await container.get(SchemaRegistryManager) - await initialize_event_schemas(schema_registry) + try: + await initialize_event_schemas(schema_registry) + except Exception as e: + app_logger.error(f"Failed to initialize schemas: {e}") + raisebackend/app/services/event_bus.py (1)
111-123: Consider propagating publish errors to callers.The
publishmethod catches all exceptions and logs them but doesn't raise, which means callers won't know if their event failed to publish. For cross-instance cache invalidation, this might be acceptable, but for critical events, callers may need to know about failures.If silent failures are intentional for this use case, consider documenting this behavior.
| async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None: | ||
| """Handle create pod command from saga orchestrator.""" | ||
| execution_id = command.execution_id | ||
|
|
||
| # Check if already processing | ||
| if execution_id in self._active_creations: | ||
| self.logger.warning(f"Already creating pod for execution {execution_id}") | ||
| return | ||
|
|
||
| # Create pod asynchronously | ||
| asyncio.create_task(self._create_pod_for_execution(command)) | ||
|
|
||
| async def handle_delete_pod_command(self, command: DeletePodCommandEvent) -> None: | ||
| """Handle delete pod command from saga orchestrator (compensation).""" | ||
| execution_id = command.execution_id | ||
| self.logger.info(f"Deleting pod for execution {execution_id} due to: {command.reason}") | ||
|
|
||
| try: | ||
| # Delete the pod | ||
| pod_name = f"executor-{execution_id}" | ||
| await asyncio.to_thread( | ||
| self.v1.delete_namespaced_pod, | ||
| name=pod_name, | ||
| namespace=self.config.namespace, | ||
| grace_period_seconds=30, | ||
| ) | ||
| self.logger.info(f"Successfully deleted pod {pod_name}") | ||
|
|
||
| # Delete associated ConfigMap | ||
| configmap_name = f"script-{execution_id}" | ||
| await asyncio.to_thread( | ||
| self.v1.delete_namespaced_config_map, name=configmap_name, namespace=self.config.namespace | ||
| ) | ||
| self.logger.info(f"Successfully deleted ConfigMap {configmap_name}") | ||
|
|
||
| # NetworkPolicy cleanup is managed via a static cluster policy; no per-execution NP deletion | ||
|
|
||
| except ApiException as e: | ||
| if e.status == 404: | ||
| self.logger.warning( | ||
| f"Resources for execution {execution_id} not found (may have already been deleted)" | ||
| ) | ||
| else: | ||
| self.logger.error(f"Failed to delete resources for execution {execution_id}: {e}") | ||
|
|
||
| async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None: | ||
| """Create pod for execution.""" | ||
| async with self._creation_semaphore: | ||
| execution_id = command.execution_id | ||
| self._active_creations.add(execution_id) | ||
| self.metrics.update_k8s_active_creations(len(self._active_creations)) | ||
|
|
||
| start_time = time.time() | ||
|
|
||
| try: | ||
| # We now have the CreatePodCommandEvent directly from saga | ||
| script_content = command.script | ||
| entrypoint_content = await self._get_entrypoint_script() | ||
|
|
||
| # Create ConfigMap | ||
| config_map = self.pod_builder.build_config_map( | ||
| command=command, script_content=script_content, entrypoint_content=entrypoint_content | ||
| ) | ||
|
|
||
| await self._create_config_map(config_map) | ||
|
|
||
| pod = self.pod_builder.build_pod_manifest(command=command) | ||
| await self._create_pod(pod) | ||
|
|
||
| # Publish PodCreated event | ||
| await self._publish_pod_created(command, pod) | ||
|
|
||
| # Update metrics | ||
| duration = time.time() - start_time | ||
| self.metrics.record_k8s_pod_creation_duration(duration, command.language) | ||
| self.metrics.record_k8s_pod_created("success", command.language) | ||
|
|
||
| self.logger.info( | ||
| f"Successfully created pod {pod.metadata.name} for execution {execution_id}. " | ||
| f"Duration: {duration:.2f}s" | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| self.logger.error(f"Failed to create pod for execution {execution_id}: {e}", exc_info=True) | ||
|
|
||
| # Update metrics | ||
| self.metrics.record_k8s_pod_created("failed", "unknown") | ||
|
|
||
| # Publish failure event | ||
| await self._publish_pod_creation_failed(command, str(e)) | ||
|
|
||
| finally: | ||
| self._active_creations.discard(execution_id) | ||
| self.metrics.update_k8s_active_creations(len(self._active_creations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent duplicate pod creation on concurrent commands.
handle_create_pod_command checks _active_creations before the task adds the ID, so two concurrent calls for the same execution can both enqueue tasks and emit duplicate events. Reserve the ID before scheduling the task and avoid double-add inside _create_pod_for_execution.
💡 Proposed fix
async def handle_create_pod_command(self, command: CreatePodCommandEvent) -> None:
"""Handle create pod command from saga orchestrator."""
execution_id = command.execution_id
# Check if already processing
if execution_id in self._active_creations:
self.logger.warning(f"Already creating pod for execution {execution_id}")
return
+ # Reserve execution_id before scheduling to prevent dupes
+ self._active_creations.add(execution_id)
+ self.metrics.update_k8s_active_creations(len(self._active_creations))
+
# Create pod asynchronously
asyncio.create_task(self._create_pod_for_execution(command))
async def _create_pod_for_execution(self, command: CreatePodCommandEvent) -> None:
"""Create pod for execution."""
async with self._creation_semaphore:
execution_id = command.execution_id
- self._active_creations.add(execution_id)
- self.metrics.update_k8s_active_creations(len(self._active_creations))🤖 Prompt for AI Agents
In `@backend/app/services/k8s_worker/worker_logic.py` around lines 70 - 163, The
duplicate-creation bug occurs because handle_create_pod_command checks
_active_creations but only schedules the task, and the ID is added later inside
_create_pod_for_execution, so two concurrent calls can both schedule tasks; fix
by reserving the execution_id immediately in handle_create_pod_command (add
execution_id to self._active_creations and call
self.metrics.update_k8s_active_creations) before asyncio.create_task, and remove
the duplicate add inside _create_pod_for_execution (leave the finally block that
discards the ID and updates metrics). Also guard the scheduling with a
try/except so if task creation fails you remove the reserved ID and update
metrics.
| async def _run_watch(self) -> None: | ||
| """Run a single watch session.""" | ||
| self.logger.info( | ||
| f"Starting pod watch: selector={self.config.label_selector}, namespace={self.config.namespace}" | ||
| ) | ||
|
|
||
| # Create watch stream | ||
| kwargs = { | ||
| "namespace": context.namespace, | ||
| "label_selector": context.label_selector, | ||
| "timeout_seconds": context.timeout_seconds, | ||
| kwargs: dict[str, Any] = { | ||
| "namespace": self.config.namespace, | ||
| "label_selector": self.config.label_selector, | ||
| "timeout_seconds": self.config.watch_timeout_seconds, | ||
| } | ||
| if self.config.field_selector: | ||
| kwargs["field_selector"] = self.config.field_selector | ||
| if self._last_resource_version: | ||
| kwargs["resource_version"] = self._last_resource_version | ||
|
|
||
| if context.field_selector: | ||
| kwargs["field_selector"] = context.field_selector | ||
|
|
||
| if context.resource_version: | ||
| kwargs["resource_version"] = context.resource_version | ||
|
|
||
| # Watch stream (clients guaranteed by __init__) | ||
| stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs) | ||
|
|
||
| try: | ||
| for event in stream: | ||
| if self._state != MonitorState.RUNNING: | ||
| break | ||
|
|
||
| await self._process_raw_event(event) | ||
|
|
||
| finally: | ||
| # Store resource version for next watch | ||
| self._update_resource_version(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the kubernetes watch.stream implementation and usage patterns
rg -n "watch.stream" --type=py -B 2 -A 5
# Check if there's async handling elsewhere
rg -n "to_thread.*stream" --type=pyRepository: HardMax71/Integr8sCode
Length of output: 1456
🏁 Script executed:
# Also check how self._watch is initialized
rg -n "self\._watch.*=" --type=py -B 1 -A 1Repository: HardMax71/Integr8sCode
Length of output: 265
🏁 Script executed:
# Check the full monitor.py file to understand the context
wc -l backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 113
🏁 Script executed:
# Get the imports in monitor.py to understand what libraries are being used
head -50 backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1273
🏁 Script executed:
# Check for asyncio.to_thread or executor usage around this code
rg -n "to_thread|run_in_executor|ThreadPoolExecutor" --type=py backend/app/services/pod_monitor/Repository: HardMax71/Integr8sCode
Length of output: 265
🏁 Script executed:
# Get the full _run_watch method and surrounding context
sed -n '160,185p' backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1085
🏁 Script executed:
# Check for any middleware or context managers that might handle this
rg -n "contextmanager|generator|async for" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check what _process_raw_event does
rg -n "def _process_raw_event" --type=py -A 15 backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 762
🏁 Script executed:
# Look at the broader _run_watch context including the try-except
sed -n '155,195p' backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 1706
🏁 Script executed:
# Check if there are any known async watch patterns in tests or elsewhere
rg -n "watch|stream" --type=py backend/tests/ | head -30Repository: HardMax71/Integr8sCode
Length of output: 3314
🏁 Script executed:
# Check the requirements to see which version of kubernetes library is used
cat backend/requirements.txt | grep -i kubernetesRepository: HardMax71/Integr8sCode
Length of output: 122
🏁 Script executed:
# Look for any existing issues or comments about watch blocking
rg -n "block|thread|concurrent" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 177
🏁 Script executed:
# Check if there's any configuration or documentation about this behavior
find backend -name "*.md" -o -name "*.rst" -o -name "*.txt" | xargs grep -l "watch\|async" 2>/dev/null | head -5Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Search for requirements files in various locations
find backend -name "*requirements*" -o -name "pyproject.toml" -o -name "poetry.lock" | head -10Repository: HardMax71/Integr8sCode
Length of output: 88
🏁 Script executed:
# Check the FakeWatch implementation to understand the expected pattern
sed -n '150,180p' backend/tests/helpers/k8s_fakes.pyRepository: HardMax71/Integr8sCode
Length of output: 1005
🏁 Script executed:
# Check if there are any GitHub issues or comments in the codebase about async watch
rg -n "TODO|FIXME|XXX|HACK|BUG" --type=py backend/app/services/pod_monitor/monitor.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Check pyproject.toml for kubernetes dependency
cat backend/pyproject.toml | grep -A 10 -B 2 "kubernetes\|dependencies"Repository: HardMax71/Integr8sCode
Length of output: 804
Wrap the watch stream iteration in asyncio.to_thread() to avoid blocking the event loop.
The self._watch.stream() call returns a synchronous iterator from the kubernetes library. Using for event in stream: blocks the event loop while waiting for events, preventing other async tasks from running. The codebase already uses the asyncio.to_thread() pattern elsewhere (lines 117, 305) for blocking I/O. Apply the same pattern here:
stream = await asyncio.to_thread(
self._watch.stream, self._v1.list_namespaced_pod, **kwargs
)
try:
for event in stream:
await self._process_raw_event(event)🤖 Prompt for AI Agents
In `@backend/app/services/pod_monitor/monitor.py` around lines 160 - 182, The
synchronous kubernetes iterator returned by self._watch.stream in _run_watch is
blocking the event loop; replace the direct call with an awaited
asyncio.to_thread call (await asyncio.to_thread(self._watch.stream,
self._v1.list_namespaced_pod, **kwargs)) so the blocking stream creation runs
off the event loop, then keep iterating and awaiting
self._process_raw_event(event) as before and call
_update_resource_version(stream) in the finally block; target the _run_watch
method and the usages of self._watch.stream, _process_raw_event, and
_update_resource_version when making this change.
| assert ValidateExecutionStep().get_compensation() is None | ||
| assert isinstance(AllocateResourcesStep(_FakeAllocRepo()).get_compensation(), ReleaseResourcesCompensation) | ||
| assert isinstance(AllocateResourcesStep(alloc_repo).get_compensation(), ReleaseResourcesCompensation) | ||
| assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion is a no-op (or True makes it always pass).
The expression isinstance(...) or True will always evaluate to True, making this assertion meaningless. Either fix the assertion or remove it if the check isn't needed.
🐛 Proposed fix
- assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+ # QueueExecutionStep has no compensation
+ assert QueueExecutionStep().get_compensation() is NoneOr if the intent was to check that it returns a SagaCompensation subclass:
- assert isinstance(QueueExecutionStep().get_compensation(), type(DeletePodCompensation(None)).__bases__[0]) or True
+ from app.services.saga.saga_step import SagaCompensation
+ comp = QueueExecutionStep().get_compensation()
+ assert comp is None or isinstance(comp, SagaCompensation)🤖 Prompt for AI Agents
In `@backend/tests/unit/services/saga/test_execution_saga_steps.py` at line 204,
The assertion in test_execution_saga_steps.py is a no-op because `or True` makes
it always pass; update the test to either remove the `or True` or replace the
assertion with a real type check such as asserting that
QueueExecutionStep().get_compensation() returns an instance/subclass of the
SagaCompensation base (e.g., use isinstance(result, SagaCompensation) or assert
issubclass(type(result), SagaCompensation) and reference
QueueExecutionStep.get_compensation and DeletePodCompensation to ensure the
intended compensation type is validated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 47 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">
<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:60">
P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| assert ctx.get("allocation_id") is not None | ||
|
|
||
| # Test 2: Limit exceeded (insert 100 active allocations) | ||
| for i in range(100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 60:
<comment>This test inserts 100 active allocation documents into the session-scoped unit DB but never removes them, so data persists across tests and can cause order-dependent failures. Clean up the inserted allocations (e.g., delete them in a finally block) to keep unit tests isolated.</comment>
<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+ assert ctx.get("allocation_id") is not None
+
+ # Test 2: Limit exceeded (insert 100 active allocations)
+ for i in range(100):
+ doc = ResourceAllocationDocument(
+ allocation_id=f"limit-test-alloc-{i}",
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 issues found across 49 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/pyproject.toml">
<violation number="1" location="backend/pyproject.toml:140">
P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</violation>
</file>
<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">
<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:71">
P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</violation>
</file>
<file name="backend/app/core/providers.py">
<violation number="1" location="backend/app/core/providers.py:192">
P2: MongoDB client is never closed after yielding the database, which can leak connections on shutdown. Wrap the yield in a try/finally and close the client in the finally block.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| [dependency-groups] | ||
| dev = [ | ||
| "coverage==7.13.0", | ||
| "fakeredis>=2.33.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/pyproject.toml, line 140:
<comment>New dev dependencies are added with >= ranges while the rest of the group is pinned to exact versions, which makes dev/test environments non-reproducible. Pin these to exact versions to align with the existing dependency policy.</comment>
<file context>
@@ -137,9 +137,11 @@ packages = ["app", "workers"]
[dependency-groups]
dev = [
"coverage==7.13.0",
+ "fakeredis>=2.33.0",
"hypothesis==6.103.4",
"iniconfig==2.0.0",
</file context>
| "fakeredis>=2.33.0", | |
| "fakeredis==2.33.0", |
| memory_limit="512Mi", | ||
| status="active", | ||
| ) | ||
| await doc.insert() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/services/saga/test_execution_saga_steps.py, line 71:
<comment>This test inserts 100 active ResourceAllocationDocument entries but never cleans them up. Since unit_container is session-scoped and there’s no per-test DB cleanup, these documents can leak into other unit tests and make them flaky (e.g., causing active-allocation limits to trip). Add cleanup to delete the inserted documents or use a fixture that isolates the collection for this test.</comment>
<file context>
@@ -42,50 +44,41 @@ async def test_validate_execution_step_success_and_failures() -> None:
+ memory_limit="512Mi",
+ status="active",
+ )
+ await doc.insert()
- # Limit exceeded
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@backend/app/core/dishka_lifespan.py`:
- Around line 7-11: The AsyncIterable providers are not closing their resources
on shutdown: update get_database to close the AsyncMongoClient after the yield
(await client.close()), update get_dlq_manager to call await consumer.stop()
(and any DLQ manager cleanup/close) after the yield instead of only cancelling
tasks, and add proper cleanup for get_schema_registry (e.g., close or shutdown
the schema registry client/resource after yield); follow the same pattern used
in get_event_bus and get_event_store_consumer where consumers are stopped with
await consumer.stop() after yielding.
In `@backend/workers/run_sse_bridge.py`:
- Around line 101-102: The decode_avro function currently calls
schema_registry.deserialize_event(body, "sse_bridge") with a hardcoded topic
string; update decode_avro to either pass the real subscribed topic from the
message context (e.g., use the topic variable available where messages are
received) into schema_registry.deserialize_event, or remove the topic argument
entirely if deserialize_event truly does not use it. Locate decode_avro and the
call sites where messages and topic are available, then replace the literal
"sse_bridge" with the dynamic topic variable or adjust deserialize_event usage
accordingly to avoid the misleading hardcoded value.
♻️ Duplicate comments (4)
backend/workers/run_result_processor.py (1)
156-159: Catch-all handler silently discards unmatched events.This handler prevents message loss but provides no visibility into unexpected events. Consider logging at debug/warning level to aid troubleshooting.
Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.debug( + "Received unhandled event type", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/app/core/providers.py (1)
577-586: SSEConnectionRegistry scope must beScope.APP, notScope.REQUEST.With
Scope.REQUEST, each HTTP request gets a new registry instance with an empty_active_connectionsdictionary. This breaks cross-request connection tracking and causesget_connection_count()to always return 0 in health checks.Suggested fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_pod_monitor.py (1)
75-88: Exceptions from completed tasks are still silently swallowed.The
asyncio.waitpattern returns completed tasks indone, but exceptions are never checked. Ifmonitor.run()raises an exception, it ends up indonebut the exception is silently ignored. This makes debugging production failures difficult.Suggested fix
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Check for exceptions in completed tasks + for task in done: + if task is not shutdown_task and not task.cancelled(): + exc = task.exception() + if exc: + logger.error("Monitor task failed", exc_info=exc) + raise exc + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await taskbackend/workers/run_k8s_worker.py (1)
146-149: Don’t silently drop unmatched events.The no‑op catch‑all handler will ack and discard events that don’t match filters, masking new event types or header bugs. Consider logging or routing to a dead‑letter path.
🪵 Suggested minimal logging
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unhandled event received: %s", + getattr(event, "event_type", type(event).__name__), + )
🧹 Nitpick comments (5)
backend/app/events/schema/schema_registry.py (1)
82-102: Schema object is recreated unnecessarily in serialize_event.
_ensure_schema_registeredalready creates anAvroSchemaobject internally but doesn't return it. Thenserialize_eventcreates it again on line 92. Consider returning the schema object from_ensure_schema_registeredto avoid redundant schema parsing.Suggested optimization
- async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> int: + async def _ensure_schema_registered(self, event_class: type[DomainEvent]) -> tuple[int, schema.AvroSchema]: """Lazily register schema and return schema ID.""" if event_class in self._schema_id_cache: - return self._schema_id_cache[event_class] + avro_schema_dict = event_class.avro_schema_to_python() + return self._schema_id_cache[event_class], schema.AvroSchema(avro_schema_dict) subject = f"{self.subject_prefix}{event_class.__name__}-value" avro_schema_dict = event_class.avro_schema_to_python() avro_schema_obj = schema.AvroSchema(avro_schema_dict) schema_id: int = await self._client.register(subject, avro_schema_obj) self._schema_id_cache[event_class] = schema_id self._id_to_class_cache[schema_id] = event_class self.logger.debug(f"Registered schema {event_class.__name__}: ID {schema_id}") - return schema_id + return schema_id, avro_schema_obj async def serialize_event(self, event: DomainEvent) -> bytes: """Serialize event to Confluent wire format.""" event_class = event.__class__ subject = f"{self.subject_prefix}{event_class.__name__}-value" - await self._ensure_schema_registered(event_class) - - avro_schema_dict = event_class.avro_schema_to_python() - avro_schema_obj = schema.AvroSchema(avro_schema_dict) + _, avro_schema_obj = await self._ensure_schema_registered(event_class)backend/app/core/providers.py (1)
271-293: DLQManager background tasks lack graceful shutdown signaling.The
monitor_loopuseswhile Trueand relies on task cancellation. Consider adding an explicit stop flag to allow graceful completion of the current iteration before shutdown.backend/workers/run_saga_orchestrator.py (3)
98-99: Closure-based state for timeout checking works but could be clearer.Using a mutable dict for state across handler invocations is a valid pattern. Consider using a simple class or dataclass for better readability.
161-162: Minor: f-string logging is used inconsistently.Line 161 uses f-string formatting while other parts of the codebase use structured logging with
extra. This is minor but affects log aggregation consistency.
142-159: Opportunistic timeout check may miss sagas during low traffic periods.Timeout checking only runs when events arrive via
check_timeouts_once(). During periods of no traffic, timed-out sagas won't be detected until the next event. Acheck_timeouts_loop()method exists inSagaLogicthat runs every 30 seconds but is not currently used. Consider whether to use the available background loop instead of the opportunistic approach, especially given the 5-minute maximum timeout constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@backend/tests/unit/conftest.py`:
- Around line 58-90: The session-scoped unit_container fixture holds a single
AsyncMongoMockClient across the whole test session causing DB state leakage; fix
by making isolation explicit: either change unit_container to function scope or
create/attach a fresh AsyncMongoMockClient per test and re-run init_beanie for
that client (references: unit_container, AsyncMongoMockClient, init_beanie), or
add a teardown that drops all collections between tests (callable in unit_scope)
and ensure tests use the existing unit_scope fixture instead of unit_container
so each test gets a clean DB state.
In `@backend/workers/run_event_replay.py`:
- Around line 34-38: The AsyncMongoClient instance (mongo_client) is created
before calling init_beanie and may leak if init_beanie raises; wrap the
init_beanie call in a try/finally so that on any exception you call await
mongo_client.close() to ensure the client is closed, then re-raise the
exception; keep the AsyncExitStack usage for normal shutdown but ensure the
finally branch only closes mongo_client when initialization fails (i.e., before
the stack takes ownership).
In `@backend/workers/run_result_processor.py`:
- Around line 132-137: The subscriber is created with auto_commit=False which
prevents offsets from being committed and combined with FastStream's default
AckPolicy.ACK_FIRST causes messages to never be acknowledged; update the
broker.subscriber(...) invocation (the call that currently passes topics,
group_id, auto_commit=False, decoder=decode_avro) to either set
ack_policy=AckPolicy.ACK to commit offsets after successful handling or remove
auto_commit=False to enable auto-commit, and then modify the no-op handle_other
handler so it explicitly acknowledges or returns success for unknown events to
ensure offsets advance for those messages as well.
In `@backend/workers/run_saga_orchestrator.py`:
- Around line 142-147: The subscriber decorator currently uses the deprecated
auto_commit parameter; replace auto_commit=False in the broker.subscriber(...)
call with an explicit ack_policy by importing AckPolicy from faststream and
using ack_policy=AckPolicy.ACK (recommended) or ack_policy=AckPolicy.MANUAL if
you want manual control; if you choose MANUAL, update the corresponding message
handler function (the consumer callback tied to broker.subscriber with topics,
group_id and decoder=decode_avro) to call message.ack() after successful
processing, and remove the auto_commit argument; apply the same change to the
identical patterns in run_result_processor.py and run_k8s_worker.py.
♻️ Duplicate comments (4)
backend/app/core/providers.py (1)
559-568: SSEConnectionRegistry should be app-scoped (singleton).With
Scope.REQUEST, the registry is recreated per request and_active_connectionsis always empty across requests (e.g., health checks). Please switch this provider toScope.APP.🔧 Suggested fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:backend/workers/run_result_processor.py (1)
161-164: Don’t silently drop unmatched events.
handle_othercurrently no-ops, which can hide routing/config errors and new event types. Add at least a warning log.🔧 Suggested fix
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + logger.warning("Unhandled result-processor event_type: %s", event.event_type)backend/workers/run_k8s_worker.py (1)
142-145: Log unmatched events to aid debugging.The catch-all handler silently discards events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters. This can mask filter bugs or cause silent data loss when new event types are introduced.
♻️ Proposed fix: add warning log for unmatched events
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unmatched event received", + extra={"event_type": getattr(event, "event_type", "unknown")}, + )backend/workers/run_pod_monitor.py (1)
85-93: Exceptions from completed tasks are still silently swallowed.When
monitor.run()raises an exception, that exception is stored inmonitor_taskbut never retrieved. The process will exit through thefinallyblock without logging or propagating the error, masking critical failures.🐛 Proposed fix: surface exceptions from completed tasks
done, pending = await asyncio.wait( [monitor_task, shutdown_task], return_when=asyncio.FIRST_COMPLETED, ) + # Surface failures from completed tasks + for task in done: + if task != shutdown_task: + try: + task.result() # Raises if task failed + except Exception: + logger.exception("Monitor task failed") + raise + for task in pending: task.cancel() with suppress(asyncio.CancelledError): await task
🧹 Nitpick comments (4)
backend/workers/dlq_processor.py (1)
87-90: MongoDB connection parameters are hardcoded.The connection settings (
tz_aware=True,serverSelectionTimeoutMS=5000) are duplicated here and indishka_lifespan.py. Consider extracting these toSettingsor a shared factory if consistency matters across workers and the main app.backend/app/core/dishka_lifespan.py (1)
17-25: Consider try/finally to ensure MongoDB client cleanup on initialization failure.If
init_beanie()raises an exception, theAsyncMongoClientwon't be closed since the code won't reach line 24. While unlikely in practice, wrapping the yield block ensures proper cleanup.Suggested fix
client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - yield - - await client.close() - await container.close() + try: + await init_beanie(database=client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + yield + finally: + await client.close() + await container.close()backend/app/core/providers.py (1)
253-265: Guard long‑running background loops from silent task death.These
while Truetasks (DLQ processing, event store batch loop, event bus listener, notification loops) will stop permanently if an exception escapes. Consider adding a try/except with logging and a short backoff so failures surface without killing the loop.Example pattern
async def batch_loop() -> None: while True: - await event_store_consumer.process_batch() + try: + await event_store_consumer.process_batch() + except Exception: + logger.exception("Event store consumer loop crashed; continuing") + await asyncio.sleep(1)Also applies to: 359-371, 413-428, 693-712
backend/workers/run_pod_monitor.py (1)
57-61: Consider moving MongoDB client lifecycle into DI for consistency.The MongoDB client is created manually outside the DI container while other boundary clients (Kafka, Redis) are DI-managed. This works correctly since it's properly closed in the
finallyblock, but moving it into a provider would improve consistency with other workers.
| @pytest_asyncio.fixture(scope="session") | ||
| async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContainer, None]: | ||
| """DI container for unit tests with fake boundary clients. | ||
|
|
||
| Provides: | ||
| - Fake Redis, Kafka, K8s (boundary clients) | ||
| - Real metrics, repositories, services (internal) | ||
| - Real MongoDB via init_beanie | ||
| """ | ||
| container = make_async_container( | ||
| SettingsProvider(), | ||
| LoggingProvider(), | ||
| FakeBoundaryClientProvider(), | ||
| RedisServicesProvider(), | ||
| MetricsProvider(), | ||
| EventProvider(), | ||
| FakeSchemaRegistryProvider(), | ||
| MessagingProvider(), | ||
| CoreServicesProvider(), | ||
| KafkaServicesProvider(), | ||
| RepositoryProvider(), | ||
| context={Settings: test_settings}, | ||
| ) | ||
|
|
||
| mongo_client: AsyncMongoMockClient[dict[str, object]] = AsyncMongoMockClient() | ||
| await init_beanie( | ||
| database=mongo_client[test_settings.DATABASE_NAME], # type: ignore[arg-type] | ||
| document_models=ALL_DOCUMENTS, | ||
| ) | ||
|
|
||
| yield container | ||
|
|
||
| await container.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Look for test cleanup patterns mentioned in the concern
rg -n "clean|reset|clear|delete_many|drop" backend/tests/unit/ --type py -C2 | head -100Repository: HardMax71/Integr8sCode
Length of output: 7819
🏁 Script executed:
#!/bin/bash
# Check how the unit_container fixture is actually used in tests
rg -n "def test_.*unit_container" backend/tests/unit/ --type py -C1Repository: HardMax71/Integr8sCode
Length of output: 1803
🏁 Script executed:
#!/bin/bash
# Look for fixture scope declarations and test isolation patterns
rg -n "@pytest\.fixture|@pytest_asyncio\.fixture" backend/tests/unit/conftest.py --type py -C1Repository: HardMax71/Integr8sCode
Length of output: 3240
🏁 Script executed:
#!/bin/bash
# Check if mongomock client is being reset or if there's any state cleanup
rg -n "AsyncMongoMockClient|mongo_client" backend/tests/unit/conftest.py --type py -C3Repository: HardMax71/Integr8sCode
Length of output: 721
🏁 Script executed:
#!/bin/bash
# Examine the full conftest.py structure to understand fixture relationships
head -300 backend/tests/unit/conftest.py | tail -250Repository: HardMax71/Integr8sCode
Length of output: 7553
Confirm: session-scoped container creates genuine state leakage risk.
The unit_container fixture is session-scoped with a single AsyncMongoMockClient instance that persists across all unit tests. No database cleanup is performed between tests—the mongomock instance accumulates data from all writes. Tests using repositories that modify database state will pollute subsequent tests in the session.
A function-scoped unit_scope fixture exists (line 93) but tests use unit_container directly, bypassing isolation. Either switch test usage to unit_scope, add cleanup hooks between tests, or accept and document the leakage risk.
🤖 Prompt for AI Agents
In `@backend/tests/unit/conftest.py` around lines 58 - 90, The session-scoped
unit_container fixture holds a single AsyncMongoMockClient across the whole test
session causing DB state leakage; fix by making isolation explicit: either
change unit_container to function scope or create/attach a fresh
AsyncMongoMockClient per test and re-run init_beanie for that client
(references: unit_container, AsyncMongoMockClient, init_beanie), or add a
teardown that drops all collections between tests (callable in unit_scope) and
ensure tests use the existing unit_scope fixture instead of unit_container so
each test gets a clean DB state.
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | ||
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | ||
| ) | ||
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure Mongo client is closed if initialization fails.
If init_beanie raises, the mongo_client isn’t closed because the AsyncExitStack is entered later. Consider a small try/finally to avoid leaking the client on startup failure.
🔧 Suggested fix
- mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
- settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
- )
- await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
+ try:
+ await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ except Exception:
+ await mongo_client.close()
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | |
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | |
| ) | |
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | |
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | |
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | |
| ) | |
| try: | |
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | |
| except Exception: | |
| await mongo_client.close() | |
| raise |
🤖 Prompt for AI Agents
In `@backend/workers/run_event_replay.py` around lines 34 - 38, The
AsyncMongoClient instance (mongo_client) is created before calling init_beanie
and may leak if init_beanie raises; wrap the init_beanie call in a try/finally
so that on any exception you call await mongo_client.close() to ensure the
client is closed, then re-raise the exception; keep the AsyncExitStack usage for
normal shutdown but ensure the finally branch only closes mongo_client when
initialization fails (i.e., before the stack takes ownership).
| subscriber = broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=False, | ||
| decoder=decode_avro, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd backend/workers && wc -l run_result_processor.pyRepository: HardMax71/Integr8sCode
Length of output: 93
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | head -200Repository: HardMax71/Integr8sCode
Length of output: 7837
🏁 Script executed:
cat -n backend/workers/run_result_processor.py | sed -n '120,160p'Repository: HardMax71/Integr8sCode
Length of output: 2074
🌐 Web query:
FastStream 0.6 Kafka subscriber auto_commit=False offset commit behavior handler success
💡 Result:
Short answer: With FastStream 0.6 you must use the ACK-style acknowledgement (not auto-commit) to have offsets committed only after handler success — AckPolicy.ACK commits the offset after the handler returns (ACK_FIRST uses client auto-commit; MANUAL does nothing until you call msg.ack(); NACK_ON_ERROR will avoid committing on errors). See FastStream docs. [1]
Sources
- FastStream 0.6 acknowledgement docs (AckPolicy behavior and examples). [1]
- FastStream 0.5 docs showing auto_commit=False example (for comparison). [2]
🏁 Script executed:
grep -rn "AckPolicy\|\.ack()" backend/workers/Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
grep -rn "from faststream" backend/workers/run_result_processor.pyRepository: HardMax71/Integr8sCode
Length of output: 146
🌐 Web query:
FastStream 0.6 KafkaBroker subscriber default AckPolicy when auto_commit=False
💡 Result:
Default AckPolicy is AckPolicy.ACK_FIRST. FastStream’s docs state ACK_FIRST (which relies on Kafka’s enable.auto.commit) is the default; to get at-least-once you must set ack_policy=AckPolicy.ACK (or use manual ack). [1][2]
Sources:
- FastStream Acknowledgement docs (0.6). [1]
- KafkaBroker/subscriber API showing subscriber options (auto_commit / ack config). [2]
Add explicit offset commit strategy or enable auto-commit.
With auto_commit=False (line 135), offsets are not committed automatically. FastStream's default AckPolicy.ACK_FIRST relies on Kafka's enable.auto.commit, which is disabled here. This results in messages never being marked as processed—they will be redelivered on consumer restart.
Set ack_policy=AckPolicy.ACK in the broker.subscriber() call to commit offsets after handler success, or remove auto_commit=False to use default auto-commit. Additionally, the no-op handle_other handler silently drops unknown events without acknowledging them, preventing offset advancement for those messages.
🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 132 - 137, The
subscriber is created with auto_commit=False which prevents offsets from being
committed and combined with FastStream's default AckPolicy.ACK_FIRST causes
messages to never be acknowledged; update the broker.subscriber(...) invocation
(the call that currently passes topics, group_id, auto_commit=False,
decoder=decode_avro) to either set ack_policy=AckPolicy.ACK to commit offsets
after successful handling or remove auto_commit=False to enable auto-commit, and
then modify the no-op handle_other handler so it explicitly acknowledges or
returns success for unknown events to ensure offsets advance for those messages
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@backend/app/services/notification_service.py`:
- Around line 85-95: The docstring incorrectly mentions a run() method; update
it to remove the run() reference and instead state that background work is
started by the application's DI/lifespan scheduler which calls this service's
batch and cleanup methods (for example
NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.
In `@backend/app/services/sse/redis_bus.py`:
- Around line 156-174: The start() method can be called multiple times which
will spawn duplicate listener tasks; add a guard in start() that checks
self._task (and whether it's not done) and either returns early or raises if a
listener is already running, or cancel/wait for the existing task before
creating a new one; update the async def start(self) to inspect self._task and
only call asyncio.create_task(listener()) when there is no active task (use
self._task.done() to determine activity) and ensure subscription/confirmation
logic is not re-run for an already-started listener (references: start, _task,
listener, _pubsub).
In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`:
- Around line 29-36: The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.
♻️ Duplicate comments (5)
backend/app/core/dishka_lifespan.py (1)
51-64: Task supervision during runtime is still missing.If either
pending_notification_taskorcleanup_notification_taskfails with an unexpected exception during normal operation (betweenyieldand shutdown), the failure goes unnoticed until shutdown. The exception handling inside the loops catches and logs errors, but a complete task crash (e.g., from an unhandled exception type) would be silent.Consider adding a done callback to log failures immediately:
♻️ Suggested supervision pattern
pending_task = asyncio.create_task(pending_notification_task()) cleanup_task = asyncio.create_task(cleanup_notification_task()) + + def _on_task_done(task: asyncio.Task[None]) -> None: + if task.cancelled(): + return + if exc := task.exception(): + logger.error("Background task failed unexpectedly", exc_info=exc) + + pending_task.add_done_callback(_on_task_done) + cleanup_task.add_done_callback(_on_task_done) logger.info("NotificationService background tasks started")backend/app/services/pod_monitor/monitor.py (2)
120-129: Re-raiseasyncio.CancelledErrorto preserve cancellation semantics.Catching
CancelledErrorwithout re-raising breaks Python's async cancellation contract. The task will appear to complete normally instead of being cancelled, causingtask.cancelled()to returnFalse.🛠️ Suggested fix
try: await self._watch_loop() except asyncio.CancelledError: self.logger.info("PodMonitor cancelled") + raise finally:
176-182: The synchronouswatch.stream()iteration blocks the event loop.The
for event in stream:loop uses a synchronous iterator from the kubernetes library, which blocks the event loop while waiting for events. This prevents other async tasks from running. The codebase already usesasyncio.to_thread()elsewhere (lines 117, 295).Consider wrapping the blocking iteration or processing events in a thread:
🛠️ Suggested approach
One option is to process each event fetch in a thread:
async def _run_watch(self) -> None: # ... setup code ... stream = self._watch.stream(self._v1.list_namespaced_pod, **kwargs) try: while True: event = await asyncio.to_thread(next, iter(stream), None) if event is None: break await self._process_raw_event(event) finally: self._update_resource_version(stream)Alternatively, consider using an async kubernetes client if available.
backend/app/services/sse/sse_connection_registry.py (1)
30-38: Metrics can still drift on duplicateregister_connectioncalls.The
unregister_connectionmethod now correctly guards against decrementing when the connection doesn't exist. However,register_connectionstill unconditionally increments metrics even if theconnection_idis already in the set (sinceset.add()is idempotent but the metrics call isn't).🛠️ Suggested fix
async def register_connection(self, execution_id: str, connection_id: str) -> None: """Register a new SSE connection.""" async with self._lock: - if execution_id not in self._active_connections: - self._active_connections[execution_id] = set() - - self._active_connections[execution_id].add(connection_id) - self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id}) - self.metrics.increment_sse_connections("executions") + conns = self._active_connections.setdefault(execution_id, set()) + if connection_id not in conns: + conns.add(connection_id) + self.metrics.increment_sse_connections("executions") + self.logger.debug("Registered SSE connection", extra={"connection_id": connection_id})backend/app/core/providers.py (1)
409-446: SSEConnectionRegistry should be app‑scoped.Request scope recreates the registry per request, which breaks shared connection tracking across requests. This should be app‑wide.
🔧 Proposed fix
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry(
🧹 Nitpick comments (2)
backend/app/events/event_store_consumer.py (1)
33-46: Trace span closes before metrics recording.The
trace_spancontext manager exits after line 38, so the metrics recording (line 41) and logging happen outside the span. This is minor, but for observability consistency, consider including the post-storage operations within the span.♻️ Optional: Include metrics/logging in span scope
async def store_event(self, event: DomainEvent, topic: str, consumer_group: str) -> bool: """Store a single event. Called by FastStream handler. Returns True if stored, False if duplicate/failed. """ with trace_span( name="event_store_service.store_event", kind=SpanKind.CONSUMER, attributes={"event.type": event.event_type, "event.id": event.event_id}, ): stored = await self.event_store.store_event(event) - if stored: - self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group) - self.logger.debug(f"Stored event {event.event_id}") - else: - self.logger.debug(f"Duplicate event {event.event_id}, skipped") + if stored: + self.event_metrics.record_kafka_message_consumed(topic=topic, consumer_group=consumer_group) + self.logger.debug(f"Stored event {event.event_id}") + else: + self.logger.debug(f"Duplicate event {event.event_id}, skipped") return storedbackend/app/services/notification_service.py (1)
126-149: Preferlogger.exceptionfor batch/cleanup failures.This preserves stack traces and is more actionable in ops.
♻️ Suggested change
- except Exception as e: - self.logger.error(f"Error processing pending notifications: {e}") + except Exception: + self.logger.exception("Error processing pending notifications") ... - except Exception as e: - self.logger.error(f"Error cleaning up old notifications: {e}") + except Exception: + self.logger.exception("Error cleaning up old notifications")
| """Service for creating and managing notifications. | ||
|
|
||
| This service handles: | ||
| - Creating notifications (user and system) | ||
| - Delivering notifications via channels (in-app, webhook, slack) | ||
| - Managing notification subscriptions | ||
| - Rate limiting via throttle cache | ||
|
|
||
| Background tasks (pending notification processing, cleanup) are started | ||
| via the run() method, which should be called from app lifespan. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring references a non‑existent run() method.
NotificationService no longer defines run(), so this is misleading. Please update the docstring to reflect the DI/lifespan scheduler that calls the new batch/cleanup methods.
📝 Proposed fix
- Background tasks (pending notification processing, cleanup) are started
- via the run() method, which should be called from app lifespan.
+ Background tasks (pending notification processing, cleanup) are started
+ by the DI/lifespan scheduler that calls process_pending_batch() and cleanup_old().🤖 Prompt for AI Agents
In `@backend/app/services/notification_service.py` around lines 85 - 95, The
docstring incorrectly mentions a run() method; update it to remove the run()
reference and instead state that background work is started by the application's
DI/lifespan scheduler which calls this service's batch and cleanup methods (for
example NotificationService.batch_process_pending_notifications and
NotificationService.cleanup_notifications); briefly describe that these methods
handle pending notification processing and periodic cleanup and that they should
be wired into app lifespan/DI.
| async def start(self) -> None: | ||
| """Start listening for messages.""" | ||
| await self._pubsub.subscribe(self._channel) | ||
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | ||
|
|
||
| async def listener() -> None: | ||
| while True: | ||
| try: | ||
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | ||
| if msg and msg.get("type") == "message": | ||
| data = json.loads(msg["data"]) | ||
| await self._handler(data) | ||
| except asyncio.CancelledError: | ||
| break | ||
| except Exception as e: | ||
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | ||
|
|
||
| self._task = asyncio.create_task(listener()) | ||
| self._logger.debug(f"Started internal subscription on {self._channel}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against double start() calls.
If start() is called twice without an intervening close(), a new listener task is created while the old one continues running, causing duplicate message handling and a leaked task.
🛠️ Suggested fix
async def start(self) -> None:
"""Start listening for messages."""
+ if self._task is not None:
+ return # Already started
+
await self._pubsub.subscribe(self._channel)
await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def start(self) -> None: | |
| """Start listening for messages.""" | |
| await self._pubsub.subscribe(self._channel) | |
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | |
| async def listener() -> None: | |
| while True: | |
| try: | |
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | |
| if msg and msg.get("type") == "message": | |
| data = json.loads(msg["data"]) | |
| await self._handler(data) | |
| except asyncio.CancelledError: | |
| break | |
| except Exception as e: | |
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | |
| self._task = asyncio.create_task(listener()) | |
| self._logger.debug(f"Started internal subscription on {self._channel}") | |
| async def start(self) -> None: | |
| """Start listening for messages.""" | |
| if self._task is not None: | |
| return # Already started | |
| await self._pubsub.subscribe(self._channel) | |
| await self._pubsub.get_message(timeout=1.0) # Consume subscribe confirmation | |
| async def listener() -> None: | |
| while True: | |
| try: | |
| msg = await self._pubsub.get_message(ignore_subscribe_messages=True, timeout=0.5) | |
| if msg and msg.get("type") == "message": | |
| data = json.loads(msg["data"]) | |
| await self._handler(data) | |
| except asyncio.CancelledError: | |
| break | |
| except Exception as e: | |
| self._logger.error(f"Error processing internal message on {self._channel}: {e}") | |
| self._task = asyncio.create_task(listener()) | |
| self._logger.debug(f"Started internal subscription on {self._channel}") |
🤖 Prompt for AI Agents
In `@backend/app/services/sse/redis_bus.py` around lines 156 - 174, The start()
method can be called multiple times which will spawn duplicate listener tasks;
add a guard in start() that checks self._task (and whether it's not done) and
either returns early or raises if a listener is already running, or cancel/wait
for the existing task before creating a new one; update the async def
start(self) to inspect self._task and only call asyncio.create_task(listener())
when there is no active task (use self._task.done() to determine activity) and
ensure subscription/confirmation logic is not re-run for an already-started
listener (references: start, _task, listener, _pubsub).
| async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None: | ||
| s = SystemSettings() | ||
| updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1") | ||
| assert isinstance(updated, SystemSettings) | ||
| # verify audit log entry exists | ||
| assert await db.get_collection("audit_log").count_documents({}) >= 1 | ||
| assert await AuditLogDocument.count() >= 1 | ||
| reset = await repo.reset_system_settings("admin", "u1") | ||
| assert isinstance(reset, SystemSettings) | ||
| assert await db.get_collection("audit_log").count_documents({}) >= 2 | ||
| assert await AuditLogDocument.count() >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "test_admin_settings_repository.py" 2>/dev/null | head -20Repository: HardMax71/Integr8sCode
Length of output: 188
🏁 Script executed:
git ls-files | grep -E "(test_admin|admin_settings)" | head -20Repository: HardMax71/Integr8sCode
Length of output: 457
🏁 Script executed:
cat -n backend/tests/integration/db/repositories/test_admin_settings_repository.py | head -50Repository: HardMax71/Integr8sCode
Length of output: 1653
🏁 Script executed:
rg "def db" backend/tests --type py -B 2 -A 5 | head -100Repository: HardMax71/Integr8sCode
Length of output: 485
🏁 Script executed:
rg "fixture|scope|autouse" backend/tests -B 2 -A 2 --type py | grep -E "(db|audit)" -A 3 -B 1 | head -80Repository: HardMax71/Integr8sCode
Length of output: 2217
🏁 Script executed:
find . -path "*/tests/integration/conftest.py" -type fRepository: HardMax71/Integr8sCode
Length of output: 105
🏁 Script executed:
cat -n backend/tests/integration/conftest.py 2>/dev/null | head -100Repository: HardMax71/Integr8sCode
Length of output: 4140
🏁 Script executed:
cat -n backend/app/db/docs/admin_settings.pyRepository: HardMax71/Integr8sCode
Length of output: 2515
🏁 Script executed:
rg "AuditLogDocument" backend/tests --type py -B 2 -A 2 | head -80Repository: HardMax71/Integr8sCode
Length of output: 1361
🏁 Script executed:
find . -path "*/tests/helpers/cleanup.py" -type fRepository: HardMax71/Integr8sCode
Length of output: 100
🏁 Script executed:
cat -n tests/helpers/cleanup.pyRepository: HardMax71/Integr8sCode
Length of output: 122
🏁 Script executed:
rg "scope.*AsyncContainer|@pytest.fixture.*scope" backend/tests --type py -B 2 -A 3 | head -80Repository: HardMax71/Integr8sCode
Length of output: 5389
🏁 Script executed:
cat -n backend/tests/helpers/cleanup.pyRepository: HardMax71/Integr8sCode
Length of output: 911
🏁 Script executed:
rg "def db\(\)" backend/tests --type py -B 3 -A 10Repository: HardMax71/Integr8sCode
Length of output: 795
🏁 Script executed:
head -100 backend/tests/conftest.pyRepository: HardMax71/Integr8sCode
Length of output: 3987
Make audit-log assertions delta-based to avoid false positives.
Global AuditLogDocument.count() assertions are fragile; they pass whenever any audit logs exist, not necessarily ones from this test. Use a baseline capture and assert the increment to ensure the test actually validates its own operations.
🔧 Suggested adjustment (baseline + delta)
async def test_update_and_reset_settings(repo: AdminSettingsRepository) -> None:
+ baseline = await AuditLogDocument.count()
s = SystemSettings()
updated = await repo.update_system_settings(s, updated_by="admin", user_id="u1")
assert isinstance(updated, SystemSettings)
- assert await AuditLogDocument.count() >= 1
+ assert await AuditLogDocument.count() >= baseline + 1
reset = await repo.reset_system_settings("admin", "u1")
assert isinstance(reset, SystemSettings)
- assert await AuditLogDocument.count() >= 2
+ assert await AuditLogDocument.count() >= baseline + 2🤖 Prompt for AI Agents
In `@backend/tests/integration/db/repositories/test_admin_settings_repository.py`
around lines 29 - 36, The test test_update_and_reset_settings currently asserts
AuditLogDocument.count() >= N which is brittle; capture a baseline count before
calling repo.update_system_settings and repo.reset_system_settings (use
AuditLogDocument.count()), then after update_system_settings assert the count
increased by one and after reset_system_settings assert the count increased by
two total (or increased by one relative to the post-update count). Reference the
test function test_update_and_reset_settings, the repository methods
update_system_settings and reset_system_settings, and AuditLogDocument.count()
when making these delta-based assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 33 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/workers/dlq_processor.py">
<violation number="1" location="backend/workers/dlq_processor.py:185">
P2: auto_commit=False disables auto-acknowledgment, but the handler never explicitly acknowledges messages. This leaves Kafka offsets uncommitted, so DLQ messages will be reprocessed on restart or rebalance. Either enable auto-commit or accept a message object and call ack/nack after processing.</violation>
</file>
<file name="backend/tests/unit/conftest.py">
<violation number="1" location="backend/tests/unit/conftest.py:65">
P3: The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</violation>
</file>
<file name="backend/app/dlq/manager.py">
<violation number="1" location="backend/app/dlq/manager.py:80">
P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</violation>
</file>
<file name="backend/workers/run_saga_orchestrator.py">
<violation number="1" location="backend/workers/run_saga_orchestrator.py:109">
P2: Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</violation>
</file>
<file name="backend/workers/run_result_processor.py">
<violation number="1" location="backend/workers/run_result_processor.py:115">
P2: Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| start = asyncio.get_running_loop().time() | ||
|
|
||
| # Record metrics | ||
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | ||
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | ||
|
|
||
| # Process with tracing | ||
| ctx = extract_trace_context(message.headers) | ||
| with get_tracer().start_as_current_span( | ||
| name="dlq.consume", | ||
| context=ctx, | ||
| kind=SpanKind.CONSUMER, | ||
| attributes={ | ||
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | ||
| EventAttributes.EVENT_TYPE: message.event.event_type, | ||
| EventAttributes.EVENT_ID: message.event.event_id, | ||
| }, | ||
| ): | ||
| await self._process_dlq_message(message) | ||
|
|
||
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/dlq/manager.py, line 80:
<comment>Wrap the processing block in try/except/finally so failures are logged and duration metrics are recorded even when _process_dlq_message raises.</comment>
<file context>
@@ -70,42 +69,35 @@ def __init__(
+ Args:
+ message: Typed DLQMessage (deserialized by FastStream/Avro)
+ """
+ start = asyncio.get_running_loop().time()
+
+ # Record metrics
</file context>
| start = asyncio.get_running_loop().time() | |
| # Record metrics | |
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | |
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | |
| # Process with tracing | |
| ctx = extract_trace_context(message.headers) | |
| with get_tracer().start_as_current_span( | |
| name="dlq.consume", | |
| context=ctx, | |
| kind=SpanKind.CONSUMER, | |
| attributes={ | |
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | |
| EventAttributes.EVENT_TYPE: message.event.event_type, | |
| EventAttributes.EVENT_ID: message.event.event_id, | |
| }, | |
| ): | |
| await self._process_dlq_message(message) | |
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") | |
| start = asyncio.get_running_loop().time() | |
| try: | |
| # Record metrics | |
| self.metrics.record_dlq_message_received(message.original_topic, message.event.event_type) | |
| self.metrics.record_dlq_message_age((datetime.now(timezone.utc) - message.failed_at).total_seconds()) | |
| # Process with tracing | |
| ctx = extract_trace_context(message.headers) | |
| with get_tracer().start_as_current_span( | |
| name="dlq.consume", | |
| context=ctx, | |
| kind=SpanKind.CONSUMER, | |
| attributes={ | |
| EventAttributes.KAFKA_TOPIC: self.dlq_topic, | |
| EventAttributes.EVENT_TYPE: message.event.event_type, | |
| EventAttributes.EVENT_ID: message.event.event_id, | |
| }, | |
| ): | |
| await self._process_dlq_message(message) | |
| except Exception as e: | |
| self.logger.error(f"Error processing DLQ message: {e}") | |
| raise | |
| finally: | |
| self.metrics.record_dlq_processing_duration(asyncio.get_running_loop().time() - start, "process") |
| app_logger.info("SagaOrchestrator starting...") | ||
|
|
||
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Mongo client is created before the try/finally, so failures during init_beanie or schema registry initialization skip cleanup. Wrap the initialization in the try/finally (or use an async context manager) so the client is always closed on startup errors.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_saga_orchestrator.py, line 109:
<comment>Mongo client is created before the `try/finally`, so failures during `init_beanie` or schema registry initialization skip cleanup. Wrap the initialization in the `try/finally` (or use an async context manager) so the client is always closed on startup errors.</comment>
<file context>
@@ -105,9 +105,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
- await container.get(Database) # Triggers init_beanie via DatabaseProvider
- schema_registry = await container.get(SchemaRegistryManager) # Triggers initialize_schemas
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
</file context>
| app_logger.info("ResultProcessor starting...") | ||
|
|
||
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Mongo client initialization is outside the try/finally, so a failure during init_beanie or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same try/finally (or guard close() when init fails) to avoid resource leaks on startup errors.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_result_processor.py, line 115:
<comment>Mongo client initialization is outside the `try/finally`, so a failure during `init_beanie` or other setup will skip cleanup and leave the client unclosed. Wrap initialization in the same `try/finally` (or guard `close()` when init fails) to avoid resource leaks on startup errors.</comment>
<file context>
@@ -111,9 +111,14 @@ async def lifespan(app: FastStream) -> AsyncIterator[None]:
- await container.get(Database) # Triggers init_beanie via DatabaseProvider
- schema_registry = await container.get(SchemaRegistryManager) # Triggers initialize_schemas
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
</file context>
| Provides: | ||
| - Fake Redis, Kafka, K8s (boundary clients) | ||
| - Real metrics, repositories, services (internal) | ||
| - Real MongoDB via init_beanie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: The fixture uses AsyncMongoMockClient, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/conftest.py, line 65:
<comment>The fixture uses `AsyncMongoMockClient`, so the docstring claiming a "Real MongoDB" is misleading. Update it to reflect that this is an in-memory/mock MongoDB instance.</comment>
<file context>
@@ -60,29 +60,33 @@ async def unit_container(test_settings: Settings) -> AsyncGenerator[AsyncContain
- - Fake Redis, Kafka, K8s, MongoDB (boundary clients)
+ - Fake Redis, Kafka, K8s (boundary clients)
- Real metrics, repositories, services (internal)
+ - Real MongoDB via init_beanie
"""
container = make_async_container(
</file context>
| - Real MongoDB via init_beanie | |
| - MongoDB mock via init_beanie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/core/providers.py (1)
103-116: Redis client lacks lifecycle management.The Kafka producer uses
AsyncIterablewith properstart()/stop()lifecycle, but the Redis client is returned directly without cleanup. Redis connections may leak on application shutdown.🔧 Proposed fix with proper lifecycle
- `@provide` - def get_redis_client(self, settings: Settings, logger: logging.Logger) -> redis.Redis: - logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}") - return redis.Redis( + `@provide` + async def get_redis_client(self, settings: Settings, logger: logging.Logger) -> AsyncIterable[redis.Redis]: + logger.info(f"Redis configured: {settings.REDIS_HOST}:{settings.REDIS_PORT}/{settings.REDIS_DB}") + client = redis.Redis( host=settings.REDIS_HOST, port=settings.REDIS_PORT, db=settings.REDIS_DB, password=settings.REDIS_PASSWORD, ssl=settings.REDIS_SSL, max_connections=settings.REDIS_MAX_CONNECTIONS, decode_responses=settings.REDIS_DECODE_RESPONSES, socket_connect_timeout=5, socket_timeout=5, ) + yield client + await client.aclose() + logger.info("Redis client closed")backend/tests/integration/dlq/test_dlq_manager.py (1)
55-85: Wait for partition assignment before processing to avoid test flakiness.With
auto_offset_reset="latest"and no committed offsets, the consumer's initial offset is set to the partition's log end when assignment completes. Ifprocess_message()produces the event before partition assignment finishes, the consumer may start fetching from a position after the message, causing it to be skipped and the test to flake.Add a check to ensure partition assignment is complete before creating the consume task:
🔧 Suggested stabilization
await consumer.start() +while not consumer.assignment(): + await asyncio.sleep(0.05) consume_task = asyncio.create_task(consume_dlq_events()) # Process message directly via DLQManager (simulating FastStream handler) await dlq_manager.process_message(dlq_message)
🤖 Fix all issues with AI agents
In `@backend/app/services/user_settings_service.py`:
- Around line 216-221: The invalidate_cache method currently calls
self._redis.delete(cache_key) without handling Redis errors; update
invalidate_cache to catch Redis-related exceptions (matching the pattern used in
your cache read/write methods), log an error via self.logger.error including the
cache_key and exception details, and swallow the exception so callers like
reset_user_settings won't fail due to cache layer issues while preserving the
existing debug log when deletion succeeds (references: invalidate_cache,
self._redis.delete, self.logger.debug, reset_user_settings).
- Around line 223-227: The _set_cache method currently lets Redis errors
propagate; change it to swallow cache write failures and only log them so
database operations like update_user_settings and restore_settings_to_point
don't fail when caching fails. Wrap the await self._redis.setex(...) call in a
try/except catching broad Redis/connection exceptions (or Exception if specific
client exceptions aren't imported), log the error with self.logger.warning or
self.logger.debug including the cache_key (from _cache_key) and
_CACHE_TTL_SECONDS, and return None silently on failure; leave successful
behavior unchanged (still log debug on success). Ensure no exception is
re-raised so callers of _set_cache (e.g., update_user_settings,
restore_settings_to_point) continue normally when caching fails.
- Around line 55-63: Wrap the Redis read in get_user_settings with a try/except
to catch connection/timeouts from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.
♻️ Duplicate comments (1)
backend/app/core/providers.py (1)
421-430: SSEConnectionRegistry scope issue persists from previous review.As flagged in a prior review,
Scope.REQUESTcreates a new registry per HTTP request, causingget_connection_count()to always return 0. The registry should useScope.APPto track connections globally across requests.🔧 Fix: Change scope to APP
- `@provide`(scope=Scope.REQUEST) + `@provide`(scope=Scope.APP) def get_sse_connection_registry( self, logger: logging.Logger, connection_metrics: ConnectionMetrics, ) -> SSEConnectionRegistry:
🧹 Nitpick comments (2)
backend/app/services/user_settings_service.py (1)
40-45: Type annotation uses syncredis.Redisbut operations are async.The
redis_clientparameter is typed asredis.Redis, but the code usesawaitwith Redis operations (e.g.,await self._redis.get(cache_key)). Theredis.asynciomodule providesredis.asyncio.Redisas the async client type. While the current annotation works at runtime due to Python's duck typing, it's technically incorrect for type checkers.🔧 Suggested fix
def __init__( self, repository: UserSettingsRepository, event_service: KafkaEventService, logger: logging.Logger, - redis_client: redis.Redis, + redis_client: redis.Redis, # type: ignore[type-arg] # async Redis client ) -> None:Alternatively, use the more explicit async type if available in your redis version:
redis_client: "redis.asyncio.Redis[str]",backend/tests/integration/services/user_settings/test_user_settings_service.py (1)
39-43: Consider asserting the results of update wrappers.The update wrapper calls at the end of the test don't verify their return values. While this exercises the code paths, adding assertions would catch regressions more effectively.
💡 Example assertions
# Update wrappers - await svc.update_theme(user_id, Theme.DARK) - await svc.update_notification_settings(user_id, DomainNotificationSettings()) - await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2)) - await svc.update_custom_setting(user_id, "k", "v") + theme_result = await svc.update_theme(user_id, Theme.DARK) + assert theme_result.theme == Theme.DARK + + notif_result = await svc.update_notification_settings(user_id, DomainNotificationSettings()) + assert notif_result.notifications is not None + + editor_result = await svc.update_editor_settings(user_id, DomainEditorSettings(tab_size=2)) + assert editor_result.editor.tab_size == 2 + + custom_result = await svc.update_custom_setting(user_id, "k", "v") + assert custom_result.custom_settings.get("k") == "v"
| async def get_user_settings(self, user_id: str) -> DomainUserSettings: | ||
| """Get settings with cache; rebuild and cache on miss.""" | ||
| if user_id in self._cache: | ||
| cached = self._cache[user_id] | ||
| self.logger.debug(f"Settings cache hit for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
| return cached | ||
| """Get settings with Redis cache; rebuild and cache on miss.""" | ||
| cache_key = self._cache_key(user_id) | ||
| cached = await self._redis.get(cache_key) | ||
| if cached: | ||
| self.logger.debug(f"Settings cache hit for user {user_id}") | ||
| return DomainUserSettings.model_validate_json(cached) | ||
|
|
||
| return await self.get_user_settings_fresh(user_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redis cache operations lack error handling.
If Redis becomes unavailable or times out, the await self._redis.get(cache_key) call will raise an exception, causing the entire request to fail. For a cache layer, Redis failures should be non-fatal—falling back to fresh data retrieval from the database.
🛡️ Proposed fix with graceful degradation
async def get_user_settings(self, user_id: str) -> DomainUserSettings:
"""Get settings with Redis cache; rebuild and cache on miss."""
cache_key = self._cache_key(user_id)
- cached = await self._redis.get(cache_key)
- if cached:
- self.logger.debug(f"Settings cache hit for user {user_id}")
- return DomainUserSettings.model_validate_json(cached)
+ try:
+ cached = await self._redis.get(cache_key)
+ if cached:
+ self.logger.debug(f"Settings cache hit for user {user_id}")
+ return DomainUserSettings.model_validate_json(cached)
+ except Exception as e:
+ self.logger.warning(f"Redis cache read failed for user {user_id}: {e}")
return await self.get_user_settings_fresh(user_id)🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 55 - 63, Wrap the
Redis read in get_user_settings with a try/except to catch connection/timeouts
from self._redis.get(cache_key) (and optionally
DomainUserSettings.model_validate_json parsing errors); on exception log a
non-fatal warning via self.logger.warning including the exception and cache_key,
then fall back to calling and returning get_user_settings_fresh(user_id). Ensure
you still log the cache hit when cached succeeds and only use the cached data
when model_validate_json succeeds; do not let Redis exceptions propagate and
crash the request.
| async def invalidate_cache(self, user_id: str) -> None: | ||
| """Invalidate cached settings for a user.""" | ||
| if self._cache.pop(user_id, None) is not None: | ||
| self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
|
|
||
| def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None: | ||
| """Add settings to TTL+LRU cache.""" | ||
| self._cache[user_id] = settings | ||
| self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)}) | ||
|
|
||
| def get_cache_stats(self) -> dict[str, Any]: | ||
| """Get cache statistics for monitoring.""" | ||
| return { | ||
| "cache_size": len(self._cache), | ||
| "max_cache_size": self._max_cache_size, | ||
| "expired_entries": 0, | ||
| "cache_ttl_seconds": self._cache_ttl.total_seconds(), | ||
| } | ||
| cache_key = self._cache_key(user_id) | ||
| deleted = await self._redis.delete(cache_key) | ||
| if deleted: | ||
| self.logger.debug(f"Invalidated cache for user {user_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache invalidation should also handle Redis failures.
For consistency with the cache read/write error handling strategy, invalidate_cache should catch Redis exceptions to ensure reset_user_settings doesn't fail due to cache layer issues.
🛡️ Proposed fix
async def invalidate_cache(self, user_id: str) -> None:
"""Invalidate cached settings for a user."""
cache_key = self._cache_key(user_id)
- deleted = await self._redis.delete(cache_key)
- if deleted:
- self.logger.debug(f"Invalidated cache for user {user_id}")
+ try:
+ deleted = await self._redis.delete(cache_key)
+ if deleted:
+ self.logger.debug(f"Invalidated cache for user {user_id}")
+ except Exception as e:
+ self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def invalidate_cache(self, user_id: str) -> None: | |
| """Invalidate cached settings for a user.""" | |
| if self._cache.pop(user_id, None) is not None: | |
| self.logger.debug(f"Invalidated cache for user {user_id}", extra={"cache_size": len(self._cache)}) | |
| def _add_to_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Add settings to TTL+LRU cache.""" | |
| self._cache[user_id] = settings | |
| self.logger.debug(f"Cached settings for user {user_id}", extra={"cache_size": len(self._cache)}) | |
| def get_cache_stats(self) -> dict[str, Any]: | |
| """Get cache statistics for monitoring.""" | |
| return { | |
| "cache_size": len(self._cache), | |
| "max_cache_size": self._max_cache_size, | |
| "expired_entries": 0, | |
| "cache_ttl_seconds": self._cache_ttl.total_seconds(), | |
| } | |
| cache_key = self._cache_key(user_id) | |
| deleted = await self._redis.delete(cache_key) | |
| if deleted: | |
| self.logger.debug(f"Invalidated cache for user {user_id}") | |
| async def invalidate_cache(self, user_id: str) -> None: | |
| """Invalidate cached settings for a user.""" | |
| cache_key = self._cache_key(user_id) | |
| try: | |
| deleted = await self._redis.delete(cache_key) | |
| if deleted: | |
| self.logger.debug(f"Invalidated cache for user {user_id}") | |
| except Exception as e: | |
| self.logger.warning(f"Failed to invalidate cache for user {user_id}: {e}") |
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 216 - 221, The
invalidate_cache method currently calls self._redis.delete(cache_key) without
handling Redis errors; update invalidate_cache to catch Redis-related exceptions
(matching the pattern used in your cache read/write methods), log an error via
self.logger.error including the cache_key and exception details, and swallow the
exception so callers like reset_user_settings won't fail due to cache layer
issues while preserving the existing debug log when deletion succeeds
(references: invalidate_cache, self._redis.delete, self.logger.debug,
reset_user_settings).
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | ||
| """Set settings in Redis cache with TTL.""" | ||
| cache_key = self._cache_key(user_id) | ||
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | ||
| self.logger.debug(f"Cached settings for user {user_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache write failures should not propagate.
Similar to the read path, _set_cache should handle Redis failures gracefully. A failed cache write shouldn't cause update_user_settings or restore_settings_to_point to fail when the primary database operation succeeded.
🛡️ Proposed fix
async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None:
"""Set settings in Redis cache with TTL."""
cache_key = self._cache_key(user_id)
- await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
- self.logger.debug(f"Cached settings for user {user_id}")
+ try:
+ await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json())
+ self.logger.debug(f"Cached settings for user {user_id}")
+ except Exception as e:
+ self.logger.warning(f"Failed to cache settings for user {user_id}: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Set settings in Redis cache with TTL.""" | |
| cache_key = self._cache_key(user_id) | |
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | |
| self.logger.debug(f"Cached settings for user {user_id}") | |
| async def _set_cache(self, user_id: str, settings: DomainUserSettings) -> None: | |
| """Set settings in Redis cache with TTL.""" | |
| cache_key = self._cache_key(user_id) | |
| try: | |
| await self._redis.setex(cache_key, self._CACHE_TTL_SECONDS, settings.model_dump_json()) | |
| self.logger.debug(f"Cached settings for user {user_id}") | |
| except Exception as e: | |
| self.logger.warning(f"Failed to cache settings for user {user_id}: {e}") |
🤖 Prompt for AI Agents
In `@backend/app/services/user_settings_service.py` around lines 223 - 227, The
_set_cache method currently lets Redis errors propagate; change it to swallow
cache write failures and only log them so database operations like
update_user_settings and restore_settings_to_point don't fail when caching
fails. Wrap the await self._redis.setex(...) call in a try/except catching broad
Redis/connection exceptions (or Exception if specific client exceptions aren't
imported), log the error with self.logger.warning or self.logger.debug including
the cache_key (from _cache_key) and _CACHE_TTL_SECONDS, and return None silently
on failure; leave successful behavior unchanged (still log debug on success).
Ensure no exception is re-raised so callers of _set_cache (e.g.,
update_user_settings, restore_settings_to_point) continue normally when caching
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/tests/integration/dlq/test_dlq_manager.py`:
- Around line 79-83: The test uses a fixed asyncio.sleep after await
consumer.start(), which can race with asynchronous partition assignment when
auto_offset_reset="latest"; instead, poll until the consumer has assignments
(e.g., check consumer.assignment() or wait for non-empty consumer._assignment)
and then call consumer.seek_to_end() on each assigned partition before
producing/starting consume_dlq_events(); update the setup that currently
references consumer.start() and consume_dlq_events() to wait for assignment and
seek_to_end() deterministically rather than relying on the 0.5s sleep.
🧹 Nitpick comments (2)
backend/app/events/schema/schema_registry.py (2)
90-92: Redundant schema construction.
_ensure_schema_registeredalready callsavro_schema_to_python()and creates anAvroSchemaobject. These lines duplicate that work on every serialization call. Consider caching theAvroSchemaobject in_ensure_schema_registeredand retrieving it here.Suggested approach
Add a cache for the schema object in
__init__:self._avro_schema_cache: dict[type[DomainEvent], schema.AvroSchema] = {}Update
_ensure_schema_registeredto cache and return the schema object, then reuse it inserialize_event.
98-102: Movedatetimeimport to module level and add recursive datetime conversion for robustness.
The
from datetime import datetimeimport on line 99 is inside the function and should be at module level.The
avro_schema_to_python()call happens twice: once in_ensure_schema_registered()(line 70) and again inserialize_event()(line 91). While both calls are necessary due to the schema_id-only caching, this repeated conversion could be optimized. If this becomes a performance concern, consider caching the AvroSchema object by schema_id.The datetime conversion on lines 100–102 only handles top-level fields. While no nested models currently contain datetime fields, adding a recursive helper would make the code more resilient to future changes where nested Pydantic models might include datetime fields.
Suggested refactor for module-level import and recursive conversion
Add to module imports:
from datetime import datetimeReplace the conversion logic with a recursive helper:
# Convert datetime fields to milliseconds for Avro timestamp-millis logical type payload = self._convert_datetimes(payload)Add helper method:
def _convert_datetimes(self, obj: Any) -> Any: """Recursively convert datetime fields to milliseconds.""" if isinstance(obj, datetime): return int(obj.timestamp() * 1_000) if isinstance(obj, dict): return {k: self._convert_datetimes(v) for k, v in obj.items()} if isinstance(obj, list): return [self._convert_datetimes(item) for item in obj] return obj
| # Start consumer BEFORE processing (auto_offset_reset="latest") | ||
| await consumer.start() | ||
| # Small delay to ensure consumer is fully subscribed and ready | ||
| await asyncio.sleep(0.5) | ||
| consume_task = asyncio.create_task(consume_dlq_events()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "test_dlq_manager.py" | head -5Repository: HardMax71/Integr8sCode
Length of output: 117
🏁 Script executed:
cat -n backend/tests/integration/dlq/test_dlq_manager.py | head -100Repository: HardMax71/Integr8sCode
Length of output: 4860
🏁 Script executed:
cat -n backend/tests/integration/dlq/test_dlq_manager.py | sed -n '70,95p'Repository: HardMax71/Integr8sCode
Length of output: 1405
🏁 Script executed:
rg -i "aiokafka" backend/tests/integration/dlq/test_dlq_manager.py -A 3 -B 3Repository: HardMax71/Integr8sCode
Length of output: 570
🏁 Script executed:
rg "consumer\.(assignment|seek)" --type py | head -20Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
rg "auto_offset_reset" --type py -B 2 -A 5 | head -40Repository: HardMax71/Integr8sCode
Length of output: 3247
🏁 Script executed:
rg "\.assignment\(\)|\.seek_to_end\(\)|\.position\(\)" --type pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
python3 << 'EOF'
import subprocess
result = subprocess.run(['pip', 'show', 'aiokafka'], capture_output=True, text=True)
print(result.stdout)
print(result.stderr)
EOFRepository: HardMax71/Integr8sCode
Length of output: 152
🌐 Web query:
aiokafka AIOKafkaConsumer assignment seek_to_end API documentation
💡 Result:
Summary — AIOKafkaConsumer.seek_to_end
- Signature: async seek_to_end(*partitions). [1]
- Behavior: set the consumer position to the most-recent available offset for the given TopicPartition(s). If no partitions are passed, it operates on all currently assigned partitions. [1]
- Errors: raises IllegalStateError if any partition is not currently assigned; raises TypeError if passed objects are not TopicPartition instances. [1]
- Note for read_committed isolation: seek_to_end returns the Last Stable Offset (LSO) (not necessarily the high watermark) when isolation_level='read_committed'. [2]
Sources:
[1] aiokafka API docs — AIOKafkaConsumer.seek_to_end.
[2] aiokafka consumer docs — explanation of LSO / read_committed behavior.
🌐 Web query:
aiokafka AIOKafkaConsumer assignment() method API
💡 Result:
AIOKafkaConsumer.assignment()
- Purpose: return the set of TopicPartition objects currently assigned to the consumer.
- Signature / usage: consumer.assignment() → set(TopicPartition). If you used consumer.assign([...]) it returns those partitions; if you used consumer.subscribe(...) it returns the partitions currently assigned by the group (may be empty if assignment hasn’t occurred yet or is in rebalance).
- Notes:
- Manual assign() and subscribe() cannot be mixed; assign() replaces previous assignment.
- assignment() does not change positions; it only reports the current assignment.
- assign()/subscribe() state rules and IllegalStateError behavior described in the docs (e.g., assign() raises IllegalStateError if subscribe() was already called).
Sources: aiokafka API & consumer docs. [1][2]
References
[1] aiokafka API documentation — AIOKafkaConsumer.assign / assignment.
[2] aiokafka consumer guide & source (assign/subscribe and assignment behavior).
🏁 Script executed:
rg "\.position\(|\.committed\(|\.seek\(" --type py backend/tests/integration/ -B 2 -A 2Repository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
rg "auto_offset_reset.*latest" --type py -B 5 -A 10 | head -60Repository: HardMax71/Integr8sCode
Length of output: 2963
Avoid race condition with auto_offset_reset="latest" and fixed sleep; wait for consumer assignment.
The hard-coded 0.5s sleep does not guarantee the consumer has completed partition assignment. With subscribe() and auto_offset_reset="latest", assignment happens asynchronously after start(). If the message is produced before assignment finishes, the consumer may miss it or start at an incorrect offset. Use assignment polling and seek_to_end() for deterministic behavior.
🛠️ Suggested fix
# Start consumer BEFORE processing (auto_offset_reset="latest")
await consumer.start()
- # Small delay to ensure consumer is fully subscribed and ready
- await asyncio.sleep(0.5)
+ # Wait for assignment, then seek to end to ensure we capture the next event
+ while not consumer.assignment():
+ await asyncio.sleep(0.05)
+ await consumer.seek_to_end()
consume_task = asyncio.create_task(consume_dlq_events())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Start consumer BEFORE processing (auto_offset_reset="latest") | |
| await consumer.start() | |
| # Small delay to ensure consumer is fully subscribed and ready | |
| await asyncio.sleep(0.5) | |
| consume_task = asyncio.create_task(consume_dlq_events()) | |
| # Start consumer BEFORE processing (auto_offset_reset="latest") | |
| await consumer.start() | |
| # Wait for assignment, then seek to end to ensure we capture the next event | |
| while not consumer.assignment(): | |
| await asyncio.sleep(0.05) | |
| await consumer.seek_to_end() | |
| consume_task = asyncio.create_task(consume_dlq_events()) |
🤖 Prompt for AI Agents
In `@backend/tests/integration/dlq/test_dlq_manager.py` around lines 79 - 83, The
test uses a fixed asyncio.sleep after await consumer.start(), which can race
with asynchronous partition assignment when auto_offset_reset="latest"; instead,
poll until the consumer has assignments (e.g., check consumer.assignment() or
wait for non-empty consumer._assignment) and then call consumer.seek_to_end() on
each assigned partition before producing/starting consume_dlq_events(); update
the setup that currently references consumer.start() and consume_dlq_events() to
wait for assignment and seek_to_end() deterministically rather than relying on
the 0.5s sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/workers/run_sse_bridge.py`:
- Around line 106-110: The subscriber currently sets auto_commit=True in the
broker.subscriber call which, combined with swallowing exceptions in
route_event, can commit Kafka offsets before Redis publish succeeds; change to
manual commit by switching the subscriber's commit policy to AckPolicy.ACK (or
the equivalent manual/manual-ack option) and ensure route_event propagates
exceptions instead of catching/silencing them so that you only
acknowledge/commit after a successful Redis publish; specifically, update the
broker.subscriber invocation (the decorator using *topics, group_id,
auto_commit, decoder=decode_avro) to use manual/ACK commit and modify
route_event to re-raise errors from the Redis publish path so the broker can
avoid committing on failure.
♻️ Duplicate comments (9)
backend/workers/dlq_processor.py (1)
184-195: Manual ack missing withauto_commit=False.With auto-commit disabled, offsets won’t be committed unless the handler explicitly acknowledges. Right now the handler only processes the message, so DLQ entries will replay on restart or rebalance. Either enable auto-commit or accept the broker message and call ack/nack after processing.
✅ Minimal fix (enable auto-commit)
`@broker.subscriber`( *topics, group_id=group_id, - auto_commit=False, + auto_commit=True, decoder=decode_dlq_json, )backend/workers/run_result_processor.py (3)
117-120: Mongo client initialization outsidetry/finallyrisks resource leak on startup failure.If
init_beanie()or subsequent initialization steps fail,mongo_clientwon't be closed since it's created before thetryblock begins at line 170.Proposed fix
`@asynccontextmanager` async def lifespan() -> AsyncIterator[None]: """Initialize infrastructure before app starts.""" app_logger = await container.get(logging.Logger) app_logger.info("ResultProcessor starting...") # Initialize MongoDB + Beanie mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - # Resolve schema registry (initialization handled by provider) - schema_registry = await container.get(SchemaRegistryManager) + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + # Resolve schema registry (initialization handled by provider) + schema_registry = await container.get(SchemaRegistryManager)Then continue the existing logic inside the new
tryblock, adjusting thefinallyaccordingly.
134-139:auto_commit=Falseis deprecated in FastStream 0.6; useAckPolicyinstead.FastStream 0.6 replaces the
auto_commitparameter withAckPolicy. Withauto_commit=Falseignored and no explicitAckPolicy, offsets may not be committed correctly after handler success, leading to message redelivery on consumer restart.Proposed fix
+from faststream.kafka import AckPolicy + # Create subscriber with Avro decoder subscriber = broker.subscriber( *topics, group_id=group_id, - auto_commit=False, + ack_policy=AckPolicy.ACK, decoder=decode_avro, )FastStream 0.6 KafkaBroker subscriber AckPolicy parameter
163-166: Silent catch-all handler discards unmatched events without logging.The
handle_otherhandler silently passes for any event not matching the specific filters. This could mask routing issues or silently drop events if new event types are added. Consider logging at warning level to aid debugging.Proposed fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unhandled event type received: %s (event_id=%s)", + event.event_type, + event.event_id, + )backend/workers/run_saga_orchestrator.py (2)
111-114: Mongo client initialization outsidetry/finallyrisks resource leak on startup failure.Same issue as in
run_result_processor.py: ifinit_beanie()or subsequent steps fail before thetryblock at line 172, the mongo client won't be closed. Note that the early-exit path (lines 129-132) does handle cleanup correctly.
144-149:auto_commit=Falseis deprecated in FastStream 0.6; useAckPolicyinstead.Same issue as in
run_result_processor.py. Replace with explicitAckPolicy.ACKfor at-least-once semantics.Proposed fix
+from faststream.kafka import AckPolicy + `@broker.subscriber`( *topics, group_id=group_id, - auto_commit=False, + ack_policy=AckPolicy.ACK, decoder=decode_avro, )backend/workers/run_k8s_worker.py (2)
120-125:auto_commit=Falseis deprecated in FastStream 0.6; useAckPolicyinstead.Same issue as in other workers. Replace with explicit
AckPolicy.ACK.Proposed fix
+from faststream.kafka import AckPolicy + # Create subscriber with Avro decoder subscriber = broker.subscriber( *topics, group_id=group_id, - auto_commit=False, + ack_policy=AckPolicy.ACK, decoder=decode_avro, )
142-145: Silent catch-all handler discards unmatched events without logging.Same issue as in
run_result_processor.py. Consider logging unhandled events to aid debugging.Proposed fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unhandled event type received: %s (event_id=%s)", + event.event_type, + event.event_id, + )backend/workers/run_sse_bridge.py (1)
101-102: Avoid hardcoded topic string in Avro decoding.
deserialize_event(..., "sse_bridge")is misleading when the subscriber handles multiple configured topics. Pass the actual topic from the message context or remove the parameter if unused.
🧹 Nitpick comments (2)
backend/workers/run_result_processor.py (1)
142-147: Calling private methods (_handle_*) from outside the class.The handlers invoke
logic._handle_completed(),logic._handle_failed(), andlogic._handle_timeout()directly. These underscore-prefixed methods are conventionally private. Consider either exposing public methods onProcessorLogicor documenting this as an intentional internal API usage.backend/workers/run_saga_orchestrator.py (1)
163-167: Opportunistic timeout checking approach is reasonable but has caveats.The timeout check runs only when events arrive. If no events are received for extended periods, timed-out sagas won't be detected. This is documented as "opportunistic" which is acceptable, but consider documenting this limitation or implementing a background task for production if timely timeout detection is critical.
backend/workers/run_sse_bridge.py
Outdated
| @broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=True, # SSE bridge is idempotent (Redis pubsub) | ||
| decoder=decode_avro, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
FastStream 0.6.0 auto_commit behavior Kafka broker offset commit timing
💡 Result:
Short answer: FastStream 0.6.0 leaves offset commit timing to the underlying Kafka client when using the default ACK_FIRST (auto-ack) behavior — it uses enable.auto.commit and auto_commit_interval_ms from the Confluent/consumer client. That means offsets are committed asynchronously by the Kafka client at the configured auto-commit interval (and are tied to the client's poll/shutdown behavior); if you need at-least-once semantics commit after processing (use AckPolicy.ACK or MANUAL). [1][2][3]
Sources:
- FastStream ack docs (explains ACK_FIRST uses enable.auto.commit and describes ACK / MANUAL behaviors). [1]
- FastStream AsyncConfluentConsumer API showing enable_auto_commit and auto_commit_interval_ms config defaults. [2]
- Kafka client behavior: auto commits occur according to enable.auto.commit and auto.commit.interval.ms (commits tied to poll/interval; sync on shutdown). [3]
🏁 Script executed:
# First, let's locate and examine the actual code file
find . -name "run_sse_bridge.py" -type fRepository: HardMax71/Integr8sCode
Length of output: 101
🏁 Script executed:
# Now let's examine the actual code context around lines 106-110
cat -n ./backend/workers/run_sse_bridge.py | sed -n '100,130p'Repository: HardMax71/Integr8sCode
Length of output: 1483
🏁 Script executed:
# Let's examine the SSEEventRouter and route_event to understand error handling
rg -A 15 "class SSEEventRouter" ./backend/workers/run_sse_bridge.pyRepository: HardMax71/Integr8sCode
Length of output: 48
🏁 Script executed:
# Also check if route_event has any error handling or retry logic
rg -B 5 -A 20 "def route_event" ./backendRepository: HardMax71/Integr8sCode
Length of output: 1851
🏁 Script executed:
# Let's check the error handling in the handler and see if exceptions are re-raised
cat -n ./backend/workers/run_sse_bridge.py | sed -n '112,119p'Repository: HardMax71/Integr8sCode
Length of output: 426
🏁 Script executed:
# Check if there's any other commit logic or error handling patterns in the file
rg -n "commit\|ack\|ACK" ./backend/workers/run_sse_bridge.pyRepository: HardMax71/Integr8sCode
Length of output: 48
Switch to manual commit or handle Redis failures by re-raising exceptions from route_event.
FastStream 0.6.0's auto_commit=True relies on the Kafka client's asynchronous offset commit (tied to poll/interval, not handler completion). The handler here swallows exceptions from route_event without re-raising them, so if Redis publish fails transiently, the Kafka offset is committed anyway and the event is lost—breaking at-least-once semantics. Either use AckPolicy.ACK (manual commit after handler success) or let handler exceptions propagate so FastStream can handle commit based on failure.
🤖 Prompt for AI Agents
In `@backend/workers/run_sse_bridge.py` around lines 106 - 110, The subscriber
currently sets auto_commit=True in the broker.subscriber call which, combined
with swallowing exceptions in route_event, can commit Kafka offsets before Redis
publish succeeds; change to manual commit by switching the subscriber's commit
policy to AckPolicy.ACK (or the equivalent manual/manual-ack option) and ensure
route_event propagates exceptions instead of catching/silencing them so that you
only acknowledge/commit after a successful Redis publish; specifically, update
the broker.subscriber invocation (the decorator using *topics, group_id,
auto_commit, decoder=decode_avro) to use manual/ACK commit and modify
route_event to re-raise errors from the Redis publish path so the broker can
avoid committing on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/workers/dlq_processor.py`:
- Around line 180-197: The DLQ subscriber (decode_dlq_json and
handle_dlq_message) is being registered inside the lifespan block which
FastStream does not support; move the broker.subscriber registration out of the
lifespan context by placing the decode_dlq_json function and the decorated
handle_dlq_message subscriber at module level (retaining group_id,
auto_commit=False, decoder=decode_dlq_json) so it is registered before
FastStream(broker) is created, or alternatively register them in an
`@app.after_startup` hook or use broker.setup_subscriber(...) followed by await
subscriber.start() at runtime; ensure DLQManager injection
(FromDishka[DLQManager]) and the DLQMessage type remain unchanged.
♻️ Duplicate comments (9)
backend/workers/run_result_processor.py (3)
118-133: MongoDB client leak on initialization failure.If
init_beanieor subsequent initialization steps fail,mongo_clientcreated at line 119 won't be closed. Wrap in a nested try/finally or useAsyncExitStack.🐛 Suggested fix
# Initialize MongoDB + Beanie mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + except Exception: + await mongo_client.close() + raise
136-141: Specify explicitAckPolicywithauto_commit=False.With
auto_commit=False, FastStream's defaultAckPolicy.ACK_FIRSTwon't commit offsets since it relies on Kafka'senable.auto.commit. Messages will be redelivered on consumer restart.🐛 Suggested fix
+from faststream.kafka import AckPolicy + # Create subscriber with Avro decoder subscriber = broker.subscriber( *topics, group_id=group_id, auto_commit=False, decoder=decode_avro, + ack_policy=AckPolicy.ACK, )FastStream 0.6 Kafka AckPolicy.ACK behavior with auto_commit=False
165-168: Silent catch-all handler masks routing issues.The
handle_otherhandler silently discards unmatched events. If new event types are added or routing fails, this will mask the problem. Log a warning to aid debugging.🔧 Suggested fix
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning( + "Unhandled event type received", + extra={"event_type": type(event).__name__, "event": str(event)[:200]}, + )backend/workers/run_sse_bridge.py (2)
102-105: Avoid hardcoded topic in decode_avro.Line 104 passes a literal
"sse_bridge"; this is misleading if topics change or the schema layer starts using the topic value. Consider passing the actual subscribed topic or removing the parameter if unused.
108-121: Revisit auto-commit with handler failure semantics.With
auto_commit=Trueon Line 111, offsets may be committed even if Redis publish fails (depending on FastStream/Kafka client behavior). Prefer manual/ACK commit or ensure failures propagate so offsets aren’t committed on error. Please verify the correct FastStream 0.6.0 configuration.FastStream 0.6.0 Kafka subscriber manual/ack commit configuration and auto_commit behaviorbackend/workers/run_k8s_worker.py (1)
144-147: Silent catch-all handler discards unmatched events without logging.This concern was raised in a previous review and remains unaddressed. The current implementation will silently consume any events that don't match the
CREATE_POD_COMMANDorDELETE_POD_COMMANDfilters, which can mask filter logic bugs or cause silent data loss when new event types are introduced.♻️ Proposed fix: log unmatched events
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other( + event: DomainEvent, + logger: FromDishka[logging.Logger], + ) -> None: + logger.warning( + "K8s worker received unhandled event type: %s (event_id=%s)", + type(event).__name__, + getattr(event, "event_id", "unknown"), + )backend/workers/dlq_processor.py (1)
186-197:auto_commit=Falsewithout manual acknowledgment leaves offsets uncommitted.The subscriber is configured with
auto_commit=False, but the handler never explicitly acknowledges messages viaack(). This means Kafka offsets will not be committed, and messages will be reprocessed on consumer restart or rebalance.Either enable
auto_commit=True(default behavior) or inject the raw message and callack()after successful processing:🔧 Option 1: Enable auto-commit
`@broker.subscriber`( *topics, group_id=group_id, - auto_commit=False, + auto_commit=True, decoder=decode_dlq_json, )🔧 Option 2: Manual acknowledgment
`@broker.subscriber`( *topics, group_id=group_id, auto_commit=False, decoder=decode_dlq_json, ) async def handle_dlq_message( message: DLQMessage, + raw_message: StreamMessage[Any], dlq_manager: FromDishka[DLQManager], ) -> None: """Handle incoming DLQ messages - invoked by FastStream when message arrives.""" await dlq_manager.process_message(message) + await raw_message.ack()backend/workers/run_saga_orchestrator.py (2)
112-123: Ensure startup failures still close the Mongo client and DI container.Line 112 creates
AsyncMongoClientbefore thetry/finallyat Line 174, so exceptions frominit_beanie, schema registry init, or producer init can skip cleanup. Wrap the full initialization inside thetry/finallyso resources are always released.🛠️ Suggested fix (wrap initialization in the existing try/finally)
- # Initialize MongoDB + Beanie - mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( - settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 - ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - # Resolve schema registry (initialization handled by provider) - schema_registry = await container.get(SchemaRegistryManager) - - # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) - await container.get(UnifiedProducer) - app_logger.info("Kafka producer ready") + mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( + settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 + ) + try: + # Initialize MongoDB + Beanie + await init_beanie( + database=mongo_client[settings.DATABASE_NAME], + document_models=ALL_DOCUMENTS, + ) + + # Resolve schema registry (initialization handled by provider) + schema_registry = await container.get(SchemaRegistryManager) + + # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) + await container.get(UnifiedProducer) + app_logger.info("Kafka producer ready") @@ - if not trigger_topics: - app_logger.warning("No saga triggers configured, shutting down") - yield - await mongo_client.close() - await container.close() - return + if not trigger_topics: + app_logger.warning("No saga triggers configured, shutting down") + yield + return @@ - try: - yield - finally: - app_logger.info("SagaOrchestrator shutting down...") - await mongo_client.close() - await container.close() - app_logger.info("SagaOrchestrator shutdown complete") + yield + finally: + app_logger.info("SagaOrchestrator shutting down...") + await mongo_client.close() + await container.close() + app_logger.info("SagaOrchestrator shutdown complete")Also applies to: 174-180
146-150: Replaceauto_commitwith explicit AckPolicy for FastStream 0.6.Line 146–150 uses
auto_commit=False. In FastStream 0.6, offset behavior is governed byAckPolicy(andauto_commitmay be ignored), which can lead to incorrect or stalled offset advancement. Please verify against your FastStream version and switch to an explicitack_policy(or MANUAL with explicitmessage.ack()).🛠️ Suggested fix (verify AckPolicy import path for your FastStream version)
-from faststream.kafka import KafkaBroker +from faststream.kafka import KafkaBroker, AckPolicy @@ `@broker.subscriber`( *topics, group_id=group_id, - auto_commit=False, + ack_policy=AckPolicy.ACK, decoder=decode_avro, )FastStream 0.6 Kafka subscriber auto_commit AckPolicy
🧹 Nitpick comments (3)
backend/workers/run_result_processor.py (1)
144-163: Consider exposing handler methods as public API.The handlers call underscore-prefixed methods (
_handle_completed,_handle_failed,_handle_timeout) onProcessorLogic. Since FastStream handlers now directly invoke these, consider renaming them to public methods (e.g.,handle_completed) inProcessorLogicto clarify intent.backend/workers/dlq_processor.py (2)
47-48: Unusedloggerparameter.The
loggerparameter is declared but never used within_configure_retry_policies. Consider removing it or adding debug logging for policy configuration.♻️ Suggested fix
-def _configure_retry_policies(manager: DLQManager, logger: logging.Logger) -> None: +def _configure_retry_policies(manager: DLQManager) -> None: """Configure topic-specific retry policies."""
91-92: Unusedloggerparameter.Same as
_configure_retry_policies, theloggerparameter is unused here as well.♻️ Suggested fix
-def _configure_filters(manager: DLQManager, testing: bool, logger: logging.Logger) -> None: +def _configure_filters(manager: DLQManager, testing: bool) -> None: """Configure message filters."""
| # Decoder: JSON message → typed DLQMessage | ||
| def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage: | ||
| data = json.loads(msg.body) | ||
| return DLQMessage.model_validate(data) | ||
|
|
||
| # Register subscriber for DLQ messages | ||
| @broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| auto_commit=False, | ||
| decoder=decode_dlq_json, | ||
| ) | ||
| async def handle_dlq_message( | ||
| message: DLQMessage, | ||
| dlq_manager: FromDishka[DLQManager], | ||
| ) -> None: | ||
| """Handle incoming DLQ messages - invoked by FastStream when message arrives.""" | ||
| await dlq_manager.process_message(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
FastStream KafkaBroker subscriber registration inside lifespan context manager
💡 Result:
Short answer
- Don't register/start subscribers inside the lifespan() context because lifespan executes before brokers start and after they stop. Use app.after_startup (or register subscribers before creating/running the app) so KafkaBroker will actually start their consumers. [1][2]
Why
- FastStream's lifespan runs before the broker starts and after it stops, so subscribers registered there won't be able to start consuming. The docs explicitly warn: "lifespan starts BEFORE your broker started ... so you can't publish any messages here." [1]
- KafkaBroker subscriber handlers must be registered while the broker is able to start them (either at import-time/app-creation-time or in after_startup hooks). [2]
Minimal example
-
Register at app creation (preferred):
from faststream import FastStream
from faststream.kafka import KafkaBrokerbroker = KafkaBroker("localhost:9092")
@broker.subscriber("in-topic")
async def handler(msg):
print(msg)app = FastStream(broker)
-
Or register in after_startup if you must register dynamically at runtime:
app = FastStream(broker)
@app.after_startup
async def on_start():
@broker.subscriber("in-topic")
async def handler(msg):
print(msg)
Notes
- FastStream recently fixed dynamic subscriber registration issues; if you hit problems make sure you're on a recent release (see release notes). [3]
Sources
- FastStream Lifespan / context manager docs. [1]
- FastStream KafkaBroker API / subscriber docs. [2]
- FastStream release notes (dynamic subscriber registration / KafkaBroker updates). [3]
🌐 Web query:
FastStream Lifespan context manager broker subscriber registration timing
💡 Result:
Short answer: register subscribers only after the broker is running — not inside a lifespan() context manager. FastStream's lifespan() (context manager) runs before the broker is started and after it is stopped, so you cannot publish or rely on a running broker there. Use @app.after_startup (or broker.setup_subscriber / start for dynamic runtime registration) to register/start subscribers while the broker is active. [1][2][3]
Sources:
- Lifespan context manager runs before broker start / after broker shutdown. [1]
- Hook call order and semantics: on_startup runs before broker start; after_startup runs after broker is launched (use after_startup for actions requiring a running broker). [2]
- Release notes: examples for registering/starting subscribers at runtime with an already-started broker. [3]
🌐 Web query:
FastStream best practices subscriber registration when to register subscribers
💡 Result:
Short answer
- Prefer registering subscribers at application startup (before starting the broker/worker) so subscriptions are set up deterministically and you don't miss messages. [1]
- Only register at runtime when you need dynamic/late registration (feature supported), and then call the broker APIs to set up and start the subscriber explicitly. Expect extra complexity (lifecycle, race conditions, and potential missed messages while wiring). [2][3]
Best-practice checklist
- Register handlers during app initialization (module import or app startup) and call broker.setup/start afterwards so the broker knows all endpoints before message flow begins. This avoids race conditions and simplifies graceful shutdown/restart. [1]
- If you must register at runtime:
- Create the Subscriber object, call broker.setup_subscriber(subscriber), then await subscriber.start() (and ensure you later await subscriber.close()). Use this for truly dynamic handlers only. [2]
- Protect against missing messages during the gap between creating the subscriber and starting it (buffering, backpressure, or feature-specific guarantees depend on the broker). [1][2]
- Keep registration idempotent and observable (log subscriber start/stop), and include health checks so orchestration can confirm handlers are active before traffic is allowed. [1]
- Prefer declarative/static registration for simpler deployments, testing, and observability; use dynamic registration for plugin systems or admin-driven handler changes. [1][2]
References
- FastStream subscription docs (channel/subscriber usage). [1]
- FastStream release notes: runtime registration support and API example. [2]
- FastStream subscriber API / proto reference. [3]
Would you like a short code example showing both startup-time and runtime registration?
Subscriber registered inside lifespan will not work as intended.
Registering subscribers inside the lifespan context manager is not supported by FastStream. The lifespan runs before the broker starts and after it stops, so subscribers registered there won't actually start consuming messages. FastStream's documentation explicitly warns against this pattern.
Register the subscriber either:
- At module level (preferred) before
FastStream(broker)is created - In an
@app.after_startuphook if dynamic registration is needed - Using
broker.setup_subscriber()followed byawait subscriber.start()for runtime registration
This code will silently fail to consume DLQ messages.
🤖 Prompt for AI Agents
In `@backend/workers/dlq_processor.py` around lines 180 - 197, The DLQ subscriber
(decode_dlq_json and handle_dlq_message) is being registered inside the lifespan
block which FastStream does not support; move the broker.subscriber registration
out of the lifespan context by placing the decode_dlq_json function and the
decorated handle_dlq_message subscriber at module level (retaining group_id,
auto_commit=False, decoder=decode_dlq_json) so it is registered before
FastStream(broker) is created, or alternatively register them in an
`@app.after_startup` hook or use broker.setup_subscriber(...) followed by await
subscriber.start() at runtime; ensure DLQManager injection
(FromDishka[DLQManager]) and the DLQMessage type remain unchanged.
…hmea registry - from now warning max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 issues found across 17 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="docs/components/workers/index.md">
<violation number="1" location="docs/components/workers/index.md:61">
P3: The `REJECT_ON_ERROR` row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.</violation>
</file>
<file name="backend/workers/run_sse_bridge.py">
<violation number="1" location="backend/workers/run_sse_bridge.py:112">
P2: `AckPolicy.ACK_FIRST` acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.</violation>
</file>
<file name="docs/architecture/lifecycle.md">
<violation number="1" location="docs/architecture/lifecycle.md:11">
P3: This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| @broker.subscriber( | ||
| *topics, | ||
| group_id=group_id, | ||
| ack_policy=AckPolicy.ACK_FIRST, # SSE bridge is idempotent (Redis pubsub) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: AckPolicy.ACK_FIRST acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/workers/run_sse_bridge.py, line 112:
<comment>`AckPolicy.ACK_FIRST` acks before processing, so any handler error (e.g., Redis outage) will drop SSE events permanently. Use an ack policy that commits after successful processing (or retries on error) to avoid losing events.</comment>
<file context>
@@ -108,7 +109,7 @@ async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
*topics,
group_id=group_id,
- auto_commit=True, # SSE bridge is idempotent (Redis pubsub)
+ ack_policy=AckPolicy.ACK_FIRST, # SSE bridge is idempotent (Redis pubsub)
decoder=decode_avro,
)
</file context>
| | `ACK_FIRST` | Commit before processing | Already committed | At most once | High throughput, idempotent operations | | ||
| | `ACK` | Commit after processing | Commit anyway (no retry) | At least once | Reliable processing without retry needs | | ||
| | `NACK_ON_ERROR` | Commit after processing | Seek back for redelivery | At least once | Auto-retry on transient failures | | ||
| | `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: The REJECT_ON_ERROR row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/components/workers/index.md, line 61:
<comment>The `REJECT_ON_ERROR` row says "At least once", but the note below states the message is committed and not redelivered. That behavior is at-most-once delivery, so the table is contradictory and misleading.</comment>
<file context>
@@ -21,22 +19,97 @@ graph LR
+| `ACK_FIRST` | Commit before processing | Already committed | At most once | High throughput, idempotent operations |
+| `ACK` | Commit after processing | Commit anyway (no retry) | At least once | Reliable processing without retry needs |
+| `NACK_ON_ERROR` | Commit after processing | Seek back for redelivery | At least once | Auto-retry on transient failures |
+| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired |
+| `MANUAL` | User calls `msg.ack()` | User calls `msg.nack()` | User-controlled | Complex conditional acknowledgement |
+
</file context>
| | `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired | | |
| | `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At most once | Permanent failures, no retry desired | |
| ## What changed | ||
|
|
||
| Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. | ||
| Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/architecture/lifecycle.md, line 11:
<comment>This line still references the removed SSE Kafka→Redis bridge and EventBus. The current code uses SSEEventRouter and has no EventBus class, so the lifecycle list is now misleading. Remove the obsolete items from the list.</comment>
<file context>
@@ -8,11 +8,11 @@ The pattern that actually fits Python and asyncio is the language's own RAII: as
## What changed
-Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
+Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`.
DI providers return unstarted instances for these services. The FastAPI lifespan acquires them and uses an `AsyncExitStack` to start/stop them in a single place. That removed scattered start/stop logic from providers and made shutdown order explicit.
</file context>
| Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. | |
| Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, EventStoreConsumer, ResultProcessor, DLQManager, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/architecture/kafka-topic-architecture.md (2)
110-116: Update Replay diagram: UnifiedProducer likely removed.The replay flow still shows
UnifiedProducer, but the PR summary indicates migration to FastStream and removal of UnifiedProducer/EventBus. Please update the diagram to reflect the new producer path (e.g., FastStream producer or DI-provided publisher).
148-155: Event schema section appears outdated (Pydantic AvroBase + unified plumbing).This section references Pydantic
AvroBaseandevents/core/{producer,consumer,dispatcher}.py, which conflicts with the stated migration todataclasses-avroschemaand FastStream-based producers/consumers. Please update the filenames/tech choices to match the new schema tooling and messaging stack.
🤖 Fix all issues with AI agents
In `@backend/workers/run_result_processor.py`:
- Around line 119-123: The MongoDB client created with AsyncMongoClient
(mongo_client) can leak on exceptions thrown during init_beanie() or later setup
before yield; wrap the initialization and the subsequent yield in a single
try/finally so that mongo_client.close()/await mongo_client.close() (use the
appropriate async close) is always called in the finally block; apply the same
pattern to the other startup block around lines 173-178 to guarantee cleanup
even if init_beanie(), schema registry setup, or any pre-yield logic raises.
In `@backend/workers/run_saga_orchestrator.py`:
- Around line 113-124: The startup code creates an AsyncMongoClient and then
calls init_beanie and container.get(...) but doesn't close mongo_client if one
of those calls raises, so wrap the initialization sequence that creates
mongo_client and resolves SchemaRegistryManager/UnifiedProducer in a
try/except/finally (or a try block that closes on exception) so that on any
exception you call mongo_client.close() (and optionally await
mongo_client.close_async if applicable) before re-raising; reference the
AsyncMongoClient instantiation, init_beanie(..., document_models=ALL_DOCUMENTS),
container.get(SchemaRegistryManager) and container.get(UnifiedProducer) to
locate where to add the try/finally and ensure cleanup happens prior to leaving
the startup routine.
In `@docs/architecture/lifecycle.md`:
- Line 11: Update the lifecycle list in lifecycle.md to reflect the new
surfaces: remove references to EventBus and "SSE Kafka→Redis bridge" and instead
list SSEEventRouter and SSEConnectionRegistry (and the DI/FastStream surface) as
the services that implement the async context manager protocol; keep existing
entries for KubernetesWorker, PodMonitor, EventStoreConsumer, ResultProcessor,
DLQManager, and the Kafka producer, and ensure the description says these
services expose __aenter__/__aexit__ that call start/stop.
In `@docs/architecture/services-overview.md`:
- Line 7: Update the legacy name "Saga Orchestrator" to the new component name
"SagaLogic" (or "SagaLogic/DI" where dependency-injection context is relevant)
in the sentence describing system components and any adjacent mentions in this
document; specifically replace occurrences of "Saga Orchestrator" with
"SagaLogic" (or "SagaLogic/DI" if the surrounding text refers to DI behavior) so
the documentation matches the PR's renamed component and avoids confusion.
- Line 73: Replace the outdated "Saga Orchestrator" label in the execution flow
string "User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result
Processor" with the current component name(s) — e.g., "SagaLogic/FastStream
workers" or "SagaLogic (FastStream workers)". Update the sentence so it reads
"User → API → SagaLogic/FastStream workers → K8s Worker → Pod → Pod Monitor →
Result Processor" (or your preferred phrasing) to ensure terminology matches the
current components.
In `@docs/components/workers/index.md`:
- Line 78: Update the table row for "SSE Bridge" with ACK_FIRST to explicitly
state the crash window where events can be lost: mention that ACK_FIRST may drop
events if a worker crashes after committing but before publishing to Redis, and
only acceptable when downstream replay or client refresh handles missed events;
update the cell text referencing "SSE Bridge" and "ACK_FIRST" to include this
clarification and avoid implying zero data loss.
♻️ Duplicate comments (3)
backend/workers/run_k8s_worker.py (1)
145-148: Log or surface unmatched events instead of silently discarding.
handle_otheris still a no-op, so unrecognized events are silently dropped. At minimum, consider warning-level logging to surface routing/config issues.🔧 Minimal adjustment
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning("Unhandled k8s_worker event_type: %s", event.event_type)backend/workers/run_result_processor.py (1)
166-169: Log or surface unmatched events instead of silently discarding.
handle_otherstill drops unknown events without any logging, which can mask routing or schema issues.🔧 Minimal adjustment
# Default handler for unmatched events (prevents message loss) `@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning("Unhandled result_processor event_type: %s", event.event_type)backend/workers/dlq_processor.py (1)
181-199: Move subscriber registration out of lifespan.
FastStream’s lifespan runs before the broker starts and after it stops; registering@broker.subscriberinside it can leave the consumer inactive. Define the decoder/subscriber before creatingFastStream(broker)(or register in@app.after_startupwith explicit setup/start).🔧 Suggested adjustment (register before lifespan)
broker = KafkaBroker( settings.KAFKA_BOOTSTRAP_SERVERS, request_timeout_ms=settings.KAFKA_REQUEST_TIMEOUT_MS, ) + # Decoder: JSON message → typed DLQMessage + def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage: + data = json.loads(msg.body) + return DLQMessage.model_validate(data) + + `@broker.subscriber`( + *topics, + group_id=group_id, + ack_policy=AckPolicy.ACK, + decoder=decode_dlq_json, + ) + async def handle_dlq_message( + message: DLQMessage, + dlq_manager: FromDishka[DLQManager], + ) -> None: + await dlq_manager.process_message(message) + `@asynccontextmanager` async def lifespan() -> AsyncIterator[None]: """Initialize infrastructure and start scheduled retry timer.""" app_logger = await container.get(logging.Logger) app_logger.info("DLQ Processor starting...") @@ - # Decoder: JSON message → typed DLQMessage - def decode_dlq_json(msg: StreamMessage[Any]) -> DLQMessage: - data = json.loads(msg.body) - return DLQMessage.model_validate(data) - - # Register subscriber for DLQ messages - `@broker.subscriber`( - *topics, - group_id=group_id, - ack_policy=AckPolicy.ACK, - decoder=decode_dlq_json, - ) - async def handle_dlq_message( - message: DLQMessage, - dlq_manager: FromDishka[DLQManager], - ) -> None: - """Handle incoming DLQ messages - invoked by FastStream when message arrives.""" - await dlq_manager.process_message(message)FastStream 0.6.0 lifespan context manager register subscriber before broker start after_startup
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | ||
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | ||
| ) | ||
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure Mongo client is closed on startup failures.
As in saga orchestrator, exceptions during init_beanie() or schema registry setup before yield will bypass the finally block. Wrap init + yield in a single try/finally to guarantee cleanup.
🔧 Proposed fix
- # Initialize MongoDB + Beanie
- mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
- settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
- )
- await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
- # Resolve schema registry (initialization handled by provider)
- schema_registry = await container.get(SchemaRegistryManager)
+ try:
+ await init_beanie(
+ database=mongo_client[settings.DATABASE_NAME],
+ document_models=ALL_DOCUMENTS,
+ )
- # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
- await container.get(UnifiedProducer)
- app_logger.info("Kafka producer ready")
+ # Resolve schema registry (initialization handled by provider)
+ schema_registry = await container.get(SchemaRegistryManager)
+
+ # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+ await container.get(UnifiedProducer)
+ app_logger.info("Kafka producer ready")
- # Decoder: Avro message → typed DomainEvent
- async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
- return await schema_registry.deserialize_event(msg.body, "result_processor")
+ # Decoder: Avro message → typed DomainEvent
+ async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
+ return await schema_registry.deserialize_event(msg.body, "result_processor")
- # Create subscriber with Avro decoder
- subscriber = broker.subscriber(
- *topics,
- group_id=group_id,
- ack_policy=AckPolicy.ACK,
- decoder=decode_avro,
- )
+ # Create subscriber with Avro decoder
+ subscriber = broker.subscriber(
+ *topics,
+ group_id=group_id,
+ ack_policy=AckPolicy.ACK,
+ decoder=decode_avro,
+ )
- # Route by event_type header (producer sets this, Kafka stores as bytes)
- `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_COMPLETED.encode())
- async def handle_completed(
- event: ExecutionCompletedEvent,
- logic: FromDishka[ProcessorLogic],
- ) -> None:
- await logic._handle_completed(event)
+ # Route by event_type header (producer sets this, Kafka stores as bytes)
+ `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_COMPLETED.encode())
+ async def handle_completed(
+ event: ExecutionCompletedEvent,
+ logic: FromDishka[ProcessorLogic],
+ ) -> None:
+ await logic._handle_completed(event)
- `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_FAILED.encode())
- async def handle_failed(
- event: ExecutionFailedEvent,
- logic: FromDishka[ProcessorLogic],
- ) -> None:
- await logic._handle_failed(event)
+ `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_FAILED.encode())
+ async def handle_failed(
+ event: ExecutionFailedEvent,
+ logic: FromDishka[ProcessorLogic],
+ ) -> None:
+ await logic._handle_failed(event)
- `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_TIMEOUT.encode())
- async def handle_timeout(
- event: ExecutionTimeoutEvent,
- logic: FromDishka[ProcessorLogic],
- ) -> None:
- await logic._handle_timeout(event)
+ `@subscriber`(filter=lambda msg: msg.headers.get("event_type") == EventType.EXECUTION_TIMEOUT.encode())
+ async def handle_timeout(
+ event: ExecutionTimeoutEvent,
+ logic: FromDishka[ProcessorLogic],
+ ) -> None:
+ await logic._handle_timeout(event)
- # Default handler for unmatched events (prevents message loss)
- `@subscriber`()
- async def handle_other(event: DomainEvent) -> None:
- pass
+ # Default handler for unmatched events (prevents message loss)
+ `@subscriber`()
+ async def handle_other(event: DomainEvent) -> None:
+ pass
- app_logger.info("Infrastructure initialized, starting event processing...")
-
- try:
- yield
- finally:
- app_logger.info("ResultProcessor shutting down...")
- await mongo_client.close()
- await container.close()
- app_logger.info("ResultProcessor shutdown complete")
+ app_logger.info("Infrastructure initialized, starting event processing...")
+ yield
+ finally:
+ app_logger.info("ResultProcessor shutting down...")
+ await mongo_client.close()
+ await container.close()
+ app_logger.info("ResultProcessor shutdown complete")Also applies to: 173-178
🤖 Prompt for AI Agents
In `@backend/workers/run_result_processor.py` around lines 119 - 123, The MongoDB
client created with AsyncMongoClient (mongo_client) can leak on exceptions
thrown during init_beanie() or later setup before yield; wrap the initialization
and the subsequent yield in a single try/finally so that
mongo_client.close()/await mongo_client.close() (use the appropriate async
close) is always called in the finally block; apply the same pattern to the
other startup block around lines 173-178 to guarantee cleanup even if
init_beanie(), schema registry setup, or any pre-yield logic raises.
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | ||
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | ||
| ) | ||
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | ||
|
|
||
| # Resolve schema registry (initialization handled by provider) | ||
| schema_registry = await container.get(SchemaRegistryManager) | ||
|
|
||
| # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) | ||
| await container.get(UnifiedProducer) | ||
| app_logger.info("Kafka producer ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure Mongo client is closed on startup failures.
If init_beanie() or schema registry init throws before yield, the finally block won’t run, leaving AsyncMongoClient (and possibly the container) unclosed. Consider wrapping the initialization in the same try/finally that handles shutdown.
🔧 Proposed fix
- # Initialize MongoDB + Beanie
- mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
- settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
- )
- await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+ # Initialize MongoDB + Beanie
+ mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
+ settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
+ )
- # Resolve schema registry (initialization handled by provider)
- schema_registry = await container.get(SchemaRegistryManager)
+ try:
+ await init_beanie(
+ database=mongo_client[settings.DATABASE_NAME],
+ document_models=ALL_DOCUMENTS,
+ )
- # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
- await container.get(UnifiedProducer)
- app_logger.info("Kafka producer ready")
+ # Resolve schema registry (initialization handled by provider)
+ schema_registry = await container.get(SchemaRegistryManager)
+
+ # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+ await container.get(UnifiedProducer)
+ app_logger.info("Kafka producer ready")
- # Get saga logic to determine topics
- logic = await container.get(SagaLogic)
- trigger_topics = logic.get_trigger_topics()
+ # Get saga logic to determine topics
+ logic = await container.get(SagaLogic)
+ trigger_topics = logic.get_trigger_topics()
- if not trigger_topics:
- app_logger.warning("No saga triggers configured, shutting down")
- yield
- await mongo_client.close()
- await container.close()
- return
+ if not trigger_topics:
+ app_logger.warning("No saga triggers configured, shutting down")
+ yield
+ return
- # Build topic list with prefix
- topics = [f"{settings.KAFKA_TOPIC_PREFIX}{t}" for t in trigger_topics]
- group_id = f"{GroupId.SAGA_ORCHESTRATOR}.{settings.KAFKA_GROUP_SUFFIX}"
+ # Build topic list with prefix
+ topics = [f"{settings.KAFKA_TOPIC_PREFIX}{t}" for t in trigger_topics]
+ group_id = f"{GroupId.SAGA_ORCHESTRATOR}.{settings.KAFKA_GROUP_SUFFIX}"
- # Decoder: Avro message → typed DomainEvent
- async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
- return await schema_registry.deserialize_event(msg.body, "saga_orchestrator")
+ # Decoder: Avro message → typed DomainEvent
+ async def decode_avro(msg: StreamMessage[Any]) -> DomainEvent:
+ return await schema_registry.deserialize_event(msg.body, "saga_orchestrator")
- # Register handler dynamically after determining topics
- # Saga orchestrator uses single handler - routing is internal to SagaLogic
- `@broker.subscriber`(
- *topics,
- group_id=group_id,
- ack_policy=AckPolicy.ACK,
- decoder=decode_avro,
- )
- async def handle_saga_event(
- event: DomainEvent,
- saga_logic: FromDishka[SagaLogic],
- ) -> None:
- """
- Handle saga trigger events.
-
- Dependencies are automatically injected via Dishka.
- Routing is handled internally by SagaLogic based on saga configuration.
- """
- # Handle the event through saga logic (internal routing)
- await saga_logic.handle_event(event)
-
- # Opportunistic timeout check (replaces background loop)
- now = time.monotonic()
- if now - timeout_check_state["last_check"] >= timeout_check_state["interval"]:
- await saga_logic.check_timeouts_once()
- timeout_check_state["last_check"] = now
-
- app_logger.info(f"Subscribing to topics: {topics}")
- app_logger.info("Infrastructure initialized, starting event processing...")
-
- try:
- yield
- finally:
- app_logger.info("SagaOrchestrator shutting down...")
- await mongo_client.close()
- await container.close()
- app_logger.info("SagaOrchestrator shutdown complete")
+ # Register handler dynamically after determining topics
+ # Saga orchestrator uses single handler - routing is internal to SagaLogic
+ `@broker.subscriber`(
+ *topics,
+ group_id=group_id,
+ ack_policy=AckPolicy.ACK,
+ decoder=decode_avro,
+ )
+ async def handle_saga_event(
+ event: DomainEvent,
+ saga_logic: FromDishka[SagaLogic],
+ ) -> None:
+ """
+ Handle saga trigger events.
+
+ Dependencies are automatically injected via Dishka.
+ Routing is handled internally by SagaLogic based on saga configuration.
+ """
+ # Handle the event through saga logic (internal routing)
+ await saga_logic.handle_event(event)
+
+ # Opportunistic timeout check (replaces background loop)
+ now = time.monotonic()
+ if now - timeout_check_state["last_check"] >= timeout_check_state["interval"]:
+ await saga_logic.check_timeouts_once()
+ timeout_check_state["last_check"] = now
+
+ app_logger.info(f"Subscribing to topics: {topics}")
+ app_logger.info("Infrastructure initialized, starting event processing...")
+ yield
+ finally:
+ app_logger.info("SagaOrchestrator shutting down...")
+ await mongo_client.close()
+ await container.close()
+ app_logger.info("SagaOrchestrator shutdown complete")Also applies to: 175-180
🤖 Prompt for AI Agents
In `@backend/workers/run_saga_orchestrator.py` around lines 113 - 124, The startup
code creates an AsyncMongoClient and then calls init_beanie and
container.get(...) but doesn't close mongo_client if one of those calls raises,
so wrap the initialization sequence that creates mongo_client and resolves
SchemaRegistryManager/UnifiedProducer in a try/except/finally (or a try block
that closes on exception) so that on any exception you call mongo_client.close()
(and optionally await mongo_client.close_async if applicable) before re-raising;
reference the AsyncMongoClient instantiation, init_beanie(...,
document_models=ALL_DOCUMENTS), container.get(SchemaRegistryManager) and
container.get(UnifiedProducer) to locate where to add the try/finally and ensure
cleanup happens prior to leaving the startup routine.
| ## What changed | ||
|
|
||
| Services with long-running background work now implement the async context manager protocol. Coordinator, KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. | ||
| Services with long-running background work now implement the async context manager protocol. KubernetesWorker, PodMonitor, SSE Kafka→Redis bridge, EventStoreConsumer, ResultProcessor, DLQManager, EventBus, and the Kafka producer all expose `__aenter__`/`__aexit__` that call `start`/`stop`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update lifecycle list to match new surfaces.
This line still references EventBus and the “SSE Kafka→Redis bridge,” which the PR objectives say were removed/replaced by SSEEventRouter/SSEConnectionRegistry. Please align the lifecycle list with the new DI/FastStream surface to avoid misleading docs.
🤖 Prompt for AI Agents
In `@docs/architecture/lifecycle.md` at line 11, Update the lifecycle list in
lifecycle.md to reflect the new surfaces: remove references to EventBus and "SSE
Kafka→Redis bridge" and instead list SSEEventRouter and SSEConnectionRegistry
(and the DI/FastStream surface) as the services that implement the async context
manager protocol; keep existing entries for KubernetesWorker, PodMonitor,
EventStoreConsumer, ResultProcessor, DLQManager, and the Kafka producer, and
ensure the description says these services expose __aenter__/__aexit__ that call
start/stop.
| ## High-level architecture | ||
|
|
||
| The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Coordinator accepts validated execution requests and enqueues them to Kafka with metadata and idempotency guards. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations. | ||
| The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align saga naming with the new SagaLogic/DI terminology.
The PR objectives say SagaLogic replaces SagaOrchestrator, but this line still uses “Saga Orchestrator.” Consider renaming here (and in adjacent sections) to match the new component name and avoid confusion.
🔧 Suggested wording tweak
-The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker.
+The Saga Logic (FastStream subscriber) drives stateful execution via events and publishes commands to the K8s Worker.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Orchestrator drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations. | |
| The API (FastAPI) receives user requests for auth, execute, events, scripts, and settings. The Saga Logic (FastStream subscriber) drives stateful execution via events and publishes commands to the K8s Worker. The K8s Worker builds and creates per-execution pods and supporting ConfigMaps with network isolation enforced at cluster level via Cilium policy. Pod Monitor watches K8s and translates pod phases and logs into domain events. Result Processor consumes completion/failure/timeout events, updates DB, and cleans resources. SSE Router fans execution events out to connected clients. DLQ Processor and Event Replay support reliability and investigations. |
🤖 Prompt for AI Agents
In `@docs/architecture/services-overview.md` at line 7, Update the legacy name
"Saga Orchestrator" to the new component name "SagaLogic" (or "SagaLogic/DI"
where dependency-injection context is relevant) in the sentence describing
system components and any adjacent mentions in this document; specifically
replace occurrences of "Saga Orchestrator" with "SagaLogic" (or "SagaLogic/DI"
if the surrounding text refers to DI behavior) so the documentation matches the
PR's renamed component and avoids confusion.
| ## Common flows | ||
|
|
||
| The main execution flow goes: User → API → Coordinator → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine. | ||
| The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the execution flow label to match current component names.
If “Saga Orchestrator” has been replaced by SagaLogic/FastStream workers, the flow line should reflect that to avoid conflicting terminology.
🔧 Suggested wording tweak
-The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor.
+The main execution flow goes: User → API → Saga Logic (FastStream) → K8s Worker → Pod → Pod Monitor → Result Processor.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The main execution flow goes: User → API → Saga Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine. | |
| The main execution flow goes: User → API → Saga Logic (FastStream) → K8s Worker → Pod → Pod Monitor → Result Processor. See [Lifecycle](lifecycle.md) for the full execution state machine. |
🤖 Prompt for AI Agents
In `@docs/architecture/services-overview.md` at line 73, Replace the outdated
"Saga Orchestrator" label in the execution flow string "User → API → Saga
Orchestrator → K8s Worker → Pod → Pod Monitor → Result Processor" with the
current component name(s) — e.g., "SagaLogic/FastStream workers" or "SagaLogic
(FastStream workers)". Update the sentence so it reads "User → API →
SagaLogic/FastStream workers → K8s Worker → Pod → Pod Monitor → Result
Processor" (or your preferred phrasing) to ensure terminology matches the
current components.
| | Saga Orchestrator | `ACK` | Saga state is persisted to MongoDB; duplicate processing is idempotent | | ||
| | K8s Worker | `ACK` | Pod creation is idempotent (same execution ID); uses idempotency middleware | | ||
| | Result Processor | `ACK` | Result persistence is idempotent; duplicate writes are safe | | ||
| | SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify ACK_FIRST rationale for SSE Bridge to avoid implying no data loss.
ACK_FIRST can drop events if the worker crashes after commit but before Redis publish. If loss is acceptable due to downstream replay or client refresh, state that explicitly.
💬 Suggested wording tweak
-| SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss |
+| SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; ACK_FIRST may drop events on crash, which is acceptable only if clients can refresh/replay from another source |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; missing an event causes a client retry, not data loss | | |
| | SSE Bridge | `ACK_FIRST` | Redis pubsub is fire-and-forget; ACK_FIRST may drop events on crash, which is acceptable only if clients can refresh/replay from another source | |
🤖 Prompt for AI Agents
In `@docs/components/workers/index.md` at line 78, Update the table row for "SSE
Bridge" with ACK_FIRST to explicitly state the crash window where events can be
lost: mention that ACK_FIRST may drop events if a worker crashes after
committing but before publishing to Redis, and only acceptable when downstream
replay or client refresh handles missed events; update the cell text referencing
"SSE Bridge" and "ACK_FIRST" to include this clarification and avoid implying
zero data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/workers/dlq_processor.py`:
- Around line 161-178: The initialization sequence can leak the AsyncMongoClient
if init_beanie() or any subsequent container.get calls raise before the outer
finally runs; wrap the MongoDB setup so mongo_client is closed on error by
adding a nested try/finally (or try/except + close) immediately after creating
mongo_client and before awaiting init_beanie(), ensuring mongo_client.close() is
called on any exception; update the block around AsyncMongoClient creation,
init_beanie(database=...), and the subsequent container.get(...) calls
(including SchemaRegistryManager, UnifiedProducer, and DLQManager/_configure_*
calls) to guarantee mongo_client is closed if initialization fails before
yielding.
♻️ Duplicate comments (7)
backend/workers/run_k8s_worker.py (1)
144-147: Silent catch-all handler discards unmatched events without logging.The
handle_otherhandler silently ignores events that don't match the CREATE_POD_COMMAND or DELETE_POD_COMMAND filters. This can mask routing bugs or missing handlers for new event types. Consider logging at debug/warning level.🔧 Suggested improvement
# Default handler for unmatched events (prevents message loss) `@subscriber`() - async def handle_other(event: DomainEvent) -> None: - pass + async def handle_other(event: DomainEvent) -> None: + app_logger.debug(f"Ignoring unmatched event: {event.event_type}")backend/workers/run_result_processor.py (2)
118-129: Ensure MongoDB client is closed on initialization failures.Same issue as in dlq_processor.py - if
init_beanie()or subsequent setup calls fail beforeyield, themongo_clientwon't be closed.🔧 Proposed fix pattern
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + # ... rest of initialization ... + except Exception: + await mongo_client.close() + raise
165-168: Silent catch-all handler discards unmatched events without logging.Same pattern as other workers - unmatched events are silently dropped. Consider logging for debugging purposes.
backend/workers/run_saga_orchestrator.py (1)
112-123: Ensure MongoDB client is closed on initialization failures.Same pattern as other workers - if
init_beanie()or subsequent setup fails beforeyield,mongo_clientwon't be closed.backend/workers/run_sse_bridge.py (2)
108-120:AckPolicy.ACK_FIRSTmay lose events on handler failures.ACK_FIRST commits the offset before
route_eventruns. If Redis is temporarily unavailable, the event is lost since the offset is already committed. While the comment notes SSE is "idempotent," it's actually "lossy" - clients may miss events with no way to recover them from this worker.If occasional event loss is acceptable (clients can refresh state), this is fine. Otherwise, consider
AckPolicy.ACKto commit after successful Redis publish.
102-104: Hardcoded topic string "sse_bridge" in deserialize_event.The
topicparameter passed todeserialize_eventis hardcoded as"sse_bridge"while the actual subscribed topics come fromCONSUMER_GROUP_SUBSCRIPTIONS. While the parameter appears unused in the current implementation, this creates maintenance risk if it becomes used later.docs/components/workers/index.md (1)
56-62:REJECT_ON_ERRORdelivery guarantee is incorrect.Line 61 states REJECT_ON_ERROR provides "At least once" delivery, but the behavior described (commit on error, discard message) is actually at most once - failed messages are never redelivered. The note below correctly explains this, but the table row is misleading.
-| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At least once | Permanent failures, no retry desired | +| `REJECT_ON_ERROR` | Commit after processing | Commit (discard message) | At most once | Permanent failures, no retry desired |
🧹 Nitpick comments (1)
backend/workers/run_result_processor.py (1)
144-163: Consider exposing public handler methods in ProcessorLogic.The handlers call private methods (
_handle_completed,_handle_failed,_handle_timeout) directly. While functional, this breaks encapsulation. Consider adding public wrapper methods in ProcessorLogic for cleaner API boundaries.
| # Initialize MongoDB + Beanie | ||
| mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( | ||
| settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 | ||
| ) | ||
| await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) | ||
|
|
||
| # Resolve schema registry (initialization handled by provider) | ||
| await container.get(SchemaRegistryManager) | ||
|
|
||
| # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) | ||
| await container.get(UnifiedProducer) | ||
| app_logger.info("Kafka producer ready") | ||
|
|
||
| # Get DLQ manager and configure policies | ||
| manager = await container.get(DLQManager) | ||
| _configure_retry_policies(manager, app_logger) | ||
| _configure_filters(manager, testing=settings.TESTING, logger=app_logger) | ||
| app_logger.info("DLQ Manager configured") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure MongoDB client is closed on initialization failures.
If init_beanie() or subsequent setup calls throw before reaching yield, the finally block won't execute, leaving mongo_client unclosed. Wrap the initialization in a nested try/finally or handle exceptions explicitly.
🔧 Proposed fix
# Initialize MongoDB + Beanie
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient(
settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000
)
- await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
-
- # Resolve schema registry (initialization handled by provider)
- await container.get(SchemaRegistryManager)
-
- # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
- await container.get(UnifiedProducer)
- app_logger.info("Kafka producer ready")
-
- # Get DLQ manager and configure policies
- manager = await container.get(DLQManager)
- _configure_retry_policies(manager, app_logger)
- _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
- app_logger.info("DLQ Manager configured")
+ try:
+ await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS)
+
+ # Resolve schema registry (initialization handled by provider)
+ await container.get(SchemaRegistryManager)
+
+ # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it)
+ await container.get(UnifiedProducer)
+ app_logger.info("Kafka producer ready")
+
+ # Get DLQ manager and configure policies
+ manager = await container.get(DLQManager)
+ _configure_retry_policies(manager, app_logger)
+ _configure_filters(manager, testing=settings.TESTING, logger=app_logger)
+ app_logger.info("DLQ Manager configured")
+ except Exception:
+ await mongo_client.close()
+ raise🤖 Prompt for AI Agents
In `@backend/workers/dlq_processor.py` around lines 161 - 178, The initialization
sequence can leak the AsyncMongoClient if init_beanie() or any subsequent
container.get calls raise before the outer finally runs; wrap the MongoDB setup
so mongo_client is closed on error by adding a nested try/finally (or try/except
+ close) immediately after creating mongo_client and before awaiting
init_beanie(), ensuring mongo_client.close() is called on any exception; update
the block around AsyncMongoClient creation, init_beanie(database=...), and the
subsequent container.get(...) calls (including SchemaRegistryManager,
UnifiedProducer, and DLQManager/_configure_* calls) to guarantee mongo_client is
closed if initialization fails before yielding.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 8 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/app/events/core/producer.py">
<violation number="1" location="backend/app/events/core/producer.py:55">
P2: The merge order lets caller-supplied headers override the required `event_type` header, so routing can break if `headers` contains `event_type`. Preserve the `event_type` value after merging additional headers.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| # Always include event_type header for routing, merge with any additional headers | ||
| all_headers = {"event_type": str(event_to_produce.event_type)} | ||
| if headers: | ||
| all_headers.update(headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The merge order lets caller-supplied headers override the required event_type header, so routing can break if headers contains event_type. Preserve the event_type value after merging additional headers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/events/core/producer.py, line 55:
<comment>The merge order lets caller-supplied headers override the required `event_type` header, so routing can break if `headers` contains `event_type`. Preserve the `event_type` value after merging additional headers.</comment>
<file context>
@@ -49,8 +49,13 @@ async def produce(
+ # Always include event_type header for routing, merge with any additional headers
+ all_headers = {"event_type": str(event_to_produce.event_type)}
+ if headers:
+ all_headers.update(headers)
+
# Convert headers to list of tuples format
</file context>
| all_headers.update(headers) | |
| all_headers = {**headers, "event_type": str(event_to_produce.event_type)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/domain/enums/kafka.py (1)
56-67: Remove unusedEVENT_STORE_CONSUMERandDLQ_MANAGERenum values or add missing dictionary entries.These two
GroupIdenum values have no corresponding entries inCONSUMER_GROUP_SUBSCRIPTIONSorCONSUMER_GROUP_EVENTS, and are not referenced anywhere in the codebase. If any code attempts to look them up, it will raise aKeyError. Either remove these enum values if they are unused, or add the missing dictionary entries if they are intended for future use.
🤖 Fix all issues with AI agents
In `@backend/app/events/core/producer.py`:
- Around line 52-58: The merge currently lets caller-supplied headers override
the routing header; to make event_type authoritative, change the merge so caller
headers are applied first and then set/overwrite "event_type" from
event_to_produce.event_type (i.e., build all_headers from headers or update with
headers, then set all_headers["event_type"] = str(event_to_produce.event_type)),
keeping the existing conversion to header_list = [(k, v.encode()) for k, v in
all_headers.items()]; reference the variables event_to_produce, headers,
all_headers, and header_list when locating the change.
♻️ Duplicate comments (8)
backend/workers/dlq_processor.py (2)
185-205: Verify FastStream supports registering subscribers inside lifespan.
If FastStream 0.6 doesn’t activate subscribers registered withinlifespan, the DLQ consumer may never start. Please confirm; if unsupported, move subscriber registration to module scope or@app.after_startup.FastStream 0.6 subscriber registration inside lifespan context manager supported? recommended registration timing
161-178: Close Mongo client on startup failures before lifespan yields.
If init_beanie or anycontainer.get(...)raises before thetry: yield, the Mongo client won’t be closed. Wrap the init sequence to ensure cleanup on errors.🐛 Proposed fix
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - # Resolve schema registry (initialization handled by provider) - await container.get(SchemaRegistryManager) - - # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) - await container.get(UnifiedProducer) - app_logger.info("Kafka producer ready") - - # Get DLQ manager and configure policies - manager = await container.get(DLQManager) - _configure_retry_policies(manager, app_logger) - _configure_filters(manager, testing=settings.TESTING, logger=app_logger) - app_logger.info("DLQ Manager configured") + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + + # Resolve schema registry (initialization handled by provider) + await container.get(SchemaRegistryManager) + + # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) + await container.get(UnifiedProducer) + app_logger.info("Kafka producer ready") + + # Get DLQ manager and configure policies + manager = await container.get(DLQManager) + _configure_retry_policies(manager, app_logger) + _configure_filters(manager, testing=settings.TESTING, logger=app_logger) + app_logger.info("DLQ Manager configured") + except Exception: + await mongo_client.close() + await container.close() + raisebackend/workers/run_sse_bridge.py (2)
102-105: Avoid hardcoded topic string in deserialize_event.
Ifdeserialize_eventever uses the topic argument, the literal"sse_bridge"will be misleading. Consider passing the actual topic or documenting why a constant is correct here.
107-111: ACK_FIRST risks at‑most‑once delivery; prefer ACK after handler success.
ACK_FIRST commits before the handler runs; if Redis publish fails (or errors are suppressed), events can be lost. ConsiderAckPolicy.ACK(or NACK_ON_ERROR) and ensure failures propagate.🐛 Proposed fix
subscriber = broker.subscriber( *topics, group_id=group_id, - ack_policy=AckPolicy.ACK_FIRST, # SSE bridge is idempotent (Redis pubsub) + ack_policy=AckPolicy.ACK, # commit after successful handler decoder=decode_avro, )FastStream 0.6 AckPolicy.ACK_FIRST vs ACK semantics Kafka subscriberbackend/workers/run_k8s_worker.py (1)
144-147: Warn on unmatched events instead of silently dropping them.
A warning helps detect new/incorrect event routing.💡 Suggested tweak
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning("Unhandled K8s worker event_type=%s", event.event_type)backend/workers/run_result_processor.py (2)
118-129: Close Mongo client on startup failures before lifespan yields.
Init errors before thetry: yieldskip cleanup and leave the client open.🐛 Proposed fix
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - # Resolve schema registry (initialization handled by provider) - schema_registry = await container.get(SchemaRegistryManager) - - # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) - await container.get(UnifiedProducer) - app_logger.info("Kafka producer ready") + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + + # Resolve schema registry (initialization handled by provider) + schema_registry = await container.get(SchemaRegistryManager) + + # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) + await container.get(UnifiedProducer) + app_logger.info("Kafka producer ready") + except Exception: + await mongo_client.close() + await container.close() + raise
165-168: Warn on unmatched events instead of silently dropping them.
Helps catch new event types or filter mismatches.💡 Suggested tweak
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning("Unhandled result-processor event_type=%s", event.event_type)backend/workers/run_saga_orchestrator.py (1)
112-123: Close Mongo client on startup failures before lifespan yields.
If init_beanie or provider resolution fails beforetry: yield, the client isn’t closed.🐛 Proposed fix
mongo_client: AsyncMongoClient[dict[str, object]] = AsyncMongoClient( settings.MONGODB_URL, tz_aware=True, serverSelectionTimeoutMS=5000 ) - await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) - - # Resolve schema registry (initialization handled by provider) - schema_registry = await container.get(SchemaRegistryManager) - - # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) - await container.get(UnifiedProducer) - app_logger.info("Kafka producer ready") + try: + await init_beanie(database=mongo_client[settings.DATABASE_NAME], document_models=ALL_DOCUMENTS) + + # Resolve schema registry (initialization handled by provider) + schema_registry = await container.get(SchemaRegistryManager) + + # Resolve Kafka producer (lifecycle managed by DI - BoundaryClientProvider starts it) + await container.get(UnifiedProducer) + app_logger.info("Kafka producer ready") + except Exception: + await mongo_client.close() + await container.close() + raise
🧹 Nitpick comments (2)
backend/workers/dlq_processor.py (1)
202-205: Log unexpected DLQ messages instead of silently discarding.
A warning makes it easier to spot misrouted or malformed messages.💡 Suggested tweak
`@subscriber`() async def handle_other(message: DLQMessage) -> None: - pass + app_logger.warning( + "Unhandled DLQ message without original_topic header: %s", + message, + )backend/workers/run_saga_orchestrator.py (1)
169-172: Consider logging unmatched events.
A warning would surface unexpected saga events or filter mismatches.💡 Suggested tweak
`@subscriber`() async def handle_other(event: DomainEvent) -> None: - pass + app_logger.warning("Unhandled saga event_type=%s", event.event_type)
| # Always include event_type header for routing, merge with any additional headers | ||
| all_headers = {"event_type": str(event_to_produce.event_type)} | ||
| if headers: | ||
| all_headers.update(headers) | ||
|
|
||
| # Convert headers to list of tuples format | ||
| header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None | ||
| header_list = [(k, v.encode()) for k, v in all_headers.items()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Header merge order allows accidental override of event_type.
The comment states the intent to "always include event_type header for routing," but the current merge order allows callers to override it via the headers dict. If the event_type should be authoritative for routing, reverse the merge order so the event's type always wins.
Proposed fix to ensure event_type cannot be overridden
# Always include event_type header for routing, merge with any additional headers
- all_headers = {"event_type": str(event_to_produce.event_type)}
- if headers:
- all_headers.update(headers)
+ all_headers = dict(headers) if headers else {}
+ all_headers["event_type"] = str(event_to_produce.event_type) # Authoritative📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Always include event_type header for routing, merge with any additional headers | |
| all_headers = {"event_type": str(event_to_produce.event_type)} | |
| if headers: | |
| all_headers.update(headers) | |
| # Convert headers to list of tuples format | |
| header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None | |
| header_list = [(k, v.encode()) for k, v in all_headers.items()] | |
| # Always include event_type header for routing, merge with any additional headers | |
| all_headers = dict(headers) if headers else {} | |
| all_headers["event_type"] = str(event_to_produce.event_type) # Authoritative | |
| # Convert headers to list of tuples format | |
| header_list = [(k, v.encode()) for k, v in all_headers.items()] |
🤖 Prompt for AI Agents
In `@backend/app/events/core/producer.py` around lines 52 - 58, The merge
currently lets caller-supplied headers override the routing header; to make
event_type authoritative, change the merge so caller headers are applied first
and then set/overwrite "event_type" from event_to_produce.event_type (i.e.,
build all_headers from headers or update with headers, then set
all_headers["event_type"] = str(event_to_produce.event_type)), keeping the
existing conversion to header_list = [(k, v.encode()) for k, v in
all_headers.items()]; reference the variables event_to_produce, headers,
all_headers, and header_list when locating the change.




Summary
Removing nullable fields where possible - source of errors.
Type of Change
Test Plan
uv run pytest)npx playwright test)Manual test steps:
Checklist
pre-commit run --all-filesand all checks passSummary by cubic
Replaced in-process lifecycles with DI and moved message handling to FastStream. Removed UnifiedConsumer/EventDispatcher and the coordinator; EventBus is removed; NotificationService self-starts; SSE now uses a stateless event router with a connection registry.
Refactors
Migration
Written for commit 4b104db. Summary will update on new commits.
Summary by CodeRabbit
New Features
Refactor
Removed
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.