fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139
Merged
Dongbumlee merged 10 commits intov2from Mar 19, 2026
Merged
fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139Dongbumlee merged 10 commits intov2from
Dongbumlee merged 10 commits intov2from
Conversation
- QdrantMemoryStore: in-process Qdrant embedded vector store, per-process isolation - SharedMemoryContextProvider: ContextProvider that reads/writes shared memories - invoking(): queries Qdrant for relevant memories before each LLM call - invoked(): stores agent responses into shared memory after each LLM call - OrchestratorBase: auto-initializes memory store + attaches provider to expert agents - Enabled by default, controlled via SHARED_MEMORY_ENABLED env var - Requires AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME for embedding generation - mem0_async_memory: reduced max_tokens from 100K to 4K for extraction calls - All 77 existing tests pass
… tests - MigrationProcessor: creates QdrantMemoryStore at workflow start, disposes in finally - Memory persists across all 4 steps (analysis→design→convert→documentation) - OrchestratorBase: resolves memory from AppContext instead of creating its own - SharedMemoryContextProvider: fix duck typing for isinstance checks - 18 tests for QdrantMemoryStore (init, add, search, workflow lifecycle) - 20 tests for SharedMemoryContextProvider (invoking, invoked, edge cases) - All 115 tests pass (77 existing + 38 new)
- SharedMemoryContextProvider: log inject count + stored content per agent turn - MigrationProcessor: log total memory count after each step completes - Enables real-time monitoring of memory flow across workflow steps
- Fix: use get_bearer_token_provider() instead of async variant (AzureCliCredential await error) - Add print() statements for memory init diagnostics (embedding deployment found/missing/failed) - Tested locally: 20 memories across 4 steps, workflow completed successfully in 19m 25s
- Workspace context injected into agent system instructions (never trimmed)
- keep_last_messages reduced 50→20, max_total_chars 600K→400K
- ResultGenerator prompts moved to prompt_resultgenerator.txt (4 steps)
- Step transition phase shows 'Initializing {Step}' instead of step name
- flush_agent_memories() fixed: use agent.context_provider.providers
- Guard against uninitialized store in _flush_memory()
- Same-step memory skip (only search cross-step memories)
- Buffered storage (only last response per agent stored)
- Debug log for memory store resolution per step
- Tested: 17m 23s with keep_last_messages=20, all 4 steps PASS
…WARE ROUTING - Added rule 6: route to YAML Expert if their sign-off is PENDING before Chief Architect finalizes - Same pattern as Chief Architect PENDING fix in design coordinator
… UI fixes - Bicep: add text-embedding-3-large model deployment (capacity 500) alongside GPT5.1 - Bicep: add AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME to App Config keys - mem0_async_memory: replace hardcoded endpoints with env vars (AZURE_OPENAI_*) - keep_last_messages adjusted 20→30 for analysis step stability - Analysis executor: phase shows 'Initializing Analysis' instead of 'Analysis' - Test assertions updated for new phase name
…fix, logging - Fix list_blobs_in_container trailing-slash bug causing intermittent 'files not found' - Remove tool-result truncation; only summarize save_content_to_blob writes - Protect last message from per-message truncation - Increase retry config: 8 retries, 5s base, 120s max with exponential backoff - Add cooldown delay on context-trim retries to avoid triggering 429s - Retry transient errors: empty messages, 5xx server errors - Add embedding retry logic (3 retries) in QdrantMemoryStore - Reduce keep_last_messages 30->15; disable per-message truncation - Fix duplicate yaml_conversion/yaml telemetry key - Clear OrchestratorBase._client_cache between processes - Convert all runtime print() to logger.info/error/warning - Remove text2art dependency - Add debug logging to SharedMemoryContextProvider invoked/flush - Prohibit Markdown footnotes in documentation reports - Add diagnostic logging for _embed and _flush_memory failures
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Does this introduce a breaking change?
Golden Path Validation
Deployment Validation
What to Check
Verify that the following are valid
[MEMORY] Stored memory fromlogs appear across all steps, NOT_flush_memory skipped — memory store not initialized[MEMORY] Injecting N memories for {Agent}in design/convert/documentation steps[MEMORY] Workflow complete — closing memory store (N memories)with N > 0[MEMORY] _embed skipped — client=Nonewarnings[AOAI_RETRY]logs show exponential backoff with 5s base delayyamlkey (no duplicateyaml_conversion)SHARED_MEMORY_ENABLED=falsefalls back to original v2 behavior without errorsKey Changes
Critical Bug Fixes
name_starts_withprefix andrelative_pathcomputation now use normalized path with trailing /AppContext._instancescache held the old closed store from previous runs, causing_initialized=Falseon all subsequent runs. Now cleared before re-registeringRetry and Resilience
Context Window
Logging and Telemetry
Other
Other Information
Configuration: