Skip to content

Fix/sub thread routing and apitoken access#78

Merged
rbuergi merged 123 commits intomainfrom
fix/sub-thread-routing-and-apitoken-access
Apr 9, 2026
Merged

Fix/sub thread routing and apitoken access#78
rbuergi merged 123 commits intomainfrom
fix/sub-thread-routing-and-apitoken-access

Conversation

@rbuergi
Copy link
Copy Markdown
Contributor

@rbuergi rbuergi commented Apr 3, 2026

No description provided.

rbuergi and others added 30 commits March 31, 2026 22:24
Mark CosmosImport and PostgreSqlImport tools as IsPackable=false
to fix NU5019 errors during dotnet pack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Map ApiToken nodeType to Permission.Api in CreateNodePermissionAttribute
  (same satellite pattern as Thread/Comment)
- Set IsSatelliteType=true on ApiToken node, add validation cache
- Fix delegation null delivery check and add cancellation registration
  to prevent infinite hang on sub-thread routing failures
- Add tests for ApiToken creation and delegation failure handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RoutingGrain was passing deliveries to grains without updating
the target address when path resolution split prefix/remainder.
This caused routing loops for deeply nested sub-thread paths (6+
segments) because the grain received a delivery whose target didn't
match its hub address.

Now mirrors RoutingServiceBase behavior: sets UnifiedPath property
and updates delivery target to the resolved prefix address.

Also adds InternalsVisibleTo for MeshWeaver.Hosting.Orleans to
access WithTarget, and Orleans tests for sub-thread routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… leak

Organization instances (e.g., PartnerRe) were visible to all authenticated
users in search because of three layers of public read access:
- ConfigureNodeTypeAccess(WithPublicRead) bypassed partition access in SQL
- WithPublicRead() on hub config allowed unauthenticated hub reads
- Access rule returned true for all reads

Now Organization instances require partition-level permissions for read
access. The Organization type definition itself remains visible (it's
nodeType=NodeType which has its own WithPublicRead). Routing is unaffected
as MeshCatalog path resolution is unprotected by design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GenerateAccessControlClause was using OR between public_read and
partition_access, meaning any node type with public_read=true (Markdown,
User, Organization) was visible to all authenticated users across ALL
partitions. This leaked cross-partition data in search results.

Now: partition_access is always required for schema-qualified queries.
public_read only skips node-level permission checks within accessible
partitions, not the partition check itself.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates the stored procedure so partition_access is always required.
public_read only skips node-level permission checks, not the partition
check. Prevents cross-partition data leakage in global search.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests now reflect that public_read does not bypass partition_access.
- GlobalAdmin tests: grant partition_access to all org schemas
- PublicRead test: verifies no results without partition_access
- CrossPartition access test: asserts other orgs are excluded
- Renamed PartnerRe references to FutuRe in test data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AsyncLocal doesn't flow through the AI framework's async streaming
and tool invocation chain, so MeshPlugin tool calls (Get, Search,
Create, Update, Patch, Delete) ran without user identity. This caused
"Access denied" when agents tried to update nodes in partitions the
user had access to.

Now each tool call explicitly restores the user's AccessContext from
ThreadExecutionContext.UserAccessContext via SwitchAccessContext.

Also fixes FutuRe schema reference in CrossPartitionSearchTests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that Get and Patch work when AsyncLocal context is cleared
(simulating AI framework tool invocation). The plugin must restore
the user's identity from ThreadExecutionContext.UserAccessContext.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SetContext directly instead of SwitchAccessContext — no await needed,
no disposal needed. The AsyncLocal is scoped to the thread's InvokeAsync
async flow so setting it once per tool call is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The delegation tool calls meshService.CreateNode(subThreadNode) which
requires Permission.Thread. Without access context (AsyncLocal lost in
AI framework's tool invocation), this fails silently → delegation returns
error → AI retries infinitely creating endless delegation attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests end-to-end delegation: agent calls delegate_to_agent tool,
which creates a sub-thread and submits to it. Verifies access context
flows through the AI tool invocation chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- RoutingGrain: log resolution details at Info level for debugging
- Delegation: guard against depth >= 3 to prevent infinite sub-threads
- ThreadPathResolutionTest: verifies PostgreSQL correctly resolves
  deeply nested _Thread paths via satellite table (all 5 tests pass)
- OrleansDelegationFlowTest: skeleton for end-to-end delegation test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-tool RestoreUserContext() with AccessContextAIFunction
(DelegatingAIFunction) wrapper applied to ALL tools in CreateAgentCore.
Every tool invocation — MeshPlugin, delegation, PlanStorage, etc. —
now automatically restores the user's identity from
ThreadExecutionContext.UserAccessContext before executing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
StreamingCompact was recursively embedding sub-thread StreamingArea
via LayoutAreaControl when tc.Result == null. This caused infinite
grain activations when the sub-thread didn't exist (CreateNode failed
due to missing access context) — each failed activation triggered
another embed attempt.

Now delegation links are static with status indicators (dot/checkmark).
No recursive LayoutAreaControl embedding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AccessContextToolCallTest: verifies tool calls restore user identity
  from ThreadExecutionContext, even when AsyncLocal is cleared
- StreamingRecursionTest: verifies delegation ToolCalls don't trigger
  recursive LayoutAreaControl embedding
- DelegationDepthGuard: verifies depth >= 3 is detected correctly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous guard counted _Thread segments which didn't detect
Worker→Worker→Worker recursion. Now counts segments after _Thread/
to determine real delegation depth (each level adds msgId/subId = 2
segments). Maximum depth = 2 (one delegation level).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ToolStatusFormatter: show agent name (without Agent/ prefix) + task
  preview instead of "Delegating to Agent/Worker"
- appsettings: set MeshWeaver.AI and RoutingGrain to Information level
  so delegation and routing traces appear in App Insights
- Fix delegation depth guard to count actual nesting from path segments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delegation entries now use <details> (same as regular tool calls) so
users can expand to see the full task and result. Removed the recursive
LayoutAreaView embed for in-progress delegations that caused stack
overflow via cascading grain activations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NotifyParentCompletion was using DelegationTracker (static in-memory
dictionary) which can't work across Orleans silos. Now posts a second
SubmitMessageResponse with Status=ExecutionCompleted back through the
hub, which the parent's RegisterCallback receives to resolve the
delegation TCS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that SubmitMessageRequest produces both CellsCreated and
ExecutionCompleted responses via RegisterCallback. This is the exact
pattern used by the delegation tool — without the second response,
the parent thread hangs forever.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ration

Root cause: RegisterCallback removes callbacks after first invocation.
The CellsCreated response consumed the callback, leaving nothing for
ExecutionCompleted → parent thread hung forever.

Fix:
- HandleSubmitMessage registers CompletionCallbacks[threadPath] closure
  that posts ResponseFor(originalDelivery) on the thread hub
- NotifyParentCompletion invokes the callback to send ExecutionCompleted
- Delegation tool re-registers callback after CellsCreated response

DelegationCompletionTest verifies both responses arrive via RegisterCallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
StreamingView now shows:
- Thread title as clickable link
- Executing message's Overview (bubble with text + tool calls)

For executing delegations in the bubble, embeds the sub-thread's
StreamingView (bounded by delegation depth guard, max 2 levels).
For completed delegations, shows expandable details.

No infinite recursion: StreamingView → Overview → Streaming is bounded
by the max delegation depth (2), not by rendering depth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
StreamingView: if thread has executing cell, return its default area.
Otherwise null. No title, no wrapping — simple passthrough.

StreamingCompact delegation rendering:
- Running (Result==null): show name + embed sub-thread's Streaming area
- Completed (Result!=null): show title with link (checkmark)

Recursion is bounded by delegation depth guard (max 2 levels):
StreamingView → Overview → StreamingCompact → sub-thread Streaming →
sub-thread Overview → done (no further delegation at max depth).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alls

When recovering a stale executing thread after restart, delegation
tool calls now check their sub-thread's status:
- Sub-thread completed (IsExecuting=false): mark as done
- Sub-thread still running: mark as cancelled (parent can't re-subscribe)
- Non-delegation: mark as cancelled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…de API

- AzureClaudeChatClient: handle DataContent as base64 document/image
  blocks in the Claude API format
- AgentChatClient: detect content: prefix with binary extensions (.pdf,
  .png, .jpg, etc.), load via IContentService as Stream, create
  DataContent and include in ChatMessage.Contents
- Path resolution: local (content:file.pdf → context path) or absolute
  (@OrgA/Doc/content:file.pdf)
- ChatMessage supports mixed content: TextContent + DataContent
- Tests for serialization, path parsing, and content type detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Foundation for document format conversion pipeline:
- IContentTransformer interface for binary-to-markdown conversion
- ContentCollection.GetContentAsTextAsync uses registered transformers
- DocSharp.Markdown package added for docx → markdown conversion

Next: restore ContentPlugin with xlsx/docx/pdf readers, wire into
content browser and agent attachments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rbuergi and others added 28 commits April 7, 2026 00:02
…space streams

History was loaded via workspace.GetRemoteStream().Current which returns
null in Orleans (workspace streams don't propagate). Now queries
IMeshService directly (reads from persistence) which is reliable.
This fixes agents losing context between messages and not knowing what
was discussed or what nodes to update.

Also reduced streaming throttle from 1s to 3s to prevent grain scheduler
overload (Orleans messages were expiring before delivery).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ompts

- Load conversation history reactively: CombineLatest on remote streams
  with 10s timeout and per-thread cache for re-submissions
- Reduce streaming throttle from 1s to 3s to prevent grain scheduler overload
- Worker prompt: mandatory read→adapt→write workflow, max 3 Gets, must Patch
- Orchestrator: prescriptive Worker delegation ("Get X, change Y, Patch it")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single DB query (namespace:{threadPath} nodeType:ThreadMessage) gets all
messages reliably. Remote streams to child grains never connected in time.
Results ordered by Thread.Messages list. Cached for re-submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Optimistic rendering: pending cells render instantly (no skeletons)
- Thread creation: BuildThreadWithMessages + AutoExecutePendingMessage
- History: GetDataRequest to each ThreadMessage node via CombineLatest
- Agent reuse: AgentCache per thread path
- Proper ChatMessage list passed to GetStreamingResponseAsync
- Retry with error on API 500, ReduceToMeshNode fallback
- Resubmit: click handler creates output cell, server skips creation
- Worker/Orchestrator prompts: mandatory write-back workflow
- Orleans test: ColdStart_AgentSeesAllPreviousMessages (FAILING - repro)
  Response never routes back from grain to test client
- Monolith test: ThreeMessages_AgentSeesFullHistory (PASSING)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- XUnitLogger: fall back to XUnitFileOutputRegistry.GetAnyActiveOutputHelper()
  when testOutputHelperAccessor.OutputHelper is null (silo-side logging)
- SharedOrleansFixture: register test client on SILO's routing service via
  reflection (InProcessSiloHandle.SiloHost.Services). Without this, response
  routing tried to activate a grain for the client address → failed.
- AccessContextGrainCallFilter: swallow NullRef for Orleans internal Stop/Close
- Result: silo logs now visible, SubmitMessageRequest routes + responds,
  history loads 5/5 messages, agent receives 6 messages. Test times out
  on QueryAsync polling (persistence flush delay) — execution itself works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SharedOrleansFixture: register client on silo's IRoutingService via
  reflection (siloHost.Services). Fixes response routing back to client.
- OrleansChatHistoryTest: use completion callback instead of QueryAsync polling
- XUnitLogger: fall back to static GetAnyActiveOutputHelper() for silo logs
- Test passes: 5 history messages assembled, 6 sent to agent, completes <1s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… history)

ColdStart_AgentSeesAllPreviousMessages now correctly fails with:
Expected "I received 1 messages" to contain "6 messages"

Root cause: AgentChatClient.BuildMessageWithContextAsync() merges all
ChatMessage objects into a single text prompt. The 6 messages
(4 history + 1 input cell + 1 new user) become 1 flattened string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Call agent.ChatClient.GetStreamingResponseAsync(allMessages) directly,
bypassing AgentChatClient.BuildMessageWithContextAsync which merged all
6 ChatMessage objects into 1 text blob. The ChatClientAgent already has
system prompt in its instructions; FunctionInvokingChatClient handles tools.

- Add AgentChatClient.GetAgent() to expose the ChatClientAgent
- ThreadExecution: call agent.ChatClient directly with full message list
- Orleans test: GREEN (agent sees 6 messages: system + 4 history + 1 new)
- Monolith tests: GREEN (2→4→6 messages across 3 turns)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AgentChatClient: prepend agent.Instructions as system message, pass all
  messages as separate turns to agent.ChatClient (includes FunctionInvokingChatClient)
- ThreadExecution: history = all messages EXCEPT last 2 (current input + output)
- All tests GREEN: Orleans ColdStart + monolith ThreeMessages + TwoMessages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assistant messages now include tool call details (name, args, truncated
result) so the agent knows what it did in previous turns. Without this,
the agent lost context about data it read or actions it took.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check hub.RunLevel before creating streams. Catch ObjectDisposedException
if hub disposes between the check and stream creation (race during F5/nav).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BuildThreadNode/BuildThreadWithMessages added /_Thread/ partition
unconditionally. For delegations, contextPath already contains /_Thread/
(it's inside a thread). This created paths like:
  User/rbuergi/_Thread/thread-id/msg-id/_Thread/sub-thread-id
Now detects existing _Thread in path and skips the partition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ThreadExecution: state-driven execution via WatchForExecution (watches
  workspace stream for IsExecuting=true, replaces command-driven flow)
- HandleSubmitMessage: thin state updater only (no cell creation, no
  execution start)
- AgentChatClient: pass tools via ChatOptions from FunctionInvokingChatClient
  .AdditionalTools (was null → Claude never saw tool definitions)
- GUI: remove pendingCells, render LayoutAreaView for every message from
  the start. New thread flow: create thread → verify → create cells →
  verify → submit → navigate on response
- WatchForExecution: idempotent cell creation (handles both GUI-initiated
  and delegation sub-threads)
- MarkdownView: create kernel node before posting code submissions (fixes
  Orleans grain activation for interactive markdown)
- HandleSubmitMessage: deduplicate Messages with Contains check
- Fix $type serialization: PendingAttachments → ToImmutableList()
- Fix ThreadsCatalog test for nested _Thread path changes
- Update all thread tests to create cells before SubmitMessageRequest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HandleSubmitMessage: don't set PendingUserMessage (reserved for delegations)
- WatchForExecution: PendingUserMessage=null → StartExecution() directly (no
  slow meshService.CreateNode for existing cells)
- ExecuteMessageAsync: include current user cell in history (count-1 instead
  of count-2), don't add UserMessageText separately when history has it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ReadOnlyListConverter now always uses JsonDocument per-element parsing
instead of Deserialize<T[]>. Old data in PostgreSQL may have $type not
as the first property, which crashes the array deserializer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On hub init, both RecoverStaleExecutingThread and WatchForExecution see
IsExecuting=true. Recovery clears it, but WatchForExecution already
captured the stale state and starts a doomed execution. Fix: skip the
first stream emission — recovery handles stale state, WatchForExecution
only reacts to new state changes from HandleSubmitMessage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tivation

Portal and client hubs are not grains — they register as memory stream
subscribers in OrleansRoutingService.RegisterStreamAsync. The RoutingGrain
now publishes to the stream for portal/ and client/ addresses instead of
trying grain activation (which fails with "node not found").

This fixes cross-silo response routing: portal on silo A receives
responses from grains on silo B via the Orleans memory stream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip-first-emission broke delegation sub-threads (their first emission
IS the legitimate trigger). Instead, check ExecutionStartedAt: if older
than 2 minutes, it's stale (recovery handles it). Fresh executions
(delegations, HandleSubmitMessage) proceed normally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecoverStaleExecutingThread was killing fresh delegation sub-threads by
clearing IsExecuting before WatchForExecution could trigger. Now both
recovery and watch use the same 2-minute age threshold on ExecutionStartedAt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The delegation tool now creates both ThreadMessage cells BEFORE creating
the thread node. Previously only the thread was created, relying on
WatchForExecution to create cells — which was unreliable (race with
recovery, cross-silo timing). Now cells exist when the grain activates.

Added info-level logging around delegation cell/thread creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies the exact delegation flow: create user cell → create response
cell → create thread with IsExecuting=true → WatchForExecution triggers
→ execution completes → response cell has agent text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HandleSubmitMessage now starts execution directly after updating state,
avoiding the unreliable meshService.CreateNode round-trip inside the
WatchForExecution subscription (which hangs due to reentrancy).

- GUI flow (client provides cell IDs): respond + start immediately
- Server flow (no IDs): create cells fire-and-forget, then start;
  respond with error if cell creation fails
- WatchForExecution: only handles BuildThreadWithMessages auto-execute
  (Take(1) on hub startup, delegation flow)

Fixes 12 CI test failures (10 threading + 2 security).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thread message cells must have MainNode = the thread's content node
(e.g., "PartnerRe/AIConsulting"), not the thread path. This is required
for SatelliteAccessRule to delegate read permissions correctly.

- HandleSubmitMessage: read MainNode from thread workspace node
- WatchForExecution: read MainNode from stream node
- ChatClientAgentFactory delegation: use execCtx.ContextPath
- ThreadChatView (existing threads): don't fall back to threadPath
- SetThreadHubIdentity: set hub access context from thread.CreatedBy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thread message nodes created from the UI had MainNode set to the thread
path (e.g., "Org/_Thread/thread-id") instead of the content node ("Org").
This caused "Access denied" when SatelliteAccessRule delegated read
permissions to MainNode.

Fix: UPDATE main_node = split_part(main_node, '/_Thread/', 1) for all
ThreadMessage nodes where main_node contains /_Thread/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a silo goes down, grain delivery throws OrleansMessageRejectionException.
Without the stream fallback, this returned delivery.Failed() which caused
cascading DeliveryFailureExceptions across all UI clients. Restoring the
fallback to Orleans memory stream ensures graceful degradation during
silo restarts instead of hard crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rbuergi rbuergi merged commit 035cdda into main Apr 9, 2026
2 checks passed
@rbuergi rbuergi deleted the fix/sub-thread-routing-and-apitoken-access branch April 9, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant