Skip to content

v2026.06.03#366

Merged
duguwanglong merged 100 commits into
mainfrom
dev
Jun 3, 2026
Merged

v2026.06.03#366
duguwanglong merged 100 commits into
mainfrom
dev

Conversation

@stephamie7
Copy link
Copy Markdown
Contributor

No description provided.

John Yin and others added 30 commits May 22, 2026 18:01
Co-authored-by: Cursor <cursoragent@cursor.com>
Clone a workflow-local provider instead of mutating the shared instance
with locks and event-loop markers, preventing cross-loop client reuse
and session config races during workflow llm.ask() calls.
…gent (#315)

* fix(chat): stabilize upload paths and dedupe document attachments

Overwrite duplicate chat uploads instead of auto-renaming so workspace
paths stay consistent. Dedupe composer document attachments by path,
reposition the user avatar in SessionChat, and enable rex_junior delegation.

* feat(agent): consolidate planning into Prometheus subagent

Replace metis and momus with prometheus for interview-style planning and
verified plan output under .flocks/plans/. Route /plan and delegate_task
session permissions through the new agent, preserve YAML permission rules
when resolving tool lists, and show structured todowrite summaries in SessionChat.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(tool): relocate task and skill_load to logical modules

Move task to tool/agent and skill_load to tool/skill, add an enabled flag
to register_function, clarify flocks_skills vs skill_load tool guidance,
limit browser setup to one retry, and update prometheus planning description.
Ensure the uploaded document attachment type guard preserves the generic item shape so listUploadedDocumentPaths narrows workspacePath to string during filtering.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ument-guard

fix(webui): narrow uploaded document attachment type
Align security product skills with browser-use cdp-direct, rename
skyeye-sensor-data-fetch to skyeye-sensor-use, and polish browser doctor,
provider credential delete response, and edit tool mismatch hints
#319)

* fix(storage): prevent and recover from SQLite "file is not a database"

Root-cause fixes for the recurring `sqlite3.DatabaseError: file is not a
database` crash that brought down server startup and disabled session
features.

Application-level corruption vectors closed:

* No code path ever ran `PRAGMA wal_checkpoint`. The WAL grew up to the
  default 1000-page (~4 MB) threshold, so every kill -9 / power loss left
  a non-trivial WAL that had to be replayed on the next start - and
  replay rewrites main-DB page 1 (the header). `Storage.shutdown()` now
  runs `wal_checkpoint(TRUNCATE)` at the very end of the FastAPI
  lifespan, and `Storage.init()` does the same on startup to drain any
  residual WAL left by an earlier crash before a second one can land
  during recovery.
* Auto-checkpoint threshold lowered to 200 pages (~800 KB) via
  `PRAGMA wal_autocheckpoint=200`, shrinking the un-persisted window 5x.
* `synchronous=NORMAL` is now set explicitly so the WAL durability
  contract cannot drift to `OFF` via a stray pragma.
* Long-lived SQLite connections in `Storage`, `TaskStore`, and
  `session_binding` now record their owning PID and rebuild after a
  detected `fork()` (uvicorn --reload / multi-worker). Sharing a
  connection across fork is the documented SQLite corruption vector.

Safety net for irreducible external causes (power loss, NFS, AV,
disk-full):

* Pre-flight SQLite magic-header probe before opening so that a corrupt
  file is quarantined *before* `aiosqlite` can delete its `-wal`/`-shm`
  sidecars - the offline `scripts/recover_raw_flocks_db.py` needs them.
* If `Storage.init()` still trips a `NOTADB`/`SQLITE_CORRUPT` /
  "database disk image is malformed" error, the main DB and its sidecars
  are renamed to `<name>.corrupt.<UTC-timestamp>` and a fresh empty DB
  is created so the server can keep booting; a loud log explains how to
  run the recovery script offline.

Tests added/updated:

* corruption quarantine (fast path via magic header, slow path via
  `PRAGMA` failure),
* `_is_db_corruption_error` and `_file_has_invalid_sqlite_header`
  classifiers,
* shutdown TRUNCATE, startup TRUNCATE of residual WAL,
* fork-detection re-init,
* `synchronous=NORMAL` contract assertion on both async and sync paths.

Regression: 346 tests across `tests/storage/`, `tests/server/concurrency/`,
`tests/task/`, `tests/integration/test_task_queue_integration.py`, and
`tests/channel/` all pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(storage): surface wal_checkpoint busy + abort init on quarantine failure

Addresses two review findings on the previous commit:

1. **HIGH — `Storage._checkpoint` ignored the PRAGMA result row.** SQLite
   reports contention via the return row ``(busy, log_pages,
   checkpointed_pages)`` rather than a SQL exception, so a TRUNCATE
   blocked by an active reader/writer would return ``(1, n, 0)`` and our
   code would still log ``storage.shutdown.checkpoint.done`` even though
   the WAL was not actually drained — defeating the core goal of "next
   startup must not need WAL recovery".

   `_checkpoint` now fetches that row, returns the full tuple, and
   raises the new ``CheckpointBusyError`` when ``busy=1``.
   ``Storage.shutdown()`` retries TRUNCATE a few times with a short
   backoff, then logs a structured ``checkpoint.unfinished`` warning
   (never ``done``) when the WAL is still occupied — so operators can
   spot the residual-WAL risk in logs.

2. **MEDIUM — fast-path quarantine return value was ignored.** When the
   magic-header pre-flight detects a non-SQLite file, the recovery flow
   depends on quarantining the file *before* SQLite touches its WAL/SHM
   sidecars.  The previous code ignored a possible ``None`` (rename
   failure) and continued to ``_bootstrap_schema``, which would let
   SQLite open the bad file and delete the very sidecars we wanted to
   preserve.

   `Storage.init()` now raises ``StorageError`` when the fast-path
   quarantine fails, so the operator can move the file aside manually
   instead of losing recovery data.

Tests:
* ``test_storage_checkpoint_raises_when_sqlite_reports_busy`` — holds an
  active reader transaction across a TRUNCATE checkpoint and asserts
  ``CheckpointBusyError`` is raised (exposes the original silent
  failure mode).
* ``test_storage_shutdown_reports_unfinished_on_persistent_busy`` —
  spies on the logger to confirm shutdown logs ``unfinished`` (not
  ``done``) when every retry is busy, while still clearing
  ``_initialized`` because the process is exiting anyway.
* ``test_storage_init_raises_when_quarantine_fails_on_invalid_header`` —
  patches the quarantine to return ``None`` and verifies init aborts
  loudly and leaves the WAL/SHM sidecars untouched.

Regression: 349 tests across `tests/storage/`,
`tests/server/concurrency/`, `tests/task/`,
`tests/integration/test_task_queue_integration.py`, and `tests/channel/`
all pass.
Normalize stdio/local MCP configs so legacy ``env`` maps to the canonical
``environment`` field at load time and when connecting servers.
…runing and bug fixes

Rewrites the context compaction pipeline to be faster, more accurate, and
available in channel (IM) sessions. Key changes:

Core algorithm (aligning with hermes-agent approach):
- Replace tiktoken with chars/4 estimation; remove system-prompt and
  tool-schema overhead fields from CompactionPolicy
- Fix overflow threshold to a fixed 85% × context_window
- Switch to single-pass LLM summarisation (drop chunked/iterative paths)
- Add hermes-style pre-pruning before summarisation: MD5 dedup of
  identical content, semantic one-line compression of large old messages
  (>200 chars), token-budget tail protection (20% of overflow threshold),
  and stripping of multimodal content with text placeholders
- Add per-message content truncation (head 4000 + tail 1500 chars) before
  feeding messages to the summariser
- Implement error-type-based cooldown for summary failures (60 s for auth
  errors, 30 s for rate-limit, 10 s for transient errors)

Bug fixes:
- Fix overflow detection never using provider-reported token counts:
  last_finished.tokens is a Pydantic TokenUsage model, not a dict, so the
  isinstance(…, dict) guard always fell through to the chars/4 estimate.
  Now normalises TokenUsage → dict via model_dump() before comparison.
- Fix target_chars conversion factor from ×2 to ×4 (chars/4 ↔ tokens)
  in both compaction.py and memory/flush.py
- Return "continue" (skip archive) instead of falling back to a stub
  summary when the summariser is in cooldown, preventing history loss
- Pass original (un-pruned) messages to memory flush to preserve
  full content density for memory extraction

Channel support:
- Expose /compact command in channel surfaces (visible_surfaces + channel_safe)
- Add InboundDispatcher._handle_compact_command: resolves model via
  SessionLoop._resolve_model, runs run_compaction, delivers status reply;
  supports /compact <focus> to bias what the summariser preserves

UI:
- Align the compaction progress indicator with regular assistant message
  bubbles (avatar + "Rex" header row) instead of a bare amber box

Observability:
- Promote loop.tokens_decision from debug to info level so channel-session
  token decisions appear in production logs
- Differentiate "provider temporarily unavailable" from "context genuinely
  too large" in the overflow-exhausted user-facing error message

Tests:
- Remove test_compaction_chunked_strategy.py (chunked path retired)
- Update test_compaction_policy, test_prompt_tokens for new thresholds
- Add overflow_ratio override tests (5 new cases)

Co-authored-by: Cursor <cursoragent@cursor.com>
…ention messages

WeCom (and other IM platforms) prefix the bot's display name to group
messages before delivering them, so "/compact" arrives as "- test /compact"
rather than starting with "/".  This caused two separate failures:

1. _parse_slash_command used a strict startswith("/") check and returned
   (None, ""), bypassing the slash-command path entirely and letting the
   message fall through to the LLM.

2. Even after adding a regex fallback that detected the command, the
   UserInputEvent was constructed with the raw "- test /compact" text.
   dispatch_user_input then re-parsed event.text with the same strict
   parser (input/dispatcher.py parse_slash_command), got None, and
   called sink.run_llm() — producing the confusing error
   "命令 `- test /compact` 暂不支持在当前渠道中以 slash 形式执行。"

Fix:
- Extend _parse_slash_command with a channel fallback: scan for the last
  "/<word> [args]" token in the text and accept it only when <word>
  resolves in the command registry, preventing false positives on paths
  like "/tmp/foo.log".  Emit a log line (dispatcher.slash_command.
  fallback_matched) when the fallback activates.
- After parsing, normalise event.text to the canonical "/cmd [args]" form
  before constructing UserInputEvent, so the strict re-parse inside
  dispatch_user_input always succeeds.  The original raw text is
  preserved in event.display_text and event.metadata["original_text"]
  for error messages and audit logging.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…tify

fix(skill): reduce watcher inotify usage
Hard-coded constants in read.py (MAX_LINES, MAX_BYTES, MAX_LINE_LENGTH)
are now overridable through a new `toolOutput` section in flocks.json,
mirroring hermes-agent's tool_output_limits design.

- Add ToolOutputConfig model and ConfigInfo.toolOutput field in config.py
- Add flocks/tool/tool_output_limits.py with sync cache-first config read
- Update read.py to resolve limits at call time via tool_output_limits
- Add toolOutput defaults to .flocks/flocks.json.example

Co-authored-by: Cursor <cursoragent@cursor.com>
The device connectivity probe (POST /devices/{id}/test) previously
required a ``base_url`` field on every device. Providers like Sangfor
SIP store ``host`` + ``port`` instead (e.g. ``192.168.1.100`` + ``7443``)
and never set ``base_url``, so every test attempt returned the misleading
error "未配置设备地址(base_url),请先填写" even when the IP was
correctly filled in.

Backend (server/routes/device.py):
  * When ``base_url`` is empty, fall back to ``host`` + ``port`` from
    the resolved credentials to build ``https://{host}:{port}``.
  * Respect an already-present scheme on ``host`` (e.g. operator typed
    ``http://10.1.2.3``) instead of double-prefixing into
    ``https://http://...``.

Frontend (DeviceIntegration/index.tsx):
  * The probe button now applies the same fallback against the form's
    current values so unsaved edits can be tested before clicking save.

Co-authored-by: Cursor <cursoragent@cursor.com>
``ToolRegistry._sync_api_service_states`` used to be a one-way switch:
when an API service was disabled it forced every tool of that provider
to ``enabled=False``, but when the service later became enabled again
the tools stayed off forever. The visible symptom: deleting the last
device of a given provider and then re-adding it left every related
tool greyed out — the only recovery was to toggle each tool by hand.

Changes: * tool/registry.py: make ``_sync_api_service_states`` bi-directional.
    When a service flips to enabled, restore each owned tool to its
    factory default captured at register time (``_enabled_defaults``).
    Already-enabled tools are left untouched to avoid spurious writes.
  * tool/device/sync.py: call ``_apply_tool_settings`` after the sync
    so user-level overrides (``tool_settings[<name>]``) are re-applied
    on top of the restored defaults. This keeps tools the user
    explicitly disabled off, even after the bounce-back path runs.
Co-authored-by: Cursor <cursoragent@cursor.com>
``PUT /api/mcp/{name}`` decided whether to reconnect the server based
solely on ``was_connected``. When the user installed a catalog entry
with ``enabled=false`` and then later flipped it to ``enabled=true``
via this endpoint, the runtime status dict had no entry for the server
at all, so ``was_connected`` was False and the reconnect step was
skipped. The result: the server's tools never registered into
``ToolRegistry`` and stayed invisible until the next process restart.

The new ``should_reconnect`` condition reconnects in two cases:
  1. Was already connected (existing behaviour — config change).
  2. Was not present in the runtime status at all AND the new config
     has ``enabled != False`` AND ``get_connect_block_reason`` reports
     no credential/config gap. This covers the first-enable flow
     without surprising operators by auto-connecting servers that the
     handler intentionally left in a non-running state.

Co-authored-by: Cursor <cursoragent@cursor.com>
``device_startup._sync_all`` swept every storage_key it saw under
``api_services``, including pure-API integrations such as
``tdp_api_v3_3_10`` whose ``_provider.yaml`` declares
``integration_type: api``. Those services never have rows in the
``device_integrations`` table, so ``sync_service_tool_state`` always
counted zero enabled devices and flipped
``api_services[<sk>].enabled = false`` on every restart, silently
disabling the tools. Operators saw the tools come back when they
toggled them manually, only to disappear again on the next restart.

Fix: introduce ``_device_type_storage_keys()`` which scans descriptor
``_provider.yaml`` files once per call and returns the set of
``storage_key`` values whose ``integration_type`` is ``"device"``.
``_sync_all`` consults this set when sweeping the config so pure-API
services are no longer touched by the device subsystem at all.

Notes: * The scan is O(N) over discovered plugins and runs once per startup
    sync; previous draft used O(N²) per-key lookups.
  * Broken or unparseable ``_provider.yaml`` files are tolerated — they
    are simply excluded from the device set rather than aborting the
    whole sync.
Co-authored-by: Cursor <cursoragent@cursor.com>
Locks in the contracts established by the preceding fixes so they
can't silently regress.

tests/tool/test_apply_tool_settings.py (4 new cases):
  * sync_restores_tool_when_service_re_enabled — the headline
    regression: a tool whose YAML default is True bounces back to True
    after its service flips disabled → enabled.
  * sync_does_not_resurrect_user_disabled_tool — a user's explicit
    disable in ``tool_settings`` wins over the bounce-back path.
  * sync_does_not_flip_factory_disabled_tool — tools whose YAML
    default is ``enabled: false`` stay off when the service is enabled;
    only an explicit overlay can open them.
  * sync_leaves_already_enabled_tool_alone — sync is a true no-op when
    tool and service are both already enabled.

tests/tool/test_device_startup_sync.py (new file, 5 cases):
  * TestDeviceTypeStorageKeys covers ``_device_type_storage_keys()``:
    empty plugins dir, device/api separation, YAML parse failures,
    and unknown ``integration_type`` values.
  * TestSyncAllScope.test_skips_pure_api_services_with_no_db_rows
    drives ``_sync_all`` end-to-end with stubbed Storage/ConfigWriter
    and asserts that ``sync_service_tool_state`` is invoked for
    ``integration_type=device`` services and never for
    ``integration_type=api`` services.

Co-authored-by: Cursor <cursoragent@cursor.com>
R1 — skipped_no_summary no longer masquerades as successful compaction
  process() now returns "skipped" (instead of "continue") for both
  anti-thrashing cooldown and summary-provider cooldown paths.
  session_loop only updates ctx.last_compaction_step and publishes the
  context.compacted event when the result is "continue" (real success);
  "skipped" is logged and the loop continues without touching cooldown state.

R2 — channel fallback slash parser tightened to bot-mention-only prefix
  The regex fallback in dispatcher._parse_slash_command now validates that
  the text before the matched /<command> is a bot-mention prefix of the
  form "- Name" or "@name".  Natural-language sentences such as
  "请解释一下 /help" or "prefix /new thanks" are rejected so they are
  handled by the LLM instead of being dispatched as slash commands.

R3 — ToolOutputConfig fields gain camelCase aliases
  Added alias="readMaxLines" / "readMaxBytes" / "readMaxLineLength" and
  populate_by_name=True so the flocks.json.example entries actually parse
  into the model fields instead of being silently discarded.

Co-authored-by: Cursor <cursoragent@cursor.com>
R1 follow-ups identified in self-review:
  - Update SessionCompaction.process return type to
    Literal["continue", "stop", "skipped"] to match the new third state.
  - Update run_compaction orchestrator return type and document the three
    states so callers know "skipped" must not be treated as success.
  - Channel /compact handler now delivers a distinct "本轮压缩被跳过" text
    when the result is "skipped" — previously it reported "压缩完成"
    even when nothing was archived.

Co-authored-by: Cursor <cursoragent@cursor.com>
R1 — anti-thrashing & summary cooldown returns "skipped"
  tests/session/test_compaction_skipped_return.py (4 tests)
    * cooldown_remaining > 0 returns "skipped" (not "continue")
    * total_skipped / total_attempts / cooldown_remaining counters move
    * skip branch must NOT invoke _archive_and_write_summary
    * summary_cooldown_until in the future returns "skipped" and
      archive is not called either

R2 — channel slash fallback bot-mention guard
  tests/channel/test_channel.py::TestParseSlashCommand (12 tests)
    * strict /command  + arg path still works
    * "- BotName /cmd" WeCom prefix accepted
    * "@botName /cmd"  Feishu  prefix accepted
    * Unicode (CJK) bot names accepted
    * 6 negative sentence cases rejected (Chinese, English, /tmp/foo.log,
      bare-word prefixes, multi-word leading text)
    * Unknown command rejected even with valid mention prefix
    * Empty / whitespace-only inputs rejected

R3 — ToolOutputConfig alias acceptance
  tests/config/test_tool_output_config.py (12 tests)
    * camelCase keys populate snake_case fields
    * snake_case keys also accepted (populate_by_name)
    * Partial overrides leave others as None
    * Zero / negative values rejected by gt=0 validator
    * ConfigInfo round-trip from toolOutput / tool_output keys
    * Runtime helpers fall back to defaults without cached config
    * Cached config overrides defaults
    * Sync flocks.json fallback path for CLI one-shot mode
    * Section loader swallows internal errors defensively

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #323 review pointed out that ``should_reconnect`` only fired for the
"first enable with no runtime status" path. Once a server had been
touched in this process — even if the previous attempt ended in
``FAILED`` or ``DISCONNECTED`` — the route would silently skip the
reconnect, forcing users to click Connect by hand after fixing
credentials.

Simplify the condition to: reconnect whenever the new config requests
``enabled`` AND ``get_connect_block_reason`` reports no pending-
credentials issue. ``MCP.remove`` runs unconditionally beforehand
when a previous status existed, so we always start from a clean
runtime slot.

Tests (tests/server/routes/test_mcp_routes.py, +3):
  * connects_on_first_enable_without_prior_status — no runtime entry,
    a local server flips enabled=true → MCP.connect is invoked.
  * reconnects_after_previous_failure — runtime status was FAILED,
    user saved a corrected command → reconnect runs without an extra
    click and the old state is removed first.
  * skips_connect_when_credentials_blank — auth.value="" marks the
    config as pending credentials → connect must NOT run.

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #323 review flagged two gaps in the new host+port path:

  1. Error text still only mentioned ``base_url``; users on host+port
     providers (Sangfor SIP) who left both blank were told to fill in
     a field that doesn't exist on their form.
  2. There were no route-level tests for the fallback logic.

Changes: * route_test_device: error message now says "未配置设备地址(base_url
    或 host),请先填写" so the prompt matches the actual provider
    fields.
  * sangfor_sip_v92/_provider.yaml: notes section explains that
    ``host`` defaults to https:// and tells operators how to force
    http:// by typing the scheme into the host field itself.
  * tests/server/routes/test_device_routes.py (new, 6 cases):
      - host+port → https://host:port
      - host only → https://host (no dangling colon)
      - host carries scheme (http://...) → no double prefix
      - body.base_url override beats persisted host
      - empty fields → error message mentions both base_url AND host,
        and the probe never runs
      - unknown device → 404
Co-authored-by: Cursor <cursoragent@cursor.com>
PR #323 review noted that ``_sync_api_service_states`` silently
overwrites ``tool.info.enabled`` with the factory default whenever a
service becomes enabled — any caller that fails to follow up with
``_apply_tool_settings`` would clobber the user overlay (tools the
user explicitly disabled would pop back on).

The two production call sites (plugin bootstrap and device sync)
already pair the calls; add the contract to the docstring so the
next contributor can't accidentally introduce a new call site that
drops the user overlay.

Co-authored-by: Cursor <cursoragent@cursor.com>
Before invoking any device-specific tool (tdp_*, onesec_*, onesig_*,
qingteng_*, skyeye_*, sangfor_xdr_*), the agent must first call
device_context to list all registered devices, match the user-supplied
device name to the correct device_id, and pass that id on every
subsequent tool call.

Without this step the agent could silently hit the wrong device when
multiple instances of the same product are configured (e.g. "TDP v4"
vs "TDP v6"), or omit device_id entirely when the parameter is marked
optional in the schema.

A dedicated "设备定位(首要步骤,不可跳过)" section has been added at
the top of the API-mode guide in each affected skill, covering:
- mandatory device_context call before the first tool invocation
- name-based matching and device_id extraction
- ask-user-to-confirm when multiple devices match or none match
- sangfor-edr-use (browser-only) adapted to resolve the access URL
  from device_context instead of always prompting for it

Co-authored-by: Cursor <cursoragent@cursor.com>
…istence (#322)

* perf(webui): lazy-load modals and heavy routes, memoize sidebar nav

Shrink the initial bundle by code-splitting Layout modals and Session/Agent/auth pages, and stabilize sidebar navigation with useMemo to reduce route-switch re-renders.

* fix(session): persist message parts per message key

Write new sessions to message_parts:<session_id>:<message_id> so tool-call hot paths avoid rewriting the full session blob; keep legacy aggregated blob reads/writes for existing data. Align CLI import and add persistence tests.
chenjie-booker and others added 27 commits June 1, 2026 16:54
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(workflow): add generic background poller with API and WebUI

Introduce WorkflowPollerManager for scheduled workflow runs, REST
endpoints for config/status/run-once, server startup integration, and
Integration tab controls with tests.

* fix(workflow): persist poller runs and improve stop lifecycle

Record poller executions via execution_store, normalize business-failure
outcomes, and let in-flight runs finish during stop instead of cancelling.

* fix(webui): dedupe uploaded document attachments by path

Export helpers to keep only the latest successful non-image upload per
workspace path when building attachment blocks and after batch upload.
* fix(model): require base URL for openai-compatible providers

Align the add-provider UI with backend validation by making Base URL required for openai-compatible provider creation, and surface localized provider descriptions in the catalog flow.

* fix(model): remove unintended provider description changes

Keep this branch focused on the requested Base URL validation fix by reverting the extra provider description code path changes that were added by mistake.
* feat(device): add 360 WAF v5.5 integration

* refactor(device): use built-in confirmations for 360 WAF
* feat(provider): add MiniMax M3 catalog support

Add MiniMax M3 to the built-in provider catalog and apply the same MiniMax runtime handling to any minimax model ID so newer variants work without extra per-model logic.

* fix(provider): align MiniMax M3 catalog tests

Keep the MiniMax catalog test expectations in sync with the configured 512k context and output limits so the provider suite reflects the intended catalog values.
…350)

* fix(session): clear history and surface Feishu websocket disconnects

Make /clear remove stored session messages across CLI and WebUI so the UI stays in sync. Fail fast when Feishu websocket workers disconnect so supervisors can restart them instead of hanging silently.

* fix(channel): keep Feishu websocket siblings running

Observe disconnects in both Feishu websocket client paths without treating a single account failure as a global stop. Clear queued prompts before deleting session history so /clear leaves the session fully reset.

* fix(channel): restart failed Feishu websocket accounts

Retry failed Feishu websocket accounts with backoff so one dropped connection can recover without interrupting healthy siblings. Cover the restart path with regression tests.
Under sustained syslog throughput, _trim_execution_history was called as
a fire-and-forget background task every 5 messages per workflow.  Each
invocation called Storage.list_entries("workflow_execution/") which
json.loads every execution record in the table into Python objects.
Execution records included full alert payloads (resp_body, req_header,
etc.) written by the dedup_and_write node, making each record several
hundred KB.  As the table grew to 3+ GB, a single trim scan materialised
gigabytes of transient objects — py-spy confirmed 100% GIL time in
json.raw_decode attributed to _trim_execution_history → list_entries.
Because multiple trim tasks could be in flight simultaneously (one spawned
per 5 messages, each taking tens of seconds to complete), the transient
allocations multiplied, driving RSS to 20 GB without bound.

Fix:
- Storage.list_raw(): new method that returns (key, raw_value_str) pairs
  without any Python-side JSON parsing, compatible with all SQLite versions.
- _trim_execution_history: replaced list_entries + json.loads with
  list_raw + regex extraction of workflowId/startedAt from the first 400
  bytes of each value string.  Avoids constructing large Python objects
  entirely; regex on a 400-byte prefix is ~100x cheaper than full parse.
- _trim_in_flight: Set[str] guard ensures at most one trim task per
  workflow runs at a time, preventing concurrent scans from multiplying
  peak memory usage.

Co-authored-by: Cursor <cursoragent@cursor.com>
…#353)

* fix(kafka): tighten ingest backpressure and compact execution storage

Reduce concurrent Kafka workflow runs and fetch buffering to limit memory
on large payloads. Summarize raw inputs and execution history for storage,
and remove experimental Kafka output producer from API and WebUI.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(workflow): add summary history mode for high-throughput ingest

Summarize step inputs/outputs in-engine for Kafka runs instead of retaining
full payloads in memory. Skip dedup hashing in summary mode and propagate
final workflow outputs on ExecutionResult.

* fix(workflow): summarize final outputs in summary history mode

Keep RunWorkflowResult.outputs bounded for Kafka and other high-throughput
runs by applying the same observability summarization used for step history.

* fix(workflow): clear REPL globals after each node in summary mode

Prevent large node-local variables from accumulating across steps when
Kafka and other high-throughput runs use history_mode=summary.
…extract

fix: eliminate OOM in execution history trim under syslog load
Align the OSS Pro upgrade application dialog with the new business requirements by removing sales-rep collection, requiring applicant contact info, and validating email/phone formats consistently in frontend and backend.

Co-authored-by: Cursor <cursoragent@cursor.com>
…-hook3-into-dev

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…h-hook3

Sync/main into dev after auth hook3
#352)

Backfill missing assistant/user fields and tool state timestamps so old
sessions deserialize without dropping the whole cache; skip invalid entries.
Allow channel surfaces to accept /clear and execute the existing session history reset flow so IM conversations can clear state consistently with WebUI.
Align EN/ZH README with WebUI same-origin /api proxy defaults, simplify
remote start flags, and expand reverse-proxy and auth recovery notes.
Ensure @mention-selected agents are preserved when messages are queued during streaming, and align agent picker copy with the default-Rex new-session behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add minimax-m3 (1M context, 128K output) to threatbook-cn-llm,
  threatbook-io-llm, and minimax providers
- Update deepseek-v4-flash context window 200K→1M, max output 128K→384K
  in both threatbook-cn-llm and threatbook-io-llm
- Reorder ThreatBook provider models: minimax group (m3/m2.7/m2.5) first
- Fix minimax provider: align minimax-m3 family field and add pricing

Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(device): add Huorong EDR and Huawei Cloud WAF device plugins

- Add huorong_edr_v1_0: HMAC-SHA1 signed API integration for Huorong
  endpoint security platform, covering group management (group_list/
  create/rename/delete), client management (online/list/info/rename/
  group/leak), and task management (virus scan task creation)

- Add huaweicloud_waf_v1: AK/SK (SDK-HMAC-SHA256) and Token dual-auth
  integration for Huawei Cloud WAF, covering protected domain management
  (cloud mode + dedicated mode), policy and rule management (CC rules,
  custom rules, blacklist/whitelist, geo-IP), attack event queries, and
  security overview statistics

Both plugins follow the standard device plugin layout with _provider.yaml,
_test.yaml, handler.py, and grouped tool YAML files.

* refactor(device): rename huaweicloud_waf to v39 and bump version

Rename huaweicloud_waf_v1 → huaweicloud_waf_v39 to match the official
WAF API reference document version (v39, 2026-04-08), and update
version/product_version fields in _provider.yaml from "1.0" to "39".
…361)

* feat(workflow): merge Kafka configured inputs with consumed messages

Persist extra workflow inputs on Kafka consumer config and apply them
when triggering runs, with WebUI JSON editing aligned to the poller UX.

* fix(workflow): strip _comment keys from Kafka configured inputs

Strip execution-only comment fields when saving Kafka inputs in the
WebUI and when persisting or applying configured inputs at runtime.

* fix(workflow): prefer processed cache size in poller status

Surface processed_cache_size_after in poller run summaries when present
and rename the WebUI label to reflect total processed count.

Co-authored-by: Cursor <cursoragent@cursor.com>
feat(provider): add minimax-m3 and update model limits in catalog
@stephamie7 stephamie7 requested a review from duguwanglong June 3, 2026 06:57
@duguwanglong duguwanglong merged commit 6dab96a into main Jun 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants