Skip to content

UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff#1935

Draft
Deepak-Kesavan wants to merge 1 commit intomainfrom
UN-3436-load-testing-config-after-async-changes
Draft

UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff#1935
Deepak-Kesavan wants to merge 1 commit intomainfrom
UN-3436-load-testing-config-after-async-changes

Conversation

@Deepak-Kesavan
Copy link
Copy Markdown
Contributor

@Deepak-Kesavan Deepak-Kesavan commented Apr 29, 2026

What

  • Workers route the UnstractCloudStorage (UCS, pcs|b8cd25cd-...) connector ID through the cached unstract.filesystem.FileSystem path instead of loading MinioFS via connectorkit.
  • Rate-limit ZSET gets a separate stale-entry cutoff (default 1h) split from the existing 6h Redis key TTL.

Companion cloud PR: https://github.com/Zipstack/unstract-cloud/pull/1468 (executor capacity / config bump for the integration env load test).

Why

Sustained API-deployment load tests against the integration env plateaued at ~u250 with two compounding failures:

  1. Executor pods accumulated boto3 sessions and OOM'd. Workers were loading UnstractCloudStorage (a system-bucket connector that inherits MinioFS) per workflow execution; MinioFS opts out of fsspec's instance cache, so each instantiation produced a fresh boto3 session that urllib3's idle reaper could not keep up with under load. Prompt-service already reaches the same bucket via the SDK1 fsspec-cached path (FileSystem(...).get_file_storage()), so workers should too.
  2. Rate-limit ZSET filled with stale entries. _release_api_deployment_rate_limit only fires from update_execution_status(COMPLETED|ERROR|STOPPED). Worker OOM/SIGKILL/timeout never reaches that line, so the entry sat for the full 6h key TTL, causing false 429s, locust retry storm, and more MinIO connection churn.

Round 3 (after a manual Redis flush) confirmed the same workers held u300 cleanly, so both failure modes are real.

How

workers/shared/workflow/connectors/utils.py

  • New _SYSTEM_INTERNAL_CONNECTOR_IDS set listing the UCS connector ID.
  • _SystemFileStorageAdapter is a thin UnstractFileSystem wrapper around FileSystem(storage_type).get_file_storage() exposing get_fsspec_fs() and test_credentials().
  • get_connector_instance short-circuits on a system ID and otherwise falls back to the existing connectorkit path. User S3/MinIO/GCS connectors are unchanged.

backend/api_v2/rate_limit_constants.py + backend/api_v2/rate_limiter.py

  • New DEFAULT_STALE_ENTRY_HOURS = 1 constant alongside the existing DEFAULT_TTL_HOURS = 6.
  • _get_stale_entry_seconds() reads API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS setting (override-able).
  • _get_cutoff_timestamp() now uses the stale cutoff (not the key TTL) so leaked entries reap within an hour rather than six. pipe.expire(org_key, ttl_seconds) continues to use the full 6h key TTL.

The connector layer (unstract/connectors/.../filesystems/minio/minio.py and the UCS subclass) is intentionally not touched. Workers no longer reach for it for the system bucket; user-configured S3/MinIO connectors keep their existing behavior including skip_instance_cache=True.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • Worker call sites that resolved a UCS-typed connector via get_connector_instance now get a FileStorage-backed adapter instead of a MinioFS instance. The fsspec object exposed by get_fsspec_fs() is S3FileSystem in both cases, so any code that opens/reads/writes via fsspec is unaffected. The adapter implements get_fsspec_fs() and test_credentials() only; MinioFS-specific helpers (extract_metadata_file_hash, extract_modified_date, is_dir_by_metadata, get_connector_root_dir, ...) are not implemented because the only call site (source.py:_get_fs_connector -> list_files_from_file_connector, FILESYSTEM source) is not exercised with the UCS id in current product flows.
  • The destination-side path (WorkerConnectorService._get_destination_connector -> ConnectorOperations.get_fs_connector) is a separate code path that this PR does not touch. If the UCS id ever flows through it, it will still load MinioFS. This was deliberate scope-limiting; flagged for follow-up if observed in profiling.
  • Rate-limit cutoff change is opt-in via setting; default lowers stale window from 6h to 1h. Active executions (less than 1h elapsed) are unaffected. Workflows running longer than 1h would be released early — current integration test profiles cap at ~10 min so this is fine; worth confirming with prod traffic patterns before promotion.

Database Migrations

  • None.

Env Config

  • API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS (new, optional, default 1) — per-entry cutoff in hours for stale ZSET releases.
  • API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS (existing, optional, default 6) — Redis key TTL in hours.

Relevant Docs

  • N/A.

Related Issues or PRs

Dependencies Versions

  • None.

Notes on Testing

  • Integration env Round 4: ramp 80->300 with no Redis flush mid-run. Pre-fix executor pods sat at 11–12.7 GB RSS by stage 4; post-fix expectation is to plateau. manage.py get_org_rate_limit org_qijtoAkJNhznYhNt should drop to 0 within seconds of locust shutdown at every stage.
  • Unit-level: existing connectorkit tests still pass (system ID set is empty for non-cloud OSS deployments, so behavior is identical there).
  • Trace check: _SystemFileStorageAdapter.get_fsspec_fs() returns FileStorageHelper.file_storage_init(...)'s result, which is a plain fsspec.filesystem(protocol, **storage_config) call. fsspec's instance cache is in effect (no skip_instance_cache is passed), confirming we hit the same cached S3FileSystem instance prompt-service uses.

Screenshots

N/A.

Checklist

I have read and understood the Contribution Guidelines.

…stem and shorten rate-limit stale cutoff

Two fixes targeting the u250 cliff under sustained API-deployment load.

1. Workers short-circuit system-internal storage. UnstractCloudStorage
   ("pcs|...") points at the same bucket workers already reach via
   FileSystem(...) — i.e., the SDK1 fsspec-cached path that prompt-service
   uses. Routing it through connectorkit instead loaded MinioFS, which
   creates a fresh boto3 session per workflow execution; under load,
   urllib3's idle-connection reaper could not keep up. The new system
   adapter in workers/shared/workflow/connectors/utils.py returns the
   cached FileSystem path without touching the connector class. User-
   configured S3/MinIO/GCS connectors keep going through connectorkit
   exactly as before.

2. Rate-limit ZSET gets a separate stale-entry cutoff (default 1h) split
   from the key TTL (still 6h). Leaks caused by non-graceful worker
   termination (OOM/SIGKILL before release_slot fires) now reap within
   hours rather than waiting for the whole key to expire. Both knobs are
   override-able via Django settings (API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS
   and API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0b9a7c80-b2a7-4ab4-996a-005950553bb7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch UN-3436-load-testing-config-after-async-changes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant