UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff by Deepak-Kesavan · Pull Request #1935 · Zipstack/unstract

Deepak-Kesavan · 2026-04-29T00:10:26Z

What

Workers route the UnstractCloudStorage (UCS, pcs|b8cd25cd-...) connector ID through the cached unstract.filesystem.FileSystem path instead of loading MinioFS via connectorkit.
Rate-limit ZSET gets a separate stale-entry cutoff (default 1h) split from the existing 6h Redis key TTL.

Companion cloud PR: https://github.com/Zipstack/unstract-cloud/pull/1468 (executor capacity / config bump for the integration env load test).

Why

Sustained API-deployment load tests against the integration env plateaued at ~u250 with two compounding failures:

Executor pods accumulated boto3 sessions and OOM'd. Workers were loading UnstractCloudStorage (a system-bucket connector that inherits MinioFS) per workflow execution; MinioFS opts out of fsspec's instance cache, so each instantiation produced a fresh boto3 session that urllib3's idle reaper could not keep up with under load. Prompt-service already reaches the same bucket via the SDK1 fsspec-cached path (FileSystem(...).get_file_storage()), so workers should too.
Rate-limit ZSET filled with stale entries. _release_api_deployment_rate_limit only fires from update_execution_status(COMPLETED|ERROR|STOPPED). Worker OOM/SIGKILL/timeout never reaches that line, so the entry sat for the full 6h key TTL, causing false 429s, locust retry storm, and more MinIO connection churn.

Round 3 (after a manual Redis flush) confirmed the same workers held u300 cleanly, so both failure modes are real.

How

workers/shared/workflow/connectors/utils.py

New _SYSTEM_INTERNAL_CONNECTOR_IDS set listing the UCS connector ID.
_SystemFileStorageAdapter is a thin UnstractFileSystem wrapper around FileSystem(storage_type).get_file_storage() exposing get_fsspec_fs() and test_credentials().
get_connector_instance short-circuits on a system ID and otherwise falls back to the existing connectorkit path. User S3/MinIO/GCS connectors are unchanged.

backend/api_v2/rate_limit_constants.py + backend/api_v2/rate_limiter.py

New DEFAULT_STALE_ENTRY_HOURS = 1 constant alongside the existing DEFAULT_TTL_HOURS = 6.
_get_stale_entry_seconds() reads API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS setting (override-able).
_get_cutoff_timestamp() now uses the stale cutoff (not the key TTL) so leaked entries reap within an hour rather than six. pipe.expire(org_key, ttl_seconds) continues to use the full 6h key TTL.

The connector layer (unstract/connectors/.../filesystems/minio/minio.py and the UCS subclass) is intentionally not touched. Workers no longer reach for it for the system bucket; user-configured S3/MinIO connectors keep their existing behavior including skip_instance_cache=True.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Worker call sites that resolved a UCS-typed connector via get_connector_instance now get a FileStorage-backed adapter instead of a MinioFS instance. The fsspec object exposed by get_fsspec_fs() is S3FileSystem in both cases, so any code that opens/reads/writes via fsspec is unaffected. The adapter implements get_fsspec_fs() and test_credentials() only; MinioFS-specific helpers (extract_metadata_file_hash, extract_modified_date, is_dir_by_metadata, get_connector_root_dir, ...) are not implemented because the only call site (source.py:_get_fs_connector -> list_files_from_file_connector, FILESYSTEM source) is not exercised with the UCS id in current product flows.
The destination-side path (WorkerConnectorService._get_destination_connector -> ConnectorOperations.get_fs_connector) is a separate code path that this PR does not touch. If the UCS id ever flows through it, it will still load MinioFS. This was deliberate scope-limiting; flagged for follow-up if observed in profiling.
Rate-limit cutoff change is opt-in via setting; default lowers stale window from 6h to 1h. Active executions (less than 1h elapsed) are unaffected. Workflows running longer than 1h would be released early — current integration test profiles cap at ~10 min so this is fine; worth confirming with prod traffic patterns before promotion.

Database Migrations

None.

Env Config

API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS (new, optional, default 1) — per-entry cutoff in hours for stale ZSET releases.
API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS (existing, optional, default 6) — Redis key TTL in hours.

Relevant Docs

N/A.

Related Issues or PRs

Cloud companion: https://github.com/Zipstack/unstract-cloud/pull/1468
Jira: UN-3436

Dependencies Versions

None.

Notes on Testing

Integration env Round 4: ramp 80->300 with no Redis flush mid-run. Pre-fix executor pods sat at 11–12.7 GB RSS by stage 4; post-fix expectation is to plateau. manage.py get_org_rate_limit org_qijtoAkJNhznYhNt should drop to 0 within seconds of locust shutdown at every stage.
Unit-level: existing connectorkit tests still pass (system ID set is empty for non-cloud OSS deployments, so behavior is identical there).
Trace check: _SystemFileStorageAdapter.get_fsspec_fs() returns FileStorageHelper.file_storage_init(...)'s result, which is a plain fsspec.filesystem(protocol, **storage_config) call. fsspec's instance cache is in effect (no skip_instance_cache is passed), confirming we hit the same cached S3FileSystem instance prompt-service uses.

Screenshots

N/A.

Checklist

I have read and understood the Contribution Guidelines.

…stem and shorten rate-limit stale cutoff Two fixes targeting the u250 cliff under sustained API-deployment load. 1. Workers short-circuit system-internal storage. UnstractCloudStorage ("pcs|...") points at the same bucket workers already reach via FileSystem(...) — i.e., the SDK1 fsspec-cached path that prompt-service uses. Routing it through connectorkit instead loaded MinioFS, which creates a fresh boto3 session per workflow execution; under load, urllib3's idle-connection reaper could not keep up. The new system adapter in workers/shared/workflow/connectors/utils.py returns the cached FileSystem path without touching the connector class. User- configured S3/MinIO/GCS connectors keep going through connectorkit exactly as before. 2. Rate-limit ZSET gets a separate stale-entry cutoff (default 1h) split from the key TTL (still 6h). Leaks caused by non-graceful worker termination (OOM/SIGKILL before release_slot fires) now reap within hours rather than waiting for the whole key to expire. Both knobs are override-able via Django settings (API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS and API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS).

coderabbitai · 2026-04-29T00:10:33Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0b9a7c80-b2a7-4ab4-996a-005950553bb7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch UN-3436-load-testing-config-after-async-changes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-04-29T00:11:48Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff#1935

UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff#1935
Deepak-Kesavan wants to merge 1 commit intomainfrom
UN-3436-load-testing-config-after-async-changes

Deepak-Kesavan commented Apr 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Review skipped

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Deepak-Kesavan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Review skipped

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deepak-Kesavan commented Apr 29, 2026 •

edited

Loading