UN-3436 [FIX] Reroute system-bucket worker IO to fsspec-cached FileSystem and shorten rate-limit stale cutoff#1935
Conversation
…stem and shorten rate-limit stale cutoff
Two fixes targeting the u250 cliff under sustained API-deployment load.
1. Workers short-circuit system-internal storage. UnstractCloudStorage
("pcs|...") points at the same bucket workers already reach via
FileSystem(...) — i.e., the SDK1 fsspec-cached path that prompt-service
uses. Routing it through connectorkit instead loaded MinioFS, which
creates a fresh boto3 session per workflow execution; under load,
urllib3's idle-connection reaper could not keep up. The new system
adapter in workers/shared/workflow/connectors/utils.py returns the
cached FileSystem path without touching the connector class. User-
configured S3/MinIO/GCS connectors keep going through connectorkit
exactly as before.
2. Rate-limit ZSET gets a separate stale-entry cutoff (default 1h) split
from the key TTL (still 6h). Leaks caused by non-graceful worker
termination (OOM/SIGKILL before release_slot fires) now reap within
hours rather than waiting for the whole key to expire. Both knobs are
override-able via Django settings (API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS
and API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS).
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|



What
pcs|b8cd25cd-...) connector ID through the cachedunstract.filesystem.FileSystempath instead of loadingMinioFSvia connectorkit.Companion cloud PR: https://github.com/Zipstack/unstract-cloud/pull/1468 (executor capacity / config bump for the integration env load test).
Why
Sustained API-deployment load tests against the integration env plateaued at ~u250 with two compounding failures:
UnstractCloudStorage(a system-bucket connector that inheritsMinioFS) per workflow execution;MinioFSopts out of fsspec's instance cache, so each instantiation produced a fresh boto3 session that urllib3's idle reaper could not keep up with under load. Prompt-service already reaches the same bucket via the SDK1 fsspec-cached path (FileSystem(...).get_file_storage()), so workers should too._release_api_deployment_rate_limitonly fires fromupdate_execution_status(COMPLETED|ERROR|STOPPED). Worker OOM/SIGKILL/timeout never reaches that line, so the entry sat for the full 6h key TTL, causing false 429s, locust retry storm, and more MinIO connection churn.Round 3 (after a manual Redis flush) confirmed the same workers held u300 cleanly, so both failure modes are real.
How
workers/shared/workflow/connectors/utils.py_SYSTEM_INTERNAL_CONNECTOR_IDSset listing the UCS connector ID._SystemFileStorageAdapteris a thinUnstractFileSystemwrapper aroundFileSystem(storage_type).get_file_storage()exposingget_fsspec_fs()andtest_credentials().get_connector_instanceshort-circuits on a system ID and otherwise falls back to the existing connectorkit path. User S3/MinIO/GCS connectors are unchanged.backend/api_v2/rate_limit_constants.py+backend/api_v2/rate_limiter.pyDEFAULT_STALE_ENTRY_HOURS = 1constant alongside the existingDEFAULT_TTL_HOURS = 6._get_stale_entry_seconds()readsAPI_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURSsetting (override-able)._get_cutoff_timestamp()now uses the stale cutoff (not the key TTL) so leaked entries reap within an hour rather than six.pipe.expire(org_key, ttl_seconds)continues to use the full 6h key TTL.The connector layer (
unstract/connectors/.../filesystems/minio/minio.pyand the UCS subclass) is intentionally not touched. Workers no longer reach for it for the system bucket; user-configured S3/MinIO connectors keep their existing behavior includingskip_instance_cache=True.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
get_connector_instancenow get aFileStorage-backed adapter instead of aMinioFSinstance. The fsspec object exposed byget_fsspec_fs()isS3FileSystemin both cases, so any code that opens/reads/writes via fsspec is unaffected. The adapter implementsget_fsspec_fs()andtest_credentials()only; MinioFS-specific helpers (extract_metadata_file_hash,extract_modified_date,is_dir_by_metadata,get_connector_root_dir, ...) are not implemented because the only call site (source.py:_get_fs_connector->list_files_from_file_connector, FILESYSTEM source) is not exercised with the UCS id in current product flows.WorkerConnectorService._get_destination_connector->ConnectorOperations.get_fs_connector) is a separate code path that this PR does not touch. If the UCS id ever flows through it, it will still loadMinioFS. This was deliberate scope-limiting; flagged for follow-up if observed in profiling.Database Migrations
Env Config
API_DEPLOYMENT_RATE_LIMIT_STALE_ENTRY_HOURS(new, optional, default 1) — per-entry cutoff in hours for stale ZSET releases.API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS(existing, optional, default 6) — Redis key TTL in hours.Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
manage.py get_org_rate_limit org_qijtoAkJNhznYhNtshould drop to 0 within seconds of locust shutdown at every stage._SystemFileStorageAdapter.get_fsspec_fs()returnsFileStorageHelper.file_storage_init(...)'s result, which is a plainfsspec.filesystem(protocol, **storage_config)call. fsspec's instance cache is in effect (noskip_instance_cacheis passed), confirming we hit the same cachedS3FileSysteminstance prompt-service uses.Screenshots
N/A.
Checklist
I have read and understood the Contribution Guidelines.