fix(gc): reuse fs in pull to avoid connections explosion#10888
fix(gc): reuse fs in pull to avoid connections explosion#10888shcheklein wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements filesystem connection reuse in the Remote class to prevent connection exhaustion during garbage collection operations. When DVC pulls missing .dir files individually during GC, it was creating separate filesystem connections for each file, overwhelming servers especially over SSH.
- Added filesystem caching mechanism using class-level cache and name-to-key mapping
- Implemented cache key generation based on remote name, filesystem class, configuration, and path
- Added proper cleanup logic to close unused filesystem connections
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| fs_config = dict(config) | ||
| fs = cls(**fs_config) | ||
| runtime_config = {**fs_config, "tmp_dir": self.repo.site_cache_dir} |
There was a problem hiding this comment.
The separation of fs_config and runtime_config creates confusion about which configuration is used where. Consider using more descriptive variable names like filesystem_config and remote_config to clarify their purposes.
| config: dict, | ||
| fs_path: str, | ||
| ) -> tuple[str, ...]: | ||
| serialized_config = json.dumps(config, sort_keys=True, default=str) |
There was a problem hiding this comment.
Using default=str in json.dumps may produce inconsistent serialization for complex objects. Objects of the same type could serialize differently depending on their string representation, leading to cache misses for equivalent configurations.
| _CACHE: ClassVar[dict[tuple[str, ...], "FileSystem"]] = {} | ||
| _NAME_TO_KEY: ClassVar[dict[str, tuple[str, ...]]] = {} |
There was a problem hiding this comment.
Class-level caches without size limits or cleanup mechanisms can lead to memory leaks in long-running processes. Consider implementing cache size limits or periodic cleanup.
|
For the record, garbage collection issue was fixed upstream by fsspec/sshfs#62. Turns out, we were creating reference cycle with |
On GC we
pullmissing.dirfiles one by one and thus we exhaust and overwhelm the server. Especially this is problematic on SSH. Her is an attempt to reuse the existingfs(need to check if it will be enough to reuse pool or connection). I don't like this additional layer tbh, I can also probably simplify it further a bit - this is just a rough draft. If there are other ideas how to force it reuse the same pool - please let me know (may we on the fsspec level we can utilize some caching?)Note: AI was used here. Not every line is review yet. This is a draft to discuss.
cc @skshetry
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏