-
Notifications
You must be signed in to change notification settings - Fork 288
Add concurrent downloads to get_file and cat_file #1007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Use parallel byte-range GET requests for large file downloads, mirroring the existing _upload_file_part_concurrent pattern. Files larger than chunksize with concurrency > 1 are downloaded in batches of concurrent range requests; smaller files fall through to the existing sequential path unchanged.
Use parallel byte-range GET requests for cat_file, mirroring the approach in _get_file. When no start/end range is specified and concurrency > 1, large files are read using batched concurrent range requests and assembled in memory. Also correct max_concurrency docstring to reflect that pipe() already supports concurrency.
Out of interest, do you have any measurements of what a difference this change makes for your specific usecase? |
martindurant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am generally positive about the changes here.
The problem with concurrency on slow networks, is that connections will time out if they are launches at the same time, but then one swamps the bandwidth. Therefor, it seems best to default to concurrency 1 unless we can reliably determine the bandwidth to the data or whether the call location is coresident.
| s3.pipe(test_bucket_name + "/parallel_test", data) | ||
|
|
||
| test_file = str(tmpdir.join("parallel_test")) | ||
| cb = Callback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overriding Callback's relative_update with give a way to measure how many chunks are actually read, to make sure everything is correct.
| @pytest.mark.parametrize("factor", [1, 5, 6]) | ||
| def test_get_file_parallel_integrity(s3, tmpdir, factor): | ||
| chunksize = 5 * 2**20 | ||
| data = os.urandom(chunksize * factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also include non-integer factors, e.g., a parameter to add zero or 1 to this data size.
| ] | ||
| ) | ||
| for offset, data in results: | ||
| f0.seek(offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should work on Windows, but I seem to recall only posix has true sparse file support - do we need to expand the file to full size and open in update mode?
Should we ensure that the block-size is a factor-of-2 to align with disk block sizes? The default, at least, is.
At least for a large file (20GB) in a cloud instance with a lot of bandwidth this reduced a download from ~5 to ~1.5 minutes. I tested this locally on my home internet, and it made a difference, but not nearly as dramatic so I think defaulting to 1 is probably fair. |
Do you want to split the max_concurrency option somehow? I think boto3 also defaults to 10 fwiw https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3.html#concurrent-transfer-operations |
|
s3fs is single-threaded, though. If parallelising on threads, each connection gets a turn (although the NIC can still get swamped by one of them). |
Closes #841
Adds parallel byte-range GET support to
_get_fileand_cat_file, mirroring the existing_upload_file_part_concurrentpattern used byput_file.When a file is larger than
chunksizeandmax_concurrency > 1, downloads are split into byte-range requests executed in batches viaasyncio.gather. Smaller files and explicit range requests (start/endoncat_file) fall through to the existing sequential path unchanged.As for my motivation: boto3's
download_filewas a lot faster than s3fsget_filefor a 20GB file in Cloudflare R2 from our cloud instances.