Skip to content

Update FAQ with chunking and virtualization details#1007

Open
TomNicholas wants to merge 1 commit into
mainfrom
docs-faq-contiguous-byte-ranges2
Open

Update FAQ with chunking and virtualization details#1007
TomNicholas wants to merge 1 commit into
mainfrom
docs-faq-contiguous-byte-ranges2

Conversation

@TomNicholas

Copy link
Copy Markdown
Member

Added explanation about format chunk requirements and virtualization restrictions for multi-file datasets.

What I did

Acceptance criteria:

  • Closes #xxxx
  • Tests added
  • Tests passing
  • No test coverage regression
  • Full type hint coverage
  • Changes are documented in docs/releases.md
  • New functions/methods are listed in an appropriate *.md file under docs/api
  • New functionality has documentation

Added explanation about format chunk requirements and virtualization restrictions for multi-file datasets.
@TomNicholas TomNicholas added the documentation Improvements or additions to documentation label Jun 2, 2026
@TomNicholas TomNicholas enabled auto-merge (squash) June 2, 2026 02:43

@maxrjones maxrjones left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple recommendations for improved readability. thanks for opening this!

Comment thread docs/faq.md
@@ -25,6 +25,8 @@ Depends on some details of your data.
VirtualiZarr works by mapping your data to the zarr data model from whatever data model is used by the format it was saved in.
This means that if your data contains anything that cannot be represented within the zarr data model, it cannot be virtualized.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This means that if your data contains anything that cannot be represented within the zarr data model, it cannot be virtualized.
This means that if your data contains anything that cannot be represented within the zarr data model, it cannot be virtualized. The following restrictions influence whether you can virtualize a data file.

I think a pre-ample to the list would help

Comment thread docs/faq.md
VirtualiZarr works by mapping your data to the zarr data model from whatever data model is used by the format it was saved in.
This means that if your data contains anything that cannot be represented within the zarr data model, it cannot be virtualized.

- **Format chunks span contiguous byte ranges** - It's only possible to efficiently access individual chunks of data inside blobs in object storage if each chunk can be fetched via a single HTTP range request, which requires each chunk to occupy a contiguous localized series of bytes within the file layout. Well-designed formats such as netCDF and GRIB have this property, but other formats such as CSV do not. Note also this means that any additional processing which scrambles the byte locations will prevent virtualization - a single netCDF file is virtualizable, but a zipped or gzipped netCDF file is not!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Format chunks span contiguous byte ranges** - It's only possible to efficiently access individual chunks of data inside blobs in object storage if each chunk can be fetched via a single HTTP range request, which requires each chunk to occupy a contiguous localized series of bytes within the file layout. Well-designed formats such as netCDF and GRIB have this property, but other formats such as CSV do not. Note also this means that any additional processing which scrambles the byte locations will prevent virtualization - a single netCDF file is virtualizable, but a zipped or gzipped netCDF file is not!
- **File must contain chunks of data, where each chunk spans a contiguous segment of the file** - For virtualization to work, each chunk must occupy a contiguous localized series of bytes within the file layout, so that the chunks can be fetched via a single HTTP range request. Well-designed formats such as netCDF and GRIB have this property, but other formats such as CSV do not. Note also this means that any additional processing which scrambles the byte locations will prevent virtualization - a single netCDF file is virtualizable, but a zipped or gzipped netCDF file is not!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants