Extract DeletionVector logic from PuffinFile#3491
Conversation
cf9a2ce to
374d25c
Compare
374d25c to
74e0d7b
Compare
|
Thanks for your review! Addressed comments. |
|
Thanks for this contribution @ebyhr . I've approved it, and I'll leave it open for a day for any additional reviews before we merge. |
|
Thanks for splitting this out — the read path looks behavior-preserving and the separation is clean. For the write side, we'd like to coordinate before building on top of this: once it lands we're planning to rebase #3474 (the DV writer, currently Given that, would it make sense to decide now whether the write/serialize path should live on Our preference would be to mirror the read-side split — keep the DV bitmap serialization on |
| class DeletionVector: | ||
| _deletion_vectors: dict[str, list[BitMap]] |
There was a problem hiding this comment.
Hi @ebyhr - I thought about this a bit more and I think we'd benefit from revisiting the class design.
A single PuffinFile can hold multiple DVs, and that's read correctly here. But the DeletionVector class ends up encapsulating a map of deletion vectors (keyed by referenced-data-file), which I find a little counterintuitive given the name.
Would it make sense to have DeletionVector map 1-1 to a single DV blob instead? That'd line up with Java, where one DV = one data file's PositionDeleteIndex, loaded via the entry's content_offset/content_size. The from_puffin_file function could then be moved to be a standalone function and return a list of DeletionVectors. WDYT?
There was a problem hiding this comment.
It makes sense to me. My initial motivation was to create a helper class similar to DVUtil in Iceberg Java.
I've updated this PR. Please let me know if the last change doesn't align with your intention.
Hi @moomindani - thanks for flagging this, and for thinking about how the two paths line up. I'm leaning the same way you are: keep the DV bitmap serialization on |
|
Works for me, thanks @sungwy. The 1-1 |
|
merged. Thanks a lot @ebyhr , and thanks @moomindani for the review! |
…inWriter Following the design agreed in apache#3491, split the write path into: - DeletionVector serialization (domain): from_positions(), _serialize_bitmap() and to_blob() as the write-side counterparts of _deserialize_bitmap() and to_vector(), reusing MAX_JAVA_SIGNED / PROPERTY_REFERENCED_DATA_FILE. - A generic, blob-agnostic PuffinWriter (format) that assembles a Puffin file from PuffinBlob payloads, mirroring PuffinFile on the read side and scaling to future blob types (e.g. apache-datasketches-theta-v1). PuffinWriter is a context manager (consistent with ManifestWriter); the file size is available via len(output_file) after exit. The created-by default uses pyiceberg.__version__. Co-authored-by: Isaac
Rationale for this change
PuffinFile handles two tasks: format parsing (magic bytes, footer, blobs) and deletion vector domain logic (bitmap deserialization and PyArrow conversion).
This will become problematic when we introduce support for the NDV
apache-datasketches-theta-v1blob in the future.Are these changes tested?
Yes
Are there any user-facing changes?
Yes - PuffinFile class user needs to call DeletionVector.