-
Notifications
You must be signed in to change notification settings - Fork 3k
API, Core: Introduce foundational types for V4 manifest support #15049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| DATA(0), | ||
| POSITION_DELETES(1), | ||
| EQUALITY_DELETES(2); | ||
| EQUALITY_DELETES(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !
| * | ||
| * <p>Used for applying manifest deletion vectors. | ||
| */ | ||
| Long pos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a primitive long? as There would always be some non-null ordinal position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Preference: Would prefer to use the full name position()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we will always project this.
Also, this is in TrackingInfo so I don't think we need it here.
| * Returns the tracking information for this entry. | ||
| * | ||
| * <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if | ||
| * tracking info is inherited. | ||
| */ | ||
| TrackingInfo trackingInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed the discussion on the original PR, so looks like TrackingInfo has a manifestLocation and ordinal as well. I think the intent there is so that TrackingInfo has enough information self contained on it's structure so it can stand alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just saw the top level comment in TrackingInfo, cool!
| * <p>Must be defined if location is defined. | ||
| */ | ||
| Long fileSizeInBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a miss on the proposal but this would be required always no? And as a result primitive long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with long.
| * @return a DataFile representation | ||
| * @throws IllegalStateException if content_type is not DATA | ||
| */ | ||
| DataFile asDataFile(PartitionSpec spec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to #14533 (comment) this should probably not have any arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| "referenced_file", | ||
| Types.StringType.get(), | ||
| "Location of data file that a DV references if content_type is POSITION_DELETES. " | ||
| + "Location of affiliated data manifest if content_type is DELETE_MANIFEST, or null if unaffiliated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These descriptions should be shorter. This explains what the field is, but it doesn't need to document all idiosyncrasies.
| */ | ||
| interface TrackingInfo { | ||
| /** Status of an entry in a tracked file */ | ||
| enum Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had to copy this because it was in ManifestEntry. It's probably better to make this top-level to avoid that problem in the future.
| * | ||
| * <p>Must be defined if content_type is POSITION_DELETES, must be null otherwise. | ||
| */ | ||
| ContentInfo contentInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem: I'm debating whether to have a separate struct for this. Primarily motivated by how awkward the name "content info" is. It's nice to have an optional struct with required fields, but that is a very minor benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am neutral on this. Can change this if you feel strongly.
| * | ||
| * <p>TODO: Define ContentStats structure per V4 proposal. | ||
| */ | ||
| Object contentStats(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a really important part of the implementation, so we should focus on getting this ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. @nastra is working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ContentStats is already in core: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/stats/ContentStats.java#L25
But in order to use it here we would have to move it into the api module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nastra. Incorporated. We don't need to move it to api for now because this PR is currently against core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it, great. Somehow I thought we're in the api module. All good then
|
Looking good overall, but I think we need to get |
These types follow the https://s.apache.org/iceberg-single-file-commit and will be used by subsequent PRs for manifest reading/writing. For now, we are adding these as package-private interfaces in core, and eventually we will move them into api. Changes API: - Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types Core: - TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4) - TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id - ContentInfo - Metadata for externally stored content (deletion vectors) - ManifestStats - Manifest-level statistics (file/row counts, min sequence number)
47542d3 to
67c20b6
Compare
These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.
For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.
Prototype PR: #14533
Changes
API:
Core: