Skip to content

Conversation

@anoopj
Copy link
Contributor

@anoopj anoopj commented Jan 14, 2026

These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.

For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.

Prototype PR: #14533

Changes

API:

  • Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types

Core:

  • TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4)
  • TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id
  • ContentInfo - Metadata for externally stored content (deletion vectors)
  • ManifestStats - Manifest-level statistics (file/row counts, min sequence number)

DATA(0),
POSITION_DELETES(1),
EQUALITY_DELETES(2);
EQUALITY_DELETES(2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !

*
* <p>Used for applying manifest deletion vectors.
*/
Long pos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a primitive long? as There would always be some non-null ordinal position

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Preference: Would prefer to use the full name position()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we will always project this.

Also, this is in TrackingInfo so I don't think we need it here.

Comment on lines +130 to +134
* Returns the tracking information for this entry.
*
* <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if
* tracking info is inherited.
*/
TrackingInfo trackingInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed the discussion on the original PR, so looks like TrackingInfo has a manifestLocation and ordinal as well. I think the intent there is so that TrackingInfo has enough information self contained on it's structure so it can stand alone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw the top level comment in TrackingInfo, cool!

Comment on lines 173 to 175
* <p>Must be defined if location is defined.
*/
Long fileSizeInBytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a miss on the proposal but this would be required always no? And as a result primitive long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with long.

@nastra nastra changed the title Introduce foundational types for V4 manifest support API, Core: Introduce foundational types for V4 manifest support Jan 15, 2026
* @return a DataFile representation
* @throws IllegalStateException if content_type is not DATA
*/
DataFile asDataFile(PartitionSpec spec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to #14533 (comment) this should probably not have any arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"referenced_file",
Types.StringType.get(),
"Location of data file that a DV references if content_type is POSITION_DELETES. "
+ "Location of affiliated data manifest if content_type is DELETE_MANIFEST, or null if unaffiliated");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions should be shorter. This explains what the field is, but it doesn't need to document all idiosyncrasies.

*/
interface TrackingInfo {
/** Status of an entry in a tracked file */
enum Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to copy this because it was in ManifestEntry. It's probably better to make this top-level to avoid that problem in the future.

*
* <p>Must be defined if content_type is POSITION_DELETES, must be null otherwise.
*/
ContentInfo contentInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem: I'm debating whether to have a separate struct for this. Primarily motivated by how awkward the name "content info" is. It's nice to have an optional struct with required fields, but that is a very minor benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this. Can change this if you feel strongly.

*
* <p>TODO: Define ContentStats structure per V4 proposal.
*/
Object contentStats();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really important part of the implementation, so we should focus on getting this ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. @nastra is working on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContentStats is already in core: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/stats/ContentStats.java#L25
But in order to use it here we would have to move it into the api module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nastra. Incorporated. We don't need to move it to api for now because this PR is currently against core.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, great. Somehow I thought we're in the api module. All good then

@rdblue
Copy link
Contributor

rdblue commented Feb 6, 2026

Looking good overall, but I think we need to get ContentStats defined and then make sure the translation to DataFile and DeleteFile is reasonable.

These types follow the https://s.apache.org/iceberg-single-file-commit
and will be used by subsequent PRs for manifest reading/writing.

For now, we are adding these as package-private interfaces in core, and
eventually we will move them into api.

Changes

API:
 - Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types

Core:
  - TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4)
  - TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id
  - ContentInfo - Metadata for externally stored content (deletion vectors)
  - ManifestStats - Manifest-level statistics (file/row counts, min sequence number)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants