-
Notifications
You must be signed in to change notification settings - Fork 12
Fix reading columns with dot-separated names from Iceberg #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.8
Are you sure you want to change the base?
Conversation
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d6b0a778de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// Strip "current_path." prefix to get child name (preserves dots in child names) | ||
| std::string_view mapped = it->second; | ||
| if (mapped.starts_with(current_path) && mapped.size() > current_path.size() | ||
| && mapped[current_path.size()] == '.') | ||
| return mapped.substr(current_path.size() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalize case before prefix stripping
When case_insensitive_column_matching is enabled, node.name is rewritten to the query’s casing in processSubtree, so current_path can differ in case from the Iceberg-mapped name. The starts_with check here is case-sensitive; if the user queries a tuple/struct column with different case, the prefix won’t be stripped and child names get built with the full mapped path (e.g., mystruct.MyStruct.child), which then fails tryGetPositionByName and can yield missing tuple elements or errors. Consider normalizing current_path/mapped with the same case-insensitive rules before comparing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setting input_format_parquet_case_insensitive_column_matching is not applicable to ColumnMapper. It is used when there are differences between what ClickHouse table schema expects and what the parquet file contains -- see https://github.com/Altinity/ClickHouse/blob/antalya-25.8/tests/queries/0_stateless/03285_orc_arrow_parquet_tuple_field_matching.sh, while ColumnMapper is a part of ClickHouse analyzer which is always case-sensitive for identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, this is what this setting affects:
# Int64 on write is `foo.bar`
:) INSERT INTO function file('dots.parquet', 'parquet', '`my.tuple` Tuple(`foo.bar` Int64, `bar.baz` String)') SELECT (1, '2') SETTINGS engine_file_truncate_on_insert = 1;
# Int64 on read is `Foo.bar`, case_insensitive_column_matching = 0, no value returned
:) SELECT `my.tuple.Foo.bar`
FROM file('dots.parquet', 'parquet', '`my.tuple` Tuple(`Foo.bar` Int64, `bar.baz` String)')
SETTINGS input_format_parquet_case_insensitive_column_matching = 0
Query id: 622d67b9-06e0-44a3-9c43-1229cbf4eea4
┌─my.tuple.Foo.bar─┐
1. │ 0 │
└──────────────────┘
# Int64 on read is `Foo.bar`, case_insensitive_column_matching = 1, a proper value returned
:) SELECT `my.tuple.Foo.bar`
FROM file('dots.parquet', 'parquet', '`my.tuple` Tuple(`Foo.bar` Int64, `bar.baz` String)')
SETTINGS input_format_parquet_case_insensitive_column_matching = 1
Query id: 8f5f3d83-380e-46d5-97d5-af94d4b5ed17
┌─my.tuple.Foo.bar─┐
1. │ 1 │
└──────────────────┘
The correct value here is 1. This is an example of reading it from the current antalya-25.8, not this branch.
Trying to SELECT my.tuple.Foo.bar when schema has my.tuple.foo.bar will always fail, because this setting doesn't affect what you're selecting vs what's in the schema. It affects what's in the schema vs what's in the file.
arthurpassos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #1301. Upstream PR: ClickHouse#94335
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixes an issue when Iceberg columns with dot in names returned NULL as values.
CI/CD Options
Exclude tests:
Regression jobs to run: