Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
Fix Iceberg read optimization returning NULLs for stats-less manifests#1895il9ue wants to merge 35 commits into
Conversation
When an Iceberg manifest's per-file column statistics are absent or empty (common for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as "all columns are absent from the file" and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Gate the optimization's absent-NULL loop directly on columns_info.empty() instead of introducing a separate stats-presence flag. When no usable per-column stats were parsed -- whether the manifest omitted the stats fields entirely or declared them but left them empty -- fall through to the Parquet reader, which correctly handles physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). columns_info is already serialized to workers in the cluster JSON path, so this changes no serialization format and keeps the fork's DataFileMetaInfo serde identical to upstream. Closes #1545. Mirror of #1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
Audit update for PR #1895
Mirror of: Altinity/ClickHouse#1688 (antalya-25.8 fix). Closes Altinity/ClickHouse#1545. Re-open of the closed fork-based PR Altinity/ClickHouse#1814 (same commit Confirmed defectsNo confirmed defects in reviewed scope. |
PR #1895 CI Verification — Fix Iceberg read optimization returning NULLs for stats-less manifests
VerdictNo PR-caused regressions identified. All red checks are infrastructure timeouts, pre-existing flaky tests, or a pre-existing UBSan bug in unrelated code ( Job status summary (current SHA)
Failing checks on current SHA
|
| # | Check | Failure | Category |
|---|---|---|---|
| 1 | Stress test (arm_asan, s3) |
Cannot start clickhouse-server |
Pre-existing flaky |
| 2 | AST fuzzer (amd_ubsan) |
UBSan integer overflow in ToStartOfInterval<Year> |
Pre-existing upstream bug, unrelated |
| 3 | Integration tests (amd_msan, 4/6) |
Job-level error, 0 test fails |
Harness timeout |
| 4 | Integration tests (arm_binary, distributed plan, 2/4) |
Job-level error, 0 test fails |
Harness timeout |
| 5 | SQLLogic test |
Job-level error, 0 test fails |
Harness timeout (3h limit) |
| 6 | RegressionTestsRelease / S3Export (part) / s3_export_part |
could not bring up docker-compose cluster |
Infrastructure |
1. Stress test (arm_asan, s3) — pre-existing flake
- Test name
Cannot start clickhouse-server: 88 failures across 34 unrelated PRs in the last 60 days (since 2026-04-30). Recurring infrastructure / startup flake.
2. AST fuzzer (amd_ubsan) — pre-existing upstream UBSan bug
- Stack trace from
stderr.log:
/ClickHouse/src/Functions/DateTimeTransforms.h:817:79: runtime error:
signed integer overflow: 9223372036854775806 * 12 cannot be represented in type 'Int64'
#0 DB::ToStartOfInterval<(IntervalKind::Kind)10>::execute(...) Functions/DateTimeTransforms.h:817
#1 DB::FunctionToStartOfInterval::execute<DataTypeDate32, ...> Functions/toStartOfInterval.cpp:437
...
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
- The bug is in
Functions/DateTimeTransforms.handFunctions/toStartOfInterval.cpp— completely unrelated to Iceberg / object storage code touched by this PR. AST fuzzer (amd_ubsan)"Unknown error" hits in last 60 days: PR=0 (master) ×4, plus PRs 1645, 1568, 1803, 1895 — generic recurring fuzzer bucket.- This is a fuzzer-discovered upstream UBSan issue, not a regression introduced by PR Fix Iceberg read optimization returning NULLs for stats-less manifests #1895.
3–5. Integration tests (amd_msan 4/6, arm_binary distributed plan 2/4) and SQLLogic test — harness timeouts
All three jobs are reported as error in the database with 0 test failures — meaning every test that completed passed; the job itself was killed by the harness timeout.
From Integration tests (amd_msan, 4/6) job log:
[2026-06-08 22:32:51] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[mt]
[2026-06-08 22:36:10] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[replicated]
[2026-06-08 22:37:34] WARNING: Timeout exceeded [11400], sending SIGTERM to process group
[2026-06-08 22:49:53] PASSED test_ttl_replicated/test.py::test_ttl_drop_parts_limit
[2026-06-08 22:49:53] === 3 passed in 1218.94s ===
[2026-06-08 22:49:54] Test execution was interrupted (exit status: 2)
From SQLLogic test job log:
[2026-06-08 22:51:52] WARNING: Timeout exceeded [10800] for [3952]
[2026-06-08 22:51:55] WARNING: Job timed out: [SQLLogic test], timeout [10800], exit code [137]
All three are runner-level timeouts (3h / 11400s limits), not regressions. None of these jobs touch the Iceberg/StorageObjectStorageSource code path.
6. S3Export (part) regression suite — infrastructure failure
From the report:
/s3Fail 52scould not bring up docker-compose cluster
/s3/minioFail 52scould not bring up docker-compose cluster
The test environment failed to start in 52 seconds (no test ever ran). Pure infrastructure issue — completely independent of the PR's source change to Iceberg read optimization.
PR's own new tests — all passing
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py includes 4 test cases:
test_iceberg_local_full_stats_manifest_reads_correctlytest_iceberg_local_partial_stats_manifest_reads_correctlytest_iceberg_local_returns_correct_rows_when_optimization_disabledtest_iceberg_local_returns_actual_rows_with_stats_less_manifest
All 4 tests passed in all 5 integration jobs that scheduled them on this SHA:
| Check | Status (4/4 cases) |
|---|---|
Integration tests (amd_tsan, 5/6) |
OK |
Integration tests (amd_msan, 5/6) |
OK |
Integration tests (amd_binary, 5/5) |
OK |
Integration tests (amd_asan, db disk, old analyzer, 5/6) |
OK |
Integration tests (arm_binary, distributed plan, 4/4) |
OK |
That is 20 / 20 PASS for the PR's own coverage — the fix's behavior is consistently exercised across debug, release, sanitizer, alternative-analyzer, and ARM builds.
Recommendations
- Rerun the failing jobs to clear the harness-timeout reds:
Integration tests (amd_msan, 4/6)Integration tests (arm_binary, distributed plan, 2/4)SQLLogic test
- Rerun
Stress test (arm_asan, s3)to clear the known startup flake. - Rerun
RegressionTestsRelease / S3Export (part)(infra docker-compose failure). - The
AST fuzzer (amd_ubsan)UBSan finding inToStartOfIntervalis an upstream bug worth tracking separately, but does not block this PR — the touched code is unrelated. - Merge once jobs are green; no PR-introduced regressions detected.
|
|
||
|
|
||
| # Iceberg v2 manifest list Avro schema (minimal fields needed by ClickHouse). | ||
| _MANIFEST_LIST_SCHEMA_STR = json.dumps({ |
There was a problem hiding this comment.
Is there any particular reason to paste all those big ugly config files right in test code making it less readable instead of using pre-generated files like in some other tests? (see tests/queries/0_stateless/data_minio/field_ids_complex_test/metadata/v1.metadata.json for example)
| started_cluster_iceberg_no_spark, | ||
| ): | ||
| """ | ||
| PASSES on unpatched code. |
There was a problem hiding this comment.
Meaning it is not related to the fix, right? Then it shall not be present in this PR. Improving tests is always a good idea, but here it just adds more lines making the PR heavier. This belongs to another PR (if needed at all)
| """ | ||
| PASSES on unpatched code. | ||
|
|
||
| With allow_experimental_iceberg_read_optimization=0 the optimization block is |
There was a problem hiding this comment.
why do we test against allow_experimental_iceberg_read_optimization == 0? this is not related to this change -- this PR only changes the behavior when the setting is 1.
|
Removing the |
Address @zvonand's review of #1895: the tests were poorly scoped. The C++ fix (gate the optimization's absent-NULL loop on `!columns_info.empty()`) is unchanged; only the integration tests are reworked. - Drop the `allow_experimental_iceberg_read_optimization=0` test. The fix only changes behavior when the optimization is enabled, so the disabled path is unrelated to this change and belongs in a separate PR if at all. - Add `test_stats_less_manifest_schema_evolution_absent_column`, a discriminating test where the Iceberg schema declares `id`+`data` but the Parquet file was written under the older schema and physically contains only `id`. On the fixed build the stats-less manifest falls through to the Parquet reader, reads `id` for real and synthesizes the absent `data` column as NULL via `IcebergMetadata::getInitialSchemaByPath` (the file's own schema becomes `initial_header`). On the unpatched build `id` is NULL too, so the result discriminates. - Add `test_stats_less_manifest_cluster_returns_real_data`, exercising the cluster JSON serialization path with `icebergLocalCluster`. An empty `columns_info` must round-trip to the workers and still read real data, confirming the bug also affected the `*Cluster` variants. - Add `test_non_empty_stats_absent_column_still_null`, a non-regression guard that the optimization's legitimate absent-NULL path (a column absent from a non-empty `columns_info`) keeps returning NULL while present columns read for real. - Merge the partial-stats and full-stats cases into one parametrized `test_non_empty_stats_returns_real_data`. - Keep the Iceberg fixtures generated in code rather than checked in as static files (zvonand's readability comment). Each manifest entry and the `metadata.json` embed per-test runtime values -- the UUID-suffixed container path, the byte sizes of the freshly written Parquet/Avro files and a random `table-uuid` -- so they are templates filled at run time, not static blobs. The one static part, the Avro schemas, is deduplicated into module constants, and a module-level comment explains the choice. Verified on the fixed build: all 6 cases pass. With the guard reverted to run the loop unconditionally, the three stats-less tests fail with all-NULL rows while the two non-empty-stats guards pass. Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Please, remove "Antalya 26.3 fix for #1545." from the changelog. As I mentioned before, it does NOT belong there. |
| ], | ||
| ids=["partial_stats", "full_stats"], | ||
| ) | ||
| def test_non_empty_stats_returns_real_data( |
There was a problem hiding this comment.
Again: this test is not really related to this PR, right? It does not test the changes introduced in the PR, right? Then why it is here?
|
My mistake on auto commit, I will clean up and update this PR in a bit
|
| - test_stats_less_manifest_cluster_returns_real_data | ||
| Non-regression guards (pass on both builds; the optimization's legitimate | ||
| absent-NULL and present-column behavior must keep working when stats exist): | ||
| - test_non_empty_stats_absent_column_still_null |
There was a problem hiding this comment.
Again: this PR fixes behavior when working with manifests w/o stats. WHY do I see some tests for cases where stats exist? Just to produce more LoC?
| - test_non_empty_stats_absent_column_still_null | ||
| - test_non_empty_stats_returns_real_data | ||
|
|
||
| Why the Iceberg fixtures are generated in code rather than checked in as static |
There was a problem hiding this comment.
This still does not explain WHY those cannot be static files actually. Why can't we pre-create an iceberg table?
I do believe that maybe this is correct actually. But an explanation is still needed: why can't we use static tables? What part of our changes requires dynamically-created tables to be used for testing?
|
|
||
|
|
||
| def _data_file_fields(stats_fields): | ||
| """data_file record fields, optionally followed by the given stats arrays.""" |
| return fields | ||
|
|
||
|
|
||
| def _manifest_entry_schema(stats_fields): |
There was a problem hiding this comment.
same as above (add explanation or rework)
| """ | ||
| Cluster/serialization-path coverage for Altinity#1545. | ||
|
|
||
| icebergLocalCluster distributes file-level tasks to workers, serializing |
There was a problem hiding this comment.
Let’s add a tiny non-regression test for a use case that is actually out of the scope of this fix but worth having because this behavior is critical for some of the important clients. This test will pass on both builds.
Create a MergeTree table with:
id Int32,
ingredient String
Use id as the primary key. Insert a few rows where ingredient is a lasagna ingredient: pasta, tomato, mozzarella, basil. Don't forget to add other ingridients that are present in a common recipe.
The test can simply create the table, insert these rows, and select them back. Please keep the lasagna fixture values as-is; they are meant to make the test recognizable later when someone is debugging the fail.
In this case it is OK to add this test to tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py -- I will later move it to another test suite which I am working on now, thus no need to mention it in test taxonomy overview
|
|
||
| def test_non_empty_stats_absent_column_still_null(started_cluster_iceberg_no_spark): | ||
| """ | ||
| Non-regression guard: the optimization's legitimate absent-NULL path. |
There was a problem hiding this comment.
again: this test is not related to the fix, there is no reason to keep it here
|
|
||
| metadata = { | ||
| "format-version": 2, | ||
| "table-uuid": str(uuid.uuid4()), |
There was a problem hiding this comment.
is this the place where we "need" the dynamic uuid? why? why is static not enough?
| the Iceberg read optimization (``allow_experimental_iceberg_read_optimization=1``) | ||
| returned all-NULL columns when the manifest carried no per-file column statistics. | ||
|
|
||
| Root cause |
There was a problem hiding this comment.
This does not belong to the tests.
| read ``id`` for real, and synthesize the absent ``data`` column as NULL via | ||
| IcebergMetadata::getInitialSchemaByPath (file schema as initial_header). | ||
|
|
||
| Unpatched build: the optimization marks BOTH columns (including ``id``) |
There was a problem hiding this comment.
Let's actually trim the comments to make the whole suite more readable. Nobody needs this info about "unpatched build", and it pollutes the file.
| result = _query_local(instance, container_base) | ||
| assert result == _DATA_NULL_RESULT, ( | ||
| f"Got: {result!r}\n" | ||
| "Expected real 'id' with NULL 'data'. If 'id' is also NULL the bug " |
There was a problem hiding this comment.
Contrary to above, IMO this belongs to the comment, not to the error str.
…late-size overflow Signed-off-by: CarlosFelipeOR <carlosfelipeor@gmail.com>
…jection Signed-off-by: CarlosFelipeOR <carlosfelipeor@gmail.com>
…ror (#1931) Signed-off-by: CarlosFelipeOR <carlosfelipeor@gmail.com>
move Avro schemas to helpers, drop out-of-scope stats tests, trim comments, static table-uuid, add MergeTree roundtrip
|
Heads up on the last update. Have a clean branch ready: il9ue:fix-empty-stats-clean : 2 signed commits, just the fix + reviewed tests, DCO clean, tests pass locally.
Branch is protected and has others' commits, so I didn't want to force-push Looks like this happened from how the branch was set up early on. I'd been working from my fork, and at some point the PR branch got rebased onto a moved antalya-26.3, which pulled the unsigned commits into the history. |
|
@il9ue |
|
antalya-26.3 branch was merged here in some weird way. Be careful and pay attention next time |
When an Iceberg manifest carries no per-column statistics, the parsed `DataFileMetaInfo::columns_info` is empty. The read optimization in `StorageObjectStorageSource::createReader` misread this as "every requested column is absent from the file": it replaced each nullable column with a constant `NULL` and set `need_only_count`, so the reader returned correct row counts but all-NULL values — silent data loss. Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that stats-less manifests fall through to the regular reader, which reads present columns normally and resolves schema-evolution-absent columns via `IcebergMetadata::getInitialSchemaByPath`. Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their `*Cluster` variants. Antalya-only, introduced by #1069. Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it. C++ change taken from #1895 Closes #1545 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the quick turnaround on #1991. @ianton-ru @zvonand |
Re-opened from
Altinity/ClickHouse(not my fork) so CI publishes direct
.debURLs for clickhouse-regression. Same commit (8b597ed) as #1814, fork PRs don't get repo secrets, so the S3 upload was skipped there.Changelog category (leave one):
Changelog entry:
Fix Iceberg read optimization returning NULL for every column when reading manifests written without per-file column statistics (e.g. pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.Documentation entry for user-facing changes
Antalya-only bug on
antalya-26.3, introduced by #1069 ("Read optimization using Iceberg metadata"). No upstream cherry-pick.Root cause: when a manifest omits the optional
value_counts/column_sizes/null_value_countsAvro fields,DataFileMetaInfo::columns_infois empty. The optimization inStorageObjectStorageSource::createReadermisreads empty stats as "all columns absent," sets every requested column to a constantField()(NULL) withneed_only_count = true, and the Parquet reader returns only the row count — correct row counts, all-NULL values, silent data loss.Fix: gate the absent-NULL loop on
columns_info.empty(). When no usable per-column stats were parsed, fall through to the Parquet reader, which handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream byIcebergMetadata::getInitialSchemaByPath).columns_infois already serialized in the cluster JSON path, so the serialization format is unchanged.Tested:
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py, stats-less reproducer (fails without the fix), partial-stats, and full-stats regression guard.
Closes #1545
CI/CD Options
Exclude tests:
Regression jobs to run: