Skip to content

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895

Open
il9ue wants to merge 35 commits into
antalya-26.3from
fix/iceberg-empty-stats-26.3
Open

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
il9ue wants to merge 35 commits into
antalya-26.3from
fix/iceberg-empty-stats-26.3

Conversation

@il9ue

@il9ue il9ue commented Jun 8, 2026

Copy link
Copy Markdown

Re-opened from Altinity/ClickHouse
(not my fork) so CI publishes direct .deb URLs for clickhouse-regression. Same commit (8b597ed) as #1814, fork PRs don't get repo secrets, so the S3 upload was skipped there.

Changelog category (leave one):

  • Bug Fix

Changelog entry:

Fix Iceberg read optimization returning NULL for every column when reading manifests written without per-file column statistics (e.g. pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Antalya-only bug on antalya-26.3, introduced by #1069 ("Read optimization using Iceberg metadata"). No upstream cherry-pick.

Root cause: when a manifest omits the optional value_counts / column_sizes / null_value_counts Avro fields, DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misreads empty stats as "all columns absent," sets every requested column to a constant Field() (NULL) with need_only_count = true, and the Parquet reader returns only the row count — correct row counts, all-NULL values, silent data loss.

Fix: gate the absent-NULL loop on columns_info.empty(). When no usable per-column stats were parsed, fall through to the Parquet reader, which handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream by IcebergMetadata::getInitialSchemaByPath). columns_info is already serialized in the cluster JSON path, so the serialization format is unchanged.

Tested:

  • tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py
    , stats-less reproducer (fails without the fix), partial-stats, and full-stats regression guard.

Closes #1545


CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

When an Iceberg manifest's per-file column statistics are absent or
empty (common for non-Spark writers like pyiceberg with default
settings), DataFileMetaInfo::columns_info is empty. The optimization
in StorageObjectStorageSource::createReader misread this as "all
columns are absent from the file" and returned constant NULLs for
every row while still returning the correct row count. Result: silent
data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and
all *Cluster variants.

Gate the optimization's absent-NULL loop directly on
columns_info.empty() instead of introducing a separate stats-presence
flag. When no usable per-column stats were parsed -- whether the
manifest omitted the stats fields entirely or declared them but left
them empty -- fall through to the Parquet reader, which correctly
handles physically-present columns (read normally) and
schema-evolved-absent columns (handled by
IcebergMetadata::getInitialSchemaByPath setting the file's own schema
as initial_header). columns_info is already serialized to workers in
the cluster JSON path, so this changes no serialization format and
keeps the fork's DataFileMetaInfo serde identical to upstream.

Closes #1545.
Mirror of #1688 (antalya-25.8 fix).

Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Workflow [PR], commit [2bc7944]

ianton-ru
ianton-ru previously approved these changes Jun 8, 2026
@alsugiliazova

Copy link
Copy Markdown
Member

Audit update for PR #1895

AI audit note: This review comment was generated by AI (claude-opus-4.7).

Mirror of: Altinity/ClickHouse#1688 (antalya-25.8 fix). Closes Altinity/ClickHouse#1545. Re-open of the closed fork-based PR Altinity/ClickHouse#1814 (same commit 8b597ed / 5ec03a9), re-filed from a repo branch so CI secrets are available and the .deb package URLs are published for clickhouse-regression.

Confirmed defects

No confirmed defects in reviewed scope.

@alsugiliazova

Copy link
Copy Markdown
Member

PR #1895 CI Verification — Fix Iceberg read optimization returning NULLs for stats-less manifests

  • Head branch: fix/iceberg-empty-stats-26.3
  • Head SHA: 5ec03a9700f77ba6dd465e84b4775d4a7e4323e9
  • Verification date: 2026-06-11

Verdict

No PR-caused regressions identified. All red checks are infrastructure timeouts, pre-existing flaky tests, or a pre-existing UBSan bug in unrelated code (ToStartOfInterval integer overflow). The PR's own 4 new integration tests pass on every job that ran them across 5 build variants.

Job status summary (current SHA)

Status Count
SUCCESS 564
FAILURE 13 (jobs + rollup rows)
ERROR 3
SKIPPED 199 (build matrix exclusions)

Failing checks on current SHA 5ec03a97 (latest run 27143685017)

# Check Failure Category
1 Stress test (arm_asan, s3) Cannot start clickhouse-server Pre-existing flaky
2 AST fuzzer (amd_ubsan) UBSan integer overflow in ToStartOfInterval<Year> Pre-existing upstream bug, unrelated
3 Integration tests (amd_msan, 4/6) Job-level error, 0 test fails Harness timeout
4 Integration tests (arm_binary, distributed plan, 2/4) Job-level error, 0 test fails Harness timeout
5 SQLLogic test Job-level error, 0 test fails Harness timeout (3h limit)
6 RegressionTestsRelease / S3Export (part) / s3_export_part could not bring up docker-compose cluster Infrastructure

1. Stress test (arm_asan, s3) — pre-existing flake

  • Test name Cannot start clickhouse-server: 88 failures across 34 unrelated PRs in the last 60 days (since 2026-04-30). Recurring infrastructure / startup flake.

2. AST fuzzer (amd_ubsan) — pre-existing upstream UBSan bug

  • Stack trace from stderr.log:
/ClickHouse/src/Functions/DateTimeTransforms.h:817:79: runtime error:
  signed integer overflow: 9223372036854775806 * 12 cannot be represented in type 'Int64'
  #0  DB::ToStartOfInterval<(IntervalKind::Kind)10>::execute(...)  Functions/DateTimeTransforms.h:817
  #1  DB::FunctionToStartOfInterval::execute<DataTypeDate32, ...>  Functions/toStartOfInterval.cpp:437
  ...
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
  • The bug is in Functions/DateTimeTransforms.h and Functions/toStartOfInterval.cpp — completely unrelated to Iceberg / object storage code touched by this PR.
  • AST fuzzer (amd_ubsan) "Unknown error" hits in last 60 days: PR=0 (master) ×4, plus PRs 1645, 1568, 1803, 1895 — generic recurring fuzzer bucket.
  • This is a fuzzer-discovered upstream UBSan issue, not a regression introduced by PR Fix Iceberg read optimization returning NULLs for stats-less manifests #1895.

3–5. Integration tests (amd_msan 4/6, arm_binary distributed plan 2/4) and SQLLogic test — harness timeouts

All three jobs are reported as error in the database with 0 test failures — meaning every test that completed passed; the job itself was killed by the harness timeout.

From Integration tests (amd_msan, 4/6) job log:

[2026-06-08 22:32:51] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[mt]
[2026-06-08 22:36:10] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[replicated]
[2026-06-08 22:37:34] WARNING: Timeout exceeded [11400], sending SIGTERM to process group
[2026-06-08 22:49:53] PASSED test_ttl_replicated/test.py::test_ttl_drop_parts_limit
[2026-06-08 22:49:53] === 3 passed in 1218.94s ===
[2026-06-08 22:49:54] Test execution was interrupted (exit status: 2)

From SQLLogic test job log:

[2026-06-08 22:51:52] WARNING: Timeout exceeded [10800] for [3952]
[2026-06-08 22:51:55] WARNING: Job timed out: [SQLLogic test], timeout [10800], exit code [137]

All three are runner-level timeouts (3h / 11400s limits), not regressions. None of these jobs touch the Iceberg/StorageObjectStorageSource code path.

6. S3Export (part) regression suite — infrastructure failure

From the report:

/s3 Fail 52s could not bring up docker-compose cluster
/s3/minio Fail 52s could not bring up docker-compose cluster

The test environment failed to start in 52 seconds (no test ever ran). Pure infrastructure issue — completely independent of the PR's source change to Iceberg read optimization.

PR's own new tests — all passing

tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py includes 4 test cases:

  • test_iceberg_local_full_stats_manifest_reads_correctly
  • test_iceberg_local_partial_stats_manifest_reads_correctly
  • test_iceberg_local_returns_correct_rows_when_optimization_disabled
  • test_iceberg_local_returns_actual_rows_with_stats_less_manifest

All 4 tests passed in all 5 integration jobs that scheduled them on this SHA:

Check Status (4/4 cases)
Integration tests (amd_tsan, 5/6) OK
Integration tests (amd_msan, 5/6) OK
Integration tests (amd_binary, 5/5) OK
Integration tests (amd_asan, db disk, old analyzer, 5/6) OK
Integration tests (arm_binary, distributed plan, 4/4) OK

That is 20 / 20 PASS for the PR's own coverage — the fix's behavior is consistently exercised across debug, release, sanitizer, alternative-analyzer, and ARM builds.

Recommendations

  1. Rerun the failing jobs to clear the harness-timeout reds:
    • Integration tests (amd_msan, 4/6)
    • Integration tests (arm_binary, distributed plan, 2/4)
    • SQLLogic test
  2. Rerun Stress test (arm_asan, s3) to clear the known startup flake.
  3. Rerun RegressionTestsRelease / S3Export (part) (infra docker-compose failure).
  4. The AST fuzzer (amd_ubsan) UBSan finding in ToStartOfInterval is an upstream bug worth tracking separately, but does not block this PR — the touched code is unrelated.
  5. Merge once jobs are green; no PR-introduced regressions detected.

@alsugiliazova alsugiliazova added the verified Approved for release label Jun 11, 2026
@svb-alt svb-alt requested a review from zvonand June 15, 2026 15:46


# Iceberg v2 manifest list Avro schema (minimal fields needed by ClickHouse).
_MANIFEST_LIST_SCHEMA_STR = json.dumps({

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to paste all those big ugly config files right in test code making it less readable instead of using pre-generated files like in some other tests? (see tests/queries/0_stateless/data_minio/field_ids_complex_test/metadata/v1.metadata.json for example)

started_cluster_iceberg_no_spark,
):
"""
PASSES on unpatched code.

@zvonand zvonand Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning it is not related to the fix, right? Then it shall not be present in this PR. Improving tests is always a good idea, but here it just adds more lines making the PR heavier. This belongs to another PR (if needed at all)

"""
PASSES on unpatched code.

With allow_experimental_iceberg_read_optimization=0 the optimization block is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we test against allow_experimental_iceberg_read_optimization == 0? this is not related to this change -- this PR only changes the behavior when the setting is 1.

@zvonand zvonand removed the verified Approved for release label Jun 15, 2026
@zvonand

zvonand commented Jun 15, 2026

Copy link
Copy Markdown
Member

Removing the verified label: the C++ code is fine, but tests need fixing (they are poorly designed and partially meaningless). This PR needs adjustments before it can be merged.

il9ue and others added 2 commits June 22, 2026 11:37
Address @zvonand's review of #1895: the tests were poorly scoped. The C++
fix (gate the optimization's absent-NULL loop on `!columns_info.empty()`) is
unchanged; only the integration tests are reworked.

- Drop the `allow_experimental_iceberg_read_optimization=0` test. The fix
  only changes behavior when the optimization is enabled, so the disabled
  path is unrelated to this change and belongs in a separate PR if at all.

- Add `test_stats_less_manifest_schema_evolution_absent_column`, a
  discriminating test where the Iceberg schema declares `id`+`data` but the
  Parquet file was written under the older schema and physically contains
  only `id`. On the fixed build the stats-less manifest falls through to the
  Parquet reader, reads `id` for real and synthesizes the absent `data`
  column as NULL via `IcebergMetadata::getInitialSchemaByPath` (the file's
  own schema becomes `initial_header`). On the unpatched build `id` is NULL
  too, so the result discriminates.

- Add `test_stats_less_manifest_cluster_returns_real_data`, exercising the
  cluster JSON serialization path with `icebergLocalCluster`. An empty
  `columns_info` must round-trip to the workers and still read real data,
  confirming the bug also affected the `*Cluster` variants.

- Add `test_non_empty_stats_absent_column_still_null`, a non-regression
  guard that the optimization's legitimate absent-NULL path (a column absent
  from a non-empty `columns_info`) keeps returning NULL while present columns
  read for real.

- Merge the partial-stats and full-stats cases into one parametrized
  `test_non_empty_stats_returns_real_data`.

- Keep the Iceberg fixtures generated in code rather than checked in as
  static files (zvonand's readability comment). Each manifest entry and the
  `metadata.json` embed per-test runtime values -- the UUID-suffixed
  container path, the byte sizes of the freshly written Parquet/Avro files
  and a random `table-uuid` -- so they are templates filled at run time, not
  static blobs. The one static part, the Avro schemas, is deduplicated into
  module constants, and a module-level comment explains the choice.

Verified on the fixed build: all 6 cases pass. With the guard reverted to
run the loop unconditionally, the three stats-less tests fail with all-NULL
rows while the two non-empty-stats guards pass.

Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@zvonand

zvonand commented Jun 22, 2026

Copy link
Copy Markdown
Member

Please, remove "Antalya 26.3 fix for #1545." from the changelog. As I mentioned before, it does NOT belong there.

],
ids=["partial_stats", "full_stats"],
)
def test_non_empty_stats_returns_real_data(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: this test is not really related to this PR, right? It does not test the changes introduced in the PR, right? Then why it is here?

@il9ue

il9ue commented Jun 22, 2026

Copy link
Copy Markdown
Author

My mistake on auto commit, I will clean up and update this PR in a bit

Please, remove "Antalya 26.3 fix for #1545." from the changelog. As I mentioned before, it does NOT belong there.

- test_stats_less_manifest_cluster_returns_real_data
Non-regression guards (pass on both builds; the optimization's legitimate
absent-NULL and present-column behavior must keep working when stats exist):
- test_non_empty_stats_absent_column_still_null

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: this PR fixes behavior when working with manifests w/o stats. WHY do I see some tests for cases where stats exist? Just to produce more LoC?

- test_non_empty_stats_absent_column_still_null
- test_non_empty_stats_returns_real_data

Why the Iceberg fixtures are generated in code rather than checked in as static

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still does not explain WHY those cannot be static files actually. Why can't we pre-create an iceberg table?

I do believe that maybe this is correct actually. But an explanation is still needed: why can't we use static tables? What part of our changes requires dynamically-created tables to be used for testing?



def _data_file_fields(stats_fields):
"""data_file record fields, optionally followed by the given stats arrays."""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return fields


def _manifest_entry_schema(stats_fields):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (add explanation or rework)

"""
Cluster/serialization-path coverage for Altinity#1545.

icebergLocalCluster distributes file-level tasks to workers, serializing

@zvonand zvonand Jun 28, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add a tiny non-regression test for a use case that is actually out of the scope of this fix but worth having because this behavior is critical for some of the important clients. This test will pass on both builds.

Create a MergeTree table with:

id Int32,

ingredient String

Use id as the primary key. Insert a few rows where ingredient is a lasagna ingredient: pasta, tomato, mozzarella, basil. Don't forget to add other ingridients that are present in a common recipe.

The test can simply create the table, insert these rows, and select them back. Please keep the lasagna fixture values as-is; they are meant to make the test recognizable later when someone is debugging the fail.

In this case it is OK to add this test to tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py -- I will later move it to another test suite which I am working on now, thus no need to mention it in test taxonomy overview


def test_non_empty_stats_absent_column_still_null(started_cluster_iceberg_no_spark):
"""
Non-regression guard: the optimization's legitimate absent-NULL path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again: this test is not related to the fix, there is no reason to keep it here


metadata = {
"format-version": 2,
"table-uuid": str(uuid.uuid4()),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the place where we "need" the dynamic uuid? why? why is static not enough?

the Iceberg read optimization (``allow_experimental_iceberg_read_optimization=1``)
returned all-NULL columns when the manifest carried no per-file column statistics.

Root cause

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong to the tests.

read ``id`` for real, and synthesize the absent ``data`` column as NULL via
IcebergMetadata::getInitialSchemaByPath (file schema as initial_header).

Unpatched build: the optimization marks BOTH columns (including ``id``)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually trim the comments to make the whole suite more readable. Nobody needs this info about "unpatched build", and it pollutes the file.

result = _query_local(instance, container_base)
assert result == _DATA_NULL_RESULT, (
f"Got: {result!r}\n"
"Expected real 'id' with NULL 'data'. If 'id' is also NULL the bug "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to above, IMO this belongs to the comment, not to the error str.

@il9ue

il9ue commented Jun 30, 2026

Copy link
Copy Markdown
Author

Heads up on the last update.
fix/iceberg-empty-stats-26.3 pulled in ~30 unsigned commits from others (zvonand, strtgbb, arthurpassos) when antalya-26.3 got merged after a rebase, so DCO is failing and the commit list is bloated. The file diff is still clean and it's mergeable.

Have a clean branch ready: il9ue:fix-empty-stats-clean : 2 signed commits, just the fix + reviewed tests, DCO clean, tests pass locally.
Three ways to land it, your call:

  1. Squash-merge Fix Iceberg read optimization returning NULLs for stats-less manifests #1895 (I'll sign the squash)
  2. Reset this branch to my clean commit (keeps the PR + threads)
  3. Fresh PR from the clean branch

Branch is protected and has others' commits, so I didn't want to force-push
@arthurpassos @ianton-ru

Looks like this happened from how the branch was set up early on. I'd been working from my fork, and at some point the PR branch got rebased onto a moved antalya-26.3, which pulled the unsigned commits into the history.

@ianton-ru

ianton-ru commented Jun 30, 2026

Copy link
Copy Markdown

@il9ue
I guess the best way is an option 3 - close this PR and create another fresh.

@zvonand

zvonand commented Jun 30, 2026

Copy link
Copy Markdown
Member

antalya-26.3 branch was merged here in some weird way. Be careful and pay attention next time

zvonand added a commit that referenced this pull request Jun 30, 2026
When an Iceberg manifest carries no per-column statistics, the parsed
`DataFileMetaInfo::columns_info` is empty. The read optimization in
`StorageObjectStorageSource::createReader` misread this as "every requested
column is absent from the file": it replaced each nullable column with a
constant `NULL` and set `need_only_count`, so the reader returned correct row
counts but all-NULL values — silent data loss.

Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that
stats-less manifests fall through to the regular reader, which reads present
columns normally and resolves schema-evolution-absent columns via
`IcebergMetadata::getInitialSchemaByPath`.

Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their
`*Cluster` variants. Antalya-only, introduced by #1069.

Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a
checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it.

C++ change taken from #1895
Closes #1545

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@il9ue

il9ue commented Jul 1, 2026

Copy link
Copy Markdown
Author

Thanks for the quick turnaround on #1991. @ianton-ru @zvonand
The static fixture is cleaner than what I had, and noted on the base merge, that one's on me.
Superseded by #1991, so I'll close this in favor of it unless you'd rather keep it open until #1991 lands.
Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read using IcebergLocal returns Nulls instead of actual rows

8 participants