[Enhancement](udf) Support volatility property for scalar UDF#62698
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
f2edf43 to
3678922
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking correctness issues.
-
Blanket
isDeterministic() == falsefor all Java/Python UDF/UDAF/UDTF classes is too broad. The modified regression coverage uses deterministic helpers (IntTest,MySumInt,FloatTest,float_test.py), so this change now rejects pure external UDFs from MV/MTMV unless users opt intoenable_nondeterministic_function=trueand stops existing MV rewrite coverage from applying. -
The optimizer fix is incomplete. Several rewrite paths that can duplicate or relocate expressions still key off
containsUniqueFunction()rather than!isDeterministic(), so external UDFs can still be evaluated multiple times or moved incorrectly. A simplefilter(project(udf(...) as a))case is still vulnerable. -
The bundled BE fix correctly handles unaligned
Decimal256Arrow reads, but the adjacentTYPE_DECIMALV2Arrow branch still performs the same unsafereinterpret_castload and keeps the misalignment UB for DECIMALV2.
Critical checkpoints:
- Goal / correctness: The PR fixes the specific CTE nondeterminism example, but not the end-to-end optimizer correctness problem and it broadens semantics for all external UDFs.
- Scope / minimality: Not minimal; it changes default behavior for every Java/Python UDF/UDAF/UDTF and rewrites existing deterministic-UDF MV tests around that change.
- Concurrency / lifecycle: No new concurrency or lifecycle issue found in the touched code.
- Config / compatibility: No new config or compatibility issue found.
- Parallel paths: Not all relevant rewrite paths were updated.
- Tests: Added CTE/MTMV coverage is useful, but the modified float/java tests now encode the broader regression for deterministic UDFs and there is no regression case for the remaining filter/project duplication path.
- Observability / transaction / FE-BE variable passing: Not applicable in this diff.
- Performance: Blanket nondeterminism disables valid optimizations for pure UDFs.
No additional user-provided focus was supplied.
3678922 to
f66187b
Compare
|
run buildall |
|
/review |
FE Regression Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Additional blockers beyond the existing inline threads:
- Pre-upgrade UDF metadata is not backward-compatible with the new persisted
deterministicfield. Old Java/Python UDFs/UDAFs/UDTFs replay as nondeterministic after upgrade, which silently changesSHOW CREATE FUNCTIONoutput and can strand existing MTMVs inSCHEMA_CHANGEwhenensureMTMVQueryUsable()re-analyzes the stored query. - The new per-UDF
deterministicproperty is still not honored in cache paths. Deterministic external UDFs remain blanket-ineligible for SQL cache and fragment query cache because those paths still special-caseUdf/AliasFunctioninstead of consultingisDeterministic().
Critical checkpoints:
- Goal: Partially met. The PR plumbs a user-facing determinism flag through create/show/Nereids translation and adds regression coverage for CTE/MTMV scenarios, but it does not preserve compatibility for existing persisted UDFs and it does not propagate the flag to cache consumers.
- Scope/focus: The code changes are localized, but the end-to-end contract of the new property is incomplete.
- Concurrency: No new lock-order or thread-safety issue found in the touched code; the blocking problems are replay/analysis-path correctness issues.
- Lifecycle/static init: No special lifecycle or static initialization issue found.
- Config/properties: A new user-visible UDF property (
deterministic) is added and parsed, but not all relevant consumers honor it. - Compatibility/persistence: Blocking issue.
FunctionJSON persistence changed without a backward-compatible upgrade path for existing function metadata. - Parallel code paths: Blocking issue. Cache/query-normalization paths still bypass the new flag.
- Special conditional checks: No additional issue beyond the above.
- Test coverage: Added DDL/CTE/MV coverage is useful, but there is no coverage for upgrade/replay compatibility or SQL/query cache behavior.
- Test result files: The new
.outfiles look consistent with the added regression cases. - Observability: No new observability gap found for this change.
- Transaction/persistence/data writes: Metadata persistence is affected; the replay compatibility issue above is the main blocker.
- FE-BE variable passing: Not applicable here.
- Performance: No material performance regression found in the touched code.
- Other issues: No additional distinct blocker beyond the two comments here and the already-open review threads.
User focus: none provided.
FE Regression Coverage ReportIncrement line coverage |
f66187b to
9d73507
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I am requesting changes based on the already-open inline review threads; I did not add duplicate inline comments for the same issues.
Critical checkpoint conclusions:
- Goal/test coverage: The PR aims to add an explicit deterministic property for Java/Python UDF, UDAF, and UDTF handling. The property is plumbed through catalog/Nereids classes and new tests exist, but existing review threads identify unresolved end-to-end gaps in cache/planner behavior and optimizer rewrites.
- Scope/focus: The code changes are mostly localized, but the behavior change is broad because UDF determinism affects MV/MTMV eligibility, SQL cache, fragment query cache, CTE inlining, and optimizer rewrites.
- Concurrency/lifecycle: I did not find new lock ordering, shared mutable state, or static initialization risks in this diff.
- Configuration: No new config item is added; this is a persisted function property.
- Compatibility/persistence: Existing review context already flags that old persisted functions lack the new field and may be replayed with a changed default, which is a blocking compatibility concern.
- Parallel paths: Java/Python scalar, aggregate, and table-function wrappers are touched, but existing review context already flags remaining parallel paths that still do not respect the new determinism bit.
- Conditional checks/error handling: Boolean property parsing follows existing property parsing style; I did not find an additional distinct error-handling issue.
- Tests/results: Regression and FE unit coverage were added, with generated
.outfiles present. Coverage still does not close the existing end-to-end gaps noted in the inline threads. - Observability/transaction/data-write/FE-BE protocol: No new observability need, data write path, or FE-BE protocol field was identified beyond persisted function metadata.
- Performance: I did not find an additional distinct performance regression in the changed code.
User focus: No additional user-provided review focus was specified. I reviewed the full PR with extra attention to determinism propagation and did not find a new non-duplicate issue beyond the existing review threads.
|
I think a boolean
So for Python UDF we probably need to identify which of these cases we want to support. The current deterministic true/false split can distinguish immutable from non-immutable, but it cannot distinguish stable functions from volatile functions, while optimizer rules need different behavior for those two cases. |
9d73507 to
2bbd937
Compare
|
/review |
There was a problem hiding this comment.
Review completed. I am not adding duplicate inline comments because the remaining blocking concern is already covered by the existing thread on CreateFunctionCommand: the new UDF volatility flag is still not effective end-to-end for SQL cache / fragment query cache eligibility. In the current head, ExpressionAnalyzer still marks any Udf/AliasUdfBuilder as cannotProcessExpression, and PhysicalPlanTranslator.hasUndeterministicExpression() still returns true for any Udf.class/AliasFunction.class before relying on isDeterministic(), so a scalar UDF declared immutable/volatility=immutable is still treated as uncacheable in those paths.
Critical checkpoint conclusions:
- Goal/test: The PR aims to let users classify scalar Java/Python UDF volatility and protect optimizer rewrites for volatile calls. The current tests cover construction/equality and some rewrite behavior, but they do not prove the cache-related advertised deterministic behavior end-to-end.
- Scope/focus: The implementation is mostly focused on Nereids UDF volatility plumbing, though several end-to-end paths still need alignment.
- Concurrency/lifecycle: No new concurrent mutable state or non-obvious lifecycle issue found in the reviewed FE changes.
- Configuration/compatibility: No new config item. Persisted Function compatibility now has a null/default guard for old metadata.
- Parallel paths: One important parallel path remains incomplete: SQL cache and fragment query cache still use UDF-class checks that bypass the new volatility semantics.
- Special conditions: The VOLATILE identity logic is localized and guarded; no additional distinct issue found there.
- Test coverage: Missing cache eligibility tests for immutable UDFs and negative tests for volatile UDFs through the cache decision paths.
- Observability/transactions/data writes: Not applicable for this FE optimizer/catalog metadata change beyond normal function persistence.
- FE/BE variable passing: Runtime execution does not appear to require BE volatility propagation; this is FE planning metadata.
User focus: No additional user-provided review focus was specified.
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 29197 ms |
TPC-DS: Total hot run time: 170621 ms |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 30926 ms |
TPC-DS: Total hot run time: 169275 ms |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I reviewed the GitHub PR diff and did not find an additional distinct issue that is not already covered by the existing inline review threads. The previously raised UDF volatility/replay/cache compatibility threads remain the relevant blockers to resolve before approval.\n\nCheckpoint conclusions:\n- Goal/test: the PR adds UDF volatility metadata and optimizer safeguards with unit/regression coverage; existing threads cover remaining replay and end-to-end effectiveness gaps.\n- Scope/focus: the current diff is focused on FE UDF volatility and affected rewrites/tests.\n- Concurrency/lifecycle/config: no new concurrency, lifecycle, or dynamic-config issue found in the current GitHub diff.\n- Compatibility/persistence: existing inline threads already cover persistence/replay compatibility concerns; I did not duplicate them.\n- Parallel paths: reviewed changed Nereids rewrite guards and UDF builder paths; no additional distinct missed path found beyond existing comments.\n- Tests/results: added tests cover volatility parsing, SHOW CREATE replay, and selected rewrite behavior, with prior comments already covering failing/missing cases.\n- Observability/performance/data correctness: no additional distinct issue found in the current diff.\n- User focus: no additional user-provided review focus was specified.
|
PR approved by at least one committer and no changes requested. |
…#62698) Problem Summary: Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases. Introduce `immutable`, `stable`, and `volatile` semantics through `"volatility" = "immutable|stable|volatile"`, persist the property in function metadata, and use it to drive deterministic and volatile-expression behavior in Nereids. Immutable UDFs are treated as deterministic, stable UDFs avoid volatile identity handling while remaining non-deterministic, and volatile UDFs receive per-call volatile identities to protect unsafe rewrites. ```sql CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1"); INSERT INTO cte_uuid_seed VALUES (1),(2),(3); DROP FUNCTION IF EXISTS py_uuid_token(INT); CREATE FUNCTION py_uuid_token(INT) RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "py_uuid_token_impl", "always_nullable" = "false", "runtime_version" = "3.12.11" ) AS $$ import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}" $$; ``` before: ```sql SET enable_cte_materialize = true; SET inline_cte_referenced_threshold = 10; -- treated as volatile func(UniqueFunction), which caused wrong planning WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 2 | | 2 | 2 | | 3 | 2 | +------+-----------------+ ``` now ```sql +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | +------+-----------------+ ``` doc: apache/doris-website#3570
…#62698) Problem Summary: Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases. Introduce `immutable`, `stable`, and `volatile` semantics through `"volatility" = "immutable|stable|volatile"`, persist the property in function metadata, and use it to drive deterministic and volatile-expression behavior in Nereids. Immutable UDFs are treated as deterministic, stable UDFs avoid volatile identity handling while remaining non-deterministic, and volatile UDFs receive per-call volatile identities to protect unsafe rewrites. ```sql CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1"); INSERT INTO cte_uuid_seed VALUES (1),(2),(3); DROP FUNCTION IF EXISTS py_uuid_token(INT); CREATE FUNCTION py_uuid_token(INT) RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "py_uuid_token_impl", "always_nullable" = "false", "runtime_version" = "3.12.11" ) AS $$ import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}" $$; ``` before: ```sql SET enable_cte_materialize = true; SET inline_cte_referenced_threshold = 10; -- treated as volatile func(UniqueFunction), which caused wrong planning WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 2 | | 2 | 2 | | 3 | 2 | +------+-----------------+ ``` now ```sql +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | +------+-----------------+ ``` doc: apache/doris-website#3570
…#62698) Problem Summary: Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases. Introduce `immutable`, `stable`, and `volatile` semantics through `"volatility" = "immutable|stable|volatile"`, persist the property in function metadata, and use it to drive deterministic and volatile-expression behavior in Nereids. Immutable UDFs are treated as deterministic, stable UDFs avoid volatile identity handling while remaining non-deterministic, and volatile UDFs receive per-call volatile identities to protect unsafe rewrites. ```sql CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1"); INSERT INTO cte_uuid_seed VALUES (1),(2),(3); DROP FUNCTION IF EXISTS py_uuid_token(INT); CREATE FUNCTION py_uuid_token(INT) RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "py_uuid_token_impl", "always_nullable" = "false", "runtime_version" = "3.12.11" ) AS $$ import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}" $$; ``` before: ```sql SET enable_cte_materialize = true; SET inline_cte_referenced_threshold = 10; -- treated as volatile func(UniqueFunction), which caused wrong planning WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 2 | | 2 | 2 | | 3 | 2 | +------+-----------------+ ``` now ```sql +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | +------+-----------------+ ``` doc: apache/doris-website#3570
…#62698) Problem Summary: Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases. Introduce `immutable`, `stable`, and `volatile` semantics through `"volatility" = "immutable|stable|volatile"`, persist the property in function metadata, and use it to drive deterministic and volatile-expression behavior in Nereids. Immutable UDFs are treated as deterministic, stable UDFs avoid volatile identity handling while remaining non-deterministic, and volatile UDFs receive per-call volatile identities to protect unsafe rewrites. ```sql CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1"); INSERT INTO cte_uuid_seed VALUES (1),(2),(3); DROP FUNCTION IF EXISTS py_uuid_token(INT); CREATE FUNCTION py_uuid_token(INT) RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "py_uuid_token_impl", "always_nullable" = "false", "runtime_version" = "3.12.11" ) AS $$ import uuid def py_uuid_token_impl(x): return f"{x}-{uuid.uuid4()}" $$; ``` before: ```sql SET enable_cte_materialize = true; SET inline_cte_referenced_threshold = 10; -- treated as volatile func(UniqueFunction), which caused wrong planning WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed) SELECT id, COUNT(DISTINCT token) AS distinct_tokens FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u GROUP BY id ORDER BY id; +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 2 | | 2 | 2 | | 3 | 2 | +------+-----------------+ ``` now ```sql +------+-----------------+ | id | distinct_tokens | +------+-----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | +------+-----------------+ ``` doc: apache/doris-website#3570
Related PR: #62698 Problem Summary: PR #62698 introduced the UDF volatility property and added `VolatileExpression` / `VolatileIdentity` so volatile UDF calls can carry per-call identity and avoid unsafe optimizer rewrites. This PR is a follow-up refactoring for that change. It removes duplicated unique identity state from `UniqueFunction`, keeps `VolatileIdentity` as the single identity holder, moves volatile-expression helper methods into `ExpressionTrait`, and expands the former `AddProjectForUniqueFunction` rewrite to operate on volatile expressions rather than only unique functions.
Problem Summary:
Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases.
Introduce
immutable,stable, andvolatilesemantics through"volatility" = "immutable|stable|volatile", persist the property in function metadata, and use it to drive deterministic and volatile-expression behavior in Nereids.Immutable UDFs are treated as deterministic, stable UDFs avoid volatile identity handling while remaining non-deterministic, and volatile UDFs receive per-call volatile identities to protect unsafe rewrites.
before:
now
doc: apache/doris-website#3570