Skip to content

Comments

docs: replace barrier() with KNN join behavior documentation#635

Merged
paleolimbot merged 6 commits intoapache:mainfrom
Kontinuation:docs/remove-barrier-update-knn-docs
Feb 19, 2026
Merged

docs: replace barrier() with KNN join behavior documentation#635
paleolimbot merged 6 commits intoapache:mainfrom
Kontinuation:docs/remove-barrier-update-knn-docs

Conversation

@Kontinuation
Copy link
Member

Summary

  • Remove the barrier() UDF function, which was an optimization barrier workaround for KNN joins. It had no external consumers (no Python bindings, no integration tests, no doc references) and is no longer needed since KNN joins inherently block filter pushdown through extension node semantics.
  • Replace the "Optimization Barrier" docs section in sql-joins.md with a "KNN Join Caveats" section that accurately documents:
    • No Filter Pushdown: KNN joins do not push filters into input tables; all predicates are post-filters. Notes that query-side pushdown is a valid future optimization.
    • ST_KNN Predicate Precedence: ST_KNN is always extracted first when combined with other predicates via AND; equivalent examples shown for ON ... AND vs WHERE placement.

Changes

  • docs/reference/sql-joins.md — Replaced "Optimization Barrier" section with "KNN Join Caveats"
  • rust/sedona-functions/src/barrier.rs — Deleted (649 lines)
  • rust/sedona-functions/src/lib.rs — Removed mod barrier;
  • rust/sedona-functions/src/register.rs — Removed barrier_udf registration

Testing

  • cargo test -p sedona-functions — 344 tests pass
  • cargo test -p sedona-spatial-join — 171 tests pass
  • All doc claims verified experimentally via Python SedonaContext

@Kontinuation Kontinuation force-pushed the docs/remove-barrier-update-knn-docs branch from 199e083 to 19fce52 Compare February 18, 2026 15:17
@jiayuasu
Copy link
Member

jiayuasu commented Feb 18, 2026

Can you clarify why we need to remove the barrier function?

This function gives the user a choice to describe what his/her intention is. Because pushing down the filter through KNN Join is not a wrong behavior. I don't think simply blocking all filter pushdown will work, unless we can achieve something similar via CTE.

In addition, SedonaSpark also has the barrier function: https://sedona.apache.org/latest/api/sql/NearestNeighbourSearching/

@paleolimbot
Copy link
Member

While it's not hurting anybody for it to continue to exist, we should definitely recommend more explicit syntax now that we have it available. You have to be a database expert familiar with the concept of barrier() to know what will happen here, and you have to have read the documentation very closely to know to use:

SELECT h.name AS hotel, r.name AS restaurant, r.rating
FROM hotels AS h
INNER JOIN restaurants AS r ON ST_KNN(h.geometry, r.geometry, 3, false)
WHERE barrier('rating > 4.0 AND stars >= 4', 'rating', r.rating, 'stars', h.stars)

Since we can now type this:

SELECT h.name AS hotel, r.name AS restaurant, r.rating
FROM hotels AS h
INNER JOIN restaurants AS r ON ST_KNN(h.geometry, r.geometry, 3, false)
WHERE rating > 4.0 AND stars >= 4

...we may as well recommend it and remove the hack before it becomes widely used. We can always add it back if it is requested.

@paleolimbot
Copy link
Member

As I understand it, we also can optimize rating > 4.0 AND stars >= 4 by pushing stars >= 4 through one side of the join (whereas we can't do that with barrier()), in addition to other built-in optimizations that DataFusion does (e.g., constant folding, common subexpression elimination).

@jiayuasu
Copy link
Member

jiayuasu commented Feb 18, 2026

I am fine removing the barrier function. I agree it is ugly.

But is there a way to allow users to clearly describe their intention? i.e., whether you want the filter first or the join first? I think we discussed this before and the suggestion was to use CTE?

SELECT h.name AS hotel, r.name AS restaurant, r.rating
FROM hotels AS h
INNER JOIN restaurants AS r ON ST_KNN(h.geometry, r.geometry, 3, false)
WHERE rating > 4.0 AND stars >= 4

@paleolimbot
Copy link
Member

Yes, I think a CTE or a subquery will both work if the filter should be applied first.

@jiayuasu
Copy link
Member

OK. As long as we document the CTE approach, I am fine with removing the function

KNN joins now block all filter pushdown automatically, so the barrier()
function is no longer needed. Replace the Optimization Barrier section
with a KNN Join Behavior section that documents:

- No filter pushdown: WHERE predicates are evaluated after KNN candidate
  selection, not pushed into input tables
- ST_KNN predicate precedence: ST_KNN is always extracted first when
  combined with other predicates via AND
The barrier() function was a workaround to prevent filter pushdown past
KNN joins by evaluating boolean expressions as opaque strings at runtime.
KNN joins now block all filter pushdown automatically via the
KnnJoinEarlyRewrite optimizer rule, making barrier() unnecessary.

The function had no external consumers: no Python bindings, no
integration tests, no documentation references, and no other Rust
modules importing it.
@Kontinuation Kontinuation force-pushed the docs/remove-barrier-update-knn-docs branch from 19fce52 to b30eb2f Compare February 19, 2026 02:30
@Kontinuation
Copy link
Member Author

I have updated the doc to include subquery and CTE examples for manually pushing down the filters. This could be a workaround for the current stage. We definitely should implement query-side predicate push down optimization for KNN in future patches.

@petern48
Copy link
Contributor

For what it's worth, the other day I stumbled across lancedb handling this exact scenario. They offer a prefilter parameter for their approximate k-nearest-neighbors vector search functionality. Not quite SQL, but worth noting that another project has indeed encountered this and supports both cases.

results_post_filtered = (
    table.search(query_embed)
    .where("label > 1", prefilter=False)  # prefilter parameter allows user to choose
    .select(["text", "keywords", "label"])
    .limit(5)
    .to_pandas()
)

https://docs.lancedb.com/search/vector-search#vector-search-with-postfiltering

@Kontinuation
Copy link
Member Author

LanceDB's fluent API does not allow something like table.where(...).search(embedding). where can only appear after search so it has an optional parameter for distinguishing whether it is a pre-filter or not.

SQL is more flexible than LanceDB's query builder API, and there are ambiguous ways to express pre- and post-filtering in SQL, so I don't think we need barrier-like annotations. We only need to faithfully carry out the semantics of the SQL.

@Kontinuation Kontinuation marked this pull request as ready for review February 19, 2026 10:48
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit 6b08cf2 into apache:main Feb 19, 2026
17 checks passed
@paleolimbot paleolimbot added this to the 0.3.0 milestone Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants