feat: add fts support by egolearner · Pull Request #408 · alibaba/zvec

egolearner · 2026-05-15T09:17:45Z

address #397

JalinWang · 2026-05-22T10:29:26Z

+
+import pytest
+
+from zvec.model.param.query import Fts, Query


This naming "Fts" is a little bit too generic. Would it be more precise to name it after its underlying dependency, like _FtsQuery (binding) or FtsQueryParam (C++)?

The current design is: Query is the top-level request containing multiple retrieval fields (vector, FTS, filter, etc.), and Fts is just one component within it — similar to how you'd have Vector or Filter as sibling fields.

Naming it FtsQuery would imply it's a standalone query type at the same level as Query, which is misleading. If we later split Query into multiple specialized types, FtsQuery would then be the right name for a full FTS query request.

We'll also align the C++ side to drop the "Query" suffix from the param struct for consistency.

What do you think? Any other naming suggestions?

You’re right — FtsQuery doesn’t feel quite right.

That said, Fts alone also feels a bit awkward, since it doesn’t make it clear that this is a component of Query, not a standalone type.

Maybe we should consider a name that better reflects that role, such as FtsQueryParam, FtsClause, FtsCondition, FtsOption, or FtsFilter.

… subdirs

…s on same block

…ring hot path

…PLT indirect calls

Move the per-query filter check from the column-reader loop into the Disjunction/Conjunction/Phrase iterators so filtered docs no longer pay for block-max binary search, do_next alignment, or phase-2 position verification ($POS CF reads). TermDocIterator inherits the base-class default and stays unchanged.

block_max_info_for() now returns {score, last_doc} in one binary search (with a small cache), so the standalone current_block_max_score(), skip_to_next_block(), block_max_score_for() and block_max_last_doc_for() methods have no live callers. Remove them from BitPackedPostingIterator, the DocIterator base, and TermDocIterator, along with the now-dead current_block_max_score_ member and its decode_block assignment. Tests adjusted to query via block_max_info_for().

When an invert filter is highly selective compared to the FTS posting size, posting-driven evaluation walks far more docs than necessary. Mirror the existing vector_recall pattern: when invert match_count is below fts_brute_force_by_keys_ratio * doc_count, extract the small id set and AND it into the FTS root via a new CandidateDocIterator. The candidate iterator becomes the lead by cost, turning the posting walk into per-candidate advance() + matches() + score() and fully reusing the existing AND / filter-pushdown / BM25 machinery. - new CandidateDocIterator: ascending segment-local ids, lower_bound advance, zero score contribution - FtsColumnIndexer::search wraps root_iter in Conjunction when FtsQueryParams.candidate_ids is non-empty - new GlobalConfig::fts_brute_force_by_keys_ratio (default 0.05, independent from the vector knob because per-candidate FTS cost is higher due to phrase phase-2 IO), wired through C API + Python binding - DocFilter::get_bf_by_keys_and_update now takes an explicit ratio so the two callers (vector vs FTS) pick the right knob; on the brute- force branch invert_filter_ is cleared so DocFilter never re-checks the same ids - 9 iterator unit tests + 7 reader equivalence tests (Term / OR / AND / Phrase / Nested, coexistence with IndexFilter, empty-candidate fallback) + config default / validation asserts

create_term_iterator_from_raw used to open BitPackedPostingIterator twice on the same buffer: an outer probe to read df/max_score, then a second open inside TermDocIterator's BitPacked constructor. Drop the probe and let the constructor read df/max_score from bp_iter_ after its single open(). On parse failure the iterator now reports cost()==0 (with the existing LOG_ERROR) and the caller skips it just like an empty term.

- FtsQueryParser now requires a tokenizer pipeline; phrase contents and bare terms are tokenized the same way the index was, so CJK queries match the doc-side segmentation instead of falling back to ASCII whitespace split. Multi-token bare terms compose via default_op. - JiebaTokenizer.position becomes a per-output sequence number. CutForSearch's overlapping sub-words shared a unicode_offset, which broke PhraseDocIterator's strict anchor+i adjacency check; sequence numbers stay contiguous so doc/query phrases align. - sqlengine_impl hoists pipeline creation above the query/match-string branch and feeds the parser.

Use DictTrie / HMMModel + per-mode segmenter directly so constructing JiebaTokenizer no longer force-loads idf.utf8 and stop_words.utf8 — files we never use, that previously forced ~12MB of required files on every deployment and ~10MB of resident memory at runtime. cppjieba aborts when those files are missing, which made the dict path a deploy-time foot-gun. Mode-aware required-field checks: cut_mode=hmm no longer demands dict_path, cut_mode=full no longer demands model_path. Drops the "idf_path" / "stop_word_path" config keys that were silently inert (KeywordExtractor uses them; segmenters don't).

When the analyzer drops every term (all stop-words, pure punctuation, or all-whitespace match_string), both query_string and match_string paths used to fail with InvalidArgument. Now they parse to an explicit EmptyNode that matches zero documents, mirroring Lucene's MatchNoDocsQuery and aligning the two entry points. The new node composes naturally with AND (whole conjunction matches nothing) and OR (child is skipped) via the existing null-iterator paths in build_iterator/search, so executor changes are limited to one switch case.

A single post-order pass over the FTS AST collapses sibling duplicates, flattens same-type composites for better block-max WAND pruning, detects must vs must_not contradictions, propagates EmptyNode, and folds single children. Duplicate TermNode/PhraseNode siblings are merged into one node whose boost is the linear sum, so the post-rewrite per-doc score equals the pre-rewrite "sum of N independent scorers" output exactly -- zero behavior change, only N posting lookups and N scorings collapsed to one per unique term. Flattening uses Lucene-style guards (pure-disjunction OR, no must_not in AND inner) to avoid altering match sets. Boost is plumbed through BM25Scorer::score_with_idf and applied to TermDocIterator's max_score so WAND pivot stays correct.

When an OrNode has any must_not children alongside positives, the rewriter now lifts them into a wrapping AndNode -- AND(OR(positives), must_nots...) -- so OrNode never carries must_not children downstream. Single-positive case skips the intermediate OR entirely, yielding the flatter AND(positive, must_nots...). Two follow-on cleanups: * build_or_iterator drops the disjunction-then-must_not-Conjunction wrap and now only handles SHOULD-style positives. A defensive error fires if a must_not slips through, surfacing rewriter bypasses loudly instead of silently producing wrong scores. * `apple -apple` style self-contradictions now collapse to EmptyNode for free: after canonicalization they become AND(apple, -apple), which the existing and_has_mustnot_conflict check rewrites to empty. FtsColumnIndexer::search() is reached via two paths: sqlengine (always runs simplify in parse_fts_query) and the indexer unit tests (helper search_ok now also calls simplify). Both paths therefore deliver a canonical AST, matching the new build_or_iterator invariant.

… anchor verify_phrase_positions previously issued one Get per (term, doc_id) *inside* the anchor loop, re-reading the same non-anchor term up to |anchor_positions| times. It also allocated a fresh key string per call, copied values into std::string, and always picked terms_[0] as anchor even when that term was the most frequent in the doc. Rewrite the phase-2 check to: - Dedup repeated terms within the phrase (e.g., "to be or not to be"). - Issue a single MultiGet across the unique terms, using PinnableSlice and a single reserved key buffer so Slice pointers stay valid and per-call allocations drop to O(1). - Decode every position list once into a local cache. - Pick the shortest position list as anchor so the outer loop iterates the rarest term, not the phrase-order first term. - Validate adjacency entirely in memory. Add append_doc_term_key as an in-place variant of make_doc_term_key for buffer-reuse callers. Add regression tests for repeated-term phrases and high-frequency leading terms.

JalinWang · 2026-05-27T09:53:36Z

+        q = Query(
+            field_name="embedding",
+            vector=[0.1, 0.2, 0.3],
+            fts=Fts(match_string="deep learning"),


Also, would it be better to rename fts= here to something more explicit, like full_text= / full_text_search= / text_search=? It feels a bit clearer and makes the intent more obvious, especially for users who may not immediately recognize fts as full-text search.

egolearner requested review from JalinWang, chinaux, feihongxu0824 and zhourrr as code owners May 15, 2026 09:17

github-actions Bot assigned egolearner May 15, 2026

egolearner force-pushed the feat/fts branch 6 times, most recently from a2882f1 to 122fb51 Compare May 22, 2026 07:43

egolearner requested a review from Cuiyus as a code owner May 22, 2026 07:43

egolearner changed the title ~~feat: add fts support in db layer~~ feat: add fts support May 22, 2026

JalinWang reviewed May 22, 2026

View reviewed changes

egolearner force-pushed the feat/fts branch 2 times, most recently from 6073f24 to 922e7ed Compare May 26, 2026 12:09

egolearner added 14 commits May 27, 2026 16:33

feat: add fts support

58771a3

fix mac compile & ci

a3dfb4e

refactor parse fts & add fts debug text

c922ccc

fix some problems

9aad98c

refactor(fts_column): reorganize into tokenizer/, posting/, iterator/…

47e3996

… subdirs

perf: or use multi_get

75981f3

perf: optimize disjunction iterator

a203cc6

perf: fts use hashskiplist

b9d2c7a

refactor batch_get_postings

f7a84ad

perf: optimize iterator virtual function

7b474f5

bench limit max_queries

714299a

perf: use PinnableSlice

c16ebaf

perf: bitpacked avx2

d206ec6

chore: rm unnecessary checkpoint

0e4a65e

egolearner added 21 commits May 27, 2026 16:33

perf: cache block_max_info_for result to skip repeated binary searche…

b008466

…s on same block

perf: precompute BM25 IDF weight per term to eliminate log() from sco…

d5bf0f5

…ring hot path

perf: cache SIMD dispatch function pointers in iterator to eliminate …

59199bb

…PLT indirect calls

rename

ee1e159

PartialMerge no optimize

e4175eb

fix fts score

24161cf

python binding support fts

b97c581

feat: wire FTS reduce into Optimize compaction

583b46b

feat: allow schema without vector fields

c657e25

rename cpp FtsQuery to Fts and fix compile

9763a7c

add c binding

b071137

egolearner force-pushed the feat/fts branch from e0e714b to 2120dc8 Compare May 27, 2026 08:44

fix some problems

ee9bef9

egolearner force-pushed the feat/fts branch from 2120dc8 to ee9bef9 Compare May 27, 2026 09:48

JalinWang reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fts support#408

feat: add fts support#408
egolearner wants to merge 36 commits into
alibaba:mainfrom
egolearner:feat/fts

egolearner commented May 15, 2026

Uh oh!

JalinWang May 22, 2026 •

edited

Loading

Uh oh!

egolearner May 25, 2026

Uh oh!

JalinWang May 27, 2026

Uh oh!

JalinWang May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

egolearner commented May 15, 2026

Uh oh!

JalinWang May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egolearner May 25, 2026

Choose a reason for hiding this comment

Uh oh!

JalinWang May 27, 2026

Choose a reason for hiding this comment

Uh oh!

JalinWang May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JalinWang May 22, 2026 •

edited

Loading