Skip to content

feat: add fts support#408

Open
egolearner wants to merge 36 commits into
alibaba:mainfrom
egolearner:feat/fts
Open

feat: add fts support#408
egolearner wants to merge 36 commits into
alibaba:mainfrom
egolearner:feat/fts

Conversation

@egolearner
Copy link
Copy Markdown
Collaborator

address #397

@egolearner egolearner force-pushed the feat/fts branch 6 times, most recently from a2882f1 to 122fb51 Compare May 22, 2026 07:43
@egolearner egolearner requested a review from Cuiyus as a code owner May 22, 2026 07:43
@egolearner egolearner changed the title feat: add fts support in db layer feat: add fts support May 22, 2026

import pytest

from zvec.model.param.query import Fts, Query
Copy link
Copy Markdown
Collaborator

@JalinWang JalinWang May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming "Fts" is a little bit too generic. Would it be more precise to name it after its underlying dependency, like _FtsQuery (binding) or FtsQueryParam (C++)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design is: Query is the top-level request containing multiple retrieval fields (vector, FTS, filter, etc.), and Fts is just one component within it — similar to how you'd have Vector or Filter as sibling fields.

Naming it FtsQuery would imply it's a standalone query type at the same level as Query, which is misleading. If we later split Query into multiple specialized types, FtsQuery would then be the right name for a full FTS query request.

We'll also align the C++ side to drop the "Query" suffix from the param struct for consistency.

What do you think? Any other naming suggestions?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right — FtsQuery doesn’t feel quite right.

That said, Fts alone also feels a bit awkward, since it doesn’t make it clear that this is a component of Query, not a standalone type.

Maybe we should consider a name that better reflects that role, such as FtsQueryParam, FtsClause, FtsCondition, FtsOption, or FtsFilter.

@egolearner egolearner force-pushed the feat/fts branch 2 times, most recently from 6073f24 to 922e7ed Compare May 26, 2026 12:09
egolearner added 21 commits May 27, 2026 16:33
Move the per-query filter check from the column-reader loop into the
Disjunction/Conjunction/Phrase iterators so filtered docs no longer pay
for block-max binary search, do_next alignment, or phase-2 position
verification ($POS CF reads). TermDocIterator inherits the base-class
default and stays unchanged.
block_max_info_for() now returns {score, last_doc} in one binary search
(with a small cache), so the standalone current_block_max_score(),
skip_to_next_block(), block_max_score_for() and block_max_last_doc_for()
methods have no live callers. Remove them from BitPackedPostingIterator,
the DocIterator base, and TermDocIterator, along with the now-dead
current_block_max_score_ member and its decode_block assignment. Tests
adjusted to query via block_max_info_for().
When an invert filter is highly selective compared to the FTS posting
size, posting-driven evaluation walks far more docs than necessary.
Mirror the existing vector_recall pattern: when invert match_count is
below fts_brute_force_by_keys_ratio * doc_count, extract the small id
set and AND it into the FTS root via a new CandidateDocIterator. The
candidate iterator becomes the lead by cost, turning the posting walk
into per-candidate advance() + matches() + score() and fully reusing
the existing AND / filter-pushdown / BM25 machinery.

- new CandidateDocIterator: ascending segment-local ids, lower_bound
  advance, zero score contribution
- FtsColumnIndexer::search wraps root_iter in Conjunction when
  FtsQueryParams.candidate_ids is non-empty
- new GlobalConfig::fts_brute_force_by_keys_ratio (default 0.05,
  independent from the vector knob because per-candidate FTS cost is
  higher due to phrase phase-2 IO), wired through C API + Python binding
- DocFilter::get_bf_by_keys_and_update now takes an explicit ratio so
  the two callers (vector vs FTS) pick the right knob; on the brute-
  force branch invert_filter_ is cleared so DocFilter never re-checks
  the same ids
- 9 iterator unit tests + 7 reader equivalence tests (Term / OR / AND /
  Phrase / Nested, coexistence with IndexFilter, empty-candidate
  fallback) + config default / validation asserts
create_term_iterator_from_raw used to open BitPackedPostingIterator twice
on the same buffer: an outer probe to read df/max_score, then a second
open inside TermDocIterator's BitPacked constructor. Drop the probe and
let the constructor read df/max_score from bp_iter_ after its single
open(). On parse failure the iterator now reports cost()==0 (with the
existing LOG_ERROR) and the caller skips it just like an empty term.
- FtsQueryParser now requires a tokenizer pipeline; phrase contents and
  bare terms are tokenized the same way the index was, so CJK queries
  match the doc-side segmentation instead of falling back to ASCII
  whitespace split. Multi-token bare terms compose via default_op.
- JiebaTokenizer.position becomes a per-output sequence number.
  CutForSearch's overlapping sub-words shared a unicode_offset, which
  broke PhraseDocIterator's strict anchor+i adjacency check; sequence
  numbers stay contiguous so doc/query phrases align.
- sqlengine_impl hoists pipeline creation above the query/match-string
  branch and feeds the parser.
Use DictTrie / HMMModel + per-mode segmenter directly so constructing
JiebaTokenizer no longer force-loads idf.utf8 and stop_words.utf8 — files
we never use, that previously forced ~12MB of required files on every
deployment and ~10MB of resident memory at runtime. cppjieba aborts when
those files are missing, which made the dict path a deploy-time foot-gun.

Mode-aware required-field checks: cut_mode=hmm no longer demands
dict_path, cut_mode=full no longer demands model_path. Drops the
"idf_path" / "stop_word_path" config keys that were silently inert
(KeywordExtractor uses them; segmenters don't).
When the analyzer drops every term (all stop-words, pure punctuation, or
all-whitespace match_string), both query_string and match_string paths
used to fail with InvalidArgument. Now they parse to an explicit
EmptyNode that matches zero documents, mirroring Lucene's
MatchNoDocsQuery and aligning the two entry points.

The new node composes naturally with AND (whole conjunction matches
nothing) and OR (child is skipped) via the existing null-iterator
paths in build_iterator/search, so executor changes are limited to
one switch case.
A single post-order pass over the FTS AST collapses sibling duplicates,
flattens same-type composites for better block-max WAND pruning, detects
must vs must_not contradictions, propagates EmptyNode, and folds single
children. Duplicate TermNode/PhraseNode siblings are merged into one
node whose boost is the linear sum, so the post-rewrite per-doc score
equals the pre-rewrite "sum of N independent scorers" output exactly --
zero behavior change, only N posting lookups and N scorings collapsed
to one per unique term.

Flattening uses Lucene-style guards (pure-disjunction OR, no must_not
in AND inner) to avoid altering match sets. Boost is plumbed through
BM25Scorer::score_with_idf and applied to TermDocIterator's max_score
so WAND pivot stays correct.
When an OrNode has any must_not children alongside positives, the
rewriter now lifts them into a wrapping AndNode -- AND(OR(positives),
must_nots...) -- so OrNode never carries must_not children downstream.
Single-positive case skips the intermediate OR entirely, yielding the
flatter AND(positive, must_nots...).

Two follow-on cleanups:

* build_or_iterator drops the disjunction-then-must_not-Conjunction
  wrap and now only handles SHOULD-style positives. A defensive error
  fires if a must_not slips through, surfacing rewriter bypasses
  loudly instead of silently producing wrong scores.

* `apple -apple` style self-contradictions now collapse to EmptyNode
  for free: after canonicalization they become AND(apple, -apple),
  which the existing and_has_mustnot_conflict check rewrites to empty.

FtsColumnIndexer::search() is reached via two paths: sqlengine (always
runs simplify in parse_fts_query) and the indexer unit tests (helper
search_ok now also calls simplify). Both paths therefore deliver a
canonical AST, matching the new build_or_iterator invariant.
… anchor

verify_phrase_positions previously issued one Get per (term, doc_id)
*inside* the anchor loop, re-reading the same non-anchor term up to
|anchor_positions| times. It also allocated a fresh key string per call,
copied values into std::string, and always picked terms_[0] as anchor
even when that term was the most frequent in the doc.

Rewrite the phase-2 check to:
- Dedup repeated terms within the phrase (e.g., "to be or not to be").
- Issue a single MultiGet across the unique terms, using PinnableSlice
  and a single reserved key buffer so Slice pointers stay valid and
  per-call allocations drop to O(1).
- Decode every position list once into a local cache.
- Pick the shortest position list as anchor so the outer loop iterates
  the rarest term, not the phrase-order first term.
- Validate adjacency entirely in memory.

Add append_doc_term_key as an in-place variant of make_doc_term_key for
buffer-reuse callers. Add regression tests for repeated-term phrases
and high-frequency leading terms.
q = Query(
field_name="embedding",
vector=[0.1, 0.2, 0.3],
fts=Fts(match_string="deep learning"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it be better to rename fts= here to something more explicit, like full_text= / full_text_search= / text_search=? It feels a bit clearer and makes the intent more obvious, especially for users who may not immediately recognize fts as full-text search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants