Skip to content

Add pgvector, tsvector/tsquery, and pg_trgm support#1390

Open
lachiewalker wants to merge 7 commits into
piccolo-orm:masterfrom
lachiewalker:feat/pgvector-and-trgm-support
Open

Add pgvector, tsvector/tsquery, and pg_trgm support#1390
lachiewalker wants to merge 7 commits into
piccolo-orm:masterfrom
lachiewalker:feat/pgvector-and-trgm-support

Conversation

@lachiewalker

Copy link
Copy Markdown

Adds first-class support for three PostgreSQL-specific extensions/features; pgvector, pg_trgm, and tsvector/tsquery.

pgvector (vector similarity search)

  • Vector(dimensions=N) column type
  • .cosine_distance(), .l2_distance(), .max_inner_product() methods
    returning QueryString for use in select() and order_by()
  • asyncpg codec registered per-connection so values round-trip as Python
    list[float] rather than strings
  • IndexMethod.hnsw and IndexMethod.ivfflat added to CreateIndex

tsvector / tsquery (full-text search)

  • Tsvector and Tsquery column types (native Postgres, no extension)
  • Tsvector.matches(query) for @@ operator in where()
  • Text search functions: ToTsvector, ToTsquery, PlaintoTsquery,
    PhrasetoTsquery, WebsearchToTsquery, TsRank, TsRankCd,
    TsHeadline

pg_trgm (trigram similarity)

  • TrigramMixin on Varchar and Text adds .trigram_similar() (where)
    and .trigram_distance() (order_by)
  • Similarity and WordSimilarity query functions
  • operator_class and index_params added to CreateIndex for
    gin_trgm_ops / vector_cosine_ops indexes

All new additions also have tests.

I tried hard to fit with the existing style of the code base. Some design decisions I made along the way:

  1. Distance methods return QueryString rather than Where because <=>, <->, and <#> produce a float, not a bool. order_by() accepts QueryString directly, and so does where() (wrapping it in WhereRaw internally), so one return type serves both ANN ordering and threshold filtering without a separate API for each.

  2. order_by() was extended to accept QueryString to support .order_by(Item.embedding.cosine_distance(vec)). As part of this change the DISTINCT ON validator was tightened: OrderByRaw and QueryString are now correctly rejected as the first order_by column when using distinct(on=...).

  3. Codec registration lives in engine/extensions.py rather than engine/postgres.py because all extension-related concerns (type checks, codec registration) are co-located in one module. postgres.py stays focused on connection and pool management. asyncpg calls register_codecs on each new connection and at pool creation via init=register_codecs. The codec silently no-ops if pgvector is not installed.

  4. Extension checks fire at DDL time rather than import time because a DB connection is not guaranteed at import and checking earlier would couple module loading to database availability. DDL execution is the last moment before Postgres would emit a cryptic error, and the earliest moment the extension is actually required. Create.run() and CreateIndex.run() query pg_extension immediately before executing DDL and raise ExtensionNotInstalled with an actionable message. The check runs once per create_table() or CreateIndex call and never during normal query operation.

  5. Extension knowledge is distributed rather than centralised to keep the column class as the single source of truth. A central map would require updates in two places whenever a new extension-backed type is added. Instead, column types declare required_extension = "vector" as a class attribute and check_extensions_for_index derives index requirements from method and operator_class directly. Adding a new extension-backed column type requires only setting the attribute on the class.

  6. TrigramMixin carries no required_extension because trgm requirements surface through operator_class in CreateIndex, not at column definition time. A Varchar column does not require pg_trgm to exist; only creating a trgm index does.

@dantownsend

Copy link
Copy Markdown
Member

@lachiewalker Thanks for this!

@lachiewalker

Copy link
Copy Markdown
Author

@dantownsend No worries, glad to (hopefully) contribute! Forgot to lint first time around. Should be sorted now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants