Add pgvector, tsvector/tsquery, and pg_trgm support#1390
Open
lachiewalker wants to merge 7 commits into
Open
Conversation
Member
|
@lachiewalker Thanks for this! |
Author
|
@dantownsend No worries, glad to (hopefully) contribute! Forgot to lint first time around. Should be sorted now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds first-class support for three PostgreSQL-specific extensions/features; pgvector, pg_trgm, and tsvector/tsquery.
pgvector (vector similarity search)
Vector(dimensions=N)column type.cosine_distance(),.l2_distance(),.max_inner_product()methodsreturning
QueryStringfor use inselect()andorder_by()list[float]rather than stringsIndexMethod.hnswandIndexMethod.ivfflatadded toCreateIndextsvector / tsquery (full-text search)
TsvectorandTsquerycolumn types (native Postgres, no extension)Tsvector.matches(query)for@@operator inwhere()ToTsvector,ToTsquery,PlaintoTsquery,PhrasetoTsquery,WebsearchToTsquery,TsRank,TsRankCd,TsHeadlinepg_trgm (trigram similarity)
TrigramMixinonVarcharandTextadds.trigram_similar()(where)and
.trigram_distance()(order_by)SimilarityandWordSimilarityquery functionsoperator_classandindex_paramsadded toCreateIndexforgin_trgm_ops/vector_cosine_opsindexesAll new additions also have tests.
I tried hard to fit with the existing style of the code base. Some design decisions I made along the way:
Distance methods return
QueryStringrather thanWherebecause<=>,<->, and<#>produce a float, not a bool.order_by()acceptsQueryStringdirectly, and so doeswhere()(wrapping it inWhereRawinternally), so one return type serves both ANN ordering and threshold filtering without a separate API for each.order_by()was extended to acceptQueryStringto support.order_by(Item.embedding.cosine_distance(vec)). As part of this change theDISTINCT ONvalidator was tightened:OrderByRawandQueryStringare now correctly rejected as the firstorder_bycolumn when usingdistinct(on=...).Codec registration lives in engine/extensions.py rather than engine/postgres.py because all extension-related concerns (type checks, codec registration) are co-located in one module. postgres.py stays focused on connection and pool management. asyncpg calls
register_codecson each new connection and at pool creation viainit=register_codecs. The codec silently no-ops if pgvector is not installed.Extension checks fire at DDL time rather than import time because a DB connection is not guaranteed at import and checking earlier would couple module loading to database availability. DDL execution is the last moment before Postgres would emit a cryptic error, and the earliest moment the extension is actually required.
Create.run()andCreateIndex.run()querypg_extensionimmediately before executing DDL and raiseExtensionNotInstalledwith an actionable message. The check runs once percreate_table()orCreateIndexcall and never during normal query operation.Extension knowledge is distributed rather than centralised to keep the column class as the single source of truth. A central map would require updates in two places whenever a new extension-backed type is added. Instead, column types declare
required_extension = "vector"as a class attribute andcheck_extensions_for_indexderives index requirements frommethodandoperator_classdirectly. Adding a new extension-backed column type requires only setting the attribute on the class.TrigramMixincarries norequired_extensionbecause trgm requirements surface throughoperator_classinCreateIndex, not at column definition time. AVarcharcolumn does not requirepg_trgmto exist; only creating a trgm index does.