A Clojure library for multi-space vector embedding, similarity search, and retrieval. Compose numeric, categorical, time-based, and text embedding spaces into a single index, then query with weighted similarity across all of them.
- Java 21+
- Leiningen
Add to your project.clj:
[superlinked/superlinked-clj "0.1.0-SNAPSHOT"]Everything is available through a single namespace:
(require '[superlinked.core :as sl])
;; 1. Define spaces
(def idx
(sl/index
(sl/number-space :price :min 0 :max 100)
(sl/categorical-space :color [:red :green :blue])))
;; 2. Ingest data
(sl/ingest-all idx [{:id "a" :price 50 :color :red}
{:id "b" :price 30 :color :green}
{:id "c" :price 90 :color :blue}])
;; 3. Query
(-> (sl/query idx)
(sl/similar :price 50)
(sl/similar :color :red :weight 2.0)
(sl/limit 5)
sl/execute)
;; => ({:id "a" :score 0.99 :entity {...}} ...)For semantic text search, include a text space with a neural embedding model:
(def text-idx
(sl/index
(sl/text-space :description "djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2")
(sl/categorical-space :category [:electronics :books :clothing])))
(sl/ingest-all text-idx
[{:id "laptop" :description "High-performance gaming laptop" :category :electronics}
{:id "novel" :description "Science fiction adventure story" :category :books}
{:id "shirt" :description "Cotton casual wear" :category :clothing}])
(-> (sl/query text-idx)
(sl/similar :description "computer for gaming")
(sl/limit 3)
sl/execute)
;; => ({:id "laptop" :score 0.85 :entity {...}} ...)
;; Remember to close when done
(sl/close text-idx)Each space encodes one field of your entities into a vector representation.
| Constructor | Description | Key options |
|---|---|---|
number-space |
Sin/cos quarter-circle encoding for numeric values | :min, :max, :mode (:similar, :maximum, :minimum), :log-scale? |
categorical-space |
N-hot encoding for categorical values | Second arg is the list of valid categories |
recency-space |
Sin/cos periodic encoding for timestamps | :periods, :weights, :max-period-secs |
text-space |
Neural text embeddings via DJL (PyTorch) | Second arg is a DJL model URL |
All constructors accept keyword or string names. The entity field to read from defaults to (keyword name) but can be overridden with :field:
(sl/number-space :cost :min 0 :max 1000 :field :item-price)Create an index from one or more spaces. Field mappings are derived automatically from space metadata.
;; Varargs
(sl/index space1 space2 space3)
;; Or pass a vector
(sl/index [space1 space2 space3])Every entity must have an :id key. Other keys are matched to spaces by field mapping.
(sl/ingest idx {:id "x" :price 42 :color :red})
(sl/ingest-all idx [{...} {...}])Both return the index, so you can chain calls. Each ingested entry is automatically timestamped with the current wall-clock time.
Remove entries older than a given age in milliseconds:
;; Drop entries older than 7 days
(sl/drop-older-than idx (* 7 24 60 60 1000))
;; => 3 (number of entries removed)
;; Drop entries older than 1 hour
(sl/drop-older-than idx (* 60 60 1000))drop-older-than compares each entry's ingestion timestamp against (System/currentTimeMillis) and removes those that exceed the age threshold. It returns the count of entries removed. Entries loaded from legacy files that lack a timestamp are treated as infinitely old and will be dropped.
Build queries with a fluent API:
(-> (sl/query idx)
(sl/similar :price 50) ;; query value for the :price space
(sl/similar :color :red :weight 2.0) ;; with custom weight
(sl/where :category #(= % :electronics)) ;; hard filter
(sl/radius 0.8) ;; minimum similarity threshold
(sl/limit 10) ;; max results
sl/execute)Results are a seq of {:id, :score, :entity} maps sorted by score descending.
Save and load indexes to JSONL files:
;; Save
(sl/save idx "index.jsonl")
;; Load using the existing index as schema
(sl/load-index "index.jsonl" idx)
;; Or provide fresh spaces (cold start)
(sl/load-index "index.jsonl"
(sl/number-space :price :min 0 :max 100)
(sl/categorical-space :color [:red :green :blue]))
;; Or pass a vector of spaces
(sl/load-index "index.jsonl" [price-space color-space])If your index includes text spaces (which hold native model resources), close them when done:
(sl/close idx)This is a no-op for indexes without text spaces.
superlinked-clj/
src/superlinked/
core.clj -- Public API (single-import facade)
index.clj -- Multi-space index and ingestion
query.clj -- Query builder and KNN search
io.clj -- JSONL persistence
vector.clj -- SLVector data structure
normalization.clj -- Normalization strategies (L2, L1, categorical)
similarity.clj -- Cosine similarity
aggregation.clj -- Weighted sum/average, min/max
schema.clj -- Schema definitions
util.clj -- Utility functions
space/
protocol.clj -- Space protocol definition
number.clj -- NumberSpace
categorical.clj -- CategoricalSimilaritySpace
recency.clj -- RecencySpace
text.clj -- TextSimilaritySpace (DJL/PyTorch)
embedding/
text_engine.clj -- DJL text embedding engine
dev/superlinked/
scripts/
download_model.clj -- Model download script (dev-only, not in JAR)
test/superlinked/ -- Test suite
resources/
embedding_models/ -- Cached model files
project.clj
The dev/ directory is on the classpath only during development (via the :dev Lein profile). It is not included in the published JAR.
superlinked-clj turns heterogeneous entity fields (numbers, categories, timestamps, free text) into a single composite vector per entity, then ranks entities by cosine similarity against a composite query vector. The pipeline is: embed per-space -> normalize -> concatenate -> score.
Every module is a plain .clj file with no macros beyond defrecord, defprotocol, and defmulti. State is confined to a single atom inside each Index. All vector math operates on Java double-array values wrapped in a defrecord, keeping allocations visible and GC-friendly.
SLVector (vector.clj) is the fundamental data structure. It is a defrecord with three fields:
(defrecord SLVector [^doubles values negative-filter-indices ^double denormalizer])values-- a Javadouble-arrayholding the raw embedding dimensions. Using a primitive array avoids boxing overhead on every arithmetic operation; type hints (^doubles) let the JIT emit direct array loads.negative-filter-indices-- a Clojure set of dimension indices that act as sentinel channels. Spaces use these to mark "this entity has no value here", and the query pipeline can zero them out to exclude entries. This keeps filtering inside the vector dot-product rather than requiring a separate filter pass.denormalizer-- a scalar that records the original magnitude before normalization. After L2-normalizing a vector to unit length, the original norm is stashed here so the operation can be reversed later (e.g. for persistence or aggregation).
Arithmetic helpers (vmul, vdiv, vadd, vsub, vconcatenate, vsplit) are implemented with dotimes loops over the backing array. They return new SLVector instances, keeping the API immutable while the inner loop is mutable for speed. vconcatenate uses System/arraycopy and adjusts negative-filter offsets so that per-space vectors can be stitched into a single composite without losing sentinel information.
The Space protocol (space/protocol.clj) defines the contract every embedding space must satisfy:
(defprotocol Space
(space-name [this])
(dimension [this])
(embed [this value])
(embed-query [this value])
(normalization-type [this]))embed and embed-query are separated because some spaces are asymmetric -- the encoding of a stored entity may differ from the encoding of a query value. For example, CategoricalSimilaritySpace scales stored embeddings by 1/sqrt(n) but query embeddings by sqrt(n)/n so that the dot product of an exact match equals 1.0. Symmetric spaces (like TextSimilaritySpace) implement both identically.
normalization-type returns a keyword (:l2, :categorical, :none, etc.) that the normalization multimethod dispatches on. This keeps normalization logic out of individual spaces.
Four implementations exist:
| Space | Encoding | Dimensions | Normalization |
|---|---|---|---|
NumberSpace |
Sin/cos quarter-circle: value in [0,1] maps to [sin(v*pi/2), cos(v*pi/2), 0]. The third dimension is a negative-filter sentinel. |
3 | :l2 |
CategoricalSimilaritySpace |
N-hot vector: one slot per category plus a sentinel dimension. Active categories get 1/sqrt(active-count). |
n_categories + 1 |
:categorical |
RecencySpace |
Sin/cos periodic encoding of (now - timestamp) across configurable period lengths, weighted per period. Plus a sentinel. |
2 * n_periods + 1 |
:l2 |
TextSimilaritySpace |
Delegates to a neural model via DJL. Outputs a dense float vector (384 dims for all-MiniLM-L6-v2). | model-dependent | :l2 |
Why sin/cos encoding for numbers? A raw scalar in a dot product is linear -- doubling a value doubles the contribution. The quarter-circle mapping [sin(v*pi/2), cos(v*pi/2)] curves the relationship: values near each other on the [0,1] range produce vectors with high cosine similarity, while distant values diverge. This gives a smooth, bounded similarity signal that composes well when concatenated with other spaces.
Why defrecord for spaces? Each space carries immutable configuration (min/max bounds, category lists, model handles) that should live alongside the protocol methods. defrecord gives field access with no map-lookup overhead, implements the protocol inline, and plays well with Java interop (the DJL model objects are Java classes).
Text embedding follows a two-layer design:
-
text-engine.clj-- a thin wrapper around DJL (Deep Java Library).create-enginebuilds aCriteriaobject specifying the model URL, the PyTorch backend, and aTextEmbeddingTranslatorFactorythat handles tokenization. It then calls.loadModelto download/cache the model and.newPredictorto obtain a thread-safe inference handle. The engine is stored as a plain map{:model ... :predictor ...}. -
TextSimilaritySpace(space/text.clj) -- implements theSpaceprotocol by callingte/embed-text, which runs.predicton the DJL predictor and converts the resultingfloat[]to adouble-arraywrapped in anSLVector.
DJL model URL
-> Criteria (PyTorch engine, TextEmbeddingTranslatorFactory)
-> ZooModel (.loadModel -- downloads/caches weights)
-> Predictor (.newPredictor -- inference handle)
-> .predict(String) -> float[]
-> double-array -> SLVector
The model URL scheme djl://ai.djl.huggingface.pytorch/... tells DJL to fetch from HuggingFace's model hub. On first use DJL downloads the ONNX/PyTorch weights to a local cache (~/.djl.ai/); subsequent loads are instant. embedding-dimension is discovered at creation time by embedding an empty string and checking the array length.
close-engine releases both the predictor and the model. The top-level sl/close walks all spaces in an index and closes any TextSimilaritySpace instances, making cleanup a single call.
Index (index.clj) is a defrecord with three fields:
(defrecord Index [spaces field-mappings entries])spaces-- an ordered vector ofSpaceinstances. Order matters: it determines the layout of the composite vector.field-mappings-- a map from space name to entity field keyword, derived automatically from space metadata at construction time. This lets the index know that space"price"should read from:priceon each entity.entries-- anatomholding{id -> {:id, :entity, :vector, :per-space}}. The atom is the sole piece of mutable state. Using a Clojure atom means ingestion is thread-safe (compare-and-swap) and reads are lock-free.
Ingestion pipeline (ingest):
- For each space, extract the entity field value via
field-mappings. - Call
proto/embedto get a per-spaceSLVector. - Normalize it according to the space's
normalization-type(dispatched viadefmulti). - Concatenate all per-space vectors into a single composite vector with
vconcatenate. swap!the atom to store the entry.
Per-space vectors are kept alongside the composite so that persistence and debugging can inspect individual space contributions.
normalization.clj uses a defmulti dispatching on a keyword (:l2, :l1, :categorical, :constant, :none). Each method takes an SLVector and returns a new one with unit norm and the original magnitude stashed in :denormalizer.
All methods guard against zero-magnitude vectors (norm < 1e-12) by returning the vector unchanged with denormalizer set to 0.0. This prevents NaN propagation during normalization itself.
The choice of defmulti over a protocol here is deliberate: normalization is a cross-cutting concern that doesn't belong to any single space type, and new strategies can be added by defining a new defmethod without touching existing code.
The query builder (query.clj) uses an immutable Query record that accumulates parameters through a chain of functions (with-similar, with-weight, with-filter, with-k, with-radius). core.clj wraps these in a user-facing fluent API.
execute builds a query vector through the same embed-normalize pipeline as ingestion (but using embed-query instead of embed), multiplies each per-space vector by its weight, concatenates, then scores every stored entry via brute-force cosine similarity. Results are sorted descending and truncated to k.
Brute-force KNN is a deliberate choice for the current scope: it has zero indexing overhead, is correct by construction, and performs well for datasets up to low tens of thousands of entries. The search is a single map + sort-by over the entries atom, with no locking.
io.clj serializes indexes to JSONL (one JSON object per line). The first line is a metadata header with space dimensions and field mappings. Each subsequent line stores an entry with its entity serialized as an EDN string (preserving Clojure keywords, sets, etc.) and vectors as JSON arrays. On load, vector dimensions are validated against the provided space schema to catch mismatches early.
lein test
lein jar
MIT