superlinked-clj

A Clojure library for multi-space vector embedding, similarity search, and retrieval. Compose numeric, categorical, time-based, and text embedding spaces into a single index, then query with weighted similarity across all of them.

Requirements

Java 21+
Leiningen

Installation

Add to your project.clj:

[superlinked/superlinked-clj "0.1.0-SNAPSHOT"]

Quick start

Everything is available through a single namespace:

(require '[superlinked.core :as sl])

;; 1. Define spaces
(def idx
  (sl/index
    (sl/number-space :price :min 0 :max 100)
    (sl/categorical-space :color [:red :green :blue])))

;; 2. Ingest data
(sl/ingest-all idx [{:id "a" :price 50 :color :red}
                     {:id "b" :price 30 :color :green}
                     {:id "c" :price 90 :color :blue}])

;; 3. Query
(-> (sl/query idx)
    (sl/similar :price 50)
    (sl/similar :color :red :weight 2.0)
    (sl/limit 5)
    sl/execute)
;; => ({:id "a" :score 0.99 :entity {...}} ...)

Text search example

For semantic text search, include a text space with a neural embedding model:

(def text-idx
  (sl/index
    (sl/text-space :description "djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2")
    (sl/categorical-space :category [:electronics :books :clothing])))

(sl/ingest-all text-idx
  [{:id "laptop" :description "High-performance gaming laptop" :category :electronics}
   {:id "novel" :description "Science fiction adventure story" :category :books}
   {:id "shirt" :description "Cotton casual wear" :category :clothing}])

(-> (sl/query text-idx)
    (sl/similar :description "computer for gaming")
    (sl/limit 3)
    sl/execute)
;; => ({:id "laptop" :score 0.85 :entity {...}} ...)

;; Remember to close when done
(sl/close text-idx)

Spaces

Each space encodes one field of your entities into a vector representation.

Constructor	Description	Key options
`number-space`	Sin/cos quarter-circle encoding for numeric values	`:min`, `:max`, `:mode` (`:similar`, `:maximum`, `:minimum`), `:log-scale?`
`categorical-space`	N-hot encoding for categorical values	Second arg is the list of valid categories
`recency-space`	Sin/cos periodic encoding for timestamps	`:periods`, `:weights`, `:max-period-secs`
`text-space`	Neural text embeddings via DJL (PyTorch)	Second arg is a DJL model URL

All constructors accept keyword or string names. The entity field to read from defaults to (keyword name) but can be overridden with :field:

(sl/number-space :cost :min 0 :max 1000 :field :item-price)

Index

Create an index from one or more spaces. Field mappings are derived automatically from space metadata.

;; Varargs
(sl/index space1 space2 space3)

;; Or pass a vector
(sl/index [space1 space2 space3])

Ingesting data

Every entity must have an :id key. Other keys are matched to spaces by field mapping.

(sl/ingest idx {:id "x" :price 42 :color :red})
(sl/ingest-all idx [{...} {...}])

Both return the index, so you can chain calls. Each ingested entry is automatically timestamped with the current wall-clock time.

Dropping stale entries

Remove entries older than a given age in milliseconds:

;; Drop entries older than 7 days
(sl/drop-older-than idx (* 7 24 60 60 1000))
;; => 3  (number of entries removed)

;; Drop entries older than 1 hour
(sl/drop-older-than idx (* 60 60 1000))

drop-older-than compares each entry's ingestion timestamp against (System/currentTimeMillis) and removes those that exceed the age threshold. It returns the count of entries removed. Entries loaded from legacy files that lack a timestamp are treated as infinitely old and will be dropped.

Querying

Build queries with a fluent API:

(-> (sl/query idx)
    (sl/similar :price 50)              ;; query value for the :price space
    (sl/similar :color :red :weight 2.0) ;; with custom weight
    (sl/where :category #(= % :electronics)) ;; hard filter
    (sl/radius 0.8)                     ;; minimum similarity threshold
    (sl/limit 10)                       ;; max results
    sl/execute)

Results are a seq of {:id, :score, :entity} maps sorted by score descending.

Persistence

Save and load indexes to JSONL files:

;; Save
(sl/save idx "index.jsonl")

;; Load using the existing index as schema
(sl/load-index "index.jsonl" idx)

;; Or provide fresh spaces (cold start)
(sl/load-index "index.jsonl"
  (sl/number-space :price :min 0 :max 100)
  (sl/categorical-space :color [:red :green :blue]))

;; Or pass a vector of spaces
(sl/load-index "index.jsonl" [price-space color-space])

Resource management

If your index includes text spaces (which hold native model resources), close them when done:

(sl/close idx)

This is a no-op for indexes without text spaces.

Project structure

superlinked-clj/
  src/superlinked/
    core.clj              -- Public API (single-import facade)
    index.clj             -- Multi-space index and ingestion
    query.clj             -- Query builder and KNN search
    io.clj                -- JSONL persistence
    vector.clj            -- SLVector data structure
    normalization.clj     -- Normalization strategies (L2, L1, categorical)
    similarity.clj        -- Cosine similarity
    aggregation.clj       -- Weighted sum/average, min/max
    schema.clj            -- Schema definitions
    util.clj              -- Utility functions
    space/
      protocol.clj        -- Space protocol definition
      number.clj          -- NumberSpace
      categorical.clj     -- CategoricalSimilaritySpace
      recency.clj         -- RecencySpace
      text.clj            -- TextSimilaritySpace (DJL/PyTorch)
    embedding/
      text_engine.clj     -- DJL text embedding engine
  dev/superlinked/
    scripts/
      download_model.clj  -- Model download script (dev-only, not in JAR)
  test/superlinked/       -- Test suite
  resources/
    embedding_models/     -- Cached model files
  project.clj

The dev/ directory is on the classpath only during development (via the :dev Lein profile). It is not included in the published JAR.

Architecture

Overview

superlinked-clj turns heterogeneous entity fields (numbers, categories, timestamps, free text) into a single composite vector per entity, then ranks entities by cosine similarity against a composite query vector. The pipeline is: embed per-space -> normalize -> concatenate -> score.

Every module is a plain .clj file with no macros beyond defrecord, defprotocol, and defmulti. State is confined to a single atom inside each Index. All vector math operates on Java double-array values wrapped in a defrecord, keeping allocations visible and GC-friendly.

SLVector

SLVector (vector.clj) is the fundamental data structure. It is a defrecord with three fields:

(defrecord SLVector [^doubles values negative-filter-indices ^double denormalizer])

values -- a Java double-array holding the raw embedding dimensions. Using a primitive array avoids boxing overhead on every arithmetic operation; type hints (^doubles) let the JIT emit direct array loads.
negative-filter-indices -- a Clojure set of dimension indices that act as sentinel channels. Spaces use these to mark "this entity has no value here", and the query pipeline can zero them out to exclude entries. This keeps filtering inside the vector dot-product rather than requiring a separate filter pass.
denormalizer -- a scalar that records the original magnitude before normalization. After L2-normalizing a vector to unit length, the original norm is stashed here so the operation can be reversed later (e.g. for persistence or aggregation).

Arithmetic helpers (vmul, vdiv, vadd, vsub, vconcatenate, vsplit) are implemented with dotimes loops over the backing array. They return new SLVector instances, keeping the API immutable while the inner loop is mutable for speed. vconcatenate uses System/arraycopy and adjusts negative-filter offsets so that per-space vectors can be stitched into a single composite without losing sentinel information.

Space protocol

The Space protocol (space/protocol.clj) defines the contract every embedding space must satisfy:

(defprotocol Space
  (space-name [this])
  (dimension [this])
  (embed [this value])
  (embed-query [this value])
  (normalization-type [this]))

embed and embed-query are separated because some spaces are asymmetric -- the encoding of a stored entity may differ from the encoding of a query value. For example, CategoricalSimilaritySpace scales stored embeddings by 1/sqrt(n) but query embeddings by sqrt(n)/n so that the dot product of an exact match equals 1.0. Symmetric spaces (like TextSimilaritySpace) implement both identically.

normalization-type returns a keyword (:l2, :categorical, :none, etc.) that the normalization multimethod dispatches on. This keeps normalization logic out of individual spaces.

Four implementations exist:

Space	Encoding	Dimensions	Normalization
`NumberSpace`	Sin/cos quarter-circle: value in [0,1] maps to `[sin(vpi/2), cos(vpi/2), 0]`. The third dimension is a negative-filter sentinel.	3	`:l2`
`CategoricalSimilaritySpace`	N-hot vector: one slot per category plus a sentinel dimension. Active categories get `1/sqrt(active-count)`.	`n_categories + 1`	`:categorical`
`RecencySpace`	Sin/cos periodic encoding of `(now - timestamp)` across configurable period lengths, weighted per period. Plus a sentinel.	`2 * n_periods + 1`	`:l2`
`TextSimilaritySpace`	Delegates to a neural model via DJL. Outputs a dense float vector (384 dims for all-MiniLM-L6-v2).	model-dependent	`:l2`

Why sin/cos encoding for numbers? A raw scalar in a dot product is linear -- doubling a value doubles the contribution. The quarter-circle mapping [sin(v*pi/2), cos(v*pi/2)] curves the relationship: values near each other on the [0,1] range produce vectors with high cosine similarity, while distant values diverge. This gives a smooth, bounded similarity signal that composes well when concatenated with other spaces.

Why defrecord for spaces? Each space carries immutable configuration (min/max bounds, category lists, model handles) that should live alongside the protocol methods. defrecord gives field access with no map-lookup overhead, implements the protocol inline, and plays well with Java interop (the DJL model objects are Java classes).

Model loading and text embedding

Text embedding follows a two-layer design:

text-engine.clj -- a thin wrapper around DJL (Deep Java Library). create-engine builds a Criteria object specifying the model URL, the PyTorch backend, and a TextEmbeddingTranslatorFactory that handles tokenization. It then calls .loadModel to download/cache the model and .newPredictor to obtain a thread-safe inference handle. The engine is stored as a plain map {:model ... :predictor ...}.
TextSimilaritySpace (space/text.clj) -- implements the Space protocol by calling te/embed-text, which runs .predict on the DJL predictor and converts the resulting float[] to a double-array wrapped in an SLVector.

DJL model URL
  -> Criteria (PyTorch engine, TextEmbeddingTranslatorFactory)
    -> ZooModel (.loadModel -- downloads/caches weights)
      -> Predictor (.newPredictor -- inference handle)
        -> .predict(String) -> float[]
          -> double-array -> SLVector

The model URL scheme djl://ai.djl.huggingface.pytorch/... tells DJL to fetch from HuggingFace's model hub. On first use DJL downloads the ONNX/PyTorch weights to a local cache (~/.djl.ai/); subsequent loads are instant. embedding-dimension is discovered at creation time by embedding an empty string and checking the array length.

close-engine releases both the predictor and the model. The top-level sl/close walks all spaces in an index and closes any TextSimilaritySpace instances, making cleanup a single call.

Index

Index (index.clj) is a defrecord with three fields:

(defrecord Index [spaces field-mappings entries])

spaces -- an ordered vector of Space instances. Order matters: it determines the layout of the composite vector.
field-mappings -- a map from space name to entity field keyword, derived automatically from space metadata at construction time. This lets the index know that space "price" should read from :price on each entity.
entries -- an atom holding {id -> {:id, :entity, :vector, :per-space}}. The atom is the sole piece of mutable state. Using a Clojure atom means ingestion is thread-safe (compare-and-swap) and reads are lock-free.

Ingestion pipeline (ingest):

For each space, extract the entity field value via field-mappings.
Call proto/embed to get a per-space SLVector.
Normalize it according to the space's normalization-type (dispatched via defmulti).
Concatenate all per-space vectors into a single composite vector with vconcatenate.
swap! the atom to store the entry.

Per-space vectors are kept alongside the composite so that persistence and debugging can inspect individual space contributions.

Normalization

normalization.clj uses a defmulti dispatching on a keyword (:l2, :l1, :categorical, :constant, :none). Each method takes an SLVector and returns a new one with unit norm and the original magnitude stashed in :denormalizer.

All methods guard against zero-magnitude vectors (norm < 1e-12) by returning the vector unchanged with denormalizer set to 0.0. This prevents NaN propagation during normalization itself.

The choice of defmulti over a protocol here is deliberate: normalization is a cross-cutting concern that doesn't belong to any single space type, and new strategies can be added by defining a new defmethod without touching existing code.

Query and scoring

The query builder (query.clj) uses an immutable Query record that accumulates parameters through a chain of functions (with-similar, with-weight, with-filter, with-k, with-radius). core.clj wraps these in a user-facing fluent API.

execute builds a query vector through the same embed-normalize pipeline as ingestion (but using embed-query instead of embed), multiplies each per-space vector by its weight, concatenates, then scores every stored entry via brute-force cosine similarity. Results are sorted descending and truncated to k.

Brute-force KNN is a deliberate choice for the current scope: it has zero indexing overhead, is correct by construction, and performs well for datasets up to low tens of thousands of entries. The search is a single map + sort-by over the entries atom, with no locking.

Persistence

io.clj serializes indexes to JSONL (one JSON object per line). The first line is a metadata header with space dimensions and field mappings. Each subsequent line stores an entry with its entity serialized as an EDN string (preserving Clojure keywords, sets, etc.) and vectors as JSON arrays. On load, vector dimensions are validated against the provided space schema to catch mismatches early.

Running tests

lein test

Building

lein jar

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
dev/superlinked		dev/superlinked
resources/embedding_models/all-MiniLM-L6-v2		resources/embedding_models/all-MiniLM-L6-v2
scripts		scripts
src/superlinked		src/superlinked
superlinked-clj/resources/embedding_models/all-MiniLM-L6-v2		superlinked-clj/resources/embedding_models/all-MiniLM-L6-v2
test/superlinked		test/superlinked
.gitignore		.gitignore
README.md		README.md
project.clj		project.clj
superlinked-clj.iml		superlinked-clj.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

superlinked-clj

Requirements

Installation

Quick start

Text search example

Spaces

Index

Ingesting data

Dropping stale entries

Querying

Persistence

Resource management

Project structure

Architecture

Overview

SLVector

Space protocol

Model loading and text embedding

Index

Normalization

Query and scoring

Persistence

Running tests

Building

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

superlinked-clj

Requirements

Installation

Quick start

Text search example

Spaces

Index

Ingesting data

Dropping stale entries

Querying

Persistence

Resource management

Project structure

Architecture

Overview

SLVector

Space protocol

Model loading and text embedding

Index

Normalization

Query and scoring

Persistence

Running tests

Building

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages