From aaa2a9125a4104d57b57c97bba55ba33f7042424 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 11:03:41 -0500 Subject: [PATCH 01/17] first commit Signed-off-by: Connor Tsui --- proposed/0000-types.md | 74 +++++ proposed/vortex-type-theory-background.md | 349 ++++++++++++++++++++++ 2 files changed, 423 insertions(+) create mode 100644 proposed/0000-types.md create mode 100644 proposed/vortex-type-theory-background.md diff --git a/proposed/0000-types.md b/proposed/0000-types.md new file mode 100644 index 0000000..72bf95c --- /dev/null +++ b/proposed/0000-types.md @@ -0,0 +1,74 @@ +- Start Date: YYYY-MM-DD (today's date) +- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) + +# RFC Template + +## Summary + +One paragraph explanation of the proposed change. + +## Motivation + +What problem does this solve? Include concrete use cases where possible. + +- What specific use cases does this enable or improve? +- What workflows or operations are painful, slow, or impossible today? + +## Design + +Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: + +- New or modified APIs, traits, or vtable entries. +- How this interacts with existing components (encodings, layouts, scan, file format, etc.). +- Key implementation details and corner cases. +- Why is this the best approach in the space of possible designs? +- Which crates are affected and how the dependency graph changes, if at all. + +Use code examples and diagrams where they might help, like this: + +```rust +pub fn main() { + let x = f32::to_bits(100.0f32); + dbg!(x); +} +``` + +## Compatibility + +- Does this change the file format or wire format? Is it backward/forward compatible? +- Does this break any public APIs? If so, what is the migration path? +- Are there performance implications? + +If there are no compatibility concerns, briefly state why. + +## Drawbacks + +- Why should we _not_ do this? +- What is the maintenance cost of this change? +- Does this add complexity that could be avoided? + +## Alternatives + +- What other designs were considered and why were they rejected? +- What is the cost of **not** doing this? +- Is there a simpler approach that gets us most of the way there? + +## Prior Art + +How have other systems solved this or similar problems? Consider: + +- Other columnar formats (Parquet, Arrow, etc.). +- Database internals (DuckDB, DataFusion, Velox, etc.). +- Relevant academic papers or blog posts. + +This section helps frame the design in a broader context. If there is no relevant prior art, that is fine. + +## Unresolved Questions + +- What parts of the design need to be resolved during the RFC process? +- What is explicitly out of scope for this RFC? +- Are there open questions that can be deferred to implementation? + +## Future Possibilities + +What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. diff --git a/proposed/vortex-type-theory-background.md b/proposed/vortex-type-theory-background.md new file mode 100644 index 0000000..f5943eb --- /dev/null +++ b/proposed/vortex-type-theory-background.md @@ -0,0 +1,349 @@ +# Theoretical Background: Logical and Physical Types in Vortex + +## Purpose + +This document provides the type-theoretic foundations for reasoning about Vortex's `DType` system, its relationship to physical encodings, and the design space for canonicalization. It is intended as background material for an RFC on evolving how the Vortex type system handles canonical representations — and on whether additions like `FixedSizeBinary` belong in `DType`. + +--- + +## 1. Core Concepts + +### 1.1 Equivalence Classes + +An **equivalence class** is the set of all elements that are "the same" under some equivalence relation. An equivalence relation `~` on a set `S` must be reflexive (`a ~ a`), symmetric (`a ~ b ⟹ b ~ a`), and transitive (`a ~ b` and `b ~ c` ⟹ `a ~ c`). + +**In Vortex:** Consider the set of all physical array representations — dictionary-encoded int32 arrays, run-length-encoded int32 arrays, bitpacked int32 arrays, flat Arrow int32 arrays, etc. Two representations are equivalent if and only if they produce the same logical sequence of values when decoded. This equivalence relation partitions the space of all physical arrays into equivalence classes, where each class corresponds to a single logical column of data. + +The key insight is that a Vortex `DType` like `Primitive(I32, NonNullable)` _names_ one of these equivalence classes. It tells you what logical data you're working with, but says nothing about which physical representative you hold. + +**Further reading:** + +- Partition of a set / equivalence class: [https://en.wikipedia.org/wiki/Equivalence_class](https://en.wikipedia.org/wiki/Equivalence_class) + +### 1.2 Quotient Types + +A **quotient type** `A / ~` is a type formed by taking a base type `A` and collapsing it by an equivalence relation `~`. Elements of the quotient type are the equivalence classes themselves. Operations on the quotient type must be _well-defined_ — they can't depend on which representative you pick from within a class. + +Formally, if `f : A → B` is a function that respects the equivalence relation (`a ~ a' ⟹ f(a) = f(a')`), then `f` descends to a well-defined function on the quotient `f' : A/~ → B`. + +**In Vortex:** `DType` is a quotient type over the space of physical representations. The equivalence relation is decode-equality ("these two arrays decode to the same logical values"). Query operations like `filter`, `take`, `scalar_at` are well-defined on the quotient: their results depend only on the logical data, not on the encoding. This is exactly the requirement for a function to descend to the quotient. + +An operation that _doesn't_ descend to the quotient — say, "return the offset buffer" — is one that depends on the specific representative (ListView has offset-size pairs; List has monotonic offsets; dictionary encoding has neither). Such operations are physical-layer concerns and must live below the DType abstraction. + +**Further reading:** + +- Quotient types in type theory: [https://ncatlab.org/nlab/show/quotient+type](https://ncatlab.org/nlab/show/quotient+type) +- Quotient types in programming languages (Altenkirch, Anberree, Li): [https://arxiv.org/abs/1901.01006](https://arxiv.org/abs/1901.01006) + +### 1.3 Sections + +Given a surjection (a function that maps from a detailed representation to a collapsed one), a **section** is a right inverse: a function that picks one representative from each equivalence class. + +Concretely, if `π : Encoding → DType` is the projection that sends each physical encoding to its logical type, a section is a function `s : DType → Encoding` such that `π(s(d)) = d` for all DTypes `d`. In other words, `s` picks a specific physical representation for each logical type, and that representation actually has the right logical type. + +A section is also sometimes called a **choice function** or a **splitting** of the projection. + +**In Vortex:** The current `to_canonical` function is a section. It calls `execute` on a lazy array tree that defines the execution plan to reach a canonical form. For each `DType`, it selects exactly one physical form: + +| DType | Current canonical form (section) | +| ------------------- | ---------------------------------------------------------------------------------------- | +| `Primitive(I32, _)` | Flat Arrow-compatible buffer | +| `Utf8(_)` | VarBin with UTF-8 bytes | +| `List(T, _)` | ListView (offset-size pairs) | +| `Struct(fields, _)` | Struct with canonicalized top-level layout (children remain in their existing encodings) | + +Note that canonicalization is _shallow_ — it applies only to the outermost array, not recursively to children. A canonical `Struct` has its top-level field pointers in canonical form, but each child field may still be dictionary-encoded, run-length-encoded, etc. This is an important design choice: it means canonicalization is a local operation, not a full tree traversal. + +The critical property is that this is a _choice_. Nothing in the theory privileges ListView over List as the canonical representative for variable-length list data. Both are valid sections. The current system hardcodes one. + +**Further reading:** + +- Section in algebra / category theory: [https://en.wikipedia.org/wiki/Section\_(category_theory)]() +- Splitting lemma (analogous structure in group theory): [https://en.wikipedia.org/wiki/Splitting_lemma](https://en.wikipedia.org/wiki/Splitting_lemma) + +### 1.4 The Church-Rosser Property + +A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can be reduced in two different ways, both reduction paths can be continued to reach the same normal form. Equivalently: every term has at most one normal form. + +This is the property that makes "just keep simplifying until you're done" a well-defined strategy. If a system is Church-Rosser, the order in which you apply reduction rules doesn't matter — you always arrive at the same result. + +**In Vortex (current state):** The current `to_canonical` pathway is confluent by construction. No matter how deeply nested the encodings are (e.g., dictionary-of-run-length-of-bitpacked), executing the lazy canonicalization plan converges to the same canonical form. There's one normal form per DType, and every reduction path reaches it. This is clean and simple. + +**In Vortex (if we add multiple canonical targets):** We would be moving to a _non-confluent_ rewriting system. A `List(Int32)` encoded as dictionary-of-run-length could reduce to _either_ a flat `List` (contiguous monotonic offsets) or a `ListView` (offset-size pairs with potential sharing), depending on which reduction strategy is applied. The system no longer has a unique normal form per logical type. + +This isn't pathological — it's well-studied. The standard approach is to define **separate reduction relations**, each of which is internally confluent. For example, you might have one reduction strategy that targets ListView as the normal form for list types, and another that targets List. Each strategy independently satisfies Church-Rosser — within a given strategy, all paths converge. The strategies simply converge to _different_ normal forms. The choice of strategy is made by the consumer (the caller of `to_canonical`), not by the encoded array itself. + +**Further reading:** + +- Church-Rosser theorem: [https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem](https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem) +- Confluence in rewriting systems: [https://en.wikipedia.org/wiki/Confluence\_(abstract_rewriting)]() +- Term rewriting and all that (Baader, Nipkow): [https://www.cambridge.org/core/books/term-rewriting-and-all-that/71768055C83EA8B18A58B8B09BEF3AB5](https://www.cambridge.org/core/books/term-rewriting-and-all-that/71768055C83EA8B18A58B8B09BEF3AB5) + +### 1.5 Representation Independence and Existential Types + +**Representation independence** is the principle that clients of an abstract data type cannot observe which concrete representation is in use. Mitchell and Plotkin (1988) proved that abstract data types correspond to **existential types** in System F: + +``` +∃Repr. { + encode : LogicalBuffer → Repr, + decode : Repr → LogicalBuffer, + ops : Repr → Results +} +``` + +The `Repr` type variable is hidden from the client. The client can only interact with the data through the provided operations, which are guaranteed to respect the logical semantics. Two implementations that expose the same interface and produce the same results are interchangeable. + +**In Vortex:** Each encoding (dictionary, RLE, bitpacked, etc.) is an existential package. The query engine is a client that interacts through `scalar_at`, `slice`, `filter`, `take`, etc. The physical layout is the hidden `Repr`. Representation independence is what makes the whole layered architecture work — the query engine doesn't branch on encoding type. + +**Further reading:** + +- Mitchell, Plotkin, "Abstract Types Have Existential Type": [https://doi.org/10.1145/44501.45065](https://doi.org/10.1145/44501.45065) +- Existential types in type theory: [https://ncatlab.org/nlab/show/existential+type](https://ncatlab.org/nlab/show/existential+type) + +--- + +## 2. Structural vs. Nominal Typing in DType + +### 2.1 The Distinction + +In a **structural** type system, two types are the same if they have the same structure. In a **nominal** type system, two types are the same only if they have the same name (declaration), even if their structures are identical. + +**Further reading:** + +- Structural vs. nominal type systems: [https://en.wikipedia.org/wiki/Nominal_type_system](https://en.wikipedia.org/wiki/Nominal_type_system) + +### 2.2 Where Vortex's DType is Nominal + +Several DType variants are structurally isomorphic but intentionally kept distinct: + +| Type A | Type B | Structurally identical? | Semantically distinct? | +| ---------------------- | ----------------------------------- | ----------------------------------------------- | --------------------------------------------------------------------------- | +| `Utf8` | `Binary` | Yes (both are variable-length byte sequences) | **Yes** — `Utf8` carries a validity invariant and enables string operations | +| `FixedSizeList(U8, n)` | (hypothetical) `FixedSizeBinary(n)` | Yes (both are fixed-width byte windows) | **Unclear** — this is the open question | +| `List(T)` | `ListView(T)` (if it were in DType) | Yes (both are variable-length sequences of `T`) | **No** — same logical semantics, different physical form | + +The first case (Utf8 vs Binary) is a clear **refinement type**: `Utf8` is `Binary` refined by the predicate `valid_utf8(bytes)`. This predicate gates operations (string functions) and imposes invariants. The nominal distinction is load-bearing. + +The third case (List vs ListView) is clearly _not_ a nominal type distinction — it's a physical encoding difference that should not appear in DType. This is already handled correctly. + +The second case (FixedSizeList vs FixedSizeBinary) is the interesting question, analyzed in Section 3. + +### 2.3 List vs ListView: The Canonical Example of a Section Choice + +The relationship between List and ListView deserves detailed treatment because it is the cleanest example of how the logical/physical split works in Vortex — and the example that most directly motivates the multi-section proposal. + +**Logical equivalence.** List and ListView represent exactly the same logical data: a sequence of variable-length sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of `Int32` values. This is true regardless of whether the physical layout uses monotonic offsets (List) or offset-size pairs (ListView). + +**Physical difference.** The distinction is entirely in the buffer layout: + +- **List** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and non-overlapping — every byte belongs to exactly one logical element. +- **ListView** stores separate offsets and sizes buffers, where `offsets[i]..offsets[i]+sizes[i]` defines the range for element `i`. This allows _overlapping ranges_ (two logical elements can share backing data) and _gaps_ (regions of the values buffer that belong to no element). + +**Why this is purely physical.** No query operation can observe the difference. `scalar_at(i)` returns the same sub-array. `filter`, `take`, `slice` all produce logically identical results. The overlapping and gaps in ListView are properties of the physical layout, not the logical data. A consumer that reads element `i` gets the same list of values regardless of which representation backs it. + +This is precisely the quotient type property: List and ListView are two representatives of the same equivalence class. Any function that "descends to the quotient" — that is well-defined on the logical type — produces the same result on both. + +**Why it matters for canonicalization.** The current system chooses ListView as the canonical form for `List(T)`. This is a valid section. But it is not the _only_ valid section. Some consumers (particularly Arrow FFI boundaries) would prefer List, because Arrow's canonical layout uses monotonic offsets. Other consumers might prefer ListView because it avoids the compaction cost of eliminating sharing. + +This is the motivating case for parameterized sections: the same logical type, two valid canonical forms, and different consumers with different preferences. The theory says this is fine — you just need each section to be internally consistent (Church-Rosser within itself). + +**Compaction as a morphism.** Converting from ListView to List requires _compaction_ — eliminating overlaps and gaps by copying values into a contiguous buffer and recomputing monotonic offsets. Converting from List to ListView is trivial (sizes are just offset deltas). This asymmetry is interesting: the section that targets List is more expensive to compute, but produces a form with stronger structural guarantees (no aliasing, no gaps). The section that targets ListView is cheaper but permits aliasing. Both are correct. + +### 2.4 Refinement Types + +A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying predicate `P`. Refinement types let you express subtypes without changing the underlying representation. + +**In Vortex:** + +``` +Utf8 ≅ { b : Binary | valid_utf8(b) } +``` + +This is a proper refinement: every Utf8 value is a valid Binary value, but not every Binary value is valid Utf8. The predicate `valid_utf8` is what justifies the separate DType variant. Without it, Utf8 and Binary would be the same type, and maintaining both would be redundant. + +**The test for whether a new DType variant is justified:** Does it carry a predicate that gates operations or imposes invariants that the base type does not? If yes, it's a refinement type and belongs in DType. If no, it may be better modeled as a canonical form (section), an encoding, or a type alias. + +**Further reading:** + +- Refinement types: [https://en.wikipedia.org/wiki/Refinement_type](https://en.wikipedia.org/wiki/Refinement_type) +- Liquid types (Rondon, Kawaguci, Jhala): [https://doi.org/10.1145/1375581.1375602](https://doi.org/10.1145/1375581.1375602) + +--- + +## 3. Analysis: Should FixedSizeBinary Be a DType? + +### 3.1 The Case For (Refinement / Nominal Argument) + +FixedSizeBinary could be justified if it carries a semantic distinction that `FixedSizeList(U8, n)` does not: + +- **Intent signaling:** FixedSizeBinary says "this is opaque binary data" (UUIDs, hashes, IP addresses), while FixedSizeList says "this is a list of bytes that happens to have a fixed length." The operations you'd want on each might differ (hex encoding, byte-order operations for FixedSizeBinary vs. element-wise list operations for FixedSizeList). +- **Schema compatibility:** Arrow, Parquet, and other formats distinguish these types. A FixedSizeBinary DType makes round-tripping schemas lossless. +- **Potential invariant:** FixedSizeBinary could carry the invariant "elements are not individually addressable / meaningful" — an opacity predicate. This is weaker than `valid_utf8` but still semantic. + +Under this reading, FixedSizeBinary is a lightweight refinement type, and the nominal distinction earns its place. + +### 3.2 The Case Against (Canonical Form / Section Argument) + +FixedSizeBinary could instead be a canonical form (section target) of `FixedSizeList(U8, n)`: + +- **No gating predicate:** If no operations are meaningful on FixedSizeBinary that aren't on FixedSizeList, then the predicate is empty and the refinement is trivial. +- **Leaky abstraction risk:** Every query engine function that handles list types would need to additionally handle FixedSizeBinary, or you need coercion rules. If the handling is always identical, the DType distinction adds complexity without semantic payoff. +- **Schema mapping as metadata:** The "this came from a Parquet FixedSizeBinary column" information could live in extension type metadata rather than in the core DType enum, keeping the logical layer minimal. + +Under this reading, FixedSizeBinary is an encoding or canonical form, not a logical type. + +### 3.3 Decision Framework + +The following questions can help determine where a proposed type belongs: + +``` + Does it gate different query operations? + │ + Yes ──────┼────── No + │ │ + Add to DType Is it structurally distinct + (refinement type) from an existing DType? + │ + Yes ──────┼────── No + │ │ + Add to DType Model as canonical form + (new structure) or encoding +``` + +For FixedSizeBinary: if it gates operations → DType. If it's structurally and operationally identical to FixedSizeList → canonical form or extension metadata. + +--- + +## 4. The Multi-Section Architecture + +### 4.1 Current State: Single Section + +Today, `to_canonical` is a single global section: + +``` +to_canonical : EncodedArray → CanonicalArray +``` + +It executes a lazy array tree (the canonicalization plan) to map every encoded array to a unique canonical form determined by its DType. The system is confluent (Church-Rosser) — all reduction paths converge. Canonicalization is shallow: only the outermost layer is reduced to canonical form. + +### 4.2 Proposed State: Parameterized Sections + +The system would support multiple named sections, each targeting a different canonical form: + +``` +to_canonical : EncodedArray × Target → CanonicalArray +``` + +where `Target` selects from a family of valid canonical forms for the given DType. + +**Concrete example for list types:** + +| DType | `Target::A` | `Target::B` | +| ------------------- | ----------------------------------------------- | -------------------------------------- | +| `List(T, _)` | ListView (offset-size pairs, potential sharing) | List (monotonic offsets, contiguous) | +| `Utf8(_)` | VarBin (offsets + sizes + data) | Flat UTF-8 (offsets + contiguous data) | +| `Primitive(I32, _)` | Flat buffer | Flat buffer (same in both) | + +The targets are not necessarily "Vortex" vs "Arrow" — the naming and number of targets is a design choice. A target might exist for a specific FFI boundary, a serialization format, or an optimization preference. The point is that each target defines an internally confluent reduction. The choice of target is made by the consumer (query engine, FFI boundary, serializer), not by the encoded array. + +### 4.3 Formal Structure + +The layered architecture becomes a commutative diagram: + +``` + EncodedArray + / | \ + / | \ + section_a section_b section_c ... + / | \ + / | \ + ListView List (other form) + \ | / + \ | / + ─── DType (quotient) ─── +``` + +All paths through the diagram commute: decoding any canonical form to logical values yields the same result. This is exactly the universal property of the quotient — any two representatives in the same equivalence class decode to the same data. + +### 4.4 Implementation Sketch + +```rust +/// A canonicalization target — selects which section to use. +/// +/// The specific variants here are illustrative. The actual set of targets +/// is a design decision based on which canonical forms consumers need. +enum CanonicalTarget { + /// Default canonical forms (e.g., ListView for lists, VarBin for strings). + Default, + /// Alternative canonical forms (e.g., List for lists, contiguous strings). + /// Could be motivated by FFI boundaries, serialization, or optimization. + Contiguous, +} + +/// Canonicalize an encoded array into the target representation. +/// +/// Executes the lazy array tree to reduce the outermost encoding to a +/// canonical form selected by `target`. Children are not recursively +/// canonicalized. +fn to_canonical(array: &Array, target: CanonicalTarget) -> VortexResult { + // Each target is an internally confluent reduction strategy. + // The logical DType is preserved regardless of target. + match target { + CanonicalTarget::Default => to_canonical_default(array), + CanonicalTarget::Contiguous => to_canonical_contiguous(array), + } +} +``` + +The invariant to maintain is that for any `array`, `target_a`, `target_b`: + +``` +decode(to_canonical(array, target_a)) == decode(to_canonical(array, target_b)) +``` + +That is: different canonical forms of the same data decode to the same logical values. This is the quotient compatibility condition. + +--- + +## 5. Summary of Theoretical Landscape + +| Concept | Role in Vortex | Key Reference | +| --------------------------------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | +| **Equivalence class** | All physical arrays with the same logical data form a class | Standard set theory | +| **Quotient type** | `DType` is the quotient over encodings by decode-equality | [nLab: quotient type](https://ncatlab.org/nlab/show/quotient+type) | +| **Section** | `to_canonical` picks one representative per class | [Section (category theory)]() | +| **Church-Rosser** | Single-section canonicalization is confluent; multi-section is confluent per-target | [Confluence]() | +| **Existential types** | Each encoding hides its representation behind the DType interface | [Mitchell & Plotkin 1988](https://doi.org/10.1145/44501.45065) | +| **Representation independence** | Query engine can't observe which encoding is in use | Follows from existential typing | +| **Refinement types** | Utf8 refines Binary with `valid_utf8`; justifies separate DType | [Liquid Types](https://doi.org/10.1145/1375581.1375602) | +| **Structural vs. nominal typing** | DType is nominal — structurally isomorphic types can be distinct | [Nominal type systems](https://en.wikipedia.org/wiki/Nominal_type_system) | + +--- + +## 6. Further Reading + +### Foundational + +- **Types and Programming Languages** (Pierce, 2002) — Chapters on existential types, subtyping, and type equivalence. The standard graduate textbook for type theory. +- **Term Rewriting and All That** (Baader & Nipkow, 1998) — Rigorous treatment of confluence, normal forms, and the Church-Rosser property. +- **Abstract Types Have Existential Type** (Mitchell & Plotkin, 1988) — The original paper connecting data abstraction to existential quantification. [DOI](https://doi.org/10.1145/44501.45065) + +### On Quotient Types in Programming + +- **Quotient Types for Programmers** (Angiuli, Coquand, 2021) — Accessible introduction to quotient types from an HoTT/cubical perspective. [nLab](https://ncatlab.org/nlab/show/quotient+type) +- **Observational Equality, Now!** (Altenkirch, McBride, 2007) — Explores when two representations should be considered "the same" in type theory. [PDF](http://www.cs.nott.ac.uk/~psztxa/publ/obseqnow.pdf) + +### On Refinement Types + +- **Refinement Types for ML** (Freeman & Pfenning, 1991) — The original paper on refinement types. +- **Liquid Types** (Rondon, Kawaguci, Jhala, 2008) — Decidable refinement type inference. [DOI](https://doi.org/10.1145/1375581.1375602) + +### On Representation Independence + +- **Abstraction and Specification in Program Development** (Liskov & Guttag, 1986) — Practical treatment of representation independence in software engineering. +- **Parametricity and Representation Independence** (Plotkin & Abadi, 1993) — Formalizes how parametric polymorphism guarantees representation independence. + +### Columnar Format Context + +- **The BtrBlocks Paper** — Direct ancestor of Vortex's compression approach. +- **Apache Arrow Specification** — The de facto standard for canonical columnar layout; relevant for understanding which canonical forms matter in practice. [https://arrow.apache.org/docs/format/Columnar.html](https://arrow.apache.org/docs/format/Columnar.html) From 3f44c9b35240a9f87d77e0368c276abfd5d39300 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 12:20:37 -0500 Subject: [PATCH 02/17] add motivation section Signed-off-by: Connor Tsui --- proposed/0000-types.md | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/proposed/0000-types.md b/proposed/0000-types.md index 72bf95c..5dac31a 100644 --- a/proposed/0000-types.md +++ b/proposed/0000-types.md @@ -1,7 +1,7 @@ -- Start Date: YYYY-MM-DD (today's date) +- Start Date: 2026-03-06 - RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) -# RFC Template +# Formalize the Vortex Type System ## Summary @@ -9,10 +9,30 @@ One paragraph explanation of the proposed change. ## Motivation -What problem does this solve? Include concrete use cases where possible. +Many of the Vortex maintainers have a good understanding of how the Vortex type system works: we +define a set of logical types, each of which can represent many physical data encodings. We +additionally define a set of `Canonical` encodings that represent the different targets that arrays +can decompress into. -- What specific use cases does this enable or improve? -- What workflows or operations are painful, slow, or impossible today? +This definition has mostly worked well for us. However, several recent discussions have revealed +that this loose definition may be insufficient. + +For example, we would like to add a `FixedSizeBinary` type, but it is unclear if this is +necessary when it is mostly equivalent to `FixedSizeList`. Are these actually different +logical types? What does a "different" logical type even mean? + +Another discussion we have had is if the choice of a canonical `ListView` is better or worse than a +canonical `List` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)). Both have the +exact same logical type (same domain of values), but we are stuck choosing a single "canonical" +encoding that we force every array of type `List` to decompress into. Is forcing everyone to +decompress into the same physical encoding really what we want? + +This RFC makes 2 proposals. The first is a more formalized definition of the Vortex type system, and +this serves to justify the second proposal. + +The second proposal is to relax (or extend) the concept of a "canonical" type from choosing a unique +physical encoding for every logical type (a unique normal form) to allowing many possible canonical +targets (multiple normal forms). ## Design From c5c772cc64fb3a624bb031f286316bb5c0142c6f Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 14:17:24 -0500 Subject: [PATCH 03/17] add some basic type theory background Signed-off-by: Connor Tsui --- proposed/0000-types.md | 79 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/proposed/0000-types.md b/proposed/0000-types.md index 5dac31a..9e95c7a 100644 --- a/proposed/0000-types.md +++ b/proposed/0000-types.md @@ -34,6 +34,58 @@ The second proposal is to relax (or extend) the concept of a "canonical" type fr physical encoding for every logical type (a unique normal form) to allowing many possible canonical targets (multiple normal forms). +# Type Theory Background + +This section introduces the type-theoretic concepts that underpin Vortex's `DType` system and its +relationship to physical encodings. To reiterate, most of the maintainers understand these concepts +intuitively, but there is value in mapping these implicit concepts to explicit theory. + +Note that this section made heavy use of LLMs to help research and identify terms and definitions, +as the author of this RFC is notably _not_ a type theory expert. + +## Equivalence Classes and `DType` as a Quotient Type + +### In Theory + +An **equivalence relation** `~` on a set `S` is a relation that is reflexive (`a ~ a`), symmetric +(`a ~ b` implies `b ~ a`), and transitive (`a ~ b` and `b ~ c` implies `a ~ c`). An equivalence +relation partitions `S` into disjoint subsets called **equivalence classes**, where each class +contains all elements that are equivalent to one another. + +A **quotient type** is a data type that falls under the general class of algebraic data types. +Formally, a quotient type `A / ~` is formed by taking a type `A` and collapsing it by an equivalence +relation `~`. The elements of the quotient type are the equivalence classes themselves: not +individual values, but entire groups of values that are considered "the same." + +The critical property of a quotient type is that operations on it must be **well-defined**: they +must produce the same result regardless of which member of the class you operate on. Formally, if +`f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends +to a well-defined function on the quotient `f' : A/~ → B`. + +### In Vortex + +Consider the set of all physical array representations / encodings in Vortex: a dictionary-encoded +`i32` array, a run-end-encoded `i32` array, a bitpacked `i32` array, a flat Arrow `i32` buffer, +etc. + +Two physical encodings are logically equivalent if and only if they produce the same logical +sequence of values when decoded / decompressed. This equivalence relation partitions the space of +all physical encodings into equivalence classes, where each class corresponds to a single logical +column of data. + +A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equivalence classes. It +tells us what logical data we are working with, but says nothing about which physical encoding is +representing it. Thus, we can say that logical types in Vortex form equivalence classes, and `DType` +is the set of equivalence classes. More formally, `DType` is the quotient type over the space of +physical encodings, collapsed by decoded / decompressed equivalence relation. + +This quotient structure imposes a concrete requirement: any operation defined on `DType` must +produce the same result regardless of which physical encoding backs the data. + +For example, operations like `filter`, `take`, and `scalar_at` all satisfy this: they depend only on +the logical values, not on how those values are stored. However, an operation like "return the +`ends` buffer" is not well-defined on the quotient type as that only exists for run-end encoding. + ## Design Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: @@ -92,3 +144,30 @@ This section helps frame the design in a broader context. If there is no relevan ## Future Possibilities What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. + +## Further Reading + +- **Equivalence classes and partitions.** + [Wikipedia: Equivalence class](https://en.wikipedia.org/wiki/Equivalence_class). +- **Quotient types in type theory.** + [nLab: quotient type](https://ncatlab.org/nlab/show/quotient+type). + Altenkirch, Anberree, Li, "Quotient Types for Programmers" + ([arXiv:1901.01006](https://arxiv.org/abs/1901.01006)). +- **Sections in category theory.** + [Wikipedia: Section (category theory)](). +- **Church-Rosser property and confluence.** + [Wikipedia: Church-Rosser theorem](https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem). + [Wikipedia: Confluence](). + Baader & Nipkow, _Term Rewriting and All That_ (Cambridge University Press, 1998). +- **Refinement types.** + [Wikipedia: Refinement type](https://en.wikipedia.org/wiki/Refinement_type). + Rondon, Kawaguci, Jhala, "Liquid Types" + ([DOI:10.1145/1375581.1375602](https://doi.org/10.1145/1375581.1375602)). +- **Abstract types and existential quantification.** + Mitchell & Plotkin, "Abstract Types Have Existential Type" + ([DOI:10.1145/44501.45065](https://doi.org/10.1145/44501.45065)). +- **Type theory textbook.** + Pierce, _Types and Programming Languages_ (MIT Press, 2002). Chapters on existential types, + subtyping, and type equivalence. +- **Arrow columnar format.** + [Apache Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html). From 4b3c4d09b4b1a0b165fd81fdd3dcbfe04192d7ad Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 14:48:23 -0500 Subject: [PATCH 04/17] add sections and canonical description Signed-off-by: Connor Tsui --- proposed/0000-types.md | 55 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/proposed/0000-types.md b/proposed/0000-types.md index 9e95c7a..9864789 100644 --- a/proposed/0000-types.md +++ b/proposed/0000-types.md @@ -86,6 +86,61 @@ For example, operations like `filter`, `take`, and `scalar_at` all satisfy this: the logical values, not on how those values are stored. However, an operation like "return the `ends` buffer" is not well-defined on the quotient type as that only exists for run-end encoding. +## Sections and Canonicalization + +Observe that every physical array (a specific encoding combined with actual data) maps back to a +`DType`. A run-end-encoded `i32` array maps to `Primitive(I32)`, as does a dictionary-encoded `i32` +array. A `VarBinView` array can map to either `Utf8` or `Binary`, depending on whether its contents +are valid UTF-8. Call this projection `π : Array → DType`. + +A **section** is a function going the other direction: `s : DType → Encoding`, that picks one +specific physical encoding for each logical type, such that projecting back gives you the original +`DType` (`π(s(d)) = d`). In other words, a section answers the question: "given a logical type, +which physical encoding should I use to represent it?" + +**In Vortex**, the current `to_canonical` function is a section. For each `DType`, it selects +exactly one canonical physical form. Observe how the `Canonical` enum is essentially identical to +`DType` enum (with the exception of `VarBinView` with `Utf8` and `Binary`): + +```rust +/// The different logical types in Vortex (the different equivalence classes). +/// This is the quotient type! +pub enum DType { + Null, + Bool(Nullability), + Primitive(PType, Nullability), + Decimal(DecimalDType, Nullability), + Utf8(Nullability), + Binary(Nullability), + List(Arc, Nullability), + FixedSizeList(Arc, u32, Nullability), + Struct(StructFields, Nullability), + Extension(ExtDTypeRef), +} + +/// We "choose" the set of representatives of each of the logical types. +/// This is the image/result of the `to_canonical` function (where `to_canonical` is the section). +pub enum Canonical { + Null(NullArray), + Bool(BoolArray), + Primitive(PrimitiveArray), + Decimal(DecimalArray), + VarBinView(VarBinViewArray), // Note that `VarBinView` maps to both `Utf8` and `Binary`. + List(ListViewArray), + FixedSizeList(FixedSizeListArray), + Struct(StructArray), + Extension(ExtensionArray), +} +``` + +More formally, `Canonical` enumerates the **image** of the section function `to_canonical`. + +The critical insight is that `Canonical` represents several arbitrary **choices**. For example, +nothing in the theory privileges `ListView` over `List` as the canonical representative for +variable-length list data. Both are valid sections (since both pick a representative from the same +equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes +one particular section. The second proposal in this RFC is to allow _multiple sections_. + ## Design Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: From 72f41e881dd99c59f8cbbb2a9bd4c7d4c4ba0e4b Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 15:34:02 -0500 Subject: [PATCH 05/17] add section on confluence Signed-off-by: Connor Tsui --- proposed/0000-types.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/proposed/0000-types.md b/proposed/0000-types.md index 9864789..bac205b 100644 --- a/proposed/0000-types.md +++ b/proposed/0000-types.md @@ -141,6 +141,30 @@ variable-length list data. Both are valid sections (since both pick a representa equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes one particular section. The second proposal in this RFC is to allow _multiple sections_. +## The Church-Rosser Property (Confluence) + +A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can +be reduced in two different ways, both reduction paths can be continued to reach the same final +result (called a **normal form**). For example, the expression `(2 + 3) * (1 + 1)` can be reduced +by evaluating the left subexpression first (`5 * (1 + 1)`) or the right first (`(2 + 3) * 2`), but +both paths arrive at `10`. + +**In current Vortex**, `to_canonical` is confluent by construction. Applying `take`, `filter`, or +`scalar_at` before or after canonicalization produces the same logical values. There is one normal +form per `DType`, and every reduction path reaches it. + +A **non-confluent** rewriting system is one where two reduction paths from the same starting point +can arrive at different normal forms. Non-confluent systems are well-studied, and the standard +approach is to define **separate reduction relations**, each of which is internally confluent. + +For example, instead of one global set of reduction rules, you define two strategies: strategy A +always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy +satisfies Church-Rosser independently, the only difference is which normal form they target. + +In Vortex, a similar scenario would be defining multiple strategies of canonicalization, where one +strategy could target `List` as a canonical target (normal form) for list data, and another strategy +could target `ListView`. See the [`List` vs. `ListView`](#list-vs-listview) section for more info. + ## Design Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: From 1ca24a8d7c03091684cc3d81e2cb04e1ac1a2663 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 15:39:16 -0500 Subject: [PATCH 06/17] rename Signed-off-by: Connor Tsui --- proposed/{0000-types.md => 9999-types.md} | 0 ...-type-theory-background.md => vortex-type-theory-background.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename proposed/{0000-types.md => 9999-types.md} (100%) rename proposed/vortex-type-theory-background.md => vortex-type-theory-background.md (100%) diff --git a/proposed/0000-types.md b/proposed/9999-types.md similarity index 100% rename from proposed/0000-types.md rename to proposed/9999-types.md diff --git a/proposed/vortex-type-theory-background.md b/vortex-type-theory-background.md similarity index 100% rename from proposed/vortex-type-theory-background.md rename to vortex-type-theory-background.md From e5a1ef55c9578403461a400050b70adc3ab0669b Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 16:11:53 -0500 Subject: [PATCH 07/17] list vs list view Signed-off-by: Connor Tsui --- proposed/9999-types.md | 116 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) diff --git a/proposed/9999-types.md b/proposed/9999-types.md index bac205b..6d613b7 100644 --- a/proposed/9999-types.md +++ b/proposed/9999-types.md @@ -165,6 +165,120 @@ In Vortex, a similar scenario would be defining multiple strategies of canonical strategy could target `List` as a canonical target (normal form) for list data, and another strategy could target `ListView`. See the [`List` vs. `ListView`](#list-vs-listview) section for more info. +## Refinement Types and the DType Decision Framework + +A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate +`P`. Refinement types express subtypes without changing the underlying representation, instead they +add constraints that gate operations or impose invariants. + +For example in Vortex, `Utf8` is a refinement of `Binary`: + +``` +Utf8 ~= { b : Binary | valid_utf8(b) } +``` + +Every `Utf8` value is a valid `Binary` value, but not every `Binary` value is valid `Utf8`. The +predicate `valid_utf8` is what justifies the separate `DType` variant: it gates string operations +(like substring, regex matching, case conversion) that are not meaningful on arbitrary binary data. +Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be +redundant. + +This gives us a concrete decision tree for whether a new `DType` variant is justified: + +``` + Does it gate different query operations? + │ + Yes ──────┼────── No + │ │ + Add to DType Is it structurally distinct + (refinement type) from an existing DType? + │ + Yes ──────┼────── No + │ │ + Add to DType Model as canonical form + (new structure) or encoding +``` + +## Should `FixedSizeBinary` Be a `DType`? + +Applying the decision framework above to `FixedSizeBinary` vs. `FixedSizeList`: + +### The Case For (Refinement / Nominal Argument) + +`FixedSizeBinary` could be justified as a refinement type if it carries semantic distinctions +that `FixedSizeList` does not: + +- **Intent signaling.** `FixedSizeBinary` says "this is opaque binary data" (UUIDs, hashes, IP + addresses), while `FixedSizeList` says "this is a list of bytes that happens to have a + fixed length." +- **Schema compatibility.** Arrow, Parquet, and other formats distinguish these types. A + `FixedSizeBinary` `DType` makes round-tripping schemas lossless. +- **Potential invariant.** `FixedSizeBinary` could carry the invariant that elements are not + individually addressable or meaningful. This is weaker than `valid_utf8` but still semantic. + +Under this reading, `FixedSizeBinary` is a lightweight refinement type, and the nominal distinction +earns its place in `DType`. + +### The Case Against (Canonical Form / Section Argument) + +`FixedSizeBinary` could instead be modeled as a canonical form (section target) of +`FixedSizeList`, or as extension type metadata: + +- **No gating predicate.** If no operations are meaningful on `FixedSizeBinary` that are not also + meaningful on `FixedSizeList`, then the predicate is empty and the refinement is trivial. +- **Leaky abstraction risk.** Every query engine function that handles list types would need to + additionally handle `FixedSizeBinary`, or we would need coercion rules. If the handling is always + identical, the `DType` distinction adds complexity without semantic payoff. +- **Schema mapping as metadata.** The information "this came from a Parquet `FixedSizeBinary` + column" could live in extension type metadata rather than in the core `DType` enum, keeping the + logical layer minimal. +- **Complexity.** Adding yet another variant to the `DType` enum has a large surface area of change. + +Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type. + +### Decision + +It is somewhat hard to decide which is the right way to go. However, this section provides some more +structure to the discussions we have been holding. + +## List vs. ListView + +There is also a separate question about whether the current `Canonical` system is the most ideal. +The relationship between `List` and `ListView` ties all of the above concepts together and most +directly motivates a multi-section (multiple normal form) proposal. + +`List` and `ListView` represent exactly the same logical data: a sequence of variable-length +sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of +`Int32` values. This is true regardless of the physical layout. + +The distinction is entirely in the buffer layout: + +- **`List`** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for + element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and + non-overlapping, and every byte belongs to exactly one logical element. +- **`ListView`** stores separate offsets and sizes buffers, where + `offsets[i]..offsets[i] + sizes[i]` defines the range for element `i`. This allows overlapping + views (two logical elements can share backing data) and gaps (regions of the values buffer that + belong to no element). + +This is a purely physical distinction, as no query operation can observe the difference. +For example, `scalar_at(i)` will always returns the same list, and `filter`, `take`, and `slice` all +produce logically identical results. + +However, this physical distinction has massive performance implications. Converting from `ListView` +to `List` requires rebuilding the entire array to eliminate overlaps and gaps. On the other hand, +converting from `List` to `ListView` is trivial (sizes are just offset deltas). + +This asymmetry is notable: the section that targets `List` is more expensive to compute but produces +a form with stronger structural guarantees (no aliasing, no gaps). The section that targets +`ListView` is cheaper and permits aliasing and faster random access. + +Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List` +because Arrow has is adding support for `ListView` slowly. Other consumers prefer `ListView` because +some operations (namely random access and potentially dependent operations) are faster. + +TODO + ## Design Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: @@ -216,6 +330,8 @@ This section helps frame the design in a broader context. If there is no relevan ## Unresolved Questions +- Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? + See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. - What parts of the design need to be resolved during the RFC process? - What is explicitly out of scope for this RFC? - Are there open questions that can be deferred to implementation? From d80cfe161ab8897e0cd746559a26dc8ffee2c1ae Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 16:27:18 -0500 Subject: [PATCH 08/17] add design section Signed-off-by: Connor Tsui --- proposed/9999-types.md | 69 +++++++++++++++++++++++++++++++++++------- 1 file changed, 58 insertions(+), 11 deletions(-) diff --git a/proposed/9999-types.md b/proposed/9999-types.md index 6d613b7..557bee6 100644 --- a/proposed/9999-types.md +++ b/proposed/9999-types.md @@ -277,27 +277,74 @@ Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFu because Arrow has is adding support for `ListView` slowly. Other consumers prefer `ListView` because some operations (namely random access and potentially dependent operations) are faster. -TODO +It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to +always decompress into `ListView`. We have the same constraint on `VarBinView` vs `VarBin`, and +while we haven't seen as many performance problems, it feels like a constraint that is too +restrictive in Vortex. ## Design -Describe the proposed design in enough detail that someone familiar with Vortex could implement it. This should cover: +The proposal is to add a new `to_canonical_target` function that accepts a `CanonicalTarget` +parameter, while keeping the existing `to_canonical` as a convenience that uses the default target. +This minimizes breaking changes. -- New or modified APIs, traits, or vtable entries. -- How this interacts with existing components (encodings, layouts, scan, file format, etc.). -- Key implementation details and corner cases. -- Why is this the best approach in the space of possible designs? -- Which crates are affected and how the dependency graph changes, if at all. +```rust +/// Canonicalize using the default target. Existing behavior, no breaking change. +fn to_canonical(array: &Array) -> VortexResult { + to_canonical_target(array, CanonicalTarget::Default) +} -Use code examples and diagrams where they might help, like this: +/// Canonicalize into a specific target. +fn to_canonical_target(array: &Array, target: CanonicalTarget) -> VortexResult; +``` + +Where `CanonicalTarget` selects from a family of valid canonical forms: ```rust -pub fn main() { - let x = f32::to_bits(100.0f32); - dbg!(x); +/// A canonicalization target (selects which section to use). +enum CanonicalTarget { + /// The default canonical forms. This is what the current system uses. + /// For lists: ListView. For strings: VarBinView. + Default, + /// Contiguous canonical forms, motivated by Arrow FFI boundaries and + /// consumers that need stronger structural guarantees. + /// For lists: List (monotonic offsets). For strings: VarBin (contiguous data). + Contiguous, } ``` +The `Canonical` enum would use inner enums for the types where multiple canonical forms exist: + +```rust +pub enum CanonicalList { + List(ListArray), + ListView(ListViewArray), +} + +pub enum CanonicalVarBin { + VarBin(VarBinArray), + VarBinView(VarBinViewArray), +} + +pub enum Canonical { + ... + VarBin(CanonicalVarBin), // Maps to both Utf8 and Binary DTypes. + List(CanonicalList), // Maps to List DType. + ... +} + +array.to_canonical()?.scalar_at(i) == array.to_canonical_target(target)?.scalar_at(i) +``` + +Each target defines an internally confluent reduction strategy. Within `CanonicalTarget::Default`, +all paths converge to `ListView`/`VarBinView`. Within `CanonicalTarget::Contiguous`, all paths +converge to `List`/`VarBin`. The choice of target is thus made by the consumer (query engine, +FFI boundary, serializer), not by the encoded array. + +For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `Null`), both targets +produce the same canonical form. The parameterization only has an effect where multiple valid +canonical forms exist. + ## Compatibility - Does this change the file format or wire format? Is it backward/forward compatible? From 9764c73f43bb47ea480d03f48ed9e9b7bec829ae Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 16:39:41 -0500 Subject: [PATCH 09/17] almost done Signed-off-by: Connor Tsui --- proposed/9999-types.md | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/proposed/9999-types.md b/proposed/9999-types.md index 557bee6..dfcd68b 100644 --- a/proposed/9999-types.md +++ b/proposed/9999-types.md @@ -1,7 +1,7 @@ - Start Date: 2026-03-06 - RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) -# Formalize the Vortex Type System +# Vortex Type System ## Summary @@ -159,7 +159,7 @@ approach is to define **separate reduction relations**, each of which is interna For example, instead of one global set of reduction rules, you define two strategies: strategy A always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy -satisfies Church-Rosser independently, the only difference is which normal form they target. +satisfies Church-Rosser independently, the only difference is which normal form they target. In Vortex, a similar scenario would be defining multiple strategies of canonicalization, where one strategy could target `List` as a canonical target (normal form) for list data, and another strategy @@ -345,28 +345,30 @@ For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `N produce the same canonical form. The parameterization only has an effect where multiple valid canonical forms exist. -## Compatibility +TODO wrap everything up and explain how the type theory helps justify this. -- Does this change the file format or wire format? Is it backward/forward compatible? -- Does this break any public APIs? If so, what is the migration path? -- Are there performance implications? +## Compatibility -If there are no compatibility concerns, briefly state why. +There shouldn't be any compatibility concerns here because even under a specific `DType`, the array +tree is fully serialized, and consumers can always convert back and forth between `List` and +`ListView` if they really need to. ## Drawbacks -- Why should we _not_ do this? -- What is the maintenance cost of this change? -- Does this add complexity that could be avoided? +The drawback is extra complexity in supporting multiple canonical targets. However, we've also had +to spend time making optimizations and fixes (`is_zero_copy_to_list` for `ListView`, see +[vortex#5129](https://github.com/vortex-data/vortex/pull/5129)) because we were forced to always +canonicalize into a single target. So there is an obvious tradeoff here. ## Alternatives -- What other designs were considered and why were they rejected? -- What is the cost of **not** doing this? -- Is there a simpler approach that gets us most of the way there? +The alternative is to just not do this. We continue to find workarounds when canonical encodings do +not fit the use case. ## Prior Art +TODO + How have other systems solved this or similar problems? Consider: - Other columnar formats (Parquet, Arrow, etc.). @@ -379,12 +381,17 @@ This section helps frame the design in a broader context. If there is no relevan - Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. + +TODO + - What parts of the design need to be resolved during the RFC process? - What is explicitly out of scope for this RFC? - Are there open questions that can be deferred to implementation? ## Future Possibilities +TODO + What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. ## Further Reading From f947ccffa73f2be99160601f68bf1584482c3f2f Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 16:59:54 -0500 Subject: [PATCH 10/17] first draft done Signed-off-by: Connor Tsui --- proposed/9999-types.md | 92 ++++---- vortex-type-theory-background.md | 349 ------------------------------- 2 files changed, 55 insertions(+), 386 deletions(-) delete mode 100644 vortex-type-theory-background.md diff --git a/proposed/9999-types.md b/proposed/9999-types.md index dfcd68b..4bd1124 100644 --- a/proposed/9999-types.md +++ b/proposed/9999-types.md @@ -5,7 +5,11 @@ ## Summary -One paragraph explanation of the proposed change. +This RFC formalizes the Vortex type system by grounding `DType` as a quotient type over physical +encodings and establishes a decision framework (based on refinement types) for when new `DType` +variants are justified. It additionally proposes extending canonicalization from a single target +(one canonical "normal form" per `DType`) to parameterized "sections" via `to_canonical_target`, +allowing consumers to choose among multiple valid canonical targets (e.g., `List` vs. `ListView`). ## Motivation @@ -34,7 +38,7 @@ The second proposal is to relax (or extend) the concept of a "canonical" type fr physical encoding for every logical type (a unique normal form) to allowing many possible canonical targets (multiple normal forms). -# Type Theory Background +## Type Theory Background This section introduces the type-theoretic concepts that underpin Vortex's `DType` system and its relationship to physical encodings. To reiterate, most of the maintainers understand these concepts @@ -43,9 +47,9 @@ intuitively, but there is value in mapping these implicit concepts to explicit t Note that this section made heavy use of LLMs to help research and identify terms and definitions, as the author of this RFC is notably _not_ a type theory expert. -## Equivalence Classes and `DType` as a Quotient Type +### Equivalence Classes and `DType` as a Quotient Type -### In Theory +#### In Theory An **equivalence relation** `~` on a set `S` is a relation that is reflexive (`a ~ a`), symmetric (`a ~ b` implies `b ~ a`), and transitive (`a ~ b` and `b ~ c` implies `a ~ c`). An equivalence @@ -62,7 +66,7 @@ must produce the same result regardless of which member of the class you operate `f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends to a well-defined function on the quotient `f' : A/~ → B`. -### In Vortex +#### In Vortex Consider the set of all physical array representations / encodings in Vortex: a dictionary-encoded `i32` array, a run-end-encoded `i32` array, a bitpacked `i32` array, a flat Arrow `i32` buffer, @@ -77,7 +81,7 @@ A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equiv tells us what logical data we are working with, but says nothing about which physical encoding is representing it. Thus, we can say that logical types in Vortex form equivalence classes, and `DType` is the set of equivalence classes. More formally, `DType` is the quotient type over the space of -physical encodings, collapsed by decoded / decompressed equivalence relation. +physical encodings, collapsed by the decoded / decompressed equivalence relation. This quotient structure imposes a concrete requirement: any operation defined on `DType` must produce the same result regardless of which physical encoding backs the data. @@ -86,7 +90,7 @@ For example, operations like `filter`, `take`, and `scalar_at` all satisfy this: the logical values, not on how those values are stored. However, an operation like "return the `ends` buffer" is not well-defined on the quotient type as that only exists for run-end encoding. -## Sections and Canonicalization +### Sections and Canonicalization Observe that every physical array (a specific encoding combined with actual data) maps back to a `DType`. A run-end-encoded `i32` array maps to `Primitive(I32)`, as does a dictionary-encoded `i32` @@ -100,7 +104,7 @@ which physical encoding should I use to represent it?" **In Vortex**, the current `to_canonical` function is a section. For each `DType`, it selects exactly one canonical physical form. Observe how the `Canonical` enum is essentially identical to -`DType` enum (with the exception of `VarBinView` with `Utf8` and `Binary`): +the `DType` enum (with the exception of `VarBinView` with `Utf8` and `Binary`): ```rust /// The different logical types in Vortex (the different equivalence classes). @@ -141,7 +145,7 @@ variable-length list data. Both are valid sections (since both pick a representa equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes one particular section. The second proposal in this RFC is to allow _multiple sections_. -## The Church-Rosser Property (Confluence) +### The Church-Rosser Property (Confluence) A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can be reduced in two different ways, both reduction paths can be continued to reach the same final @@ -165,13 +169,13 @@ In Vortex, a similar scenario would be defining multiple strategies of canonical strategy could target `List` as a canonical target (normal form) for list data, and another strategy could target `ListView`. See the [`List` vs. `ListView`](#list-vs-listview) section for more info. -## Refinement Types and the DType Decision Framework +### Refinement Types and the DType Decision Framework A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate `P`. Refinement types express subtypes without changing the underlying representation, instead they add constraints that gate operations or impose invariants. -For example in Vortex, `Utf8` is a refinement of `Binary`: +For example, in Vortex, `Utf8` is a refinement of `Binary`: ``` Utf8 ~= { b : Binary | valid_utf8(b) } @@ -236,11 +240,6 @@ earns its place in `DType`. Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type. -### Decision - -It is somewhat hard to decide which is the right way to go. However, this section provides some more -structure to the discussions we have been holding. - ## List vs. ListView There is also a separate question about whether the current `Canonical` system is the most ideal. @@ -262,7 +261,7 @@ The distinction is entirely in the buffer layout: belong to no element). This is a purely physical distinction, as no query operation can observe the difference. -For example, `scalar_at(i)` will always returns the same list, and `filter`, `take`, and `slice` all +For example, `scalar_at(i)` will always return the same list, and `filter`, `take`, and `slice` all produce logically identical results. However, this physical distinction has massive performance implications. Converting from `ListView` @@ -274,7 +273,7 @@ a form with stronger structural guarantees (no aliasing, no gaps). The section t `ListView` is cheaper and permits aliasing and faster random access. Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List` -because Arrow has is adding support for `ListView` slowly. Other consumers prefer `ListView` because +because Arrow is adding support for `ListView` slowly. Other consumers prefer `ListView` because some operations (namely random access and potentially dependent operations) are faster. It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to @@ -345,7 +344,26 @@ For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `N produce the same canonical form. The parameterization only has an effect where multiple valid canonical forms exist. -TODO wrap everything up and explain how the type theory helps justify this. +Because `DType` is a quotient type and canonicalization is a section on that quotient, supporting +multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant +of the Vortex type system). Because each `CanonicalTarget` defines an internally confluent reduction +strategy, the quotient structure of `DType` guarantees that all well-defined operations produce +the same logical results regardless of which target is chosen. + +``` +Logical type (DType: purely semantic, not physical) + ▲ + │ section 1: Default (ListView, VarBinView) + │ section 2: Contiguous (List, VarBin) + │ section n: ... (future targets) + ▼ +Canonical form (a specific "normal form" encoding chosen by the section) + ▲ + │ encode + │ decode + ▼ +Physical encoding (dictionary, REE, bitpacked, etc.) +``` ## Compatibility @@ -367,32 +385,32 @@ not fit the use case. ## Prior Art -TODO - -How have other systems solved this or similar problems? Consider: - -- Other columnar formats (Parquet, Arrow, etc.). -- Database internals (DuckDB, DataFusion, Velox, etc.). -- Relevant academic papers or blog posts. - -This section helps frame the design in a broader context. If there is no relevant prior art, that is fine. +- **Arrow** defines both `List` and `ListView` (and `LargeList`/`LargeListView`) as separate type + IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (`List`), and + `ListView` support is being added incrementally across implementations. Arrow also distinguishes + `FixedSizeBinary` from `FixedSizeList` at the type level. +- **Parquet** has `FIXED_LEN_BYTE_ARRAY` as a distinct primitive type, separate from repeated + groups. This is a nominal distinction similar to the `FixedSizeBinary` question. +- **DuckDB** uses a single canonical form per logical type and does not support multiple + canonicalization targets. ## Unresolved Questions - Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. - -TODO - -- What parts of the design need to be resolved during the RFC process? -- What is explicitly out of scope for this RFC? -- Are there open questions that can be deferred to implementation? +- What is the concrete set of `CanonicalTarget` variants? This RFC proposes `Default` and + `Contiguous`, but there may be other useful targets. Theoretically, we could have a target that + would allow us to return compressed arrays as a `Canonical` type! ## Future Possibilities -TODO - -What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. +- We can have extension types be a generalization of the refinement type pattern: for example we + could enforce user-defined predicates that gate custom operations on existing `DType`s. +- User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization + formats. +- Using the decision framework to audit existing `DType` variants and determine if any should be + consolidated or split, as well as to make decisions about logical types we want to add (namely + `FixedSizeBinary` and `Variant`). ## Further Reading diff --git a/vortex-type-theory-background.md b/vortex-type-theory-background.md deleted file mode 100644 index f5943eb..0000000 --- a/vortex-type-theory-background.md +++ /dev/null @@ -1,349 +0,0 @@ -# Theoretical Background: Logical and Physical Types in Vortex - -## Purpose - -This document provides the type-theoretic foundations for reasoning about Vortex's `DType` system, its relationship to physical encodings, and the design space for canonicalization. It is intended as background material for an RFC on evolving how the Vortex type system handles canonical representations — and on whether additions like `FixedSizeBinary` belong in `DType`. - ---- - -## 1. Core Concepts - -### 1.1 Equivalence Classes - -An **equivalence class** is the set of all elements that are "the same" under some equivalence relation. An equivalence relation `~` on a set `S` must be reflexive (`a ~ a`), symmetric (`a ~ b ⟹ b ~ a`), and transitive (`a ~ b` and `b ~ c` ⟹ `a ~ c`). - -**In Vortex:** Consider the set of all physical array representations — dictionary-encoded int32 arrays, run-length-encoded int32 arrays, bitpacked int32 arrays, flat Arrow int32 arrays, etc. Two representations are equivalent if and only if they produce the same logical sequence of values when decoded. This equivalence relation partitions the space of all physical arrays into equivalence classes, where each class corresponds to a single logical column of data. - -The key insight is that a Vortex `DType` like `Primitive(I32, NonNullable)` _names_ one of these equivalence classes. It tells you what logical data you're working with, but says nothing about which physical representative you hold. - -**Further reading:** - -- Partition of a set / equivalence class: [https://en.wikipedia.org/wiki/Equivalence_class](https://en.wikipedia.org/wiki/Equivalence_class) - -### 1.2 Quotient Types - -A **quotient type** `A / ~` is a type formed by taking a base type `A` and collapsing it by an equivalence relation `~`. Elements of the quotient type are the equivalence classes themselves. Operations on the quotient type must be _well-defined_ — they can't depend on which representative you pick from within a class. - -Formally, if `f : A → B` is a function that respects the equivalence relation (`a ~ a' ⟹ f(a) = f(a')`), then `f` descends to a well-defined function on the quotient `f' : A/~ → B`. - -**In Vortex:** `DType` is a quotient type over the space of physical representations. The equivalence relation is decode-equality ("these two arrays decode to the same logical values"). Query operations like `filter`, `take`, `scalar_at` are well-defined on the quotient: their results depend only on the logical data, not on the encoding. This is exactly the requirement for a function to descend to the quotient. - -An operation that _doesn't_ descend to the quotient — say, "return the offset buffer" — is one that depends on the specific representative (ListView has offset-size pairs; List has monotonic offsets; dictionary encoding has neither). Such operations are physical-layer concerns and must live below the DType abstraction. - -**Further reading:** - -- Quotient types in type theory: [https://ncatlab.org/nlab/show/quotient+type](https://ncatlab.org/nlab/show/quotient+type) -- Quotient types in programming languages (Altenkirch, Anberree, Li): [https://arxiv.org/abs/1901.01006](https://arxiv.org/abs/1901.01006) - -### 1.3 Sections - -Given a surjection (a function that maps from a detailed representation to a collapsed one), a **section** is a right inverse: a function that picks one representative from each equivalence class. - -Concretely, if `π : Encoding → DType` is the projection that sends each physical encoding to its logical type, a section is a function `s : DType → Encoding` such that `π(s(d)) = d` for all DTypes `d`. In other words, `s` picks a specific physical representation for each logical type, and that representation actually has the right logical type. - -A section is also sometimes called a **choice function** or a **splitting** of the projection. - -**In Vortex:** The current `to_canonical` function is a section. It calls `execute` on a lazy array tree that defines the execution plan to reach a canonical form. For each `DType`, it selects exactly one physical form: - -| DType | Current canonical form (section) | -| ------------------- | ---------------------------------------------------------------------------------------- | -| `Primitive(I32, _)` | Flat Arrow-compatible buffer | -| `Utf8(_)` | VarBin with UTF-8 bytes | -| `List(T, _)` | ListView (offset-size pairs) | -| `Struct(fields, _)` | Struct with canonicalized top-level layout (children remain in their existing encodings) | - -Note that canonicalization is _shallow_ — it applies only to the outermost array, not recursively to children. A canonical `Struct` has its top-level field pointers in canonical form, but each child field may still be dictionary-encoded, run-length-encoded, etc. This is an important design choice: it means canonicalization is a local operation, not a full tree traversal. - -The critical property is that this is a _choice_. Nothing in the theory privileges ListView over List as the canonical representative for variable-length list data. Both are valid sections. The current system hardcodes one. - -**Further reading:** - -- Section in algebra / category theory: [https://en.wikipedia.org/wiki/Section\_(category_theory)]() -- Splitting lemma (analogous structure in group theory): [https://en.wikipedia.org/wiki/Splitting_lemma](https://en.wikipedia.org/wiki/Splitting_lemma) - -### 1.4 The Church-Rosser Property - -A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can be reduced in two different ways, both reduction paths can be continued to reach the same normal form. Equivalently: every term has at most one normal form. - -This is the property that makes "just keep simplifying until you're done" a well-defined strategy. If a system is Church-Rosser, the order in which you apply reduction rules doesn't matter — you always arrive at the same result. - -**In Vortex (current state):** The current `to_canonical` pathway is confluent by construction. No matter how deeply nested the encodings are (e.g., dictionary-of-run-length-of-bitpacked), executing the lazy canonicalization plan converges to the same canonical form. There's one normal form per DType, and every reduction path reaches it. This is clean and simple. - -**In Vortex (if we add multiple canonical targets):** We would be moving to a _non-confluent_ rewriting system. A `List(Int32)` encoded as dictionary-of-run-length could reduce to _either_ a flat `List` (contiguous monotonic offsets) or a `ListView` (offset-size pairs with potential sharing), depending on which reduction strategy is applied. The system no longer has a unique normal form per logical type. - -This isn't pathological — it's well-studied. The standard approach is to define **separate reduction relations**, each of which is internally confluent. For example, you might have one reduction strategy that targets ListView as the normal form for list types, and another that targets List. Each strategy independently satisfies Church-Rosser — within a given strategy, all paths converge. The strategies simply converge to _different_ normal forms. The choice of strategy is made by the consumer (the caller of `to_canonical`), not by the encoded array itself. - -**Further reading:** - -- Church-Rosser theorem: [https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem](https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem) -- Confluence in rewriting systems: [https://en.wikipedia.org/wiki/Confluence\_(abstract_rewriting)]() -- Term rewriting and all that (Baader, Nipkow): [https://www.cambridge.org/core/books/term-rewriting-and-all-that/71768055C83EA8B18A58B8B09BEF3AB5](https://www.cambridge.org/core/books/term-rewriting-and-all-that/71768055C83EA8B18A58B8B09BEF3AB5) - -### 1.5 Representation Independence and Existential Types - -**Representation independence** is the principle that clients of an abstract data type cannot observe which concrete representation is in use. Mitchell and Plotkin (1988) proved that abstract data types correspond to **existential types** in System F: - -``` -∃Repr. { - encode : LogicalBuffer → Repr, - decode : Repr → LogicalBuffer, - ops : Repr → Results -} -``` - -The `Repr` type variable is hidden from the client. The client can only interact with the data through the provided operations, which are guaranteed to respect the logical semantics. Two implementations that expose the same interface and produce the same results are interchangeable. - -**In Vortex:** Each encoding (dictionary, RLE, bitpacked, etc.) is an existential package. The query engine is a client that interacts through `scalar_at`, `slice`, `filter`, `take`, etc. The physical layout is the hidden `Repr`. Representation independence is what makes the whole layered architecture work — the query engine doesn't branch on encoding type. - -**Further reading:** - -- Mitchell, Plotkin, "Abstract Types Have Existential Type": [https://doi.org/10.1145/44501.45065](https://doi.org/10.1145/44501.45065) -- Existential types in type theory: [https://ncatlab.org/nlab/show/existential+type](https://ncatlab.org/nlab/show/existential+type) - ---- - -## 2. Structural vs. Nominal Typing in DType - -### 2.1 The Distinction - -In a **structural** type system, two types are the same if they have the same structure. In a **nominal** type system, two types are the same only if they have the same name (declaration), even if their structures are identical. - -**Further reading:** - -- Structural vs. nominal type systems: [https://en.wikipedia.org/wiki/Nominal_type_system](https://en.wikipedia.org/wiki/Nominal_type_system) - -### 2.2 Where Vortex's DType is Nominal - -Several DType variants are structurally isomorphic but intentionally kept distinct: - -| Type A | Type B | Structurally identical? | Semantically distinct? | -| ---------------------- | ----------------------------------- | ----------------------------------------------- | --------------------------------------------------------------------------- | -| `Utf8` | `Binary` | Yes (both are variable-length byte sequences) | **Yes** — `Utf8` carries a validity invariant and enables string operations | -| `FixedSizeList(U8, n)` | (hypothetical) `FixedSizeBinary(n)` | Yes (both are fixed-width byte windows) | **Unclear** — this is the open question | -| `List(T)` | `ListView(T)` (if it were in DType) | Yes (both are variable-length sequences of `T`) | **No** — same logical semantics, different physical form | - -The first case (Utf8 vs Binary) is a clear **refinement type**: `Utf8` is `Binary` refined by the predicate `valid_utf8(bytes)`. This predicate gates operations (string functions) and imposes invariants. The nominal distinction is load-bearing. - -The third case (List vs ListView) is clearly _not_ a nominal type distinction — it's a physical encoding difference that should not appear in DType. This is already handled correctly. - -The second case (FixedSizeList vs FixedSizeBinary) is the interesting question, analyzed in Section 3. - -### 2.3 List vs ListView: The Canonical Example of a Section Choice - -The relationship between List and ListView deserves detailed treatment because it is the cleanest example of how the logical/physical split works in Vortex — and the example that most directly motivates the multi-section proposal. - -**Logical equivalence.** List and ListView represent exactly the same logical data: a sequence of variable-length sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of `Int32` values. This is true regardless of whether the physical layout uses monotonic offsets (List) or offset-size pairs (ListView). - -**Physical difference.** The distinction is entirely in the buffer layout: - -- **List** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and non-overlapping — every byte belongs to exactly one logical element. -- **ListView** stores separate offsets and sizes buffers, where `offsets[i]..offsets[i]+sizes[i]` defines the range for element `i`. This allows _overlapping ranges_ (two logical elements can share backing data) and _gaps_ (regions of the values buffer that belong to no element). - -**Why this is purely physical.** No query operation can observe the difference. `scalar_at(i)` returns the same sub-array. `filter`, `take`, `slice` all produce logically identical results. The overlapping and gaps in ListView are properties of the physical layout, not the logical data. A consumer that reads element `i` gets the same list of values regardless of which representation backs it. - -This is precisely the quotient type property: List and ListView are two representatives of the same equivalence class. Any function that "descends to the quotient" — that is well-defined on the logical type — produces the same result on both. - -**Why it matters for canonicalization.** The current system chooses ListView as the canonical form for `List(T)`. This is a valid section. But it is not the _only_ valid section. Some consumers (particularly Arrow FFI boundaries) would prefer List, because Arrow's canonical layout uses monotonic offsets. Other consumers might prefer ListView because it avoids the compaction cost of eliminating sharing. - -This is the motivating case for parameterized sections: the same logical type, two valid canonical forms, and different consumers with different preferences. The theory says this is fine — you just need each section to be internally consistent (Church-Rosser within itself). - -**Compaction as a morphism.** Converting from ListView to List requires _compaction_ — eliminating overlaps and gaps by copying values into a contiguous buffer and recomputing monotonic offsets. Converting from List to ListView is trivial (sizes are just offset deltas). This asymmetry is interesting: the section that targets List is more expensive to compute, but produces a form with stronger structural guarantees (no aliasing, no gaps). The section that targets ListView is cheaper but permits aliasing. Both are correct. - -### 2.4 Refinement Types - -A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying predicate `P`. Refinement types let you express subtypes without changing the underlying representation. - -**In Vortex:** - -``` -Utf8 ≅ { b : Binary | valid_utf8(b) } -``` - -This is a proper refinement: every Utf8 value is a valid Binary value, but not every Binary value is valid Utf8. The predicate `valid_utf8` is what justifies the separate DType variant. Without it, Utf8 and Binary would be the same type, and maintaining both would be redundant. - -**The test for whether a new DType variant is justified:** Does it carry a predicate that gates operations or imposes invariants that the base type does not? If yes, it's a refinement type and belongs in DType. If no, it may be better modeled as a canonical form (section), an encoding, or a type alias. - -**Further reading:** - -- Refinement types: [https://en.wikipedia.org/wiki/Refinement_type](https://en.wikipedia.org/wiki/Refinement_type) -- Liquid types (Rondon, Kawaguci, Jhala): [https://doi.org/10.1145/1375581.1375602](https://doi.org/10.1145/1375581.1375602) - ---- - -## 3. Analysis: Should FixedSizeBinary Be a DType? - -### 3.1 The Case For (Refinement / Nominal Argument) - -FixedSizeBinary could be justified if it carries a semantic distinction that `FixedSizeList(U8, n)` does not: - -- **Intent signaling:** FixedSizeBinary says "this is opaque binary data" (UUIDs, hashes, IP addresses), while FixedSizeList says "this is a list of bytes that happens to have a fixed length." The operations you'd want on each might differ (hex encoding, byte-order operations for FixedSizeBinary vs. element-wise list operations for FixedSizeList). -- **Schema compatibility:** Arrow, Parquet, and other formats distinguish these types. A FixedSizeBinary DType makes round-tripping schemas lossless. -- **Potential invariant:** FixedSizeBinary could carry the invariant "elements are not individually addressable / meaningful" — an opacity predicate. This is weaker than `valid_utf8` but still semantic. - -Under this reading, FixedSizeBinary is a lightweight refinement type, and the nominal distinction earns its place. - -### 3.2 The Case Against (Canonical Form / Section Argument) - -FixedSizeBinary could instead be a canonical form (section target) of `FixedSizeList(U8, n)`: - -- **No gating predicate:** If no operations are meaningful on FixedSizeBinary that aren't on FixedSizeList, then the predicate is empty and the refinement is trivial. -- **Leaky abstraction risk:** Every query engine function that handles list types would need to additionally handle FixedSizeBinary, or you need coercion rules. If the handling is always identical, the DType distinction adds complexity without semantic payoff. -- **Schema mapping as metadata:** The "this came from a Parquet FixedSizeBinary column" information could live in extension type metadata rather than in the core DType enum, keeping the logical layer minimal. - -Under this reading, FixedSizeBinary is an encoding or canonical form, not a logical type. - -### 3.3 Decision Framework - -The following questions can help determine where a proposed type belongs: - -``` - Does it gate different query operations? - │ - Yes ──────┼────── No - │ │ - Add to DType Is it structurally distinct - (refinement type) from an existing DType? - │ - Yes ──────┼────── No - │ │ - Add to DType Model as canonical form - (new structure) or encoding -``` - -For FixedSizeBinary: if it gates operations → DType. If it's structurally and operationally identical to FixedSizeList → canonical form or extension metadata. - ---- - -## 4. The Multi-Section Architecture - -### 4.1 Current State: Single Section - -Today, `to_canonical` is a single global section: - -``` -to_canonical : EncodedArray → CanonicalArray -``` - -It executes a lazy array tree (the canonicalization plan) to map every encoded array to a unique canonical form determined by its DType. The system is confluent (Church-Rosser) — all reduction paths converge. Canonicalization is shallow: only the outermost layer is reduced to canonical form. - -### 4.2 Proposed State: Parameterized Sections - -The system would support multiple named sections, each targeting a different canonical form: - -``` -to_canonical : EncodedArray × Target → CanonicalArray -``` - -where `Target` selects from a family of valid canonical forms for the given DType. - -**Concrete example for list types:** - -| DType | `Target::A` | `Target::B` | -| ------------------- | ----------------------------------------------- | -------------------------------------- | -| `List(T, _)` | ListView (offset-size pairs, potential sharing) | List (monotonic offsets, contiguous) | -| `Utf8(_)` | VarBin (offsets + sizes + data) | Flat UTF-8 (offsets + contiguous data) | -| `Primitive(I32, _)` | Flat buffer | Flat buffer (same in both) | - -The targets are not necessarily "Vortex" vs "Arrow" — the naming and number of targets is a design choice. A target might exist for a specific FFI boundary, a serialization format, or an optimization preference. The point is that each target defines an internally confluent reduction. The choice of target is made by the consumer (query engine, FFI boundary, serializer), not by the encoded array. - -### 4.3 Formal Structure - -The layered architecture becomes a commutative diagram: - -``` - EncodedArray - / | \ - / | \ - section_a section_b section_c ... - / | \ - / | \ - ListView List (other form) - \ | / - \ | / - ─── DType (quotient) ─── -``` - -All paths through the diagram commute: decoding any canonical form to logical values yields the same result. This is exactly the universal property of the quotient — any two representatives in the same equivalence class decode to the same data. - -### 4.4 Implementation Sketch - -```rust -/// A canonicalization target — selects which section to use. -/// -/// The specific variants here are illustrative. The actual set of targets -/// is a design decision based on which canonical forms consumers need. -enum CanonicalTarget { - /// Default canonical forms (e.g., ListView for lists, VarBin for strings). - Default, - /// Alternative canonical forms (e.g., List for lists, contiguous strings). - /// Could be motivated by FFI boundaries, serialization, or optimization. - Contiguous, -} - -/// Canonicalize an encoded array into the target representation. -/// -/// Executes the lazy array tree to reduce the outermost encoding to a -/// canonical form selected by `target`. Children are not recursively -/// canonicalized. -fn to_canonical(array: &Array, target: CanonicalTarget) -> VortexResult { - // Each target is an internally confluent reduction strategy. - // The logical DType is preserved regardless of target. - match target { - CanonicalTarget::Default => to_canonical_default(array), - CanonicalTarget::Contiguous => to_canonical_contiguous(array), - } -} -``` - -The invariant to maintain is that for any `array`, `target_a`, `target_b`: - -``` -decode(to_canonical(array, target_a)) == decode(to_canonical(array, target_b)) -``` - -That is: different canonical forms of the same data decode to the same logical values. This is the quotient compatibility condition. - ---- - -## 5. Summary of Theoretical Landscape - -| Concept | Role in Vortex | Key Reference | -| --------------------------------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | -| **Equivalence class** | All physical arrays with the same logical data form a class | Standard set theory | -| **Quotient type** | `DType` is the quotient over encodings by decode-equality | [nLab: quotient type](https://ncatlab.org/nlab/show/quotient+type) | -| **Section** | `to_canonical` picks one representative per class | [Section (category theory)]() | -| **Church-Rosser** | Single-section canonicalization is confluent; multi-section is confluent per-target | [Confluence]() | -| **Existential types** | Each encoding hides its representation behind the DType interface | [Mitchell & Plotkin 1988](https://doi.org/10.1145/44501.45065) | -| **Representation independence** | Query engine can't observe which encoding is in use | Follows from existential typing | -| **Refinement types** | Utf8 refines Binary with `valid_utf8`; justifies separate DType | [Liquid Types](https://doi.org/10.1145/1375581.1375602) | -| **Structural vs. nominal typing** | DType is nominal — structurally isomorphic types can be distinct | [Nominal type systems](https://en.wikipedia.org/wiki/Nominal_type_system) | - ---- - -## 6. Further Reading - -### Foundational - -- **Types and Programming Languages** (Pierce, 2002) — Chapters on existential types, subtyping, and type equivalence. The standard graduate textbook for type theory. -- **Term Rewriting and All That** (Baader & Nipkow, 1998) — Rigorous treatment of confluence, normal forms, and the Church-Rosser property. -- **Abstract Types Have Existential Type** (Mitchell & Plotkin, 1988) — The original paper connecting data abstraction to existential quantification. [DOI](https://doi.org/10.1145/44501.45065) - -### On Quotient Types in Programming - -- **Quotient Types for Programmers** (Angiuli, Coquand, 2021) — Accessible introduction to quotient types from an HoTT/cubical perspective. [nLab](https://ncatlab.org/nlab/show/quotient+type) -- **Observational Equality, Now!** (Altenkirch, McBride, 2007) — Explores when two representations should be considered "the same" in type theory. [PDF](http://www.cs.nott.ac.uk/~psztxa/publ/obseqnow.pdf) - -### On Refinement Types - -- **Refinement Types for ML** (Freeman & Pfenning, 1991) — The original paper on refinement types. -- **Liquid Types** (Rondon, Kawaguci, Jhala, 2008) — Decidable refinement type inference. [DOI](https://doi.org/10.1145/1375581.1375602) - -### On Representation Independence - -- **Abstraction and Specification in Program Development** (Liskov & Guttag, 1986) — Practical treatment of representation independence in software engineering. -- **Parametricity and Representation Independence** (Plotkin & Abadi, 1993) — Formalizes how parametric polymorphism guarantees representation independence. - -### Columnar Format Context - -- **The BtrBlocks Paper** — Direct ancestor of Vortex's compression approach. -- **Apache Arrow Specification** — The de facto standard for canonical columnar layout; relevant for understanding which canonical forms matter in practice. [https://arrow.apache.org/docs/format/Columnar.html](https://arrow.apache.org/docs/format/Columnar.html) From e41c1cb6a9bc666eaf647a3b60ef4c29d5d58088 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 6 Mar 2026 17:20:14 -0500 Subject: [PATCH 11/17] fix Signed-off-by: Connor Tsui --- proposed/{9999-types.md => 0029-types.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename proposed/{9999-types.md => 0029-types.md} (99%) diff --git a/proposed/9999-types.md b/proposed/0029-types.md similarity index 99% rename from proposed/9999-types.md rename to proposed/0029-types.md index 4bd1124..7405be3 100644 --- a/proposed/9999-types.md +++ b/proposed/0029-types.md @@ -1,5 +1,5 @@ - Start Date: 2026-03-06 -- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) +- RFC PR: [vortex-data/rfcs#29](https://github.com/vortex-data/rfcs/pull/29) # Vortex Type System From 4e4e6fd6257fdb896e543a479dc4562695988276 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 9 Mar 2026 09:42:38 -0400 Subject: [PATCH 12/17] add clarity Signed-off-by: Connor Tsui --- proposed/0029-types.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/proposed/0029-types.md b/proposed/0029-types.md index 7405be3..2c5bc44 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -281,7 +281,11 @@ always decompress into `ListView`. We have the same constraint on `VarBinView` v while we haven't seen as many performance problems, it feels like a constraint that is too restrictive in Vortex. -## Design +## Non-confluent Rewriting Design + +As a reminder, Vortex (maybe unintentionally) currently has a confluent rewriting system, where we +"rewrite" (`to_canonical`) everything under an equivalence class (`DType`) into a singular normal +form (`Canonical`). This part of the RFC proposes a design of a non-conflunet rewriting system. The proposal is to add a new `to_canonical_target` function that accepts a `CanonicalTarget` parameter, while keeping the existing `to_canonical` as a convenience that uses the default target. From 70655ef943c46b3b16400c9b6577442f62011076 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 9 Mar 2026 10:03:44 -0400 Subject: [PATCH 13/17] split into 2 RFCs Signed-off-by: Connor Tsui --- proposed/0029-types.md | 188 ++-------------------------- proposed/9999-canonical-targets.md | 192 +++++++++++++++++++++++++++++ 2 files changed, 204 insertions(+), 176 deletions(-) create mode 100644 proposed/9999-canonical-targets.md diff --git a/proposed/0029-types.md b/proposed/0029-types.md index 2c5bc44..0b5472a 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -7,9 +7,7 @@ This RFC formalizes the Vortex type system by grounding `DType` as a quotient type over physical encodings and establishes a decision framework (based on refinement types) for when new `DType` -variants are justified. It additionally proposes extending canonicalization from a single target -(one canonical "normal form" per `DType`) to parameterized "sections" via `to_canonical_target`, -allowing consumers to choose among multiple valid canonical targets (e.g., `List` vs. `ListView`). +variants are justified. ## Motivation @@ -28,15 +26,11 @@ logical types? What does a "different" logical type even mean? Another discussion we have had is if the choice of a canonical `ListView` is better or worse than a canonical `List` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)). Both have the exact same logical type (same domain of values), but we are stuck choosing a single "canonical" -encoding that we force every array of type `List` to decompress into. Is forcing everyone to -decompress into the same physical encoding really what we want? +encoding that we force every array of type `List` to decompress into. This question is explored +further in a companion RFC ([RFC 9999](./9999-canonical-targets.md)). -This RFC makes 2 proposals. The first is a more formalized definition of the Vortex type system, and -this serves to justify the second proposal. - -The second proposal is to relax (or extend) the concept of a "canonical" type from choosing a unique -physical encoding for every logical type (a unique normal form) to allowing many possible canonical -targets (multiple normal forms). +This RFC formalizes the Vortex type system definitions, and this formalization serves as the +foundation for reasoning about questions like these. ## Type Theory Background @@ -143,7 +137,7 @@ The critical insight is that `Canonical` represents several arbitrary **choices* nothing in the theory privileges `ListView` over `List` as the canonical representative for variable-length list data. Both are valid sections (since both pick a representative from the same equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes -one particular section. The second proposal in this RFC is to allow _multiple sections_. +one particular section. ### The Church-Rosser Property (Confluence) @@ -165,10 +159,6 @@ For example, instead of one global set of reduction rules, you define two strate always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy satisfies Church-Rosser independently, the only difference is which normal form they target. -In Vortex, a similar scenario would be defining multiple strategies of canonicalization, where one -strategy could target `List` as a canonical target (normal form) for list data, and another strategy -could target `ListView`. See the [`List` vs. `ListView`](#list-vs-listview) section for more info. - ### Refinement Types and the DType Decision Framework A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate @@ -240,178 +230,24 @@ earns its place in `DType`. Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type. -## List vs. ListView - -There is also a separate question about whether the current `Canonical` system is the most ideal. -The relationship between `List` and `ListView` ties all of the above concepts together and most -directly motivates a multi-section (multiple normal form) proposal. - -`List` and `ListView` represent exactly the same logical data: a sequence of variable-length -sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of -`Int32` values. This is true regardless of the physical layout. - -The distinction is entirely in the buffer layout: - -- **`List`** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for - element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and - non-overlapping, and every byte belongs to exactly one logical element. -- **`ListView`** stores separate offsets and sizes buffers, where - `offsets[i]..offsets[i] + sizes[i]` defines the range for element `i`. This allows overlapping - views (two logical elements can share backing data) and gaps (regions of the values buffer that - belong to no element). - -This is a purely physical distinction, as no query operation can observe the difference. -For example, `scalar_at(i)` will always return the same list, and `filter`, `take`, and `slice` all -produce logically identical results. - -However, this physical distinction has massive performance implications. Converting from `ListView` -to `List` requires rebuilding the entire array to eliminate overlaps and gaps. On the other hand, -converting from `List` to `ListView` is trivial (sizes are just offset deltas). - -This asymmetry is notable: the section that targets `List` is more expensive to compute but produces -a form with stronger structural guarantees (no aliasing, no gaps). The section that targets -`ListView` is cheaper and permits aliasing and faster random access. - -Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List` -because Arrow is adding support for `ListView` slowly. Other consumers prefer `ListView` because -some operations (namely random access and potentially dependent operations) are faster. - -It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to -always decompress into `ListView`. We have the same constraint on `VarBinView` vs `VarBin`, and -while we haven't seen as many performance problems, it feels like a constraint that is too -restrictive in Vortex. - -## Non-confluent Rewriting Design - -As a reminder, Vortex (maybe unintentionally) currently has a confluent rewriting system, where we -"rewrite" (`to_canonical`) everything under an equivalence class (`DType`) into a singular normal -form (`Canonical`). This part of the RFC proposes a design of a non-conflunet rewriting system. - -The proposal is to add a new `to_canonical_target` function that accepts a `CanonicalTarget` -parameter, while keeping the existing `to_canonical` as a convenience that uses the default target. -This minimizes breaking changes. - -```rust -/// Canonicalize using the default target. Existing behavior, no breaking change. -fn to_canonical(array: &Array) -> VortexResult { - to_canonical_target(array, CanonicalTarget::Default) -} - -/// Canonicalize into a specific target. -fn to_canonical_target(array: &Array, target: CanonicalTarget) -> VortexResult; -``` - -Where `CanonicalTarget` selects from a family of valid canonical forms: - -```rust -/// A canonicalization target (selects which section to use). -enum CanonicalTarget { - /// The default canonical forms. This is what the current system uses. - /// For lists: ListView. For strings: VarBinView. - Default, - /// Contiguous canonical forms, motivated by Arrow FFI boundaries and - /// consumers that need stronger structural guarantees. - /// For lists: List (monotonic offsets). For strings: VarBin (contiguous data). - Contiguous, -} -``` - -The `Canonical` enum would use inner enums for the types where multiple canonical forms exist: - -```rust -pub enum CanonicalList { - List(ListArray), - ListView(ListViewArray), -} - -pub enum CanonicalVarBin { - VarBin(VarBinArray), - VarBinView(VarBinViewArray), -} - -pub enum Canonical { - ... - VarBin(CanonicalVarBin), // Maps to both Utf8 and Binary DTypes. - List(CanonicalList), // Maps to List DType. - ... -} - -array.to_canonical()?.scalar_at(i) == array.to_canonical_target(target)?.scalar_at(i) -``` - -Each target defines an internally confluent reduction strategy. Within `CanonicalTarget::Default`, -all paths converge to `ListView`/`VarBinView`. Within `CanonicalTarget::Contiguous`, all paths -converge to `List`/`VarBin`. The choice of target is thus made by the consumer (query engine, -FFI boundary, serializer), not by the encoded array. - -For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `Null`), both targets -produce the same canonical form. The parameterization only has an effect where multiple valid -canonical forms exist. - -Because `DType` is a quotient type and canonicalization is a section on that quotient, supporting -multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant -of the Vortex type system). Because each `CanonicalTarget` defines an internally confluent reduction -strategy, the quotient structure of `DType` guarantees that all well-defined operations produce -the same logical results regardless of which target is chosen. - -``` -Logical type (DType: purely semantic, not physical) - ▲ - │ section 1: Default (ListView, VarBinView) - │ section 2: Contiguous (List, VarBin) - │ section n: ... (future targets) - ▼ -Canonical form (a specific "normal form" encoding chosen by the section) - ▲ - │ encode - │ decode - ▼ -Physical encoding (dictionary, REE, bitpacked, etc.) -``` - -## Compatibility - -There shouldn't be any compatibility concerns here because even under a specific `DType`, the array -tree is fully serialized, and consumers can always convert back and forth between `List` and -`ListView` if they really need to. - -## Drawbacks - -The drawback is extra complexity in supporting multiple canonical targets. However, we've also had -to spend time making optimizations and fixes (`is_zero_copy_to_list` for `ListView`, see -[vortex#5129](https://github.com/vortex-data/vortex/pull/5129)) because we were forced to always -canonicalize into a single target. So there is an obvious tradeoff here. - -## Alternatives - -The alternative is to just not do this. We continue to find workarounds when canonical encodings do -not fit the use case. - ## Prior Art -- **Arrow** defines both `List` and `ListView` (and `LargeList`/`LargeListView`) as separate type - IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (`List`), and - `ListView` support is being added incrementally across implementations. Arrow also distinguishes - `FixedSizeBinary` from `FixedSizeList` at the type level. -- **Parquet** has `FIXED_LEN_BYTE_ARRAY` as a distinct primitive type, separate from repeated - groups. This is a nominal distinction similar to the `FixedSizeBinary` question. -- **DuckDB** uses a single canonical form per logical type and does not support multiple - canonicalization targets. +- **Arrow** has a physical type system, defining both `List` and `ListView` (and + `LargeList`/`LargeListView`) as separate type IDs in its columnar format. Arrow also + distinguishes `FixedSizeBinary` from `FixedSizeList` at the type level. +- **Parquet** also has a physical type system, with `FIXED_LEN_BYTE_ARRAY` as a distinct primitive + type separate from repeated groups. This is a nominal distinction similar to the + `FixedSizeBinary` question. ## Unresolved Questions - Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. -- What is the concrete set of `CanonicalTarget` variants? This RFC proposes `Default` and - `Contiguous`, but there may be other useful targets. Theoretically, we could have a target that - would allow us to return compressed arrays as a `Canonical` type! ## Future Possibilities - We can have extension types be a generalization of the refinement type pattern: for example we could enforce user-defined predicates that gate custom operations on existing `DType`s. -- User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization - formats. - Using the decision framework to audit existing `DType` variants and determine if any should be consolidated or split, as well as to make decisions about logical types we want to add (namely `FixedSizeBinary` and `Variant`). diff --git a/proposed/9999-canonical-targets.md b/proposed/9999-canonical-targets.md new file mode 100644 index 0000000..3b80320 --- /dev/null +++ b/proposed/9999-canonical-targets.md @@ -0,0 +1,192 @@ +- Start Date: 2026-03-06 +- RFC PR: TBD + +# Multiple Canonical Targets + +## Summary + +This RFC proposes extending canonicalization from a single target (one canonical "normal form" per +`DType`) to parameterized targets via `to_canonical_target`, allowing consumers to choose among +multiple valid canonical representations (e.g., `List` vs. `ListView`). This builds on the type +system formalization in [RFC 0029](./0029-types.md). + +## Motivation + +There is an open question about whether the current `Canonical` system is the most ideal. The +relationship between `List` and `ListView` ties the concepts from +[RFC 0029](./0029-types.md) together and most directly motivates a multi-section (multiple normal +form) proposal. + +`List` and `ListView` represent exactly the same logical data: a sequence of variable-length +sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of +`Int32` values. This is true regardless of the physical layout. + +The distinction is entirely in the buffer layout: + +- **`List`** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for + element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and + non-overlapping, and every byte belongs to exactly one logical element. +- **`ListView`** stores separate offsets and sizes buffers, where + `offsets[i]..offsets[i] + sizes[i]` defines the range for element `i`. This allows overlapping + views (two logical elements can share backing data) and gaps (regions of the values buffer that + belong to no element). + +This is a purely physical distinction, as no query operation can observe the difference. +For example, `scalar_at(i)` will always return the same list, and `filter`, `take`, and `slice` all +produce logically identical results. + +However, this physical distinction has massive performance implications. Converting from `ListView` +to `List` requires rebuilding the entire array to eliminate overlaps and gaps. On the other hand, +converting from `List` to `ListView` is trivial (sizes are just offset deltas). + +This asymmetry is notable: the section that targets `List` is more expensive to compute but produces +a form with stronger structural guarantees (no aliasing, no gaps). The section that targets +`ListView` is cheaper and permits aliasing and faster random access. + +Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List` +because Arrow is adding support for `ListView` slowly. Other consumers prefer `ListView` because +some operations (namely random access and potentially dependent operations) are faster. + +It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to +always decompress into `ListView`. We have the same constraint on `VarBinView` vs `VarBin`, and +while we haven't seen as many performance problems, it feels like a constraint that is too +restrictive in Vortex. + +## Background + +As described in [RFC 0029](./0029-types.md), Vortex's `to_canonical` function is a **section** on +the quotient type `DType`: it picks one canonical physical encoding for each logical type. The +current system is **confluent** (Church-Rosser): there is one normal form per `DType`, and every +reduction path reaches it. + +A **non-confluent** rewriting system is one where two reduction paths from the same starting point +can arrive at different normal forms. The standard approach for handling non-confluent systems is to +define **separate reduction relations**, each of which is internally confluent. For example, instead +of one global set of reduction rules, you define two strategies: strategy A always reduces to normal +form X, and strategy B always reduces to normal form Y. Each strategy satisfies Church-Rosser +independently, the only difference is which normal form they target. + +In Vortex, a similar scenario would be defining multiple strategies of canonicalization, where one +strategy could target `List` as a canonical target (normal form) for list data, and another strategy +could target `ListView`. + +## Design + +The proposal is to add a new `to_canonical_target` function that accepts a `CanonicalTarget` +parameter, while keeping the existing `to_canonical` as a convenience that uses the default target. +This minimizes breaking changes. + +```rust +/// Canonicalize using the default target. Existing behavior, no breaking change. +fn to_canonical(array: &Array) -> VortexResult { + to_canonical_target(array, CanonicalTarget::Default) +} + +/// Canonicalize into a specific target. +fn to_canonical_target(array: &Array, target: CanonicalTarget) -> VortexResult; +``` + +Where `CanonicalTarget` selects from a family of valid canonical forms: + +```rust +/// A canonicalization target (selects which section to use). +enum CanonicalTarget { + /// The default canonical forms. This is what the current system uses. + /// For lists: ListView. For strings: VarBinView. + Default, + /// Contiguous canonical forms, motivated by Arrow FFI boundaries and + /// consumers that need stronger structural guarantees. + /// For lists: List (monotonic offsets). For strings: VarBin (contiguous data). + Contiguous, +} +``` + +The `Canonical` enum would use inner enums for the types where multiple canonical forms exist: + +```rust +pub enum CanonicalList { + List(ListArray), + ListView(ListViewArray), +} + +pub enum CanonicalVarBin { + VarBin(VarBinArray), + VarBinView(VarBinViewArray), +} + +pub enum Canonical { + ... + VarBin(CanonicalVarBin), // Maps to both Utf8 and Binary DTypes. + List(CanonicalList), // Maps to List DType. + ... +} + +array.to_canonical()?.scalar_at(i) == array.to_canonical_target(target)?.scalar_at(i) +``` + +Each target defines an internally confluent reduction strategy. Within `CanonicalTarget::Default`, +all paths converge to `ListView`/`VarBinView`. Within `CanonicalTarget::Contiguous`, all paths +converge to `List`/`VarBin`. The choice of target is thus made by the consumer (query engine, +FFI boundary, serializer), not by the encoded array. + +For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `Null`), both targets +produce the same canonical form. The parameterization only has an effect where multiple valid +canonical forms exist. + +Because `DType` is a quotient type and canonicalization is a section on that quotient, supporting +multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant +of the Vortex type system). Because each `CanonicalTarget` defines an internally confluent reduction +strategy, the quotient structure of `DType` guarantees that all well-defined operations produce +the same logical results regardless of which target is chosen. + +``` +Logical type (DType: purely semantic, not physical) + ▲ + │ section 1: Default (ListView, VarBinView) + │ section 2: Contiguous (List, VarBin) + │ section n: ... (future targets) + ▼ +Canonical form (a specific "normal form" encoding chosen by the section) + ▲ + │ encode + │ decode + ▼ +Physical encoding (dictionary, REE, bitpacked, etc.) +``` + +## Compatibility + +There shouldn't be any compatibility concerns here because even under a specific `DType`, the array +tree is fully serialized, and consumers can always convert back and forth between `List` and +`ListView` if they really need to. + +## Drawbacks + +The drawback is extra complexity in supporting multiple canonical targets. However, we've also had +to spend time making optimizations and fixes (`is_zero_copy_to_list` for `ListView`, see +[vortex#5129](https://github.com/vortex-data/vortex/pull/5129)) because we were forced to always +canonicalize into a single target. So there is an obvious tradeoff here. + +## Alternatives + +The alternative is to just not do this. We continue to find workarounds when canonical encodings do +not fit the use case. + +## Prior Art + +- **Arrow** defines both `List` and `ListView` (and `LargeList`/`LargeListView`) as separate type + IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (`List`), and + `ListView` support is being added incrementally across implementations. +- **DuckDB** uses a single canonical form per logical type and does not support multiple + canonicalization targets. + +## Unresolved Questions + +- What is the concrete set of `CanonicalTarget` variants? This RFC proposes `Default` and + `Contiguous`, but there may be other useful targets. Theoretically, we could have a target that + would allow us to return compressed arrays as a `Canonical` type! + +## Future Possibilities + +- User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization + formats. From 556d91c92dfc26b3206fa4fc0be1f4d751287d92 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 9 Mar 2026 10:57:01 -0400 Subject: [PATCH 14/17] add better background Signed-off-by: Connor Tsui --- proposed/0029-types.md | 63 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 12 deletions(-) diff --git a/proposed/0029-types.md b/proposed/0029-types.md index 0b5472a..99b0b36 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -16,6 +16,37 @@ define a set of logical types, each of which can represent many physical data en additionally define a set of `Canonical` encodings that represent the different targets that arrays can decompress into. +## Overview + +### Logical vs. Physical Types + +A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g., +`Primitive(I32)`, `Utf8`, `List(Primitive(I32))`). A **physical encoding** describes how data is +laid out in memory or on disk (e.g., flat buffer, dictionary-encoded, run-end-encoded, bitpacked). +Many physical encodings can represent the same logical type. + +Vortex separates these two concepts so that encodings and compute can evolve independently. Without +this separation, implementing `M` operations across `N` encodings requires `N * M` implementations. +With it, each encoding only needs to decompress itself and each operation only needs to target +decompressed forms, reducing the cost to `N + M`. See this +[blog post](https://spiraldb.com/post/logical-vs-physical-data-types) for more information. + +### What is a `Canonical` encoding? + +The `N + M` argument relies on a common decompression target that operations are implemented +against. A **canonical encoding** is a physical encoding chosen as this representative for a logical +type (e.g., `VarBinView` for `Utf8`, `ListView` for `List`). The choice is deliberate and +optimized for the common compute case, but not fundamental: nothing in theory privileges `ListView` +over `List` for list data, for example. + +### Extension Types + +Vortex's built-in set of logical types will not cover every use case. Extension types allow external +consumers to define their own logical types on top of existing `DType`s without modifying the core +type system. See [RFC 0005](./0005-extension.md) for the full design. + +### Formalization + This definition has mostly worked well for us. However, several recent discussions have revealed that this loose definition may be insufficient. @@ -180,19 +211,24 @@ redundant. This gives us a concrete decision tree for whether a new `DType` variant is justified: ``` - Does it gate different query operations? - │ - Yes ──────┼────── No - │ │ - Add to DType Is it structurally distinct - (refinement type) from an existing DType? - │ - Yes ──────┼────── No - │ │ - Add to DType Model as canonical form - (new structure) or encoding + Does it gate different query operations? + │ + Yes ───────────────┼─────────────── No + │ │ + Does Vortex core Is it structurally distinct + own those ops? from an existing DType? + │ │ + Yes ──────┼────── No Yes ──────┼────── No + │ │ │ │ + Add to DType Extension Add to DType Model as canonical + (refinement type) type (new structure) form or encoding ``` +As described in the [motivation](#why-extension-types), the "Yes" branch distinguishes between +first-class `DType` variants and extension types based on who owns the gated operations. If Vortex +core provides kernels that require the predicate, it belongs in `DType`. If only external consumers +need it, an extension type suffices (see [RFC 0005](./0005-extension.md)). + ## Should `FixedSizeBinary` Be a `DType`? Applying the decision framework above to `FixedSizeBinary` vs. `FixedSizeList`: @@ -230,6 +266,8 @@ earns its place in `DType`. Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type. + + ## Prior Art - **Arrow** has a physical type system, defining both `List` and `ListView` (and @@ -242,7 +280,8 @@ Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a lo ## Unresolved Questions - Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? - See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. + See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. It is + not so easy to claim one argument here is better than the other. Comments would be appreciated! ## Future Possibilities From 60f487091c499cd95c5dc39a5bf62b59535484dc Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 9 Mar 2026 12:24:04 -0400 Subject: [PATCH 15/17] remove proposal Signed-off-by: Connor Tsui --- proposed/0029-types.md | 5 +- proposed/9999-canonical-targets.md | 192 ----------------------------- 2 files changed, 2 insertions(+), 195 deletions(-) delete mode 100644 proposed/9999-canonical-targets.md diff --git a/proposed/0029-types.md b/proposed/0029-types.md index 99b0b36..570e75a 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -57,8 +57,7 @@ logical types? What does a "different" logical type even mean? Another discussion we have had is if the choice of a canonical `ListView` is better or worse than a canonical `List` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)). Both have the exact same logical type (same domain of values), but we are stuck choosing a single "canonical" -encoding that we force every array of type `List` to decompress into. This question is explored -further in a companion RFC ([RFC 9999](./9999-canonical-targets.md)). +encoding that we force every array of type `List` to decompress into. This RFC formalizes the Vortex type system definitions, and this formalization serves as the foundation for reasoning about questions like these. @@ -224,7 +223,7 @@ This gives us a concrete decision tree for whether a new `DType` variant is just (refinement type) type (new structure) form or encoding ``` -As described in the [motivation](#why-extension-types), the "Yes" branch distinguishes between +As described in the [overview](#extension-types), the "Yes" branch distinguishes between first-class `DType` variants and extension types based on who owns the gated operations. If Vortex core provides kernels that require the predicate, it belongs in `DType`. If only external consumers need it, an extension type suffices (see [RFC 0005](./0005-extension.md)). diff --git a/proposed/9999-canonical-targets.md b/proposed/9999-canonical-targets.md deleted file mode 100644 index 3b80320..0000000 --- a/proposed/9999-canonical-targets.md +++ /dev/null @@ -1,192 +0,0 @@ -- Start Date: 2026-03-06 -- RFC PR: TBD - -# Multiple Canonical Targets - -## Summary - -This RFC proposes extending canonicalization from a single target (one canonical "normal form" per -`DType`) to parameterized targets via `to_canonical_target`, allowing consumers to choose among -multiple valid canonical representations (e.g., `List` vs. `ListView`). This builds on the type -system formalization in [RFC 0029](./0029-types.md). - -## Motivation - -There is an open question about whether the current `Canonical` system is the most ideal. The -relationship between `List` and `ListView` ties the concepts from -[RFC 0029](./0029-types.md) together and most directly motivates a multi-section (multiple normal -form) proposal. - -`List` and `ListView` represent exactly the same logical data: a sequence of variable-length -sub-arrays. Given an array of type `List(Int32)`, element `i` is a variable-length sequence of -`Int32` values. This is true regardless of the physical layout. - -The distinction is entirely in the buffer layout: - -- **`List`** stores a single offsets buffer where `offsets[i]..offsets[i+1]` defines the range for - element `i`. Offsets are monotonically increasing. The child values buffer is contiguous and - non-overlapping, and every byte belongs to exactly one logical element. -- **`ListView`** stores separate offsets and sizes buffers, where - `offsets[i]..offsets[i] + sizes[i]` defines the range for element `i`. This allows overlapping - views (two logical elements can share backing data) and gaps (regions of the values buffer that - belong to no element). - -This is a purely physical distinction, as no query operation can observe the difference. -For example, `scalar_at(i)` will always return the same list, and `filter`, `take`, and `slice` all -produce logically identical results. - -However, this physical distinction has massive performance implications. Converting from `ListView` -to `List` requires rebuilding the entire array to eliminate overlaps and gaps. On the other hand, -converting from `List` to `ListView` is trivial (sizes are just offset deltas). - -This asymmetry is notable: the section that targets `List` is more expensive to compute but produces -a form with stronger structural guarantees (no aliasing, no gaps). The section that targets -`ListView` is cheaper and permits aliasing and faster random access. - -Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List` -because Arrow is adding support for `ListView` slowly. Other consumers prefer `ListView` because -some operations (namely random access and potentially dependent operations) are faster. - -It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to -always decompress into `ListView`. We have the same constraint on `VarBinView` vs `VarBin`, and -while we haven't seen as many performance problems, it feels like a constraint that is too -restrictive in Vortex. - -## Background - -As described in [RFC 0029](./0029-types.md), Vortex's `to_canonical` function is a **section** on -the quotient type `DType`: it picks one canonical physical encoding for each logical type. The -current system is **confluent** (Church-Rosser): there is one normal form per `DType`, and every -reduction path reaches it. - -A **non-confluent** rewriting system is one where two reduction paths from the same starting point -can arrive at different normal forms. The standard approach for handling non-confluent systems is to -define **separate reduction relations**, each of which is internally confluent. For example, instead -of one global set of reduction rules, you define two strategies: strategy A always reduces to normal -form X, and strategy B always reduces to normal form Y. Each strategy satisfies Church-Rosser -independently, the only difference is which normal form they target. - -In Vortex, a similar scenario would be defining multiple strategies of canonicalization, where one -strategy could target `List` as a canonical target (normal form) for list data, and another strategy -could target `ListView`. - -## Design - -The proposal is to add a new `to_canonical_target` function that accepts a `CanonicalTarget` -parameter, while keeping the existing `to_canonical` as a convenience that uses the default target. -This minimizes breaking changes. - -```rust -/// Canonicalize using the default target. Existing behavior, no breaking change. -fn to_canonical(array: &Array) -> VortexResult { - to_canonical_target(array, CanonicalTarget::Default) -} - -/// Canonicalize into a specific target. -fn to_canonical_target(array: &Array, target: CanonicalTarget) -> VortexResult; -``` - -Where `CanonicalTarget` selects from a family of valid canonical forms: - -```rust -/// A canonicalization target (selects which section to use). -enum CanonicalTarget { - /// The default canonical forms. This is what the current system uses. - /// For lists: ListView. For strings: VarBinView. - Default, - /// Contiguous canonical forms, motivated by Arrow FFI boundaries and - /// consumers that need stronger structural guarantees. - /// For lists: List (monotonic offsets). For strings: VarBin (contiguous data). - Contiguous, -} -``` - -The `Canonical` enum would use inner enums for the types where multiple canonical forms exist: - -```rust -pub enum CanonicalList { - List(ListArray), - ListView(ListViewArray), -} - -pub enum CanonicalVarBin { - VarBin(VarBinArray), - VarBinView(VarBinViewArray), -} - -pub enum Canonical { - ... - VarBin(CanonicalVarBin), // Maps to both Utf8 and Binary DTypes. - List(CanonicalList), // Maps to List DType. - ... -} - -array.to_canonical()?.scalar_at(i) == array.to_canonical_target(target)?.scalar_at(i) -``` - -Each target defines an internally confluent reduction strategy. Within `CanonicalTarget::Default`, -all paths converge to `ListView`/`VarBinView`. Within `CanonicalTarget::Contiguous`, all paths -converge to `List`/`VarBin`. The choice of target is thus made by the consumer (query engine, -FFI boundary, serializer), not by the encoded array. - -For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `Null`), both targets -produce the same canonical form. The parameterization only has an effect where multiple valid -canonical forms exist. - -Because `DType` is a quotient type and canonicalization is a section on that quotient, supporting -multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant -of the Vortex type system). Because each `CanonicalTarget` defines an internally confluent reduction -strategy, the quotient structure of `DType` guarantees that all well-defined operations produce -the same logical results regardless of which target is chosen. - -``` -Logical type (DType: purely semantic, not physical) - ▲ - │ section 1: Default (ListView, VarBinView) - │ section 2: Contiguous (List, VarBin) - │ section n: ... (future targets) - ▼ -Canonical form (a specific "normal form" encoding chosen by the section) - ▲ - │ encode - │ decode - ▼ -Physical encoding (dictionary, REE, bitpacked, etc.) -``` - -## Compatibility - -There shouldn't be any compatibility concerns here because even under a specific `DType`, the array -tree is fully serialized, and consumers can always convert back and forth between `List` and -`ListView` if they really need to. - -## Drawbacks - -The drawback is extra complexity in supporting multiple canonical targets. However, we've also had -to spend time making optimizations and fixes (`is_zero_copy_to_list` for `ListView`, see -[vortex#5129](https://github.com/vortex-data/vortex/pull/5129)) because we were forced to always -canonicalize into a single target. So there is an obvious tradeoff here. - -## Alternatives - -The alternative is to just not do this. We continue to find workarounds when canonical encodings do -not fit the use case. - -## Prior Art - -- **Arrow** defines both `List` and `ListView` (and `LargeList`/`LargeListView`) as separate type - IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (`List`), and - `ListView` support is being added incrementally across implementations. -- **DuckDB** uses a single canonical form per logical type and does not support multiple - canonicalization targets. - -## Unresolved Questions - -- What is the concrete set of `CanonicalTarget` variants? This RFC proposes `Default` and - `Contiguous`, but there may be other useful targets. Theoretically, we could have a target that - would allow us to return compressed arrays as a `Canonical` type! - -## Future Possibilities - -- User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization - formats. From 19c46be9c1e9dbc1ac81328b6825e84df52c0f30 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 9 Mar 2026 12:53:09 -0400 Subject: [PATCH 16/17] revise decision framework Signed-off-by: Connor Tsui --- proposed/0029-types.md | 116 +++++++++++++++++++++-------------------- 1 file changed, 60 insertions(+), 56 deletions(-) diff --git a/proposed/0029-types.md b/proposed/0029-types.md index 570e75a..c5f654e 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -9,15 +9,11 @@ This RFC formalizes the Vortex type system by grounding `DType` as a quotient ty encodings and establishes a decision framework (based on refinement types) for when new `DType` variants are justified. -## Motivation - -Many of the Vortex maintainers have a good understanding of how the Vortex type system works: we -define a set of logical types, each of which can represent many physical data encodings. We -additionally define a set of `Canonical` encodings that represent the different targets that arrays -can decompress into. - ## Overview +Vortex defines a set of logical types, each of which can represent many physical data encodings, and +a set of `Canonical` encodings that represent the different targets that arrays can decompress into. + ### Logical vs. Physical Types A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g., @@ -45,7 +41,7 @@ Vortex's built-in set of logical types will not cover every use case. Extension consumers to define their own logical types on top of existing `DType`s without modifying the core type system. See [RFC 0005](./0005-extension.md) for the full design. -### Formalization +## Motivation This definition has mostly worked well for us. However, several recent discussions have revealed that this loose definition may be insufficient. @@ -169,6 +165,10 @@ variable-length list data. Both are valid sections (since both pick a representa equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes one particular section. +Note that even dictionary encoding or run-end encoding are theoretically valid sections (they +satisfy `π(s(d)) = d`). The fact that we choose flat, uncompressed forms as canonical is a design +choice optimized for compute, not a theoretical requirement. + ### The Church-Rosser Property (Confluence) A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can @@ -187,9 +187,11 @@ approach is to define **separate reduction relations**, each of which is interna For example, instead of one global set of reduction rules, you define two strategies: strategy A always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy -satisfies Church-Rosser independently, the only difference is which normal form they target. +satisfies Church-Rosser independently, the only difference is which normal form they target. This +is relevant for future work: if Vortex were to support multiple canonical targets (e.g., both `List` +and `ListView`), each target would define an internally confluent reduction strategy. -### Refinement Types and the DType Decision Framework +### Refinement Types A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate `P`. Refinement types express subtypes without changing the underlying representation, instead they @@ -207,65 +209,67 @@ predicate `valid_utf8` is what justifies the separate `DType` variant: it gates Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be redundant. -This gives us a concrete decision tree for whether a new `DType` variant is justified: +## What justifies a new type? -``` - Does it gate different query operations? - │ - Yes ───────────────┼─────────────── No - │ │ - Does Vortex core Is it structurally distinct - own those ops? from an existing DType? - │ │ - Yes ──────┼────── No Yes ──────┼────── No - │ │ │ │ - Add to DType Extension Add to DType Model as canonical - (refinement type) type (new structure) form or encoding -``` +With the formalizations above, we have a framework that gives us a set of questions to guide whether +a new `DType` variant is justified: + +1. **Does it gate different query operations?** If yes, does Vortex core own those operations? If + so, it should be a first-class `DType` (refinement type). If only external consumers need them, + an extension type suffices (see [RFC 0005](./0005-extension.md)). +2. **Is it structurally distinct from an existing `DType`?** If yes, there may be a case for a new + `DType` variant. However, this depends on whether the structural difference provides enough + practical benefit to justify the added complexity. +3. If neither, it should just be a new physical encoding. -As described in the [overview](#extension-types), the "Yes" branch distinguishes between -first-class `DType` variants and extension types based on who owns the gated operations. If Vortex -core provides kernels that require the predicate, it belongs in `DType`. If only external consumers -need it, an extension type suffices (see [RFC 0005](./0005-extension.md)). +These decisions are mostly design choices rather than strict theoretical requirements. For example, +`Utf8` could theoretically be an extension type over `Binary` with a `valid_utf8` predicate. The +main reason it is a first-class `DType` is because we want to have optimized string kernels in the +core Vortex library. -## Should `FixedSizeBinary` Be a `DType`? +### Should `FixedSizeList` be a type? -Applying the decision framework above to `FixedSizeBinary` vs. `FixedSizeList`: +We can apply this framework to an existing type, `FixedSizeList`: -### The Case For (Refinement / Nominal Argument) +**Does it gate different query operations?** No. A fixed-size list is a list with the additional +constraint that every element has the same length `n`, but `scalar_at`, `filter`, `take`, etc. all +behave identically regardless of whether the list is fixed-size. -`FixedSizeBinary` could be justified as a refinement type if it carries semantic distinctions -that `FixedSizeList` does not: +**Is it structurally distinct from an existing `DType`?** Yes. `FixedSizeList` has a different +physical layout (no offsets buffer, since all elements have the same size). However, this does not +automatically justify a new `DType` variant. A `List` whose offsets are a constant stride would +compress extremely well (a `SequenceArray`), and encodings in Vortex are designed to exploit exactly +this kind of redundancy. -- **Intent signaling.** `FixedSizeBinary` says "this is opaque binary data" (UUIDs, hashes, IP - addresses), while `FixedSizeList` says "this is a list of bytes that happens to have a - fixed length." -- **Schema compatibility.** Arrow, Parquet, and other formats distinguish these types. A - `FixedSizeBinary` `DType` makes round-tripping schemas lossless. -- **Potential invariant.** `FixedSizeBinary` could carry the invariant that elements are not - individually addressable or meaningful. This is weaker than `valid_utf8` but still semantic. +The argument for keeping `FixedSizeList` as its own `DType` is that it makes the fixed-size +invariant explicit at the type level, which simplifies downstream consumers that want to rely on +it (e.g., fixed-shape tensors). The argument against is that it adds a variant to the `DType` enum +that is logically equivalent to `List` with a constraint. -Under this reading, `FixedSizeBinary` is a lightweight refinement type, and the nominal distinction -earns its place in `DType`. +Ultimately, we decided in the past that the structural difference merits its own `DType` variant, +but an argument can be made that it does not warrant one. -### The Case Against (Canonical Form / Section Argument) +### Should `FixedSizeBinary` be a type? -`FixedSizeBinary` could instead be modeled as a canonical form (section target) of -`FixedSizeList`, or as extension type metadata: +A similar question applies to `FixedSizeBinary` vs. `FixedSizeList`: -- **No gating predicate.** If no operations are meaningful on `FixedSizeBinary` that are not also - meaningful on `FixedSizeList`, then the predicate is empty and the refinement is trivial. -- **Leaky abstraction risk.** Every query engine function that handles list types would need to - additionally handle `FixedSizeBinary`, or we would need coercion rules. If the handling is always - identical, the `DType` distinction adds complexity without semantic payoff. -- **Schema mapping as metadata.** The information "this came from a Parquet `FixedSizeBinary` - column" could live in extension type metadata rather than in the core `DType` enum, keeping the - logical layer minimal. -- **Complexity.** Adding yet another variant to the `DType` enum has a large surface area of change. +**Does it gate different query operations?** This is unclear. `FixedSizeBinary` signals "opaque +binary data" (UUIDs, hashes, IP addresses) whereas `FixedSizeList` signals "a list of bytes +that happens to have a fixed length." One could argue that `FixedSizeBinary` carries the invariant +that individual bytes are not independently meaningful, which would gate byte-level list operations. +However, this invariant is weaker than something like `valid_utf8`, and it is not obvious that the +core Vortex library would ship any operations gated by it. -Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type. +If the answer is **yes** (it gates operations), then the next question is whether Vortex core owns +those operations. If so, `FixedSizeBinary` is a first-class refinement type. If not, it should be +an extension type. +If the answer is **no** (it does not gate operations), then: **is it structurally distinct from an +existing `DType`?** `FixedSizeBinary` has the same physical layout as `FixedSizeList` (a +flat buffer of `n`-byte elements), so the answer is no. By the decision framework, this means it +should be modeled as a canonical form or extension type metadata rather than a new `DType` variant. +This question remains unresolved. See [Unresolved Questions](#unresolved-questions). ## Prior Art @@ -279,7 +283,7 @@ Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a lo ## Unresolved Questions - Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? - See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. It is + See the [analysis above](#should-fixedsizebinary-be-a-type) for the case for and against. It is not so easy to claim one argument here is better than the other. Comments would be appreciated! ## Future Possibilities From 448bffe03b6234fbae6222dd9273abe3c9c5e816 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Tue, 10 Mar 2026 10:47:49 -0400 Subject: [PATCH 17/17] add author Signed-off-by: Connor Tsui --- proposed/0029-types.md | 1 + 1 file changed, 1 insertion(+) diff --git a/proposed/0029-types.md b/proposed/0029-types.md index c5f654e..d8cc5c1 100644 --- a/proposed/0029-types.md +++ b/proposed/0029-types.md @@ -1,4 +1,5 @@ - Start Date: 2026-03-06 +- Authors: Connor Tsui - RFC PR: [vortex-data/rfcs#29](https://github.com/vortex-data/rfcs/pull/29) # Vortex Type System