Skip to content

Commit 913fa41

Browse files
committed
first draft
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent a08d67c commit 913fa41

File tree

1 file changed

+236
-0
lines changed

1 file changed

+236
-0
lines changed

proposed/0005-extension.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
- Start Date: (2026-02-27)
2+
- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5)
3+
- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)
4+
5+
## Summary
6+
7+
We would like to build a more robust system for extension data types (or `DType`s). This RFC
8+
proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond
9+
forwarding to the storage type), lays out the completed and in-progress work, and identifies the
10+
open questions that remain.
11+
12+
## Motivation
13+
14+
A limitation of the current type system in Vortex is that we cannot easily add new logical types.
15+
For example, the effort to add `FixedSizeList`
16+
([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to
17+
`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive.
18+
It is much easier to add wrappers around canonical types (treating the canonical dtype as a
19+
"storage type") and implement some additional logic than to add a new variant to the `DType` enum.
20+
21+
Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can add
22+
a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) and
23+
specifying a canonical storage type that behaves like the "physical" type of the extension type.
24+
For example, the time extension types use a primitive storage type, meaning they wrap the primitive
25+
scalars or primitive arrays with some extra logic on top (mostly validating that the timestamps are
26+
valid).
27+
28+
We would like to add many more extension types. Some notable extension types (and their likely
29+
storage types) include:
30+
31+
- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond
32+
to levels of nesting. There are many open questions on the design of this, but that is out of
33+
scope of this RFC.
34+
- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement
35+
this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`).
36+
Vortex is well suited to represent this because it can compress each of the type field arrays
37+
independently, so we do not need to distinguish between a "Sparse" or "Dense" Union.
38+
- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of
39+
scope for this RFC.
40+
41+
The issue with the current system is that it only forwards logic to the underlying storage type.
42+
The only other behavior we support is serializing and pretty-printing extension arrays. This means
43+
that we cannot define custom compute logic for extension types.
44+
45+
Take the time extension types as an example of where this limitation does not matter. If we want to
46+
run a `compare` expression over a timestamp array, we just run the `compare` over the underlying
47+
primitive array. For simple types like timestamps, this is sufficient (and this is what we do right
48+
now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also
49+
fine.
50+
51+
However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely
52+
insufficient as these types need custom compute logic. Given that, we want a more robust
53+
implementation path instead of wrapping `ExtensionArray` and performing significant internal
54+
dispatch work.
55+
56+
## Design
57+
58+
### Background
59+
60+
[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables,
61+
or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`)
62+
now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata.
63+
The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.
64+
65+
There were a few blockers (detailed in the tracking issue
66+
[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)),
67+
but now that those have been resolved, we can move forward.
68+
69+
### Proposed Design
70+
71+
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place
72+
all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from
73+
`ExtDTypeVTable`).
74+
75+
It will look something like the following:
76+
77+
```rust
78+
// Note: naming should be considered unstable.
79+
80+
/// The public API for defining new extension types.
81+
///
82+
/// This is the non-object-safe trait that plugin authors implement to define a new extension
83+
/// type. It specifies the type's identity, metadata, serialization, and validation.
84+
pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash {
85+
/// Associated type containing the deserialized metadata for this extension type.
86+
type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash;
87+
88+
/// A native Rust value that represents a scalar of the extension type.
89+
///
90+
/// The value only represents non-null values. We denote nullable values as `Option<Value>`.
91+
type NativeValue<'a>: Display;
92+
93+
/// Returns the ID for this extension type.
94+
fn id(&self) -> ExtId;
95+
96+
// Methods related to the extension `DType`.
97+
98+
/// Serialize the metadata into a byte vector.
99+
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
100+
101+
/// Deserialize the metadata from a byte slice.
102+
fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult<Self::Metadata>;
103+
104+
/// Validate that the given storage type is compatible with this extension type.
105+
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
106+
107+
// Methods related to the extension scalar values.
108+
109+
/// Validate the given storage value is compatible with the extension type.
110+
///
111+
/// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the
112+
/// result.
113+
///
114+
/// # Errors
115+
///
116+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
117+
fn validate_scalar_value(
118+
&self,
119+
metadata: &Self::Metadata,
120+
storage_dtype: &DType,
121+
storage_value: &ScalarValue,
122+
) -> VortexResult<()> {
123+
self.unpack_native(metadata, storage_dtype, storage_value)
124+
.map(|_| ())
125+
}
126+
127+
/// Validate and unpack a native value from the storage [`ScalarValue`].
128+
///
129+
/// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage
130+
/// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the
131+
/// storage value is compatible with the storage dtype on construction.
132+
///
133+
/// # Errors
134+
///
135+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
136+
fn unpack_native<'a>(
137+
&self,
138+
metadata: &'a Self::Metadata,
139+
storage_dtype: &'a DType,
140+
storage_value: &'a ScalarValue,
141+
) -> VortexResult<Self::NativeValue<'a>>;
142+
143+
// `ArrayRef`
144+
145+
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
146+
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
147+
// Additional compute methods TBD.
148+
}
149+
```
150+
151+
Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the
152+
`Extension` variant of `DType`) has the correct methods that access the internal, type-erased
153+
`ExtVTable`.
154+
155+
Take extension scalars as an example. The only behavior we need from extension scalars is validating
156+
that they have correct values, displaying them, and unpacking them into native types. So we added
157+
these methods to `ExtDTypeRef`:
158+
159+
```rust
160+
impl ExtDTypeRef {
161+
/// Formats an extension scalar value using the current dtype for metadata context.
162+
pub fn fmt_storage_value<'a>(
163+
&'a self,
164+
f: &mut fmt::Formatter<'_>,
165+
storage_value: &'a ScalarValue,
166+
) -> fmt::Result { ... }
167+
168+
/// Validates that the given storage scalar value is valid for this dtype.
169+
pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... }
170+
}
171+
```
172+
173+
**Open question**: What should the API for extension arrays look like? The answer will determine
174+
what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above.
175+
176+
## Compatibility
177+
178+
This should not break anything because extension types are mostly related to in-memory APIs (since
179+
data is read from and written to disk as the storage type).
180+
181+
## Drawbacks
182+
183+
If forwarding to the storage type turns out to be sufficient for all extension types, the
184+
additional vtable surface area adds complexity without clear benefit.
185+
186+
## Alternatives
187+
188+
We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and
189+
may not scale.
190+
191+
## Prior Art
192+
193+
Apache Arrow allows defining
194+
[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types)
195+
and also provides a
196+
[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html).
197+
198+
## Unresolved Questions
199+
200+
- Is forwarding to the storage type insufficient, and which extension types genuinely need custom
201+
compute logic?
202+
- What should the `ExtVTable` API for extension arrays look like? What methods beyond
203+
`validate_array` are needed?
204+
- How should compute expressions be defined and dispatched for extension types?
205+
206+
## Future Possibilities
207+
208+
If we can get extension types working well, we can add all of the following types:
209+
210+
- `DateTimeParts` (`Primitive`)
211+
- Matrix (`FixedSizeList`)
212+
- Tensor (`FixedSizeList`)
213+
- UUID (Do we need to add `FixedSizeBinary` as a canonical type?)
214+
- JSON (`UTF8`)
215+
- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`)
216+
- Union
217+
- Sparse (`Struct { Primitive, Struct { types } }`)
218+
- Dense[^1]
219+
- Map (`List<Struct { K, V }>`)
220+
- Tags: See this
221+
[discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892),
222+
where we think we can represent this with (`ListView<Utf8>`)
223+
- `Struct` but with protobuf-style field numbers (`Struct`)
224+
- **NOT** Variant[^2]
225+
- And likely more.
226+
227+
[^1]:
228+
`Struct` doesn't work here because children can have different lengths, but what we could do
229+
is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would
230+
effectively be the exact same but with the overhead of tracking indices for each of the child
231+
fields. In that case, it might just be better to always use a "sparse" union and let the
232+
compressor decide what to do.
233+
234+
[^2]:
235+
We likely cannot implement `Variant` as an extension type because we have no way of defining
236+
what the storage type would be (since the schema is not known ahead of time for each row).

0 commit comments

Comments
 (0)