Skip to content

Commit b0fc69f

Browse files
committed
more clarity on open questions
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 4759c5c commit b0fc69f

1 file changed

Lines changed: 26 additions & 24 deletions

File tree

proposals/0005-extension-types.md

Lines changed: 26 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -22,18 +22,18 @@ Vortex provides an `Extension` variant of `DType` to help with this. Currently,
2222
a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) and
2323
specifying a canonical storage type that behaves like the "physical" type of the extension type.
2424
For example, the time extension types use a primitive storage type, meaning they wrap the primitive
25-
scalars or primitive arrays with some extra logic on top of it (mostly validating that the
26-
timestamps are valid).
25+
scalars or primitive arrays with some extra logic on top (mostly validating that the timestamps are
26+
valid).
2727

2828
We would like to add many more extension types. Some notable extension types (and their likely
2929
storage types) include:
3030

3131
- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond
3232
to levels of nesting. There are many open questions on the design of this, but that is out of
3333
scope of this RFC.
34-
- **Union**: The sum type of an algebraic data type, like a Rust enum. We would likely implement
34+
- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement
3535
this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`).
36-
Vortex may be well suited to represent this because it can compress each of the type field arrays
36+
Vortex is well suited to represent this because it can compress each of the type field arrays
3737
independently, so we do not need to distinguish between a "Sparse" or "Dense" Union.
3838
- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of
3939
scope for this RFC.
@@ -46,16 +46,16 @@ Take the time extension types as an example of where this limitation does not ma
4646
run a `compare` expression over a timestamp array, we just run the `compare` over the underlying
4747
primitive array. For simple types like timestamps, this is sufficient (and this is what we do right
4848
now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also
49-
probably fine.
49+
fine.
5050

51-
However, for more complex types like UUID, Union, or JSON, this is likely not sufficient. It
52-
remains an open question whether forwarding to the storage type is always enough, but assuming we do
53-
need richer behavior, we want a more robust implementation path (instead of wrapping
54-
`ExtensionArray` and performing significant internal dispatch work).
51+
However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely
52+
insufficient as these types need custom compute logic. Given that, we want a more robust
53+
implementation path instead of wrapping `ExtensionArray` and performing significant internal
54+
dispatch work.
5555

5656
## Design
5757

58-
### Completed Work
58+
### Background
5959

6060
[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables,
6161
or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`)
@@ -66,10 +66,12 @@ There were a few blockers (detailed in the tracking issue
6666
[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)),
6767
but now that those have been resolved, we can move forward.
6868

69-
### Current Work
69+
### Proposed Design
7070

7171
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place
72-
all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (it has been renamed now).
72+
all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from
73+
`ExtDTypeVTable`).
74+
7375
It will look something like the following:
7476

7577
```rust
@@ -168,25 +170,23 @@ impl ExtDTypeRef {
168170
}
169171
```
170172

171-
We do not yet know what the API for extension arrays should look like, so it is hard to say what
172-
other methods should exist on `ExtDTypeRef`.
173+
**Open question**: What should the API for extension arrays look like? The answer will determine
174+
what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above.
173175

174176
## Compatibility
175177

176178
This should not break anything because extension types are mostly related to in-memory APIs (since
177179
data is read from and written to disk as the storage type).
178180

179-
There could potentially be performance optimizations we could make with the new extension type
180-
system, but it is hard to know for sure.
181-
182181
## Drawbacks
183182

184-
As stated before, this might be overkill.
183+
If forwarding to the storage type turns out to be sufficient for all extension types, the
184+
additional vtable surface area adds complexity without clear benefit.
185185

186186
## Alternatives
187187

188-
We could have many `ExtensionArray` wrappers with custom logic. This
189-
might not even work, but even if it did, it would likely be very clunky.
188+
We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and
189+
may not scale.
190190

191191
## Prior Art
192192

@@ -197,9 +197,11 @@ and also provides a
197197

198198
## Unresolved Questions
199199

200-
It is not yet clear whether changes to our extension types are strictly necessary, or whether
201-
forwarding to the storage type will always suffice. We also do not yet know how to define compute
202-
expressions over extension arrays.
200+
- Is forwarding to the storage type insufficient, and which extension types genuinely need custom
201+
compute logic?
202+
- What should the `ExtVTable` API for extension arrays look like? What methods beyond
203+
`validate_array` are needed?
204+
- How should compute expressions be defined and dispatched for extension types?
203205

204206
## Future Possibilities
205207

@@ -231,4 +233,4 @@ If we can get extension types working well, we can add all of the following type
231233

232234
[^2]:
233235
We likely cannot implement `Variant` as an extension type because we have no way of defining
234-
what the storage type would be (since we have no idea what the schema is for each row).
236+
what the storage type would be (since the schema is not known ahead of time for each row).

0 commit comments

Comments
 (0)