Skip to content

Commit bf683c1

Browse files
committed
Updated plan with new blosc2.transform() descriptor
1 parent 9b02394 commit bf683c1

1 file changed

Lines changed: 86 additions & 60 deletions

File tree

plans/ctable-ndarray-cols.md

Lines changed: 86 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -126,105 +126,109 @@ Suggested message:
126126

127127
```text
128128
Cannot compare ndarray column 'embedding' directly; the result would not be a
129-
1-D row mask. Use an element projection like t.embedding[:, 0] > 0.5 or a
130-
row-wise reduction like t.embedding.row_max() > 0.5.
129+
1-D row mask. Use an element projection like t.embedding[:, 0] > 0.5 or an
130+
axis-aware reduction like t.embedding.max(axis=1) > 0.5.
131131
```
132132

133133
---
134134

135135
## 5. Practical user-facing analysis helpers
136136

137-
This is an extra practical addition beyond the existing plans: provide explicit
138-
row-wise reduction methods on ndarray columns. These make it easy to
139-
analyze fixed-shape columns without leaving `CTable` ergonomics.
137+
Treat an ndarray column as a logical array with shape `(nrows, *item_shape)`.
138+
Column reductions should follow NumPy / Blosc2 NDArray axis semantics over that
139+
logical shape.
140140

141-
### 5.1 Row-wise reductions
141+
### 5.1 Axis-aware reductions
142142

143-
Add methods that reduce only the inner item axes and return one value per row:
143+
For scalar columns, existing behavior remains unchanged:
144144

145145
```python
146-
t.embedding.row_sum()
147-
t.embedding.row_mean()
148-
t.embedding.row_min()
149-
t.embedding.row_max()
150-
t.embedding.row_std()
151-
t.embedding.row_var()
152-
t.embedding.row_any()
153-
t.embedding.row_all()
154-
t.embedding.row_norm(ord=2)
146+
t.price.shape # (nrows,)
147+
t.price.sum() # scalar reduction over rows
155148
```
156149

157-
For `item_shape=(768,)`, each returns shape `(nrows,)`.
158-
For `item_shape=(H, W, C)`, each reduces axes `(1, 2, 3)` by default.
150+
For ndarray columns:
159151

160-
Optional `axis=` can reduce selected inner axes:
152+
```python
153+
t.embedding.shape # (nrows, dim)
154+
t.embedding.sum() # scalar full reduction, same as axis=None
155+
t.embedding.sum(axis=None) # scalar full reduction
156+
t.embedding.sum(axis=0) # reduce rows -> shape (dim,)
157+
t.embedding.sum(axis=1) # reduce embedding coords -> shape (nrows,)
158+
t.embedding.norm(axis=1) # row-wise norm -> shape (nrows,)
159+
```
160+
161+
For image-like columns:
161162

162163
```python
163-
t.image.row_mean(axis=(1, 2)) # mean over H,W, keep channel dimension
164+
t.image.shape # (nrows, H, W, C)
165+
t.image.mean() # scalar full reduction
166+
t.image.mean(axis=0) # mean image over rows -> shape (H, W, C)
167+
t.image.mean(axis=(1, 2)) # per-row per-channel mean -> shape (nrows, C)
168+
t.image.sum(axis=(1, 2, 3)) # per-row total -> shape (nrows,)
164169
```
165170

166-
Naming these `row_*` avoids ambiguity with existing scalar-column `.sum()` /
167-
`.max()` methods, which currently mean “reduce the whole column to a scalar”.
171+
This avoids special `row_*` methods and minimizes surprise: the table row is
172+
always the leading axis (`axis=0`), exactly as if the column were an NDArray of
173+
shape `(nrows, *item_shape)`.
168174

169-
### 5.2 Materialize row-wise reductions as generated columns
175+
`CTable.where()` still requires a 1-D row mask. For example,
176+
`t.embedding.norm(axis=1) > 5` is valid, whereas
177+
`t.image.mean(axis=(1, 2)) > 0.5` returns shape `(nrows, C)` and should be
178+
rejected unless further reduced to `(nrows,)`.
179+
180+
### 5.2 Materialize reductions as generated columns
170181

171182
Use `CTable.add_generated_column()` as the canonical API for storing generated
172183
scalar/vector columns. This is consistent with the existing
173184
`add_computed_column()` name while making the storage/maintenance semantics
174185
explicit.
175186

176-
The first use case is ndarray row-wise reductions:
187+
The first use case is ndarray reductions that produce one value per row:
177188

178189
```python
179190
t.add_generated_column(
180191
"embedding_norm",
181-
source_columns=["embedding"],
182-
transform={"kind": "ndarray_row_reduction", "op": "norm", "ord": 2},
192+
transform=blosc2.transform.norm("embedding", axis=1, ord=2),
183193
dtype=blosc2.float64(),
184194
create_index=True,
185195
)
186196

187197
t.add_generated_column(
188198
"embedding_max",
189-
source_columns=["embedding"],
190-
transform={"kind": "ndarray_row_reduction", "op": "max"},
199+
transform=blosc2.transform.max("embedding", axis=1),
191200
dtype=blosc2.float32(),
192201
)
193202

194203
t.add_generated_column(
195204
"embedding_0",
196-
source_columns=["embedding"],
197-
transform={"kind": "ndarray_element", "key": 0},
205+
transform=blosc2.transform["embedding", 0],
198206
dtype=blosc2.float32(),
199207
)
200208
```
201209

202210
The concept is general: a **generated column** is a real stored column
203211
maintained from one or more source columns. `add_generated_column()` should
204-
support two explicit generation modes:
205-
206-
- `expr=` for scalar expressions, mirroring `add_computed_column()`
207-
- `transform=` for structured transforms such as ndarray row-wise reductions
212+
accept a `blosc2.Transform` object built by the `blosc2.transform` singleton
213+
namespace/factory.
208214

209-
Exactly one of `expr` or `transform` should be provided. Do not overload
210-
`transform=` to also accept expression strings; keeping `expr=` separate makes
211-
validation and documentation clearer.
212-
213-
Scalar generated-column example:
215+
Scalar expression transform example:
214216

215217
```python
216218
t.add_generated_column(
217-
"total", expr="price * qty", dtype=blosc2.float64(), create_index=True
219+
"total",
220+
transform=blosc2.transform("price * qty"),
221+
dtype=blosc2.float64(),
222+
create_index=True,
218223
)
219224
```
220225

221-
Structured transform examples:
226+
Other transform examples:
222227

223228
```python
224229
t.add_generated_column(
225230
"price_with_tax",
226-
source_columns=["price"],
227-
transform={"kind": "scalar_unary", "op": "mul", "value": 1.21},
231+
transform=blosc2.transform("price * 1.21"),
228232
dtype=blosc2.float64(),
229233
)
230234
```
@@ -239,8 +243,7 @@ Example:
239243
```python
240244
t.add_generated_column(
241245
"embedding_norm",
242-
source_columns=["embedding"],
243-
transform={"kind": "ndarray_row_reduction", "op": "norm", "ord": 2},
246+
transform=blosc2.transform.norm("embedding", axis=1, ord=2),
244247
dtype=blosc2.float64(),
245248
create_index=True,
246249
)
@@ -252,7 +255,7 @@ view = t.where(t.embedding_norm > 5.0) # can use the index
252255
Equivalent manual workflow today would be:
253256

254257
```python
255-
values = t.embedding.row_norm()[:]
258+
values = t.embedding.norm(axis=1)[:]
256259
t.add_column("embedding_norm", blosc2.field(blosc2.float64(), default=0.0))
257260
t.embedding_norm[:] = values
258261
t.create_index("embedding_norm")
@@ -265,22 +268,43 @@ def add_generated_column(
265268
self,
266269
name: str,
267270
*,
268-
expr=None,
269-
source_columns: list[str] | None = None,
270-
transform: dict | None = None,
271+
transform: blosc2.Transform,
271272
dtype=None,
272273
create_index: bool = False,
273274
) -> None: ...
274275
```
275276

276277
Validation rules:
277278

278-
- exactly one of `expr` or `transform` is required
279-
- `expr` accepts the same expression forms as `add_computed_column()` where practical
280-
- `source_columns` may be inferred for expression strings / LazyExpr callables when possible
281-
- `source_columns` is required for structured `transform=` descriptors
279+
- `transform` must be a `blosc2.Transform` instance
280+
- string expressions are represented as `blosc2.transform("price * qty")`
281+
- ndarray projections are represented as `blosc2.transform["embedding", 0]`
282+
- reductions follow logical `(nrows, *item_shape)` axis semantics, e.g. `blosc2.transform.norm("embedding", axis=1)`
283+
- source columns are taken from `transform.source_columns`
282284
- generated columns are always stored and maintained; virtual columns remain the job of `add_computed_column()`
283285

286+
`blosc2.transform` API sketch:
287+
288+
```python
289+
blosc2.transform("price * qty") # expression transform
290+
blosc2.transform.norm("embedding", axis=1, ord=2)
291+
blosc2.transform.max("embedding", axis=1)
292+
blosc2.transform.mean("image", axis=(1, 2)) # shape (nrows, C)
293+
blosc2.transform["embedding", 0] # per-row item projection
294+
blosc2.transform["image", :, :, 0] # per-row item projection with slices
295+
```
296+
297+
It is a callable singleton namespace/factory. All constructors return a
298+
`blosc2.Transform` object with at least:
299+
300+
```python
301+
transform.kind
302+
transform.source_columns
303+
transform.to_metadata()
304+
transform.evaluate_existing(table)
305+
transform.evaluate_batch(raw_columns)
306+
```
307+
284308
The helper should:
285309

286310
- choose or validate output dtype
@@ -305,7 +329,7 @@ Suggested metadata shape:
305329
"kind": "ndarray_row_reduction",
306330
"op": "norm",
307331
"ord": 2,
308-
"axis": None,
332+
"axis": 1,
309333
},
310334
"index": {
311335
"create": True,
@@ -314,7 +338,7 @@ Suggested metadata shape:
314338
}
315339
```
316340

317-
Expression-generated columns use `expr` metadata instead of `transform`:
341+
Expression-generated columns use expression transform metadata:
318342

319343
```python
320344
{
@@ -325,7 +349,10 @@ Expression-generated columns use `expr` metadata instead of `transform`:
325349
"maintain_on_append": True,
326350
"stale_on_source_update": True,
327351
"dtype": "float64",
328-
"expr": "price * qty",
352+
"transform": {
353+
"kind": "expression",
354+
"expression": "price * qty",
355+
},
329356
}
330357
```
331358

@@ -334,8 +361,7 @@ Expected maintenance behavior:
334361
```python
335362
t.add_generated_column(
336363
"embedding_norm",
337-
source_columns=["embedding"],
338-
transform={"kind": "ndarray_row_reduction", "op": "norm", "ord": 2},
364+
transform=blosc2.transform.norm("embedding", axis=1, ord=2),
339365
dtype=blosc2.float64(),
340366
create_index=True,
341367
)
@@ -378,7 +404,7 @@ ndarray column 'embedding'
378404
dtype : float32
379405
storage : NDArray shape=(1048576, 768), chunks=(..., ...), blocks=(..., ...)
380406
cbytes : ...
381-
row stats : min(row_norm)=..., mean(row_norm)=..., max(row_norm)=...
407+
row stats : min(norm(axis=1))=..., mean(norm(axis=1))=..., max(norm(axis=1))=...
382408
```
383409

384410
Keep this opt-in so normal table display stays compact.
@@ -491,7 +517,7 @@ as structured subarrays.
491517
- sort/groupby/index guard messages
492518
- display of small and large item shapes
493519
- `Column.ndim`, `size`, `item_shape`
494-
- row-wise reduction helpers
520+
- axis-aware ndarray column reductions
495521
- `add_generated_column()` with optional indexing
496522
- generated-column auto-fill on `append()` / `extend()`
497523
- generated-column staleness / refresh behavior after source-column mutation

0 commit comments

Comments
 (0)