@@ -126,105 +126,109 @@ Suggested message:
126126
127127``` text
128128Cannot compare ndarray column 'embedding' directly; the result would not be a
129- 1-D row mask. Use an element projection like t.embedding[:, 0] > 0.5 or a
130- row-wise reduction like t.embedding.row_max( ) > 0.5.
129+ 1-D row mask. Use an element projection like t.embedding[:, 0] > 0.5 or an
130+ axis-aware reduction like t.embedding.max(axis=1 ) > 0.5.
131131```
132132
133133---
134134
135135## 5. Practical user-facing analysis helpers
136136
137- This is an extra practical addition beyond the existing plans: provide explicit
138- row-wise reduction methods on ndarray columns. These make it easy to
139- analyze fixed- shape columns without leaving ` CTable ` ergonomics .
137+ Treat an ndarray column as a logical array with shape ` (nrows, *item_shape) ` .
138+ Column reductions should follow NumPy / Blosc2 NDArray axis semantics over that
139+ logical shape.
140140
141- ### 5.1 Row-wise reductions
141+ ### 5.1 Axis-aware reductions
142142
143- Add methods that reduce only the inner item axes and return one value per row :
143+ For scalar columns, existing behavior remains unchanged :
144144
145145``` python
146- t.embedding.row_sum()
147- t.embedding.row_mean()
148- t.embedding.row_min()
149- t.embedding.row_max()
150- t.embedding.row_std()
151- t.embedding.row_var()
152- t.embedding.row_any()
153- t.embedding.row_all()
154- t.embedding.row_norm(ord = 2 )
146+ t.price.shape # (nrows,)
147+ t.price.sum() # scalar reduction over rows
155148```
156149
157- For ` item_shape=(768,) ` , each returns shape ` (nrows,) ` .
158- For ` item_shape=(H, W, C) ` , each reduces axes ` (1, 2, 3) ` by default.
150+ For ndarray columns:
159151
160- Optional ` axis= ` can reduce selected inner axes:
152+ ``` python
153+ t.embedding.shape # (nrows, dim)
154+ t.embedding.sum() # scalar full reduction, same as axis=None
155+ t.embedding.sum(axis = None ) # scalar full reduction
156+ t.embedding.sum(axis = 0 ) # reduce rows -> shape (dim,)
157+ t.embedding.sum(axis = 1 ) # reduce embedding coords -> shape (nrows,)
158+ t.embedding.norm(axis = 1 ) # row-wise norm -> shape (nrows,)
159+ ```
160+
161+ For image-like columns:
161162
162163``` python
163- t.image.row_mean(axis = (1 , 2 )) # mean over H,W, keep channel dimension
164+ t.image.shape # (nrows, H, W, C)
165+ t.image.mean() # scalar full reduction
166+ t.image.mean(axis = 0 ) # mean image over rows -> shape (H, W, C)
167+ t.image.mean(axis = (1 , 2 )) # per-row per-channel mean -> shape (nrows, C)
168+ t.image.sum(axis = (1 , 2 , 3 )) # per-row total -> shape (nrows,)
164169```
165170
166- Naming these ` row_* ` avoids ambiguity with existing scalar-column ` .sum() ` /
167- ` .max() ` methods, which currently mean “reduce the whole column to a scalar”.
171+ This avoids special ` row_* ` methods and minimizes surprise: the table row is
172+ always the leading axis (` axis=0 ` ), exactly as if the column were an NDArray of
173+ shape ` (nrows, *item_shape) ` .
168174
169- ### 5.2 Materialize row-wise reductions as generated columns
175+ ` CTable.where() ` still requires a 1-D row mask. For example,
176+ ` t.embedding.norm(axis=1) > 5 ` is valid, whereas
177+ ` t.image.mean(axis=(1, 2)) > 0.5 ` returns shape ` (nrows, C) ` and should be
178+ rejected unless further reduced to ` (nrows,) ` .
179+
180+ ### 5.2 Materialize reductions as generated columns
170181
171182Use ` CTable.add_generated_column() ` as the canonical API for storing generated
172183scalar/vector columns. This is consistent with the existing
173184` add_computed_column() ` name while making the storage/maintenance semantics
174185explicit.
175186
176- The first use case is ndarray row-wise reductions:
187+ The first use case is ndarray reductions that produce one value per row :
177188
178189``` python
179190t.add_generated_column(
180191 " embedding_norm" ,
181- source_columns = [" embedding" ],
182- transform = {" kind" : " ndarray_row_reduction" , " op" : " norm" , " ord" : 2 },
192+ transform = blosc2.transform.norm(" embedding" , axis = 1 , ord = 2 ),
183193 dtype = blosc2.float64(),
184194 create_index = True ,
185195)
186196
187197t.add_generated_column(
188198 " embedding_max" ,
189- source_columns = [" embedding" ],
190- transform = {" kind" : " ndarray_row_reduction" , " op" : " max" },
199+ transform = blosc2.transform.max(" embedding" , axis = 1 ),
191200 dtype = blosc2.float32(),
192201)
193202
194203t.add_generated_column(
195204 " embedding_0" ,
196- source_columns = [" embedding" ],
197- transform = {" kind" : " ndarray_element" , " key" : 0 },
205+ transform = blosc2.transform[" embedding" , 0 ],
198206 dtype = blosc2.float32(),
199207)
200208```
201209
202210The concept is general: a ** generated column** is a real stored column
203211maintained from one or more source columns. ` add_generated_column() ` should
204- support two explicit generation modes:
205-
206- - ` expr= ` for scalar expressions, mirroring ` add_computed_column() `
207- - ` transform= ` for structured transforms such as ndarray row-wise reductions
212+ accept a ` blosc2.Transform ` object built by the ` blosc2.transform ` singleton
213+ namespace/factory.
208214
209- Exactly one of ` expr ` or ` transform ` should be provided. Do not overload
210- ` transform= ` to also accept expression strings; keeping ` expr= ` separate makes
211- validation and documentation clearer.
212-
213- Scalar generated-column example:
215+ Scalar expression transform example:
214216
215217``` python
216218t.add_generated_column(
217- " total" , expr = " price * qty" , dtype = blosc2.float64(), create_index = True
219+ " total" ,
220+ transform = blosc2.transform(" price * qty" ),
221+ dtype = blosc2.float64(),
222+ create_index = True ,
218223)
219224```
220225
221- Structured transform examples:
226+ Other transform examples:
222227
223228``` python
224229t.add_generated_column(
225230 " price_with_tax" ,
226- source_columns = [" price" ],
227- transform = {" kind" : " scalar_unary" , " op" : " mul" , " value" : 1.21 },
231+ transform = blosc2.transform(" price * 1.21" ),
228232 dtype = blosc2.float64(),
229233)
230234```
@@ -239,8 +243,7 @@ Example:
239243``` python
240244t.add_generated_column(
241245 " embedding_norm" ,
242- source_columns = [" embedding" ],
243- transform = {" kind" : " ndarray_row_reduction" , " op" : " norm" , " ord" : 2 },
246+ transform = blosc2.transform.norm(" embedding" , axis = 1 , ord = 2 ),
244247 dtype = blosc2.float64(),
245248 create_index = True ,
246249)
@@ -252,7 +255,7 @@ view = t.where(t.embedding_norm > 5.0) # can use the index
252255Equivalent manual workflow today would be:
253256
254257``` python
255- values = t.embedding.row_norm( )[:]
258+ values = t.embedding.norm( axis = 1 )[:]
256259t.add_column(" embedding_norm" , blosc2.field(blosc2.float64(), default = 0.0 ))
257260t.embedding_norm[:] = values
258261t.create_index(" embedding_norm" )
@@ -265,22 +268,43 @@ def add_generated_column(
265268 self ,
266269 name : str ,
267270 * ,
268- expr = None ,
269- source_columns : list[str ] | None = None ,
270- transform : dict | None = None ,
271+ transform : blosc2.Transform,
271272 dtype = None ,
272273 create_index : bool = False ,
273274) -> None : ...
274275```
275276
276277Validation rules:
277278
278- - exactly one of ` expr ` or ` transform ` is required
279- - ` expr ` accepts the same expression forms as ` add_computed_column() ` where practical
280- - ` source_columns ` may be inferred for expression strings / LazyExpr callables when possible
281- - ` source_columns ` is required for structured ` transform= ` descriptors
279+ - ` transform ` must be a ` blosc2.Transform ` instance
280+ - string expressions are represented as ` blosc2.transform("price * qty") `
281+ - ndarray projections are represented as ` blosc2.transform["embedding", 0] `
282+ - reductions follow logical ` (nrows, *item_shape) ` axis semantics, e.g. ` blosc2.transform.norm("embedding", axis=1) `
283+ - source columns are taken from ` transform.source_columns `
282284- generated columns are always stored and maintained; virtual columns remain the job of ` add_computed_column() `
283285
286+ ` blosc2.transform ` API sketch:
287+
288+ ``` python
289+ blosc2.transform(" price * qty" ) # expression transform
290+ blosc2.transform.norm(" embedding" , axis = 1 , ord = 2 )
291+ blosc2.transform.max(" embedding" , axis = 1 )
292+ blosc2.transform.mean(" image" , axis = (1 , 2 )) # shape (nrows, C)
293+ blosc2.transform[" embedding" , 0 ] # per-row item projection
294+ blosc2.transform[" image" , :, :, 0 ] # per-row item projection with slices
295+ ```
296+
297+ It is a callable singleton namespace/factory. All constructors return a
298+ ` blosc2.Transform ` object with at least:
299+
300+ ``` python
301+ transform.kind
302+ transform.source_columns
303+ transform.to_metadata()
304+ transform.evaluate_existing(table)
305+ transform.evaluate_batch(raw_columns)
306+ ```
307+
284308The helper should:
285309
286310- choose or validate output dtype
@@ -305,7 +329,7 @@ Suggested metadata shape:
305329 " kind" : " ndarray_row_reduction" ,
306330 " op" : " norm" ,
307331 " ord" : 2 ,
308- " axis" : None ,
332+ " axis" : 1 ,
309333 },
310334 " index" : {
311335 " create" : True ,
@@ -314,7 +338,7 @@ Suggested metadata shape:
314338}
315339```
316340
317- Expression-generated columns use ` expr ` metadata instead of ` transform ` :
341+ Expression-generated columns use expression transform metadata :
318342
319343``` python
320344{
@@ -325,7 +349,10 @@ Expression-generated columns use `expr` metadata instead of `transform`:
325349 " maintain_on_append" : True ,
326350 " stale_on_source_update" : True ,
327351 " dtype" : " float64" ,
328- " expr" : " price * qty" ,
352+ " transform" : {
353+ " kind" : " expression" ,
354+ " expression" : " price * qty" ,
355+ },
329356}
330357```
331358
@@ -334,8 +361,7 @@ Expected maintenance behavior:
334361``` python
335362t.add_generated_column(
336363 " embedding_norm" ,
337- source_columns = [" embedding" ],
338- transform = {" kind" : " ndarray_row_reduction" , " op" : " norm" , " ord" : 2 },
364+ transform = blosc2.transform.norm(" embedding" , axis = 1 , ord = 2 ),
339365 dtype = blosc2.float64(),
340366 create_index = True ,
341367)
@@ -378,7 +404,7 @@ ndarray column 'embedding'
378404 dtype : float32
379405 storage : NDArray shape=(1048576, 768), chunks=(..., ...), blocks=(..., ...)
380406 cbytes : ...
381- row stats : min(row_norm) =..., mean(row_norm) =..., max(row_norm )=...
407+ row stats : min(norm(axis=1)) =..., mean(norm(axis=1)) =..., max(norm(axis=1) )=...
382408```
383409
384410Keep this opt-in so normal table display stays compact.
@@ -491,7 +517,7 @@ as structured subarrays.
491517- sort/groupby/index guard messages
492518- display of small and large item shapes
493519- ` Column.ndim ` , ` size ` , ` item_shape `
494- - row-wise reduction helpers
520+ - axis-aware ndarray column reductions
495521- ` add_generated_column() ` with optional indexing
496522- generated-column auto-fill on ` append() ` / ` extend() `
497523- generated-column staleness / refresh behavior after source-column mutation
0 commit comments