Background
CTable stores list-typed columns (declared as blosc2.list(item_spec)) in a ListArray
object backed by a BatchArray (or ObjectArray). The BatchArray stores rows in
compressed batches of fixed size (batch_rows). Individual rows are located using a
prefix-sum array over batch lengths.
Example schema:
@dataclass
class Food:
id: str = blosc2.field(blosc2.string(max_length=64))
ingredients: list = blosc2.field(blosc2.list(blosc2.string(max_length=32)))
Current problem: sort_by() on tables with list columns is expensive
When CTable.sort_by("id") is called on a table that also has a ListArray column
(ingredients), the column must be physically rewritten in the new order. The relevant
code is in ctable.py, sort_by():
# inplace=True path (simplified)
new_arr = ListArray(spec=col.spec)
new_arr.extend((arr[int(pos)] for pos in sorted_pos), validate=False)
new_arr.flush()
self._cols[name] = new_arr
Each arr[int(pos)] call locates the batch that contains row pos, decompresses it,
and extracts the value. For a table with n rows and batch_rows=B, this causes up to
n batch decompressions (O(n) decompress + O(n) recompress into the new array).
For large tables with many/large list values this is very slow.
Proposed solution: offset indirection array
Add a companion NDArray of int64 to each ListArray column. This array is called the
offset and maps logical_physical_pos → true_physical_pos_in_batch_array.
Initial state (4 rows, no sort yet)
id column (NDArray): ["chocolate", "tortilla", "paella", "albondigas"]
ingredients (BatchArray): [choc_list, tort_list, pae_list, alb_list ] ← physical
ingredients offset (NDArray):[0, 1, 2, 3 ] ← identity
After sort_by("id") — only the offset changes, BatchArray untouched
id column (NDArray): ["albondigas", "chocolate", "paella", "tortilla"]
ingredients (BatchArray): [choc_list, tort_list, pae_list, alb_list ] ← unchanged
ingredients offset (NDArray):[3, 0, 2, 1 ]
Reading ingredients for row 1 ("chocolate"):
- Look up
offset[1] → 0
- Read
batch_array[0] → choc_list
Sort cost is now O(n) NDArray write instead of O(n) BatchArray decompress+recompress.
The offset is lazy
The BatchArray is only accessed when list data is actually needed. Operations that sort
or filter on other columns never touch the BatchArray at all. This means:
- sort_by("id") → permute offset only, BatchArray untouched
- where("id == 'paella'") → BatchArray untouched
- head() / tail() → only a few offset lookups, trivial cost
- aggregate on other cols → BatchArray untouched
The only time the fragmented-access cost is paid is a full scan of the list column
(e.g. to_arrow(), to_csv(), or iterating every row's ingredients). Even then the data
is correct; it just may access BatchArray batches out of sequential order.
Compaction restores physical order
CTable.copy(compact=True) is the existing idiom to reclaim space and rewrite data.
When compact copy is triggered, the BatchArray should be rewritten following the current
offset order and the offset reset to identity:
After copy(compact=True) of the sorted example above (and "chocolate" deleted):
id column (NDArray): ["albondigas", "paella", "tortilla"]
ingredients (BatchArray): [alb_list, pae_list, tort_list ] ← physically reordered
ingredients offset (NDArray):[0, 1, 2 ] ← identity again
Impact on each CTable operation
| Operation |
Current behaviour (list cols) |
With offset |
| sort_by |
new ListArray, extend one-by-one via generator |
permute offset NDArray only; BatchArray untouched |
| append |
push to _pending_cells / _backend |
same + append next_phys_idx to offset |
| extend |
batch-extend _pending_cells |
same + append range of phys indices to offset |
| getitem (read) |
valid_rows → phys_pos → batch_array[phys_pos] |
valid_rows → phys_pos → offset[phys_pos] → batch_array |
| setitem (write) |
batch_array[phys_pos] = val |
batch_array[offset[phys_pos]] = val |
| delete |
tombstone in _valid_rows, ListArray untouched |
same, offset untouched |
| compact |
new ListArray with live rows |
new ListArray in offset order, reset offset to identity |
| copy(compact=True) |
iterate live positions, fresh ListArray |
iterate via offset order, fresh identity offset |
| copy(compact=False) |
copy all rows including gaps |
copy all rows (offset included as-is) |
| to_arrow / to_csv |
sequential scan, BatchArray cache-friendly |
permuted scan — consider copy(compact=True) first |
Implementation notes
Where the offset lives
The offset NDArray is a peer of the other NDArray columns inside CTable._cols or can
be stored as internal state inside ListArray itself. Keeping it inside ListArray
is the cleaner option: ListArray owns its own permutation state and callers need no
changes to how they access it.
If kept inside ListArray:
self._offset: blosc2.NDArray | None (None = identity, lazily allocated on first sort)
_resolve(phys_pos) → self._offset[phys_pos] if self._offset is not None else phys_pos
Pending cells and flush()
ListArray for storage="batch" buffers rows in self._pending_cells before they are
flushed into the BatchArray. When a flush happens, the newly persisted rows get physical
indices [prev_persisted_count, ..., prev_persisted_count + batch_size - 1]. These
need to be appended to the offset array in the same order they were in _pending_cells.
As long as appends always go to the end (no insert-in-middle), the offset for new rows
is always the identity range and the invariant is easy to maintain.
Persistence
For file-backed tables the offset needs to be stored alongside the BatchArray. A simple
option is a .b2nd file next to the .b2b file:
_cols/ingredients.b2b ← BatchArray data (unchanged)
_cols/ingredients.offset.b2nd ← offset NDArray (new)
On open, if the offset file is missing it is treated as an identity permutation (backward
compatibility with existing archives).
Backward compatibility
- Existing
.b2z and .b2d archives without an offset file → treat as identity (no migration needed).
- Schema serialization does not need to change; the offset is operational state, not schema.
Known trade-offs
-
Read overhead after unsorted state. A full sequential scan of a list column after
sort (without compacting) accesses BatchArray batches out of order. For large batch_rows
this means decompressing a full batch to read one element. Mitigation: call
copy(compact=True) before bulk-reading list data if performance matters.
-
Extra indirection on every list read. One additional NDArray lookup per row.
For small tables or point lookups this is negligible. For large scans the NDArray is
accessed sequentially (offset itself is sequential even if the values it points to are
not), so it stays cache-warm.
-
_pending_cells bookkeeping. The offset must be kept in sync with the pending buffer
during flush. This is the trickiest part of the implementation; see notes above.
Files to touch
src/blosc2/list_array.py — add _offset field, _resolve(), update all read/write paths
src/blosc2/ctable.py — update sort_by() to permute offset instead of rewriting ListArray;
update compact() and copy() to rebuild offset
src/blosc2/ctable_storage.py — persist/load .offset.b2nd sidecar file alongside .b2b
tests/ctable/ — add tests: sort with list col, read after sort, compact resets offset,
copy with compact, persistence round-trip, backward compat (no offset file)
Acceptance criteria
Background
CTable stores list-typed columns (declared as
blosc2.list(item_spec)) in aListArrayobject backed by a
BatchArray(orObjectArray). TheBatchArraystores rows incompressed batches of fixed size (
batch_rows). Individual rows are located using aprefix-sum array over batch lengths.
Example schema:
Current problem: sort_by() on tables with list columns is expensive
When
CTable.sort_by("id")is called on a table that also has aListArraycolumn(
ingredients), the column must be physically rewritten in the new order. The relevantcode is in
ctable.py,sort_by():Each
arr[int(pos)]call locates the batch that contains rowpos, decompresses it,and extracts the value. For a table with n rows and batch_rows=B, this causes up to
n batch decompressions (O(n) decompress + O(n) recompress into the new array).
For large tables with many/large list values this is very slow.
Proposed solution: offset indirection array
Add a companion NDArray of int64 to each ListArray column. This array is called the
offset and maps
logical_physical_pos → true_physical_pos_in_batch_array.Initial state (4 rows, no sort yet)
After sort_by("id") — only the offset changes, BatchArray untouched
Reading
ingredientsfor row 1 ("chocolate"):offset[1]→0batch_array[0]→choc_listSort cost is now O(n) NDArray write instead of O(n) BatchArray decompress+recompress.
The offset is lazy
The BatchArray is only accessed when list data is actually needed. Operations that sort
or filter on other columns never touch the BatchArray at all. This means:
The only time the fragmented-access cost is paid is a full scan of the list column
(e.g. to_arrow(), to_csv(), or iterating every row's ingredients). Even then the data
is correct; it just may access BatchArray batches out of sequential order.
Compaction restores physical order
CTable.copy(compact=True)is the existing idiom to reclaim space and rewrite data.When compact copy is triggered, the BatchArray should be rewritten following the current
offset order and the offset reset to identity:
Impact on each CTable operation
Implementation notes
Where the offset lives
The offset NDArray is a peer of the other NDArray columns inside
CTable._colsor canbe stored as internal state inside
ListArrayitself. Keeping it insideListArrayis the cleaner option: ListArray owns its own permutation state and callers need no
changes to how they access it.
If kept inside
ListArray:self._offset: blosc2.NDArray | None(None = identity, lazily allocated on first sort)_resolve(phys_pos)→self._offset[phys_pos] if self._offset is not None else phys_posPending cells and flush()
ListArrayforstorage="batch"buffers rows inself._pending_cellsbefore they areflushed into the BatchArray. When a flush happens, the newly persisted rows get physical
indices
[prev_persisted_count, ..., prev_persisted_count + batch_size - 1]. Theseneed to be appended to the offset array in the same order they were in
_pending_cells.As long as appends always go to the end (no insert-in-middle), the offset for new rows
is always the identity range and the invariant is easy to maintain.
Persistence
For file-backed tables the offset needs to be stored alongside the BatchArray. A simple
option is a
.b2ndfile next to the.b2bfile:On open, if the offset file is missing it is treated as an identity permutation (backward
compatibility with existing archives).
Backward compatibility
.b2zand.b2darchives without an offset file → treat as identity (no migration needed).Known trade-offs
Read overhead after unsorted state. A full sequential scan of a list column after
sort (without compacting) accesses BatchArray batches out of order. For large
batch_rowsthis means decompressing a full batch to read one element. Mitigation: call
copy(compact=True)before bulk-reading list data if performance matters.Extra indirection on every list read. One additional NDArray lookup per row.
For small tables or point lookups this is negligible. For large scans the NDArray is
accessed sequentially (offset itself is sequential even if the values it points to are
not), so it stays cache-warm.
_pending_cells bookkeeping. The offset must be kept in sync with the pending buffer
during flush. This is the trickiest part of the implementation; see notes above.
Files to touch
src/blosc2/list_array.py— add_offsetfield,_resolve(), update all read/write pathssrc/blosc2/ctable.py— updatesort_by()to permute offset instead of rewriting ListArray;update
compact()andcopy()to rebuild offsetsrc/blosc2/ctable_storage.py— persist/load.offset.b2ndsidecar file alongside.b2btests/ctable/— add tests: sort with list col, read after sort, compact resets offset,copy with compact, persistence round-trip, backward compat (no offset file)
Acceptance criteria
sort_byon a table with alistcolumn does not rewrite the BatchArraycopy(compact=True)produces a table where the offset is identity and BatchArrayis physically in the correct order
.b2zfiles without an offset file open correctly (identity fallback)