Skip to content

[CTable] Add offset indirection layer to ListArray for a better sort cost #635

@Jacc4224

Description

@Jacc4224

Background

CTable stores list-typed columns (declared as blosc2.list(item_spec)) in a ListArray
object backed by a BatchArray (or ObjectArray). The BatchArray stores rows in
compressed batches of fixed size (batch_rows). Individual rows are located using a
prefix-sum array over batch lengths.

Example schema:

@dataclass
class Food:
    id: str = blosc2.field(blosc2.string(max_length=64))
    ingredients: list = blosc2.field(blosc2.list(blosc2.string(max_length=32)))

Current problem: sort_by() on tables with list columns is expensive

When CTable.sort_by("id") is called on a table that also has a ListArray column
(ingredients), the column must be physically rewritten in the new order. The relevant
code is in ctable.py, sort_by():

# inplace=True path (simplified)
new_arr = ListArray(spec=col.spec)
new_arr.extend((arr[int(pos)] for pos in sorted_pos), validate=False)
new_arr.flush()
self._cols[name] = new_arr

Each arr[int(pos)] call locates the batch that contains row pos, decompresses it,
and extracts the value. For a table with n rows and batch_rows=B, this causes up to
n batch decompressions (O(n) decompress + O(n) recompress into the new array).

For large tables with many/large list values this is very slow.


Proposed solution: offset indirection array

Add a companion NDArray of int64 to each ListArray column. This array is called the
offset and maps logical_physical_pos → true_physical_pos_in_batch_array.

Initial state (4 rows, no sort yet)

id column (NDArray):         ["chocolate", "tortilla", "paella", "albondigas"]
ingredients (BatchArray):    [choc_list,   tort_list,  pae_list, alb_list    ]  ← physical
ingredients offset (NDArray):[0,           1,          2,        3           ]  ← identity

After sort_by("id") — only the offset changes, BatchArray untouched

id column (NDArray):         ["albondigas", "chocolate", "paella", "tortilla"]
ingredients (BatchArray):    [choc_list,    tort_list,   pae_list, alb_list  ]  ← unchanged
ingredients offset (NDArray):[3,            0,           2,        1         ]

Reading ingredients for row 1 ("chocolate"):

  1. Look up offset[1]0
  2. Read batch_array[0]choc_list

Sort cost is now O(n) NDArray write instead of O(n) BatchArray decompress+recompress.

The offset is lazy

The BatchArray is only accessed when list data is actually needed. Operations that sort
or filter on other columns never touch the BatchArray at all. This means:

  • sort_by("id") → permute offset only, BatchArray untouched
  • where("id == 'paella'") → BatchArray untouched
  • head() / tail() → only a few offset lookups, trivial cost
  • aggregate on other cols → BatchArray untouched

The only time the fragmented-access cost is paid is a full scan of the list column
(e.g. to_arrow(), to_csv(), or iterating every row's ingredients). Even then the data
is correct; it just may access BatchArray batches out of sequential order.

Compaction restores physical order

CTable.copy(compact=True) is the existing idiom to reclaim space and rewrite data.
When compact copy is triggered, the BatchArray should be rewritten following the current
offset order and the offset reset to identity:

After copy(compact=True) of the sorted example above (and "chocolate" deleted):

id column (NDArray):         ["albondigas", "paella", "tortilla"]
ingredients (BatchArray):    [alb_list,     pae_list, tort_list ]  ← physically reordered
ingredients offset (NDArray):[0,            1,        2         ]  ← identity again

Impact on each CTable operation

Operation Current behaviour (list cols) With offset
sort_by new ListArray, extend one-by-one via generator permute offset NDArray only; BatchArray untouched
append push to _pending_cells / _backend same + append next_phys_idx to offset
extend batch-extend _pending_cells same + append range of phys indices to offset
getitem (read) valid_rows → phys_pos → batch_array[phys_pos] valid_rows → phys_pos → offset[phys_pos] → batch_array
setitem (write) batch_array[phys_pos] = val batch_array[offset[phys_pos]] = val
delete tombstone in _valid_rows, ListArray untouched same, offset untouched
compact new ListArray with live rows new ListArray in offset order, reset offset to identity
copy(compact=True) iterate live positions, fresh ListArray iterate via offset order, fresh identity offset
copy(compact=False) copy all rows including gaps copy all rows (offset included as-is)
to_arrow / to_csv sequential scan, BatchArray cache-friendly permuted scan — consider copy(compact=True) first

Implementation notes

Where the offset lives

The offset NDArray is a peer of the other NDArray columns inside CTable._cols or can
be stored as internal state inside ListArray itself. Keeping it inside ListArray
is the cleaner option: ListArray owns its own permutation state and callers need no
changes to how they access it.

If kept inside ListArray:

  • self._offset: blosc2.NDArray | None (None = identity, lazily allocated on first sort)
  • _resolve(phys_pos)self._offset[phys_pos] if self._offset is not None else phys_pos

Pending cells and flush()

ListArray for storage="batch" buffers rows in self._pending_cells before they are
flushed into the BatchArray. When a flush happens, the newly persisted rows get physical
indices [prev_persisted_count, ..., prev_persisted_count + batch_size - 1]. These
need to be appended to the offset array in the same order they were in _pending_cells.

As long as appends always go to the end (no insert-in-middle), the offset for new rows
is always the identity range and the invariant is easy to maintain.

Persistence

For file-backed tables the offset needs to be stored alongside the BatchArray. A simple
option is a .b2nd file next to the .b2b file:

_cols/ingredients.b2b      ← BatchArray data (unchanged)
_cols/ingredients.offset.b2nd ← offset NDArray (new)

On open, if the offset file is missing it is treated as an identity permutation (backward
compatibility with existing archives).

Backward compatibility

  • Existing .b2z and .b2d archives without an offset file → treat as identity (no migration needed).
  • Schema serialization does not need to change; the offset is operational state, not schema.

Known trade-offs

  1. Read overhead after unsorted state. A full sequential scan of a list column after
    sort (without compacting) accesses BatchArray batches out of order. For large batch_rows
    this means decompressing a full batch to read one element. Mitigation: call
    copy(compact=True) before bulk-reading list data if performance matters.

  2. Extra indirection on every list read. One additional NDArray lookup per row.
    For small tables or point lookups this is negligible. For large scans the NDArray is
    accessed sequentially (offset itself is sequential even if the values it points to are
    not), so it stays cache-warm.

  3. _pending_cells bookkeeping. The offset must be kept in sync with the pending buffer
    during flush. This is the trickiest part of the implementation; see notes above.


Files to touch

  • src/blosc2/list_array.py — add _offset field, _resolve(), update all read/write paths
  • src/blosc2/ctable.py — update sort_by() to permute offset instead of rewriting ListArray;
    update compact() and copy() to rebuild offset
  • src/blosc2/ctable_storage.py — persist/load .offset.b2nd sidecar file alongside .b2b
  • tests/ctable/ — add tests: sort with list col, read after sort, compact resets offset,
    copy with compact, persistence round-trip, backward compat (no offset file)

Acceptance criteria

  • sort_by on a table with a list column does not rewrite the BatchArray
  • Reading list values after sort returns values in the new sorted order
  • copy(compact=True) produces a table where the offset is identity and BatchArray
    is physically in the correct order
  • Existing .b2z files without an offset file open correctly (identity fallback)
  • All existing ctable tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions