[CTable] Add offset indirection layer to ListArray for a better sort cost

## Background

CTable stores list-typed columns (declared as `blosc2.list(item_spec)`) in a `ListArray`
object backed by a `BatchArray` (or `ObjectArray`). The `BatchArray` stores rows in
compressed batches of fixed size (`batch_rows`). Individual rows are located using a
prefix-sum array over batch lengths.

Example schema:

    @dataclass
    class Food:
        id: str = blosc2.field(blosc2.string(max_length=64))
        ingredients: list = blosc2.field(blosc2.list(blosc2.string(max_length=32)))

---

## Current problem: sort_by() on tables with list columns is expensive

When `CTable.sort_by("id")` is called on a table that also has a `ListArray` column
(`ingredients`), the column must be physically rewritten in the new order. The relevant
code is in `ctable.py`, `sort_by()`:

    # inplace=True path (simplified)
    new_arr = ListArray(spec=col.spec)
    new_arr.extend((arr[int(pos)] for pos in sorted_pos), validate=False)
    new_arr.flush()
    self._cols[name] = new_arr

Each `arr[int(pos)]` call locates the batch that contains row `pos`, decompresses it,
and extracts the value. For a table with n rows and batch_rows=B, this causes up to
n batch decompressions (O(n) decompress + O(n) recompress into the new array).

For large tables with many/large list values this is very slow.

---

## Proposed solution: offset indirection array

Add a companion NDArray of int64 to each ListArray column. This array is called the
**offset** and maps `logical_physical_pos → true_physical_pos_in_batch_array`.

### Initial state (4 rows, no sort yet)

    id column (NDArray):         ["chocolate", "tortilla", "paella", "albondigas"]
    ingredients (BatchArray):    [choc_list,   tort_list,  pae_list, alb_list    ]  ← physical
    ingredients offset (NDArray):[0,           1,          2,        3           ]  ← identity

### After sort_by("id") — only the offset changes, BatchArray untouched

    id column (NDArray):         ["albondigas", "chocolate", "paella", "tortilla"]
    ingredients (BatchArray):    [choc_list,    tort_list,   pae_list, alb_list  ]  ← unchanged
    ingredients offset (NDArray):[3,            0,           2,        1         ]

Reading `ingredients` for row 1 ("chocolate"):
  1. Look up `offset[1]` → `0`
  2. Read `batch_array[0]` → `choc_list`

Sort cost is now O(n) NDArray write instead of O(n) BatchArray decompress+recompress.

### The offset is lazy

The BatchArray is only accessed when list data is actually needed. Operations that sort
or filter on other columns never touch the BatchArray at all. This means:

  - sort_by("id")            → permute offset only, BatchArray untouched
  - where("id == 'paella'")  → BatchArray untouched
  - head() / tail()          → only a few offset lookups, trivial cost
  - aggregate on other cols  → BatchArray untouched

The only time the fragmented-access cost is paid is a full scan of the list column
(e.g. to_arrow(), to_csv(), or iterating every row's ingredients). Even then the data
is correct; it just may access BatchArray batches out of sequential order.

### Compaction restores physical order

`CTable.copy(compact=True)` is the existing idiom to reclaim space and rewrite data.
When compact copy is triggered, the BatchArray should be rewritten following the current
offset order and the offset reset to identity:

    After copy(compact=True) of the sorted example above (and "chocolate" deleted):

    id column (NDArray):         ["albondigas", "paella", "tortilla"]
    ingredients (BatchArray):    [alb_list,     pae_list, tort_list ]  ← physically reordered
    ingredients offset (NDArray):[0,            1,        2         ]  ← identity again

---

## Impact on each CTable operation

| Operation          | Current behaviour (list cols)                         | With offset                                              |
|--------------------|-------------------------------------------------------|----------------------------------------------------------|
| sort_by            | new ListArray, extend one-by-one via generator        | permute offset NDArray only; BatchArray untouched        |
| append             | push to _pending_cells / _backend                     | same + append next_phys_idx to offset                    |
| extend             | batch-extend _pending_cells                           | same + append range of phys indices to offset            |
| __getitem__ (read) | valid_rows → phys_pos → batch_array[phys_pos]         | valid_rows → phys_pos → offset[phys_pos] → batch_array  |
| __setitem__ (write)| batch_array[phys_pos] = val                           | batch_array[offset[phys_pos]] = val                      |
| delete             | tombstone in _valid_rows, ListArray untouched         | same, offset untouched                                   |
| compact            | new ListArray with live rows                          | new ListArray in offset order, reset offset to identity  |
| copy(compact=True) | iterate live positions, fresh ListArray               | iterate via offset order, fresh identity offset          |
| copy(compact=False)| copy all rows including gaps                          | copy all rows (offset included as-is)                    |
| to_arrow / to_csv  | sequential scan, BatchArray cache-friendly            | permuted scan — consider copy(compact=True) first        |

---

## Implementation notes

### Where the offset lives

The offset NDArray is a peer of the other NDArray columns inside `CTable._cols` or can
be stored as internal state inside `ListArray` itself. Keeping it inside `ListArray`
is the cleaner option: ListArray owns its own permutation state and callers need no
changes to how they access it.

If kept inside `ListArray`:
  - `self._offset: blosc2.NDArray | None`  (None = identity, lazily allocated on first sort)
  - `_resolve(phys_pos)` → `self._offset[phys_pos] if self._offset is not None else phys_pos`

### Pending cells and flush()

`ListArray` for `storage="batch"` buffers rows in `self._pending_cells` before they are
flushed into the BatchArray. When a flush happens, the newly persisted rows get physical
indices `[prev_persisted_count, ..., prev_persisted_count + batch_size - 1]`. These
need to be appended to the offset array in the same order they were in `_pending_cells`.

As long as appends always go to the end (no insert-in-middle), the offset for new rows
is always the identity range and the invariant is easy to maintain.

### Persistence

For file-backed tables the offset needs to be stored alongside the BatchArray. A simple
option is a `.b2nd` file next to the `.b2b` file:

    _cols/ingredients.b2b      ← BatchArray data (unchanged)
    _cols/ingredients.offset.b2nd ← offset NDArray (new)

On open, if the offset file is missing it is treated as an identity permutation (backward
compatibility with existing archives).

### Backward compatibility

- Existing `.b2z` and `.b2d` archives without an offset file → treat as identity (no migration needed).
- Schema serialization does not need to change; the offset is operational state, not schema.

---

## Known trade-offs

1. **Read overhead after unsorted state.** A full sequential scan of a list column after
   sort (without compacting) accesses BatchArray batches out of order. For large `batch_rows`
   this means decompressing a full batch to read one element. Mitigation: call
   `copy(compact=True)` before bulk-reading list data if performance matters.

2. **Extra indirection on every list read.** One additional NDArray lookup per row.
   For small tables or point lookups this is negligible. For large scans the NDArray is
   accessed sequentially (offset itself is sequential even if the values it points to are
   not), so it stays cache-warm.

3. **_pending_cells bookkeeping.** The offset must be kept in sync with the pending buffer
   during flush. This is the trickiest part of the implementation; see notes above.

---

## Files to touch

- `src/blosc2/list_array.py`   — add `_offset` field, `_resolve()`, update all read/write paths
- `src/blosc2/ctable.py`       — update `sort_by()` to permute offset instead of rewriting ListArray;
                                  update `compact()` and `copy()` to rebuild offset
- `src/blosc2/ctable_storage.py` — persist/load `.offset.b2nd` sidecar file alongside `.b2b`
- `tests/ctable/`              — add tests: sort with list col, read after sort, compact resets offset,
                                  copy with compact, persistence round-trip, backward compat (no offset file)

---

## Acceptance criteria

- [ ] `sort_by` on a table with a `list` column does not rewrite the BatchArray
- [ ] Reading list values after sort returns values in the new sorted order
- [ ] `copy(compact=True)` produces a table where the offset is identity and BatchArray
      is physically in the correct order
- [ ] Existing `.b2z` files without an offset file open correctly (identity fallback)
- [ ] All existing ctable tests pass


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CTable] Add offset indirection layer to ListArray for a better sort cost #635

Background

Current problem: sort_by() on tables with list columns is expensive

Proposed solution: offset indirection array

Initial state (4 rows, no sort yet)

After sort_by("id") — only the offset changes, BatchArray untouched

The offset is lazy

Compaction restores physical order

Impact on each CTable operation

Implementation notes

Where the offset lives

Pending cells and flush()

Persistence

Backward compatibility

Known trade-offs

Files to touch

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Operation	Current behaviour (list cols)	With offset
sort_by	new ListArray, extend one-by-one via generator	permute offset NDArray only; BatchArray untouched
append	push to _pending_cells / _backend	same + append next_phys_idx to offset
extend	batch-extend _pending_cells	same + append range of phys indices to offset
getitem (read)	valid_rows → phys_pos → batch_array[phys_pos]	valid_rows → phys_pos → offset[phys_pos] → batch_array
setitem (write)	batch_array[phys_pos] = val	batch_array[offset[phys_pos]] = val
delete	tombstone in _valid_rows, ListArray untouched	same, offset untouched
compact	new ListArray with live rows	new ListArray in offset order, reset offset to identity
copy(compact=True)	iterate live positions, fresh ListArray	iterate via offset order, fresh identity offset
copy(compact=False)	copy all rows including gaps	copy all rows (offset included as-is)
to_arrow / to_csv	sequential scan, BatchArray cache-friendly	permuted scan — consider copy(compact=True) first

Uh oh!

Uh oh!

[CTable] Add offset indirection layer to ListArray for a better sort cost #635

Description

Background

Current problem: sort_by() on tables with list columns is expensive

Proposed solution: offset indirection array

Initial state (4 rows, no sort yet)

After sort_by("id") — only the offset changes, BatchArray untouched

The offset is lazy

Compaction restores physical order

Impact on each CTable operation

Implementation notes

Where the offset lives

Pending cells and flush()

Persistence

Backward compatibility

Known trade-offs

Files to touch

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions