Optional byte-level DBF reader with predicate push-down for large files (dbfread OOM on SIA-PA)

## Summary

I'd like to propose (and contribute) an **optional fast path** for reading DATASUS
DBF data — the step after DBC decompression — that **filters records at the byte
level before materializing them**, instead of building one Python object per record
via `dbfread`.

On large files this is the difference between **12 min / ~14 GB RAM** and
**45 s / ~0.8 GB RAM**, and it removes a class of OOM failures on SIA-PA.

PySUS currently depends on `dbfread==2.0.7` (DBF→rows) and `pyreaddbc>=2.0.4`
(DBC→DBF). Both are fine for small/medium files but struggle at SIA-PA scale.
I'm opening this as a design discussion before investing in a PR.

## Problem

- `dbfread` instantiates a Python object per record. A single SIA-PA file
  (São Paulo) can hold **5–17M+ records** and expand to **~1.1 GB DBF** from
  ~110 MB DBC.
- Loading it into pandas to *then* filter (e.g. by CID/disease) costs ~14 GB RAM
  and can OOM — even though the analyst usually wants a tiny subset (a handful of
  CIDs).
- The whole-file-then-filter model pays full materialization cost regardless of
  selectivity.

## Proposed approach: filter at the byte level, materialize only matches

DATASUS DBF records are fixed-width, so we can:

1. Parse the DBF header once with `struct` → record count, record size, field
   descriptors (name / type / length).
2. Compute fixed byte offsets per column.
3. Read the record block as raw bytes and scan **only the target column(s)**
   against byte-encoded target values, collecting matching row indices — without
   building any per-row Python object.
4. Materialize only the matched rows into a DataFrame / Arrow table.

This maps directly onto PySUS's existing filter parameters (disease/CID, UF,
year): the filter becomes a **predicate push-down** into the scan, so unmatched
rows are never materialized or written to Parquet.

## Benchmark (our ETL: dbfread-based path vs. byte-level reader)

| File | Records | Matched | dbfread path | byte-level + push-down |
|------|--------:|--------:|-------------:|-----------------------:|
| SIA-PA, São Paulo, 2023-01 | 17.2M | 1,284 (neuro CIDs) | 12 min, ~14 GB | **45 s, ~0.8 GB** |

~16× faster, ~1/17 the memory, on real disseminação files.

## Correctness details a general reader must handle (we'd bake these into tests)

- **Encoding is latin-1, never utf-8.** A large share of files contain bytes that
  are invalid in utf-8; decoding must be `latin-1` with `errors="replace"`.
- **Embedded NUL bytes** appear inside character fields; values need NUL stripping
  plus `.strip()`.
- **CID prefix vs. exact match:** a 3-char CID (`G40`) should match `G400/G401/…`,
  but a 4-char CID (`E104`) must match exactly — prefixing on `E10` would wrongly
  pull in `E109`. The scan must distinguish the two.
- **Multi-token fields** (SIM `LINHAA`–`LINHAD`, `LINHAII`) can pack several
  space-separated CIDs in one field; the scan must tokenize.

## Related robustness note: isolating `pyreaddbc` SIGSEGV

Separately (possibly a `pyreaddbc` fix rather than PySUS): `blast.c` can **SIGSEGV**
on corrupt/malformed DBC headers, killing the host process with no `try/except`
recourse — one bad file aborts an entire batch. We run `dbc2dbf` in a subprocess
with retries to contain this. Happy to share the pattern if it's useful for overall
pipeline robustness.

## Scope / positioning

- This is a **DATASUS-specialized fast path**, not a general `dbfread` replacement:
  it reads character fields as strings and leaves type coercion to the downstream
  Parquet / type-inference step (which I believe matches PySUS's current pipeline).
- The reference implementation uses a Python loop over record indices (already
  ~16× over dbfread). A **NumPy `np.frombuffer` + structured-dtype** version can
  vectorize the scan for another order of magnitude and slot cleanly into the
  Arrow/Parquet path.

## What I can contribute

If the approach sounds good, I can open a PR with:
- a self-contained reader module (no new deps beyond numpy/pandas);
- **golden tests** asserting row-for-row equivalence against `dbfread` on a small
  corpus of real DBC across systems (SIH/SIM/SIA/CNES) **and across years**
  (layouts drift silently between years);
- a reproducible benchmark harness.

Context: this comes out of building a full ETL over SIH/SIM/SIA/CNES for all 27
UFs, 2008–2025. We also accumulated a large catalog of DATASUS layout gotchas
(column drift across years, mislabeled names in third-party data dictionaries,
sentinel nulls, per-system sex/age/date encodings) that could feed the
documentation effort (#276) if of interest.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optional byte-level DBF reader with predicate push-down for large files (dbfread OOM on SIA-PA) #288

Summary

Problem

Proposed approach: filter at the byte level, materialize only matches

Benchmark (our ETL: dbfread-based path vs. byte-level reader)

Correctness details a general reader must handle (we'd bake these into tests)

Related robustness note: isolating `pyreaddbc` SIGSEGV

Scope / positioning

What I can contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Optional byte-level DBF reader with predicate push-down for large files (dbfread OOM on SIA-PA) #288

Description

Summary

Problem

Proposed approach: filter at the byte level, materialize only matches

Benchmark (our ETL: dbfread-based path vs. byte-level reader)

Correctness details a general reader must handle (we'd bake these into tests)

Related robustness note: isolating pyreaddbc SIGSEGV

Scope / positioning

What I can contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Related robustness note: isolating `pyreaddbc` SIGSEGV