Skip to content

Optional byte-level DBF reader with predicate push-down for large files (dbfread OOM on SIA-PA) #288

Description

@atiweb

Summary

I'd like to propose (and contribute) an optional fast path for reading DATASUS
DBF data — the step after DBC decompression — that filters records at the byte
level before materializing them
, instead of building one Python object per record
via dbfread.

On large files this is the difference between 12 min / ~14 GB RAM and
45 s / ~0.8 GB RAM, and it removes a class of OOM failures on SIA-PA.

PySUS currently depends on dbfread==2.0.7 (DBF→rows) and pyreaddbc>=2.0.4
(DBC→DBF). Both are fine for small/medium files but struggle at SIA-PA scale.
I'm opening this as a design discussion before investing in a PR.

Problem

  • dbfread instantiates a Python object per record. A single SIA-PA file
    (São Paulo) can hold 5–17M+ records and expand to ~1.1 GB DBF from
    ~110 MB DBC.
  • Loading it into pandas to then filter (e.g. by CID/disease) costs ~14 GB RAM
    and can OOM — even though the analyst usually wants a tiny subset (a handful of
    CIDs).
  • The whole-file-then-filter model pays full materialization cost regardless of
    selectivity.

Proposed approach: filter at the byte level, materialize only matches

DATASUS DBF records are fixed-width, so we can:

  1. Parse the DBF header once with struct → record count, record size, field
    descriptors (name / type / length).
  2. Compute fixed byte offsets per column.
  3. Read the record block as raw bytes and scan only the target column(s)
    against byte-encoded target values, collecting matching row indices — without
    building any per-row Python object.
  4. Materialize only the matched rows into a DataFrame / Arrow table.

This maps directly onto PySUS's existing filter parameters (disease/CID, UF,
year): the filter becomes a predicate push-down into the scan, so unmatched
rows are never materialized or written to Parquet.

Benchmark (our ETL: dbfread-based path vs. byte-level reader)

File Records Matched dbfread path byte-level + push-down
SIA-PA, São Paulo, 2023-01 17.2M 1,284 (neuro CIDs) 12 min, ~14 GB 45 s, ~0.8 GB

~16× faster, ~1/17 the memory, on real disseminação files.

Correctness details a general reader must handle (we'd bake these into tests)

  • Encoding is latin-1, never utf-8. A large share of files contain bytes that
    are invalid in utf-8; decoding must be latin-1 with errors="replace".
  • Embedded NUL bytes appear inside character fields; values need NUL stripping
    plus .strip().
  • CID prefix vs. exact match: a 3-char CID (G40) should match G400/G401/…,
    but a 4-char CID (E104) must match exactly — prefixing on E10 would wrongly
    pull in E109. The scan must distinguish the two.
  • Multi-token fields (SIM LINHAALINHAD, LINHAII) can pack several
    space-separated CIDs in one field; the scan must tokenize.

Related robustness note: isolating pyreaddbc SIGSEGV

Separately (possibly a pyreaddbc fix rather than PySUS): blast.c can SIGSEGV
on corrupt/malformed DBC headers, killing the host process with no try/except
recourse — one bad file aborts an entire batch. We run dbc2dbf in a subprocess
with retries to contain this. Happy to share the pattern if it's useful for overall
pipeline robustness.

Scope / positioning

  • This is a DATASUS-specialized fast path, not a general dbfread replacement:
    it reads character fields as strings and leaves type coercion to the downstream
    Parquet / type-inference step (which I believe matches PySUS's current pipeline).
  • The reference implementation uses a Python loop over record indices (already
    ~16× over dbfread). A NumPy np.frombuffer + structured-dtype version can
    vectorize the scan for another order of magnitude and slot cleanly into the
    Arrow/Parquet path.

What I can contribute

If the approach sounds good, I can open a PR with:

  • a self-contained reader module (no new deps beyond numpy/pandas);
  • golden tests asserting row-for-row equivalence against dbfread on a small
    corpus of real DBC across systems (SIH/SIM/SIA/CNES) and across years
    (layouts drift silently between years);
  • a reproducible benchmark harness.

Context: this comes out of building a full ETL over SIH/SIM/SIA/CNES for all 27
UFs, 2008–2025. We also accumulated a large catalog of DATASUS layout gotchas
(column drift across years, mislabeled names in third-party data dictionaries,
sentinel nulls, per-system sex/age/date encodings) that could feed the
documentation effort (#276) if of interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions