Summary
I'd like to propose (and contribute) an optional fast path for reading DATASUS
DBF data — the step after DBC decompression — that filters records at the byte
level before materializing them, instead of building one Python object per record
via dbfread.
On large files this is the difference between 12 min / ~14 GB RAM and
45 s / ~0.8 GB RAM, and it removes a class of OOM failures on SIA-PA.
PySUS currently depends on dbfread==2.0.7 (DBF→rows) and pyreaddbc>=2.0.4
(DBC→DBF). Both are fine for small/medium files but struggle at SIA-PA scale.
I'm opening this as a design discussion before investing in a PR.
Problem
dbfread instantiates a Python object per record. A single SIA-PA file
(São Paulo) can hold 5–17M+ records and expand to ~1.1 GB DBF from
~110 MB DBC.
- Loading it into pandas to then filter (e.g. by CID/disease) costs ~14 GB RAM
and can OOM — even though the analyst usually wants a tiny subset (a handful of
CIDs).
- The whole-file-then-filter model pays full materialization cost regardless of
selectivity.
Proposed approach: filter at the byte level, materialize only matches
DATASUS DBF records are fixed-width, so we can:
- Parse the DBF header once with
struct → record count, record size, field
descriptors (name / type / length).
- Compute fixed byte offsets per column.
- Read the record block as raw bytes and scan only the target column(s)
against byte-encoded target values, collecting matching row indices — without
building any per-row Python object.
- Materialize only the matched rows into a DataFrame / Arrow table.
This maps directly onto PySUS's existing filter parameters (disease/CID, UF,
year): the filter becomes a predicate push-down into the scan, so unmatched
rows are never materialized or written to Parquet.
Benchmark (our ETL: dbfread-based path vs. byte-level reader)
| File |
Records |
Matched |
dbfread path |
byte-level + push-down |
| SIA-PA, São Paulo, 2023-01 |
17.2M |
1,284 (neuro CIDs) |
12 min, ~14 GB |
45 s, ~0.8 GB |
~16× faster, ~1/17 the memory, on real disseminação files.
Correctness details a general reader must handle (we'd bake these into tests)
- Encoding is latin-1, never utf-8. A large share of files contain bytes that
are invalid in utf-8; decoding must be latin-1 with errors="replace".
- Embedded NUL bytes appear inside character fields; values need NUL stripping
plus .strip().
- CID prefix vs. exact match: a 3-char CID (
G40) should match G400/G401/…,
but a 4-char CID (E104) must match exactly — prefixing on E10 would wrongly
pull in E109. The scan must distinguish the two.
- Multi-token fields (SIM
LINHAA–LINHAD, LINHAII) can pack several
space-separated CIDs in one field; the scan must tokenize.
Related robustness note: isolating pyreaddbc SIGSEGV
Separately (possibly a pyreaddbc fix rather than PySUS): blast.c can SIGSEGV
on corrupt/malformed DBC headers, killing the host process with no try/except
recourse — one bad file aborts an entire batch. We run dbc2dbf in a subprocess
with retries to contain this. Happy to share the pattern if it's useful for overall
pipeline robustness.
Scope / positioning
- This is a DATASUS-specialized fast path, not a general
dbfread replacement:
it reads character fields as strings and leaves type coercion to the downstream
Parquet / type-inference step (which I believe matches PySUS's current pipeline).
- The reference implementation uses a Python loop over record indices (already
~16× over dbfread). A NumPy np.frombuffer + structured-dtype version can
vectorize the scan for another order of magnitude and slot cleanly into the
Arrow/Parquet path.
What I can contribute
If the approach sounds good, I can open a PR with:
- a self-contained reader module (no new deps beyond numpy/pandas);
- golden tests asserting row-for-row equivalence against
dbfread on a small
corpus of real DBC across systems (SIH/SIM/SIA/CNES) and across years
(layouts drift silently between years);
- a reproducible benchmark harness.
Context: this comes out of building a full ETL over SIH/SIM/SIA/CNES for all 27
UFs, 2008–2025. We also accumulated a large catalog of DATASUS layout gotchas
(column drift across years, mislabeled names in third-party data dictionaries,
sentinel nulls, per-system sex/age/date encodings) that could feed the
documentation effort (#276) if of interest.
Summary
I'd like to propose (and contribute) an optional fast path for reading DATASUS
DBF data — the step after DBC decompression — that filters records at the byte
level before materializing them, instead of building one Python object per record
via
dbfread.On large files this is the difference between 12 min / ~14 GB RAM and
45 s / ~0.8 GB RAM, and it removes a class of OOM failures on SIA-PA.
PySUS currently depends on
dbfread==2.0.7(DBF→rows) andpyreaddbc>=2.0.4(DBC→DBF). Both are fine for small/medium files but struggle at SIA-PA scale.
I'm opening this as a design discussion before investing in a PR.
Problem
dbfreadinstantiates a Python object per record. A single SIA-PA file(São Paulo) can hold 5–17M+ records and expand to ~1.1 GB DBF from
~110 MB DBC.
and can OOM — even though the analyst usually wants a tiny subset (a handful of
CIDs).
selectivity.
Proposed approach: filter at the byte level, materialize only matches
DATASUS DBF records are fixed-width, so we can:
struct→ record count, record size, fielddescriptors (name / type / length).
against byte-encoded target values, collecting matching row indices — without
building any per-row Python object.
This maps directly onto PySUS's existing filter parameters (disease/CID, UF,
year): the filter becomes a predicate push-down into the scan, so unmatched
rows are never materialized or written to Parquet.
Benchmark (our ETL: dbfread-based path vs. byte-level reader)
~16× faster, ~1/17 the memory, on real disseminação files.
Correctness details a general reader must handle (we'd bake these into tests)
are invalid in utf-8; decoding must be
latin-1witherrors="replace".plus
.strip().G40) should matchG400/G401/…,but a 4-char CID (
E104) must match exactly — prefixing onE10would wronglypull in
E109. The scan must distinguish the two.LINHAA–LINHAD,LINHAII) can pack severalspace-separated CIDs in one field; the scan must tokenize.
Related robustness note: isolating
pyreaddbcSIGSEGVSeparately (possibly a
pyreaddbcfix rather than PySUS):blast.ccan SIGSEGVon corrupt/malformed DBC headers, killing the host process with no
try/exceptrecourse — one bad file aborts an entire batch. We run
dbc2dbfin a subprocesswith retries to contain this. Happy to share the pattern if it's useful for overall
pipeline robustness.
Scope / positioning
dbfreadreplacement:it reads character fields as strings and leaves type coercion to the downstream
Parquet / type-inference step (which I believe matches PySUS's current pipeline).
~16× over dbfread). A NumPy
np.frombuffer+ structured-dtype version canvectorize the scan for another order of magnitude and slot cleanly into the
Arrow/Parquet path.
What I can contribute
If the approach sounds good, I can open a PR with:
dbfreadon a smallcorpus of real DBC across systems (SIH/SIM/SIA/CNES) and across years
(layouts drift silently between years);
Context: this comes out of building a full ETL over SIH/SIM/SIA/CNES for all 27
UFs, 2008–2025. We also accumulated a large catalog of DATASUS layout gotchas
(column drift across years, mislabeled names in third-party data dictionaries,
sentinel nulls, per-system sex/age/date encodings) that could feed the
documentation effort (#276) if of interest.