|
1 | | -# wip |
| 1 | +# klaw-dbase |
| 2 | + |
| 3 | +A Polars plugin for reading and writing dBase III files (.DBF), with built-in support for DATASUS compressed files (.DBC). |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Polars IO plugin** with lazy scanning, projection pushdown, and predicate pushdown |
| 8 | +- **DATASUS .DBC support** for compressed Brazilian health system files |
| 9 | +- **Parallel reading** across multiple files |
| 10 | +- **Flexible encodings** (`cp1252`, `utf-8`, `iso-8859-1`, etc.) |
| 11 | +- **Globbing and directory scanning** |
| 12 | + |
| 13 | +## Installation |
| 14 | + |
| 15 | +```bash |
| 16 | +pip install klaw-dbase |
| 17 | +``` |
| 18 | + |
| 19 | +**Requirements:** Python 3.13+ |
| 20 | + |
| 21 | +## Quickstart |
| 22 | + |
| 23 | +### Read a .DBF file |
| 24 | + |
| 25 | +```python |
| 26 | +from klaw_dbase import read_dbase |
| 27 | + |
| 28 | +df = read_dbase('data.dbf') |
| 29 | +``` |
| 30 | + |
| 31 | +### Lazy scan for large files |
| 32 | + |
| 33 | +```python |
| 34 | +import polars as pl |
| 35 | +from klaw_dbase import scan_dbase |
| 36 | + |
| 37 | +lf = scan_dbase('data.dbf') |
| 38 | +result = lf.filter(pl.col('age') > 30).select('name', 'age').collect() |
| 39 | +``` |
| 40 | + |
| 41 | +### Write a DataFrame |
| 42 | + |
| 43 | +```python |
| 44 | +import polars as pl |
| 45 | +from klaw_dbase import write_dbase |
| 46 | + |
| 47 | +df = pl.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) |
| 48 | +write_dbase(df, 'output.dbf', overwrite=True) |
| 49 | +``` |
| 50 | + |
| 51 | +## DATASUS .DBC Files |
| 52 | + |
| 53 | +The primary use case for this library is handling DATASUS files from Brazil's public health system—both compressed (.DBC) and uncompressed (.DBF). |
| 54 | + |
| 55 | +### Read a compressed .DBC file |
| 56 | + |
| 57 | +```python |
| 58 | +from klaw_dbase import read_dbase |
| 59 | + |
| 60 | +# Auto-detected by .dbc extension |
| 61 | +df = read_dbase('RDPA2402.dbc') |
| 62 | + |
| 63 | +# Or explicitly |
| 64 | +df = read_dbase('RDPA2402.dbc', compressed=True) |
| 65 | +``` |
| 66 | + |
| 67 | +### Read multiple DATASUS files |
| 68 | + |
| 69 | +```python |
| 70 | +from klaw_dbase import read_dbase |
| 71 | + |
| 72 | +files = [ |
| 73 | + 'RDPA2401.dbc', |
| 74 | + 'RDPA2402.dbc', |
| 75 | + 'RDPA2403.dbc', |
| 76 | +] |
| 77 | +df = read_dbase(files) |
| 78 | +``` |
| 79 | + |
| 80 | +### Lazy scan with glob patterns |
| 81 | + |
| 82 | +```python |
| 83 | +import polars as pl |
| 84 | +from klaw_dbase import scan_dbase |
| 85 | + |
| 86 | +lf = scan_dbase('data/RDPA24*.dbc') |
| 87 | +summary = lf.filter(pl.col('IDADE') >= 65).group_by('UF_RESID').agg(pl.len().alias('count')).collect() |
| 88 | +``` |
| 89 | + |
| 90 | +### Get record count without loading data |
| 91 | + |
| 92 | +```python |
| 93 | +from klaw_dbase import get_dbase_record_count |
| 94 | + |
| 95 | +n = get_dbase_record_count('RDPA2402.dbc') |
| 96 | +``` |
| 97 | + |
| 98 | +## API Reference |
| 99 | + |
| 100 | +### `read_dbase` |
| 101 | + |
| 102 | +```python |
| 103 | +read_dbase( |
| 104 | + sources, # path, list of paths, directory, or glob pattern |
| 105 | + *, |
| 106 | + columns=None, # columns to select (names or indices) |
| 107 | + n_rows=None, # limit number of rows |
| 108 | + row_index_name=None, # add row index column |
| 109 | + row_index_offset=0, |
| 110 | + rechunk=False, |
| 111 | + batch_size=8192, |
| 112 | + n_workers=None, # parallel readers (default: all CPUs) |
| 113 | + glob=True, |
| 114 | + encoding="cp1252", |
| 115 | + character_trim="begin_end", |
| 116 | + skip_deleted=True, |
| 117 | + validate_schema=True, |
| 118 | + compressed=False, # auto-detected for .dbc files |
| 119 | +) -> pl.DataFrame |
| 120 | +``` |
| 121 | + |
| 122 | +### `scan_dbase` |
| 123 | + |
| 124 | +```python |
| 125 | +scan_dbase( |
| 126 | + sources, |
| 127 | + *, |
| 128 | + batch_size=8192, |
| 129 | + n_workers=None, |
| 130 | + single_col_name=None, |
| 131 | + encoding="cp1252", |
| 132 | + character_trim="begin_end", |
| 133 | + skip_deleted=True, |
| 134 | + validate_schema=True, |
| 135 | + compressed=False, |
| 136 | + glob=True, |
| 137 | + progress=False, |
| 138 | +) -> pl.LazyFrame |
| 139 | +``` |
| 140 | + |
| 141 | +### `write_dbase` |
| 142 | + |
| 143 | +```python |
| 144 | +write_dbase( |
| 145 | + df, # polars DataFrame |
| 146 | + dest, # path or file-like object |
| 147 | + *, |
| 148 | + batch_size=None, |
| 149 | + encoding="cp1252", |
| 150 | + overwrite=False, |
| 151 | +) -> None |
| 152 | +``` |
| 153 | + |
| 154 | +### `get_dbase_record_count` |
| 155 | + |
| 156 | +```python |
| 157 | +get_dbase_record_count(path) -> int |
| 158 | +``` |
| 159 | + |
| 160 | +## Encodings |
| 161 | + |
| 162 | +Common encodings for dBase files: |
| 163 | + |
| 164 | +| Encoding | Use case | |
| 165 | +| ------------- | --------------------------------------------- | |
| 166 | +| `cp1252` | Windows Latin-1 (default, common for DATASUS) | |
| 167 | +| `utf-8` | Unicode | |
| 168 | +| `iso-8859-1` | Latin-1 | |
| 169 | +| `iso-8859-15` | Latin-9 (Euro sign) | |
| 170 | + |
| 171 | +## Error Handling |
| 172 | + |
| 173 | +| Exception | When raised | |
| 174 | +| ---------------- | ------------------------------------------ | |
| 175 | +| `DbaseError` | Corrupted or invalid dBase file | |
| 176 | +| `DbcError` | Compression-specific problems | |
| 177 | +| `EmptySources` | No input files or empty DataFrame on write | |
| 178 | +| `SchemaMismatch` | Multiple files with incompatible schemas | |
| 179 | +| `EncodingError` | Invalid or unsupported encoding | |
| 180 | + |
| 181 | +```python |
| 182 | +from klaw_dbase import DbaseError, DbcError, EmptySources |
| 183 | + |
| 184 | +try: |
| 185 | + df = read_dbase('corrupted.dbf') |
| 186 | +except DbaseError as e: |
| 187 | + print(f'Failed to read: {e}') |
| 188 | +``` |
| 189 | + |
| 190 | +## License |
| 191 | + |
| 192 | +MIT |
0 commit comments