Skip to content

Commit a231048

Browse files
committed
feat: add readme docs for klaw-dbase
1 parent 4f15407 commit a231048

2 files changed

Lines changed: 203 additions & 12 deletions

File tree

.github/workflows/release.yml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,21 @@ permissions:
1717

1818
env:
1919
PACKAGE_NAME: klaw-dbase
20-
PYTHON_VERSION: "3.12"
20+
PYTHON_VERSION: "3.13"
2121

2222
jobs:
2323
linux:
2424
runs-on: ${{ matrix.platform.runner }}
2525
strategy:
2626
matrix:
2727
platform:
28-
- runner: blacksmith-4vcpu-ubuntu-2404
28+
- runner: depot-ubuntu-24.04-4
2929
target: x86_64
30-
- runner: blacksmith-4vcpu-ubuntu-2404
30+
- runner: depot-ubuntu-24.04-4
3131
target: x86
32-
- runner: blacksmith-4vcpu-ubuntu-2404
32+
- runner: depot-ubuntu-24.04-4
3333
target: aarch64
34-
- runner: blacksmith-4vcpu-ubuntu-2404
34+
- runner: depot-ubuntu-24.04-4
3535
target: armv7
3636

3737
defaults:
@@ -67,9 +67,9 @@ jobs:
6767
strategy:
6868
matrix:
6969
platform:
70-
- runner: blacksmith-4vcpu-ubuntu-2404
70+
- runner: depot-ubuntu-24.04-4
7171
target: x86_64
72-
- runner: blacksmith-4vcpu-ubuntu-2404
72+
- runner: depot-ubuntu-24.04-4
7373
target: aarch64
7474

7575
defaults:
@@ -105,9 +105,9 @@ jobs:
105105
strategy:
106106
matrix:
107107
platform:
108-
- runner: blacksmith-4vcpu-windows-2025
108+
- runner: depot-windows-2022-4
109109
target: x64
110-
- runner: blacksmith-4vcpu-windows-2025
110+
- runner: depot-windows-2022-4
111111
target: x86
112112

113113
defaults:
@@ -175,7 +175,7 @@ jobs:
175175
path: workspaces/rust/klaw-dbase/dist
176176

177177
sdist:
178-
runs-on: blacksmith-4vcpu-ubuntu-2404
178+
runs-on: depot-ubuntu-24.04-4
179179

180180
defaults:
181181
run:
@@ -199,7 +199,7 @@ jobs:
199199

200200
publish:
201201
name: Publish to PyPI
202-
runs-on: blacksmith-4vcpu-ubuntu-2404
202+
runs-on: depot-ubuntu-24.04-4
203203
needs: [linux, musllinux, windows, macos, sdist]
204204
if: startsWith(github.ref, 'refs/tags/') && !inputs.dry_run
205205

Lines changed: 192 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,192 @@
1-
# wip
1+
# klaw-dbase
2+
3+
A Polars plugin for reading and writing dBase III files (.DBF), with built-in support for DATASUS compressed files (.DBC).
4+
5+
## Features
6+
7+
- **Polars IO plugin** with lazy scanning, projection pushdown, and predicate pushdown
8+
- **DATASUS .DBC support** for compressed Brazilian health system files
9+
- **Parallel reading** across multiple files
10+
- **Flexible encodings** (`cp1252`, `utf-8`, `iso-8859-1`, etc.)
11+
- **Globbing and directory scanning**
12+
13+
## Installation
14+
15+
```bash
16+
pip install klaw-dbase
17+
```
18+
19+
**Requirements:** Python 3.13+
20+
21+
## Quickstart
22+
23+
### Read a .DBF file
24+
25+
```python
26+
from klaw_dbase import read_dbase
27+
28+
df = read_dbase('data.dbf')
29+
```
30+
31+
### Lazy scan for large files
32+
33+
```python
34+
import polars as pl
35+
from klaw_dbase import scan_dbase
36+
37+
lf = scan_dbase('data.dbf')
38+
result = lf.filter(pl.col('age') > 30).select('name', 'age').collect()
39+
```
40+
41+
### Write a DataFrame
42+
43+
```python
44+
import polars as pl
45+
from klaw_dbase import write_dbase
46+
47+
df = pl.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
48+
write_dbase(df, 'output.dbf', overwrite=True)
49+
```
50+
51+
## DATASUS .DBC Files
52+
53+
The primary use case for this library is handling DATASUS files from Brazil's public health system—both compressed (.DBC) and uncompressed (.DBF).
54+
55+
### Read a compressed .DBC file
56+
57+
```python
58+
from klaw_dbase import read_dbase
59+
60+
# Auto-detected by .dbc extension
61+
df = read_dbase('RDPA2402.dbc')
62+
63+
# Or explicitly
64+
df = read_dbase('RDPA2402.dbc', compressed=True)
65+
```
66+
67+
### Read multiple DATASUS files
68+
69+
```python
70+
from klaw_dbase import read_dbase
71+
72+
files = [
73+
'RDPA2401.dbc',
74+
'RDPA2402.dbc',
75+
'RDPA2403.dbc',
76+
]
77+
df = read_dbase(files)
78+
```
79+
80+
### Lazy scan with glob patterns
81+
82+
```python
83+
import polars as pl
84+
from klaw_dbase import scan_dbase
85+
86+
lf = scan_dbase('data/RDPA24*.dbc')
87+
summary = lf.filter(pl.col('IDADE') >= 65).group_by('UF_RESID').agg(pl.len().alias('count')).collect()
88+
```
89+
90+
### Get record count without loading data
91+
92+
```python
93+
from klaw_dbase import get_dbase_record_count
94+
95+
n = get_dbase_record_count('RDPA2402.dbc')
96+
```
97+
98+
## API Reference
99+
100+
### `read_dbase`
101+
102+
```python
103+
read_dbase(
104+
sources, # path, list of paths, directory, or glob pattern
105+
*,
106+
columns=None, # columns to select (names or indices)
107+
n_rows=None, # limit number of rows
108+
row_index_name=None, # add row index column
109+
row_index_offset=0,
110+
rechunk=False,
111+
batch_size=8192,
112+
n_workers=None, # parallel readers (default: all CPUs)
113+
glob=True,
114+
encoding="cp1252",
115+
character_trim="begin_end",
116+
skip_deleted=True,
117+
validate_schema=True,
118+
compressed=False, # auto-detected for .dbc files
119+
) -> pl.DataFrame
120+
```
121+
122+
### `scan_dbase`
123+
124+
```python
125+
scan_dbase(
126+
sources,
127+
*,
128+
batch_size=8192,
129+
n_workers=None,
130+
single_col_name=None,
131+
encoding="cp1252",
132+
character_trim="begin_end",
133+
skip_deleted=True,
134+
validate_schema=True,
135+
compressed=False,
136+
glob=True,
137+
progress=False,
138+
) -> pl.LazyFrame
139+
```
140+
141+
### `write_dbase`
142+
143+
```python
144+
write_dbase(
145+
df, # polars DataFrame
146+
dest, # path or file-like object
147+
*,
148+
batch_size=None,
149+
encoding="cp1252",
150+
overwrite=False,
151+
) -> None
152+
```
153+
154+
### `get_dbase_record_count`
155+
156+
```python
157+
get_dbase_record_count(path) -> int
158+
```
159+
160+
## Encodings
161+
162+
Common encodings for dBase files:
163+
164+
| Encoding | Use case |
165+
| ------------- | --------------------------------------------- |
166+
| `cp1252` | Windows Latin-1 (default, common for DATASUS) |
167+
| `utf-8` | Unicode |
168+
| `iso-8859-1` | Latin-1 |
169+
| `iso-8859-15` | Latin-9 (Euro sign) |
170+
171+
## Error Handling
172+
173+
| Exception | When raised |
174+
| ---------------- | ------------------------------------------ |
175+
| `DbaseError` | Corrupted or invalid dBase file |
176+
| `DbcError` | Compression-specific problems |
177+
| `EmptySources` | No input files or empty DataFrame on write |
178+
| `SchemaMismatch` | Multiple files with incompatible schemas |
179+
| `EncodingError` | Invalid or unsupported encoding |
180+
181+
```python
182+
from klaw_dbase import DbaseError, DbcError, EmptySources
183+
184+
try:
185+
df = read_dbase('corrupted.dbf')
186+
except DbaseError as e:
187+
print(f'Failed to read: {e}')
188+
```
189+
190+
## License
191+
192+
MIT

0 commit comments

Comments
 (0)