Skip to content

Commit ce3e7ae

Browse files
doc update
1 parent 882fc25 commit ce3e7ae

1 file changed

Lines changed: 63 additions & 28 deletions

File tree

docs/index.md

Lines changed: 63 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<div align="center">
2-
<img src="assets/images/logo.png" alt="TransformPlan Logo" width="150">
2+
<img src="assets/images/logo_wordmark_white.png" alt="TransformPlan Logo" width="450">
33
</div>
44

55
# TransformPlan
@@ -13,53 +13,88 @@ TransformPlan tracks transformation history, validates operations against DataFr
1313
- **Declarative transformations**: Build transformation pipelines using method chaining
1414
- **Schema validation**: Validate operations before execution with dry-run capability
1515
- **Audit trails**: Generate complete audit protocols with deterministic DataFrame hashing
16-
- **Chunked processing**: Process large Parquet files that exceed RAM with partition-aware chunking
1716
- **Multi-backend support**: Works with both Polars (primary) and Pandas DataFrames
1817
- **Serializable pipelines**: Save and load transformation plans as JSON
1918

2019
## Quick Example
2120

2221
```python
23-
import polars as pl
2422
from transformplan import TransformPlan, Col
2523

26-
# Create sample data
27-
df = pl.DataFrame({
28-
"name": ["Alice", "Bob", "Charlie"],
29-
"age": [25, 30, 35],
30-
"salary": [50000, 60000, 70000]
31-
})
32-
33-
# Build a transformation plan
24+
# Build readable pipelines with 70+ chainable operations
3425
plan = (
3526
TransformPlan()
36-
.col_rename(column="name", new_name="employee_name")
37-
.math_multiply(column="salary", value=1.1, new_column="new_salary")
38-
.math_round(column="new_salary", decimals=0)
39-
.rows_filter(Col("age") >= 30)
27+
# Standardize column names
28+
.col_rename(column="PatientID", new_name="patient_id")
29+
.col_rename(column="DOB", new_name="date_of_birth")
30+
.str_strip(column="patient_id")
31+
32+
# Calculate derived values
33+
.dt_age_years(column="date_of_birth", new_column="age")
34+
.math_clamp(column="age", min_value=0, max_value=120)
35+
36+
# Categorize patients age
37+
.map_discretize(column="age", bins=[18, 40, 65], labels=["young", "adult", "senior"], new_column="age_group")
38+
39+
# Filter and clean
40+
.rows_filter(Col("age") >= 18)
41+
.rows_drop_nulls(columns=["patient_id", "age"])
42+
.col_drop(column="date_of_birth")
4043
)
4144

42-
# Validate the plan
43-
print(plan.validate(df))
45+
# Execute with schema validation — catch errors before they hit production
46+
df_result, protocol = plan.process(df, validate=True)
4447

45-
# Execute and get audit trail
46-
df_result, protocol = plan.process(df)
47-
protocol.print()
48-
```
48+
# Serialize pipelines to JSON — version control your transformations
49+
plan.to_json("patient_transform.json")
4950

50-
## Why TransformPlan?
51+
# Reload and reapply — reproducible results across environments
52+
plan = TransformPlan.from_json("patient_transform.json")
53+
df_result, protocol = plan.process(new_data)
54+
```
5155

52-
### Reproducibility
56+
### Full Audit Trail — Every Step Tracked and Hashed
5357

54-
Every transformation is tracked with deterministic hashes, ensuring you can verify that the same inputs produce the same outputs.
58+
```python
59+
protocol.print(show_params=False)
60+
```
5561

56-
### Safety
62+
```
63+
======================================================================
64+
TRANSFORM PROTOCOL
65+
======================================================================
66+
Input: 1000 rows × 5 cols [a4f8b2c1]
67+
Output: 847 rows × 5 cols [e7d3f9a2]
68+
Total time: 0.0247s
69+
----------------------------------------------------------------------
70+
71+
# Operation Rows Cols Time Hash
72+
----------------------------------------------------------------------
73+
0 input 1000 5 - a4f8b2c1
74+
1 col_rename 1000 5 0.0012s b2e4a7f3
75+
2 col_rename 1000 5 0.0008s c9d1e5b8
76+
3 str_strip 1000 5 0.0013s c9d1e5b8 ○
77+
4 dt_age_years 1000 6 (+1) 0.0041s d4f2c8a1
78+
5 math_clamp 1000 6 0.0015s e1b7d3f9
79+
6 map_discretize 1000 7 (+1) 0.0028s f8a4c2e6
80+
7 rows_filter 858 (-142) 7 0.0037s a2e9f4b7
81+
8 rows_drop_nulls 847 (-11) 7 0.0019s b5c1d8e3
82+
9 col_drop 847 6 (-1) 0.0006s e7d3f9a2
83+
======================================================================
84+
○ = no effect (steps 3 did not change data)
85+
```
5786

58-
Schema validation catches errors before execution. The dry-run feature lets you preview what a pipeline will do without modifying data.
87+
## Available Operations
5988

60-
### Auditability
89+
| Category | Description | Examples |
90+
|----------|-------------|----------|
91+
| **col_** | Column operations | `col_rename`, `col_drop`, `col_cast`, `col_add`, `col_select` |
92+
| **math_** | Arithmetic operations | `math_add`, `math_multiply`, `math_clamp`, `math_round`, `math_abs` |
93+
| **rows_** | Row filtering & reshaping | `rows_filter`, `rows_drop_nulls`, `rows_sort`, `rows_unique`, `rows_pivot` |
94+
| **str_** | String operations | `str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split` |
95+
| **dt_** | Datetime operations | `dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days` |
96+
| **map_** | Value mapping | `map_values`, `map_discretize`, `map_case`, `map_from_column` |
6197

62-
Complete audit protocols capture every operation, timing, and data shape change - essential for compliance and debugging.
6398

6499
## Getting Started
65100

0 commit comments

Comments
 (0)