11<div align =" center " >
2- <img src =" assets/images/logo .png " alt =" TransformPlan Logo " width =" 150 " >
2+ <img src =" assets/images/logo_wordmark_white .png " alt =" TransformPlan Logo " width =" 450 " >
33</div >
44
55# TransformPlan
@@ -13,53 +13,88 @@ TransformPlan tracks transformation history, validates operations against DataFr
1313- ** Declarative transformations** : Build transformation pipelines using method chaining
1414- ** Schema validation** : Validate operations before execution with dry-run capability
1515- ** Audit trails** : Generate complete audit protocols with deterministic DataFrame hashing
16- - ** Chunked processing** : Process large Parquet files that exceed RAM with partition-aware chunking
1716- ** Multi-backend support** : Works with both Polars (primary) and Pandas DataFrames
1817- ** Serializable pipelines** : Save and load transformation plans as JSON
1918
2019## Quick Example
2120
2221``` python
23- import polars as pl
2422from transformplan import TransformPlan, Col
2523
26- # Create sample data
27- df = pl.DataFrame({
28- " name" : [" Alice" , " Bob" , " Charlie" ],
29- " age" : [25 , 30 , 35 ],
30- " salary" : [50000 , 60000 , 70000 ]
31- })
32-
33- # Build a transformation plan
24+ # Build readable pipelines with 70+ chainable operations
3425plan = (
3526 TransformPlan()
36- .col_rename(column = " name" , new_name = " employee_name" )
37- .math_multiply(column = " salary" , value = 1.1 , new_column = " new_salary" )
38- .math_round(column = " new_salary" , decimals = 0 )
39- .rows_filter(Col(" age" ) >= 30 )
27+ # Standardize column names
28+ .col_rename(column = " PatientID" , new_name = " patient_id" )
29+ .col_rename(column = " DOB" , new_name = " date_of_birth" )
30+ .str_strip(column = " patient_id" )
31+
32+ # Calculate derived values
33+ .dt_age_years(column = " date_of_birth" , new_column = " age" )
34+ .math_clamp(column = " age" , min_value = 0 , max_value = 120 )
35+
36+ # Categorize patients age
37+ .map_discretize(column = " age" , bins = [18 , 40 , 65 ], labels = [" young" , " adult" , " senior" ], new_column = " age_group" )
38+
39+ # Filter and clean
40+ .rows_filter(Col(" age" ) >= 18 )
41+ .rows_drop_nulls(columns = [" patient_id" , " age" ])
42+ .col_drop(column = " date_of_birth" )
4043)
4144
42- # Validate the plan
43- print ( plan.validate (df) )
45+ # Execute with schema validation — catch errors before they hit production
46+ df_result, protocol = plan.process (df, validate = True )
4447
45- # Execute and get audit trail
46- df_result, protocol = plan.process(df)
47- protocol.print()
48- ```
48+ # Serialize pipelines to JSON — version control your transformations
49+ plan.to_json(" patient_transform.json" )
4950
50- ## Why TransformPlan?
51+ # Reload and reapply — reproducible results across environments
52+ plan = TransformPlan.from_json(" patient_transform.json" )
53+ df_result, protocol = plan.process(new_data)
54+ ```
5155
52- ### Reproducibility
56+ ### Full Audit Trail — Every Step Tracked and Hashed
5357
54- Every transformation is tracked with deterministic hashes, ensuring you can verify that the same inputs produce the same outputs.
58+ ``` python
59+ protocol.print(show_params = False )
60+ ```
5561
56- ### Safety
62+ ```
63+ ======================================================================
64+ TRANSFORM PROTOCOL
65+ ======================================================================
66+ Input: 1000 rows × 5 cols [a4f8b2c1]
67+ Output: 847 rows × 5 cols [e7d3f9a2]
68+ Total time: 0.0247s
69+ ----------------------------------------------------------------------
70+
71+ # Operation Rows Cols Time Hash
72+ ----------------------------------------------------------------------
73+ 0 input 1000 5 - a4f8b2c1
74+ 1 col_rename 1000 5 0.0012s b2e4a7f3
75+ 2 col_rename 1000 5 0.0008s c9d1e5b8
76+ 3 str_strip 1000 5 0.0013s c9d1e5b8 ○
77+ 4 dt_age_years 1000 6 (+1) 0.0041s d4f2c8a1
78+ 5 math_clamp 1000 6 0.0015s e1b7d3f9
79+ 6 map_discretize 1000 7 (+1) 0.0028s f8a4c2e6
80+ 7 rows_filter 858 (-142) 7 0.0037s a2e9f4b7
81+ 8 rows_drop_nulls 847 (-11) 7 0.0019s b5c1d8e3
82+ 9 col_drop 847 6 (-1) 0.0006s e7d3f9a2
83+ ======================================================================
84+ ○ = no effect (steps 3 did not change data)
85+ ```
5786
58- Schema validation catches errors before execution. The dry-run feature lets you preview what a pipeline will do without modifying data.
87+ ## Available Operations
5988
60- ### Auditability
89+ | Category | Description | Examples |
90+ | ----------| -------------| ----------|
91+ | ** col_ ** | Column operations | ` col_rename ` , ` col_drop ` , ` col_cast ` , ` col_add ` , ` col_select ` |
92+ | ** math_ ** | Arithmetic operations | ` math_add ` , ` math_multiply ` , ` math_clamp ` , ` math_round ` , ` math_abs ` |
93+ | ** rows_ ** | Row filtering & reshaping | ` rows_filter ` , ` rows_drop_nulls ` , ` rows_sort ` , ` rows_unique ` , ` rows_pivot ` |
94+ | ** str_ ** | String operations | ` str_lower ` , ` str_upper ` , ` str_strip ` , ` str_replace ` , ` str_split ` |
95+ | ** dt_ ** | Datetime operations | ` dt_year ` , ` dt_month ` , ` dt_parse ` , ` dt_age_years ` , ` dt_diff_days ` |
96+ | ** map_ ** | Value mapping | ` map_values ` , ` map_discretize ` , ` map_case ` , ` map_from_column ` |
6197
62- Complete audit protocols capture every operation, timing, and data shape change - essential for compliance and debugging.
6398
6499## Getting Started
65100
0 commit comments