cleanframe detects common column types (email, phone, date, name) and automatically normalizes them with simple rule-based fixers and validators.
Quick start
- Install dependencies
python -m pip install pandas- Run unit tests
pip install pytest
pytest -q- Use the library from a script
python - <<'PY'
import sys
sys.path.insert(0, 'datacleaner')
import pandas as pd
from cleanframe import fix
df = pd.DataFrame({
'email': ['AASHAY@GMAIL.COM', 'vansh@EXAMPLE.com', 'bad'],
'phone': ['800-123-4567', '(800) 333-4444', 'nope'],
'date': ['2023-13-01', '2020-02-29', 'invalid'],
'name': ['aashay', 'VANSH', '3rd street'],
})
cleaned = fix(df)
print(cleaned)
PYUsage note
- To clean a CSV file, call
cleanframe.fixfrom a Python script and write the resulting DataFrame to disk.
Removed files and tools
- This repo was cleaned to remove several helper/experimental scripts. Use pytest for tests and import
cleanframe.fixin your Python code to perform cleaning.
Design notes
- Detection: heuristics based on validators and pattern matching
- Fixing: deterministic normalization, no external APIs or ML
- Invalid values become
NaNafter cleaning (safe for later processing)