Skip to content

Commit 0cdddca

Browse files
Merge pull request #6 from limebit/encodings
Encodings
2 parents ce3e7ae + aaeaaa8 commit 0cdddca

9 files changed

Lines changed: 1125 additions & 2 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ Total time: 0.0247s
9595
| **str_** | String operations | `str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split` |
9696
| **dt_** | Datetime operations | `dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days` |
9797
| **map_** | Value mapping | `map_values`, `map_discretize`, `map_case`, `map_from_column` |
98+
| **enc_** | Categorical encoding | `enc_onehot`, `enc_ordinal`, `enc_label` |
9899

99100
## Installation
100101

docs/api/ops/encoding.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Encoding Operations
2+
3+
Categorical encoding operations for machine learning preparation.
4+
5+
## Overview
6+
7+
Encoding operations transform categorical columns into numeric representations suitable for machine learning models. They support one-hot encoding, ordinal encoding, and label encoding.
8+
9+
```python
10+
from transformplan import TransformPlan
11+
12+
plan = (
13+
TransformPlan()
14+
.enc_onehot("color", categories=["red", "green", "blue"], drop="first")
15+
.enc_ordinal("size", categories=["small", "medium", "large"])
16+
)
17+
```
18+
19+
## Class Reference
20+
21+
::: transformplan.ops.encoding.EncodingOps
22+
options:
23+
show_root_heading: true
24+
members:
25+
- enc_onehot
26+
- enc_ordinal
27+
- enc_label
28+
29+
## Examples
30+
31+
### One-Hot Encoding
32+
33+
Creates binary indicator columns (0/1) for each category.
34+
35+
```python
36+
# Basic one-hot encoding
37+
plan = TransformPlan().enc_onehot(
38+
column="color",
39+
categories=["red", "green", "blue"]
40+
)
41+
# Creates columns: color_red, color_green, color_blue
42+
43+
# Drop first category to avoid multicollinearity (for regression models)
44+
plan = TransformPlan().enc_onehot(
45+
column="color",
46+
categories=["red", "green", "blue"],
47+
drop="first"
48+
)
49+
# Creates columns: color_green, color_blue (drops color_red)
50+
51+
# Drop last category
52+
plan = TransformPlan().enc_onehot(
53+
column="color",
54+
categories=["red", "green", "blue"],
55+
drop="last"
56+
)
57+
# Creates columns: color_red, color_green (drops color_blue)
58+
59+
# Drop specific category
60+
plan = TransformPlan().enc_onehot(
61+
column="color",
62+
categories=["red", "green", "blue"],
63+
drop="green"
64+
)
65+
# Creates columns: color_red, color_blue (drops color_green)
66+
67+
# Custom prefix for new columns
68+
plan = TransformPlan().enc_onehot(
69+
column="color",
70+
categories=["red", "green", "blue"],
71+
prefix="c"
72+
)
73+
# Creates columns: c_red, c_green, c_blue
74+
75+
# Keep original column
76+
plan = TransformPlan().enc_onehot(
77+
column="color",
78+
categories=["red", "green", "blue"],
79+
drop_original=False
80+
)
81+
# Keeps color column alongside color_red, color_green, color_blue
82+
```
83+
84+
### Ordinal Encoding
85+
86+
Maps categories to integers based on explicit ordering (first=0, second=1, etc.).
87+
88+
```python
89+
# Ordinal encoding with meaningful order
90+
plan = TransformPlan().enc_ordinal(
91+
column="size",
92+
categories=["small", "medium", "large"]
93+
)
94+
# Maps: small -> 0, medium -> 1, large -> 2
95+
96+
# Output to new column
97+
plan = TransformPlan().enc_ordinal(
98+
column="size",
99+
categories=["small", "medium", "large"],
100+
new_column="size_encoded"
101+
)
102+
103+
# Custom unknown value
104+
plan = TransformPlan().enc_ordinal(
105+
column="size",
106+
categories=["small", "medium", "large"],
107+
unknown_value=-1 # Default
108+
)
109+
# Values not in categories get -1
110+
```
111+
112+
### Label Encoding
113+
114+
Simple integer encoding, alphabetically sorted by default. Similar to ordinal encoding but without semantic ordering.
115+
116+
```python
117+
# Label encoding (alphabetically sorted)
118+
plan = TransformPlan().enc_label(column="department")
119+
# Maps alphabetically: Engineering -> 0, HR -> 1, Sales -> 2
120+
121+
# With explicit categories
122+
plan = TransformPlan().enc_label(
123+
column="department",
124+
categories=["HR", "Engineering", "Sales"]
125+
)
126+
# Maps: HR -> 0, Engineering -> 1, Sales -> 2
127+
```
128+
129+
## Use Cases
130+
131+
### Preparing Data for Machine Learning
132+
133+
```python
134+
# One-hot encode categorical features, dropping first to avoid multicollinearity
135+
plan = (
136+
TransformPlan()
137+
.enc_onehot("color", categories=["red", "green", "blue"], drop="first")
138+
.enc_onehot("size", categories=["S", "M", "L", "XL"], drop="first")
139+
.enc_ordinal("quality", categories=["low", "medium", "high"])
140+
)
141+
```
142+
143+
### Handling Unknown Categories
144+
145+
```python
146+
# Unknown values get all zeros (one-hot)
147+
plan = TransformPlan().enc_onehot(
148+
column="color",
149+
categories=["red", "green", "blue"],
150+
unknown_value="all_zero" # Default
151+
)
152+
153+
# Unknown values get -1 (ordinal/label)
154+
plan = TransformPlan().enc_ordinal(
155+
column="size",
156+
categories=["small", "medium", "large"],
157+
unknown_value=-1
158+
)
159+
```
160+
161+
### Deriving Categories from Data
162+
163+
When categories are not specified, they are derived from the data (sorted alphabetically):
164+
165+
```python
166+
# Categories derived from data
167+
plan = TransformPlan().enc_onehot("color")
168+
# Uses sorted unique values from the column
169+
170+
# Note: For reproducibility, explicitly specify categories
171+
plan = TransformPlan().enc_onehot(
172+
column="color",
173+
categories=["blue", "green", "red"] # Explicit is better
174+
)
175+
```
176+
177+
## Multicollinearity Note
178+
179+
When using one-hot encoding for linear models (regression, logistic regression), you should drop one category to avoid the [dummy variable trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)). Use the `drop` parameter:
180+
181+
```python
182+
# For regression models, drop one category
183+
plan = TransformPlan().enc_onehot("color", drop="first")
184+
185+
# Tree-based models (random forest, XGBoost) don't require this
186+
plan = TransformPlan().enc_onehot("color") # Keep all
187+
```

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ Total time: 0.0247s
9494
| **str_** | String operations | `str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split` |
9595
| **dt_** | Datetime operations | `dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days` |
9696
| **map_** | Value mapping | `map_values`, `map_discretize`, `map_case`, `map_from_column` |
97+
| **enc_** | Categorical encoding | `enc_onehot`, `enc_ordinal`, `enc_label` |
9798

9899

99100
## Getting Started

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,3 +91,4 @@ nav:
9191
- String Operations: api/ops/string.md
9292
- Datetime Operations: api/ops/datetime.md
9393
- Map Operations: api/ops/map.md
94+
- Encoding Operations: api/ops/encoding.md

0 commit comments

Comments
 (0)