Skip to content

Commit 3db66ce

Browse files
Merge branch 'release/4.2.0'
2 parents dc78d4a + 42047b4 commit 3db66ce

33 files changed

Lines changed: 5676 additions & 666 deletions

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ A clear and concise description of what the bug is.
99

1010
**To Reproduce**
1111
Steps to reproduce the behavior:
12+
1213
1. `import mailparser`
1314
2. `mail = mailparser.parse_from_file(f)`
1415
3. '....'
@@ -23,9 +24,10 @@ You can use a `gist` like [this](https://gist.github.com/fedelemantuano/5dd70200
2324
The issues without raw mail will be closed.
2425

2526
**Environment:**
26-
- OS: [e.g. Linux, Windows]
27-
- Docker: [yes or no]
28-
- mail-parser version [e.g. 3.6.0]
27+
28+
- OS: [e.g. Linux, Windows]
29+
- Docker: [yes or no]
30+
- mail-parser version [e.g. 3.6.0]
2931

3032
**Additional context**
3133
Add any other context about the problem here (e.g. stack traceback error).

.github/copilot-instructions.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Copilot Instructions for mail-parser
2+
3+
mail-parser is a **production-grade email parsing library** for Python that transforms raw email messages into
4+
structured Python objects. Originally built as the foundation for [SpamScope](https://github.com/SpamScope/spamscope),
5+
it excels at security analysis, forensics, and RFC-compliant email processing.
6+
7+
## Core Architecture
8+
9+
### Factory-Based API Pattern
10+
11+
**Always use factory functions** instead of direct `MailParser()` instantiation:
12+
13+
```python
14+
import mailparser
15+
mail = mailparser.parse_from_file(filepath) # Standard email files
16+
mail = mailparser.parse_from_string(raw_email) # Email as string
17+
mail = mailparser.parse_from_bytes(email_bytes) # Email as bytes
18+
mail = mailparser.parse_from_file_msg(msg_file) # Outlook .msg files
19+
```
20+
21+
### Triple-Format Property Access
22+
23+
Every parsed component offers **three access patterns** (`src/mailparser/core.py:550-570`):
24+
25+
```python
26+
mail.subject # Python object (decoded string)
27+
mail.subject_raw # Raw header value (JSON list)
28+
mail.subject_json # JSON-serialized version
29+
```
30+
31+
This pattern applies to all properties via `__getattr__` magic in `core.py`.
32+
33+
### Property Naming Convention
34+
35+
Headers with hyphens use **underscore substitution** (`core.py:__getattr__`):
36+
37+
```python
38+
mail.X_MSMail_Priority # Accesses "X-MSMail-Priority" header
39+
mail.Content_Type # Accesses "Content-Type" header
40+
```
41+
42+
## Development Workflows
43+
44+
### Dependency Management with uv
45+
46+
The project uses **[uv](https://github.com/astral-sh/uv)** (modern pip/virtualenv replacement) exclusively:
47+
48+
```bash
49+
uv sync # Install all dev/test dependencies (defined in pyproject.toml)
50+
make install # Alias for uv sync
51+
```
52+
53+
Never use `pip` directly—all commands in Makefile use `uv run` prefix.
54+
55+
### Testing Patterns
56+
57+
```bash
58+
make test # pytest with coverage (generates coverage.xml, junit.xml, htmlcov/)
59+
make lint # ruff check .
60+
make format # ruff format .
61+
make check # lint + test
62+
make pre-commit # Run all pre-commit hooks
63+
```
64+
65+
When adding features or fixing bugs you MUST follow these steps:
66+
67+
1. Add relevant test email to `tests/mails/` if demonstrating new case
68+
2. Write tests in the corresponding test file following existing patterns, under `tests/`
69+
3. Run `make test` to verify all tests pass before committing
70+
4. Run `uv run mail-parser -f tests/mails/mail_test_11 -j` to manually verify JSON output and that new changes
71+
work as expected
72+
5. Run `make pre-commit` to ensure code style compliance before pushing
73+
74+
**Test data location**: `tests/mails/` contains malformed emails, Outlook files, and various encodings
75+
(`mail_test_1` through `mail_test_17`, `mail_malformed_1-3`, `mail_outlook_1`).
76+
77+
**Critical testing rule**: When modifying parsing logic, test against malformed emails to ensure security defect
78+
detection still works.
79+
80+
### Build & Release Process
81+
82+
```bash
83+
make build # uv build → creates dist/*.tar.gz and dist/*.whl
84+
make release # build + twine upload to PyPI
85+
```
86+
87+
Version is **dynamically loaded** from `src/mailparser/version.py` (see
88+
`pyproject.toml:tool.hatch.version`).
89+
90+
## Security-First Parsing
91+
92+
### Defect Detection System
93+
94+
The parser identifies RFC violations that could indicate malicious intent (`core.py:240-268`):
95+
96+
```python
97+
mail.has_defects # Boolean flag
98+
mail.defects # List of defect dicts by content type
99+
mail.defects_categories # Set of defect class names (e.g., "StartBoundaryNotFoundDefect")
100+
```
101+
102+
**Epilogue defect handling** (`core.py:320-335`): When `EPILOGUE_DEFECTS` are detected, parser extracts hidden
103+
content between MIME boundaries that could contain malicious payloads.
104+
105+
### IP Address Extraction
106+
107+
`get_server_ipaddress(trust)` method (`core.py:487-528`) extracts sender IPs with **trust-level validation**:
108+
109+
```python
110+
# Finds first non-private IP in trusted headers
111+
mail.get_server_ipaddress(trust="Received")
112+
```
113+
114+
Filters out private IP ranges using Python's `ipaddress` module.
115+
116+
### Received Header Parsing
117+
118+
Complex regex-based parsing (`utils.py:302-360`, patterns in `const.py:24-73`) extracts hop-by-hop routing:
119+
120+
```python
121+
# Returns list of dicts with: by, from, date, date_utc, delay, envelope_from, hop, with
122+
mail.received
123+
```
124+
125+
**Key pattern**: `RECEIVED_COMPILED_LIST` contains pre-compiled regexes for "from", "by", "with", "id", "for",
126+
"via", "envelope-from", "envelope-sender", and date patterns. Recent fixes addressed IBM gateway duplicate matches
127+
(see comments in `const.py:26-38`).
128+
129+
If parsing fails, falls back to `receiveds_not_parsed()` returning `{"raw": <header>, "hop": <n>}`
130+
structure.
131+
132+
## Project Structure Specifics
133+
134+
### src/ Layout
135+
136+
Package uses modern **src-layout** (`src/mailparser/`) for cleaner imports and testing isolation:
137+
138+
```text
139+
src/mailparser/
140+
├── __init__.py # Exports factory functions
141+
├── __main__.py # CLI entry point (mail-parser command)
142+
├── core.py # MailParser class (760 lines)
143+
├── utils.py # Parsing utilities (582 lines)
144+
├── const.py # Regex patterns and constants
145+
├── exceptions.py # Exception hierarchy
146+
└── version.py # Version string
147+
```
148+
149+
### External Dependency: Outlook Support
150+
151+
Outlook `.msg` file parsing requires **system-level Perl module**:
152+
153+
```bash
154+
apt-get install libemail-outlook-message-perl # Debian/Ubuntu
155+
```
156+
157+
Triggered via `msgconvert()` function in `utils.py` that shells out to Perl script. Raises `MailParserOutlookError`
158+
if unavailable.
159+
160+
### CLI Tool Pattern
161+
162+
`__main__.py` provides production CLI with mutually exclusive input modes (`-f`, `-s`, `-k`), JSON output (`-j`),
163+
and selective printing (`-b`, `-a`, `-r`, `-t`).
164+
165+
**Entry point defined** in `pyproject.toml:project.scripts`:
166+
167+
```toml
168+
[project.scripts]
169+
mail-parser = "mailparser.__main__:main"
170+
```
171+
172+
## Code Style & Tooling
173+
174+
### Ruff Configuration
175+
176+
Single linter/formatter (replaces black, isort, flake8):
177+
178+
```toml
179+
[tool.ruff.lint]
180+
select = ["E", "F", "I"] # pycodestyle, pyflakes, isort
181+
# "UP", "B", "SIM", "S", "PT" commented out in pyproject.toml
182+
```
183+
184+
### Pytest Configuration
185+
186+
Key markers in `pyproject.toml:tool.pytest.ini_options`:
187+
188+
- `integration`: marks integration tests
189+
- Coverage outputs: XML (for CI), HTML (for local), terminal
190+
- JUnit XML for CI integration
191+
192+
## Common Pitfalls
193+
194+
1. **Don't instantiate `MailParser()` directly**—use factory functions from `__init__.py`
195+
2. **Don't use `pip`**—always use `uv` or Makefile targets
196+
3. **Don't ignore defects**—they're critical for security analysis
197+
4. **Don't assume headers exist**—use `.get()` pattern or handle `None`
198+
5. **Test against malformed emails**`tests/mails/mail_malformed_*` files exist for this reason
199+
200+
## Docker Development
201+
202+
Dockerfile uses **Python 3.10-slim-bookworm** with Outlook dependencies pre-installed. Container runs as non-root
203+
`mailparser` user.
204+
205+
```bash
206+
docker build -t mail-parser .
207+
docker run mail-parser -f /path/to/email
208+
```
209+
210+
## Key Reference Points
211+
212+
- **Property implementation**: `core.py:540-730` (all `@property` decorators)
213+
- **Attachment extraction**: `core.py:355-475` (walks multipart, handles encoding)
214+
- **Received parsing logic**: `utils.py:302-455` + `const.py:24-73` (regex patterns)
215+
- **CLI implementation**: `__main__.py:30-347` (argparse + output formatting)
216+
- **Exception hierarchy**: `exceptions.py:20-60` (5 exception types)
217+
218+
## Testing Strategy
219+
220+
When adding features:
221+
222+
1. Add test email to `tests/mails/` if demonstrating new case
223+
2. Write tests in `tests/test_mail_parser.py` following existing patterns
224+
3. Test both normal and `_raw`/`_json` property variants
225+
4. Verify defect detection for security-relevant changes
226+
5. Run `make check` before committing

0 commit comments

Comments
 (0)