|
| 1 | +# Copilot Instructions for mail-parser |
| 2 | + |
| 3 | +mail-parser is a **production-grade email parsing library** for Python that transforms raw email messages into |
| 4 | +structured Python objects. Originally built as the foundation for [SpamScope](https://github.com/SpamScope/spamscope), |
| 5 | +it excels at security analysis, forensics, and RFC-compliant email processing. |
| 6 | + |
| 7 | +## Core Architecture |
| 8 | + |
| 9 | +### Factory-Based API Pattern |
| 10 | + |
| 11 | +**Always use factory functions** instead of direct `MailParser()` instantiation: |
| 12 | + |
| 13 | +```python |
| 14 | +import mailparser |
| 15 | +mail = mailparser.parse_from_file(filepath) # Standard email files |
| 16 | +mail = mailparser.parse_from_string(raw_email) # Email as string |
| 17 | +mail = mailparser.parse_from_bytes(email_bytes) # Email as bytes |
| 18 | +mail = mailparser.parse_from_file_msg(msg_file) # Outlook .msg files |
| 19 | +``` |
| 20 | + |
| 21 | +### Triple-Format Property Access |
| 22 | + |
| 23 | +Every parsed component offers **three access patterns** (`src/mailparser/core.py:550-570`): |
| 24 | + |
| 25 | +```python |
| 26 | +mail.subject # Python object (decoded string) |
| 27 | +mail.subject_raw # Raw header value (JSON list) |
| 28 | +mail.subject_json # JSON-serialized version |
| 29 | +``` |
| 30 | + |
| 31 | +This pattern applies to all properties via `__getattr__` magic in `core.py`. |
| 32 | + |
| 33 | +### Property Naming Convention |
| 34 | + |
| 35 | +Headers with hyphens use **underscore substitution** (`core.py:__getattr__`): |
| 36 | + |
| 37 | +```python |
| 38 | +mail.X_MSMail_Priority # Accesses "X-MSMail-Priority" header |
| 39 | +mail.Content_Type # Accesses "Content-Type" header |
| 40 | +``` |
| 41 | + |
| 42 | +## Development Workflows |
| 43 | + |
| 44 | +### Dependency Management with uv |
| 45 | + |
| 46 | +The project uses **[uv](https://github.com/astral-sh/uv)** (modern pip/virtualenv replacement) exclusively: |
| 47 | + |
| 48 | +```bash |
| 49 | +uv sync # Install all dev/test dependencies (defined in pyproject.toml) |
| 50 | +make install # Alias for uv sync |
| 51 | +``` |
| 52 | + |
| 53 | +Never use `pip` directly—all commands in Makefile use `uv run` prefix. |
| 54 | + |
| 55 | +### Testing Patterns |
| 56 | + |
| 57 | +```bash |
| 58 | +make test # pytest with coverage (generates coverage.xml, junit.xml, htmlcov/) |
| 59 | +make lint # ruff check . |
| 60 | +make format # ruff format . |
| 61 | +make check # lint + test |
| 62 | +make pre-commit # Run all pre-commit hooks |
| 63 | +``` |
| 64 | + |
| 65 | +When adding features or fixing bugs you MUST follow these steps: |
| 66 | + |
| 67 | +1. Add relevant test email to `tests/mails/` if demonstrating new case |
| 68 | +2. Write tests in the corresponding test file following existing patterns, under `tests/` |
| 69 | +3. Run `make test` to verify all tests pass before committing |
| 70 | +4. Run `uv run mail-parser -f tests/mails/mail_test_11 -j` to manually verify JSON output and that new changes |
| 71 | + work as expected |
| 72 | +5. Run `make pre-commit` to ensure code style compliance before pushing |
| 73 | + |
| 74 | +**Test data location**: `tests/mails/` contains malformed emails, Outlook files, and various encodings |
| 75 | +(`mail_test_1` through `mail_test_17`, `mail_malformed_1-3`, `mail_outlook_1`). |
| 76 | + |
| 77 | +**Critical testing rule**: When modifying parsing logic, test against malformed emails to ensure security defect |
| 78 | +detection still works. |
| 79 | + |
| 80 | +### Build & Release Process |
| 81 | + |
| 82 | +```bash |
| 83 | +make build # uv build → creates dist/*.tar.gz and dist/*.whl |
| 84 | +make release # build + twine upload to PyPI |
| 85 | +``` |
| 86 | + |
| 87 | +Version is **dynamically loaded** from `src/mailparser/version.py` (see |
| 88 | +`pyproject.toml:tool.hatch.version`). |
| 89 | + |
| 90 | +## Security-First Parsing |
| 91 | + |
| 92 | +### Defect Detection System |
| 93 | + |
| 94 | +The parser identifies RFC violations that could indicate malicious intent (`core.py:240-268`): |
| 95 | + |
| 96 | +```python |
| 97 | +mail.has_defects # Boolean flag |
| 98 | +mail.defects # List of defect dicts by content type |
| 99 | +mail.defects_categories # Set of defect class names (e.g., "StartBoundaryNotFoundDefect") |
| 100 | +``` |
| 101 | + |
| 102 | +**Epilogue defect handling** (`core.py:320-335`): When `EPILOGUE_DEFECTS` are detected, parser extracts hidden |
| 103 | +content between MIME boundaries that could contain malicious payloads. |
| 104 | + |
| 105 | +### IP Address Extraction |
| 106 | + |
| 107 | +`get_server_ipaddress(trust)` method (`core.py:487-528`) extracts sender IPs with **trust-level validation**: |
| 108 | + |
| 109 | +```python |
| 110 | +# Finds first non-private IP in trusted headers |
| 111 | +mail.get_server_ipaddress(trust="Received") |
| 112 | +``` |
| 113 | + |
| 114 | +Filters out private IP ranges using Python's `ipaddress` module. |
| 115 | + |
| 116 | +### Received Header Parsing |
| 117 | + |
| 118 | +Complex regex-based parsing (`utils.py:302-360`, patterns in `const.py:24-73`) extracts hop-by-hop routing: |
| 119 | + |
| 120 | +```python |
| 121 | +# Returns list of dicts with: by, from, date, date_utc, delay, envelope_from, hop, with |
| 122 | +mail.received |
| 123 | +``` |
| 124 | + |
| 125 | +**Key pattern**: `RECEIVED_COMPILED_LIST` contains pre-compiled regexes for "from", "by", "with", "id", "for", |
| 126 | +"via", "envelope-from", "envelope-sender", and date patterns. Recent fixes addressed IBM gateway duplicate matches |
| 127 | +(see comments in `const.py:26-38`). |
| 128 | + |
| 129 | +If parsing fails, falls back to `receiveds_not_parsed()` returning `{"raw": <header>, "hop": <n>}` |
| 130 | +structure. |
| 131 | + |
| 132 | +## Project Structure Specifics |
| 133 | + |
| 134 | +### src/ Layout |
| 135 | + |
| 136 | +Package uses modern **src-layout** (`src/mailparser/`) for cleaner imports and testing isolation: |
| 137 | + |
| 138 | +```text |
| 139 | +src/mailparser/ |
| 140 | +├── __init__.py # Exports factory functions |
| 141 | +├── __main__.py # CLI entry point (mail-parser command) |
| 142 | +├── core.py # MailParser class (760 lines) |
| 143 | +├── utils.py # Parsing utilities (582 lines) |
| 144 | +├── const.py # Regex patterns and constants |
| 145 | +├── exceptions.py # Exception hierarchy |
| 146 | +└── version.py # Version string |
| 147 | +``` |
| 148 | + |
| 149 | +### External Dependency: Outlook Support |
| 150 | + |
| 151 | +Outlook `.msg` file parsing requires **system-level Perl module**: |
| 152 | + |
| 153 | +```bash |
| 154 | +apt-get install libemail-outlook-message-perl # Debian/Ubuntu |
| 155 | +``` |
| 156 | + |
| 157 | +Triggered via `msgconvert()` function in `utils.py` that shells out to Perl script. Raises `MailParserOutlookError` |
| 158 | +if unavailable. |
| 159 | + |
| 160 | +### CLI Tool Pattern |
| 161 | + |
| 162 | +`__main__.py` provides production CLI with mutually exclusive input modes (`-f`, `-s`, `-k`), JSON output (`-j`), |
| 163 | +and selective printing (`-b`, `-a`, `-r`, `-t`). |
| 164 | + |
| 165 | +**Entry point defined** in `pyproject.toml:project.scripts`: |
| 166 | + |
| 167 | +```toml |
| 168 | +[project.scripts] |
| 169 | +mail-parser = "mailparser.__main__:main" |
| 170 | +``` |
| 171 | + |
| 172 | +## Code Style & Tooling |
| 173 | + |
| 174 | +### Ruff Configuration |
| 175 | + |
| 176 | +Single linter/formatter (replaces black, isort, flake8): |
| 177 | + |
| 178 | +```toml |
| 179 | +[tool.ruff.lint] |
| 180 | +select = ["E", "F", "I"] # pycodestyle, pyflakes, isort |
| 181 | +# "UP", "B", "SIM", "S", "PT" commented out in pyproject.toml |
| 182 | +``` |
| 183 | + |
| 184 | +### Pytest Configuration |
| 185 | + |
| 186 | +Key markers in `pyproject.toml:tool.pytest.ini_options`: |
| 187 | + |
| 188 | +- `integration`: marks integration tests |
| 189 | +- Coverage outputs: XML (for CI), HTML (for local), terminal |
| 190 | +- JUnit XML for CI integration |
| 191 | + |
| 192 | +## Common Pitfalls |
| 193 | + |
| 194 | +1. **Don't instantiate `MailParser()` directly**—use factory functions from `__init__.py` |
| 195 | +2. **Don't use `pip`**—always use `uv` or Makefile targets |
| 196 | +3. **Don't ignore defects**—they're critical for security analysis |
| 197 | +4. **Don't assume headers exist**—use `.get()` pattern or handle `None` |
| 198 | +5. **Test against malformed emails**—`tests/mails/mail_malformed_*` files exist for this reason |
| 199 | + |
| 200 | +## Docker Development |
| 201 | + |
| 202 | +Dockerfile uses **Python 3.10-slim-bookworm** with Outlook dependencies pre-installed. Container runs as non-root |
| 203 | +`mailparser` user. |
| 204 | + |
| 205 | +```bash |
| 206 | +docker build -t mail-parser . |
| 207 | +docker run mail-parser -f /path/to/email |
| 208 | +``` |
| 209 | + |
| 210 | +## Key Reference Points |
| 211 | + |
| 212 | +- **Property implementation**: `core.py:540-730` (all `@property` decorators) |
| 213 | +- **Attachment extraction**: `core.py:355-475` (walks multipart, handles encoding) |
| 214 | +- **Received parsing logic**: `utils.py:302-455` + `const.py:24-73` (regex patterns) |
| 215 | +- **CLI implementation**: `__main__.py:30-347` (argparse + output formatting) |
| 216 | +- **Exception hierarchy**: `exceptions.py:20-60` (5 exception types) |
| 217 | + |
| 218 | +## Testing Strategy |
| 219 | + |
| 220 | +When adding features: |
| 221 | + |
| 222 | +1. Add test email to `tests/mails/` if demonstrating new case |
| 223 | +2. Write tests in `tests/test_mail_parser.py` following existing patterns |
| 224 | +3. Test both normal and `_raw`/`_json` property variants |
| 225 | +4. Verify defect detection for security-relevant changes |
| 226 | +5. Run `make check` before committing |
0 commit comments