Skip to content

Commit 1e250a5

Browse files
ekscryptoclaude
andcommitted
Improve RFC compliance and add comprehensive unit tests
- Fix IPv6 zone identifier validation (RFC 5321 Section 4.1.3) - Reject C1 control characters U+0080-U+009F (RFC 5198) - Reject bidirectional formatting characters for security - Fix supplementary Unicode plane support (emoji, etc.) via CharacterSet bug workaround - Add 48 new unit tests covering edge cases, boundaries, and RFC compliance - Add CHANGELOG.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0c2487e commit 1e250a5

6 files changed

Lines changed: 635 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
12+
#### New Unit Tests (48 tests across 3 files)
13+
14+
**EmailSyntaxValidatorTests.swift**
15+
- `testLocalPartExactly63Characters` - Boundary test for 63-character local part
16+
- `testLocalPartExactlyOneCharacter` - Minimum valid local part
17+
- `testLocalPartEmptyString` - Empty local part rejection
18+
- `testUnicodeLocalPartCharacterVsByteCount` - 30 four-byte Unicode chars (120 bytes, 30 chars)
19+
- `testUnicodeLocalPartExceeds64Characters` - 65+ Unicode character rejection
20+
- `testEmojiInLocalPart` - Emoji validation in Unicode mode
21+
- `testCombiningMarksInLocalPart` - Diacritics and combining characters
22+
- `testHighUnicodeRanges` - Characters beyond BMP (U+1D400+)
23+
- `testZeroWidthCharacters` - ZWSP, ZWJ, ZWNJ handling
24+
- `testBidirectionalOverrideCharacters` - RTL/LTR control character rejection
25+
- `testC1ControlCharactersRejected` - C1 control character rejection (U+0080-U+009F)
26+
- `testRFC2047EncodedWithIPv4AddressLiteral` - RFC2047 with IPv4 literal
27+
- `testRFC2047EncodedWithIPv6AddressLiteral` - RFC2047 with IPv6 literal
28+
- `testQuotedStringWithMultipleAtSymbols` - Multiple @ in quoted strings
29+
- `testQuotedStringWithRFC2047Decoding` - RFC2047 decoded quoted strings
30+
- `testAutoEncodeToRfc2047WithAddressLiteral` - Combined options testing
31+
- `testCustomDomainValidatorAcceptsAnyDomain` - Permissive validator
32+
- `testCustomDomainValidatorRejectsAllDomains` - Restrictive validator
33+
- `testCustomDomainValidatorWithSpecificTLDs` - TLD-specific validation
34+
- `testCustomDomainValidatorReceivesCorrectDomain` - Domain parameter verification
35+
- `testCustomDomainValidatorWithUnicodeDomain` - IDN domain handling
36+
- `testMultipleDotsInVariousPositions` - Valid multi-dot local parts
37+
- `testSingleCharactersBetweenDots` - Minimal segments between dots
38+
- `testMaxConsecutiveSpecialCharacters` - Consecutive special characters
39+
- `testSpecialCharactersAtBoundaries` - Special chars at start/end of segments
40+
- `testExtremelyLongLocalPart` - 1000 character local part rejection
41+
- `testExtremelyLongDomain` - 500+ character domain handling
42+
- `testVeryLongRFC2047EncodedString` - Near 76-char limit RFC2047
43+
- `testManyUnicodeCharactersInLocalPart` - 64 diverse Unicode characters
44+
45+
**RFC2047CoderTests.swift**
46+
- `testDecodingUTF16B` - Base64 with UTF-16 charset
47+
- `testDecodingUTF32B` - Base64 with UTF-32 charset
48+
- `testDecodingUTF16InvalidData` - Malformed UTF-16 rejection
49+
- `testDecodingUTF32InvalidData` - Malformed UTF-32 rejection
50+
- `testEncodeDecodeRoundTripSimpleASCII` - ASCII round-trip
51+
- `testEncodeDecodeRoundTripUnicode` - Unicode round-trip
52+
- `testEncodeDecodeRoundTripSpecialCharacters` - Special character round-trip
53+
- `testDecodingLatin2QPolishCharacters` - Polish special characters
54+
- `testDecodingLatin2QCzechCharacters` - Czech special characters
55+
- `testDecodingLatin2InvalidControlCharacter` - Invalid byte handling
56+
- `testEncodeEmptyString` - Empty string encoding
57+
- `testDecodeWithMixedCaseCharset` - Case-insensitive charset
58+
- `testDecodeWithMixedCaseEncoding` - Case-insensitive encoding type
59+
- `testDecodeWithWhitespaceInEncodedWord` - Whitespace handling
60+
61+
**IPAddressValidatorTests.swift**
62+
- `testIPv6ZoneIdentifiers` - Zone identifier rejection per RFC 5321
63+
- `testIPv6LoopbackVariants` - `::1` variations
64+
- `testIPv4MappedIPv6Extended` - `::ffff:` mapped addresses
65+
- `testIPv4LeadingZeros` - Leading zeros handling
66+
- `testEmptyIPAddressStrings` - Empty/whitespace rejection
67+
68+
### Changed
69+
70+
- **EmailSyntaxValidator.swift**: Reordered CharacterSet construction to work around Foundation bug where `.subtracting()` corrupts supplementary Unicode plane data. Supplementary planes (U+10000-U+10FFFF) are now added last, after all subtractions.
71+
72+
### Fixed
73+
74+
#### RFC 5321 Compliance
75+
- **IPAddressSyntaxValidator.swift**: IPv6 zone identifiers (e.g., `fe80::1%eth0`) are now correctly rejected. Per RFC 5321 Section 4.1.3, zone identifiers are not valid in email address literals.
76+
77+
#### RFC 5198 Compliance
78+
- **EmailSyntaxValidator.swift**: C1 control characters (U+0080-U+009F) are now rejected in Unicode mode. Per RFC 5198 Section 2, these control characters should be avoided in network interchange.
79+
80+
#### RFC 6531 Compliance
81+
- **EmailSyntaxValidator.swift**: Fixed supplementary Unicode plane support (U+10000-U+10FFFF). Emoji, mathematical symbols, and other characters beyond the Basic Multilingual Plane now correctly validate in Unicode mode.
82+
83+
#### Security Improvements
84+
- **EmailSyntaxValidator.swift**: Bidirectional formatting characters are now rejected:
85+
- Left-to-Right Mark / Right-to-Left Mark (U+200E-U+200F)
86+
- Directional embeddings and overrides (U+202A-U+202E)
87+
- Directional isolates (U+2066-U+2069)
88+
- Deprecated format characters (U+206A-U+206F)
89+
90+
These characters can be exploited for homograph attacks and email spoofing.
91+
92+
### Technical Notes
93+
94+
#### CharacterSet Bug Workaround
95+
Foundation's `CharacterSet` has a bug where calling `.subtracting()` on a set that includes supplementary Unicode planes (U+10000+) corrupts the supplementary plane data, even when the subtracted characters don't overlap. The workaround is to add supplementary planes as the final `.union()` call, after all `.subtracting()` operations are complete.
96+
97+
```swift
98+
// WRONG - supplementary planes get corrupted by subsequent subtractions
99+
let charset = baseSet
100+
.union(supplementaryPlanes) // Added here...
101+
.subtracting(c1Controls) // ...corrupted here
102+
103+
// CORRECT - add supplementary planes last
104+
let charset = baseSet
105+
.subtracting(c1Controls) // All subtractions first
106+
.union(supplementaryPlanes) // Add supplementary planes last
107+
```

Sources/SwiftEmailValidator/EmailSyntaxValidator.swift

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@
77
//
88
// References:
99
// * RFC2047 https://datatracker.ietf.org/doc/html/rfc2047
10+
// * RFC5198 https://datatracker.ietf.org/doc/html/rfc5198 (Unicode Format for Network Interchange)
1011
// * RFC5321 https://datatracker.ietf.org/doc/html/rfc5321 Section 4.1.2 & Section 4.1.3
1112
// * RFC5322 https://datatracker.ietf.org/doc/html/rfc5322 Section 3.2.3 & Section 3.4.1
1213
// * RFC5234 https://datatracker.ietf.org/doc/html/rfc5234 Appendix B.1
1314
// * RFC6531 https://datatracker.ietf.org/doc/html/rfc6531
15+
// * RFC6532 https://datatracker.ietf.org/doc/html/rfc6532
1416

1517
import Foundation
1618
import SwiftPublicSuffixList
@@ -194,17 +196,53 @@ public final class EmailSyntaxValidator {
194196
.union(CharacterSet(charactersIn: digitRange))
195197
.union(CharacterSet(charactersIn: #"!#$%&'*+-/=?^_`{|}~"#)) // Ref RFC5322 section 3.2.3 Atom, definition of atext
196198
private static let asciiRange: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x00)!...Unicode.Scalar(0x7F)!
199+
200+
// RFC6531 extends atext to include UTF8-non-ascii (U+0080+)
201+
// RFC5198 Section 2: Control characters (U+0000-U+001F, U+007F-U+009F) should be avoided
202+
// We also exclude other problematic characters per security best practices:
203+
// - Bidirectional formatting characters (U+200E-U+200F, U+202A-U+202E, U+2066-U+2069)
204+
// - Deprecated format characters (U+206A-U+206F)
205+
private static let c1ControlRange: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x80)!...Unicode.Scalar(0x9F)! // C1 control chars
206+
private static let bidiFormattingChars: CharacterSet = CharacterSet(charactersIn: Unicode.Scalar(0x200E)!...Unicode.Scalar(0x200F)!) // LRM, RLM
207+
.union(CharacterSet(charactersIn: Unicode.Scalar(0x202A)!...Unicode.Scalar(0x202E)!)) // LRE, RLE, PDF, LRO, RLO
208+
.union(CharacterSet(charactersIn: Unicode.Scalar(0x2066)!...Unicode.Scalar(0x2069)!)) // LRI, RLI, FSI, PDI
209+
private static let deprecatedFormatChars: CharacterSet = CharacterSet(charactersIn: Unicode.Scalar(0x206A)!...Unicode.Scalar(0x206F)!) // Deprecated formatting
210+
211+
// Note: CharacterSet.inverted doesn't properly include supplementary planes (U+10000+)
212+
// We must explicitly include them. Unicode planes:
213+
// - BMP (U+0000-U+FFFF) - included via asciiRange.inverted
214+
// - SMP (U+10000-U+1FFFF) - Supplementary Multilingual Plane (emoji, historic scripts)
215+
// - SIP (U+20000-U+2FFFF) - Supplementary Ideographic Plane (CJK)
216+
// - TIP (U+30000-U+3FFFF) - Tertiary Ideographic Plane
217+
// - Planes 4-13 (U+40000-U+DFFFF) - Unassigned
218+
// - SSP (U+E0000-U+EFFFF) - Supplementary Special-purpose Plane
219+
// - PUA (U+F0000-U+10FFFF) - Private Use Areas
220+
private static let supplementaryPlanes: CharacterSet = CharacterSet(charactersIn: Unicode.Scalar(0x10000)!...Unicode.Scalar(0x10FFFF)!)
221+
222+
// Note: CharacterSet has a bug where .subtracting() corrupts supplementary plane data
223+
// We must add supplementaryPlanes LAST, after all subtractions are complete
197224
private static let atextUnicodeCharacterSet: CharacterSet = atextCharacterSet
198-
.union(CharacterSet(charactersIn: asciiRange).inverted)
225+
.union(CharacterSet(charactersIn: asciiRange).inverted) // BMP non-ASCII
226+
.subtracting(CharacterSet(charactersIn: c1ControlRange)) // Exclude C1 control characters per RFC5198
227+
.subtracting(bidiFormattingChars) // Exclude bidirectional formatting (security)
228+
.subtracting(deprecatedFormatChars) // Exclude deprecated format characters
229+
.union(supplementaryPlanes) // Supplementary planes (emoji, etc.) - MUST BE LAST (after subtractions)
230+
199231
private static let quotedPairSMTP: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x20)!...Unicode.Scalar(0x7E)!
200232
private static let qtextSMTP1: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x20)!...Unicode.Scalar(0x21)!
201233
private static let qtextSMTP2: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x23)!...Unicode.Scalar(0x5B)!
202234
private static let qtextSMTP3: ClosedRange<Unicode.Scalar> = Unicode.Scalar(0x5D)!...Unicode.Scalar(0x7E)!
203235
private static let qtextSMTPCharacterSet: CharacterSet = CharacterSet(charactersIn: qtextSMTP1)
204236
.union(CharacterSet(charactersIn: qtextSMTP2))
205237
.union(CharacterSet(charactersIn: qtextSMTP3))
238+
// Note: CharacterSet has a bug where .subtracting() corrupts supplementary plane data
239+
// We must add supplementaryPlanes LAST, after all subtractions are complete
206240
private static let qtextUnicodeSMTPCharacterSet = qtextSMTPCharacterSet
207-
.union(CharacterSet(charactersIn: asciiRange).inverted)
241+
.union(CharacterSet(charactersIn: asciiRange).inverted) // BMP non-ASCII
242+
.subtracting(CharacterSet(charactersIn: c1ControlRange)) // Exclude C1 control characters per RFC5198
243+
.subtracting(bidiFormattingChars) // Exclude bidirectional formatting (security)
244+
.subtracting(deprecatedFormatChars) // Exclude deprecated format characters
245+
.union(supplementaryPlanes) // Supplementary planes (emoji, etc.) - MUST BE LAST (after subtractions)
208246

209247
private static func extractDotAtom(_ candidate: String, compatibility: Compatibility) -> String? {
210248
guard !candidate.hasPrefix("\""),

Sources/SwiftEmailValidator/IPAddressSyntaxValidator.swift

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,15 @@ final public class IPAddressSyntaxValidator {
2525
return candidate.range(of: v4regex, options: .regularExpression) != nil
2626
}
2727

28-
/// Validates that the candidate string respects the IPv6 syntax
28+
/// Validates that the candidate string respects the IPv6 syntax per RFC 5321
2929
/// - Parameter candidate: String to validate
30-
/// - Returns: true if syntax eems valid, false otherwise
30+
/// - Returns: true if syntax seems valid, false otherwise
31+
/// - Note: Zone identifiers (e.g., %eth0) are NOT allowed per RFC 5321 for email addresses
3132
static func matchIPv6(_ candidate: String) -> Bool {
32-
// Source: https://gist.github.com/syzdek/6086792
33-
let v6regex = #"^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$"#
33+
// Based on: https://gist.github.com/syzdek/6086792
34+
// Modified: Removed zone identifier pattern (fe80:...%...) as zone IDs are not valid
35+
// in email address literals per RFC 5321 Section 4.1.3
36+
let v6regex = #"^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$"#
3437
return candidate.range(of: v6regex, options: .regularExpression) != nil
3538
}
3639
}

0 commit comments

Comments
 (0)