┌─────────────────────────────┐ ┌─────────────────────────────┐
│ GUI Interface │ │ CLI Interface │
│ (PySide6 / Modern Card UI) │ │ (Click-based commands) │
└──────────────┬──────────────┘ └──────────────┬──────────────┘
│ │
┌──────────────▼─────────────────────────────────▼──────────────┐
│ Core Engine Layer │
│ • MetadataEngine: Orchestrates all operations │
│ • BackupManager: Handles file backups and restoration │
│ • BatchProcessor: Parallel processing for directories │
└────────┬────────────────────────┬───────────────────────────┘
│ │
┌────────▼────────┐ ┌────────▼────────────────────────────┐
│ Format Handlers│ │ Privacy & Analysis Layer │
│ • ImageHandler │ │ • PrivacyAnalyzer: Risk scoring │
│ • PDFHandler │ │ • DummyGenerator: Fake metadata │
│ • AudioHandler │ │ • DiffGenerator: Compare changes │
│ • VideoHandler │ └──────────────────────────────────────┘
│ • DocHandler │
└─────────────────┘
Data Models (Pydantic)
• MetadataRecord: Complete file metadata
• MetadataField: Individual field with risk assessment
• PrivacyRisk: Risk analysis results
• ProcessingResult: Operation outcomes
Role: Central orchestrator for all metadata operations
Key Methods:
extract(file_path)→ MetadataRecordmodify(file_path, fields_to_remove, fields_to_modify)→ ProcessingResultclean(file_path, profile)→ ProcessingResultrestore(file_path)→ ProcessingResult
Responsibilities:
- Route files to appropriate handlers
- Manage backup lifecycle
- Coordinate between privacy analyzer and handlers
- Error handling and result aggregation
Role: Interactive user interface
Components:
- MetaSanitizeApp (
app.py): Application entry point, QStyleSheet theming. - MainWindow (
main_window.py): Main card-based layout logic. - WorkerThread: Handles long-running tasks (scanning, cleaning) to keep UI responsive.
Design Pattern:
- Card-Based UI: Uses custom
Cardwidgets for modular layout. - Responsive Threading: All file operations run in background QThreads.
Role: Format-specific metadata extraction/modification
Base Interface (BaseHandler):
- supported_extensions: List[str]
- extract_metadata(file_path) → MetadataRecord
- modify_metadata(file_path, fields, values) → int
- clean_metadata(file_path, profile) → intImplementations:
- ImageHandler: EXIF/XMP/IPTC via Pillow + piexif
- PDFHandler: PDF Info Dict + XMP via PyPDF2/pikepdf
- AudioHandler: ID3v2/Vorbis via mutagen
- VideoHandler: Container metadata via pymediainfo
- DocumentHandler: Office XML properties via python-docx
Design Principle: Each handler is self-contained, testable in isolation
Role: Risk assessment and comparison
Key Features:
- Weighted risk scoring by category (Location=1.0, Technical=0.2)
- PII detection (names, GPS, device IDs)
- Risk levels: CRITICAL (80-100), HIGH (60-79), MEDIUM (40-59), LOW (0-39)
- Metadata diff generation (removed/modified/added fields)
- Human-readable reports
Risk Calculation:
weighted_score = Σ(field.risk_score × category_weight) / Σ(category_weight)Role: Generate realistic but fake metadata
Consistency Rules:
- GPS coordinates → Timezone matching
- Camera Make → Compatible Lens Model
- Timestamps: No future dates, chronologically valid
- Locale: Language ↔ Region alignment
Profiles:
- smartphone: iPhone/Android with phone-typical settings
- dslr: Professional camera bodies with compatible lenses
- office: MS Office/LibreOffice/Google Docs
- scanner: Flatbed scanner metadata
Role: Efficient multi-file processing
Features:
- Parallel execution (ThreadPoolExecutor, configurable workers)
- Extension filtering
- Recursive directory traversal
- Progress callbacks
- Aggregate reporting (success rate, total changes, errors)
User: scan photo.jpg
↓
CLI: Parse args, create Engine
↓
Engine.extract(photo.jpg)
↓
Registry → ImageHandler (based on .jpg)
↓
ImageHandler: Pillow.open() → getexif() → categorize fields
↓
MetadataRecord (50 fields)
↓
PrivacyAnalyzer.analyze(record)
↓
PrivacyRisk (score=85, CRITICAL, GPS detected)
↓
CLI: Display rich table + recommendations
User: clean photo.jpg --profile safe
↓
Engine.clean(photo.jpg, profile='safe')
↓
1. Extract original metadata (risk_before)
2. Create backup → .originals/photo_20260114_143022.jpg
3. ImageHandler.clean_metadata('safe')
- Remove GPS, Author, Make, Model
- Strip thumbnail
- Keep technical (Orientation, ColorSpace)
4. Re-extract metadata (risk_after)
↓
ProcessingResult:
- success: True
- changes: 8
- risk_reduction: 65 points
- backup_path: .originals/...
↓
CLI: Display summary with green checkmark
| Component | Library | Why? |
|---|---|---|
| Images | Pillow + piexif | Universal support, pure Python, thumbnail handling |
| PyPDF2 + pikepdf | PyPDF2 for basic, pikepdf for advanced XMP stripping | |
| Audio | mutagen | Format-agnostic (ID3, Vorbis, MP4), battle-tested |
| Video | pymediainfo | Wrapper for MediaInfo library, comprehensive |
| Office | python-docx | Direct XML access to DOCX core properties |
| CLI | Click + Rich | Click for args, Rich for beautiful tables/progress bars |
| Data Models | Pydantic | Validation, JSON serialization, type safety |
| Dummy Data | Faker | Realistic names/locations, locale support |
Problem: Some metadata is required for file validity
Solution: Three-tier profile system
- complete: Strip everything (risk: may break some viewers)
- safe: Remove PII, keep technical
- minimal: Keep only essential (orientation, dimensions)
Problem: EXIF thumbnails can leak unedited image data
Options:
- Strip entirely (safe, breaks some viewers)
- Regenerate from cleaned image (safer, compatible)
- Leave untouched (mark as high risk)
Implementation: Profile-based (safe=strip, minimal=regenerate)
Problem: Metadata forensics can detect mismatches
Solution: Pre-mapped city coordinates with timezones
'New York': {lat: 40.7128, lon: -74.0060, tz: 'America/New_York'}Generate timestamps using city's timezone
Structure:
.originals/
├── photo_20260114_143022.jpg
├── document_20260114_143156.pdf
└── ...
- Timestamped filenames prevent collisions
- Restore command finds most recent backup by pattern match
- Optional:
.metadata.jsonsidecar for restoration metadata
- Graceful degradation: Partial metadata extraction better than total failure
- Warnings vs Errors: Unsupported tags → warning, file not found → error
- Atomic operations: Backup before modify, revert on failure
- Batch isolation: One file failure doesn't stop batch processing
Why: Irreversible data loss if operation fails
Do: Always create backup first, then modify
Why: May break file format, leave orphaned references
Do: Format-specific cleaning (e.g., PDF: rewrite streams, JPEG: re-encode EXIF)
Why: Forensically detectable (GPS in ocean, Feb 30, etc.)
Do: Use constrained randomness with validation
Why: malicious.jpg.exe, renamed_video.jpg
Do: Validate file format via magic bytes or library detection
Why: Metadata standards evolve (EXIF 2.31, XMP 2.0)
Do: Pattern matching + namespace awareness
Why: ZIP containers in DOCX, MP4 tracks, multi-page TIFFs
Do: Recursive extraction for container formats
Why: Legacy EXIF uses ASCII, XMP uses UTF-8, ID3 has multiple encodings
Do: Handle encoding detection per format
Adversary: Forensic analyst with metadata extraction tools
Goals:
- Remove PII leakage (names, locations, device IDs)
- Prevent device fingerprinting
- Avoid timestamp correlation attacks
- Strip edit history
Non-Goals:
- Hiding evidence of criminal activity (unethical, illegal)
- Defeating steganography detection
- Bypassing digital signatures (will break file)
Post-Clean Checks:
- Re-extract metadata → ensure sensitive fields gone
- Hash comparison → detect unintended changes
- File format validation → ensure still openable
- Thumbnail check → verify stripping/regeneration
Optional Feature: Mark files as sanitized
Pros: Transparency, provenance tracking
Cons: Reveals use of privacy tool
Implementation: XMP xmp:History entry or custom tag
BatchProcessor(max_workers=4)
# CPU-bound: workers = CPU count
# I/O-bound: workers = 2-4x CPU count- Reuse Pillow Image objects when modifying + re-extracting
- Cache handler instances in registry
- Stream large files (PDF pages, video chunks)
- Truncate binary blobs in display (show first 1KB)
def progress_callback(current, total, file_path):
print(f"[{current}/{total}] {file_path.name}")- Handler-specific: Mock file I/O, test field extraction
- Privacy analyzer: Test risk scoring edge cases
- Dummy generator: Validate consistency rules
- End-to-end: scan → clean → verify
- Format-specific: Real sample files (JPEG, PDF, MP3)
- Corrupted metadata (partial EXIF)
- Empty files (0 bytes)
- Extremely large metadata (1MB XMP block)
- Unicode in tags (emoji in MP3 artist)
- Nested containers (DOCX with embedded images)
- Web Interface: Flask/FastAPI + drag-drop UI
- Format Support: RAW images (CR2, NEF), more video codecs
- ML-Based Risk Detection: Train classifier on PII patterns
- Blockchain Provenance: Immutable sanitization audit trail
- Cloud Integration: S3/Azure Blob batch processing
- Steganography Detection: Warn about hidden data beyond metadata
- GDPR Compliance Mode: Auto-generate Article 17 deletion report
Dependencies:
- Most libraries: MIT/BSD (permissive)
- ExifTool: Perl Artistic/GPL (if using PyExifTool wrapper)
Recommendation:
- Keep core MIT licensed
- Document GPL dependency clearly
- Offer MIT-only mode (without ExifTool integration)
MetaSanitize balances usability (simple CLI), security (thorough cleaning), and realism (forensically plausible dummy data). The modular architecture enables format-specific optimizations while maintaining a consistent interface.
Next Steps:
- Implement comprehensive test suite
- Create sample files library
- Benchmark performance on 1000+ file batches
- Document forensic validation methodology