MetaSanitize - Design Document

Architecture Overview

┌─────────────────────────────┐   ┌─────────────────────────────┐
│       GUI Interface         │   │       CLI Interface         │
│ (PySide6 / Modern Card UI)  │   │  (Click-based commands)     │
└──────────────┬──────────────┘   └──────────────┬──────────────┘
               │                                 │
┌──────────────▼─────────────────────────────────▼──────────────┐
│                   Core Engine Layer                          │
│  • MetadataEngine: Orchestrates all operations               │
│  • BackupManager: Handles file backups and restoration       │
│  • BatchProcessor: Parallel processing for directories       │
└────────┬────────────────────────┬───────────────────────────┘
         │                        │
┌────────▼────────┐     ┌────────▼────────────────────────────┐
│  Format Handlers│     │    Privacy & Analysis Layer          │
│  • ImageHandler │     │  • PrivacyAnalyzer: Risk scoring     │
│  • PDFHandler   │     │  • DummyGenerator: Fake metadata     │
│  • AudioHandler │     │  • DiffGenerator: Compare changes    │
│  • VideoHandler │     └──────────────────────────────────────┘
│  • DocHandler   │
└─────────────────┘

Data Models (Pydantic)
• MetadataRecord: Complete file metadata
• MetadataField: Individual field with risk assessment
• PrivacyRisk: Risk analysis results
• ProcessingResult: Operation outcomes

Component Responsibilities

1. Core Engine (`core/engine.py`)

Role: Central orchestrator for all metadata operations

Key Methods:

extract(file_path) → MetadataRecord
modify(file_path, fields_to_remove, fields_to_modify) → ProcessingResult
clean(file_path, profile) → ProcessingResult
restore(file_path) → ProcessingResult

Responsibilities:

Route files to appropriate handlers
Manage backup lifecycle
Coordinate between privacy analyzer and handlers
Error handling and result aggregation

2. GUI Application (`gui/`)

Role: Interactive user interface

Components:

MetaSanitizeApp (app.py): Application entry point, QStyleSheet theming.
MainWindow (main_window.py): Main card-based layout logic.
WorkerThread: Handles long-running tasks (scanning, cleaning) to keep UI responsive.

Design Pattern:

Card-Based UI: Uses custom Card widgets for modular layout.
Responsive Threading: All file operations run in background QThreads.

3. Format Handlers (`handlers/`)

Role: Format-specific metadata extraction/modification

Base Interface (BaseHandler):

- supported_extensions: List[str]
- extract_metadata(file_path) → MetadataRecord
- modify_metadata(file_path, fields, values) → int
- clean_metadata(file_path, profile) → int

Implementations:

ImageHandler: EXIF/XMP/IPTC via Pillow + piexif
PDFHandler: PDF Info Dict + XMP via PyPDF2/pikepdf
AudioHandler: ID3v2/Vorbis via mutagen
VideoHandler: Container metadata via pymediainfo
DocumentHandler: Office XML properties via python-docx

Design Principle: Each handler is self-contained, testable in isolation

3. Privacy Analyzer (`privacy/analyzer.py`)

Role: Risk assessment and comparison

Key Features:

Weighted risk scoring by category (Location=1.0, Technical=0.2)
PII detection (names, GPS, device IDs)
Risk levels: CRITICAL (80-100), HIGH (60-79), MEDIUM (40-59), LOW (0-39)
Metadata diff generation (removed/modified/added fields)
Human-readable reports

Risk Calculation:

weighted_score = Σ(field.risk_score × category_weight) / Σ(category_weight)

4. Dummy Metadata Generator (`profiles/generator.py`)

Role: Generate realistic but fake metadata

Consistency Rules:

GPS coordinates → Timezone matching
Camera Make → Compatible Lens Model
Timestamps: No future dates, chronologically valid
Locale: Language ↔ Region alignment

Profiles:

smartphone: iPhone/Android with phone-typical settings
dslr: Professional camera bodies with compatible lenses
office: MS Office/LibreOffice/Google Docs
scanner: Flatbed scanner metadata

5. Batch Processor (`utils/batch.py`)

Role: Efficient multi-file processing

Features:

Parallel execution (ThreadPoolExecutor, configurable workers)
Extension filtering
Recursive directory traversal
Progress callbacks
Aggregate reporting (success rate, total changes, errors)

Data Flow Examples

Example 1: Scan Operation

User: scan photo.jpg
  ↓
CLI: Parse args, create Engine
  ↓
Engine.extract(photo.jpg)
  ↓
Registry → ImageHandler (based on .jpg)
  ↓
ImageHandler: Pillow.open() → getexif() → categorize fields
  ↓
MetadataRecord (50 fields)
  ↓
PrivacyAnalyzer.analyze(record)
  ↓
PrivacyRisk (score=85, CRITICAL, GPS detected)
  ↓
CLI: Display rich table + recommendations

Example 2: Clean Operation with Backup

User: clean photo.jpg --profile safe
  ↓
Engine.clean(photo.jpg, profile='safe')
  ↓
1. Extract original metadata (risk_before)
2. Create backup → .originals/photo_20260114_143022.jpg
3. ImageHandler.clean_metadata('safe')
   - Remove GPS, Author, Make, Model
   - Strip thumbnail
   - Keep technical (Orientation, ColorSpace)
4. Re-extract metadata (risk_after)
  ↓
ProcessingResult:
  - success: True
  - changes: 8
  - risk_reduction: 65 points
  - backup_path: .originals/...
  ↓
CLI: Display summary with green checkmark

Tech Stack Rationale

Component	Library	Why?
Images	Pillow + piexif	Universal support, pure Python, thumbnail handling
PDF	PyPDF2 + pikepdf	PyPDF2 for basic, pikepdf for advanced XMP stripping
Audio	mutagen	Format-agnostic (ID3, Vorbis, MP4), battle-tested
Video	pymediainfo	Wrapper for MediaInfo library, comprehensive
Office	python-docx	Direct XML access to DOCX core properties
CLI	Click + Rich	Click for args, Rich for beautiful tables/progress bars
Data Models	Pydantic	Validation, JSON serialization, type safety
Dummy Data	Faker	Realistic names/locations, locale support

Key Implementation Considerations

1. Metadata Preservation Logic

Problem: Some metadata is required for file validity

Solution: Three-tier profile system

complete: Strip everything (risk: may break some viewers)
safe: Remove PII, keep technical
minimal: Keep only essential (orientation, dimensions)

2. Thumbnail Handling

Problem: EXIF thumbnails can leak unedited image data

Options:

Strip entirely (safe, breaks some viewers)
Regenerate from cleaned image (safer, compatible)
Leave untouched (mark as high risk)

Implementation: Profile-based (safe=strip, minimal=regenerate)

3. GPS → Timezone Consistency

Problem: Metadata forensics can detect mismatches

Solution: Pre-mapped city coordinates with timezones

'New York': {lat: 40.7128, lon: -74.0060, tz: 'America/New_York'}

Generate timestamps using city's timezone

4. Backup Strategy

Structure:

.originals/
  ├── photo_20260114_143022.jpg
  ├── document_20260114_143156.pdf
  └── ...

Timestamped filenames prevent collisions
Restore command finds most recent backup by pattern match
Optional: .metadata.json sidecar for restoration metadata

5. Error Handling Philosophy

Graceful degradation: Partial metadata extraction better than total failure
Warnings vs Errors: Unsupported tags → warning, file not found → error
Atomic operations: Backup before modify, revert on failure
Batch isolation: One file failure doesn't stop batch processing

Common Mistakes to Avoid

❌ Don't: Modify files in-place without backup

Why: Irreversible data loss if operation fails

Do: Always create backup first, then modify

❌ Don't: Use simple deletion for all cleaning

Why: May break file format, leave orphaned references

Do: Format-specific cleaning (e.g., PDF: rewrite streams, JPEG: re-encode EXIF)

❌ Don't: Generate completely random dummy metadata

Why: Forensically detectable (GPS in ocean, Feb 30, etc.)

Do: Use constrained randomness with validation

❌ Don't: Trust file extensions

Why: malicious.jpg.exe, renamed_video.jpg

Do: Validate file format via magic bytes or library detection

❌ Don't: Hardcode field names

Why: Metadata standards evolve (EXIF 2.31, XMP 2.0)

Do: Pattern matching + namespace awareness

❌ Don't: Ignore nested metadata

Why: ZIP containers in DOCX, MP4 tracks, multi-page TIFFs

Do: Recursive extraction for container formats

❌ Don't: Assume UTF-8 everywhere

Why: Legacy EXIF uses ASCII, XMP uses UTF-8, ID3 has multiple encodings

Do: Handle encoding detection per format

Security Considerations

Threat Model

Adversary: Forensic analyst with metadata extraction tools

Goals:

Remove PII leakage (names, locations, device IDs)
Prevent device fingerprinting
Avoid timestamp correlation attacks
Strip edit history

Non-Goals:

Hiding evidence of criminal activity (unethical, illegal)
Defeating steganography detection
Bypassing digital signatures (will break file)

Sanitization Verification

Post-Clean Checks:

Re-extract metadata → ensure sensitive fields gone
Hash comparison → detect unintended changes
File format validation → ensure still openable
Thumbnail check → verify stripping/regeneration

Watermarking Ethics

Optional Feature: Mark files as sanitized

Pros: Transparency, provenance tracking

Cons: Reveals use of privacy tool

Implementation: XMP xmp:History entry or custom tag

Performance Optimization

Parallel Processing

BatchProcessor(max_workers=4)
# CPU-bound: workers = CPU count
# I/O-bound: workers = 2-4x CPU count

Caching

Reuse Pillow Image objects when modifying + re-extracting
Cache handler instances in registry

Memory Management

Stream large files (PDF pages, video chunks)
Truncate binary blobs in display (show first 1KB)

Progress Reporting

def progress_callback(current, total, file_path):
    print(f"[{current}/{total}] {file_path.name}")

Testing Strategy

Unit Tests

Handler-specific: Mock file I/O, test field extraction
Privacy analyzer: Test risk scoring edge cases
Dummy generator: Validate consistency rules

Integration Tests

End-to-end: scan → clean → verify
Format-specific: Real sample files (JPEG, PDF, MP3)

Edge Cases

Corrupted metadata (partial EXIF)
Empty files (0 bytes)
Extremely large metadata (1MB XMP block)
Unicode in tags (emoji in MP3 artist)
Nested containers (DOCX with embedded images)

Future Enhancements

Web Interface: Flask/FastAPI + drag-drop UI
Format Support: RAW images (CR2, NEF), more video codecs
ML-Based Risk Detection: Train classifier on PII patterns
Blockchain Provenance: Immutable sanitization audit trail
Cloud Integration: S3/Azure Blob batch processing
Steganography Detection: Warn about hidden data beyond metadata
GDPR Compliance Mode: Auto-generate Article 17 deletion report

License Considerations

Dependencies:

Most libraries: MIT/BSD (permissive)
ExifTool: Perl Artistic/GPL (if using PyExifTool wrapper)

Recommendation:

Keep core MIT licensed
Document GPL dependency clearly
Offer MIT-only mode (without ExifTool integration)

Conclusion

MetaSanitize balances usability (simple CLI), security (thorough cleaning), and realism (forensically plausible dummy data). The modular architecture enables format-specific optimizations while maintaining a consistent interface.

Next Steps:

Implement comprehensive test suite
Create sample files library
Benchmark performance on 1000+ file batches
Document forensic validation methodology

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History