Skip to content

Latest commit

 

History

History
421 lines (290 loc) · 13 KB

File metadata and controls

421 lines (290 loc) · 13 KB

MetaSanitize - Design Document

Architecture Overview

┌─────────────────────────────┐   ┌─────────────────────────────┐
│       GUI Interface         │   │       CLI Interface         │
│ (PySide6 / Modern Card UI)  │   │  (Click-based commands)     │
└──────────────┬──────────────┘   └──────────────┬──────────────┘
               │                                 │
┌──────────────▼─────────────────────────────────▼──────────────┐
│                   Core Engine Layer                          │
│  • MetadataEngine: Orchestrates all operations               │
│  • BackupManager: Handles file backups and restoration       │
│  • BatchProcessor: Parallel processing for directories       │
└────────┬────────────────────────┬───────────────────────────┘
         │                        │
┌────────▼────────┐     ┌────────▼────────────────────────────┐
│  Format Handlers│     │    Privacy & Analysis Layer          │
│  • ImageHandler │     │  • PrivacyAnalyzer: Risk scoring     │
│  • PDFHandler   │     │  • DummyGenerator: Fake metadata     │
│  • AudioHandler │     │  • DiffGenerator: Compare changes    │
│  • VideoHandler │     └──────────────────────────────────────┘
│  • DocHandler   │
└─────────────────┘

Data Models (Pydantic)
• MetadataRecord: Complete file metadata
• MetadataField: Individual field with risk assessment
• PrivacyRisk: Risk analysis results
• ProcessingResult: Operation outcomes

Component Responsibilities

1. Core Engine (core/engine.py)

Role: Central orchestrator for all metadata operations

Key Methods:

  • extract(file_path) → MetadataRecord
  • modify(file_path, fields_to_remove, fields_to_modify) → ProcessingResult
  • clean(file_path, profile) → ProcessingResult
  • restore(file_path) → ProcessingResult

Responsibilities:

  • Route files to appropriate handlers
  • Manage backup lifecycle
  • Coordinate between privacy analyzer and handlers
  • Error handling and result aggregation

2. GUI Application (gui/)

Role: Interactive user interface

Components:

  • MetaSanitizeApp (app.py): Application entry point, QStyleSheet theming.
  • MainWindow (main_window.py): Main card-based layout logic.
  • WorkerThread: Handles long-running tasks (scanning, cleaning) to keep UI responsive.

Design Pattern:

  • Card-Based UI: Uses custom Card widgets for modular layout.
  • Responsive Threading: All file operations run in background QThreads.

3. Format Handlers (handlers/)

Role: Format-specific metadata extraction/modification

Base Interface (BaseHandler):

- supported_extensions: List[str]
- extract_metadata(file_path) → MetadataRecord
- modify_metadata(file_path, fields, values) → int
- clean_metadata(file_path, profile) → int

Implementations:

  • ImageHandler: EXIF/XMP/IPTC via Pillow + piexif
  • PDFHandler: PDF Info Dict + XMP via PyPDF2/pikepdf
  • AudioHandler: ID3v2/Vorbis via mutagen
  • VideoHandler: Container metadata via pymediainfo
  • DocumentHandler: Office XML properties via python-docx

Design Principle: Each handler is self-contained, testable in isolation

3. Privacy Analyzer (privacy/analyzer.py)

Role: Risk assessment and comparison

Key Features:

  • Weighted risk scoring by category (Location=1.0, Technical=0.2)
  • PII detection (names, GPS, device IDs)
  • Risk levels: CRITICAL (80-100), HIGH (60-79), MEDIUM (40-59), LOW (0-39)
  • Metadata diff generation (removed/modified/added fields)
  • Human-readable reports

Risk Calculation:

weighted_score = Σ(field.risk_score × category_weight) / Σ(category_weight)

4. Dummy Metadata Generator (profiles/generator.py)

Role: Generate realistic but fake metadata

Consistency Rules:

  • GPS coordinates → Timezone matching
  • Camera Make → Compatible Lens Model
  • Timestamps: No future dates, chronologically valid
  • Locale: Language ↔ Region alignment

Profiles:

  • smartphone: iPhone/Android with phone-typical settings
  • dslr: Professional camera bodies with compatible lenses
  • office: MS Office/LibreOffice/Google Docs
  • scanner: Flatbed scanner metadata

5. Batch Processor (utils/batch.py)

Role: Efficient multi-file processing

Features:

  • Parallel execution (ThreadPoolExecutor, configurable workers)
  • Extension filtering
  • Recursive directory traversal
  • Progress callbacks
  • Aggregate reporting (success rate, total changes, errors)

Data Flow Examples

Example 1: Scan Operation

User: scan photo.jpg
  ↓
CLI: Parse args, create Engine
  ↓
Engine.extract(photo.jpg)
  ↓
Registry → ImageHandler (based on .jpg)
  ↓
ImageHandler: Pillow.open() → getexif() → categorize fields
  ↓
MetadataRecord (50 fields)
  ↓
PrivacyAnalyzer.analyze(record)
  ↓
PrivacyRisk (score=85, CRITICAL, GPS detected)
  ↓
CLI: Display rich table + recommendations

Example 2: Clean Operation with Backup

User: clean photo.jpg --profile safe
  ↓
Engine.clean(photo.jpg, profile='safe')
  ↓
1. Extract original metadata (risk_before)
2. Create backup → .originals/photo_20260114_143022.jpg
3. ImageHandler.clean_metadata('safe')
   - Remove GPS, Author, Make, Model
   - Strip thumbnail
   - Keep technical (Orientation, ColorSpace)
4. Re-extract metadata (risk_after)
  ↓
ProcessingResult:
  - success: True
  - changes: 8
  - risk_reduction: 65 points
  - backup_path: .originals/...
  ↓
CLI: Display summary with green checkmark

Tech Stack Rationale

Component Library Why?
Images Pillow + piexif Universal support, pure Python, thumbnail handling
PDF PyPDF2 + pikepdf PyPDF2 for basic, pikepdf for advanced XMP stripping
Audio mutagen Format-agnostic (ID3, Vorbis, MP4), battle-tested
Video pymediainfo Wrapper for MediaInfo library, comprehensive
Office python-docx Direct XML access to DOCX core properties
CLI Click + Rich Click for args, Rich for beautiful tables/progress bars
Data Models Pydantic Validation, JSON serialization, type safety
Dummy Data Faker Realistic names/locations, locale support

Key Implementation Considerations

1. Metadata Preservation Logic

Problem: Some metadata is required for file validity

Solution: Three-tier profile system

  • complete: Strip everything (risk: may break some viewers)
  • safe: Remove PII, keep technical
  • minimal: Keep only essential (orientation, dimensions)

2. Thumbnail Handling

Problem: EXIF thumbnails can leak unedited image data

Options:

  1. Strip entirely (safe, breaks some viewers)
  2. Regenerate from cleaned image (safer, compatible)
  3. Leave untouched (mark as high risk)

Implementation: Profile-based (safe=strip, minimal=regenerate)

3. GPS → Timezone Consistency

Problem: Metadata forensics can detect mismatches

Solution: Pre-mapped city coordinates with timezones

'New York': {lat: 40.7128, lon: -74.0060, tz: 'America/New_York'}

Generate timestamps using city's timezone

4. Backup Strategy

Structure:

.originals/
  ├── photo_20260114_143022.jpg
  ├── document_20260114_143156.pdf
  └── ...
  • Timestamped filenames prevent collisions
  • Restore command finds most recent backup by pattern match
  • Optional: .metadata.json sidecar for restoration metadata

5. Error Handling Philosophy

  • Graceful degradation: Partial metadata extraction better than total failure
  • Warnings vs Errors: Unsupported tags → warning, file not found → error
  • Atomic operations: Backup before modify, revert on failure
  • Batch isolation: One file failure doesn't stop batch processing

Common Mistakes to Avoid

❌ Don't: Modify files in-place without backup

Why: Irreversible data loss if operation fails

Do: Always create backup first, then modify

❌ Don't: Use simple deletion for all cleaning

Why: May break file format, leave orphaned references

Do: Format-specific cleaning (e.g., PDF: rewrite streams, JPEG: re-encode EXIF)

❌ Don't: Generate completely random dummy metadata

Why: Forensically detectable (GPS in ocean, Feb 30, etc.)

Do: Use constrained randomness with validation

❌ Don't: Trust file extensions

Why: malicious.jpg.exe, renamed_video.jpg

Do: Validate file format via magic bytes or library detection

❌ Don't: Hardcode field names

Why: Metadata standards evolve (EXIF 2.31, XMP 2.0)

Do: Pattern matching + namespace awareness

❌ Don't: Ignore nested metadata

Why: ZIP containers in DOCX, MP4 tracks, multi-page TIFFs

Do: Recursive extraction for container formats

❌ Don't: Assume UTF-8 everywhere

Why: Legacy EXIF uses ASCII, XMP uses UTF-8, ID3 has multiple encodings

Do: Handle encoding detection per format

Security Considerations

Threat Model

Adversary: Forensic analyst with metadata extraction tools

Goals:

  1. Remove PII leakage (names, locations, device IDs)
  2. Prevent device fingerprinting
  3. Avoid timestamp correlation attacks
  4. Strip edit history

Non-Goals:

  • Hiding evidence of criminal activity (unethical, illegal)
  • Defeating steganography detection
  • Bypassing digital signatures (will break file)

Sanitization Verification

Post-Clean Checks:

  1. Re-extract metadata → ensure sensitive fields gone
  2. Hash comparison → detect unintended changes
  3. File format validation → ensure still openable
  4. Thumbnail check → verify stripping/regeneration

Watermarking Ethics

Optional Feature: Mark files as sanitized

Pros: Transparency, provenance tracking

Cons: Reveals use of privacy tool

Implementation: XMP xmp:History entry or custom tag

Performance Optimization

Parallel Processing

BatchProcessor(max_workers=4)
# CPU-bound: workers = CPU count
# I/O-bound: workers = 2-4x CPU count

Caching

  • Reuse Pillow Image objects when modifying + re-extracting
  • Cache handler instances in registry

Memory Management

  • Stream large files (PDF pages, video chunks)
  • Truncate binary blobs in display (show first 1KB)

Progress Reporting

def progress_callback(current, total, file_path):
    print(f"[{current}/{total}] {file_path.name}")

Testing Strategy

Unit Tests

  • Handler-specific: Mock file I/O, test field extraction
  • Privacy analyzer: Test risk scoring edge cases
  • Dummy generator: Validate consistency rules

Integration Tests

  • End-to-end: scan → clean → verify
  • Format-specific: Real sample files (JPEG, PDF, MP3)

Edge Cases

  • Corrupted metadata (partial EXIF)
  • Empty files (0 bytes)
  • Extremely large metadata (1MB XMP block)
  • Unicode in tags (emoji in MP3 artist)
  • Nested containers (DOCX with embedded images)

Future Enhancements

  1. Web Interface: Flask/FastAPI + drag-drop UI
  2. Format Support: RAW images (CR2, NEF), more video codecs
  3. ML-Based Risk Detection: Train classifier on PII patterns
  4. Blockchain Provenance: Immutable sanitization audit trail
  5. Cloud Integration: S3/Azure Blob batch processing
  6. Steganography Detection: Warn about hidden data beyond metadata
  7. GDPR Compliance Mode: Auto-generate Article 17 deletion report

License Considerations

Dependencies:

  • Most libraries: MIT/BSD (permissive)
  • ExifTool: Perl Artistic/GPL (if using PyExifTool wrapper)

Recommendation:

  • Keep core MIT licensed
  • Document GPL dependency clearly
  • Offer MIT-only mode (without ExifTool integration)

Conclusion

MetaSanitize balances usability (simple CLI), security (thorough cleaning), and realism (forensically plausible dummy data). The modular architecture enables format-specific optimizations while maintaining a consistent interface.

Next Steps:

  1. Implement comprehensive test suite
  2. Create sample files library
  3. Benchmark performance on 1000+ file batches
  4. Document forensic validation methodology