parsa-hemmati
diff --git a/‎.claude/ccpm/epics/de-identification-module/001.md‎
Lines changed: 157 additions & 0 deletions b/‎.claude/ccpm/epics/de-identification-module/001.md‎
Lines changed: 157 additions & 0 deletions
diff --git a/‎.claude/ccpm/epics/de-identification-module/002.md‎
Lines changed: 193 additions & 0 deletions b/‎.claude/ccpm/epics/de-identification-module/002.md‎
Lines changed: 193 additions & 0 deletions
@@ -0,0 +1,157 @@
+---
+name: Fine-tune MedCAT for PHI Detection
+status: open
+created: 2025-11-21T17:03:23Z
+updated: 2025-11-21T17:03:23Z
+github:
+depends_on: []
+parallel: true
+conflicts_with: []
+priority: P0
+agent_type: ml-engineer
+estimated_hours: 120
+---
+
+# Task #001: Fine-tune MedCAT for PHI Detection
+
+## Description
+
+Fine-tune the existing MedCAT NER model to detect 18 HIPAA Safe Harbor PHI identifiers with target precision >95% and recall >90% (F1 score >0.92).
+
+## Phase
+
+Phase 1: PHI Detection Model (Week 1-3)
+
+## Technical Specifications
+
+**Goal**: Create a PHI-detection model that identifies all 18 HIPAA Safe Harbor identifiers
+
+**18 HIPAA Identifiers to Detect**:
+1. Names (patients, relatives, employers)
+2. Geographic locations (smaller than state)
+3. Dates (except year)
+4. Telephone numbers
+5. Fax numbers
+6. Email addresses
+7. Social Security Numbers
+8. Medical record numbers
+9. Health plan beneficiary numbers
+10. Account numbers
+11. Certificate/license numbers
+12. Vehicle identifiers
+13. Device identifiers
+14. URLs
+15. IP addresses
+16. Biometric identifiers
+17. Full-face photos (metadata)
+18. Unique identifying numbers
+
+**Training Dataset**: i2b2 2014 De-identification Challenge corpus
+- 1,296 clinical notes with PHI annotations
+- ~35,000 PHI entity annotations
+- Covers diverse note types (H&P, discharge summaries, progress notes)
+
+## Implementation Requirements
+
+### 1. Dataset Acquisition & Preparation
+- Download i2b2 2014 corpus from PhysioNet
+- Parse XML annotations to extract PHI entities
+- Split into train (70%), validation (15%), test (15%)
+- Convert to MedCAT training format
+- Verify annotation quality (inter-annotator agreement)
+
+### 2. Model Fine-tuning
+- Load pre-trained MedCAT model (from Phase 3)
+- Add PHI entity types to model vocabulary
+- Fine-tune NER component with PHI annotations
+- Use transfer learning (freeze embedding layers initially)
+- Train for 10-20 epochs with early stopping
+- Hyperparameter tuning (learning rate, batch size)
+
+### 3. Validation & Testing
+- Evaluate on held-out test set (193 notes)
+- Calculate per-entity precision, recall, F1 score
+- Generate confusion matrix (which PHI types are missed)
+- Analyze false positives and false negatives
+- Iterate on training if F1 <0.92
+
+### 4. Model Deployment
+- Save fine-tuned model weights
+- Update MedCAT service configuration
+- Test model loading and inference
+- Benchmark inference speed (target: <2 min per 10-page note)
+
+## Acceptance Criteria
+
+- [ ] i2b2 2014 corpus acquired and preprocessed
+- [ ] MedCAT model fine-tuned on PHI entities
+- [ ] Validation metrics meet targets:
+  - Precision >95%
+  - Recall >90%
+  - F1 score >0.92
+- [ ] Per-entity F1 scores all >0.85
+- [ ] Model deployed to MedCAT service
+- [ ] Inference speed <2 min per 10-page note
+- [ ] Documentation: model card with training details
+
+## Testing Requirements
+
+**Unit Tests**:
+- Test dataset loading and preprocessing
+- Test annotation format conversion
+- Test model inference on sample text
+
+**Validation Tests**:
+- F1 score calculation on test set
+- Confusion matrix generation
+- Performance benchmarking
+
+**Integration Tests**:
+- Model loads correctly in MedCAT service
+- Inference API returns PHI entities
+- Confidence scores in valid range [0, 1]
+
+## Dependencies
+
+**External**:
+- i2b2 2014 De-identification Challenge corpus (PhysioNet - requires CITI training)
+- MedCAT library (Phase 3 installation)
+- PyTorch GPU environment
+
+**Internal**:
+- MedCAT Service (Phase 3) must be operational
+- GPU resources for training (8-16GB VRAM)
+
+## Deliverables
+
+1. **Fine-tuned Model**: `models/medcat_phi_v1.0.model`
+2. **Training Report**: `reports/phi_model_training_report.md` with:
+   - Training hyperparameters
+   - Validation metrics (precision, recall, F1)
+   - Confusion matrix
+   - Error analysis
+3. **Model Card**: `models/medcat_phi_model_card.md` (documentation)
+4. **Test Suite**: `tests/test_phi_detection.py`
+
+## Estimated Time
+
+**Total**: 120 hours (3 weeks × 40 hours)
+
+**Breakdown**:
+- Dataset acquisition & preprocessing: 20 hours
+- Model fine-tuning & hyperparameter tuning: 60 hours
+- Validation & error analysis: 20 hours
+- Model deployment & testing: 10 hours
+- Documentation: 10 hours
+
+## Success Metrics
+
+- F1 score >0.92 on held-out test set
+- All 18 PHI categories detected with F1 >0.85
+- Inference speed: 10-page note in <2 minutes
+- False negative rate <10% (acceptable with human review)
+
+## Related Tasks
+
+- Blocks: #002 (PHI detection service needs this model)
+- Blocks: #003 (De-identification service needs PHI detection)
@@ -0,0 +1,193 @@
+---
+name: Create PHI Detection Service
+status: open
+created: 2025-11-21T17:06:45Z
+updated: 2025-11-21T17:06:45Z
+github:
+depends_on: [001]
+parallel: false
+conflicts_with: []
+priority: P0
+agent_type: developer
+estimated_hours: 20
+---
+
+# Task #002: Create PHI Detection Service
+
+## Description
+
+Implement `phi_detection_service.py` that integrates with the fine-tuned MedCAT model to detect PHI entities in clinical notes. This service wraps the MedCAT client and provides a clean interface for PHI detection with confidence scores.
+
+## Phase
+
+Phase 2: Backend API (Week 4)
+
+## Technical Specifications
+
+**Goal**: Backend service for PHI entity detection using MedCAT
+
+**Service Design**:
+```python
+backend/app/services/phi_detection_service.py
+├── PHIDetectionService (class)
+│   ├── __init__(medcat_client: MedCATClient)
+│   ├── detect_phi(text: str) → List[PHIEntity]
+│   ├── detect_phi_batch(texts: List[str]) → List[List[PHIEntity]]
+│   └── get_model_info() → ModelInfo
+```
+
+**PHI Entity Schema**:
+```python
+class PHIEntity(BaseModel):
+    entity_type: str  # name, date, mrn, phone, etc.
+    text: str         # Actual PHI text found
+    start: int        # Character offset start
+    end: int          # Character offset end
+    confidence: float # 0.0 - 1.0
+    cui: Optional[str] # UMLS CUI if applicable
+```
+
+**18 PHI Entity Types to Detect**:
+1. NAME - Patient, relative, employer names
+2. LOCATION - Cities, streets, zip codes
+3. DATE - All dates except year
+4. PHONE - Telephone numbers
+5. FAX - Fax numbers
+6. EMAIL - Email addresses
+7. SSN - Social Security Numbers
+8. MRN - Medical record numbers
+9. HEALTHPLAN - Health plan beneficiary numbers
+10. ACCOUNT - Account numbers
+11. LICENSE - Certificate/license numbers
+12. VEHICLE - Vehicle identifiers
+13. DEVICE - Device identifiers/serial numbers
+14. URL - Web URLs
+15. IPADDR - IP addresses
+16. BIOMETRIC - Biometric identifiers
+17. PHOTO - Full-face photo references
+18. IDENTIFIER - Other unique identifying numbers
+
+## Implementation Requirements
+
+### 1. MedCAT Client Integration
+- Reuse existing `MedCATClient` from Phase 3
+- Load fine-tuned PHI model (from Task 001)
+- Configure model path in settings
+- Handle connection errors gracefully
+
+### 2. PHI Entity Detection
+- Call MedCAT NER on input text
+- Filter entities by PHI categories only
+- Extract entity type, text, offsets, confidence
+- Return structured PHIEntity objects
+
+### 3. Batch Processing Support
+- Process multiple notes efficiently
+- Maintain order of results
+- Handle partial failures (some notes fail, others succeed)
+
+### 4. Error Handling
+- Connection timeout (MedCAT service down)
+- Model not loaded
+- Invalid input (empty text, too large)
+- Return meaningful error messages
+
+### 5. Performance Optimization
+- Connection pooling (reuse MedCAT client)
+- Caching for repeated queries (optional)
+- Async support for concurrent requests
+
+### 6. Configuration
+- Model path configurable via environment variable
+- MedCAT service URL configurable
+- Confidence threshold configurable (default: 0.7)
+
+## Acceptance Criteria
+
+- [ ] `PHIDetectionService` class implemented
+- [ ] Detects all 18 PHI entity types
+- [ ] Returns structured PHIEntity objects with offsets
+- [ ] Confidence scores included (0.0 - 1.0 range)
+- [ ] Batch processing works correctly
+- [ ] Error handling for service failures
+- [ ] Configuration via environment variables
+- [ ] Unit tests (mock MedCAT responses) - 15 tests
+- [ ] Integration tests (real MedCAT service) - 5 tests
+- [ ] Test coverage >90%
+- [ ] API documentation (docstrings)
+
+## Testing Requirements
+
+**Unit Tests** (`backend/tests/unit/services/test_phi_detection_service.py`):
+```python
+def test_detect_phi_returns_entities(mock_medcat_client):
+    """Test PHI detection returns PHIEntity objects"""
+
+def test_detect_phi_filters_non_phi_entities(mock_medcat_client):
+    """Test only PHI categories returned, not clinical concepts"""
+
+def test_detect_phi_confidence_threshold(mock_medcat_client):
+    """Test low-confidence entities filtered out"""
+
+def test_detect_phi_batch_processing(mock_medcat_client):
+    """Test batch processing maintains order"""
+
+def test_detect_phi_handles_empty_text(mock_medcat_client):
+    """Test graceful handling of empty input"""
+```
+
+**Integration Tests** (`backend/tests/integration/test_phi_detection_integration.py`):
+```python
+def test_phi_detection_with_real_medcat_service():
+    """Test detection with actual MedCAT service"""
+
+def test_phi_detection_all_18_categories():
+    """Test sample text with all 18 PHI types detected"""
+
+def test_phi_detection_performance_benchmark():
+    """Test 10-page note processed in <2 minutes"""
+```
+
+## Dependencies
+
+**External**:
+- Fine-tuned MedCAT model (Task 001) - **BLOCKS THIS TASK**
+- MedCAT Service running (Phase 3)
+- Redis (optional for caching)
+
+**Internal**:
+- `MedCATClient` (Phase 3 infrastructure)
+- Pydantic schemas
+- FastAPI settings
+
+## Deliverables
+
+1. **Service Implementation**: `backend/app/services/phi_detection_service.py`
+2. **Pydantic Schema**: `backend/app/schemas/phi_entity.py`
+3. **Unit Tests**: `backend/tests/unit/services/test_phi_detection_service.py`
+4. **Integration Tests**: `backend/tests/integration/test_phi_detection_integration.py`
+5. **Configuration**: Add `PHI_MODEL_PATH` to `.env`
+
+## Estimated Time
+
+**Total**: 20 hours (Week 4)
+
+**Breakdown**:
+- Service implementation: 8 hours
+- Error handling & edge cases: 4 hours
+- Unit tests: 4 hours
+- Integration tests: 2 hours
+- Documentation: 2 hours
+
+## Success Metrics
+
+- PHI detection works for all 18 categories
+- Performance: <2 minutes per 10-page note
+- Test coverage: >90%
+- No crashes on invalid input
+
+## Related Tasks
+
+- **Depends on**: #001 (Fine-tuned MedCAT model)
+- **Blocks**: #003 (De-identification service needs PHI detection)
+- **Blocks**: #004 (Batch API needs PHI detection)