Skip to content

Commit 01b80dd

Browse files
committed
feat(ccpm): Decompose de-identification-module into 8 tasks via /pm:epic-decompose
Changes: - Created Task #1: Fine-tune MedCAT for PHI Detection (ML, 120h, P0) - Created Task #2: Create PHI Detection Service (Backend, 20h, P0) - Created Task #3: Create De-identification Service (Backend, 24h, P0) - Created Task #4: Create Batch Processing API and Celery Tasks (Backend, 32h, P0) - Created Task #5: Implement Audit Logging and Database Schema (Backend, 16h, P0, parallel) - Created Task #6: Create Upload and Review UI (Frontend, 40h, P0) - Created Task #7: Create Manual Annotation Tool and Job Tracking (Frontend, 32h, P1) - Created Task #8: IRB Submission and Pilot Study (Validation, 40h, P0) Rationale: - Following proper CCPM workflow (/pm:epic-decompose command) - Simplified from 20-30 typical tasks to 8 core tasks (per CCPM guidance: "≤10 tasks") - Applied 5 simplification strategies: 1. Reuse search module components (entity highlighting, sanitization) 2. Reuse MedCAT infrastructure (no new NLP service) 3. Minimal database schema (2 PostgreSQL tables, 2 Elasticsearch indexes) 4. Focus on Safe Harbor method initially 5. Batch-only processing (no real-time API in Phase 1) - Total estimated effort: 204 hours (9 person-weeks across 12 calendar weeks) Task Dependencies: - Task #1 blocks #2 (PHI detection needs fine-tuned model) - Task #2 blocks #3, #4 (services need PHI detection) - Task #3 blocks #4 (batch API needs de-identification logic) - Task #4 blocks #6 (frontend needs API) - Task #5 parallel (infrastructure setup) - Task #6 blocks #7 (annotation extends review UI) - Tasks #6, #7 block #8 (IRB needs complete system) AI Context: - Command: /pm:epic-decompose de-identification-module - Epic: .claude/ccpm/epics/de-identification-module/epic.md - PRD: .claude/ccpm/prds/de-identification-module.md - Session: 2025-11-21
1 parent 2cde410 commit 01b80dd

8 files changed

Lines changed: 2478 additions & 0 deletions

File tree

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
name: Fine-tune MedCAT for PHI Detection
3+
status: open
4+
created: 2025-11-21T17:03:23Z
5+
updated: 2025-11-21T17:03:23Z
6+
github:
7+
depends_on: []
8+
parallel: true
9+
conflicts_with: []
10+
priority: P0
11+
agent_type: ml-engineer
12+
estimated_hours: 120
13+
---
14+
15+
# Task #001: Fine-tune MedCAT for PHI Detection
16+
17+
## Description
18+
19+
Fine-tune the existing MedCAT NER model to detect 18 HIPAA Safe Harbor PHI identifiers with target precision >95% and recall >90% (F1 score >0.92).
20+
21+
## Phase
22+
23+
Phase 1: PHI Detection Model (Week 1-3)
24+
25+
## Technical Specifications
26+
27+
**Goal**: Create a PHI-detection model that identifies all 18 HIPAA Safe Harbor identifiers
28+
29+
**18 HIPAA Identifiers to Detect**:
30+
1. Names (patients, relatives, employers)
31+
2. Geographic locations (smaller than state)
32+
3. Dates (except year)
33+
4. Telephone numbers
34+
5. Fax numbers
35+
6. Email addresses
36+
7. Social Security Numbers
37+
8. Medical record numbers
38+
9. Health plan beneficiary numbers
39+
10. Account numbers
40+
11. Certificate/license numbers
41+
12. Vehicle identifiers
42+
13. Device identifiers
43+
14. URLs
44+
15. IP addresses
45+
16. Biometric identifiers
46+
17. Full-face photos (metadata)
47+
18. Unique identifying numbers
48+
49+
**Training Dataset**: i2b2 2014 De-identification Challenge corpus
50+
- 1,296 clinical notes with PHI annotations
51+
- ~35,000 PHI entity annotations
52+
- Covers diverse note types (H&P, discharge summaries, progress notes)
53+
54+
## Implementation Requirements
55+
56+
### 1. Dataset Acquisition & Preparation
57+
- Download i2b2 2014 corpus from PhysioNet
58+
- Parse XML annotations to extract PHI entities
59+
- Split into train (70%), validation (15%), test (15%)
60+
- Convert to MedCAT training format
61+
- Verify annotation quality (inter-annotator agreement)
62+
63+
### 2. Model Fine-tuning
64+
- Load pre-trained MedCAT model (from Phase 3)
65+
- Add PHI entity types to model vocabulary
66+
- Fine-tune NER component with PHI annotations
67+
- Use transfer learning (freeze embedding layers initially)
68+
- Train for 10-20 epochs with early stopping
69+
- Hyperparameter tuning (learning rate, batch size)
70+
71+
### 3. Validation & Testing
72+
- Evaluate on held-out test set (193 notes)
73+
- Calculate per-entity precision, recall, F1 score
74+
- Generate confusion matrix (which PHI types are missed)
75+
- Analyze false positives and false negatives
76+
- Iterate on training if F1 <0.92
77+
78+
### 4. Model Deployment
79+
- Save fine-tuned model weights
80+
- Update MedCAT service configuration
81+
- Test model loading and inference
82+
- Benchmark inference speed (target: <2 min per 10-page note)
83+
84+
## Acceptance Criteria
85+
86+
- [ ] i2b2 2014 corpus acquired and preprocessed
87+
- [ ] MedCAT model fine-tuned on PHI entities
88+
- [ ] Validation metrics meet targets:
89+
- Precision >95%
90+
- Recall >90%
91+
- F1 score >0.92
92+
- [ ] Per-entity F1 scores all >0.85
93+
- [ ] Model deployed to MedCAT service
94+
- [ ] Inference speed <2 min per 10-page note
95+
- [ ] Documentation: model card with training details
96+
97+
## Testing Requirements
98+
99+
**Unit Tests**:
100+
- Test dataset loading and preprocessing
101+
- Test annotation format conversion
102+
- Test model inference on sample text
103+
104+
**Validation Tests**:
105+
- F1 score calculation on test set
106+
- Confusion matrix generation
107+
- Performance benchmarking
108+
109+
**Integration Tests**:
110+
- Model loads correctly in MedCAT service
111+
- Inference API returns PHI entities
112+
- Confidence scores in valid range [0, 1]
113+
114+
## Dependencies
115+
116+
**External**:
117+
- i2b2 2014 De-identification Challenge corpus (PhysioNet - requires CITI training)
118+
- MedCAT library (Phase 3 installation)
119+
- PyTorch GPU environment
120+
121+
**Internal**:
122+
- MedCAT Service (Phase 3) must be operational
123+
- GPU resources for training (8-16GB VRAM)
124+
125+
## Deliverables
126+
127+
1. **Fine-tuned Model**: `models/medcat_phi_v1.0.model`
128+
2. **Training Report**: `reports/phi_model_training_report.md` with:
129+
- Training hyperparameters
130+
- Validation metrics (precision, recall, F1)
131+
- Confusion matrix
132+
- Error analysis
133+
3. **Model Card**: `models/medcat_phi_model_card.md` (documentation)
134+
4. **Test Suite**: `tests/test_phi_detection.py`
135+
136+
## Estimated Time
137+
138+
**Total**: 120 hours (3 weeks × 40 hours)
139+
140+
**Breakdown**:
141+
- Dataset acquisition & preprocessing: 20 hours
142+
- Model fine-tuning & hyperparameter tuning: 60 hours
143+
- Validation & error analysis: 20 hours
144+
- Model deployment & testing: 10 hours
145+
- Documentation: 10 hours
146+
147+
## Success Metrics
148+
149+
- F1 score >0.92 on held-out test set
150+
- All 18 PHI categories detected with F1 >0.85
151+
- Inference speed: 10-page note in <2 minutes
152+
- False negative rate <10% (acceptable with human review)
153+
154+
## Related Tasks
155+
156+
- Blocks: #002 (PHI detection service needs this model)
157+
- Blocks: #003 (De-identification service needs PHI detection)
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
name: Create PHI Detection Service
3+
status: open
4+
created: 2025-11-21T17:06:45Z
5+
updated: 2025-11-21T17:06:45Z
6+
github:
7+
depends_on: [001]
8+
parallel: false
9+
conflicts_with: []
10+
priority: P0
11+
agent_type: developer
12+
estimated_hours: 20
13+
---
14+
15+
# Task #002: Create PHI Detection Service
16+
17+
## Description
18+
19+
Implement `phi_detection_service.py` that integrates with the fine-tuned MedCAT model to detect PHI entities in clinical notes. This service wraps the MedCAT client and provides a clean interface for PHI detection with confidence scores.
20+
21+
## Phase
22+
23+
Phase 2: Backend API (Week 4)
24+
25+
## Technical Specifications
26+
27+
**Goal**: Backend service for PHI entity detection using MedCAT
28+
29+
**Service Design**:
30+
```python
31+
backend/app/services/phi_detection_service.py
32+
├── PHIDetectionService (class)
33+
│ ├── __init__(medcat_client: MedCATClient)
34+
│ ├── detect_phi(text: str) → List[PHIEntity]
35+
│ ├── detect_phi_batch(texts: List[str]) → List[List[PHIEntity]]
36+
│ └── get_model_info() → ModelInfo
37+
```
38+
39+
**PHI Entity Schema**:
40+
```python
41+
class PHIEntity(BaseModel):
42+
entity_type: str # name, date, mrn, phone, etc.
43+
text: str # Actual PHI text found
44+
start: int # Character offset start
45+
end: int # Character offset end
46+
confidence: float # 0.0 - 1.0
47+
cui: Optional[str] # UMLS CUI if applicable
48+
```
49+
50+
**18 PHI Entity Types to Detect**:
51+
1. NAME - Patient, relative, employer names
52+
2. LOCATION - Cities, streets, zip codes
53+
3. DATE - All dates except year
54+
4. PHONE - Telephone numbers
55+
5. FAX - Fax numbers
56+
6. EMAIL - Email addresses
57+
7. SSN - Social Security Numbers
58+
8. MRN - Medical record numbers
59+
9. HEALTHPLAN - Health plan beneficiary numbers
60+
10. ACCOUNT - Account numbers
61+
11. LICENSE - Certificate/license numbers
62+
12. VEHICLE - Vehicle identifiers
63+
13. DEVICE - Device identifiers/serial numbers
64+
14. URL - Web URLs
65+
15. IPADDR - IP addresses
66+
16. BIOMETRIC - Biometric identifiers
67+
17. PHOTO - Full-face photo references
68+
18. IDENTIFIER - Other unique identifying numbers
69+
70+
## Implementation Requirements
71+
72+
### 1. MedCAT Client Integration
73+
- Reuse existing `MedCATClient` from Phase 3
74+
- Load fine-tuned PHI model (from Task 001)
75+
- Configure model path in settings
76+
- Handle connection errors gracefully
77+
78+
### 2. PHI Entity Detection
79+
- Call MedCAT NER on input text
80+
- Filter entities by PHI categories only
81+
- Extract entity type, text, offsets, confidence
82+
- Return structured PHIEntity objects
83+
84+
### 3. Batch Processing Support
85+
- Process multiple notes efficiently
86+
- Maintain order of results
87+
- Handle partial failures (some notes fail, others succeed)
88+
89+
### 4. Error Handling
90+
- Connection timeout (MedCAT service down)
91+
- Model not loaded
92+
- Invalid input (empty text, too large)
93+
- Return meaningful error messages
94+
95+
### 5. Performance Optimization
96+
- Connection pooling (reuse MedCAT client)
97+
- Caching for repeated queries (optional)
98+
- Async support for concurrent requests
99+
100+
### 6. Configuration
101+
- Model path configurable via environment variable
102+
- MedCAT service URL configurable
103+
- Confidence threshold configurable (default: 0.7)
104+
105+
## Acceptance Criteria
106+
107+
- [ ] `PHIDetectionService` class implemented
108+
- [ ] Detects all 18 PHI entity types
109+
- [ ] Returns structured PHIEntity objects with offsets
110+
- [ ] Confidence scores included (0.0 - 1.0 range)
111+
- [ ] Batch processing works correctly
112+
- [ ] Error handling for service failures
113+
- [ ] Configuration via environment variables
114+
- [ ] Unit tests (mock MedCAT responses) - 15 tests
115+
- [ ] Integration tests (real MedCAT service) - 5 tests
116+
- [ ] Test coverage >90%
117+
- [ ] API documentation (docstrings)
118+
119+
## Testing Requirements
120+
121+
**Unit Tests** (`backend/tests/unit/services/test_phi_detection_service.py`):
122+
```python
123+
def test_detect_phi_returns_entities(mock_medcat_client):
124+
"""Test PHI detection returns PHIEntity objects"""
125+
126+
def test_detect_phi_filters_non_phi_entities(mock_medcat_client):
127+
"""Test only PHI categories returned, not clinical concepts"""
128+
129+
def test_detect_phi_confidence_threshold(mock_medcat_client):
130+
"""Test low-confidence entities filtered out"""
131+
132+
def test_detect_phi_batch_processing(mock_medcat_client):
133+
"""Test batch processing maintains order"""
134+
135+
def test_detect_phi_handles_empty_text(mock_medcat_client):
136+
"""Test graceful handling of empty input"""
137+
```
138+
139+
**Integration Tests** (`backend/tests/integration/test_phi_detection_integration.py`):
140+
```python
141+
def test_phi_detection_with_real_medcat_service():
142+
"""Test detection with actual MedCAT service"""
143+
144+
def test_phi_detection_all_18_categories():
145+
"""Test sample text with all 18 PHI types detected"""
146+
147+
def test_phi_detection_performance_benchmark():
148+
"""Test 10-page note processed in <2 minutes"""
149+
```
150+
151+
## Dependencies
152+
153+
**External**:
154+
- Fine-tuned MedCAT model (Task 001) - **BLOCKS THIS TASK**
155+
- MedCAT Service running (Phase 3)
156+
- Redis (optional for caching)
157+
158+
**Internal**:
159+
- `MedCATClient` (Phase 3 infrastructure)
160+
- Pydantic schemas
161+
- FastAPI settings
162+
163+
## Deliverables
164+
165+
1. **Service Implementation**: `backend/app/services/phi_detection_service.py`
166+
2. **Pydantic Schema**: `backend/app/schemas/phi_entity.py`
167+
3. **Unit Tests**: `backend/tests/unit/services/test_phi_detection_service.py`
168+
4. **Integration Tests**: `backend/tests/integration/test_phi_detection_integration.py`
169+
5. **Configuration**: Add `PHI_MODEL_PATH` to `.env`
170+
171+
## Estimated Time
172+
173+
**Total**: 20 hours (Week 4)
174+
175+
**Breakdown**:
176+
- Service implementation: 8 hours
177+
- Error handling & edge cases: 4 hours
178+
- Unit tests: 4 hours
179+
- Integration tests: 2 hours
180+
- Documentation: 2 hours
181+
182+
## Success Metrics
183+
184+
- PHI detection works for all 18 categories
185+
- Performance: <2 minutes per 10-page note
186+
- Test coverage: >90%
187+
- No crashes on invalid input
188+
189+
## Related Tasks
190+
191+
- **Depends on**: #001 (Fine-tuned MedCAT model)
192+
- **Blocks**: #003 (De-identification service needs PHI detection)
193+
- **Blocks**: #004 (Batch API needs PHI detection)

0 commit comments

Comments
 (0)