DeepCatch is an open-source computational framework for multi-cancer early detection (MCED) from cell-free DNA (cfDNA). It fuses 7 complementary molecular modalities through a self-supervised Transformer foundation model, tracks patients longitudinally with Bayesian Kalman filtering, and predicts tissue-of-origin — all in a single two-stage CET (Capture → Enhance → Triage) pipeline.
v2.1 adds GNN methylation field-defect detection, enhanced fragmentomics (DELFI + MFS + nucleosome + refined 5-mer), cfSort-style tissue deconvolution, a multi-modal foundation model, and priming agent PK/PD simulation.
⚠️ Research-stage software. Not for clinical diagnosis. See §11 for real-plasma validation status.
cfDNA Sample
│
├── Stage 1 (Capture) — 7 Modalities ────────────────────────┐
│ ├── Fragmentomics Basic MFR, FSI, CAFF, FEM │
│ ├── Enhanced Fragmentomics DELFI + MFS + nucleosome │
│ ├── CNV 6-D chromosomal instability │
│ ├── Serological PG-I, PG-II, G-17, Hp │
│ ├── GNN Methylation Network GATv2 field defect detection │
│ ├── Tissue Deconvolution cfSort-style DNN (24-D) │
│ └── Priming Agents PK/PD + denoising │
│ │
└──→ Multi-Modal Foundation Model (Transformer) ←────────────┘
│
└── Stage 2 (Enhance) — Longitudinal ────────────────────────┐
└── Bayesian Kalman Filter (BSSLM) │
│
Detection Decision: p_cancer > τ
git clone https://github.com/rollroyces/deepcatch.git
cd deepcatch
pip install -r requirements_py.txtMinimum dependencies:
pip install numpy scipy scikit-learn pandasWith deep learning (GNN, foundation model, tissue deconv):
pip install torch>=2.0.0 torch-geometricOptional — BAM/FASTQ processing:
pip install pysam statsmodelsDocker:
docker build -t deepcatch:latest .
docker run --rm -v $(pwd)/results:/app/results deepcatch:latestimport numpy as np
from src.fragmentomics import EnhancedFragmentomics
from src.fragmentomics.themis_features import (
MFRCalculator, FSICalculator, CAFFCalculator, FEMCalculator
)
from src.methylation_gnn import RegulatoryGraphBuilder, MethylationGNNPredictor
from src.tissue_deconv import DEConvIntegration
from src.priming.pharmacokinetics import PKModel, OptimalDosingSchedule
# ── Fragmentomics Basic ──
mfr = MFRCalculator()
fsi = FSICalculator()
caff = CAFFCalculator()
fem = FEMCalculator()
frag_basic = {
"mfr": mfr.compute(coverage, cpg_density),
"fsi": fsi.compute(fragment_lengths),
"caff": caff.compute(cnv_profile),
"fem": fem.compute(end_motif_counts),
}
# ── Enhanced Fragmentomics (DELFI + MFS + nucleosome + 5-mer) ──
ef = EnhancedFragmentomics()
frag_enhanced = ef.extract_all(
fragment_lengths=lengths,
fragments=fragments,
end_sequences=end_seqs,
tss_positions=tss_positions,
)
# → dict of ~70 scalar features
# ── GNN Methylation Network ──
gnn = MethylationGNNPredictor.load("checkpoints/gnn_pretrained.pt")
graph = RegulatoryGraphBuilder().build_graph(
sample_name="S001", methylation_data=meth_data
)
field_defect_score = gnn.predict_sample(
sample_name="S001", methylation_data=meth_data
)
# ── Tissue Deconvolution ──
deconv = DEConvIntegration(checkpoint="checkpoints/deconv.pt")
# Or train from scratch on synthetic mixtures:
# deconv.fit_synthetic(n_samples=2000)
tissue_fractions = deconv.predict_tissue_fractions(methylation_data)
tissue_features = deconv.extract_all(methylation_data, tissue_fractions)
# → dict of 24 scalar features
# ── CNV ──
cnv_features = {
"cnv_burden": np.mean(np.abs(cnv_log2_ratios)),
"cnv_entropy": scipy.stats.entropy(cnv_segment_lengths),
"arm_imbalance": max_arm_imbalance(cnv_profile),
}
# ── Serological ──
sero_features = {
"pg1": pg1_value, "pg2": pg2_value,
"g17": g17_value, "hp": hp_igg_value,
}
# ── Priming Agent PK/PD ──
pk = PKModel()
pk_result = pk.simulate(
agent="scFv", dose_mg=100, patient_weight_kg=70,
duration_hours=48,
)
dosing = OptimalDosingSchedule().compute(
agent="scFv", patient_data={"weight_kg": 70}
)from src.foundation import FoundationDownstream, FoundationConfig
# Assemble modalities dict (n_samples × dim for each key)
modalities = {
"frag_basic": np.array(frag_basic_array), # (N, 4)
"frag_enhanced": np.array(frag_enhanced_array), # (N, 44)
"cnv": np.array(cnv_array), # (N, 6)
"sero": np.array(sero_array), # (N, 4)
"gnn": np.array(gnn_scores), # (N, 1)
"tissue": np.array(tissue_array), # (N, 24)
}
# Use pre-trained checkpoint
fusion = FoundationDownstream(pretrained=True)
fusion.fit(modalities, labels)
proba = fusion.predict_proba(modalities) # shape (N, 2)
predictions = fusion.predict(modalities) # shape (N,)
# Or train from scratch (no pre-training needed)
fusion = FoundationDownstream(pretrained=False)
fusion.fit(modalities, labels, n_epochs=50, batch_size=32)
proba = fusion.predict_proba(modalities)from src.multimodal_fusion.advanced_fusion import CrossAttentionFusion
# List of 1-D score arrays per modality
scores = [mfr_scores, fsi_scores, caff_scores, fem_scores, cnv_scores]
fusion = CrossAttentionFusion(n_modalities=5)
fusion.fit(scores, labels)
proba = fusion.predict_proba(scores)from src.clinical import ClinicalReportGenerator
crg = ClinicalReportGenerator(cet_df, fusion_result)
print(crg.generate_briefing()) # One-paragraph summary
crg.export_json("report.json") # Machine-readable export
with open("report.html", "w") as f:
f.write(crg.generate_html_report()) # Full HTML reportbash RUN_ALL.sh # Full pipeline
bash RUN_ALL.sh --quick # 2-minute smoke testPurpose: cfDNA fragmentation pattern analysis implementing DELFI, MDS, and THEMIS-equivalent feature frameworks.
| Class / Function | Description |
|---|---|
MFRCalculator |
Methylated Fragment Ratio via CpG density scoring |
FSICalculator |
Fragment Size Index: short/long ratio + GMM sub-nucleosomal fraction |
CAFFCalculator |
Chromosomal Aneuploidy: CNA burden scoring from whole-genome bins |
FEMCalculator |
Fragment End Motif: 4-mer MDS + motif embeddings (Jiang 2020) |
FragmentLengthGMM |
4-component Gaussian Mixture Model (sub-/mono-/di-/tri-nucleosomal) |
DELFI_style_normalization |
LOESS GC-bias correction + mappability filter |
compute_MDS |
Motif Diversity Score from 4/5-mer counts |
EnhancedFragmentomics |
Unified extractor: DELFI + MFS + nucleosome footprint + refined 5-mer |
extract_4mer_end_motifs |
4-mer extraction from BAM files |
extract_end_motifs_from_fastq |
4-mer extraction from FASTQ |
Input: BAM/FASTQ files, or fragment length arrays + end sequences
Output: Scalar features (4–80+), GMM component statistics, MDS scores
Tests: 47 (test_enhanced_features.py)
Purpose: Detect pre-cancer epigenetic field defects via GATv2 graph attention on methylation regulatory graphs.
| Class / Function | Description |
|---|---|
RegulatoryGraphBuilder |
Constructs heterogeneous graphs from methylation + Hi-C contacts |
MethylationGNN |
GATv2 model with reconstruction decoder + anomaly head |
GNNTrainer |
3-phase training: masked pre-training → joint → fine-tuning |
GNNInference / MethylationGNNPredictor |
Lightweight inference producing field_defect_score |
ReferenceDataCatalog |
Downloads UCSC CpG islands, ENCODE Hi-C, GENCODE promoters, FANTOM5 enhancers |
MethylationBranchAdapter |
Drop-in adapter for CrossAttentionFusion compatibility |
Input: cfDNA methylation beta values + reference Hi-C/chromatin data
Output: Graph-level field_defect_score (scalar) per sample
Tests: 54 (test_integration.py)
Purpose: Predict tissue-of-origin cfDNA fractions from methylation data using a cfSort-style DNN.
| Class / Function | Description |
|---|---|
TissueAtlas |
29-tissue reference methylation profile store |
TissueDeconvolutionModel |
Lightweight DNN (~500K params): [256, 128, 64] + BN + ReLU + Dropout |
TissueDeconvolutionEnsemble |
3-model ensemble with seed diversity |
TissueDeconvTrainer |
KL divergence + L1 sparsity + entropy regularization on synthetic mixtures |
TissueDeconvolutionFeatures |
Extracts 24-D feature vector from tissue fractions |
DEConvIntegration |
Full integration class compatible with existing pipeline |
Input: cfDNA methylation beta values (or synthetic atlas for training)
Output: Per-tissue fraction vector + 24-D feature vector
Tests: 54 (test_integration.py)
Purpose: Self-supervised multi-modal Transformer pre-training for cfDNA. Drop-in replacement for CrossAttentionFusion.
| Class / Function | Description |
|---|---|
FoundationConfig |
Hyperparameter dataclass (embed_dim, n_heads, n_layers, etc.) |
MultiModalEncoder |
4-layer TransformerEncoder with per-modality linear projections |
PretrainHead |
Masked modality prediction head |
ContrastiveHead |
Cross-modal contrastive loss (InfoNCE) |
FoundationPretrainer |
Self-supervised pre-training orchestrator |
FoundationDownstream |
Downstream fine-tuning with CrossAttentionFusion-compatible API |
FoundationCompatibilityWrapper |
Wrapper for seamless replacement of CrossAttentionFusion |
MultiModalDataGenerator |
Synthetic multi-modal data generator for pre-training |
Pre-training tasks:
- Masked modality prediction — reconstruct masked modalities from context
- Cross-modal contrastive — InfoNCE between modalities of same sample
API compatibility:
# CrossAttentionFusion (old)
fusion = CrossAttentionFusion(n_modalities=6)
fusion.fit(scores, labels) # scores: list of 1-D arrays
proba = fusion.predict_proba(scores)
# FoundationDownstream (new — drop-in)
fusion = FoundationDownstream(pretrained=True)
fusion.fit(modalities, labels) # modalities: dict of (N, D) arrays
proba = fusion.predict_proba(modalities) # shape (N, 2)Input: Dict of modality arrays {name: np.ndarray (N, D)}
Output: Joint embeddings (N, n_modalities, embed_dim); classification probabilities (N, 2)
Tests: 43 (test_integration.py)
Purpose: Simulate PK/PD of cfDNA priming agents (Amplifyer Bio) and their effect on ctDNA detection.
| Class / Function | Description |
|---|---|
PKModel |
1-compartment PK model with first-order elimination |
OptimalDosingSchedule |
Computes optimal dosing for 5 agent types |
PrimingConfig |
Dataclass with literature-based PK parameters |
Agents: scFv, liposome, nanoparticle, polymeric micelle, dendrimer Input: Agent type, dose, patient weight, liver function Output: Concentration-time profiles, ctDNA boost factor, optimal dosing schedule Reference: Martin-Alonso et al. (2024) Science
| Class / Function | Description |
|---|---|
CrossAttentionFusion |
Relation-aware cross-attention between modality embeddings |
GCNTissueOfOrigin |
Heterogeneous GCN for TOO prediction at low sequencing depth |
EarlyLateFusion |
Sample-modality evaluator MLP |
| Class / Function | Description |
|---|---|
SerologicalFusion |
Fuses PG-I, PG-II, G-17, H. pylori with cfDNA predictions |
IntegrativeScoringSystem |
Unified risk scoring across all modalities |
ClinicalReportGenerator |
Generates clinician-friendly HTML/JSON reports |
NestedCETValidator |
Nested cross-validation for unbiased motif-based CET evaluation |
FrequencyDataset |
Loads pre-computed 4-mer frequency vectors (Jiang lab format) |
Bayesian Kalman filter (BSSLM) for longitudinal evidence accumulation across quarterly blood draws. Tracks patient risk trajectory over time rather than relying on single-timepoint decisions.
MAML-based few-shot adaptation for cancer subtype detection.
Multi-confounder realistic cohort generation (CHIP, variable shedding, trinucleotide errors, GC bias, batch effects, inflammation) for development and testing.
# All tests
python -m pytest src/ -v
# Or with unittest
python -m unittest discover -s src -p "test_*.py"
# Per-module
python src/foundation/test_integration.py # 43 tests
python src/methylation_gnn/test_integration.py # 54 tests
python src/tissue_deconv/test_integration.py # 54 tests
python src/fragmentomics/test_enhanced_features.py # 47 tests
# Quick smoke test
python -c "from src.foundation import FoundationConfig; print('OK')"| Module | Tests | Status |
|---|---|---|
| Enhanced Fragmentomics | 47 | ✅ All passing |
| GNN Methylation | 54 | ✅ All passing |
| Tissue Deconvolution | 54 | ✅ All passing |
| Foundation Model | 43 | ✅ All passing |
| Total | 198 | ✅ |
Seven independent modalities extract signal from the same cfDNA sample. Each produces a scalar risk score vector. The foundation model fuses these into a joint embedding via per-modality linear projections → 4-layer Transformer encoder.
Longitudinal tracking via Bayesian Kalman filter (BSSLM). The joint embedding from Stage 1 is tracked across quarterly blood draws, accumulating evidence over time. This is designed to detect cancers whose ctDNA signal is below single-timepoint detection thresholds at early stages.
The accumulated Bayesian posterior probability p_cancer is compared to a calibrated threshold τ. Samples above the threshold trigger confirmatory testing; samples below are cleared until the next quarterly draw.
| Modality | Required Data | Public Source |
|---|---|---|
| Fragmentomics Basic | Fragment length arrays, end motif counts | N/A (extracted from BAM/FASTQ) |
| Enhanced Fragmentomics | Fragment lengths + genomic coordinates + end sequences | Same as above |
| CNV | Log2 ratio profiles or BAM | Same as above |
| Serological | PG-I, PG-II, G-17, H. pylori IgG | Clinical lab |
| GNN Methylation | cfDNA methylation beta values | TCGA, GEO |
| Tissue Deconvolution | cfDNA methylation beta values | TCGA, cfSort atlas |
| Priming Agents | Agent PK parameters | Literature |
| Resource | URL |
|---|---|
| ENCODE Hi-C | https://www.encodeproject.org/ |
| UCSC CpG Islands | http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ |
| GENCODE promoters | https://www.gencodegenes.org/human/ |
| FANTOM5 enhancers | https://fantom.gsc.riken.jp/5/ |
| TCGA methylation | https://portal.gdc.cancer.gov/ |
| cfSort atlas | https://github.com/stephenrcraig/cfSort |
All modules support fully synthetic data for development and testing. Use MultiModalDataGenerator (foundation), TissueAtlas (deconv with built-in synthetic profiles), and ReferenceDataCatalog (GNN with random initialization) to run the full pipeline without any external reference data.
deepcatch/
├── README.md # This file
├── LICENSE # MIT
├── CITATION.cff # Academic citation metadata
├── requirements_py.txt # Python dependencies
├── RUN_ALL.sh # One-command validation
├── Dockerfile
│
├── src/
│ ├── fragmentomics/ # FragmentoSign: DELFI, MDS, GMM, LOESS, enhanced
│ ├── methylation_gnn/ # GATv2 graph attention for field defect detection
│ ├── tissue_deconv/ # cfSort-style DNN for tissue-of-origin
│ ├── foundation/ # Self-supervised Transformer foundation model
│ ├── priming/ # PK/PD priming agent simulation
│ ├── multimodal_fusion/ # CrossAttentionFusion, GCN, EarlyLate
│ ├── clinical/ # Serological fusion, clinical reports, CET validation
│ ├── longitudinal/ # Bayesian Kalman filter (Stage 2)
│ ├── ensemble/ # MAML meta-learning
│ ├── synthetic_data/ # Realistic cohort generation
│ ├── variant_calling/ # Bayesian + contrastive DL
│ └── preprocessing/ # CHIP filter
│
├── validation/ # Statistical validation suite
│ ├── py/ # Python validation modules (11)
│ ├── tcga/ # TCGA data loaders + validators
│ └── *.py # 10 bioinformatics-grade modules
│
├── test/ # Additional test suites
├── results/ # Output reports + figures
├── paper/ # LaTeX manuscript
├── docs/ # User guide
└── review/ # Peer review history
- Create module directory under
src/your_modality/ - Implement feature extractor with
extract_all()orpredict_sample()entry point - Define config with dataclass
YourModalityConfig - Add integration class that wraps your module for the fusion API
- Write tests — aim for ≥20 tests covering config, forward pass, edge cases, and integration
- Update
MODALITY_DIMSinsrc/foundation/config.py
- Type hints on all public APIs
- NumPy docstring style with Parameters/Returns sections
- Tests use pytest or unittest; run them before submitting
Open an issue first to discuss scope. Target main branch. PRs must pass all existing tests.
Preliminary validation on 129 real plasma samples from Jiang lab (CUHK), using 4-mer end-motif frequency vectors:
| Metric | Value |
|---|---|
| Samples (HCC vs Control) | 72 (34 HCC, 38 Control) |
| Nested CV AUC | 0.986 |
| Bonferroni-significant motifs | 108 / 256 |
| Biological pattern | CG-rich depletion, AT-rich enrichment |
Caveats: HCC only (other types n≤17), processed frequency data (not raw BAM), single centre. Not a clinical assay.
License: MIT — see LICENSE.
Cite as:
@software{deepcatch2026,
title = {{DeepCatch}: Multi-Modal Longitudinal MCED Framework
for Early Cancer Detection from cfDNA},
author = {Royce and DeepCatch Contributors},
year = {2026},
version = {2.1.0},
url = {https://github.com/rollroyces/deepcatch},
}Every DeepCatch claim is traceable to computations in validation/ and src/. No numbers are invented. No clinical claims are intended. 🧬