A disease-agnostic biomarker discovery and risk stratification framework for precision medicine.
Developed by PulseLogic Biosciences Inc. under the discipline of Computational Bio-AI Engineering (CBAE).
Biomarker Probability Fusion (BPF) is a deterministic, auditable pipeline for transcriptomic biomarker discovery and patient risk stratification. BPF performs:
- Univariate biomarker ranking -- AUC-based discriminative power assessment with direction tracking for each gene/feature
- Adaptive gene selection -- Statistical filtering (AUC threshold + p-value) with configurable panel size
- Weighted score fusion -- AUC-weighted z-score composition with direction correction
- Risk stratification -- Probabilistic patient scoring with bootstrap confidence intervals
- Cross-validation -- 5x5 repeated stratified k-fold with complete within-fold feature selection (no information leakage)
BPF is disease-agnostic: the same algorithm and parameters have been validated across oncology, Alzheimer's disease, and Parkinson's disease without domain-specific modifications.
| Domain | Internal Benchmark | External Datasets | External Patients | External Mean AUC |
|---|---|---|---|---|
| Oncology | 39 TCGA + GDC + CGCI cohorts (16,958 patients, mean AUC 0.8077) | 40 | 14,578 | 0.8015 |
| Alzheimer's Disease | ADNI training reference | 8 | 2,972 | 0.8184 |
| Parkinson's Disease | PPMI training reference | 8 | 753 | 0.7928 |
| External Total | 56 | 18,303 | ||
| Full Portfolio (incl. internal benchmark) | 95 | 35,261 |
95 cohorts and datasets. 35,261 patients. Three disease domains. Tier breakdown: 48 GOLD / 24 CONDITIONAL / 23 EXPLORATORY.
Master evidence lock: SSRN_EVIDENCE_LOCK_v2 (SHA-256: 418c149e9fc541006d3158db69dfd3ebcfd9c9eeaa9be8b954083c69ce5d5f0e).
See BENCHMARK_SUMMARY.md for head-to-head benchmark results against LASSO, ElasticNet, and Random Forest.
pulselogic-bpf/
|-- README.md
|-- LICENSE
|-- CITATION.cff
|-- BENCHMARK_SUMMARY.md # Benchmark v2.3 results vs. baselines
|-- PUBLICATIONS_AND_IP.md # Manuscript status, patents, DOIs
|-- requirements.txt
|-- setup.py
|
|-- bpf/ # Core BPF pipeline
| |-- __init__.py
| |-- pipeline.py # BPF v1.0.0 locked canonical pipeline
| |-- pipeline_v2.py # BPF v2.0 (full dataset, no CV)
| |-- ranking.py # Univariate AUC ranking with direction tracking
| |-- selection.py # Adaptive gene selection
| |-- fusion.py # AUC-weighted z-score fusion
| |-- evaluation.py # Bootstrap CI, risk stratification
| `-- utils.py # Preprocessing, I/O, gene mapping
|
|-- scripts/ # Execution scripts
| |-- run_single_cohort.py # Process a single dataset
| |-- run_batch.py # Batch processing across multiple datasets
| `-- run_cross_validation.py # 5x5 repeated stratified k-fold
|
|-- configs/ # Parameter configurations
| |-- default_params.yaml # Default BPF parameters
| |-- oncology_params.yaml # Phase 1 oncology configuration
| |-- alzheimer_params.yaml # Phase 2 AD configuration
| `-- parkinson_params.yaml # Phase 3 PD configuration
|
|-- data/ # Sample data for testing
| `-- sample_expression.csv # Small synthetic dataset for CI/CD
|
|-- tests/ # Unit and integration tests
| |-- test_ranking.py
| |-- test_selection.py
| |-- test_fusion.py
| |-- test_pipeline.py
| `-- test_reproducibility.py # Determinism verification (seed=42)
|
|-- results/ # Output directory (gitignored except examples)
| `-- example_output/
| |-- DATA.json
| |-- DETAILED_STATS.txt
| |-- EXECUTIVE_SUMMARY.txt
| |-- FULL_AUC_RANKING.txt
| |-- GENE_PANEL.txt
| `-- SAMPLES.txt
|
`-- docs/
|-- METHODS.md # Detailed methodology documentation
|-- PARAMETERS.md # Parameter reference
|-- OUTPUT_FORMAT.md # Output file specifications
`-- VALIDATION.md # External validation summary
git clone https://github.com/pulselogicbio/pulselogic-bpf.git
cd pulselogic-bpf
pip install -r requirements.txt- Python >= 3.11
- NumPy >= 1.21
- Pandas >= 1.3
- Scikit-learn >= 1.0
- SciPy >= 1.7
from bpf import BPFPipeline
# Initialize with default parameters
pipeline = BPFPipeline(
min_auc=0.55,
pvalue_threshold=0.05,
max_genes=100,
variance_threshold=0.01,
seed=42
)
# Load expression data (genes x samples) and binary outcome
X, y = pipeline.load_data("expression_matrix.tsv", "clinical_data.tsv")
# Run full pipeline
results = pipeline.run(X, y, cohort_name="MY_COHORT")
# Run with cross-validation
cv_results = pipeline.run_cv(X, y, n_splits=5, n_repeats=5)
# Save all 6 output files
pipeline.save_results(results, output_dir="results/MY_COHORT/")| Parameter | Default | Description |
|---|---|---|
| min_auc | 0.55 | Minimum univariate AUC for gene inclusion |
| pvalue_threshold | 0.05 | Maximum Mann-Whitney p-value for gene inclusion |
| max_genes | 100 | Maximum genes in the fusion panel |
| variance_threshold | 0.01 | Minimum variance for gene retention |
| seed | 42 | Random seed for reproducibility |
| n_splits | 5 | Number of CV folds |
| n_repeats | 5 | Number of CV repeats |
| n_bootstrap | 1000 | Bootstrap resamples for confidence intervals |
Each BPF run produces 6 standardized output files:
| File | Description |
|---|---|
| DATA.json | Complete results in machine-readable format |
| DETAILED_STATS.txt | Full statistical report |
| EXECUTIVE_SUMMARY.txt | One-page summary with key metrics |
| FULL_AUC_RANKING.txt | All genes ranked by univariate AUC |
| GENE_PANEL.txt | Selected biomarker panel with directions |
| SAMPLES.txt | Per-patient BPF scores and risk groups |
BPF is fully deterministic. Given the same input data, parameters, and seed, the
pipeline produces identical results. This is verified by test_reproducibility.py
which checks bit-for-bit output consistency.
The locked canonical pipeline (BPF_LOCKED_PIPELINE_v1.py) is version-controlled and hash-verified via PIPELINE_AUDIT.json.
- Random seed: 42 (all stochastic operations)
- Pipeline version: v1.0.0 (locked February 11, 2026)
- Input verification: SHA-256 checksums on all expression and survival files
Please cite this work as:
@article{dowden2026bpf,
title={A stability-governed, tuning-free framework for feature selection
in high-dimensional transcriptomic biomarker discovery},
author={Dowden, Christopher B.},
journal={Bioinformatics},
year={2026},
note={Submitted. Manuscript ID: BIOINF-2026-0795}
}Software citation (for reproducibility artifacts):
@software{dowden2026bpfcode,
title={PulseLogic BPF: Biomarker Probability Fusion Pipeline},
author={Dowden, Christopher B.},
year={2026},
url={https://github.com/pulselogicbio/pulselogic-bpf},
doi={10.5281/zenodo.19342790},
version={1.0.1}
}Alternatively, click the "Cite this repository" button (top-right of this page) to auto-generate citations from CITATION.cff.
Provisional patent: US 63/942,422 (filed December 2025) -- BPF methodology and multi-modal fusion framework. All IP assigned to PulseLogic Biosciences Inc. Methodology implementation is proprietary. This repository presents the research artifact companion to the submitted manuscript.
Christopher B. Dowden Founder & CEO, PulseLogic Biosciences Inc. ceo@pulselogic.bio | ORCiD: 0009-0008-5690-3723
MIT License. See LICENSE for details.
Research software. Not for clinical decision-making. Not FDA cleared or approved.