GitHub - drgmo/WSI-Analytics: Explore the hidden information in your Whole Slide Images and optimize your DL-Training workflow. No more frustration with bad slides.

Whole Slide Image (WSI) dataset quality control and analytics platform for computational pathology. Provides comprehensive tile-level QC, class imbalance analysis, and optional integration with AtlasPatch for SAM2-based tissue detection.

Features

Triple pipeline architecture: NativeQC (built-in Otsu/OD masking + grid tiling), StampQC (STAMP-style brightness + Canny edge filtering), or AtlasPatchQC (SAM2-based tissue detection via external tool)
Per-tile QC scoring: tissue fraction, brightness, blur (Laplacian variance), white/background fraction, pen marks, folds
Multi-level statistics: tile, slide, patient, and label-level summaries with bag ratio tracking
Class imbalance analysis: entropy, effective N, imbalance ratio, before/after QC comparison
Interactive UI: Streamlit app with 3-tab workflow (Input, Parameters, Run & Results)
Headless CLI: batch processing with YAML config, presets, and CLI overrides
Exportable reports: HTML reports with embedded Plotly plots, CSV/Parquet exports
Multi-backend WSI reading: OpenSlide, tifffile, pyvips (uses whichever is installed)

Quick Start

Install

pip install -r requirements.txt

Install at least one WSI backend:

# OpenSlide (recommended)
pip install openslide-python
# macOS: brew install openslide
# Linux: apt install openslide-tools

# OR tifffile
pip install tifffile imagecodecs

# OR pyvips
pip install pyvips
# macOS: brew install vips
# Linux: apt install libvips-dev

Run Self-Check

python selfcheck.py

Streamlit UI

streamlit run app.py

CLI

# Preview run (30 slides)
python cli.py --csv /path/to/slides.csv

# Full run
python cli.py --csv /path/to/slides.csv --full --output runs/

# With AtlasPatch pipeline
python cli.py --csv /path/to/slides.csv --pipeline AtlasPatchQC --full

# Apply a preset
python cli.py --csv /path/to/slides.csv --preset conservative_tissue --full

# Custom config
python cli.py --csv /path/to/slides.csv --config configs/default.yaml --full

# Generate HTML report
python cli.py --csv /path/to/slides.csv --full --report

# Check available backends
python cli.py --backends

# Validate config
python cli.py --csv /path/to/slides.csv --validate

CSV Manifest Format

The input CSV must have these columns:

Column	Description
`patient_id`	Patient identifier
`slide_id`	Unique slide identifier
`slide_path`	Absolute path to WSI file (.svs, .ndpi, .tif, .mrxs, etc.)
`label`	Class label (e.g., tumor, normal)

Example:

patient_id,slide_id,slide_path,label
P001,S001,/data/slides/slide_001.svs,tumor
P001,S002,/data/slides/slide_002.svs,tumor
P002,S003,/data/slides/slide_003.svs,normal

Project Structure

WSI-Analytics/
├── app.py                          # Streamlit UI
├── cli.py                          # Headless batch CLI
├── selfcheck.py                    # Self-validation test suite (17 tests)
├── requirements.txt
├── configs/
│   └── default.yaml                # Default configuration
├── wsi_analyticspro/               # Core package
│   ├── __init__.py
│   ├── config.py                   # YAML config + validation + presets
│   ├── io/
│   │   ├── csv_ingest.py           # CSV loading + preview subset
│   │   └── artifacts.py            # Run directory + manifest + exports
│   ├── wsi/
│   │   ├── reader.py               # WSI reader + masking + tiling
│   │   └── metadata.py             # MPP/mag utilities + AtlasPatch check
│   ├── pipelines/
│   │   ├── qc_native.py            # NativeQC pipeline + SlideResult
│   │   ├── qc_stamp.py             # StampQC pipeline (STAMP-style tile selection)
│   │   ├── atlaspatch_wrapper.py   # AtlasPatch CLI wrapper + HDF5 parsing
│   │   └── orchestrator.py         # Pipeline dispatcher + PipelineResult
│   ├── stats/
│   │   ├── tile_stats.py           # Per-tile statistics
│   │   ├── slide_stats.py          # Per-slide statistics
│   │   ├── patient_stats.py        # Per-patient statistics
│   │   ├── label_stats.py          # Per-label statistics
│   │   └── imbalance.py            # Class imbalance analysis
│   └── viz/
│       ├── plots.py                # Plotly plots + montages
│       ├── overlays.py             # Mask/grid overlays
│       └── report.py               # HTML report builder
├── wsi_reader.py                   # (legacy) Multi-backend WSI reader
├── masking.py                      # (legacy) Tissue masking
├── tiling.py                       # (legacy) Tile generation + QC scoring
├── metrics.py                      # (legacy) Statistics
├── viz.py                          # (legacy) Visualizations
├── sweeps.py                       # (legacy) Parameter sweep engine
└── qc_pipeline.py                  # (legacy) Pipeline orchestration

Pipelines

NativeQC

Built-in pipeline using Otsu/OD thresholding for tissue masking and grid-based tiling:

Read slide with auto-detected backend
Generate thumbnail and tissue mask
Generate tile coordinates on tissue regions
Score each tile (tissue fraction, brightness, blur, white fraction)
Accept/reject based on QC thresholds

StampQC

Replicates the STAMP tile-selection pipeline for compatibility with STAMP-based MIL workflows:

Generate non-overlapping tile grid (stride = tile size, no partial edge tiles)
Coarse brightness rejection at supertile level (grayscale intensity ≥ cutoff → background)
Fine Canny edge filter per tile (edge fraction < cutoff → discard)
Tiles surviving both filters are accepted

STAMP QC defaults (applied automatically when pipeline is StampQC):

Parameter	Default	Description
`tile_size_um`	256.0	Physical tile size in µm
`tile_size_px`	224	Output tile size in pixels
`brightness_cutoff`	240	Supertile brightness threshold (set to `null` to disable)
`canny_cutoff`	0.02	Canny edge fraction threshold (set to `null` to disable)

Hardcoded constants (not configurable):

Canny low/high thresholds: 40 / 100
Overlap: 0 (non-overlapping grid)
Read level: 0 (full resolution)
Max supertile size: 1024 px

CLI example:

python cli.py --csv /path/to/slides.csv --pipeline StampQC --full
python cli.py --csv /path/to/slides.csv --pipeline StampQC --stamp-brightness-cutoff 230
python cli.py --csv /path/to/slides.csv --pipeline StampQC --stamp-disable-brightness

AtlasPatchQC

Wraps the external AtlasPatch tool (CC-BY-NC-SA-4.0):

Run AtlasPatch CLI on slide (SAM2-based tissue detection)
Parse HDF5 output (coords dataset: N x 5 int32)
Optionally score tiles with native QC metrics
Apply same accept/reject logic

Setup AtlasPatch:

pip install atlas-patch
# or clone and install from source

Configuration

Presets

Preset	Description
`fast_preview`	Low thresholds, fast iteration
`conservative_tissue`	High tissue fraction, strict brightness
`high_purity`	Maximum purity with blur + artifact detection

Key Parameters

Parameter	Default	Description
`tile_size_um`	256.0	Physical tile size in micrometers
`tile_size_px`	224	Tile size in pixels
`min_tissue_fraction`	0.5	Minimum tissue to accept tile
`brightness_max`	220	Maximum mean brightness (0-255)
`blur_threshold`	15.0	Minimum Laplacian variance
`min_tiles_per_slide`	5	Below this, slide is QC_FAIL

Resolution Rule

The target resolution is derived from tile sizing defaults:

Default: tile_size_um = 256.0, tile_size_px = 224
Derived: target_mpp = 256.0 / 224 = 1.143 um/px

You can override any one of these three values and the others will be computed.

Run Artifacts

Each run creates a directory under runs/ with:

runs/<run_id>/
├── manifest.json           # Reproducibility metadata
├── config.yaml             # Frozen config
├── tiles.csv               # All tile records
├── tiles.parquet            # Tiles in Parquet format
├── slides_summary.csv      # Per-slide statistics
├── patients_summary.csv    # Per-patient statistics
├── labels_summary.csv      # Per-label statistics
├── report.html             # HTML report (if --report)
├── run.log                 # Processing log
├── thumbnails/             # Per-slide thumbnails + overlays
└── debug/                  # Debug artifacts

Statistics

Tile-Level

Tissue fraction, brightness, blur score, white fraction distributions
Reject reason breakdown with percentages
Borderline tile counts (near QC thresholds)

Slide-Level

Acceptance rate, bag ratio (accepted/candidates)
Per-slide metric distributions
Low-acceptance slide detection

Patient-Level

Tiles per patient, slides per patient
Patient outlier detection (>10% tile share)

Label-Level

Before/after QC tile counts per label
Per-label acceptance rates and bag ratios

Class Imbalance

Shannon entropy and effective N classes
Imbalance ratio (max/min)
Before vs after QC comparison
Automatic warnings for worsened or severe imbalance

Troubleshooting

OpenSlide not found

brew install openslide          # macOS
apt install openslide-tools     # Linux
pip install openslide-python

If you see Library not loaded: libopenslide.0.dylib:

export DYLD_LIBRARY_PATH=$(brew --prefix openslide)/lib:$DYLD_LIBRARY_PATH

No backend available

The tool will still run but mark all slides as unreadable. Install at least one backend (openslide recommended).

Streamlit port conflict

streamlit run app.py --server.port 8502

Large datasets are slow

Use preview mode first (default: 30 slides)
Increase mask_workscale_mpp (e.g., 16.0) for faster masking
Set max_tiles_per_slide to limit tile count
Use the CLI for batch processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Quick Start

Install

Run Self-Check

Streamlit UI

CLI

CSV Manifest Format

Project Structure

Pipelines

NativeQC

StampQC

AtlasPatchQC

Configuration

Presets

Key Parameters

Resolution Rule

Run Artifacts

Statistics

Tile-Level

Slide-Level

Patient-Level

Label-Level

Class Imbalance

Troubleshooting

OpenSlide not found

No backend available

Streamlit port conflict

Large datasets are slow

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.streamlit		.streamlit
configs		configs
tests		tests
wsi_analyticspro		wsi_analyticspro
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WSI-Analytics.png		WSI-Analytics.png
app.py		app.py
cli.py		cli.py
masking.py		masking.py
metrics.py		metrics.py
qc_pipeline.py		qc_pipeline.py
requirements.txt		requirements.txt
selfcheck.py		selfcheck.py
sweeps.py		sweeps.py
tiling.py		tiling.py
viz.py		viz.py
wsi_reader.py		wsi_reader.py

License

drgmo/WSI-Analytics

Folders and files

Latest commit

History

Repository files navigation

Features

Quick Start

Install

Run Self-Check

Streamlit UI

CLI

CSV Manifest Format

Project Structure

Pipelines

NativeQC

StampQC

AtlasPatchQC

Configuration

Presets

Key Parameters

Resolution Rule

Run Artifacts

Statistics

Tile-Level

Slide-Level

Patient-Level

Label-Level

Class Imbalance

Troubleshooting

OpenSlide not found

No backend available

Streamlit port conflict

Large datasets are slow

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages