Skip to content

Explore the hidden information in your Whole Slide Images and optimize your DL-Training workflow. No more frustration with bad slides.

License

Notifications You must be signed in to change notification settings

drgmo/WSI-Analytics

Repository files navigation

WSI-Analytics Logo

Whole Slide Image (WSI) dataset quality control and analytics platform for computational pathology. Provides comprehensive tile-level QC, class imbalance analysis, and optional integration with AtlasPatch for SAM2-based tissue detection.

Features

  • Triple pipeline architecture: NativeQC (built-in Otsu/OD masking + grid tiling), StampQC (STAMP-style brightness + Canny edge filtering), or AtlasPatchQC (SAM2-based tissue detection via external tool)
  • Per-tile QC scoring: tissue fraction, brightness, blur (Laplacian variance), white/background fraction, pen marks, folds
  • Multi-level statistics: tile, slide, patient, and label-level summaries with bag ratio tracking
  • Class imbalance analysis: entropy, effective N, imbalance ratio, before/after QC comparison
  • Interactive UI: Streamlit app with 3-tab workflow (Input, Parameters, Run & Results)
  • Headless CLI: batch processing with YAML config, presets, and CLI overrides
  • Exportable reports: HTML reports with embedded Plotly plots, CSV/Parquet exports
  • Multi-backend WSI reading: OpenSlide, tifffile, pyvips (uses whichever is installed)

Quick Start

Install

pip install -r requirements.txt

Install at least one WSI backend:

# OpenSlide (recommended)
pip install openslide-python
# macOS: brew install openslide
# Linux: apt install openslide-tools

# OR tifffile
pip install tifffile imagecodecs

# OR pyvips
pip install pyvips
# macOS: brew install vips
# Linux: apt install libvips-dev

Run Self-Check

python selfcheck.py

Streamlit UI

streamlit run app.py

CLI

# Preview run (30 slides)
python cli.py --csv /path/to/slides.csv

# Full run
python cli.py --csv /path/to/slides.csv --full --output runs/

# With AtlasPatch pipeline
python cli.py --csv /path/to/slides.csv --pipeline AtlasPatchQC --full

# Apply a preset
python cli.py --csv /path/to/slides.csv --preset conservative_tissue --full

# Custom config
python cli.py --csv /path/to/slides.csv --config configs/default.yaml --full

# Generate HTML report
python cli.py --csv /path/to/slides.csv --full --report

# Check available backends
python cli.py --backends

# Validate config
python cli.py --csv /path/to/slides.csv --validate

CSV Manifest Format

The input CSV must have these columns:

Column Description
patient_id Patient identifier
slide_id Unique slide identifier
slide_path Absolute path to WSI file (.svs, .ndpi, .tif, .mrxs, etc.)
label Class label (e.g., tumor, normal)

Example:

patient_id,slide_id,slide_path,label
P001,S001,/data/slides/slide_001.svs,tumor
P001,S002,/data/slides/slide_002.svs,tumor
P002,S003,/data/slides/slide_003.svs,normal

Project Structure

WSI-Analytics/
├── app.py                          # Streamlit UI
├── cli.py                          # Headless batch CLI
├── selfcheck.py                    # Self-validation test suite (17 tests)
├── requirements.txt
├── configs/
│   └── default.yaml                # Default configuration
├── wsi_analyticspro/               # Core package
│   ├── __init__.py
│   ├── config.py                   # YAML config + validation + presets
│   ├── io/
│   │   ├── csv_ingest.py           # CSV loading + preview subset
│   │   └── artifacts.py            # Run directory + manifest + exports
│   ├── wsi/
│   │   ├── reader.py               # WSI reader + masking + tiling
│   │   └── metadata.py             # MPP/mag utilities + AtlasPatch check
│   ├── pipelines/
│   │   ├── qc_native.py            # NativeQC pipeline + SlideResult
│   │   ├── qc_stamp.py             # StampQC pipeline (STAMP-style tile selection)
│   │   ├── atlaspatch_wrapper.py   # AtlasPatch CLI wrapper + HDF5 parsing
│   │   └── orchestrator.py         # Pipeline dispatcher + PipelineResult
│   ├── stats/
│   │   ├── tile_stats.py           # Per-tile statistics
│   │   ├── slide_stats.py          # Per-slide statistics
│   │   ├── patient_stats.py        # Per-patient statistics
│   │   ├── label_stats.py          # Per-label statistics
│   │   └── imbalance.py            # Class imbalance analysis
│   └── viz/
│       ├── plots.py                # Plotly plots + montages
│       ├── overlays.py             # Mask/grid overlays
│       └── report.py               # HTML report builder
├── wsi_reader.py                   # (legacy) Multi-backend WSI reader
├── masking.py                      # (legacy) Tissue masking
├── tiling.py                       # (legacy) Tile generation + QC scoring
├── metrics.py                      # (legacy) Statistics
├── viz.py                          # (legacy) Visualizations
├── sweeps.py                       # (legacy) Parameter sweep engine
└── qc_pipeline.py                  # (legacy) Pipeline orchestration

Pipelines

NativeQC

Built-in pipeline using Otsu/OD thresholding for tissue masking and grid-based tiling:

  1. Read slide with auto-detected backend
  2. Generate thumbnail and tissue mask
  3. Generate tile coordinates on tissue regions
  4. Score each tile (tissue fraction, brightness, blur, white fraction)
  5. Accept/reject based on QC thresholds

StampQC

Replicates the STAMP tile-selection pipeline for compatibility with STAMP-based MIL workflows:

  1. Generate non-overlapping tile grid (stride = tile size, no partial edge tiles)
  2. Coarse brightness rejection at supertile level (grayscale intensity ≥ cutoff → background)
  3. Fine Canny edge filter per tile (edge fraction < cutoff → discard)
  4. Tiles surviving both filters are accepted

STAMP QC defaults (applied automatically when pipeline is StampQC):

Parameter Default Description
tile_size_um 256.0 Physical tile size in µm
tile_size_px 224 Output tile size in pixels
brightness_cutoff 240 Supertile brightness threshold (set to null to disable)
canny_cutoff 0.02 Canny edge fraction threshold (set to null to disable)

Hardcoded constants (not configurable):

  • Canny low/high thresholds: 40 / 100
  • Overlap: 0 (non-overlapping grid)
  • Read level: 0 (full resolution)
  • Max supertile size: 1024 px

CLI example:

python cli.py --csv /path/to/slides.csv --pipeline StampQC --full
python cli.py --csv /path/to/slides.csv --pipeline StampQC --stamp-brightness-cutoff 230
python cli.py --csv /path/to/slides.csv --pipeline StampQC --stamp-disable-brightness

AtlasPatchQC

Wraps the external AtlasPatch tool (CC-BY-NC-SA-4.0):

  1. Run AtlasPatch CLI on slide (SAM2-based tissue detection)
  2. Parse HDF5 output (coords dataset: N x 5 int32)
  3. Optionally score tiles with native QC metrics
  4. Apply same accept/reject logic

Setup AtlasPatch:

pip install atlas-patch
# or clone and install from source

Configuration

Presets

Preset Description
fast_preview Low thresholds, fast iteration
conservative_tissue High tissue fraction, strict brightness
high_purity Maximum purity with blur + artifact detection

Key Parameters

Parameter Default Description
tile_size_um 256.0 Physical tile size in micrometers
tile_size_px 224 Tile size in pixels
min_tissue_fraction 0.5 Minimum tissue to accept tile
brightness_max 220 Maximum mean brightness (0-255)
blur_threshold 15.0 Minimum Laplacian variance
min_tiles_per_slide 5 Below this, slide is QC_FAIL

Resolution Rule

The target resolution is derived from tile sizing defaults:

  • Default: tile_size_um = 256.0, tile_size_px = 224
  • Derived: target_mpp = 256.0 / 224 = 1.143 um/px

You can override any one of these three values and the others will be computed.

Run Artifacts

Each run creates a directory under runs/ with:

runs/<run_id>/
├── manifest.json           # Reproducibility metadata
├── config.yaml             # Frozen config
├── tiles.csv               # All tile records
├── tiles.parquet            # Tiles in Parquet format
├── slides_summary.csv      # Per-slide statistics
├── patients_summary.csv    # Per-patient statistics
├── labels_summary.csv      # Per-label statistics
├── report.html             # HTML report (if --report)
├── run.log                 # Processing log
├── thumbnails/             # Per-slide thumbnails + overlays
└── debug/                  # Debug artifacts

Statistics

Tile-Level

  • Tissue fraction, brightness, blur score, white fraction distributions
  • Reject reason breakdown with percentages
  • Borderline tile counts (near QC thresholds)

Slide-Level

  • Acceptance rate, bag ratio (accepted/candidates)
  • Per-slide metric distributions
  • Low-acceptance slide detection

Patient-Level

  • Tiles per patient, slides per patient
  • Patient outlier detection (>10% tile share)

Label-Level

  • Before/after QC tile counts per label
  • Per-label acceptance rates and bag ratios

Class Imbalance

  • Shannon entropy and effective N classes
  • Imbalance ratio (max/min)
  • Before vs after QC comparison
  • Automatic warnings for worsened or severe imbalance

Troubleshooting

OpenSlide not found

brew install openslide          # macOS
apt install openslide-tools     # Linux
pip install openslide-python

If you see Library not loaded: libopenslide.0.dylib:

export DYLD_LIBRARY_PATH=$(brew --prefix openslide)/lib:$DYLD_LIBRARY_PATH

No backend available

The tool will still run but mark all slides as unreadable. Install at least one backend (openslide recommended).

Streamlit port conflict

streamlit run app.py --server.port 8502

Large datasets are slow

  • Use preview mode first (default: 30 slides)
  • Increase mask_workscale_mpp (e.g., 16.0) for faster masking
  • Set max_tiles_per_slide to limit tile count
  • Use the CLI for batch processing

About

Explore the hidden information in your Whole Slide Images and optimize your DL-Training workflow. No more frustration with bad slides.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages