Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69 by conradbzura · Pull Request #71 · abdenlab/cfdb

conradbzura · 2026-06-24T01:01:28Z

Summary

Stop ENCODE starch files from poisoning the content-addressed cache. ENCODE labels BEDOPS starch archives as BED upstream, so the tabix processor ran zcat -f | sort | bgzip on them; because starch is neither gzip nor text, the pipeline produced a valid BGZF wrapping scrambled bytes and committed it before tabix failed, leaving a corrupt data artifact that later reads served and every retry re-failed against.

Add a byte-sniff guard in TabixIntervalProcessor that inspects the downloaded source's leading bytes and fails before any cache write when they are neither gzip nor plain text. The guard is format-agnostic (it catches starch and any other surprise binary), needs no new system dependencies, and leaves the ontology mapping and routing untouched. bigBed is exempt because its binary source is handled by bigBedToBed. Real starch support via the BEDOPS toolchain is intentionally out of scope and left as a follow-up.

Closes #69

Proposed changes

Reject non-gzip/non-text sources before caching (`src/cfdb/workflows/processors/tabix.py`)

Add _source_looks_processable(prefix): accept a gzip/bgzip magic prefix or a NUL-free prefix that decodes as UTF-8 (trimming up to three trailing bytes so a multi-byte character split at the sniff boundary is not falsely rejected); treat an empty prefix as acceptable so the existing zero-data-line guard still owns the empty-source case.
Add _read_prefix to read the leading bytes and _assert_source_processable to raise a clear RuntimeError naming the format, local id, and observed magic when the source is unsupported.
Invoke the guard in _stage_prepare immediately after download_source and before the pipeline (hence before any cache.put), for every format except bigBed.

Cover the guard (`tests/test_workflows/test_processors_tabix.py`)

Add predicate tests (examples plus two Hypothesis properties) and two run()-level behavior tests.

Test cases

#	Test Suite	Given	When	Then	Coverage Target
1	`_source_looks_processable`	A gzip/bgzip magic prefix	The predicate is called	It returns `True`	Gzip acceptance
2	`_source_looks_processable`	A plain BED text prefix	The predicate is called	It returns `True`	Text acceptance
3	`_source_looks_processable`	A BEDOPS starch magic prefix	The predicate is called	It returns `False`	Starch rejection
4	`_source_looks_processable`	A NUL-containing binary prefix	The predicate is called	It returns `False`	Binary rejection
5	`_source_looks_processable`	A short non-text prefix shorter than the trim window	The predicate is called	It returns `False`	Trim-to-empty branch
6	`_source_looks_processable`	UTF-8 text whose final multi-byte char is cut at the boundary	The predicate is called	It returns `True`	Boundary-split recovery
7	`_source_looks_processable`	An empty prefix	The predicate is called	It returns `True`	Empty deferred to downstream guard
8	`_source_looks_processable`	Any bytes appended to the gzip magic	The predicate is called	It always returns `True`	Property: gzip prefix accepted
9	`_source_looks_processable`	Any bytes appended to the starch magic	The predicate is called	It always returns `False`	Property: starch prefix rejected
10	`TestTabixIntervalProcessorPipelines`	A BED file whose source is a starch archive	`run` is awaited	It raises and commits no cache artifact	Fail-before-commit on starch
11	`TestTabixIntervalProcessorPipelines`	A bigBed file with a binary source	`run` is awaited	It completes and commits the artifact	bigBed guard exemption

ENCODE starch files (BEDOPS archives) are mapped to the BED format upstream, so the tabix processor ran zcat -f, sort and bgzip on them. starch is neither gzip nor text, so the pipeline produced a valid BGZF wrapping scrambled bytes and committed it to the content-addressed cache before tabix failed. That poisoned the cache: later reads served corrupt bytes and every retry was a stage-1 cache hit that re-failed the index step forever. Sniff the downloaded source's leading bytes and fail before any cache write when they are neither gzip nor plain text, so an unsupported serialization fails cleanly instead of committing a corrupt artifact. bigBed is exempt because its binary source is handled by bigBedToBed.

Add predicate tests for the gzip, text, starch, NUL-binary and empty cases plus boundary-split and starch-prefixed property checks, and run-level tests asserting a starch source raises without committing a cache artifact while bigBed binary sources stay exempt from the guard.

Address the round-1 principal-engineer review of the source-encoding guard: - Bump processor_version so an artifact poisoned by the pre-guard code (committed under the old version) is re-keyed to a miss, re-enters _stage_prepare, and is rejected instead of served. The guard was otherwise unreachable for already-cached entries. - Reject known non-gzip compression magics (BEDOPS starch, bzip2, xz, zstd, zip) explicitly before the UTF-8 text test, so a mislabeled compressed source is caught by its magic rather than incidentally. - Name the bigBed exemption as a set and document the safety chain (the tool validates its own input and fails under pipefail before any cache commit), so the carve-out is greppable and self-explaining. - Scope the docstring honestly (a gzip member is trusted on its magic alone, so a doubly-compressed source is out of scope; latin-1/CP1252 text is deliberately rejected) and add the dcc to the error message.

Extend the source-encoding guard coverage: - Parametrize the starch-rejection run test over BED, VCF and GTF so the guard is pinned ahead of the intermediate-decompress branches, not just the single-pipeline BED path. - Add predicate cases for the bzip2, xz, zstd and zip magic denylist. - Add a bigBed test where the tool raises, asserting the exemption commits nothing when bigBedToBed rejects a bad source. - Add a regression test that a poisoned artifact under the old processor version is re-keyed and never served.

conradbzura self-assigned this Jun 24, 2026

conradbzura added 2 commits June 23, 2026 21:48

conradbzura force-pushed the 69-prevent-starch-cache-poisoning branch from 7dca39c to 12df4ab Compare June 24, 2026 01:50

conradbzura marked this pull request as ready for review June 24, 2026 01:58

conradbzura added 2 commits June 24, 2026 13:09

conradbzura mentioned this pull request Jun 24, 2026

Support ENCODE starch peak files end-to-end via BEDOPS unstarch #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69#71

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69#71
conradbzura wants to merge 4 commits into
masterfrom
69-prevent-starch-cache-poisoning

conradbzura commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

conradbzura commented Jun 24, 2026

Summary

Proposed changes

Reject non-gzip/non-text sources before caching (src/cfdb/workflows/processors/tabix.py)

Cover the guard (tests/test_workflows/test_processors_tabix.py)

Test cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reject non-gzip/non-text sources before caching (`src/cfdb/workflows/processors/tabix.py`)

Cover the guard (`tests/test_workflows/test_processors_tabix.py`)