Skip to content

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69#71

Open
conradbzura wants to merge 4 commits into
masterfrom
69-prevent-starch-cache-poisoning
Open

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69#71
conradbzura wants to merge 4 commits into
masterfrom
69-prevent-starch-cache-poisoning

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Stop ENCODE starch files from poisoning the content-addressed cache. ENCODE labels BEDOPS starch archives as BED upstream, so the tabix processor ran zcat -f | sort | bgzip on them; because starch is neither gzip nor text, the pipeline produced a valid BGZF wrapping scrambled bytes and committed it before tabix failed, leaving a corrupt data artifact that later reads served and every retry re-failed against.

Add a byte-sniff guard in TabixIntervalProcessor that inspects the downloaded source's leading bytes and fails before any cache write when they are neither gzip nor plain text. The guard is format-agnostic (it catches starch and any other surprise binary), needs no new system dependencies, and leaves the ontology mapping and routing untouched. bigBed is exempt because its binary source is handled by bigBedToBed. Real starch support via the BEDOPS toolchain is intentionally out of scope and left as a follow-up.

Closes #69

Proposed changes

Reject non-gzip/non-text sources before caching (src/cfdb/workflows/processors/tabix.py)

  • Add _source_looks_processable(prefix): accept a gzip/bgzip magic prefix or a NUL-free prefix that decodes as UTF-8 (trimming up to three trailing bytes so a multi-byte character split at the sniff boundary is not falsely rejected); treat an empty prefix as acceptable so the existing zero-data-line guard still owns the empty-source case.
  • Add _read_prefix to read the leading bytes and _assert_source_processable to raise a clear RuntimeError naming the format, local id, and observed magic when the source is unsupported.
  • Invoke the guard in _stage_prepare immediately after download_source and before the pipeline (hence before any cache.put), for every format except bigBed.

Cover the guard (tests/test_workflows/test_processors_tabix.py)

  • Add predicate tests (examples plus two Hypothesis properties) and two run()-level behavior tests.

Test cases

# Test Suite Given When Then Coverage Target
1 _source_looks_processable A gzip/bgzip magic prefix The predicate is called It returns True Gzip acceptance
2 _source_looks_processable A plain BED text prefix The predicate is called It returns True Text acceptance
3 _source_looks_processable A BEDOPS starch magic prefix The predicate is called It returns False Starch rejection
4 _source_looks_processable A NUL-containing binary prefix The predicate is called It returns False Binary rejection
5 _source_looks_processable A short non-text prefix shorter than the trim window The predicate is called It returns False Trim-to-empty branch
6 _source_looks_processable UTF-8 text whose final multi-byte char is cut at the boundary The predicate is called It returns True Boundary-split recovery
7 _source_looks_processable An empty prefix The predicate is called It returns True Empty deferred to downstream guard
8 _source_looks_processable Any bytes appended to the gzip magic The predicate is called It always returns True Property: gzip prefix accepted
9 _source_looks_processable Any bytes appended to the starch magic The predicate is called It always returns False Property: starch prefix rejected
10 TestTabixIntervalProcessorPipelines A BED file whose source is a starch archive run is awaited It raises and commits no cache artifact Fail-before-commit on starch
11 TestTabixIntervalProcessorPipelines A bigBed file with a binary source run is awaited It completes and commits the artifact bigBed guard exemption

@conradbzura conradbzura self-assigned this Jun 24, 2026
ENCODE starch files (BEDOPS archives) are mapped to the BED format
upstream, so the tabix processor ran zcat -f, sort and bgzip on them.
starch is neither gzip nor text, so the pipeline produced a valid BGZF
wrapping scrambled bytes and committed it to the content-addressed
cache before tabix failed. That poisoned the cache: later reads served
corrupt bytes and every retry was a stage-1 cache hit that re-failed
the index step forever.

Sniff the downloaded source's leading bytes and fail before any cache
write when they are neither gzip nor plain text, so an unsupported
serialization fails cleanly instead of committing a corrupt artifact.
bigBed is exempt because its binary source is handled by bigBedToBed.
Add predicate tests for the gzip, text, starch, NUL-binary and empty
cases plus boundary-split and starch-prefixed property checks, and
run-level tests asserting a starch source raises without committing a
cache artifact while bigBed binary sources stay exempt from the guard.
@conradbzura conradbzura force-pushed the 69-prevent-starch-cache-poisoning branch from 7dca39c to 12df4ab Compare June 24, 2026 01:50
@conradbzura conradbzura marked this pull request as ready for review June 24, 2026 01:58
Address the round-1 principal-engineer review of the source-encoding
guard:

- Bump processor_version so an artifact poisoned by the pre-guard code
  (committed under the old version) is re-keyed to a miss, re-enters
  _stage_prepare, and is rejected instead of served. The guard was
  otherwise unreachable for already-cached entries.
- Reject known non-gzip compression magics (BEDOPS starch, bzip2, xz,
  zstd, zip) explicitly before the UTF-8 text test, so a mislabeled
  compressed source is caught by its magic rather than incidentally.
- Name the bigBed exemption as a set and document the safety chain
  (the tool validates its own input and fails under pipefail before any
  cache commit), so the carve-out is greppable and self-explaining.
- Scope the docstring honestly (a gzip member is trusted on its magic
  alone, so a doubly-compressed source is out of scope; latin-1/CP1252
  text is deliberately rejected) and add the dcc to the error message.
Extend the source-encoding guard coverage:

- Parametrize the starch-rejection run test over BED, VCF and GTF so the
  guard is pinned ahead of the intermediate-decompress branches, not
  just the single-pipeline BED path.
- Add predicate cases for the bzip2, xz, zstd and zip magic denylist.
- Add a bigBed test where the tool raises, asserting the exemption
  commits nothing when bigBedToBed rejects a bad source.
- Add a regression test that a poisoned artifact under the old
  processor version is re-keyed and never served.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache

1 participant