Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache — Closes #69#71
Open
conradbzura wants to merge 4 commits into
Open
Conversation
ENCODE starch files (BEDOPS archives) are mapped to the BED format upstream, so the tabix processor ran zcat -f, sort and bgzip on them. starch is neither gzip nor text, so the pipeline produced a valid BGZF wrapping scrambled bytes and committed it to the content-addressed cache before tabix failed. That poisoned the cache: later reads served corrupt bytes and every retry was a stage-1 cache hit that re-failed the index step forever. Sniff the downloaded source's leading bytes and fail before any cache write when they are neither gzip nor plain text, so an unsupported serialization fails cleanly instead of committing a corrupt artifact. bigBed is exempt because its binary source is handled by bigBedToBed.
Add predicate tests for the gzip, text, starch, NUL-binary and empty cases plus boundary-split and starch-prefixed property checks, and run-level tests asserting a starch source raises without committing a cache artifact while bigBed binary sources stay exempt from the guard.
7dca39c to
12df4ab
Compare
Address the round-1 principal-engineer review of the source-encoding guard: - Bump processor_version so an artifact poisoned by the pre-guard code (committed under the old version) is re-keyed to a miss, re-enters _stage_prepare, and is rejected instead of served. The guard was otherwise unreachable for already-cached entries. - Reject known non-gzip compression magics (BEDOPS starch, bzip2, xz, zstd, zip) explicitly before the UTF-8 text test, so a mislabeled compressed source is caught by its magic rather than incidentally. - Name the bigBed exemption as a set and document the safety chain (the tool validates its own input and fails under pipefail before any cache commit), so the carve-out is greppable and self-explaining. - Scope the docstring honestly (a gzip member is trusted on its magic alone, so a doubly-compressed source is out of scope; latin-1/CP1252 text is deliberately rejected) and add the dcc to the error message.
Extend the source-encoding guard coverage: - Parametrize the starch-rejection run test over BED, VCF and GTF so the guard is pinned ahead of the intermediate-decompress branches, not just the single-pipeline BED path. - Add predicate cases for the bzip2, xz, zstd and zip magic denylist. - Add a bigBed test where the tool raises, asserting the exemption commits nothing when bigBedToBed rejects a bad source. - Add a regression test that a poisoned artifact under the old processor version is re-keyed and never served.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stop ENCODE
starchfiles from poisoning the content-addressed cache. ENCODE labels BEDOPSstarcharchives asBEDupstream, so the tabix processor ranzcat -f | sort | bgzipon them; becausestarchis neither gzip nor text, the pipeline produced a valid BGZF wrapping scrambled bytes and committed it beforetabixfailed, leaving a corruptdataartifact that later reads served and every retry re-failed against.Add a byte-sniff guard in
TabixIntervalProcessorthat inspects the downloaded source's leading bytes and fails before any cache write when they are neither gzip nor plain text. The guard is format-agnostic (it catchesstarchand any other surprise binary), needs no new system dependencies, and leaves the ontology mapping and routing untouched.bigBedis exempt because its binary source is handled bybigBedToBed. Realstarchsupport via the BEDOPS toolchain is intentionally out of scope and left as a follow-up.Closes #69
Proposed changes
Reject non-gzip/non-text sources before caching (
src/cfdb/workflows/processors/tabix.py)_source_looks_processable(prefix): accept a gzip/bgzip magic prefix or a NUL-free prefix that decodes as UTF-8 (trimming up to three trailing bytes so a multi-byte character split at the sniff boundary is not falsely rejected); treat an empty prefix as acceptable so the existing zero-data-line guard still owns the empty-source case._read_prefixto read the leading bytes and_assert_source_processableto raise a clearRuntimeErrornaming the format, local id, and observed magic when the source is unsupported._stage_prepareimmediately afterdownload_sourceand before the pipeline (hence before anycache.put), for every format exceptbigBed.Cover the guard (
tests/test_workflows/test_processors_tabix.py)run()-level behavior tests.Test cases
_source_looks_processableTrue_source_looks_processableTrue_source_looks_processableFalse_source_looks_processableFalse_source_looks_processableFalse_source_looks_processableTrue_source_looks_processableTrue_source_looks_processableTrue_source_looks_processableFalseTestTabixIntervalProcessorPipelinesrunis awaitedTestTabixIntervalProcessorPipelinesrunis awaited