fix(deps): promote msgpack to core dependency for Flax scanner#583
fix(deps): promote msgpack to core dependency for Flax scanner#583yash2998chhabria wants to merge 9 commits intomainfrom
Conversation
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver, ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall, py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses (_operator, functools), pickle recursion (pickle, _pickle, dill, cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob, distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading (signal, _signal, threading), and more (webbrowser, asyncio, mmap, select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3, _sqlite3, doctest, idlelib, lib2to3, uuid). Add uuid as novel detection (VULN-6: _get_command_stdout calls subprocess.Popen) and pkgutil.resolve_name as dynamic resolution trampoline. Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix multi-stream opcode analysis so detailed checks run across all pickle streams. Add NEWOBJ_EX to all opcode check lists. Includes 10 regression tests covering pkgutil trampoline, uuid RCE, multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3, tarfile, marshal, cloudpickle, and webbrowser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning - Fix test regression: track first pickle STOP position and use it for binary content scanning instead of f.tell() (which advances past multi-stream data) - Fix multi-stream resync bypass: skip junk separator bytes between streams (up to 256 bytes) instead of returning immediately - Gracefully handle ValueErrors on subsequent streams (non-pickle data after valid pickle) instead of propagating them as file-level errors - Add type hints: ScanResult return type on _scan_bytes, -> None on all test methods - Add separator-byte resync regression test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial streams (binary tensor data misinterpreted as opcodes) to prevent MANY_DANGEROUS_OPCODES false positives - Add resync logic to skip up to 256 non-pickle separator bytes between streams, preventing single-byte bypass of multi-stream detection - Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE calls referencing safe callables via memo are not counted as dangerous opcodes - Reset dangerous opcode counters at STOP boundaries so each stream is evaluated independently - Use first_pickle_end_pos for binary content scanning instead of f.tell() which advances past tensor data during multi-stream parsing Validated against 5 HuggingFace models with A/B comparison to main branch — no new false positives introduced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The flax_msgpack scanner was registered but failed to load when msgpack was not installed, silently falling back to generic scanning. Since msgpack is a lightweight zero-dependency C extension, promote it from optional (extras: flax) to a core dependency so Flax/JAX .msgpack files are scanned correctly out of the box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughExpands SUSPICIOUS_GLOBALS and ALWAYS_DANGEROUS_* lists, adds multi-stream/resync parsing, NEWOBJ_EX handling, and first_pickle_end_pos propagation in the pickle scanner; adds tests for blocklist and multi-stream cases; updates msgpack dependency; improves flax/msgpack model detection. Changes
Sequence Diagram(s)sequenceDiagram
participant FileObj as FileObj
participant Parser as PickleParser
participant Buffer as StreamBuffer
participant Detector as OpcodeDetector
participant Binary as BinaryScanner
FileObj->>Parser: read bytes
Parser->>Parser: attempt parse stream
alt first stream completes (STOP)
Parser->>Detector: yield ops for analysis
Parser->>Parser: set first_pickle_end_pos
else subsequent stream
Parser->>Buffer: buffer opcodes
Buffer->>Parser: validate buffered stream
alt buffered parse succeeds
Buffer->>Detector: yield ops for analysis
else within _MAX_RESYNC_BYTES
Parser->>FileObj: skip separator bytes (resync)
Parser->>Parser: continue parsing
end
end
Detector->>Binary: attach first_pickle_end_pos to metadata
Binary->>Detector: report binary-scan findings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: The scanner currently terminates multi-stream scanning when
a later stream yields at least one opcode before an error; instead, in the
scanning loop that uses resync_skipped and _MAX_RESYNC_BYTES you must remove the
early return on partial parse and implement resync logic: when a stream errors
after yielding opcodes, do not return — increment resync_skipped, advance the
input by one byte and attempt to resync/continue scanning this stream (and
subsequent streams) until a match or until resync_skipped exceeds a sensible
limit; do not treat a single opcode+junk separator as final. Replace the
immediate return with a continue/resync flow, reset resync_skipped on successful
parse, and revise the hard-coded 256-byte ceiling (_MAX_RESYNC_BYTES) to either
be higher or configurable so simple padding cannot evade detection; add tests
for benign and malicious samples to verify detection still fires after padding.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pypyproject.tomltests/scanners/test_pickle_scanner.py
Some HuggingFace Flax models (e.g. bert-for-token-classification) use a nested structure where model weights are wrapped under a model-name key (e.g. "bert", "classifier") rather than placing transformer keys like "encoder" or "embeddings" at the top level. This caused an INFO-level false positive: "Suspicious data structure - does not match known ML model patterns". Add detection for nested model structures by checking if top-level keys contain transformer sub-keys, and add a set of common HuggingFace model name keys to recognize known architectures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in modelaudit/scanners/pickle_scanner.py: - ALWAYS_DANGEROUS_GLOBALS: keep both branch additions (pkgutil.resolve_name, uuid internals) and main additions (cProfile/profile/pdb/timeit, ctypes) - ALWAYS_DANGEROUS_MODULES: merge branch's comprehensive module blocklist with main's PR #518 additions (ctypes, profiling, thread/process, module loading) and preserve main's linecache/logging.config exclusion note - REDUCE opcode handler: adopt main's resolved_callables approach with branch's _safe_memo fallback for BINGET memo patterns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: Document the security trade-off for _MAX_RESYNC_BYTES in
pickle_scanner.py: update the comment near the _MAX_RESYNC_BYTES and
resync_skipped definitions to explain that the 256-byte limit prevents skipping
arbitrarily large gaps between pickle streams but can be evaded by inserting
>=256 bytes of non-pickle data before a malicious stream, note the rationale for
stopping on partial opcode sequences to avoid misinterpreting binary tensor
data, and recommend making _MAX_RESYNC_BYTES configurable (and covered by tests)
if callers need a different performance/security trade-off.
- Around line 409-411: The scan currently lists "functools" in
ALWAYS_DANGEROUS_MODULES which will produce false positives for legitimate ML
usages; update ML_SAFE_GLOBALS to whitelist safe functools members (e.g.,
partial, reduce, lru_cache, wraps) and adjust any checks that treat the entire
module as dangerous so they allow these specific attributes, or add a clarifying
comment (similar to the existing logging/linecache rationale) explaining why the
module is treated specially; reference ALWAYS_DANGEROUS_MODULES and
ML_SAFE_GLOBALS in your change so the scanner accepts common safe usages like
functools.partial or functools.lru_cache while still flagging dangerous patterns
that use functools to wrap unsafe callables.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/flax_msgpack_scanner.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
| # Maximum number of consecutive non-pickle bytes to skip when resyncing | ||
| _MAX_RESYNC_BYTES = 256 | ||
| resync_skipped = 0 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider documenting the security trade-off for _MAX_RESYNC_BYTES.
The 256-byte resync limit prevents the scanner from skipping arbitrarily large amounts of data between pickle streams. However, this limit could potentially be bypassed by inserting 256+ bytes of non-pickle data before a malicious stream. While increasing this limit has performance implications, documenting this trade-off would help future maintainers understand the security model.
The current behavior (stopping after encountering a partial stream with opcodes) is a reasonable defense against misinterpreting binary tensor data as pickle opcodes, but it creates a potential evasion vector for multi-stream attacks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 54, Document the
security trade-off for _MAX_RESYNC_BYTES in pickle_scanner.py: update the
comment near the _MAX_RESYNC_BYTES and resync_skipped definitions to explain
that the 256-byte limit prevents skipping arbitrarily large gaps between pickle
streams but can be evaded by inserting >=256 bytes of non-pickle data before a
malicious stream, note the rationale for stopping on partial opcode sequences to
avoid misinterpreting binary tensor data, and recommend making _MAX_RESYNC_BYTES
configurable (and covered by tests) if callers need a different
performance/security trade-off.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)
2132-2138:⚠️ Potential issue | 🟠 MajorSplit
INSTfromOBJ/NEWOBJ/NEWOBJ_EXhandling—the current condition disables the latter three.In
pickletools.genops(),INSTyields a string argument (module and class name), butOBJ,NEWOBJ, andNEWOBJ_EXhave no inline arguments and yieldarg=None. Theisinstance(arg, str)guard prevents these three opcodes from ever triggering the detection pattern, making their addition ineffective.Suggested fix
- if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str): - return { - "pattern": f"{opcode.name}_EXECUTION", - "argument": arg, - "position": pos, - "opcode": opcode.name, - } + if opcode.name == "INST" and isinstance(arg, str): + return { + "pattern": f"{opcode.name}_EXECUTION", + "argument": arg, + "position": pos, + "opcode": opcode.name, + } + + if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}: + return { + "pattern": f"{opcode.name}_EXECUTION", + "argument": arg, + "position": pos, + "opcode": opcode.name, + }This violates the guideline: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 2132 - 2138, The current check conflates INST with OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg, str), which blocks OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the logic in pickle_scanner.py (around the opcode handling using opcode.name and arg, e.g., in the code that references pickletools.genops and opcode.name) to split cases: handle opcode.name == "INST" with the isinstance(arg, str) guard and emit the INST_EXECUTION pattern including the string argument, and add a separate branch for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a distinct pattern (e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and opcode.name while not requiring arg to be a string so those opcodes are detected correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 726-737: The loop that checks top-level keys (using found_keys,
obj, transformer_keys, model_name_keys, has_transformer_keys) can call
key.lower() on non-string keys and raise AttributeError; guard against that by
first verifying the key is a string (e.g., if isinstance(key, str)) before
calling key.lower(), and only perform the model-name membership checks
(key.lower() in model_name_keys and any(mk in key.lower() for mk in
model_name_keys)) when the key is a string so the scanner won't crash while
preserving the same transformer-key detection logic.
---
Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2132-2138: The current check conflates INST with
OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg, str), which blocks
OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the logic in
pickle_scanner.py (around the opcode handling using opcode.name and arg, e.g.,
in the code that references pickletools.genops and opcode.name) to split cases:
handle opcode.name == "INST" with the isinstance(arg, str) guard and emit the
INST_EXECUTION pattern including the string argument, and add a separate branch
for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a distinct pattern
(e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and opcode.name while
not requiring arg to be a string so those opcodes are detected correctly.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
modelaudit/scanners/flax_msgpack_scanner.pymodelaudit/scanners/pickle_scanner.py
- Guard key.lower() for non-string msgpack keys to prevent AttributeError that could abort the scan (evasion vector) - Add functools safe functions to ML_SAFE_GLOBALS to whitelist common ML utilities like partial, reduce, lru_cache - Increase _MAX_RESYNC_BYTES from 256 to 8192 so binary tensor data between pickle streams doesn't cause early scan termination - Continue scanning on partial stream errors instead of returning early, preventing evasion via hidden payloads after binary blobs - Clear _safe_memo at STOP boundaries so safe callables from one pickle stream cannot whitelist dangerous REDUCEs in the next Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 695-740: The substring check using model_name_keys (especially
short tokens like "opt" or "score") is too permissive; update the logic in the
block that computes has_transformer_keys (references: model_name_keys,
has_transformer_keys, transformer_keys, key_lower, found_keys, obj.get) to use
stricter token/boundary matching instead of any(mk in key_lower for mk in
model_name_keys): normalize key_lower and match whole tokens (e.g., split on
non-alphanumerics or use regex word boundaries) and explicitly check both exact
matches and token-level membership for entries in model_name_keys, and
add/update unit tests to include adversarial keys (e.g., "optional_score",
"desktop") and true positives to ensure the heuristic no longer false-positives
benign payloads.
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 95-100: When the branch with "if stream_error and had_opcodes" is
taken inside the PickleScanner scanning loop, increment the resync budget by
adding the number of bytes consumed by the discarded buffer to resync_skipped
and abort (raise/stop) when resync_skipped exceeds _MAX_RESYNC_BYTES;
specifically, compute the consumed byte count for the current partial stream
(e.g., bytes between the previous sync point and current file position or
len(buffer) being discarded), add that value to resync_skipped, and enforce the
existing _MAX_RESYNC_BYTES check before continuing so a crafted binary region
cannot indefinitely bypass the resync limit (update references in the scanning
loop where stream_error, had_opcodes, and resync_skipped are used).
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modelaudit/scanners/flax_msgpack_scanner.pymodelaudit/scanners/pickle_scanner.py
| # Common HuggingFace model name keys that wrap transformer substructure | ||
| model_name_keys = { | ||
| "bert", | ||
| "roberta", | ||
| "distilbert", | ||
| "albert", | ||
| "electra", | ||
| "xlm", | ||
| "gpt2", | ||
| "gpt_neo", | ||
| "gpt_neox", | ||
| "gptj", | ||
| "opt", | ||
| "llama", | ||
| "t5", | ||
| "bart", | ||
| "pegasus", | ||
| "mbart", | ||
| "blenderbot", | ||
| "vit", | ||
| "clip", | ||
| "whisper", | ||
| "wav2vec2", | ||
| "flax_model", | ||
| "classifier", | ||
| "qa_outputs", | ||
| "lm_head", | ||
| "score", | ||
| } | ||
| has_transformer_keys = any(key in found_keys for key in transformer_keys) | ||
|
|
||
| # Check if any top-level key contains transformer sub-keys (nested model structure) | ||
| if not has_transformer_keys: | ||
| for key in found_keys: | ||
| value = obj.get(key) | ||
| if isinstance(value, dict): | ||
| sub_keys = set(value.keys()) | ||
| if sub_keys & transformer_keys: | ||
| has_transformer_keys = True | ||
| break | ||
| # Also check if the top-level key is a known model name | ||
| # Guard against non-string keys (msgpack allows int/bytes keys) | ||
| key_lower = key.lower() if isinstance(key, str) else str(key).lower() | ||
| if key_lower in model_name_keys or any(mk in key_lower for mk in model_name_keys): | ||
| has_transformer_keys = True | ||
| break |
There was a problem hiding this comment.
model_name_keys substring matching is too permissive and can weaken structural detection.
At Line 738, any(mk in key_lower for mk in model_name_keys) plus short keys like opt/score can classify unrelated payloads as transformer checkpoints, which then short-circuits deeper structure analysis. Please switch to token/boundary matching and add negative/adversarial tests for this heuristic.
🔧 Suggested tightening
- key_lower = key.lower() if isinstance(key, str) else str(key).lower()
- if key_lower in model_name_keys or any(mk in key_lower for mk in model_name_keys):
+ key_lower = key.lower() if isinstance(key, str) else str(key).lower()
+ key_tokens = {tok for tok in re.split(r"[^a-z0-9]+", key_lower) if tok}
+ if key_lower in model_name_keys or bool(key_tokens & model_name_keys):
has_transformer_keys = True
breakAs per coding guidelines "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/flax_msgpack_scanner.py` around lines 695 - 740, The
substring check using model_name_keys (especially short tokens like "opt" or
"score") is too permissive; update the logic in the block that computes
has_transformer_keys (references: model_name_keys, has_transformer_keys,
transformer_keys, key_lower, found_keys, obj.get) to use stricter token/boundary
matching instead of any(mk in key_lower for mk in model_name_keys): normalize
key_lower and match whole tokens (e.g., split on non-alphanumerics or use regex
word boundaries) and explicitly check both exact matches and token-level
membership for entries in model_name_keys, and add/update unit tests to include
adversarial keys (e.g., "optional_score", "desktop") and true positives to
ensure the heuristic no longer false-positives benign payloads.
| if stream_error and had_opcodes: | ||
| # Partial stream: binary data was misinterpreted as opcodes. | ||
| # Discard the buffer but keep scanning — a malicious payload | ||
| # could be hiding after the binary blob. Resync from the | ||
| # current file position to find the next valid stream. | ||
| continue |
There was a problem hiding this comment.
Partial-stream recovery bypasses the resync budget.
When stream_error and had_opcodes is hit, the code continues without increasing resync_skipped. A crafted binary region can repeatedly trigger this path and force excessive scanning work. Count consumed bytes in this branch and enforce _MAX_RESYNC_BYTES.
🔧 Suggested guard
if stream_error and had_opcodes:
# Partial stream: binary data was misinterpreted as opcodes.
# Discard the buffer but keep scanning — a malicious payload
# could be hiding after the binary blob. Resync from the
# current file position to find the next valid stream.
+ advanced = file_obj.tell() - stream_start
+ if advanced <= 0:
+ file_obj.seek(stream_start, 0)
+ if not file_obj.read(1):
+ return
+ advanced = 1
+ resync_skipped += advanced
+ if resync_skipped >= _MAX_RESYNC_BYTES:
+ return
continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 95 - 100, When the branch
with "if stream_error and had_opcodes" is taken inside the PickleScanner
scanning loop, increment the resync budget by adding the number of bytes
consumed by the discarded buffer to resync_skipped and abort (raise/stop) when
resync_skipped exceeds _MAX_RESYNC_BYTES; specifically, compute the consumed
byte count for the current partial stream (e.g., bytes between the previous sync
point and current file position or len(buffer) being discarded), add that value
to resync_skipped, and enforce the existing _MAX_RESYNC_BYTES check before
continuing so a crafted binary region cannot indefinitely bypass the resync
limit (update references in the scanning loop where stream_error, had_opcodes,
and resync_skipped are used).
Summary
msgpack>=1.0.0,<2.0from optional dependency (extras: flax) to a core dependency inpyproject.tomlflax_msgpackscanner was registered but silently fell back to generic scanning whenmsgpackwas not installed, causing.msgpackFlax/JAX model files to miss specialized security analysismsgpackis a lightweight, zero-dependency C extension (~100KB) that is appropriate as a core dependencyTest plan
FlaxMsgpackScannerloads successfully withpython -c "from modelaudit.scanners.flax_msgpack_scanner import FlaxMsgpackScanner; print('OK')"sshleifer/tiny-gpt2,hf-internal-testing/tiny-bert-flax-only,patrickvonplaten/t5-tiny-random-- all recognized byflax_msgpackscanner (not generic fallback)ArthurZ/tiny-random-bert-flax-only,sanchit-gandhi/tiny-random-flax-bert-- both scanned correctlypython -m pytest tests/ -x -q --timeout=60)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Behavior Changes
Tests
Chores