fix: report actual file size in scan summary when scanner exits early#587
fix: report actual file size in scan summary when scanner exits early#587yash2998chhabria wants to merge 6 commits intomainfrom
Conversation
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver, ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall, py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses (_operator, functools), pickle recursion (pickle, _pickle, dill, cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob, distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading (signal, _signal, threading), and more (webbrowser, asyncio, mmap, select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3, _sqlite3, doctest, idlelib, lib2to3, uuid). Add uuid as novel detection (VULN-6: _get_command_stdout calls subprocess.Popen) and pkgutil.resolve_name as dynamic resolution trampoline. Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix multi-stream opcode analysis so detailed checks run across all pickle streams. Add NEWOBJ_EX to all opcode check lists. Includes 10 regression tests covering pkgutil trampoline, uuid RCE, multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3, tarfile, marshal, cloudpickle, and webbrowser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning - Fix test regression: track first pickle STOP position and use it for binary content scanning instead of f.tell() (which advances past multi-stream data) - Fix multi-stream resync bypass: skip junk separator bytes between streams (up to 256 bytes) instead of returning immediately - Gracefully handle ValueErrors on subsequent streams (non-pickle data after valid pickle) instead of propagating them as file-level errors - Add type hints: ScanResult return type on _scan_bytes, -> None on all test methods - Add separator-byte resync regression test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial streams (binary tensor data misinterpreted as opcodes) to prevent MANY_DANGEROUS_OPCODES false positives - Add resync logic to skip up to 256 non-pickle separator bytes between streams, preventing single-byte bypass of multi-stream detection - Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE calls referencing safe callables via memo are not counted as dangerous opcodes - Reset dangerous opcode counters at STOP boundaries so each stream is evaluated independently - Use first_pickle_end_pos for binary content scanning instead of f.tell() which advances past tensor data during multi-stream parsing Validated against 5 HuggingFace models with A/B comparison to main branch — no new false positives introduced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scanners that return early (missing optional dependency, parse error, etc.) left bytes_scanned at its default value of 0, causing the scan summary to always display "Size: 0 bytes". This affected ONNX, TFLite, Flax, XGBoost, and several other format scanners. Add a fallback in _scan_file_internal that sets bytes_scanned to the actual file size (already computed via os.path.getsize) whenever a scanner leaves it at zero. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughAdds a bytes_scanned fallback in core scanning, vastly extends the suspicious-globals blocklist, upgrades the pickle scanner for multi-stream parsing and NEWOBJ_EX-aware detection, and introduces regression tests covering multi-stream resync, opcode variants, and blocklist hardening. Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Scanner as PickleScanner
participant Parser as _genops_with_fallback
participant Opcode as OpcodeProcessor
participant Matcher as PatternMatcher
Client->>Scanner: scan(file_obj)
Scanner->>Parser: _genops_with_fallback(file_obj, multi_stream=True)
Parser->>Parser: parse stream, detect first_pickle_end_pos, resync/buffer subsequent streams
Parser->>Opcode: yield opcodes (including NEWOBJ_EX)
Opcode->>Matcher: evaluate opcode sequences (REDUCE/INST/NEWOBJ/NEWOBJ_EX)
Matcher->>Scanner: flag dangerous modules/functions via ALWAYS_DANGEROUS_* checks
Scanner->>Client: return ScanResult (metadata includes first_pickle_end_pos, bytes_scanned)
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: The fixed 256-byte resync cap (_MAX_RESYNC_BYTES) combined
with the resync_skipped counter can hide malicious second pickle streams; update
the resynchronization logic in pickle_scanner.py to remove or substantially
raise the fixed cap and instead scan for additional pickle stream opcodes until
EOF (or until a safe global size-based limit tied to the input length), i.e.,
revise the loop that increments resync_skipped and checks _MAX_RESYNC_BYTES
(around the code using resync_skipped and _MAX_RESYNC_BYTES between lines
~106-115) so it continues scanning for stream markers across larger/padded gaps
(or uses a sliding-window opcode search) rather than stopping after 256 bytes.
- Around line 1549-1553: When handling the pickle stream boundary (the branch
where opcode.name == "STOP") you currently reset counters but not the
stream-local memo; add logic to clear or reinitialize the scanner's _safe_memo
at the same place you reset dangerous_opcode_count, consecutive_dangerous and
max_consecutive so memo indexes from a previous stream cannot be reused to mark
BINGET->REDUCE pairs in a new stream as safe. Update the STOP handling block in
pickle_scanner.py to reset _safe_memo (e.g., assign a fresh empty memo
structure) while keeping the existing counter resets to preserve detection
behavior.
- Line 1477: The current conditional incorrectly requires isinstance(arg, str)
for OBJ/NEWOBJ/NEWOBJ_EX, making those branches unreachable; split the logic so
INST is handled when opcode.name == "INST" and isinstance(arg, str) as before,
but handle opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] in a separate branch that
does not require arg to be a string and instead inspects the stack or preceding
GLOBAL/STACK items to resolve the class being instantiated (so NEWOBJ_EX
following a dangerous GLOBAL is flagged); update the logic in pickle_scanner.py
where opcode and arg are checked and ensure the test
test_newobj_ex_dangerous_class can detect the dangerous pattern.
In `@tests/scanners/test_pickle_scanner.py`:
- Around line 647-695: The module-blocklist spot tests (e.g.,
test_smtplib_blocked, test_sqlite3_blocked, test_tarfile_blocked,
test_marshal_blocked, test_cloudpickle_blocked, test_webbrowser_blocked) only
assert result.has_errors and may pass on unrelated failure paths; add an
explicit assert result.success (immediately after obtaining result from
self._scan_bytes(...)) in each of these test_* methods to ensure the
CRITICAL/module checks validate successful detection flow before the existing
severity/message assertions.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
modelaudit/core.pymodelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
Resolve conflicts in pickle_scanner.py: - DANGEROUS_FUNCTIONS: keep both branch additions (pkgutil, uuid) and main additions (cProfile, profile, pdb, timeit, ctypes functions) - ALWAYS_DANGEROUS_MODULES: merge main's core blocklist (ctypes, cProfile, profile, pdb, timeit, _thread, multiprocessing, zipimport, importlib, runpy) with branch's extended coverage (pkgutil, smtplib, marshal, functools, pickle, cloudpickle, tempfile, uuid, etc.) and preserve main's NOTE about linecache/logging.config - REDUCE opcode handling: adopt main's resolved_callables lookup with branch's _safe_memo fallback for BINGET/memo patterns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
modelaudit/scanners/pickle_scanner.py (3)
2209-2213:⚠️ Potential issue | 🟠 MajorReset
_safe_memoat each stream boundary (STOP).Line 2209 resets counters only; memo-index safety state can leak into the next stream.
🔧 Suggested fix
if opcode.name == "STOP": dangerous_opcode_count = 0 consecutive_dangerous = 0 max_consecutive = 0 + _safe_memo.clear() continueBased on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 2209 - 2213, At each pickle stream boundary where the code checks opcode.name == "STOP" (inside the PickleScanner/scan loop), also reset the memo safety state by clearing or reinitializing _safe_memo so memo-index safety does not leak between streams; update the STOP handling block that currently resets dangerous_opcode_count, consecutive_dangerous, and max_consecutive to additionally reset _safe_memo (and any related memo tracking structures) and run unit tests for both benign and malicious samples to ensure detections are preserved.
2132-2138:⚠️ Potential issue | 🟠 MajorOBJ/NEWOBJ/NEWOBJ_EX branch is unreachable behind
isinstance(arg, str).Line 2132 gates stack-based opcodes behind a string-arg check; only
INSTcan satisfy this in practice.🔧 Suggested fix
- if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str): + if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}: + return { + "pattern": f"{opcode.name}_EXECUTION", + "argument": str(arg), + "position": pos, + "opcode": opcode.name, + } + if opcode.name == "INST" and isinstance(arg, str): return { "pattern": f"{opcode.name}_EXECUTION", "argument": arg, "position": pos, "opcode": opcode.name, }#!/bin/bash # Verify pickletools opcode argument descriptors for the affected opcodes. python - <<'PY' import pickletools for name in ("INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"): op = next(o for o in pickletools.opcodes if o.name == name) print(f"{name}: has_stream_argument={op.arg is not None}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 2132 - 2138, The branch currently requires isinstance(arg, str) for all opcodes, making the OBJ/NEWOBJ/NEWOBJ_EX cases unreachable; update the logic in pickle_scanner.py (around the block handling opcode.name in ["INST","OBJ","NEWOBJ","NEWOBJ_EX"]) to separate handling: keep the INST path gated by isinstance(arg, str) to return the INST_EXECUTION pattern, but remove the str requirement for OBJ, NEWOBJ, and NEWOBJ_EX so they return their respective _EXECUTION pattern even when arg is not a string (handle None/int safely), i.e., branch on opcode.name first and only apply isinstance(arg, str) for INST to avoid crashes and correctly detect stack-based opcodes.
52-54:⚠️ Potential issue | 🟠 MajorFixed 256-byte resync limit is bypassable with larger separators.
Line 113 hard-stops resync after 256 bytes, so padded separators can hide later malicious streams.
🔧 Suggested hardening
- # Maximum number of consecutive non-pickle bytes to skip when resyncing - _MAX_RESYNC_BYTES = 256 @@ - if resync_skipped >= _MAX_RESYNC_BYTES: - returnBased on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".
Also applies to: 113-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 54, The current fixed resync limit (_MAX_RESYNC_BYTES = 256) and resync_skipped counter can be bypassed by large padded separators; update the resync logic in the pickle scanner to stop relying on a small hard cap and instead scan the incoming stream for valid pickle start/marker patterns until a legitimate pickle header or EOF is found. Make _MAX_RESYNC_BYTES configurable (or remove the fixed cap) and change the loop that increments resync_skipped to perform a sliding search for known pickle opcodes/headers and reset resync_skipped when a candidate is found; ensure the scanner still fails safely on EOF and include tests for both short and very large padded separators to validate detection.tests/scanners/test_pickle_scanner.py (1)
744-792:⚠️ Potential issue | 🟡 MinorAdd explicit
result.successassertions in module spot checks.These tests currently assert
has_errorsonly; they can pass on non-target failure paths.🔧 Suggested consolidation
+ def _assert_module_blocked(self, module: str, func: str) -> None: + result = self._scan_bytes(self._craft_global_reduce_pickle(module, func)) + assert result.success + assert result.has_errors + assert any( + i.severity == IssueSeverity.CRITICAL and module in i.message + for i in result.issues + ), f"Expected CRITICAL {module} issue, got: {[i.message for i in result.issues]}" @@ def test_smtplib_blocked(self) -> None: - result = self._scan_bytes(self._craft_global_reduce_pickle("smtplib", "SMTP")) - assert result.has_errors - assert any(i.severity == IssueSeverity.CRITICAL and "smtplib" in i.message for i in result.issues), ( - f"Expected CRITICAL smtplib issue, got: {[i.message for i in result.issues]}" - ) + self._assert_module_blocked("smtplib", "SMTP")Based on learnings: "Add comprehensive tests including edge cases and regression coverage for scanner changes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scanners/test_pickle_scanner.py` around lines 744 - 792, The tests (e.g., test_smtplib_blocked, test_sqlite3_blocked, test_tarfile_blocked, test_marshal_blocked, test_cloudpickle_blocked, test_webbrowser_blocked) only assert result.has_errors and can pass on unrelated failure paths; update each to also assert the explicit success flag (e.g., assert result.success is False or assert not result.success) after calling _scan_bytes(_craft_global_reduce_pickle(...)) so the test guarantees the scanner reported a failure state for these dangerous-module pickles.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 435-451: The list literal that adds "logging" to the
always-dangerous modules conflicts with the nearby intent note that
"logging.config" is intentionally excluded; either remove "logging" from the
always-dangerous/modules blocklist or update the comment to explicitly state
that only specific submodules (e.g., "logging.handlers" or others) are
considered dangerous while "logging.config" is excluded. Locate the module list
in pickle_scanner.py (the literal containing "webbrowser", "asyncio", ...,
"logging", ...) and make the change so the code and comment consistently reflect
the intended policy.
- Around line 91-95: The early return on a partial-decode inside the pickle
scanner (the block that checks stream_error and had_opcodes) aborts scanning the
rest of the file and can miss later malicious streams; instead, discard the
current buffer and continue scanning subsequent streams. Edit the code in
pickle_scanner.py (the block using stream_error and had_opcodes inside the
stream-processing loop) to remove the return and ensure control continues to the
next stream (e.g., clear/reset the per-stream buffer/state and use continue or
otherwise advance the outer loop), preserving any detection bookkeeping so later
streams are still scanned.
---
Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2209-2213: At each pickle stream boundary where the code checks
opcode.name == "STOP" (inside the PickleScanner/scan loop), also reset the memo
safety state by clearing or reinitializing _safe_memo so memo-index safety does
not leak between streams; update the STOP handling block that currently resets
dangerous_opcode_count, consecutive_dangerous, and max_consecutive to
additionally reset _safe_memo (and any related memo tracking structures) and run
unit tests for both benign and malicious samples to ensure detections are
preserved.
- Around line 2132-2138: The branch currently requires isinstance(arg, str) for
all opcodes, making the OBJ/NEWOBJ/NEWOBJ_EX cases unreachable; update the logic
in pickle_scanner.py (around the block handling opcode.name in
["INST","OBJ","NEWOBJ","NEWOBJ_EX"]) to separate handling: keep the INST path
gated by isinstance(arg, str) to return the INST_EXECUTION pattern, but remove
the str requirement for OBJ, NEWOBJ, and NEWOBJ_EX so they return their
respective _EXECUTION pattern even when arg is not a string (handle None/int
safely), i.e., branch on opcode.name first and only apply isinstance(arg, str)
for INST to avoid crashes and correctly detect stack-based opcodes.
- Around line 52-54: The current fixed resync limit (_MAX_RESYNC_BYTES = 256)
and resync_skipped counter can be bypassed by large padded separators; update
the resync logic in the pickle scanner to stop relying on a small hard cap and
instead scan the incoming stream for valid pickle start/marker patterns until a
legitimate pickle header or EOF is found. Make _MAX_RESYNC_BYTES configurable
(or remove the fixed cap) and change the loop that increments resync_skipped to
perform a sliding search for known pickle opcodes/headers and reset
resync_skipped when a candidate is found; ensure the scanner still fails safely
on EOF and include tests for both short and very large padded separators to
validate detection.
In `@tests/scanners/test_pickle_scanner.py`:
- Around line 744-792: The tests (e.g., test_smtplib_blocked,
test_sqlite3_blocked, test_tarfile_blocked, test_marshal_blocked,
test_cloudpickle_blocked, test_webbrowser_blocked) only assert result.has_errors
and can pass on unrelated failure paths; update each to also assert the explicit
success flag (e.g., assert result.success is False or assert not result.success)
after calling _scan_bytes(_craft_global_reduce_pickle(...)) so the test
guarantees the scanner reported a failure state for these dangerous-module
pickles.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
| "webbrowser", | ||
| "asyncio", | ||
| "mmap", | ||
| "select", | ||
| "selectors", | ||
| "logging", | ||
| "syslog", | ||
| "tarfile", | ||
| "zipfile", | ||
| "shelve", | ||
| "sqlite3", | ||
| "_sqlite3", | ||
| "doctest", | ||
| "idlelib", | ||
| "lib2to3", | ||
| # uuid — _get_command_stdout internally calls subprocess.Popen (VULN-6) | ||
| "uuid", |
There was a problem hiding this comment.
logging blocklist addition conflicts with nearby intent note.
Line 440 adds logging to always-dangerous modules, but the nearby note states logging.config is intentionally not in this set. Please align code or comment to avoid policy ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 435 - 451, The list
literal that adds "logging" to the always-dangerous modules conflicts with the
nearby intent note that "logging.config" is intentionally excluded; either
remove "logging" from the always-dangerous/modules blocklist or update the
comment to explicitly state that only specific submodules (e.g.,
"logging.handlers" or others) are considered dangerous while "logging.config" is
excluded. Locate the module list in pickle_scanner.py (the literal containing
"webbrowser", "asyncio", ..., "logging", ...) and make the change so the code
and comment consistently reflect the intended policy.
- Increase _MAX_RESYNC_BYTES from 256 to 8192 to prevent attackers from hiding malicious streams behind padding gaps larger than the resync window - Continue scanning on partial stream decode errors instead of returning early, so malicious payloads later in the file are not missed - Clear _safe_memo at STOP boundaries so stale memo entries from a prior stream cannot mark BINGET->REDUCE pairs in a new stream as safe - Split OBJ/NEWOBJ/NEWOBJ_EX detection from INST: these stack-based opcodes have no string arg, so resolve the class via callable_refs instead - Strengthen module blocklist test assertions with explicit result.success and result.issues checks before severity/message checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)
2149-2156:⚠️ Potential issue | 🟠 Major
INSTpath currently flags all string-arg INST opcodes as dangerous.Unlike
OBJ/NEWOBJ/NEWOBJ_EX, theINSTbranch returns immediately without checking whether the resolved target is actually dangerous. This can create false critical findings.Proposed fix
# INST encodes the class directly in the string argument. if opcode.name == "INST" and isinstance(arg, str): - return { - "pattern": f"{opcode.name}_EXECUTION", - "argument": arg, - "position": pos, - "opcode": opcode.name, - } + parsed = _parse_module_function(arg) + if parsed: + mod, func = parsed + if _is_dangerous_ref(mod, func): + return { + "pattern": f"{opcode.name}_EXECUTION", + "argument": f"{mod}.{func}", + "position": pos, + "opcode": opcode.name, + }
♻️ Duplicate comments (1)
modelaudit/scanners/pickle_scanner.py (1)
52-55:⚠️ Potential issue | 🟠 MajorFixed resync cap still allows padded multi-stream bypass.
A malicious stream placed after >8192 non-pickle bytes will still be skipped because scanning terminates at Line 117–Line 118.
Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".Hardening direction
- _MAX_RESYNC_BYTES = 8192 + # Avoid fixed-size resync cutoffs that can be bypassed with padding. + # Runtime is already bounded by scanner timeout / opcode limits. @@ - if resync_skipped >= _MAX_RESYNC_BYTES: - return continueAlso applies to: 117-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 55, The scanner's fixed resync cap (_MAX_RESYNC_BYTES = 8192) plus the early termination in the resync logic can be bypassed by padding a malicious pickle after a larger gap; update the resync handling to not silently stop scanning once the skipped byte count exceeds _MAX_RESYNC_BYTES—either (a) remove the premature termination and continue scanning the remainder of the stream for additional pickle markers, or (b) treat exceeding the cap as a detection (fail/flag) instead of skipping further checks; locate the resync loop/branch that references _MAX_RESYNC_BYTES and change its control flow accordingly so no multi-stream payload can hide beyond the cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 93-103: The code currently resets stream_error and then still
yields the prior buffered opcodes, allowing false positives; in the block
handling a partial stream (the branch that checks if stream_error and
had_opcodes) explicitly discard the buffered opcode store (e.g., clear or
reassign buffered = [] or buffered.clear()) before resetting
had_opcodes/stream_error so no buffered data is yielded later; update the logic
around the variables buffered, had_opcodes, and stream_error in
pickle_scanner.py to only yield from buffered when stream_error is false and
buffered was not discarded.
In `@tests/scanners/test_pickle_scanner.py`:
- Around line 723-739: The test test_newobj_ex_dangerous_class currently can
pass due to the GLOBAL 'os._wrap_close' match rather than verifying NEWOBJ_EX
handling; update the assertion to specifically verify that the reported issue is
tied to NEWOBJ_EX by checking the issue details from result.issues (e.g.,
message contains "NEWOBJ_EX" or an opcode/source field indicating NEWOBJ_EX) and
still ensure it's CRITICAL; use the existing test helper _scan_bytes and
IssueSeverity.CRITICAL to locate the relevant issue and assert that at least one
issue is NEWOBJ_EX-specific so the test fails if NEWOBJ_EX detection regresses.
---
Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-55: The scanner's fixed resync cap (_MAX_RESYNC_BYTES = 8192)
plus the early termination in the resync logic can be bypassed by padding a
malicious pickle after a larger gap; update the resync handling to not silently
stop scanning once the skipped byte count exceeds _MAX_RESYNC_BYTES—either (a)
remove the premature termination and continue scanning the remainder of the
stream for additional pickle markers, or (b) treat exceeding the cap as a
detection (fail/flag) instead of skipping further checks; locate the resync
loop/branch that references _MAX_RESYNC_BYTES and change its control flow
accordingly so no multi-stream payload can hide beyond the cap.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
| if stream_error and had_opcodes: | ||
| # Partial stream: binary data was misinterpreted as opcodes. | ||
| # Discard the buffer but keep scanning — a valid malicious | ||
| # stream may follow later in the file. | ||
| had_opcodes = False | ||
| stream_error = False | ||
|
|
||
| if not stream_error: | ||
| # Stream completed successfully — yield buffered opcodes | ||
| yield from buffered | ||
|
|
There was a problem hiding this comment.
Partial stream opcodes are not actually discarded after parse errors.
At Line 97–Line 102, the code marks the stream as non-valid but still yields buffered because stream_error is reset to False. That can surface false positives from garbage bytes.
Proposed fix
if stream_error and had_opcodes:
# Partial stream: binary data was misinterpreted as opcodes.
# Discard the buffer but keep scanning — a valid malicious
# stream may follow later in the file.
- had_opcodes = False
- stream_error = False
+ buffered.clear()
+ had_opcodes = False
+ # Keep stream_error=True so partial buffered opcodes are not yielded
if not stream_error:
# Stream completed successfully — yield buffered opcodes
yield from buffered🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 93 - 103, The code
currently resets stream_error and then still yields the prior buffered opcodes,
allowing false positives; in the block handling a partial stream (the branch
that checks if stream_error and had_opcodes) explicitly discard the buffered
opcode store (e.g., clear or reassign buffered = [] or buffered.clear()) before
resetting had_opcodes/stream_error so no buffered data is yielded later; update
the logic around the variables buffered, had_opcodes, and stream_error in
pickle_scanner.py to only yield from buffered when stream_error is false and
buffered was not discarded.
| def test_newobj_ex_dangerous_class(self) -> None: | ||
| """NEWOBJ_EX opcode with a dangerous class should be flagged.""" | ||
| # Craft pickle: PROTO 4 | GLOBAL 'os _wrap_close' | EMPTY_TUPLE | EMPTY_DICT | NEWOBJ_EX | STOP | ||
| # Protocol 4 is needed for NEWOBJ_EX (opcode 0x92) | ||
| proto = b"\x80\x04" | ||
| global_op = b"c" + b"os\n_wrap_close\n" | ||
| empty_tuple = b")" | ||
| empty_dict = b"}" | ||
| newobj_ex = b"\x92" # NEWOBJ_EX opcode | ||
| stop = b"." | ||
| data = proto + global_op + empty_tuple + empty_dict + newobj_ex + stop | ||
|
|
||
| result = self._scan_bytes(data) | ||
| assert result.success | ||
| assert result.has_errors | ||
| os_issues = [i for i in result.issues if i.severity == IssueSeverity.CRITICAL and "os" in i.message.lower()] | ||
| assert os_issues, f"Expected CRITICAL os issue for NEWOBJ_EX, got: {[i.message for i in result.issues]}" |
There was a problem hiding this comment.
test_newobj_ex_dangerous_class is too broad to validate NEWOBJ_EX-specific detection.
This test can pass from the GLOBAL os._wrap_close finding alone, even if NEWOBJ_EX handling regresses. Please assert on a check/detail tied to opcode == "NEWOBJ_EX" (or NEWOBJ_EX-specific pattern) to make it a true regression guard.
Based on learnings: "Add comprehensive tests including edge cases and regression coverage for scanner changes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/scanners/test_pickle_scanner.py` around lines 723 - 739, The test
test_newobj_ex_dangerous_class currently can pass due to the GLOBAL
'os._wrap_close' match rather than verifying NEWOBJ_EX handling; update the
assertion to specifically verify that the reported issue is tied to NEWOBJ_EX by
checking the issue details from result.issues (e.g., message contains
"NEWOBJ_EX" or an opcode/source field indicating NEWOBJ_EX) and still ensure
it's CRITICAL; use the existing test helper _scan_bytes and
IssueSeverity.CRITICAL to locate the relevant issue and assert that at least one
issue is NEWOBJ_EX-specific so the test fails if NEWOBJ_EX detection regresses.
Summary
_scan_file_internal(core.py) that setsbytes_scannedto the actual file size whenever a scanner leaves it at zerobytes_scannedTest plan
python -m pytest tests/ -x -q)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Tests