fix(security): harden pickle scanner blocklist and multi-stream analysis#581
fix(security): harden pickle scanner blocklist and multi-stream analysis#581yash2998chhabria wants to merge 5 commits intomainfrom
Conversation
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver, ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall, py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses (_operator, functools), pickle recursion (pickle, _pickle, dill, cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob, distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading (signal, _signal, threading), and more (webbrowser, asyncio, mmap, select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3, _sqlite3, doctest, idlelib, lib2to3, uuid). Add uuid as novel detection (VULN-6: _get_command_stdout calls subprocess.Popen) and pkgutil.resolve_name as dynamic resolution trampoline. Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix multi-stream opcode analysis so detailed checks run across all pickle streams. Add NEWOBJ_EX to all opcode check lists. Includes 10 regression tests covering pkgutil trampoline, uuid RCE, multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3, tarfile, marshal, cloudpickle, and webbrowser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughExpanded the suspicious-symbols blocklist with many modules/functions and hardened the pickle scanner: added multi-stream parsing with resynchronization, first_pickle_end_pos metadata, NEWOBJ_EX handling, extended dangerous module/function lists, and regression tests validating these behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant File as File (binary blob)
participant Picker as PickleScanner
participant Gen as _genops_with_fallback
participant Result as ScanResult/Metadata
participant Binary as BinaryScanner
File->>Picker: open file_obj
Picker->>Gen: iterate ops (multi_stream=True)
Gen-->>Picker: yield opcodes and STOP boundaries
Gen-->>Picker: emit first_pickle_end_pos (after first stream)
Picker->>Result: record findings + metadata(first_pickle_end_pos)
alt trailing bytes exist after first_pickle_end_pos
Picker->>Binary: hand off trailing bytes
Binary-->>Result: emit binary scan findings
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 50-73: The loop exits early when had_opcodes is False, allowing a
single non-pickle separator byte to bypass multi_stream scanning; instead of
returning in the multi_stream branch inside the genops iteration, implement a
resynchronization step in the scanner (the while True loop handling
pickletools.genops over file_obj) that, when had_opcodes is False and
multi_stream is True, advances/searches forward through file_obj until either
EOF or a likely pickle start/opcode is found (e.g., bytes indicating a protocol
header or known opcode) and then continues the outer loop so subsequent pickle
streams are scanned; update logic around had_opcodes, next_byte and
file_obj.seek usage to perform this resync rather than immediate return so
malicious streams after junk bytes are not skipped.
In `@tests/scanners/test_pickle_scanner.py`:
- Around line 579-599: Update the test_multi_stream_benign_then_malicious case
to also cover separator-byte resynchronization: construct data as the benign
pickle + b"\x00" + the malicious stream produced by _craft_global_reduce_pickle,
call _scan_bytes on that combined buffer, and assert result.success is True,
result.has_errors is True and that a CRITICAL os/posix issue
(IssueSeverity.CRITICAL) appears in result.issues just like the contiguous case;
use the same os_issues filtering logic to validate detection so we catch
regressions in stream resynchronization handling.
- Around line 538-673: Add explicit type hints: annotate the helper _scan_bytes
with a concrete return type (e.g. -> Any if the ScanResult type isn't imported)
and add -> None to each test method you added
(test_pkgutil_resolve_name_critical, test_uuid_get_command_stdout_critical,
test_multi_stream_benign_then_malicious, test_newobj_ex_dangerous_class,
test_smtplib_blocked, test_sqlite3_blocked, test_tarfile_blocked,
test_marshal_blocked, test_cloudpickle_blocked, test_webbrowser_blocked); also
add the necessary import (from typing import Any) if you use Any or import the
actual ScanResult type used by PickleScanner so _scan_bytes returns the correct
annotated type.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
…e scanning - Fix test regression: track first pickle STOP position and use it for binary content scanning instead of f.tell() (which advances past multi-stream data) - Fix multi-stream resync bypass: skip junk separator bytes between streams (up to 256 bytes) instead of returning immediately - Gracefully handle ValueErrors on subsequent streams (non-pickle data after valid pickle) instead of propagating them as file-level errors - Add type hints: ScanResult return type on _scan_bytes, -> None on all test methods - Add separator-byte resync regression test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)
1453-1459:⚠️ Potential issue | 🟡 MinorRemove dead-code path: OBJ/NEWOBJ/NEWOBJ_EX detection requires removing the
isinstance(arg, str)guard.The condition on line 1453 prevents detection of OBJ, NEWOBJ, and NEWOBJ_EX opcodes because pickletools always returns
None(not a string) for their arguments, making these branches unreachable. Only INST has a string argument. Separate the logic to detect these dangerous opcodes without the string guard, while keeping the guard for INST:Fix
- if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str): + if opcode.name in ["OBJ", "NEWOBJ", "NEWOBJ_EX"]: + return { + "pattern": f"{opcode.name}_EXECUTION", + "argument": str(arg), + "position": pos, + "opcode": opcode.name, + } + if opcode.name == "INST" and isinstance(arg, str): return { "pattern": f"{opcode.name}_EXECUTION", "argument": arg, "position": pos, "opcode": opcode.name, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 1453 - 1459, The current combined condition blocks detection of OBJ/NEWOBJ/NEWOBJ_EX because it requires isinstance(arg, str); split the logic in pickle_scanner.py so INST keeps the isinstance(arg, str) guard but OBJ/NEWOBJ/NEWOBJ_EX are detected regardless of arg type. Concretely, replace the single if with two branches: one checking opcode.name == "INST" and isinstance(arg, str) to return the INST_EXECUTION record, and another checking opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] (no isinstance check) to return the corresponding { "pattern": f"{opcode.name}_EXECUTION", "argument": arg, "position": pos, "opcode": opcode.name } result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 82-91: The resync logic in pickle_scanner.py currently aborts
scanning after resync_skipped reaches _MAX_RESYNC_BYTES, allowing an attacker to
hide a second stream behind >256 bytes of padding; update the loop around
had_opcodes/resync_skipped to remove or raise the fixed cap: either eliminate
the _MAX_RESYNC_BYTES check entirely so scanning continues until EOF (using
file_obj.read(1) to detect EOF) or replace the cap with a computed
remaining-file-length limit (e.g., derive remaining bytes via file_obj.seek/
tell and use that instead of _MAX_RESYNC_BYTES); adjust references to
resync_skipped, file_obj.seek(stream_start, 0), and file_obj.read(1) accordingly
so multi-stream resync continues until EOF rather than stopping after 256 bytes.
---
Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 1453-1459: The current combined condition blocks detection of
OBJ/NEWOBJ/NEWOBJ_EX because it requires isinstance(arg, str); split the logic
in pickle_scanner.py so INST keeps the isinstance(arg, str) guard but
OBJ/NEWOBJ/NEWOBJ_EX are detected regardless of arg type. Concretely, replace
the single if with two branches: one checking opcode.name == "INST" and
isinstance(arg, str) to return the INST_EXECUTION record, and another checking
opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] (no isinstance check) to return the
corresponding { "pattern": f"{opcode.name}_EXECUTION", "argument": arg,
"position": pos, "opcode": opcode.name } result.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
| if not had_opcodes: | ||
| # Resync: the current byte was not a valid pickle start. | ||
| # Skip one byte and keep searching for the next stream, up to a limit. | ||
| file_obj.seek(stream_start, 0) | ||
| if not file_obj.read(1): | ||
| return # EOF | ||
| resync_skipped += 1 | ||
| if resync_skipped >= _MAX_RESYNC_BYTES: | ||
| return | ||
| continue |
There was a problem hiding this comment.
Avoid terminating multi-stream resync after 256 bytes; this enables a padding bypass.
At Line 89, the scanner stops after a fixed 256-byte junk window. An attacker can prepend >256 separator bytes before a malicious second stream and evade opcode-level multi-stream detection.
🔧 Suggested hardening
- if resync_skipped >= _MAX_RESYNC_BYTES:
- return
+ if resync_skipped >= _MAX_RESYNC_BYTES:
+ # Bounded fast-forward search for the next likely protocol header
+ # so simple padding cannot terminate multi-stream scanning.
+ probe_start = file_obj.tell()
+ probe = file_obj.read(64 * 1024)
+ candidate = next(
+ (
+ i
+ for i in range(len(probe) - 1)
+ if probe[i] == 0x80 and probe[i + 1] in (2, 3, 4, 5)
+ ),
+ -1,
+ )
+ if candidate < 0:
+ return
+ file_obj.seek(probe_start + candidate, 0)
+ resync_skipped = 0Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 82 - 91, The resync logic
in pickle_scanner.py currently aborts scanning after resync_skipped reaches
_MAX_RESYNC_BYTES, allowing an attacker to hide a second stream behind >256
bytes of padding; update the loop around had_opcodes/resync_skipped to remove or
raise the fixed cap: either eliminate the _MAX_RESYNC_BYTES check entirely so
scanning continues until EOF (using file_obj.read(1) to detect EOF) or replace
the cap with a computed remaining-file-length limit (e.g., derive remaining
bytes via file_obj.seek/ tell and use that instead of _MAX_RESYNC_BYTES); adjust
references to resync_skipped, file_obj.seek(stream_start, 0), and
file_obj.read(1) accordingly so multi-stream resync continues until EOF rather
than stopping after 256 bytes.
There was a problem hiding this comment.
Thanks rabbit
There was a problem hiding this comment.
@yash2998chhabria You're welcome! I'm glad the security analysis was helpful. Feel free to reach out if you have any questions about the suggested hardening approach or need assistance with testing the fix. 🐰
🧠 Learnings used
Learnt from: CR
Repo: promptfoo/modelaudit PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T07:15:18.667Z
Learning: Applies to modelaudit/scanners/**/*.py : Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners
- Buffer opcodes for subsequent pickle streams and discard partial streams (binary tensor data misinterpreted as opcodes) to prevent MANY_DANGEROUS_OPCODES false positives - Add resync logic to skip up to 256 non-pickle separator bytes between streams, preventing single-byte bypass of multi-stream detection - Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE calls referencing safe callables via memo are not counted as dangerous opcodes - Reset dangerous opcode counters at STOP boundaries so each stream is evaluated independently - Use first_pickle_end_pos for binary content scanning instead of f.tell() which advances past tensor data during multi-stream parsing Validated against 5 HuggingFace models with A/B comparison to main branch — no new false positives introduced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)
1477-1483:⚠️ Potential issue | 🟡 MinorRemove
NEWOBJ_EXfrom unreachable condition inis_dangerous_reduce_pattern
NEWOBJ_EX(opcode0x92) is stack-based and yieldsarg=Nonefrompickletools.genops(). The conditionisinstance(arg, str)is permanentlyFalsefor it, making this branch unreachable dead code.NEWOBJ_EXwas added to the list but will never match, creating false coverage.Either remove
NEWOBJ_EXfrom the condition or restructure to handle stack-based opcodes separately:Suggested fix
- if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str): + if opcode.name == "INST" and isinstance(arg, str): return { "pattern": f"{opcode.name}_EXECUTION", "argument": arg, "position": pos, "opcode": opcode.name, }If stack-based opcodes (
OBJ,NEWOBJ,NEWOBJ_EX) need detection, handle them with separate logic that inspects the stack state instead ofarg.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 1477 - 1483, The check in is_dangerous_reduce_pattern currently includes "NEWOBJ_EX" in the list tested alongside "INST","OBJ","NEWOBJ" but also requires isinstance(arg, str), which is always False for stack-based NEWOBJ_EX (arg is None) so the branch is unreachable; remove "NEWOBJ_EX" from that list to eliminate dead code, or alternatively add separate logic in is_dangerous_reduce_pattern to detect stack-based opcodes (like OBJ/NEWOBJ/NEWOBJ_EX) by inspecting the stack state rather than relying on arg being a str—update the opcode membership or add a distinct branch that handles stack-based opcode detection accordingly.
♻️ Duplicate comments (1)
modelaudit/scanners/pickle_scanner.py (1)
106-115:⚠️ Potential issue | 🟠 Major
_MAX_RESYNC_BYTES = 256cap still allows padding-based evasionThe 256-byte hard stop on resync was flagged in the previous review (not yet addressed). An attacker can prefix a second malicious pickle stream with >256 bytes of padding, causing the scanner to return before reaching the payload.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 106 - 115, The current resync loop uses a fixed _MAX_RESYNC_BYTES cap which allows evasion by padding; change the resync logic in the block that references had_opcodes, resync_skipped, file_obj.seek and file_obj.read to stop based on actual remaining file length instead of an arbitrary 256 bytes: compute remaining = (file_size - stream_start) (use file_obj.seek(0,2) then tell() and restore position) and either remove the hard cap or set max_resync = min(_MAX_RESYNC_BYTES, remaining) so the loop continues until EOF (or remaining bytes exhausted) rather than returning after 256 bytes, ensuring the scanner will find a later pickle stream even if padded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 362-405: ALWAYS_DANGEROUS_MODULES was widened to include
"logging", "functools", and "types" which causes CRITICAL false positives
because _is_actually_dangerous_global treats any reference as dangerous unless
ML_SAFE_GLOBALS contains explicit allowlists; update the scanner by either (a)
removing these three module names from ALWAYS_DANGEROUS_MODULES and instead
adding only their truly dangerous callables to ALWAYS_DANGEROUS_FUNCTIONS, or
(b) add explicit safe entries for the common benign symbols (e.g.,
logging.Logger, logging.Formatter, functools.partial, types.SimpleNamespace,
types.MappingProxyType) into ML_SAFE_GLOBALS so _is_actually_dangerous_global
will not escalate legitimate uses; modify the definitions of
ALWAYS_DANGEROUS_MODULES, ALWAYS_DANGEROUS_FUNCTIONS, and ML_SAFE_GLOBALS in
pickle_scanner.py accordingly and run existing unit tests and benign/malicious
serialization samples to verify no regressions.
- Around line 1539-1553: The loop that iterates "for i, (opcode, arg, pos) in
enumerate(opcodes)" resets counters on opcode.name == "STOP" but does not clear
the pickle memo map _safe_memo, which allows safe flags to leak across streams;
update the STOP handler to clear or reinitialize _safe_memo (e.g.,
_safe_memo.clear() or _safe_memo = {}) so subsequent streams cannot inherit
earlier memoized safe globals referenced by the BINGET → REDUCE path that checks
_safe_memo.get(...).
- Around line 319-321: Remove the stale "uuid._popen" entry from the pickle
scanner's blocklist and leave only "uuid._get_command_stdout" (i.e., delete the
"uuid._popen" string from the list where the uuid internal functions are
enumerated) so the blocklist matches Python 3.10+; ensure no other references to
uuid._popen remain in the module such as in PICKLE_BLOCKLIST or similar
variables.
- Around line 319-321: Remove the dead entry "uuid._popen" from the
internal-functions list and leave only valid helpers (e.g.
"uuid._get_command_stdout"); locate the list where these internal UUID functions
are declared (search for the array containing "uuid._get_command_stdout" and
"uuid._popen") and delete the "uuid._popen" string or replace it with a short
comment noting it was removed in Python 3.9 so it will never match on supported
versions 3.10–3.13.
- Around line 2707-2711: The calculation assigning first_pickle_end_pos in the
STOP opcode handler is wrong because pickletools.genops yields absolute file
positions; change the assignment in the branch handling opcode.name == "STOP"
(where first_pickle_end_pos is set) to use pos + 1 instead of start_pos + pos +
1 so first_pickle_end_pos reflects the true end offset; update the STOP branch
that references current_stack_depth, first_pickle_end_pos, start_pos, and pos
accordingly.
---
Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 1477-1483: The check in is_dangerous_reduce_pattern currently
includes "NEWOBJ_EX" in the list tested alongside "INST","OBJ","NEWOBJ" but also
requires isinstance(arg, str), which is always False for stack-based NEWOBJ_EX
(arg is None) so the branch is unreachable; remove "NEWOBJ_EX" from that list to
eliminate dead code, or alternatively add separate logic in
is_dangerous_reduce_pattern to detect stack-based opcodes (like
OBJ/NEWOBJ/NEWOBJ_EX) by inspecting the stack state rather than relying on arg
being a str—update the opcode membership or add a distinct branch that handles
stack-based opcode detection accordingly.
---
Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 106-115: The current resync loop uses a fixed _MAX_RESYNC_BYTES
cap which allows evasion by padding; change the resync logic in the block that
references had_opcodes, resync_skipped, file_obj.seek and file_obj.read to stop
based on actual remaining file length instead of an arbitrary 256 bytes: compute
remaining = (file_size - stream_start) (use file_obj.seek(0,2) then tell() and
restore position) and either remove the hard cap or set max_resync =
min(_MAX_RESYNC_BYTES, remaining) so the loop continues until EOF (or remaining
bytes exhausted) rather than returning after 256 bytes, ensuring the scanner
will find a later pickle stream even if padded.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
modelaudit/scanners/pickle_scanner.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge main's _build_symbolic_reference_maps for REDUCE/memo resolution with our multi-stream buffering, STOP-boundary resets, and expanded blocklist entries. Both sets of dangerous functions/modules are combined. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelaudit/scanners/pickle_scanner.py (2)
91-115:⚠️ Potential issue | 🔴 CriticalMulti-stream resync still has bypass paths.
At Line 91 and Line 113, scanning stops on (a) a partial candidate stream and (b) 256 separator bytes. Both allow hiding a later malicious stream behind decoy bytes/padding.
🔧 Proposed hardening
- if stream_error and had_opcodes: - # Partial stream: binary data was misinterpreted as opcodes. - # Discard the buffer and stop — no more valid streams. - return + if stream_error: + # Failed candidate stream; resync instead of terminating. + had_opcodes = False ... - if resync_skipped >= _MAX_RESYNC_BYTES: - return + if resync_skipped >= _MAX_RESYNC_BYTES: + # Fast-forward probe for next protocol header to avoid fixed-window bypass. + probe_start = file_obj.tell() + probe = file_obj.read(64 * 1024) + candidate = next( + (i for i in range(len(probe) - 1) if probe[i] == 0x80 and probe[i + 1] in (2, 3, 4, 5)), + -1, + ) + if candidate < 0: + return + file_obj.seek(probe_start + candidate, 0) + resync_skipped = 0Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 91 - 115, The multi-stream resync currently aborts scanning on a partial candidate (when stream_error and had_opcodes) and when resync_skipped reaches _MAX_RESYNC_BYTES, which allows later malicious streams to be skipped; change the logic in the loop handling multi_stream so that: (1) when stream_error and had_opcodes occur do not return early — discard the buffer, advance the file pointer by one byte (use file_obj.seek(stream_start, 0) then file_obj.read(1) or equivalent) and increment resync_skipped, then continue scanning; (2) when resync_skipped >= _MAX_RESYNC_BYTES do not return immediately but reset resync_skipped (or escalate detection) and continue scanning from the next byte; update uses of stream_error, had_opcodes, stream_start, resync_skipped, and _MAX_RESYNC_BYTES accordingly so parsing proceeds byte-by-byte until EOF while still yielding valid buffered opcodes where appropriate (yield from buffered).
3271-3272:⚠️ Potential issue | 🟡 Minor
first_pickle_end_poscan be over-offset when scanning from a non-zero position.At Line 3272,
start_pos + pos + 1may double-countstart_pos(ifposis already absolute), which can shift downstream binary-scan boundaries.🔧 Proposed fix
- if first_pickle_end_pos is None: - first_pickle_end_pos = start_pos + pos + 1 + if first_pickle_end_pos is None: + first_pickle_end_pos = pos + 1In CPython, when `pickletools.genops(file_obj)` is called on a seeked file object (non-zero `tell()`), is the yielded `pos` absolute file offset or relative stream offset?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 3271 - 3272, The code incorrectly assumes `pos` from pickletools.genops is always relative and sets first_pickle_end_pos = start_pos + pos + 1, which can double-count when `pos` is already an absolute file offset; update the assignment in the scan loop (where `first_pickle_end_pos` is set) to normalize `pos`: if `pos` is >= `start_pos` treat `pos` as absolute and set `first_pickle_end_pos = pos + 1`, otherwise treat it as relative and set `first_pickle_end_pos = start_pos + pos + 1`; reference the usage around pickletools.genops/file seek handling and the variables `first_pickle_end_pos`, `start_pos`, and `pos` to locate the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 91-115: The multi-stream resync currently aborts scanning on a
partial candidate (when stream_error and had_opcodes) and when resync_skipped
reaches _MAX_RESYNC_BYTES, which allows later malicious streams to be skipped;
change the logic in the loop handling multi_stream so that: (1) when
stream_error and had_opcodes occur do not return early — discard the buffer,
advance the file pointer by one byte (use file_obj.seek(stream_start, 0) then
file_obj.read(1) or equivalent) and increment resync_skipped, then continue
scanning; (2) when resync_skipped >= _MAX_RESYNC_BYTES do not return immediately
but reset resync_skipped (or escalate detection) and continue scanning from the
next byte; update uses of stream_error, had_opcodes, stream_start,
resync_skipped, and _MAX_RESYNC_BYTES accordingly so parsing proceeds
byte-by-byte until EOF while still yielding valid buffered opcodes where
appropriate (yield from buffered).
- Around line 3271-3272: The code incorrectly assumes `pos` from
pickletools.genops is always relative and sets first_pickle_end_pos = start_pos
+ pos + 1, which can double-count when `pos` is already an absolute file offset;
update the assignment in the scan loop (where `first_pickle_end_pos` is set) to
normalize `pos`: if `pos` is >= `start_pos` treat `pos` as absolute and set
`first_pickle_end_pos = pos + 1`, otherwise treat it as relative and set
`first_pickle_end_pos = start_pos + pos + 1`; reference the usage around
pickletools.genops/file seek handling and the variables `first_pickle_end_pos`,
`start_pos`, and `pos` to locate the fix.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
Summary
ALWAYS_DANGEROUS_MODULESwith ~45 new modules not covered by fix(security): close scanner RCE bypasses and add regressions #518 (network, compilation, FFI, debugging, operator bypasses, pickle recursion, filesystem, venv/pip, threading, and more)uuidas novel detection (VULN-6:_get_command_stdoutcallssubprocess.Popen) andpkgutil.resolve_nameas dynamic resolution trampoline toALWAYS_DANGEROUS_FUNCTIONSSUSPICIOUS_GLOBALSfor defense-in-depthNEWOBJ_EXto all 5 opcode check lists where it was missingRelationship to #518
This PR is designed to land after #518. It avoids duplicating the 10 modules #518 already adds to
ALWAYS_DANGEROUS_MODULES(ctypes, cProfile, profile, pdb, timeit, _thread, multiprocessing, zipimport, importlib, runpy). The two PRs are complementary:["*"]bug fixTest plan
TestPickleScannerBlocklistHardening🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests