Skip to content

fix(deps): promote msgpack to core dependency for Flax scanner#583

Open
yash2998chhabria wants to merge 9 commits intomainfrom
fix/flax-dep
Open

fix(deps): promote msgpack to core dependency for Flax scanner#583
yash2998chhabria wants to merge 9 commits intomainfrom
fix/flax-dep

Conversation

@yash2998chhabria
Copy link
Contributor

@yash2998chhabria yash2998chhabria commented Feb 25, 2026

Summary

  • Promotes msgpack>=1.0.0,<2.0 from optional dependency (extras: flax) to a core dependency in pyproject.toml
  • The flax_msgpack scanner was registered but silently fell back to generic scanning when msgpack was not installed, causing .msgpack Flax/JAX model files to miss specialized security analysis
  • msgpack is a lightweight, zero-dependency C extension (~100KB) that is appropriate as a core dependency

Test plan

  • Verified FlaxMsgpackScanner loads successfully with python -c "from modelaudit.scanners.flax_msgpack_scanner import FlaxMsgpackScanner; print('OK')"
  • Scanned 3 HuggingFace Flax models: sshleifer/tiny-gpt2, hf-internal-testing/tiny-bert-flax-only, patrickvonplaten/t5-tiny-random -- all recognized by flax_msgpack scanner (not generic fallback)
  • Regression tested 2 additional Flax models: ArthurZ/tiny-random-bert-flax-only, sanchit-gandhi/tiny-random-flax-bert -- both scanned correctly
  • Full test suite passes: 377 passed, 0 failed (python -m pytest tests/ -x -q --timeout=60)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Pickle scanner: multi-stream parsing, improved opcode coverage, and propagation of pickle-boundary metadata for downstream scanning.
    • Flax/msgpack scanner: enhanced transformer-model detection including nested-model and name-key checks.
  • Behavior Changes

    • Scanner now flags a much larger set of standard-library and third-party modules/functions as suspicious in pickle contexts.
  • Tests

    • Added regression suite covering crafted pickle payloads, multi-stream cases, and newly flagged modules.
  • Chores

    • Added msgpack dependency.

yash2998chhabria and others added 4 commits February 25, 2026 10:46
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by
PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver,
ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall,
py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses
(_operator, functools), pickle recursion (pickle, _pickle, dill,
cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob,
distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading
(signal, _signal, threading), and more (webbrowser, asyncio, mmap,
select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3,
_sqlite3, doctest, idlelib, lib2to3, uuid).

Add uuid as novel detection (VULN-6: _get_command_stdout calls
subprocess.Popen) and pkgutil.resolve_name as dynamic resolution
trampoline.

Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix
multi-stream opcode analysis so detailed checks run across all pickle
streams. Add NEWOBJ_EX to all opcode check lists.

Includes 10 regression tests covering pkgutil trampoline, uuid RCE,
multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3,
tarfile, marshal, cloudpickle, and webbrowser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning

- Fix test regression: track first pickle STOP position and use it for
  binary content scanning instead of f.tell() (which advances past
  multi-stream data)
- Fix multi-stream resync bypass: skip junk separator bytes between
  streams (up to 256 bytes) instead of returning immediately
- Gracefully handle ValueErrors on subsequent streams (non-pickle data
  after valid pickle) instead of propagating them as file-level errors
- Add type hints: ScanResult return type on _scan_bytes, -> None on all
  test methods
- Add separator-byte resync regression test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial
  streams (binary tensor data misinterpreted as opcodes) to prevent
  MANY_DANGEROUS_OPCODES false positives
- Add resync logic to skip up to 256 non-pickle separator bytes
  between streams, preventing single-byte bypass of multi-stream
  detection
- Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE
  calls referencing safe callables via memo are not counted as
  dangerous opcodes
- Reset dangerous opcode counters at STOP boundaries so each stream
  is evaluated independently
- Use first_pickle_end_pos for binary content scanning instead of
  f.tell() which advances past tensor data during multi-stream parsing

Validated against 5 HuggingFace models with A/B comparison to main
branch — no new false positives introduced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The flax_msgpack scanner was registered but failed to load when msgpack
was not installed, silently falling back to generic scanning. Since
msgpack is a lightweight zero-dependency C extension, promote it from
optional (extras: flax) to a core dependency so Flax/JAX .msgpack files
are scanned correctly out of the box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

Walkthrough

Expands SUSPICIOUS_GLOBALS and ALWAYS_DANGEROUS_* lists, adds multi-stream/resync parsing, NEWOBJ_EX handling, and first_pickle_end_pos propagation in the pickle scanner; adds tests for blocklist and multi-stream cases; updates msgpack dependency; improves flax/msgpack model detection.

Changes

Cohort / File(s) Summary
Suspicious symbols mapping
modelaudit/detectors/suspicious_symbols.py
Expanded SUSPICIOUS_GLOBALS with many new module/function entries (networking, execution, FFI, threading, packaging, tooling, stdlib trampolines).
Pickle scanner logic
modelaudit/scanners/pickle_scanner.py
Added multi-stream parsing/resync, _MAX_RESYNC_BYTES, buffering of subsequent streams, tracking/propagating first_pickle_end_pos, treat NEWOBJ_EX in opcode checks, and broadened ALWAYS_DANGEROUS_FUNCTIONS/ALWAYS_DANGEROUS_MODULES.
Pickle scanner tests
tests/scanners/test_pickle_scanner.py
New test class validating GLOBAL+REDUCE payload detection, NEWOBJ_EX cases, multi-stream benign→malicious flows, resync with separator bytes, and multiple module blocklist checks.
Flax/msgpack scanner
modelaudit/scanners/flax_msgpack_scanner.py
Improved transformer-model detection by inspecting nested dict keys and top-level names containing common model-name tokens when transformer keys are absent.
Dependencies
pyproject.toml
Added msgpack>=1.0.0,<2.0 constraint.

Sequence Diagram(s)

sequenceDiagram
    participant FileObj as FileObj
    participant Parser as PickleParser
    participant Buffer as StreamBuffer
    participant Detector as OpcodeDetector
    participant Binary as BinaryScanner

    FileObj->>Parser: read bytes
    Parser->>Parser: attempt parse stream
    alt first stream completes (STOP)
        Parser->>Detector: yield ops for analysis
        Parser->>Parser: set first_pickle_end_pos
    else subsequent stream
        Parser->>Buffer: buffer opcodes
        Buffer->>Parser: validate buffered stream
        alt buffered parse succeeds
            Buffer->>Detector: yield ops for analysis
        else within _MAX_RESYNC_BYTES
            Parser->>FileObj: skip separator bytes (resync)
            Parser->>Parser: continue parsing
        end
    end
    Detector->>Binary: attach first_pickle_end_pos to metadata
    Binary->>Detector: report binary-scan findings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"A rabbit nibble, then a hop of cheer—
Pickles parsed, strange modules drawn near.
Streams resync, buffers keep tight,
NEWOBJ_EX gleams in the night.
🐇✨"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: promoting msgpack from an optional dependency to a core dependency for the Flax scanner, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 95.45% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/flax-dep

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: The scanner currently terminates multi-stream scanning when
a later stream yields at least one opcode before an error; instead, in the
scanning loop that uses resync_skipped and _MAX_RESYNC_BYTES you must remove the
early return on partial parse and implement resync logic: when a stream errors
after yielding opcodes, do not return — increment resync_skipped, advance the
input by one byte and attempt to resync/continue scanning this stream (and
subsequent streams) until a match or until resync_skipped exceeds a sensible
limit; do not treat a single opcode+junk separator as final. Replace the
immediate return with a continue/resync flow, reset resync_skipped on successful
parse, and revise the hard-coded 256-byte ceiling (_MAX_RESYNC_BYTES) to either
be higher or configurable so simple padding cannot evade detection; add tests
for benign and malicious samples to verify detection still fires after padding.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e736ebb and b1ff559.

📒 Files selected for processing (4)
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/pickle_scanner.py
  • pyproject.toml
  • tests/scanners/test_pickle_scanner.py

yash2998chhabria and others added 3 commits February 25, 2026 12:34
Some HuggingFace Flax models (e.g. bert-for-token-classification) use a
nested structure where model weights are wrapped under a model-name key
(e.g. "bert", "classifier") rather than placing transformer keys like
"encoder" or "embeddings" at the top level. This caused an INFO-level
false positive: "Suspicious data structure - does not match known ML
model patterns".

Add detection for nested model structures by checking if top-level keys
contain transformer sub-keys, and add a set of common HuggingFace model
name keys to recognize known architectures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in modelaudit/scanners/pickle_scanner.py:
- ALWAYS_DANGEROUS_GLOBALS: keep both branch additions (pkgutil.resolve_name,
  uuid internals) and main additions (cProfile/profile/pdb/timeit, ctypes)
- ALWAYS_DANGEROUS_MODULES: merge branch's comprehensive module blocklist with
  main's PR #518 additions (ctypes, profiling, thread/process, module loading)
  and preserve main's linecache/logging.config exclusion note
- REDUCE opcode handler: adopt main's resolved_callables approach with branch's
  _safe_memo fallback for BINGET memo patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: Document the security trade-off for _MAX_RESYNC_BYTES in
pickle_scanner.py: update the comment near the _MAX_RESYNC_BYTES and
resync_skipped definitions to explain that the 256-byte limit prevents skipping
arbitrarily large gaps between pickle streams but can be evaded by inserting
>=256 bytes of non-pickle data before a malicious stream, note the rationale for
stopping on partial opcode sequences to avoid misinterpreting binary tensor
data, and recommend making _MAX_RESYNC_BYTES configurable (and covered by tests)
if callers need a different performance/security trade-off.
- Around line 409-411: The scan currently lists "functools" in
ALWAYS_DANGEROUS_MODULES which will produce false positives for legitimate ML
usages; update ML_SAFE_GLOBALS to whitelist safe functools members (e.g.,
partial, reduce, lru_cache, wraps) and adjust any checks that treat the entire
module as dangerous so they allow these specific attributes, or add a clarifying
comment (similar to the existing logging/linecache rationale) explaining why the
module is treated specially; reference ALWAYS_DANGEROUS_MODULES and
ML_SAFE_GLOBALS in your change so the scanner accepts common safe usages like
functools.partial or functools.lru_cache while still flagging dangerous patterns
that use functools to wrap unsafe callables.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b1ff559 and c3b8150.

📒 Files selected for processing (4)
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py
  • tests/scanners/test_pickle_scanner.py

Comment on lines 52 to 54
# Maximum number of consecutive non-pickle bytes to skip when resyncing
_MAX_RESYNC_BYTES = 256
resync_skipped = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider documenting the security trade-off for _MAX_RESYNC_BYTES.

The 256-byte resync limit prevents the scanner from skipping arbitrarily large amounts of data between pickle streams. However, this limit could potentially be bypassed by inserting 256+ bytes of non-pickle data before a malicious stream. While increasing this limit has performance implications, documenting this trade-off would help future maintainers understand the security model.

The current behavior (stopping after encountering a partial stream with opcodes) is a reasonable defense against misinterpreting binary tensor data as pickle opcodes, but it creates a potential evasion vector for multi-stream attacks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 54, Document the
security trade-off for _MAX_RESYNC_BYTES in pickle_scanner.py: update the
comment near the _MAX_RESYNC_BYTES and resync_skipped definitions to explain
that the 256-byte limit prevents skipping arbitrarily large gaps between pickle
streams but can be evaded by inserting >=256 bytes of non-pickle data before a
malicious stream, note the rationale for stopping on partial opcode sequences to
avoid misinterpreting binary tensor data, and recommend making _MAX_RESYNC_BYTES
configurable (and covered by tests) if callers need a different
performance/security trade-off.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)

2132-2138: ⚠️ Potential issue | 🟠 Major

Split INST from OBJ/NEWOBJ/NEWOBJ_EX handling—the current condition disables the latter three.

In pickletools.genops(), INST yields a string argument (module and class name), but OBJ, NEWOBJ, and NEWOBJ_EX have no inline arguments and yield arg=None. The isinstance(arg, str) guard prevents these three opcodes from ever triggering the detection pattern, making their addition ineffective.

Suggested fix
-        if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str):
-            return {
-                "pattern": f"{opcode.name}_EXECUTION",
-                "argument": arg,
-                "position": pos,
-                "opcode": opcode.name,
-            }
+        if opcode.name == "INST" and isinstance(arg, str):
+            return {
+                "pattern": f"{opcode.name}_EXECUTION",
+                "argument": arg,
+                "position": pos,
+                "opcode": opcode.name,
+            }
+
+        if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}:
+            return {
+                "pattern": f"{opcode.name}_EXECUTION",
+                "argument": arg,
+                "position": pos,
+                "opcode": opcode.name,
+            }

This violates the guideline: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2132 - 2138, The current
check conflates INST with OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg,
str), which blocks OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the
logic in pickle_scanner.py (around the opcode handling using opcode.name and
arg, e.g., in the code that references pickletools.genops and opcode.name) to
split cases: handle opcode.name == "INST" with the isinstance(arg, str) guard
and emit the INST_EXECUTION pattern including the string argument, and add a
separate branch for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a
distinct pattern (e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and
opcode.name while not requiring arg to be a string so those opcodes are detected
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 726-737: The loop that checks top-level keys (using found_keys,
obj, transformer_keys, model_name_keys, has_transformer_keys) can call
key.lower() on non-string keys and raise AttributeError; guard against that by
first verifying the key is a string (e.g., if isinstance(key, str)) before
calling key.lower(), and only perform the model-name membership checks
(key.lower() in model_name_keys and any(mk in key.lower() for mk in
model_name_keys)) when the key is a string so the scanner won't crash while
preserving the same transformer-key detection logic.

---

Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2132-2138: The current check conflates INST with
OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg, str), which blocks
OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the logic in
pickle_scanner.py (around the opcode handling using opcode.name and arg, e.g.,
in the code that references pickletools.genops and opcode.name) to split cases:
handle opcode.name == "INST" with the isinstance(arg, str) guard and emit the
INST_EXECUTION pattern including the string argument, and add a separate branch
for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a distinct pattern
(e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and opcode.name while
not requiring arg to be a string so those opcodes are detected correctly.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c3b8150 and 1869d08.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py

- Guard key.lower() for non-string msgpack keys to prevent
  AttributeError that could abort the scan (evasion vector)
- Add functools safe functions to ML_SAFE_GLOBALS to whitelist
  common ML utilities like partial, reduce, lru_cache
- Increase _MAX_RESYNC_BYTES from 256 to 8192 so binary tensor
  data between pickle streams doesn't cause early scan termination
- Continue scanning on partial stream errors instead of returning
  early, preventing evasion via hidden payloads after binary blobs
- Clear _safe_memo at STOP boundaries so safe callables from one
  pickle stream cannot whitelist dangerous REDUCEs in the next

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 695-740: The substring check using model_name_keys (especially
short tokens like "opt" or "score") is too permissive; update the logic in the
block that computes has_transformer_keys (references: model_name_keys,
has_transformer_keys, transformer_keys, key_lower, found_keys, obj.get) to use
stricter token/boundary matching instead of any(mk in key_lower for mk in
model_name_keys): normalize key_lower and match whole tokens (e.g., split on
non-alphanumerics or use regex word boundaries) and explicitly check both exact
matches and token-level membership for entries in model_name_keys, and
add/update unit tests to include adversarial keys (e.g., "optional_score",
"desktop") and true positives to ensure the heuristic no longer false-positives
benign payloads.

In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 95-100: When the branch with "if stream_error and had_opcodes" is
taken inside the PickleScanner scanning loop, increment the resync budget by
adding the number of bytes consumed by the discarded buffer to resync_skipped
and abort (raise/stop) when resync_skipped exceeds _MAX_RESYNC_BYTES;
specifically, compute the consumed byte count for the current partial stream
(e.g., bytes between the previous sync point and current file position or
len(buffer) being discarded), add that value to resync_skipped, and enforce the
existing _MAX_RESYNC_BYTES check before continuing so a crafted binary region
cannot indefinitely bypass the resync limit (update references in the scanning
loop where stream_error, had_opcodes, and resync_skipped are used).

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1869d08 and 6fa1a0e.

📒 Files selected for processing (2)
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py

Comment on lines +695 to +740
# Common HuggingFace model name keys that wrap transformer substructure
model_name_keys = {
"bert",
"roberta",
"distilbert",
"albert",
"electra",
"xlm",
"gpt2",
"gpt_neo",
"gpt_neox",
"gptj",
"opt",
"llama",
"t5",
"bart",
"pegasus",
"mbart",
"blenderbot",
"vit",
"clip",
"whisper",
"wav2vec2",
"flax_model",
"classifier",
"qa_outputs",
"lm_head",
"score",
}
has_transformer_keys = any(key in found_keys for key in transformer_keys)

# Check if any top-level key contains transformer sub-keys (nested model structure)
if not has_transformer_keys:
for key in found_keys:
value = obj.get(key)
if isinstance(value, dict):
sub_keys = set(value.keys())
if sub_keys & transformer_keys:
has_transformer_keys = True
break
# Also check if the top-level key is a known model name
# Guard against non-string keys (msgpack allows int/bytes keys)
key_lower = key.lower() if isinstance(key, str) else str(key).lower()
if key_lower in model_name_keys or any(mk in key_lower for mk in model_name_keys):
has_transformer_keys = True
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

model_name_keys substring matching is too permissive and can weaken structural detection.

At Line 738, any(mk in key_lower for mk in model_name_keys) plus short keys like opt/score can classify unrelated payloads as transformer checkpoints, which then short-circuits deeper structure analysis. Please switch to token/boundary matching and add negative/adversarial tests for this heuristic.

🔧 Suggested tightening
-                    key_lower = key.lower() if isinstance(key, str) else str(key).lower()
-                    if key_lower in model_name_keys or any(mk in key_lower for mk in model_name_keys):
+                    key_lower = key.lower() if isinstance(key, str) else str(key).lower()
+                    key_tokens = {tok for tok in re.split(r"[^a-z0-9]+", key_lower) if tok}
+                    if key_lower in model_name_keys or bool(key_tokens & model_name_keys):
                         has_transformer_keys = True
                         break

As per coding guidelines "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/flax_msgpack_scanner.py` around lines 695 - 740, The
substring check using model_name_keys (especially short tokens like "opt" or
"score") is too permissive; update the logic in the block that computes
has_transformer_keys (references: model_name_keys, has_transformer_keys,
transformer_keys, key_lower, found_keys, obj.get) to use stricter token/boundary
matching instead of any(mk in key_lower for mk in model_name_keys): normalize
key_lower and match whole tokens (e.g., split on non-alphanumerics or use regex
word boundaries) and explicitly check both exact matches and token-level
membership for entries in model_name_keys, and add/update unit tests to include
adversarial keys (e.g., "optional_score", "desktop") and true positives to
ensure the heuristic no longer false-positives benign payloads.

Comment on lines +95 to +100
if stream_error and had_opcodes:
# Partial stream: binary data was misinterpreted as opcodes.
# Discard the buffer but keep scanning — a malicious payload
# could be hiding after the binary blob. Resync from the
# current file position to find the next valid stream.
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial-stream recovery bypasses the resync budget.

When stream_error and had_opcodes is hit, the code continues without increasing resync_skipped. A crafted binary region can repeatedly trigger this path and force excessive scanning work. Count consumed bytes in this branch and enforce _MAX_RESYNC_BYTES.

🔧 Suggested guard
             if stream_error and had_opcodes:
                 # Partial stream: binary data was misinterpreted as opcodes.
                 # Discard the buffer but keep scanning — a malicious payload
                 # could be hiding after the binary blob.  Resync from the
                 # current file position to find the next valid stream.
+                advanced = file_obj.tell() - stream_start
+                if advanced <= 0:
+                    file_obj.seek(stream_start, 0)
+                    if not file_obj.read(1):
+                        return
+                    advanced = 1
+                resync_skipped += advanced
+                if resync_skipped >= _MAX_RESYNC_BYTES:
+                    return
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 95 - 100, When the branch
with "if stream_error and had_opcodes" is taken inside the PickleScanner
scanning loop, increment the resync budget by adding the number of bytes
consumed by the discarded buffer to resync_skipped and abort (raise/stop) when
resync_skipped exceeds _MAX_RESYNC_BYTES; specifically, compute the consumed
byte count for the current partial stream (e.g., bytes between the previous sync
point and current file position or len(buffer) being discarded), add that value
to resync_skipped, and enforce the existing _MAX_RESYNC_BYTES check before
continuing so a crafted binary region cannot indefinitely bypass the resync
limit (update references in the scanning loop where stream_error, had_opcodes,
and resync_skipped are used).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant