Skip to content

Add security hardening, fuzzing, and release automation#34

Merged
Neverdecel merged 2 commits into
masterfrom
claude/kind-albattani-lrt3vf
Jun 16, 2026
Merged

Add security hardening, fuzzing, and release automation#34
Neverdecel merged 2 commits into
masterfrom
claude/kind-albattani-lrt3vf

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Summary

This PR adds comprehensive security hardening, continuous fuzzing infrastructure, and automated release workflows to CodeRAG. It introduces keyless Sigstore signing for releases, Python SAST analysis via bandit, fuzzing of the source chunker via Atheris, and ClusterFuzzLite/OSS-Fuzz integration.

Key Changes

Security & Release Automation:

  • Release workflow (release.yml): Automated build, sign (keyless Sigstore), and publish of sdist + wheel artifacts on version tags with OpenSSF Scorecard compliance
  • Bandit SAST job (security.yml): Added Python-focused static security analysis gating on MEDIUM+ severity findings
  • Least privilege permissions: Refactored docker-beta.yml and codeql.yml to apply write scopes only at the job level, not workflow-wide (OpenSSF Token-Permissions compliance)

Fuzzing Infrastructure:

  • Atheris fuzzer (fuzz/fuzz_chunk_file.py): Continuous fuzzing harness for chunk_file() (the most exposed parser) with bounded runs on PRs and longer weekly bursts; validates the contract that chunking never crashes and structural invariants on output chunks
  • Fuzz workflow (fuzz.yml): Triggers on chunker-related changes or weekly schedule; runs 50k iterations on PRs, 600s on schedule
  • ClusterFuzzLite/OSS-Fuzz integration (.clusterfuzzlite/): Dockerfile and build script for integration with continuous fuzzing platforms

Housekeeping:

  • Renamed LICENSE-2.0.txtLICENSE and updated all references (pyproject.toml, Dockerfile, README.md)
  • Added # nosec B104 annotation in http_api.py for a false-positive security finding (host classification, not socket binding)
  • Fixed SQL query formatting in sqlite_store.py (line continuation)

Notable Details

  • Fuzzer uses atheris.instrument_imports() to enable coverage-guided fuzzing of CodeRAG's chunking logic
  • Release artifacts include .sigstore bundles for provenance verification via sigstore verify
  • Bandit runs with -ll (MEDIUM+ only) to reduce noise while catching real issues
  • All GitHub Actions use pinned commit SHAs for supply-chain security

https://claude.ai/code/session_01VgY3wMWzuBw6QFNivhXZYy

claude added 2 commits June 16, 2026 19:13
…zzing, signed releases)

Address the low-scoring OpenSSF Scorecard checks (overall 6.5 at 075f9e0):

Token-Permissions (0 -> 10):
- docker-beta.yml: drop top-level write scopes to `contents: read`; move
  packages/id-token/attestations write to the build job only.
- codeql.yml: declare a top-level `permissions: contents: read`.

Fuzzing (0):
- Add an Atheris fuzz target for the source chunker (the most exposed parser),
  plus ClusterFuzzLite config (.clusterfuzzlite/) and a bounded fuzz CI workflow.

Signed-Releases:
- Add release.yml: on a v* tag, build sdist+wheel, keyless-sign each with
  Sigstore (OIDC), and publish a GitHub Release with the .sigstore bundles.

SAST (8):
- Add a bandit job to security.yml; annotate two confirmed false positives
  (parameterized SQL IN-clause; a public-host classifier) with `# nosec`.

License (9 -> 10):
- Rename LICENSE-2.0.txt -> LICENSE (standard name) and update references in
  pyproject.toml, README.md, Dockerfile.

https://claude.ai/code/session_01VgY3wMWzuBw6QFNivhXZYy
…t hang indexing

The Atheris fuzz target (added in this PR) found a ~180-byte TypeScript input —
mostly newlines with a few stray tokens — that drove the tree-sitter grammar's
GLR error-recovery super-linear: a single parse ran for minutes and ballooned
RSS past 2 GB. Indexing arbitrary repos must never let one hostile/garbled file
hang or OOM the indexer.

Fix: set a per-parse time budget (timeout_micros, 2s) on the cached tree-sitter
parsers. A parse that blows the budget raises, and chunk_file falls back to line
windows — its existing graceful-degradation contract. Real source parses in
single-digit milliseconds, so the guard never trips on legitimate code.

Regression tests in tests/test_chunking.py: assert every tree-sitter parser
carries the budget, and that the exact fuzzer-found input degrades to windows
(SIGALRM-bounded so a future regression fails fast instead of hanging).

https://claude.ai/code/session_01VgY3wMWzuBw6QFNivhXZYy
@Neverdecel Neverdecel merged commit 0144a0a into master Jun 16, 2026
13 checks passed
@Neverdecel Neverdecel deleted the claude/kind-albattani-lrt3vf branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants