feat: v11.2 — AST clone detection + debug artifact rules (140 commands, 102 MCP tools)

CosmoHac · claude · CosmoHac · commit 9023ed76922d · 2026-02-27T20:13:09.000+02:00
New `roam clones` command detects Type-2 structural clones via tree-sitter
AST subtree hashing with Jaccard similarity scoring and Union-Find clustering.
9 debug artifact rules (COR-560–568) catch leftover print/breakpoint/debugger
statements in Python, JS, TS, and Java.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 **The architectural intelligence layer for AI coding agents. Structural graph, architecture governance, multi-agent orchestration, vulnerability mapping, runtime analysis -- one CLI, zero API keys.**
 
-*139 commands · 101 MCP tools · 27 languages · 100% local*
+*140 commands · 102 MCP tools · 27 languages · 100% local*
 
 [![PyPI version](https://img.shields.io/pypi/v/roam-code?style=flat-square&color=blue)](https://pypi.org/project/roam-code/)
 [![GitHub stars](https://img.shields.io/github/stars/Cranot/roam-code?style=flat-square)](https://github.com/Cranot/roam-code/stargazers)
@@ -64,14 +64,19 @@ $ roam diff                    # blast radius of uncommitted changes
 
 ## What's New in v11
 
+### v11.2 -- AST Clone Detection + Debug Artifact Rules
+- **`roam clones`**: New AST structural clone detection via subtree hashing. Finds Type-2 clones (identical control flow, different identifiers/literals) with Jaccard similarity scoring, Union-Find clustering, and automated refactoring suggestions. More precise than the metric-based `duplicates` command.
+- **9 debug artifact rules** (COR-560 through COR-568): Detect leftover `print()`, `breakpoint()`, `pdb.set_trace()`, `console.log()`, `debugger`, and `System.out.println()` in Python, JavaScript, TypeScript, and Java code. All use `ast_match` type with test file exemptions.
+- **140 commands, 102 MCP tools**.
+
 ### v11.1.2 -- SQL + Scala Tier 1, 27 Languages
 - **SQL DDL promoted to Tier 1** with dedicated `SqlExtractor` -- tables, columns, views, functions, triggers, schemas, types (enums), sequences, ALTER TABLE ADD COLUMN. Foreign keys produce graph edges; views and triggers reference source tables. Database-schema projects now work with `roam health`, `roam layers`, `roam impact`, `roam coupling` and all graph commands.
 - **Scala promoted to Tier 1** with dedicated `ScalaExtractor` -- classes, traits, objects, case classes, sealed hierarchies, val/var properties, type aliases, imports, and inheritance. Full `extends` + `with` trait mixin resolution.
 - **27 languages** with 16 dedicated Tier 1 extractors.
 - `server.json` for official MCP Registry submission.
 
 ### v11.1.1 -- Command Quality Audit
-- **Full command audit**: all 139 commands reviewed for usefulness, duplicates, and test coverage. ~20 bugs fixed, 21 new test files (700+ tests), every command docstring updated with cross-references to related commands.
+- **Full command audit**: all 140 commands reviewed for usefulness, duplicates, and test coverage. ~20 bugs fixed, 21 new test files (700+ tests), every command docstring updated with cross-references to related commands.
 - **Kotlin promoted to Tier 1** via new YAML-based declarative extractor architecture. Classes, interfaces, enums, objects, functions, methods, properties, and inheritance fully extracted.
 - **7 new commands**: `roam congestion`, `roam adrs`, `roam flag-dead`, `roam test-scaffold`, `roam sbom`, `roam triage`, `roam ci-setup`.
 - **CI templates**: `roam ci-setup` generates pipelines for GitHub Actions, GitLab CI, Azure Pipelines, Jenkins, and Bitbucket.
@@ -226,7 +231,7 @@ roam health
 
 ## Commands
 
-The [5 core commands](#core-commands) shown above cover ~80% of agent workflows. All 139 commands are organized into 7 categories.
+The [5 core commands](#core-commands) shown above cover ~80% of agent workflows. All 140 commands are organized into 7 categories.
 
 <details>
 <summary><strong>Full command reference</strong></summary>
@@ -484,6 +489,7 @@ The sentinel pair `<!-- roam:minimap -->` / `<!-- /roam:minimap -->` is replaced
 | `roam safe-zones` | Graph-based containment boundaries |
 | `roam coverage-gaps` | Unprotected entry points with no path to gate symbols |
 | `roam duplicates [--threshold T] [--min-lines N]` | Semantic duplicate detector: functionally equivalent code clusters with divergent edge-case handling |
+| `roam clones [--threshold T] [--min-lines N] [--scope P]` | AST structural clone detection: Type-2 clones via subtree hashing (more precise than `duplicates`) |
 
 ### Exploration
 
@@ -822,7 +828,7 @@ pip install "roam-code[mcp]"
 roam mcp
 ```
 
-101 tools, 10 resources, and 5 prompts are available in the full preset. Most tools are read-only index queries; side-effect tools are explicitly annotated.
+102 tools, 10 resources, and 5 prompts are available in the full preset. Most tools are read-only index queries; side-effect tools are explicitly annotated.
 
 **MCP v2 highlights (v11):**
 - In-process MCP execution (no subprocess shell-out per call)
@@ -889,6 +895,7 @@ Core preset tools: `roam_affected_tests`, `roam_batch_get`, `roam_batch_search`,
 | `roam_bisect_blame` | Architectural git bisect |
 | `roam_doc_intent` | Doc-to-code linking |
 | `roam_cut_analysis` | Minimum graph cut analysis |
+| `roam_clones` | AST structural clone detection (Type-2 clones) |
 | `roam_annotate_symbol` | Attach persistent notes to symbols |
 | `roam_get_annotations` | View stored annotations |
 | `roam_relate` | Show relationship between two symbols |
@@ -1465,8 +1472,8 @@ roam-code/
 ├── action.yml                         # Reusable GitHub Action
 ├── src/roam/
 │   ├── __init__.py                    # Version (from pyproject.toml)
-│   ├── cli.py                         # Click CLI (139 commands)
-│   ├── mcp_server.py                  # MCP server (101 tools, 10 resources, 5 prompts)
+│   ├── cli.py                         # Click CLI (140 commands)
+│   ├── mcp_server.py                  # MCP server (102 tools, 10 resources, 5 prompts)
 │   ├── db/
 │   │   ├── connection.py              # SQLite (WAL, pragmas, batched IN)
 │   │   ├── schema.py                  # Tables, indexes, migrations
@@ -1560,7 +1567,7 @@ Optional: Local semantic ONNX stack (`numpy`, `onnxruntime`, `tokenizers`) via `
 ### Shipped
 
 - [x] MCP v2 agent surface: in-process execution, compound operations, presets, schemas, annotations, and compatibility profiles.
-- [x] Full command and MCP inventory parity in docs: 139 CLI commands and 101 MCP tools.
+- [x] Full command and MCP inventory parity in docs: 140 CLI commands and 102 MCP tools.
 - [x] CI hardening: composite action, changed-only mode, trend-aware gates, sticky PR updater, and SARIF guardrails.
 - [x] Performance foundation: FTS5/BM25 search, O(changed) incremental indexing, DB/index optimizations.
 - [x] Agent governance suite: `vibe-check`, `ai-readiness`, `verify`, `ai-ratio`, `duplicates`, advanced `algo` scoring/SARIF.
diff --git a/llms-install.md b/llms-install.md
@@ -1,7 +1,7 @@
 # Installing roam-code
 
 roam-code provides instant codebase comprehension for AI coding agents.
-139 commands, 101 MCP tools, 27 languages, 100% local, zero API keys.
+140 commands, 102 MCP tools, 27 languages, 100% local, zero API keys.
 
 ## Documentation Hub
 
@@ -78,4 +78,4 @@ Add to your MCP config:
 | `roam context <symbol>` | Files and line ranges to read |
 | `roam diff` | Blast radius of uncommitted changes |
 
-Run `roam --help` for all 139 commands (+3 alias pairs: 142 total entries).
+Run `roam --help` for all 140 commands (+3 alias pairs: 143 total entries).
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "roam-code"
-version = "11.1.3"
+version = "11.2.0"
 description = "Instant codebase comprehension for AI coding agents"
 readme = "README.md"
 requires-python = ">=3.9"
diff --git a/rules/community/correctness/COR-560-py-debug-print.yaml b/rules/community/correctness/COR-560-py-debug-print.yaml
@@ -0,0 +1,10 @@
+name: py-debug-print
+description: "Leftover debug print() statement (print with DEBUG/TODO/HACK/FIXME prefix)."
+severity: warning
+type: ast_match
+match:
+  ast: "print($EXPR)"
+  language: python
+  file_glob: "**/*.py"
+exempt:
+  files: ["**/tests/**", "**/test_**"]
diff --git a/rules/community/correctness/COR-561-py-breakpoint.yaml b/rules/community/correctness/COR-561-py-breakpoint.yaml
@@ -0,0 +1,10 @@
+name: py-breakpoint
+description: "Leftover breakpoint() call — debugger entry point in production code."
+severity: error
+type: ast_match
+match:
+  ast: "breakpoint()"
+  language: python
+  file_glob: "**/*.py"
+exempt:
+  files: ["**/tests/**", "**/test_**"]
diff --git a/rules/community/correctness/COR-562-py-pdb-import.yaml b/rules/community/correctness/COR-562-py-pdb-import.yaml
@@ -0,0 +1,10 @@
+name: py-pdb-import
+description: "Leftover pdb/ipdb import — debugger module in production code."
+severity: error
+type: ast_match
+match:
+  ast: "import pdb"
+  language: python
+  file_glob: "**/*.py"
+exempt:
+  files: ["**/tests/**", "**/test_**"]
diff --git a/rules/community/correctness/COR-563-py-set-trace.yaml b/rules/community/correctness/COR-563-py-set-trace.yaml
@@ -0,0 +1,10 @@
+name: py-set-trace
+description: "Leftover pdb.set_trace() call — debugger in production code."
+severity: error
+type: ast_match
+match:
+  ast: "pdb.set_trace()"
+  language: python
+  file_glob: "**/*.py"
+exempt:
+  files: ["**/tests/**", "**/test_**"]
diff --git a/rules/community/correctness/COR-564-js-console-log.yaml b/rules/community/correctness/COR-564-js-console-log.yaml
@@ -0,0 +1,10 @@
+name: js-console-log
+description: "Leftover console.log() call — debug output in production code."
+severity: warning
+type: ast_match
+match:
+  ast: "console.log($EXPR)"
+  language: javascript
+  file_glob: "**/*.{js,jsx,mjs,cjs}"
+exempt:
+  files: ["**/tests/**", "**/test/**", "**/__tests__/**", "**/*.test.*", "**/*.spec.*"]
diff --git a/rules/community/correctness/COR-565-ts-console-log.yaml b/rules/community/correctness/COR-565-ts-console-log.yaml
@@ -0,0 +1,10 @@
+name: ts-console-log
+description: "Leftover console.log() call — debug output in production code."
+severity: warning
+type: ast_match
+match:
+  ast: "console.log($EXPR)"
+  language: typescript
+  file_glob: "**/*.{ts,tsx,mts,cts}"
+exempt:
+  files: ["**/tests/**", "**/test/**", "**/__tests__/**", "**/*.test.*", "**/*.spec.*"]
diff --git a/rules/community/correctness/COR-566-js-debugger.yaml b/rules/community/correctness/COR-566-js-debugger.yaml
@@ -0,0 +1,10 @@
+name: js-debugger
+description: "Leftover debugger statement — pauses execution in browser devtools."
+severity: error
+type: ast_match
+match:
+  ast: "debugger"
+  language: javascript
+  file_glob: "**/*.{js,jsx,mjs,cjs}"
+exempt:
+  files: ["**/tests/**", "**/test/**", "**/__tests__/**"]
diff --git a/rules/community/correctness/COR-567-ts-debugger.yaml b/rules/community/correctness/COR-567-ts-debugger.yaml
@@ -0,0 +1,10 @@
+name: ts-debugger
+description: "Leftover debugger statement — pauses execution in browser devtools."
+severity: error
+type: ast_match
+match:
+  ast: "debugger"
+  language: typescript
+  file_glob: "**/*.{ts,tsx,mts,cts}"
+exempt:
+  files: ["**/tests/**", "**/test/**", "**/__tests__/**"]
diff --git a/rules/community/correctness/COR-568-java-sysout.yaml b/rules/community/correctness/COR-568-java-sysout.yaml
@@ -0,0 +1,10 @@
+name: java-sysout-debug
+description: "Leftover System.out.println() — use a logger instead."
+severity: warning
+type: ast_match
+match:
+  ast: "System.out.println($EXPR)"
+  language: java
+  file_glob: "**/*.java"
+exempt:
+  files: ["**/test/**", "**/tests/**", "**/Test*.java"]
diff --git a/src/roam/cli.py b/src/roam/cli.py
@@ -14,7 +14,7 @@
 
 # Lazy-loading command group: imports command modules only when invoked.
 # This avoids importing networkx (~500ms) on every CLI call.
-# Total: 142 invokable command names (139 canonical commands + 3 alias pairs).
+# Total: 143 invokable command names (140 canonical commands + 3 alias pairs).
 # If this changes, update README.md, CLAUDE.md, llms-install.md, and docs copy.
 _COMMANDS = {
     "index": ("roam.commands.cmd_index", "index"),
@@ -159,6 +159,7 @@
     "sbom": ("roam.commands.cmd_sbom", "sbom"),
     "triage": ("roam.commands.cmd_triage", "triage"),
     "ci-setup": ("roam.commands.cmd_ci_setup", "ci_setup"),
+    "clones": ("roam.commands.cmd_clones", "clones"),
 }
 
 # Command categories for organized --help display
@@ -252,6 +253,7 @@
         "visualize",
         "x-lang",
         "fingerprint",
+        "clones",
     ],
     "Exploration": [
         "search",
diff --git a/src/roam/commands/cmd_clones.py b/src/roam/commands/cmd_clones.py
@@ -0,0 +1,163 @@
+"""Detect near-duplicate code via AST structural hashing.
+
+Unlike ``duplicates`` (which uses metric-based similarity from the DB),
+this command re-parses source files and compares actual AST subtree
+structures.  Detects Type-2 clones: identical control flow with different
+identifiers or literals.
+
+Related commands: ``duplicates`` (metric-based), ``suggest-refactoring``,
+``split`` (extract responsibilities).
+"""
+
+from __future__ import annotations
+
+import click
+
+from roam.commands.resolve import ensure_index
+from roam.db.connection import open_db
+from roam.output.formatter import (
+    abbrev_kind,
+    format_table,
+    json_envelope,
+    loc,
+    to_json,
+)
+
+
+@click.command()
+@click.option(
+    "--threshold",
+    default=0.70,
+    show_default=True,
+    type=click.FloatRange(0.0, 1.0),
+    help="Minimum Jaccard similarity (0.0-1.0)",
+)
+@click.option(
+    "--min-lines",
+    default=5,
+    show_default=True,
+    type=int,
+    help="Skip functions shorter than N lines",
+)
+@click.option("--scope", default=None, type=str, help="Limit to files under this path prefix")
+@click.option("--top", default=0, type=int, help="Show only top N clusters (0=all)")
+@click.pass_context
+def clones(ctx, threshold, min_lines, scope, top):
+    """Detect near-duplicate code via AST structural hashing.
+
+    Re-parses source files and compares function AST structures via subtree
+    hashing.  Finds Type-2 clones: identical control flow with different
+    identifiers or literals.
+
+    Unlike ``duplicates`` (metric-based), this uses actual tree-sitter AST
+    comparison for higher precision.
+    """
+    json_mode = ctx.obj.get("json") if ctx.obj else False
+    token_budget = ctx.obj.get("budget", 0) if ctx.obj else 0
+    ensure_index()
+
+    from roam.graph.clone_detect import detect_clones
+
+    with open_db(readonly=True) as conn:
+        pairs, clusters = detect_clones(
+            conn,
+            min_similarity=threshold,
+            min_lines=min_lines,
+            scope=scope,
+        )
+
+        if top > 0:
+            clusters = clusters[:top]
+
+        # Summary stats
+        total_functions = sum(len(c.members) for c in clusters)
+        total_pairs = len(pairs)
+        avg_sim = sum(c.avg_similarity for c in clusters) / len(clusters) if clusters else 0.0
+
+        # Estimate reducible lines
+        reducible_lines = 0
+        for c in clusters:
+            lines = sorted(m["line_end"] - m["line_start"] + 1 for m in c.members)
+            if len(lines) > 1:
+                reducible_lines += sum(lines[:-1])
+
+        verdict = (
+            f"{len(clusters)} clone cluster{'s' if len(clusters) != 1 else ''} "
+            f"found ({total_functions} functions, {round(avg_sim * 100)}% avg similarity)"
+            if clusters
+            else "No structural clones detected"
+        )
+
+        if json_mode:
+            clusters_json = []
+            for c in clusters:
+                clusters_json.append(
+                    {
+                        "cluster_id": c.cluster_id,
+                        "avg_similarity": c.avg_similarity,
+                        "size": len(c.members),
+                        "members": c.members,
+                        "pattern": c.pattern,
+                        "suggestion": c.suggestion,
+                    }
+                )
+
+            pairs_json = [
+                {
+                    "file_a": p.file_a,
+                    "func_a": p.func_a,
+                    "line_a": p.line_a,
+                    "file_b": p.file_b,
+                    "func_b": p.func_b,
+                    "line_b": p.line_b,
+                    "similarity": p.similarity,
+                }
+                for p in pairs[:50]  # Cap pair output
+            ]
+
+            click.echo(
+                to_json(
+                    json_envelope(
+                        "clones",
+                        summary={
+                            "verdict": verdict,
+                            "clusters": len(clusters),
+                            "clone_pairs": total_pairs,
+                            "total_functions": total_functions,
+                            "avg_similarity": round(avg_sim, 3),
+                            "estimated_reducible_lines": reducible_lines,
+                        },
+                        budget=token_budget,
+                        clusters=clusters_json,
+                        pairs=pairs_json,
+                    )
+                )
+            )
+            return
+
+        # Text output
+        click.echo(f"VERDICT: {verdict}")
+
+        if not clusters:
+            return
+
+        click.echo()
+        for c in clusters:
+            sim_pct = round(c.avg_similarity * 100)
+            click.echo(f"CLUSTER {c.cluster_id} -- {sim_pct}% similarity, {len(c.members)} functions:")
+            for m in c.members:
+                lines = m["line_end"] - m["line_start"] + 1
+                click.echo(
+                    f"  {m['function']:<40s} "
+                    f"{loc(m['file'], m['line_start'])}"
+                    f"  ({lines} lines, {m['ast_nodes']} AST nodes)"
+                )
+            click.echo(f"  Pattern: {c.pattern}")
+            click.echo(f"  Suggestion: {c.suggestion}")
+            click.echo()
+
+        click.echo(
+            f"SUMMARY: {len(clusters)} clusters, "
+            f"{total_functions} functions, "
+            f"~{reducible_lines} lines of reducible duplication"
+        )
diff --git a/src/roam/graph/clone_detect.py b/src/roam/graph/clone_detect.py
diff --git a/src/roam/mcp_server.py b/src/roam/mcp_server.py
diff --git a/tests/test_clones.py b/tests/test_clones.py
diff --git a/tests/test_surface_counts.py b/tests/test_surface_counts.py