MLflow tracing silently disabled: setup_mlflow.py sets MLFLOW_CLAUDE_TRACING_ENABLED=false, Stop hook short-circuits

## Symptom

Despite the README promising "every Claude Code session is automatically traced to a Databricks MLflow experiment — zero configuration required", **no traces are written**. The Stop hook fires, but the upstream MLflow handler returns before processing the transcript.

## Root cause

[`setup_mlflow.py:36`](https://github.com/databrickslabs/coding-agents-databricks-apps/blob/main/setup_mlflow.py#L36) writes the wrong value into `~/.claude/settings.json`:

```python
settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] = "false"
```

This is the env var the upstream `mlflow.claude_code` package uses as its master enable/disable switch (mlflow-skinny 3.11.1):

- `mlflow/claude_code/config.py:24` — `MLFLOW_TRACING_ENABLED = "MLFLOW_CLAUDE_TRACING_ENABLED"`
- `mlflow/claude_code/tracing.py:128` — `is_tracing_enabled()` returns `True` only when the value is in `("true","1","yes")`
- `mlflow/claude_code/hooks.py:199` — `stop_hook_handler()` early-returns if `not is_tracing_enabled()`:

```python
def stop_hook_handler() -> None:
    if not is_tracing_enabled():
        response = get_hook_response()
        print(json.dumps(response))
        return
    # ... the transcript-processing path is below this guard
```

So with `"false"`, the hook prints an empty response and exits — no trace, no error, no log line.

## README disagrees with code

The README documents the correct value but the code writes the opposite:

> | `MLFLOW_CLAUDE_TRACING_ENABLED` | `true` | Enables Claude Code tracing |

(README.md, "Tracing is configured during app startup" section, line 123.)

## Tests lock in the broken behavior

`tests/test_mlflow_tracing.py` asserts the wrong value, so the test passes against a non-functional setup:

```python
# line 68 and line 143
assert settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] == "false"
```

A green test suite here is misleading — it certifies the bug.

## Repro

1. Deploy the app (`make deploy PROFILE=<profile>`).
2. Open the app, start a Claude Code session, run a few prompts.
3. Type `exit` to fire the Stop hook.
4. Open the MLflow experiment at `/Users/{app_owner}/coding-agents`.

**Expected:** A trace per session, with prompts/tool calls/outputs visible.
**Actual:** Experiment is empty (or contains only traces from before this regression landed).

## Suggested fix

```diff
- settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] = "false"
+ settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] = "true"
```

Plus update the two test assertions in `tests/test_mlflow_tracing.py` (lines 68 and 143).

## Notes

- The OTEL endpoint override on the next line (`OTEL_EXPORTER_OTLP_ENDPOINT = ""`) is unrelated and looks correct — it disables the container's OTLP collector so MLflow uses its native exporter.
- `git blame setup_mlflow.py` would show when `"false"` was introduced and whether it predates the upstream gating change in mlflow-skinny 3.11.x — worth a glance to decide whether to backport the fix to older releases.

---

## Update — root cause was intentional

`git log` on `setup_mlflow.py` shows the disable was intentional. Commit `b8a06c9` (2026-03-28) is titled *"fix: always create fresh session on page load, disable MLflow tracing by default"* and explicitly flips the flag from `"true"` to `"false"`.

So this is **deliberate-disable + stale-docs**, not a stray flag. Two ways to resolve:

- **(A) Re-enable** — flip back to `"true"`, verify whatever motivated `b8a06c9` is gone (the commit message bundled "fresh session on page load" with the disable, so the underlying issue may have been addressed alongside).
- **(B) Keep disabled, update docs** — leave `"false"`, rewrite the README "Tracing is configured during app startup" section to say tracing is opt-in, restore the env-driven override that existed in earlier history (commit `66ab612`), and rename `test_tracing_enabled` to match.

Either way the README, the test name, and the flag value need to align with whichever direction is chosen.

---

## Empirical confirmation (2026-05-06, v0.18.1 on daveok)

Deployed `databrickslabs/main` (v0.18.1, commit `dc14bf8`) to a fresh Databricks Apps workspace, completed setup, and verified:

**1. The flag is `"false"` in the running container.** Read directly from the deployed `~/.claude/settings.json`:
```
FLAG= false EXP= /Users/<owner>/coding-agents OTEL= ''
```

**2. The MLflow experiment doesn't exist.** Querying `GET /api/2.0/mlflow/experiments/get-by-name?experiment_name=/Users/<owner>/coding-agents` returns:
```json
{"error_code": "RESOURCE_DOES_NOT_EXIST",
 "message": "Node /Users/<owner>/coding-agents does not exist."}
```

So nothing is being written. MLflow auto-creates experiments on first write — since the Stop hook short-circuits before any write, no experiment ever materialises. The user has no breadcrumb to follow when they go looking for the traces the README promised them.

---

## Adjacent finding: misleading log line

`setup_mlflow.py:62` prints `"MLflow tracing enabled: experiment={...}"` even though it just wrote `MLFLOW_CLAUDE_TRACING_ENABLED="false"` two lines earlier. The startup log says "enabled" — the user has every reason to believe tracing is working. This makes the silent-failure mode worse, not better.

Whichever direction is chosen for the master fix, this print message should match reality (or be removed).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLflow tracing silently disabled: setup_mlflow.py sets MLFLOW_CLAUDE_TRACING_ENABLED=false, Stop hook short-circuits #9

Symptom

Root cause

README disagrees with code

Tests lock in the broken behavior

Repro

Suggested fix

Notes

Update — root cause was intentional

Empirical confirmation (2026-05-06, v0.18.1 on daveok)

Adjacent finding: misleading log line

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MLflow tracing silently disabled: setup_mlflow.py sets MLFLOW_CLAUDE_TRACING_ENABLED=false, Stop hook short-circuits #9

Description

Symptom

Root cause

README disagrees with code

Tests lock in the broken behavior

Repro

Suggested fix

Notes

Update — root cause was intentional

Empirical confirmation (2026-05-06, v0.18.1 on daveok)

Adjacent finding: misleading log line

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions