diff --git a/README.md b/README.md
index a2a8e6f..c2c9164 100644
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ Every method can be used in one of two modes, and the distinction runs through t
 
 Each is independent and self-contained — pick the one that matches the problem you care about, and read that directory's `README.md` for the full walkthrough. They are numbered in a recommended order that mirrors the bootcamp progression — conventional numerical methods → LLM Processes → agents → agentic evaluation — but any one stands on its own, so jump straight to the problem you care about.
 
-**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below.
+**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below. Also includes [`99_repo_concierge.ipynb`](implementations/getting_started/99_repo_concierge.ipynb) — a lite-model repo guide for “how does this codebase work?” questions (`uv run adk run implementations/getting_started/concierge_agent` from the repo root).
 
 | #   | Implementation                                                       | The problem                                                                     | Concepts & techniques it demonstrates                                                                                                                                                                                                                                                                       |
 | --- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
diff --git a/implementations/README.md b/implementations/README.md
index bf89a80..cbe7a51 100644
--- a/implementations/README.md
+++ b/implementations/README.md
@@ -32,6 +32,8 @@ YAML backtest and eval specs live under each use case in `specs/`. Each director
 
 Every domain use case (all except `getting_started`) also ships a `starter_agent/` module and a `99_starter_agent.ipynb` — a fresh, hackable **starter agent** that is the consistent "build your own" entry point for that use case (toggleable news search + code execution, two lightweight tool-usage skills, an interactive cell, and one scored forecast).
 
+`getting_started/` additionally ships a **`concierge_agent/`** module and **`99_repo_concierge.ipynb`** — a repo onboarding helper (not a forecaster) that answers questions about how the codebase works using a committed public-`main` knowledge digest. From the repository root: `uv run adk run implementations/getting_started/concierge_agent`. See [`getting_started/README.md`](getting_started/README.md) and the notebook for full usage.
+
 ---
 
 ## Relationship to `aieng-forecasting`
diff --git a/implementations/getting_started/99_repo_concierge.ipynb b/implementations/getting_started/99_repo_concierge.ipynb
new file mode 100644
index 0000000..8691ae4
--- /dev/null
+++ b/implementations/getting_started/99_repo_concierge.ipynb
@@ -0,0 +1,241 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Repo Concierge — ask questions about this codebase\n",
+        "\n",
+        "> **Note:** This agent uses a snapshot of the public `main` branch (not your local\n",
+        "> uncommitted changes or `data/` cache). Like any LLM, it can be wrong — verify\n",
+        "> important details against the repo or ask a facilitator.\n",
+        "\n",
+        "**Not sure how something works? Start here.**\n",
+        "\n",
+        "The repo concierge helps you **find your way** — it answers questions, points you\n",
+        "to the right notebooks and modules, and can quote short snippets so you know\n",
+        "where to dig deeper. Example questions:\n",
+        "\n",
+        "- *How do I create a new data service?*\n",
+        "- *How do I customize the way context is presented to an LLMP?*\n",
+        "- *What's the difference between `backtest()` and `evaluate()`?*\n",
+        "\n",
+        "It searches a committed **catalog** of the codebase (`search_repo_catalog` →\n",
+        "`fetch_repo_artifact`): full `aieng/forecasting`, reference implementations, and\n",
+        "notebooks (markdown + code cells). Domain `99_starter_agent.ipynb` notebooks are\n",
+        "for building forecasters; this one is your map of the repo.\n",
+        "\n",
+        "Live cells are gated by `RUN_AGENT` so `Run All` is safe and free; set it to `True`\n",
+        "to call the model."
+      ],
+      "id": "cell-00"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import warnings\n",
+        "from pathlib import Path\n",
+        "\n",
+        "from IPython.display import Markdown, display  # noqa: A004\n",
+        "\n",
+        "\n",
+        "warnings.filterwarnings(\"ignore\")\n",
+        "\n",
+        "from dotenv import load_dotenv\n",
+        "\n",
+        "\n",
+        "def find_repo_root(start: Path | None = None) -> Path:\n",
+        "    \"\"\"Walk upward until we find the workspace root.\"\"\"\n",
+        "    here = (start or Path.cwd()).resolve()\n",
+        "    for cand in (here, *here.parents):\n",
+        "        if (cand / \"pyproject.toml\").exists() and (cand / \"aieng-forecasting\").is_dir():\n",
+        "            return cand\n",
+        "    return Path.cwd().resolve().parents[1]\n",
+        "\n",
+        "\n",
+        "ROOT = find_repo_root()\n",
+        "load_dotenv(ROOT / \".env\", override=False)\n",
+        "\n",
+        "# ── Model selection ───────────────────────────────────\n",
+        "# Concierge uses the lite/default model only.\n",
+        "AGENT_MODEL = \"gemini-3.1-flash-lite-preview\"\n",
+        "\n",
+        "# ── Run guard ──────────────────────────────────────\n",
+        "RUN_AGENT = True\n",
+        "\n",
+        "from getting_started.concierge_agent import build_concierge_config\n",
+        "\n",
+        "\n",
+        "print(\"RUN_AGENT =\", RUN_AGENT, \"| model =\", AGENT_MODEL)"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-01"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## 1. Meet the concierge\n",
+        "\n",
+        "The agent uses a **catalog + artifacts** knowledge pack shipped under `concierge_agent/context/` — no build step for participants.\n",
+        "\n",
+        "1. **`search_repo_catalog`** — search metadata (paths, summaries, domains); cheap, run first.\n",
+        "2. **`fetch_repo_artifact`** — fetch full content for a catalog path (Python modules, READMEs, notebooks with **markdown + code cells**).\n",
+        "\n",
+        "Maintainers regenerate the pack from public `main` with `scripts/build_concierge_context.py` when library code or notebooks change. The `repo-navigation` skill has reference guides (no scripts)."
+      ],
+      "id": "cell-02"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "config = build_concierge_config(model=AGENT_MODEL)\n",
+        "\n",
+        "print(\"Agent:\", config.name)\n",
+        "print(\"Search enabled:    \", config.context_retrieval.enabled)\n",
+        "print(\"Code-exec enabled: \", config.code_execution.enabled)\n",
+        "print(\"Skills loaded:     \", [p.name for p in config.skills_dirs])\n",
+        "print(\"Extra tools:       \", [getattr(t, \"__name__\", repr(t)) for t in config.extra_tools])\n",
+        "display(Markdown(\"### System instruction\\n\\n*Edit in `concierge_agent/agent.py`*\"))\n",
+        "display(Markdown(config.instruction))"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-03"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## 2. Try a seed question\n",
+        "\n",
+        "Edit `QUESTION` below, or jump to the next section for a multi-turn conversation."
+      ],
+      "id": "cell-04"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from aieng.forecasting.methods.agentic import build_adk_agent\n",
+        "from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig\n",
+        "\n",
+        "\n",
+        "QUESTION = \"How do I create a new data service?\"\n",
+        "\n",
+        "if RUN_AGENT:\n",
+        "    chat_agent = build_adk_agent(config)\n",
+        "    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name=\"repo_concierge_chat\"))\n",
+        "    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142\n",
+        "    display(Markdown(reply))\n",
+        "else:\n",
+        "    print(\"RUN_AGENT is False — set it to True in the setup cell to ask the concierge.\")"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-05"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "QUESTION = \"How do I customize the way context is presented to an LLMP?\"\n",
+        "\n",
+        "if RUN_AGENT:\n",
+        "    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142\n",
+        "    display(Markdown(reply))\n",
+        "else:\n",
+        "    print(\"RUN_AGENT is False — set it to True to run this cell.\")"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-05b"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "QUESTION = \"What's the difference between backtest() and evaluate()?\"\n",
+        "\n",
+        "if RUN_AGENT:\n",
+        "    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142\n",
+        "    display(Markdown(reply))\n",
+        "else:\n",
+        "    print(\"RUN_AGENT is False — set it to True to run this cell.\")"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-05c"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "QUESTION = \"Where should I go after getting_started if I want to build agents?\"\n",
+        "\n",
+        "if RUN_AGENT:\n",
+        "    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142\n",
+        "    display(Markdown(reply))\n",
+        "else:\n",
+        "    print(\"RUN_AGENT is False — set it to True to run this cell.\")"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "cell-05d"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## 3. Terminal mode — multi-turn conversations\n",
+        "\n",
+        "For extended back-and-forth, use the ADK CLI from the **repository root**:\n",
+        "\n",
+        "```bash\n",
+        "uv run adk run implementations/getting_started/concierge_agent\n",
+        "```\n",
+        "\n",
+        "That loads the same `repo_concierge` agent (`gemini-3.1-flash-lite-preview`) with\n",
+        "`search_repo_catalog`, `fetch_repo_artifact`, and the repo-navigation skill.\n",
+        "\n",
+        "**Alternative:** `uv run adk web implementations/getting_started/concierge_agent`\n",
+        "opens a browser UI (same agent). From `implementations/getting_started/`, you can\n",
+        "also use the shorter `uv run adk run concierge_agent`.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "**Where next?** Forecasting starter agents live in each domain implementation's\n",
+        "`99_starter_agent.ipynb` (food, energy, BoC, S&P 500). This concierge helps you\n",
+        "navigate the repo — open one of those when you're ready to build and score a forecaster."
+      ],
+      "id": "cell-08"
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
diff --git a/implementations/getting_started/README.md b/implementations/getting_started/README.md
index f548cfb..99e5638 100644
--- a/implementations/getting_started/README.md
+++ b/implementations/getting_started/README.md
@@ -139,6 +139,33 @@ against [`cpi_gasoline_eval_2025.yaml`](specs/cpi_gasoline_eval_2025.yaml)
 — monthly origins from Jan 2025 through Mar 2026, all currently resolved.
 `max_runs: 5` — spend deliberately.
 
+### 6. Ask the repo concierge — `99_repo_concierge.ipynb`
+
+**Questions about how the repository works?** Open
+[`99_repo_concierge.ipynb`](99_repo_concierge.ipynb) — a lite-model **repo
+concierge** that answers onboarding questions, points you to the right notebooks
+and modules, and can quote snippets from the committed public-`main` catalog.
+
+- Notebook cells are gated by `RUN_AGENT` (safe `Run All`).
+- For longer conversations, run the ADK CLI from the **repository root**:
+
+  ```bash
+  uv run adk run implementations/getting_started/concierge_agent
+  ```
+
+  (`uv run adk web implementations/getting_started/concierge_agent` opens the same
+  agent in a browser.)
+
+  From `implementations/getting_started/`, the shorter `uv run adk run concierge_agent`
+  works too.
+
+This is different from each domain's `99_starter_agent.ipynb` — those are
+hackable **forecasting** agents; the concierge only explains the repo.
+
+Maintainers regenerate the catalog with
+`uv run python scripts/build_concierge_context.py` when library code,
+implementations, or notebooks change.
+
 ---
 
 ## Where to go next
@@ -166,9 +193,11 @@ what you're building:
 getting_started/                 # this directory
 ├── README.md
 ├── specs/                       # backtest and eval YAML
+├── concierge_agent/             # repo concierge ADK agent + catalog + artifacts
 ├── 00_environment_check.ipynb   # self-guided setup preflight — run this first
 ├── 01_cpi_data_exploration.ipynb
-└── 02_cpi_backtest_demo.ipynb
+├── 02_cpi_backtest_demo.ipynb
+└── 99_repo_concierge.ipynb      # ask questions about the repo (onboarding helper)
 ```
 
 Reference predictors live in the `aieng-forecasting` package under
diff --git a/implementations/getting_started/__init__.py b/implementations/getting_started/__init__.py
new file mode 100644
index 0000000..13bc933
--- /dev/null
+++ b/implementations/getting_started/__init__.py
@@ -0,0 +1 @@
+"""Getting started reference implementation (notebooks + repo concierge agent)."""
diff --git a/implementations/getting_started/concierge_agent/__init__.py b/implementations/getting_started/concierge_agent/__init__.py
new file mode 100644
index 0000000..c84fffe
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/__init__.py
@@ -0,0 +1,21 @@
+"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase.
+
+Exports the :class:`AgentConfig` factory and the knowledge-search tool. Pair
+with ``99_repo_concierge.ipynb`` or ``adk run implementations/getting_started/concierge_agent``
+from the repository root.
+"""
+
+from getting_started.concierge_agent.agent import build_concierge_config
+from getting_started.concierge_agent.catalog import (
+    fetch_repo_artifact,
+    search_repo_catalog,
+)
+from getting_started.concierge_agent.knowledge import search_repo_knowledge
+
+
+__all__ = [
+    "build_concierge_config",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+    "search_repo_knowledge",
+]
diff --git a/implementations/getting_started/concierge_agent/agent.py b/implementations/getting_started/concierge_agent/agent.py
new file mode 100644
index 0000000..22ac40d
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/agent.py
@@ -0,0 +1,108 @@
+"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase.
+
+A lightweight ADK agent powered by ``LITE_MODEL`` (``gemini-3.1-flash-lite-preview``).
+It answers questions about the repository using a committed **catalog + artifacts**
+snapshot of public ``main`` — not the participant's local workspace.
+
+Pair with ``99_repo_concierge.ipynb`` or ``adk run implementations/getting_started/concierge_agent``
+from the repository root.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+from getting_started.concierge_agent.catalog import fetch_repo_artifact, search_repo_catalog
+
+
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_REPO_NAV_SKILL = _SKILLS_ROOT / "repo-navigation"
+
+
+def _build_concierge_instruction() -> str:
+    return (
+        "## Role\n\n"
+        "You are the **repo concierge** for the agentic-forecasting bootcamp — a "
+        "friendly guide who helps participants understand the repository and find "
+        "their way to the right notebooks, modules, and patterns.\n\n"
+        "Answer questions clearly. Point people to **concrete paths** in the "
+        "codebase (READMEs, notebooks, specs, library modules) where they can "
+        "read more or try things themselves. When it helps, quote short snippets "
+        "from fetched artifacts — especially from notebooks and reference "
+        "implementations.\n\n"
+        "## How you work\n\n"
+        "- Ground answers in the committed catalog: call "
+        "``search_repo_catalog`` first, then ``fetch_repo_artifact`` for the "
+        "paths you need (usually one to three per question).\n"
+        "- Prefer showing *where* something lives and *how it fits together* "
+        "over long generic explanations.\n"
+        "- If someone is debugging or extending code, walk them through the "
+        "relevant files and patterns you find in the catalog; suggest what to "
+        "open next in their editor.\n"
+        "- Your knowledge reflects the committed public ``main`` snapshot — not "
+        "the participant's local ``.env``, ``data/`` cache, or uncommitted "
+        "changes. If the catalog does not cover something, say so and name the "
+        "best file to open or a facilitator to ask.\n\n"
+        "## Tone\n\n"
+        "- Concise, welcoming, and practical — short paragraphs and bullet lists.\n"
+        "- Always cite paths returned by the catalog.\n"
+    )
+
+
+_CONCIERGE_INSTRUCTION = _build_concierge_instruction()
+
+_SKILLS_SUPPLEMENT = """
+
+## Skills
+
+You have one read-only skill: `repo-navigation` with reference files (catalog guide,
+domain map). Load them via `load_skill_resource` when you need routing hints.
+
+**To use a skill:**
+1. Call `list_skills` → `load_skill` → `load_skill_resource` as needed.
+
+These skills have NO scripts. Do not call `run_skill_script`.
+
+## Repo catalog tools (required workflow)
+
+1. **`search_repo_catalog(query, domain=None, kind=None)`** — search metadata only
+   (paths, summaries, section titles). Use `domain` filters like `core.data`,
+   `core.methods`, `impl.energy_oil_forecasting`, `scripts`, `docs`.
+   Use `kind` filters: `python`, `notebook`, `markdown`, `yaml`.
+2. **`fetch_repo_artifact(path, section=None)`** — fetch full content for one catalog
+   path (optionally one heading/section). Fetch 1–3 artifacts per question.
+
+Do not answer implementation or API questions without fetching the relevant paths.\
+"""
+
+
+def _full_instruction() -> str:
+    return _CONCIERGE_INSTRUCTION + _SKILLS_SUPPLEMENT
+
+
+def build_concierge_config(*, model: str = LITE_MODEL) -> AgentConfig:
+    """Build the repo-concierge :class:`AgentConfig`."""
+    return AgentConfig(
+        name="repo_concierge",
+        model=model,
+        instruction=_full_instruction(),
+        context_retrieval=ContextRetrievalConfig(),
+        code_execution=CodeExecutionConfig(),
+        skills_dirs=[_REPO_NAV_SKILL],
+        extra_tools=[search_repo_catalog, fetch_repo_artifact],
+    )
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ADK CLI."""
+    if name == "root_agent":
+        return build_adk_agent(build_concierge_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
diff --git a/implementations/getting_started/concierge_agent/catalog.py b/implementations/getting_started/concierge_agent/catalog.py
new file mode 100644
index 0000000..379040e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/catalog.py
@@ -0,0 +1,238 @@
+"""Runtime catalog search and artifact fetch for the repo concierge agent."""
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from functools import lru_cache
+from pathlib import Path
+from typing import Any
+
+import yaml
+from getting_started.concierge_agent.catalog_build import CatalogEntry
+
+
+_CONTEXT_DIR = Path(__file__).parent / "context"
+_MAX_CATALOG_HITS = 8
+_DEFAULT_FETCH_MAX_CHARS = 6000
+_MIN_SCORE = 1
+_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE)
+
+
+@dataclass(frozen=True)
+class CatalogHit:
+    """A ranked catalog match (metadata only)."""
+
+    path: str
+    kind: str
+    domain: str
+    summary: str
+    score: int
+    artifact: str
+    sections: list[str]
+
+
+def _entry_from_dict(data: dict[str, Any]) -> CatalogEntry:
+    return CatalogEntry(
+        path=str(data["path"]),
+        kind=str(data.get("kind", "other")),
+        domain=str(data.get("domain", "other")),
+        summary=str(data.get("summary", "")),
+        symbols=[str(s) for s in data.get("symbols", [])],
+        sections=[str(s) for s in data.get("sections", [])],
+        chars=int(data.get("chars", 0)),
+        artifact=str(data["artifact"]),
+    )
+
+
+@lru_cache(maxsize=1)
+def _load_catalog() -> dict[str, Any]:
+    catalog_path = _CONTEXT_DIR / "catalog.yaml"
+    if not catalog_path.is_file():
+        msg = f"Concierge catalog not found: {catalog_path}. Run scripts/build_concierge_context.py"
+        raise FileNotFoundError(msg)
+    with catalog_path.open(encoding="utf-8") as fh:
+        data = yaml.safe_load(fh)
+    if not isinstance(data, dict):
+        msg = f"Invalid catalog format in {catalog_path}"
+        raise ValueError(msg)
+    return data
+
+
+@lru_cache(maxsize=1)
+def _load_entries() -> tuple[CatalogEntry, ...]:
+    catalog = _load_catalog()
+    raw_entries = catalog.get("entries", [])
+    if not isinstance(raw_entries, list):
+        return ()
+    return tuple(_entry_from_dict(item) for item in raw_entries if isinstance(item, dict))
+
+
+def _tokenize(query: str) -> list[str]:
+    return [t for t in re.findall(r"[a-zA-Z0-9_./-]+", query.lower()) if len(t) > 2]
+
+
+def _score_entry(entry: CatalogEntry, terms: list[str], domain: str | None, kind: str | None) -> int:
+    haystack = " ".join(
+        [
+            entry.path,
+            entry.summary,
+            entry.domain,
+            entry.kind,
+            " ".join(entry.symbols),
+            " ".join(entry.sections),
+        ]
+    ).lower()
+    score = sum(haystack.count(term) for term in terms)
+    if domain and entry.domain == domain:
+        score += 5
+    if kind and entry.kind == kind:
+        score += 3
+    return score
+
+
+def _normalize_domain(domain: str | None) -> str | None:
+    if domain is None:
+        return None
+    return domain.strip().lower()
+
+
+def _normalize_kind(kind: str | None) -> str | None:
+    if kind is None:
+        return None
+    return kind.strip().lower()
+
+
+def search_repo_catalog(
+    query: str,
+    domain: str | None = None,
+    kind: str | None = None,
+) -> str:
+    """Search the committed repo catalog (metadata only).
+
+    Returns matching paths, summaries, and section titles — not file bodies.
+    Follow up with :func:`fetch_repo_artifact` for content.
+    """
+    terms = _tokenize(query)
+    if not terms:
+        return "No search terms found. Try e.g. 'DataService register' or 'energy notebook 02 agentic'."
+
+    domain_filter = _normalize_domain(domain)
+    kind_filter = _normalize_kind(kind)
+    ranked: list[CatalogHit] = []
+    for entry in _load_entries():
+        if domain_filter and entry.domain != domain_filter:
+            continue
+        if kind_filter and entry.kind != kind_filter:
+            continue
+        score = _score_entry(entry, terms, domain_filter, kind_filter)
+        if score >= _MIN_SCORE:
+            ranked.append(
+                CatalogHit(
+                    path=entry.path,
+                    kind=entry.kind,
+                    domain=entry.domain,
+                    summary=entry.summary,
+                    score=score,
+                    artifact=entry.artifact,
+                    sections=entry.sections[:5],
+                )
+            )
+
+    if not ranked:
+        domains = sorted({e.domain for e in _load_entries()})
+        return (
+            f"No catalog matches for query={query!r}"
+            + (f", domain={domain!r}" if domain else "")
+            + (f", kind={kind!r}" if kind else "")
+            + f". Available domains: {', '.join(domains)}."
+        )
+
+    ranked.sort(key=lambda hit: hit.score, reverse=True)
+    top = ranked[:_MAX_CATALOG_HITS]
+
+    lines = [
+        f"# Catalog search: {query}",
+        "",
+        "Metadata only — call `fetch_repo_artifact(path)` for full content.",
+        "",
+    ]
+    for i, hit in enumerate(top, start=1):
+        lines.append(f"## Match {i} (score={hit.score})")
+        lines.append(f"- **path:** `{hit.path}`")
+        lines.append(f"- **kind:** `{hit.kind}` | **domain:** `{hit.domain}`")
+        lines.append(f"- **summary:** {hit.summary}")
+        if hit.sections:
+            lines.append(f"- **sections:** {'; '.join(hit.sections[:3])}")
+        lines.append("")
+    return "\n".join(lines)
+
+
+def _find_entry_by_path(path: str) -> CatalogEntry | None:
+    normalized = path.strip().replace("\\", "/")
+    for entry in _load_entries():
+        if entry.path == normalized:
+            return entry
+    return None
+
+
+def _extract_section(body: str, section: str) -> str | None:
+    needle = section.strip().lower()
+    if not needle:
+        return None
+    parts = re.split(r"\n(?=#{1,4} )", body)
+    for part in parts:
+        heading_match = _HEADING_RE.match(part.strip())
+        if heading_match and needle in heading_match.group(1).lower():
+            return part.strip()
+        if needle in part[:120].lower():
+            return part.strip()
+    return None
+
+
+def fetch_repo_artifact(
+    path: str,
+    section: str | None = None,
+    max_chars: int = _DEFAULT_FETCH_MAX_CHARS,
+) -> str:
+    """Fetch one pre-built artifact by repo-relative ``path``.
+
+    Parameters
+    ----------
+    path : str
+        Repo-relative path as listed in the catalog (e.g.
+        ``aieng-forecasting/aieng/forecasting/data/service.py``).
+    section : str or None
+        Optional heading substring to return one section only.
+    max_chars : int
+        Hard cap on returned characters.
+    """
+    entry = _find_entry_by_path(path)
+    if entry is None:
+        return f"No catalog entry for path={path!r}. Call `search_repo_catalog` first."
+
+    artifact_path = _CONTEXT_DIR / entry.artifact
+    if not artifact_path.is_file():
+        return f"Artifact missing for {path!r}: {entry.artifact}"
+
+    body = artifact_path.read_text(encoding="utf-8")
+    if section:
+        extracted = _extract_section(body, section)
+        body = extracted or (f"(Section {section!r} not found in artifact; showing beginning.)\n\n" + body[:max_chars])
+
+    if len(body) > max_chars:
+        body = body[:max_chars] + "\n…\n"
+    return body
+
+
+def clear_catalog_cache() -> None:
+    """Clear cached catalog reads (for tests)."""
+    _load_catalog.cache_clear()
+    _load_entries.cache_clear()
+
+
+__all__ = [
+    "clear_catalog_cache",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+]
diff --git a/implementations/getting_started/concierge_agent/catalog_build.py b/implementations/getting_started/concierge_agent/catalog_build.py
new file mode 100644
index 0000000..9b5638f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/catalog_build.py
@@ -0,0 +1,341 @@
+"""Build the repo concierge catalog and per-source artifacts (maintainer-only)."""
+
+from __future__ import annotations
+
+import ast
+import json
+import re
+import shutil
+import subprocess
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any, Literal
+
+import yaml
+
+
+REPO_URL = "https://github.com/VectorInstitute/agentic-forecasting"
+DEFAULT_BRANCH = "main"
+CORE_PREFIX = "aieng-forecasting/aieng/forecasting"
+
+Kind = Literal["python", "markdown", "notebook", "yaml", "shell"]
+
+_SKIP_IMPL_PARTS = frozenset({"tests", "context", "__pycache__"})
+_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE)
+
+
+@dataclass(frozen=True)
+class CatalogEntry:
+    """One indexed source file in the public repo snapshot."""
+
+    path: str
+    kind: str
+    domain: str
+    summary: str
+    symbols: list[str]
+    sections: list[str]
+    chars: int
+    artifact: str
+
+
+def repo_root_from_here() -> Path:
+    """Return repository root (parent of ``implementations/``)."""
+    return Path(__file__).resolve().parents[3]
+
+
+def context_dir(repo_root: Path | None = None) -> Path:
+    root = repo_root or repo_root_from_here()
+    return root / "implementations/getting_started/concierge_agent/context"
+
+
+def path_to_artifact_slug(rel_path: str) -> str:
+    return rel_path.replace("/", "__")
+
+
+_DOMAIN_RULES: tuple[tuple[str, str], ...] = (
+    (f"{CORE_PREFIX}/data", "core.data"),
+    (f"{CORE_PREFIX}/evaluation", "core.evaluation"),
+    (f"{CORE_PREFIX}/methods", "core.methods"),
+    (f"{CORE_PREFIX}/documents", "core.documents"),
+    (f"{CORE_PREFIX}/", "core.root"),
+)
+
+
+def infer_domain(rel_path: str) -> str:
+    """Map a repo-relative path to a catalog domain tag."""
+    for prefix, domain in _DOMAIN_RULES:
+        if rel_path.startswith(prefix):
+            return domain
+    if rel_path.startswith("implementations/"):
+        parts = rel_path.split("/")
+        if len(parts) >= 2:
+            return f"impl.{parts[1]}"
+    if rel_path.startswith("scripts/"):
+        return "scripts"
+    if rel_path.startswith(("docs/", "planning-docs/")) or rel_path in {"README.md", "AGENTS.md"}:
+        return "docs"
+    return "other"
+
+
+def infer_kind(rel_path: str) -> Kind:
+    suffix = Path(rel_path).suffix.lower()
+    if suffix == ".py":
+        return "python"
+    if suffix == ".ipynb":
+        return "notebook"
+    if suffix in {".yaml", ".yml"}:
+        return "yaml"
+    if suffix == ".md":
+        return "markdown"
+    return "shell"
+
+
+def _first_paragraph(text: str) -> str:
+    stripped = text.strip()
+    if not stripped:
+        return ""
+    return stripped.split("\n\n")[0].replace("\n", " ").strip()[:240]
+
+
+def _extract_headings(text: str) -> list[str]:
+    return [m.group(1).strip() for m in _HEADING_RE.finditer(text)][:40]
+
+
+def _analyze_python(source: str) -> tuple[str, list[str]]:
+    try:
+        tree = ast.parse(source)
+    except SyntaxError:
+        return "", []
+    summary = _first_paragraph(ast.get_docstring(tree) or "")
+    symbols: list[str] = []
+    for node in tree.body:
+        if isinstance(node, ast.ClassDef):
+            symbols.append(node.name)
+        elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+            if not node.name.startswith("_"):
+                symbols.append(node.name)
+        elif isinstance(node, ast.Assign):
+            for target in node.targets:
+                if (
+                    isinstance(target, ast.Name)
+                    and target.id == "__all__"
+                    and isinstance(node.value, (ast.List, ast.Tuple))
+                ):
+                    for elt in node.value.elts:
+                        if isinstance(elt, ast.Constant) and isinstance(elt.value, str):
+                            symbols.append(elt.value)
+    return summary, symbols[:30]
+
+
+def _notebook_to_markdown(rel_path: str, raw: str) -> tuple[str, str, list[str]]:
+    nb = json.loads(raw)
+    lines = [f"# Source: {rel_path}", "", "kind: notebook", ""]
+    sections: list[str] = []
+    for idx, cell in enumerate(nb.get("cells", []), start=1):
+        cell_type = cell.get("cell_type", "")
+        source = "".join(cell.get("source", []))
+        if not source.strip():
+            continue
+        if cell_type == "markdown":
+            lines.extend([f"## Cell {idx} (markdown)", "", source.rstrip(), ""])
+            first = source.strip().splitlines()[0] if source.strip() else ""
+            if first.startswith("#"):
+                sections.append(first.lstrip("#").strip())
+        elif cell_type == "code":
+            lines.extend([f"## Cell {idx} (code)", "", "```python", source.rstrip(), "```", ""])
+    body = "\n".join(lines)
+    title = sections[0] if sections else Path(rel_path).stem.replace("_", " ")
+    summary = title[:240]
+    return body, summary, sections[:40]
+
+
+def _markdown_summary_and_sections(body: str, *, fallback: str) -> tuple[str, list[str]]:
+    sections = _extract_headings(body)
+    summary = sections[0] if sections else _first_paragraph(body) or fallback
+    return summary[:240], sections
+
+
+def _collect_core_paths(repo_root: Path) -> set[Path]:
+    paths: set[Path] = set()
+    core = repo_root / CORE_PREFIX
+    if core.is_dir():
+        for path in core.rglob("*"):
+            if path.is_file() and path.suffix in {".py", ".md"} and "__pycache__" not in path.parts:
+                paths.add(path)
+    return paths
+
+
+def _collect_impl_paths(repo_root: Path) -> set[Path]:
+    paths: set[Path] = set()
+    impl_root = repo_root / "implementations"
+    if not impl_root.is_dir():
+        return paths
+    for path in impl_root.rglob("*"):
+        if not path.is_file():
+            continue
+        if _SKIP_IMPL_PARTS.intersection(path.parts):
+            continue
+        if "curriculum" in path.parts and "context" in path.parts:
+            continue
+        if path.suffix in {".py", ".md", ".ipynb"} or (
+            path.parent.name == "specs" and path.suffix in {".yaml", ".yml"}
+        ):
+            paths.add(path)
+    return paths
+
+
+def collect_source_paths(repo_root: Path) -> list[Path]:
+    """Collect all concierge-indexed paths under the repo snapshot."""
+    paths = _collect_core_paths(repo_root) | _collect_impl_paths(repo_root)
+
+    for rel in (
+        "README.md",
+        "AGENTS.md",
+        "implementations/README.md",
+        "planning-docs/roadmap.md",
+        "docs/adk-skills-guide.md",
+    ):
+        candidate = repo_root / rel
+        if candidate.is_file():
+            paths.add(candidate)
+
+    scripts = repo_root / "scripts"
+    if scripts.is_dir():
+        for path in scripts.glob("fetch_*.py"):
+            paths.add(path)
+
+    return sorted(paths, key=lambda p: p.relative_to(repo_root).as_posix())
+
+
+def build_entry(repo_root: Path, path: Path) -> tuple[CatalogEntry, str]:
+    rel = path.relative_to(repo_root).as_posix()
+    kind = infer_kind(rel)
+    domain = infer_domain(rel)
+    raw = path.read_text(encoding="utf-8", errors="replace")
+
+    symbols: list[str] = []
+    sections: list[str] = []
+    if kind == "python":
+        summary, symbols = _analyze_python(raw)
+        if not summary:
+            summary = Path(rel).name
+        body = f"# Source: {rel}\n\nkind: python\n\n```python\n{raw.rstrip()}\n```\n"
+    elif kind == "notebook":
+        body, summary, sections = _notebook_to_markdown(rel, raw)
+    elif kind == "markdown":
+        summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name)
+        body = f"# Source: {rel}\n\nkind: markdown\n\n{raw.rstrip()}\n"
+    elif kind == "yaml":
+        summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name)
+        body = f"# Source: {rel}\n\nkind: yaml\n\n```yaml\n{raw.rstrip()}\n```\n"
+    else:
+        summary = Path(rel).name
+        body = f"# Source: {rel}\n\nkind: shell\n\n```\n{raw.rstrip()}\n```\n"
+
+    artifact_rel = f"artifacts/{path_to_artifact_slug(rel)}.md"
+    entry = CatalogEntry(
+        path=rel,
+        kind=kind,
+        domain=domain,
+        summary=summary,
+        symbols=symbols,
+        sections=sections,
+        chars=len(body),
+        artifact=artifact_rel,
+    )
+    return entry, body
+
+
+def git_ref(repo_root: Path) -> str:
+    try:
+        return subprocess.check_output(
+            ["git", "rev-parse", "HEAD"],
+            cwd=repo_root,
+            text=True,
+        ).strip()
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        return "unknown"
+
+
+def build_catalog(repo_root: Path | None = None) -> Path:
+    """Walk the repo, write ``catalog.yaml`` and per-source artifacts."""
+    root = repo_root or repo_root_from_here()
+    out_dir = context_dir(root)
+    artifacts_dir = out_dir / "artifacts"
+    if artifacts_dir.exists():
+        shutil.rmtree(artifacts_dir)
+    artifacts_dir.mkdir(parents=True, exist_ok=True)
+
+    entries: list[CatalogEntry] = []
+    for path in collect_source_paths(root):
+        entry, body = build_entry(root, path)
+        entries.append(entry)
+        artifact_path = out_dir / entry.artifact
+        artifact_path.parent.mkdir(parents=True, exist_ok=True)
+        artifact_path.write_text(body, encoding="utf-8")
+
+    built_at = datetime.now(tz=UTC).replace(microsecond=0).isoformat()
+    catalog = {
+        "source_url": REPO_URL,
+        "git_ref": git_ref(root),
+        "branch": DEFAULT_BRANCH,
+        "built_at": built_at,
+        "ingest_source": str(root),
+        "entry_count": len(entries),
+        "entries": [
+            {
+                "path": e.path,
+                "kind": e.kind,
+                "domain": e.domain,
+                "summary": e.summary,
+                "symbols": e.symbols,
+                "sections": e.sections,
+                "chars": e.chars,
+                "artifact": e.artifact,
+            }
+            for e in entries
+        ],
+    }
+    catalog_path = out_dir / "catalog.yaml"
+    catalog_path.write_text(yaml.safe_dump(catalog, sort_keys=False), encoding="utf-8")
+
+    # Remove legacy topic-blob digests if present.
+    for legacy in (
+        "overview.md",
+        "core_library.md",
+        "methods.md",
+        "implementations.md",
+        "extension_guides.md",
+        "manifest.yaml",
+    ):
+        legacy_path = out_dir / legacy
+        if legacy_path.is_file():
+            legacy_path.unlink()
+
+    _sync_skill_catalog_summary(catalog, root)
+    return out_dir
+
+
+def _sync_skill_catalog_summary(catalog: dict[str, Any], repo_root: Path) -> None:
+    """Write a compact domain summary for the repo-navigation skill."""
+    entries = catalog.get("entries", [])
+    domains: dict[str, int] = {}
+    for entry in entries:
+        if isinstance(entry, dict):
+            domain = str(entry.get("domain", "other"))
+            domains[domain] = domains.get(domain, 0) + 1
+    summary = {
+        "source_url": catalog.get("source_url"),
+        "branch": catalog.get("branch"),
+        "built_at": catalog.get("built_at"),
+        "git_ref": catalog.get("git_ref"),
+        "entry_count": catalog.get("entry_count"),
+        "domains": domains,
+    }
+    out = (
+        repo_root
+        / "implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml"
+    )
+    header = "# Concierge catalog summary (regenerated by scripts/build_concierge_context.py)\n"
+    out.write_text(header + yaml.safe_dump(summary, sort_keys=False), encoding="utf-8")
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md b/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md
new file mode 100644
index 0000000..a7d37df
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md
@@ -0,0 +1,67 @@
+# Source: AGENTS.md
+
+kind: markdown
+
+# AGENTS.md
+
+## How to use this file
+
+Instructions here are **general when possible, specific when needed.** Prefer patterns and principles over static lists — static lists go stale. When something is specific (a command, a maintenance contract, a non-obvious convention), it is specific for a reason.
+
+---
+
+## Project documentation
+
+### Documentation is part of every change (hard rule)
+
+**Any change to code, features, datasets, methods, specs, notebooks, or observable behavior must update the docs that describe it, in the same change.** Docs are part of the product and part of the definition of "done." A change that lands working code but leaves a README, the root `README.md`, the method catalog, or `planning-docs/roadmap.md` describing the old reality is a **regression** — treat it exactly as you would a failing test, not as follow-up work.
+
+So "done" always includes a documentation reconciliation step. Before considering any change complete:
+
+1. **Grep for what you touched** across docs — the feature name, the module/class/function, the dataset, the spec, the notebook. The fast version: `grep -rn "<thing>" --include="*.md" .` (and check notebook markdown cells). Don't rely on memory for where something is mentioned.
+2. **Reconcile every hit.** If a doc calls something "planned", "deferred", "not yet wired in", a "seam", or "out of scope" and you just made it real, update that wording. If a doc lists files, notebooks, predictors, specs, or data sources and you added or removed one, fix the list. If you changed a default, a metric, or a command, fix it everywhere it appears.
+3. **Update the layered docs together**, not just the nearest one: the use-case README (most detail), the reference-implementations table in the root `README.md`, the method catalog (`aieng-forecasting/aieng/forecasting/methods/README.md`) when you touch a reusable predictor, and `planning-docs/roadmap.md` when something moves from "extension idea" to "shipped".
+
+Concrete example: integrating Canada's Food Price Report PDFs into the food-price LLM-Process prompt is **not done when the code runs** — it is done when `implementations/food_price_forecasting/README.md` (which currently frames report→prompt wiring as a deferred extension) and the "Reports as predictor context" entry in `planning-docs/roadmap.md` no longer describe it as future work. Shipping the code while those still say "deferred" is the regression the reviewer should catch.
+
+The two subsections below are the map of where docs live, so the reconciliation in step 3 is quick.
+
+### planning-docs/
+
+`./planning-docs/roadmap.md` captures the architecture principles worth preserving and the catalog of extension ideas. It is the place for cross-cutting design notes, not per-task tracking.
+
+The older planning log, backlog, project charter, and technical-design files under `planning-docs/` (and `planning-docs/archive/`) are retired and kept only for continuity — do not add new decisions to them. When a change affects architecture, datasets, repo layout, or the set of reference implementations, update `planning-docs/roadmap.md` (for an architectural principle or a new extension idea) and the relevant README files in the same session.
+
+Project shape to keep in mind:
+
+- The core library `aieng.forecasting` owns stable infrastructure; reusable predictors live in `aieng.forecasting.methods`; use-case material lives in `implementations/<use-case>/`.
+- YAML specs are co-located under `implementations/<use-case>/specs/`.
+- Reference implementations: Getting Started, Food Price Forecasting, Energy/Oil (stateless capability track plus an adaptive learning agent), BoC Rate Decisions (quantitative path, cutoff-aware press-release ingestion, and a reasoning-alignment evaluator), and S&P 500 (in active development).
+- Energy/oil's older information-session notebooks are archived under `playground/energy_case_study/`.
+- Continuous and discrete-event forecasts are output modalities; numerical methods, LLM Processes, and agentic forecasters are method families that apply to either.
+
+### README files
+
+Search the repo for `README.md` files (excluding `.venv/`) to find every README — there is one at the root, one per package (`aieng-forecasting/`, `implementations/`), the method catalog under `aieng-forecasting/aieng/forecasting/methods/`, and one per use case under `implementations/<use-case>/`. These are the primary user surface and the first thing a new contributor reads; the reconciliation rule above applies to all of them. Keep them accurate and production-quality: describe what the code does and what you can build from it, with no internal program, scheduling, or ownership framing.
+
+---
+
+## Development conventions
+
+### Data cache
+
+Historical data is stored in `data/` at the repo root (gitignored). Before running notebooks or scripts that depend on live data, populate the cache by running the relevant script in `scripts/` (e.g. `uv run python scripts/fetch_cpi.py`). Never commit data files.
+
+### Model selection
+
+The project standardizes on **two** Vector-proxy models so examples stay consistent: `gemini-3.1-flash-lite-preview` (the **lite / default** model) and `gemini-3.5-flash` (the **advanced** model, used for the adaptive-agent path and curriculum runs). Both are defined once in `aieng.forecasting.models` as `LITE_MODEL` / `ADVANCED_MODEL` (`DEFAULT_MODEL = LITE_MODEL`). Reference these constants in code rather than hardcoding model strings; notebooks pick one of the two literals with the other shown as a commented alternative. See `planning-docs/vector-llm-proxy.md` for the full convention.
+
+### Code quality (not on commit)
+
+Git commits **do not** run automated hooks locally. Run **`make lint`** (ruff format + ruff check + mypy on `aieng`) before pushing — a passing `make lint` means CI will be happy with the code. To fully mirror CI (yaml checks, uv-lock, etc.) run **`uv run pre-commit run --all-files`**. CI on `main` runs the same `pre-commit` config.
+
+Notebook outputs **are** committed at the author's discretion — `nbstripout` is not in the pre-commit config. Strip outputs manually before committing if you don't want them in the repo.
+
+### Test philosophy
+
+Tests should justify their existence. Write tests for: non-obvious logic that is easy to get wrong, defensive contracts (e.g. copy-on-return), and error paths where the message matters. Do not write tests for: Pydantic model construction (Pydantic already validates this), trivial Python behaviour (sorted lists, empty dicts), or mock-interaction assertions that test implementation rather than behaviour. When in doubt, fewer focused tests are better than many shallow ones.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/README.md.md
new file mode 100644
index 0000000..bd4d6a0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/README.md.md
@@ -0,0 +1,207 @@
+# Source: README.md
+
+kind: markdown
+
+# Agentic Forecasting
+
+A foundation for building, evaluating, and comparing forecasting systems — conventional numerical models, LLM Processes, and agentic forecasters — on real economic, financial, and event-prediction tasks.
+
+The repository pairs a small, stable core library with a set of self-contained reference implementations. The library gives you cutoff-safe data handling, a single `Predictor` interface, and a backtest/evaluation harness. Each reference implementation is a worked example of a different forecasting problem and the techniques that suit it. Start from whichever one is closest to what you want to build.
+
+> **👉 First time here? Run the environment check.** After `uv sync` (see [Setup](#setup)), open [`implementations/getting_started/00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) and run it top to bottom. It's a self-guided preflight that verifies every capability — proxy LLM inference, Langfuse, E2B code execution, StatCan/FRED data access, and an end-to-end mini backtest — and tells you exactly what to fix when something isn't set up. **Do this before anything else.**
+
+## What's here
+
+- **Core library** — `aieng-forecasting` (`aieng.forecasting`): data services, cutoff enforcement, forecasting tasks, prediction payloads, backtesting, evaluation, and artifacts.
+- **Reusable methods** — `aieng.forecasting.methods`: `Predictor` implementations including naive baselines (continuous, binary, and categorical), Darts numerical predictors, LLM-process predictors (continuous, binary-probability, and categorical-probability), and ADK-based agentic infrastructure (`build_adk_agent`, `AdkTextRunner`, `AgentPredictor`).
+- **Reference implementations** — `implementations/<use-case>/`: notebooks, helper modules, task-specific configuration, and co-located YAML specs.
+- **Tracing** — Langfuse / OpenTelemetry bootstrap (`aieng.forecasting.langfuse_tracing`) for LiteLLM and Google ADK.
+- **Data scripts** — `scripts/`: one fetch script per data source, plus `build_e2b_template.py` for the agentic code-execution sandbox.
+
+## Two ways to use a forecaster
+
+Every method can be used in one of two modes, and the distinction runs through the library:
+
+- **Track 1 — evaluated prediction.** Numerical methods, LLM Processes, and agentic forecasters emit standardized `Prediction` objects and are compared head-to-head with the evaluation harness (CRPS, Brier, RPS, calibration).
+- **Track 2 — interactive analysis.** The same agents can do scenario analysis, monitoring, open-ended Q&A, code-backed analysis, and reasoning over evidence — useful work that isn't reduced to a single score.
+
+## Reference implementations
+
+Each is independent and self-contained — pick the one that matches the problem you care about, and read that directory's `README.md` for the full walkthrough. They are numbered in a recommended order that mirrors the bootcamp progression — conventional numerical methods → LLM Processes → agents → agentic evaluation — but any one stands on its own, so jump straight to the problem you care about.
+
+**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below. Also includes [`99_repo_concierge.ipynb`](implementations/getting_started/99_repo_concierge.ipynb) — a lite-model repo guide for “how does this codebase work?” questions.
+
+| #   | Implementation                                                       | The problem                                                                     | Concepts & techniques it demonstrates                                                                                                                                                                                                                                                                       |
+| --- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 1   | [`sp500_forecasting/`](implementations/sp500_forecasting/)           | S&P 500 returns under a macro/market covariate panel.                           | A head-to-head of conventional numerical methods (naive, ETS, Kalman, AutoARIMA, linear regression, LightGBM) plus a covariate-aware LLM-Process, all reading the same leak-safe covariate panel. Cumulative-return targets at 1/5/21-business-day horizons, CRPS + direction metrics, config-driven specs. |
+| 2   | [`food_price_forecasting/`](implementations/food_price_forecasting/) | A multivariate food-CPI trajectory, in the style of Canada's Food Price Report. | Nine correlated sub-indices, a 12-step trajectory, a domain metric (avg/avg YoY), baselines vs LLM-Process predictors, leakage-aware backtests, and cached artifacts for fast iteration.                                                                                                                    |
+| 3   | [`energy_oil_forecasting/`](implementations/energy_oil_forecasting/) | Daily WTI crude-oil price under regime-breaking news.                           | A capability progression — Prophet → LLM-Process → news-grounded agent → code-executing agent — plus an adaptive agent that learns a strategy from data and is scored before vs after. Continuous trajectories, a binary up-shock task, and interactive scenario analysis.                                  |
+| 4   | [`boc_rate_decisions/`](implementations/boc_rate_decisions/)         | Will the Bank of Canada cut, hold, or hike at its next meeting?                 | Discrete-event forecasting: ordered-categorical outcomes on an irregular calendar, RPS scoring and one-vs-rest calibration (instead of CRPS), a binary (Brier) special case, cutoff-aware document ingestion, and an LLM-as-judge that scores an agent's reasoning against the official rationale.          |
+
+**Not sure where to start building?** Each of the four domain implementations above ends with a `99_starter_agent.ipynb` — a fresh, hackable **starter agent** (a `starter_agent/` module) with toggleable news search and code execution, two lightweight tool-usage skills, an interactive cell, and one scored forecast. It's the consistent "continue from here" entry point for taking any reference use case in an agentic direction, and a quick end-to-end test of that use case's agent stack.
+
+## Time Series Data sources
+
+- **StatCan** — Canadian CPI and related macroeconomic series.
+- **FRED** — macroeconomic and commodity series.
+- **yfinance** — equities, indices, and commodity futures.
+
+Historical data is cached locally under `data/` and is not committed. Each implementation's README names the fetch script(s) it needs.
+
+### FRED API key
+
+Several reference implementations (S&P 500, BoC rate decisions) fetch data from the Federal Reserve Economic Data (FRED) API, which requires a free personal API key. **We cannot provide this key for you** — each participant must request their own at:
+
+> [https://fred.stlouisfed.org/docs/api/api_key.html](https://fred.stlouisfed.org/docs/api/api_key.html)
+
+FRED keys are free and approval is typically quick, but it can occasionally take some time, so request yours early. When asked for a use-case description, something extended from the following works well:
+
+> "Requesting an API key to explore the effectiveness of various forecasting techniques on economic data."
+
+Once you have the key, add it to your repo-root `.env`:
+
+```
+FRED_API_KEY=your_fred_api_key
+```
+
+On Coder workspaces, bootcamp keys (`OPENAI_*`, `E2B_*`, `LANGFUSE_*`) live in your shell environment — **not** in repo `.env`. See [Bootcamp environment](#bootcamp-environment-coder).
+
+## Repository layout
+
+```text
+aieng-forecasting/   # Installable library: import as aieng.forecasting
+implementations/     # Self-contained reference implementations + co-located specs
+scripts/             # Data-fetch scripts + E2B template builder
+tests/               # Onboarding integration tests (not run in CI)
+planning-docs/       # Architecture notes and the extension/roadmap catalog
+playground/          # Exploration and archived demos (not reference implementations)
+```
+
+## Setup
+
+Install dependencies from the repo root:
+
+```bash
+git clone <repo-url>. # If running locally. Coder environment setup clones repo automatically.
+cd agentic-forecasting
+uv sync --dev
+```
+
+**macOS — LightGBM and OpenMP.** The library depends on **LightGBM** (used by `DartsLightGBMPredictor` and some notebooks). The PyPI wheel expects **OpenMP** at runtime. If you see `Library not loaded: @rpath/libomp.dylib` when importing or training, install Homebrew's OpenMP once and restart your shell or Jupyter kernel:
+
+```bash
+brew install libomp
+```
+
+On Apple Silicon the dylib is typically under `/opt/homebrew/opt/libomp/lib/`; on Intel Homebrew, `/usr/local/opt/libomp/lib/`.
+
+### Coder Workspaces
+
+When you open a **Coder workspace**, startup runs automatically in the background. By the time you connect you should have:
+
+- The repo cloned, a Python venv, and dependencies installed
+- Bootcamp API keys (`OPENAI_*`, `E2B_*`, `LANGFUSE_*`) available in your shell (not in `.env`)
+- A shell that opens in the repo with the venv activated
+
+**Your next step:** run [`00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) top to bottom. That notebook will confirm that startup succeeded.
+
+On first boot, keys are verified against live services and your onboarding status is recorded. Workspace restarts reload keys without re-running the full test suite.
+
+**Local machine or troubleshooting** — fetch and verify keys manually:
+
+```bash
+eval "$(onboard --bootcamp-name agentic-forecasting --test-script tests/test_integration.py)"
+```
+
+Reload keys in a new shell without re-testing:
+
+```bash
+eval "$(onboard --bootcamp-name agentic-forecasting --skip-test)"
+```
+
+Headless verification (same checks as first-boot onboarding):
+
+```bash
+uv sync --all-extras --dev --all-packages
+uv run pytest tests/test_integration.py -v
+```
+
+**Credential model:** bootcamp keys live in your shell environment. Optional personal keys (e.g. `FRED_API_KEY`) go in a `.env` only — see [`.env.example`](.env.example).
+
+### Verify your environment first
+
+New to the project? Open [`implementations/getting_started/00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) and run it top to bottom. It's a self-guided preflight that checks every major capability — proxy LLM inference, Langfuse, E2B code execution, StatCan/FRED data access, and a full end-to-end mini backtest — one cell at a time, and tells you exactly what to fix when something isn't set up (most often a missing or placeholder key in your `.env`). It's the fastest way to confirm setup before working through the reference implementations.
+
+### Populate the data cache
+
+Data is fetched once and cached locally (gitignored). Each implementation names the fetch script(s) it needs in its own `README.md` — for example `scripts/fetch_cpi.py` (getting started), `scripts/fetch_sp500_market.py` + `scripts/fetch_fred.py` (S&P 500), `scripts/fetch_wti.py` (energy), and `scripts/fetch_boc.py` and `scripts/fetch_boc_press_releases.py` (BoC). Run the relevant one before opening that implementation's notebooks:
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+### Build the E2B sandbox image (agentic implementations only)
+
+Agentic forecasters can run code in an E2B cloud sandbox. Credentials for e2b should be automatically injected into the environment for bootcamp participants, and you can confirm successful setup by running [`00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb).
+
+If this was unsuccessful, or if you prefer to run with E2B in an alternative environment, do this once before enabling code execution in `build_adk_agent`:
+
+1. Create a free account at [e2b.dev](https://e2b.dev) and copy your API key.
+2. Add it to your `.env` file alongside the other keys (see `.env.example`):
+
+  ```
+   E2B_API_KEY=your_e2b_api_key
+  ```
+
+1. Build the template (takes a few minutes on first run):
+
+  ```bash
+   uv run --env-file .env scripts/build_e2b_template.py
+  ```
+
+The template name is the default in `CodeExecutionConfig.template_name`, so notebooks pick it up automatically.
+
+## Core concepts
+
+`Predictor` is the interface every forecasting method implements:
+
+```python
+class MyPredictor(Predictor):
+    @property
+    def predictor_id(self) -> str:
+        return "my_predictor"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        series = context.get_series(task.target_series_id)
+        ...
+        return [Prediction(...)]
+```
+
+`ForecastContext` is cutoff-scoped. Predictors only see observations available as of the forecast origin, which keeps backtests honest.
+
+`backtest()` is the open iteration loop against historical data. `evaluate()` is the budgeted protected-window loop.
+
+## Extending the foundation
+
+This repo is a starting point, not a finished product. The shape of a new forecaster is always the same: implement `Predictor`, declare a spec, and run `backtest()` / `evaluate()` to compare it against the baselines. Each reference implementation's README ends with concrete extension ideas; `planning-docs/roadmap.md` collects the cross-cutting ones (new data sources, additional methods, live forecasting, deeper agent work).
+
+## Code quality
+
+```bash
+make lint
+make format
+```
+
+`make lint` runs the expected pre-push quality checks. Git commits do not run hooks locally. To mirror the full pre-commit suite, run:
+
+```bash
+uv run pre-commit run --all-files
+```
+
+## Documentation
+
+- Per-implementation READMEs under [`implementations/`](implementations/) — the primary user surface.
+- [`aieng-forecasting/README.md`](aieng-forecasting/README.md) and [`aieng-forecasting/aieng/forecasting/methods/README.md`](aieng-forecasting/aieng/forecasting/methods/README.md) — the library and the method catalog.
+- [`planning-docs/roadmap.md`](planning-docs/roadmap.md) — architecture principles and extension ideas.
+
+Keep code, notebooks, specs, and these docs in sync when you change behavior, setup, layout, or datasets.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md
new file mode 100644
index 0000000..1f44b7d
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md
@@ -0,0 +1,12 @@
+# Source: aieng-forecasting/aieng/forecasting/__init__.py
+
+kind: python
+
+```python
+"""Agentic Forecasting — data service and evaluation harness."""
+
+from aieng.forecasting.models import ADVANCED_MODEL, DEFAULT_MODEL, LITE_MODEL
+
+
+__all__ = ["ADVANCED_MODEL", "DEFAULT_MODEL", "LITE_MODEL"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md
new file mode 100644
index 0000000..47acf3f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md
@@ -0,0 +1,14 @@
+# Source: aieng-forecasting/aieng/forecasting/data/__init__.py
+
+kind: python
+
+```python
+"""Data service: adapters, series store, and cutoff enforcement."""
+
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.data.models import SeriesMetadata, SeriesRecord
+from aieng.forecasting.data.service import DataService
+
+
+__all__ = ["DataService", "ForecastContext", "SeriesMetadata", "SeriesRecord"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md
new file mode 100644
index 0000000..52e5f2f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md
@@ -0,0 +1,15 @@
+# Source: aieng-forecasting/aieng/forecasting/data/adapters/__init__.py
+
+kind: python
+
+```python
+"""Adapter implementations for ingesting data into the SeriesStore."""
+
+from aieng.forecasting.data.adapters.base import BaseAdapter
+from aieng.forecasting.data.adapters.fred import FREDAdapter
+from aieng.forecasting.data.adapters.statcan import StatCanAdapter
+from aieng.forecasting.data.adapters.yfinance import YFinanceDailyAdapter
+
+
+__all__ = ["BaseAdapter", "FREDAdapter", "StatCanAdapter", "YFinanceDailyAdapter"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md
new file mode 100644
index 0000000..dd46a88
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md
@@ -0,0 +1,61 @@
+# Source: aieng-forecasting/aieng/forecasting/data/adapters/base.py
+
+kind: python
+
+```python
+"""Base adapter protocol for data ingestion."""
+
+from abc import ABC, abstractmethod
+
+import pandas as pd
+
+
+class BaseAdapter(ABC):
+    """Abstract base class for all data adapters.
+
+    An adapter is responsible for fetching data from a single source and
+    returning it in the canonical internal format understood by ``SeriesStore``.
+
+    Each adapter instance represents **one series**. If a source provides
+    multiple series (e.g. a StatCan table with many product groups), create
+    one adapter instance per series.
+
+    The canonical format returned by ``fetch()`` is a ``pandas.DataFrame``
+    with the following columns:
+
+    - ``timestamp`` (``datetime64[ns]``): observation time / reference period.
+    - ``value`` (``float64``): the observed quantity.
+    - ``released_at`` (``datetime64[ns]``, optional): when the data point
+      became publicly available. If absent, ``CutoffEnforcer`` falls back to
+      ``timestamp``.
+
+    The ``series_id`` is **not** a column — it is the key used when
+    registering the adapter with ``DataService``.
+
+    Notes
+    -----
+    Adapters should be **offline-safe** after initial data retrieval. All
+    network calls belong in ``fetch()``, which is called once by a
+    data-loading script ahead of sessions. During sessions or backtests,
+    ``DataService.get_series()`` serves from the in-memory store with no
+    further network access.
+    """
+
+    @abstractmethod
+    def fetch(self) -> pd.DataFrame:
+        """Fetch the series and return it in canonical format.
+
+        Returns
+        -------
+        pd.DataFrame
+            DataFrame with columns ``timestamp`` (datetime64) and ``value``
+            (float64). The optional ``released_at`` column (datetime64) should
+            be included when the source provides reliable publication dates.
+            Rows are sorted ascending by ``timestamp``.
+
+        Raises
+        ------
+        RuntimeError
+            If the fetch fails (network error, missing data, etc.).
+        """
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md
new file mode 100644
index 0000000..1e8f285
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md
@@ -0,0 +1,193 @@
+# Source: aieng-forecasting/aieng/forecasting/data/adapters/fred.py
+
+kind: python
+
+```python
+"""FRED (Federal Reserve Economic Data) adapter for the SeriesStore.
+
+``FREDAdapter`` fetches a single FRED series and returns it in the canonical
+internal format understood by :class:`~aieng.forecasting.data.store.SeriesStore`.
+
+Caching
+-------
+When ``cache_dir`` is provided, the adapter persists each series to
+``{cache_dir}/{fred_id}.parquet`` on first fetch and reads from the parquet
+file on all subsequent calls.  This mirrors the ``StatCanAdapter`` pattern:
+run ``scripts/fetch_fred.py`` once to populate the cache, then notebooks and
+backtests read from disk with no further network access.
+
+**API key requirement:** FRED requires a free API key obtained from
+https://fred.stlouisfed.org/docs/api/api_key.html.  Provide it via the
+``FRED_API_KEY`` environment variable (recommended) or the ``api_key``
+constructor argument.  The key is only needed when the local cache is empty
+or ``refresh=True``.
+
+**``released_at`` approximation:** FRED does not expose vintage / release
+dates through the standard ``fredapi`` interface.  The adapter sets
+``released_at = timestamp``, which is correct for series that are available
+at their reference period end (e.g. monthly averages published at or shortly
+after month end).  For series with significant publication lags this is
+optimistic and may be refined in a later pass using FRED's
+``get_series_vintage_dates`` endpoint.
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import pandas as pd
+from aieng.forecasting.data.adapters.base import BaseAdapter
+
+
+class FREDAdapter(BaseAdapter):
+    """Adapter that fetches a single FRED series, with optional disk cache.
+
+    Parameters
+    ----------
+    series_id : str
+        FRED series identifier, e.g. ``"CPIFABSL"`` or ``"EXCAUS"``.
+    api_key : str or None
+        FRED API key.  If ``None``, the value is read from the
+        ``FRED_API_KEY`` environment variable.  The key is only consulted
+        when a network fetch is actually required (cache miss or
+        ``refresh=True``); adapters pointing at a populated cache can be
+        instantiated without a key.
+    cache_dir : str, Path, or None
+        Directory to read/write parquet cache files.  When ``None``,
+        caching is disabled and every ``fetch()`` call hits the FRED API.
+        When set, the adapter reads from ``{cache_dir}/{series_id}.parquet``
+        if present; otherwise it fetches from FRED and writes the parquet
+        before returning.  Default: ``"data/fred"``.
+    refresh : bool
+        When ``True``, force a network fetch even if a cache file exists
+        (and overwrite the cache).  Default: ``False``.
+
+    Raises
+    ------
+    ValueError
+        When a network fetch is required but no API key is available.
+
+    Examples
+    --------
+    Populate the cache once::
+
+        >>> adapter = FREDAdapter("EXCAUS")          # uses FRED_API_KEY env var
+        >>> df = adapter.fetch()                     # hits API, writes parquet
+
+    Subsequent reads never touch the network::
+
+        >>> adapter = FREDAdapter("EXCAUS")
+        >>> df = adapter.fetch()                     # reads parquet
+    """
+
+    DEFAULT_CACHE_DIR = "data/fred"
+
+    def __init__(
+        self,
+        series_id: str,
+        api_key: str | None = None,
+        cache_dir: str | Path | None = DEFAULT_CACHE_DIR,
+        refresh: bool = False,
+    ) -> None:
+        self._series_id = series_id
+        self._api_key = api_key or os.environ.get("FRED_API_KEY")
+        self._cache_dir = Path(cache_dir) if cache_dir is not None else None
+        self._refresh = refresh
+
+    @property
+    def series_id(self) -> str:
+        """FRED series identifier."""
+        return self._series_id
+
+    @property
+    def cache_path(self) -> Path | None:
+        """Full path to this adapter's parquet cache file, or ``None`` if disabled."""
+        if self._cache_dir is None:
+            return None
+        return self._cache_dir / f"{self._series_id}.parquet"
+
+    def fetch(self) -> pd.DataFrame:
+        """Return the series in canonical format, using the disk cache when available.
+
+        Flow:
+
+        1. If ``cache_dir`` is set and the parquet file exists and ``refresh=False``,
+           read and return it.
+        2. Otherwise fetch from the FRED API, normalize, write to parquet (when
+           caching is enabled), and return.
+
+        Returns
+        -------
+        pd.DataFrame
+            Columns: ``timestamp`` (datetime64[ns]), ``value`` (float64),
+            ``released_at`` (datetime64[ns]).  Sorted ascending by
+            ``timestamp``.  Index is a default RangeIndex.
+
+        Raises
+        ------
+        ValueError
+            If a network fetch is required but no API key is available.
+        RuntimeError
+            If the FRED API request fails or returns no data.
+        """
+        cache_path = self.cache_path
+        if cache_path is not None and cache_path.exists() and not self._refresh:
+            return self._read_cache(cache_path)
+
+        df = self._fetch_from_api()
+
+        if cache_path is not None:
+            cache_path.parent.mkdir(parents=True, exist_ok=True)
+            df.to_parquet(cache_path, index=False)
+
+        return df
+
+    def _fetch_from_api(self) -> pd.DataFrame:
+        """Fetch the series directly from the FRED API."""
+        if not self._api_key:
+            raise ValueError(
+                "FRED API key not provided.  Set the FRED_API_KEY environment variable "
+                "or pass api_key= to FREDAdapter.  (Key is only required on cache miss; "
+                "populated caches can be read without one.)"
+            )
+
+        try:
+            from fredapi import Fred  # noqa: PLC0415
+        except ImportError as exc:
+            raise RuntimeError("fredapi is not installed. Run `uv add fredapi` to install it.") from exc
+
+        fred = Fred(api_key=self._api_key)
+
+        try:
+            raw: pd.Series = fred.get_series(self._series_id)
+        except Exception as exc:
+            raise RuntimeError(f"Failed to fetch FRED series '{self._series_id}': {exc}") from exc
+
+        if raw.empty:
+            raise RuntimeError(f"FRED series '{self._series_id}' returned no data.")
+
+        df = raw.reset_index()
+        df.columns = pd.Index(["timestamp", "value"])
+        df["timestamp"] = pd.to_datetime(df["timestamp"])
+        df["value"] = pd.to_numeric(df["value"], errors="coerce")
+        df = df.dropna(subset=["value"])
+        df["released_at"] = df["timestamp"]
+        df = df.sort_values("timestamp").reset_index(drop=True)
+
+        return df[["timestamp", "value", "released_at"]]
+
+    @staticmethod
+    def _read_cache(cache_path: Path) -> pd.DataFrame:
+        """Read a cached parquet and normalize dtypes defensively."""
+        df = pd.read_parquet(cache_path)
+        df["timestamp"] = pd.to_datetime(df["timestamp"])
+        df["released_at"] = pd.to_datetime(df["released_at"])
+        df["value"] = pd.to_numeric(df["value"], errors="coerce")
+        return df[["timestamp", "value", "released_at"]].reset_index(drop=True)
+
+    def __repr__(self) -> str:
+        """Return a short representation without exposing the API key."""
+        cache = self._cache_dir if self._cache_dir is not None else "disabled"
+        return f"FREDAdapter(series_id={self._series_id!r}, cache_dir={cache!r})"
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md
new file mode 100644
index 0000000..3b25f7a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md
@@ -0,0 +1,219 @@
+# Source: aieng-forecasting/aieng/forecasting/data/adapters/statcan.py
+
+kind: python
+
+```python
+"""Statistics Canada adapter using the stats-can library."""
+
+import re
+import zipfile
+from pathlib import Path
+
+import pandas as pd
+from aieng.forecasting.data.adapters.base import BaseAdapter
+
+
+# Canonical column names in StatCan CSV exports (stable across tables).
+_STATCAN_DATE_COL = "REF_DATE"
+_STATCAN_VALUE_COL = "VALUE"
+
+
+def _normalize_table_id(table_id: str) -> str:
+    """Strip non-numeric characters and take the first 8 digits.
+
+    Statistics Canada table IDs like ``"18-10-0004-13"`` map to the zip filename
+    ``"18100004-eng.zip"`` — the last two digits are a product variant suffix
+    not used in the filename.
+    """
+    return re.sub(r"\D", "", table_id)[:8]
+
+
+def _read_zip(zip_path: Path, normalized_id: str) -> pd.DataFrame:
+    """Read the CSV from a StatCan zip file into a raw DataFrame.
+
+    Uses ``errors="coerce"`` for date parsing (avoiding the pandas-3
+    incompatibility in ``stats_can.zip_table_to_dataframe`` which used
+    the now-removed ``errors="ignore"``).
+    """
+    csv_name = f"{normalized_id}.csv"
+    with zipfile.ZipFile(zip_path) as zf:
+        with zf.open(csv_name) as f:
+            col_names = pd.read_csv(f, nrows=0).columns.tolist()
+        types_dict: dict[str, type | str] = {_STATCAN_VALUE_COL: float}
+        types_dict.update({col: str for col in col_names if col not in types_dict})
+        with zf.open(csv_name) as f:
+            df = pd.read_csv(f, dtype=types_dict)
+
+    df[_STATCAN_DATE_COL] = pd.to_datetime(df[_STATCAN_DATE_COL], errors="coerce")
+    return df
+
+
+class StatCanAdapter(BaseAdapter):
+    """Adapter for a single series from a Statistics Canada table.
+
+    Uses the ``stats-can`` library (v3+) to download tables and caches the
+    raw zip locally. The CSV inside the zip is read directly with pandas to
+    avoid a pandas-3 incompatibility in ``stats_can.zip_table_to_dataframe``.
+    After the initial download, all data is served from the local cache —
+    no further network calls are made unless the cache is cleared.
+
+    Each instance represents **one series**, identified by a set of filter
+    criteria (e.g. geography + product group). For tables that contain many
+    series, instantiate one ``StatCanAdapter`` per series and register each
+    with ``DataService`` under a distinct ``series_id``.
+
+    Parameters
+    ----------
+    table_id : str
+        Statistics Canada table identifier (e.g. ``"18-10-0004-13"``).
+    member_filter : dict[str, str]
+        Column-value pairs used to select a single series from the table.
+        For example: ``{"GEO": "Canada", "Products and product groups": "All-items"}``.
+        All specified columns must be present in the downloaded table.
+    cache_dir : str or Path
+        Directory where the ``stats-can`` library stores its local table cache.
+        Defaults to ``"data/statcan"`` relative to the current working directory.
+    release_lag_days : int
+        Days added to ``timestamp`` to populate ``released_at``. The default
+        of 21 is a deliberately loose approximation for monthly survey
+        tables; note the lag is measured from the *month-start* timestamp,
+        while StatCan publishes CPI roughly three weeks after the month
+        *ends* (~51 days after the timestamp), so the default is optimistic
+        by about one month. Consumers that use monthly series as covariates
+        should add their own conservative lag (see the BoC use case). Daily
+        financial-market tables (e.g. 10-10-0139-01 interest rates) are
+        published the next business day — pass ``release_lag_days=1`` for
+        those so backtests do not hide three weeks of perfectly public
+        market data from predictors.
+
+    Notes
+    -----
+    **Information cutoff**: StatCan publishes CPI data roughly 3 weeks after
+    the reference month ends. For example, January CPI is released in
+    mid-to-late February. This adapter populates ``released_at`` as
+    ``timestamp + release_lag_days``, a fixed-lag approximation. A more
+    precise implementation would query StatCan's release calendar API, but
+    the fixed lag removes the most significant optimistic bias in backtests.
+
+    Examples
+    --------
+    >>> adapter = StatCanAdapter(
+    ...     table_id="18-10-0004-13",
+    ...     member_filter={
+    ...         "GEO": "Canada",
+    ...         "Products and product groups": "All-items",
+    ...     },
+    ... )
+    >>> df = adapter.fetch()
+    >>> df.columns.tolist()
+    ['timestamp', 'value', 'released_at']
+    """
+
+    def __init__(
+        self,
+        table_id: str,
+        member_filter: dict[str, str],
+        cache_dir: str | Path = "data/statcan",
+        release_lag_days: int = 21,
+    ) -> None:
+        if release_lag_days < 0:
+            raise ValueError(f"release_lag_days must be non-negative; got {release_lag_days}")
+        self._table_id = table_id
+        self._member_filter = member_filter
+        self._cache_dir = Path(cache_dir)
+        self._release_lag_days = release_lag_days
+
+    @property
+    def table_id(self) -> str:
+        """Return the StatCan table identifier."""
+        return self._table_id
+
+    @property
+    def member_filter(self) -> dict[str, str]:
+        """Return the filter criteria that identify this series."""
+        return dict(self._member_filter)
+
+    def fetch(self) -> pd.DataFrame:
+        """Download (or load from cache) and return the series in canonical format.
+
+        Returns
+        -------
+        pd.DataFrame
+            DataFrame with columns ``timestamp`` (datetime64[ns]), ``value``
+            (float64), and ``released_at`` (datetime64[ns]), sorted ascending
+            by ``timestamp``. ``released_at`` is set to
+            ``timestamp + release_lag_days`` to approximate StatCan's
+            publication lag. Rows with missing values are dropped.
+
+        Raises
+        ------
+        RuntimeError
+            If the table cannot be downloaded or the filter criteria do not
+            match any rows.
+        ValueError
+            If a column named in ``member_filter`` is not present in the table.
+        """
+        import stats_can.sc as _sc  # noqa: PLC0415 — lazy import after package checks
+
+        self._cache_dir.mkdir(parents=True, exist_ok=True)
+
+        normalized = _normalize_table_id(self._table_id)
+        zip_path = self._cache_dir / f"{normalized}-eng.zip"
+
+        if not zip_path.exists():
+            try:
+                _sc.download_tables([normalized], path=self._cache_dir)
+            except Exception as exc:
+                raise RuntimeError(f"Failed to download StatCan table {self._table_id!r}: {exc}") from exc
+
+        try:
+            raw = _read_zip(zip_path, normalized)
+        except Exception as exc:
+            raise RuntimeError(f"Failed to fetch StatCan table {self._table_id!r}: {exc}") from exc
+
+        # Validate that all filter columns exist before filtering.
+        missing_cols = [col for col in self._member_filter if col not in raw.columns]
+        if missing_cols:
+            raise ValueError(
+                f"Filter column(s) {missing_cols} not found in table {self._table_id!r}. "
+                f"Available columns: {raw.columns.tolist()}"
+            )
+
+        # Apply member filter to isolate the target series.
+        mask = pd.Series(True, index=raw.index)
+        for col, val in self._member_filter.items():
+            mask &= raw[col] == val
+
+        filtered = raw.loc[mask].copy()
+
+        if filtered.empty:
+            raise RuntimeError(f"No rows matched filter {self._member_filter} in table {self._table_id!r}.")
+
+        if _STATCAN_VALUE_COL not in filtered.columns:
+            raise ValueError(
+                f"Expected value column {_STATCAN_VALUE_COL!r} not found in table. "
+                f"Available columns: {filtered.columns.tolist()}"
+            )
+
+        if _STATCAN_DATE_COL not in filtered.columns:
+            raise ValueError(
+                f"Expected date column {_STATCAN_DATE_COL!r} not found in table. "
+                f"Available columns: {filtered.columns.tolist()}"
+            )
+
+        # Build canonical output: (timestamp, value, released_at).
+        timestamps = pd.to_datetime(filtered[_STATCAN_DATE_COL])
+        result = pd.DataFrame(
+            {
+                "timestamp": timestamps,
+                "value": pd.to_numeric(filtered[_STATCAN_VALUE_COL], errors="coerce"),
+                # Approximate the table's publication lag (default 21 days for
+                # monthly survey tables like CPI; 1 day for daily market data).
+                "released_at": timestamps + pd.DateOffset(days=self._release_lag_days),
+            }
+        )
+
+        # Drop rows with missing values (StatCan uses blank VALUE for suppressed data).
+        result = result.dropna(subset=["value"])
+        return result.sort_values("timestamp").reset_index(drop=True)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md
new file mode 100644
index 0000000..6fc3163
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md
@@ -0,0 +1,340 @@
+# Source: aieng-forecasting/aieng/forecasting/data/adapters/yfinance.py
+
+kind: python
+
+```python
+"""Yahoo Finance adapter for daily market series.
+
+``YFinanceDailyAdapter`` fetches one ticker/field pair from Yahoo Finance via
+``yfinance`` and returns the canonical internal format understood by
+:class:`~aieng.forecasting.data.store.SeriesStore`.
+
+Caching
+-------
+When ``cache_dir`` is provided, the adapter persists each ticker/field pair to
+``{cache_dir}/{ticker}_{field}_1d.parquet`` on first fetch and reads from that
+parquet file on subsequent calls. The cache is only used when it fully covers the
+requested ``start``/``end`` window; if the cached data starts too late *or* ends too
+early, a fresh yfinance request is made and the cache is overwritten. Use
+``refresh=True`` to force a network fetch regardless of cache state.
+
+Information cutoff
+------------------
+Yahoo Finance daily bars do not include a reliable point-in-time availability
+timestamp. For daily bars, this adapter sets ``released_at`` to the next
+business day after the observation timestamp. That is a conservative default
+for close-based daily forecasting and avoids treating a session close as known
+at the start of that same session. It is not an exchange-grade release calendar
+and should be revisited for intraday or contract-specific futures workflows.
+"""
+
+from __future__ import annotations
+
+import re
+from pathlib import Path
+from typing import Any, Literal
+
+import pandas as pd
+from aieng.forecasting.data.adapters.base import BaseAdapter
+from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator
+
+
+# Supported Yahoo Finance daily history fields.
+YFinanceField = Literal["Open", "High", "Low", "Close", "Adj Close", "Volume"]
+
+
+# Supported yfinance interval for this adapter.
+YFinanceInterval = Literal["1d"]
+
+
+_DEFAULT_FIELD: YFinanceField = "Adj Close"
+_DEFAULT_INTERVAL: YFinanceInterval = "1d"
+
+
+def _cache_stem(ticker: str, field: YFinanceField, interval: YFinanceInterval) -> str:
+    """Return a filesystem-safe cache stem for a ticker/field/interval combination."""
+    key = f"{ticker}_{field}_{interval}".lower()
+    sanitized = re.sub(r"[^a-z0-9]+", "_", key).strip("_")
+    if not sanitized:
+        raise ValueError("ticker and field produced an empty cache key")
+    return sanitized
+
+
+class YFinanceDailyConfig(BaseModel):
+    """Validated configuration for :class:`YFinanceDailyAdapter`."""
+
+    model_config = ConfigDict(frozen=True)
+
+    ticker: str = Field(min_length=1)
+    field: YFinanceField = _DEFAULT_FIELD
+    start: str | None = None
+    end: str | None = None
+    interval: YFinanceInterval = _DEFAULT_INTERVAL
+
+    @field_validator("ticker")
+    @classmethod
+    def ticker_must_not_be_blank(cls, value: str) -> str:
+        """Normalize ticker whitespace and reject blank values."""
+        stripped = value.strip()
+        if not stripped:
+            raise ValueError("ticker must not be blank")
+        return stripped
+
+    @model_validator(mode="after")
+    def end_must_be_after_start(self) -> "YFinanceDailyConfig":
+        """Validate the requested date window."""
+        if self.start is not None and self.end is not None:
+            start = pd.Timestamp(self.start)
+            end = pd.Timestamp(self.end)
+            if end <= start:
+                raise ValueError(f"end ({self.end!r}) must be after start ({self.start!r})")
+        return self
+
+
+class YFinanceDailyAdapter(BaseAdapter):
+    """Adapter that fetches a single Yahoo Finance daily ticker field.
+
+    Parameters
+    ----------
+    ticker : str
+        Yahoo Finance symbol, e.g. ``"^GSPC"``, ``"CL=F"``, or ``"XLE"``.
+    field : {"Open", "High", "Low", "Close", "Adj Close", "Volume"}
+        Daily history column to expose as canonical ``value``. Defaults to
+        ``"Adj Close"``.
+    start : str or None
+        Inclusive start date passed to yfinance and applied to cache reads.
+    end : str or None
+        Exclusive end date passed to yfinance and applied to cache reads.
+    cache_dir : str, Path, or None
+        Directory for parquet cache files. When ``None``, caching is disabled
+        and every ``fetch()`` call hits yfinance. Default: ``"data/yfinance"``.
+    refresh : bool
+        When ``True``, force a network fetch even if a cache file exists.
+    """
+
+    DEFAULT_CACHE_DIR = "data/yfinance"
+
+    def __init__(
+        self,
+        ticker: str,
+        *,
+        field: YFinanceField = _DEFAULT_FIELD,
+        start: str | None = None,
+        end: str | None = None,
+        cache_dir: str | Path | None = DEFAULT_CACHE_DIR,
+        refresh: bool = False,
+    ) -> None:
+        self._config = YFinanceDailyConfig(
+            ticker=ticker,
+            field=field,
+            start=start,
+            end=end,
+            interval=_DEFAULT_INTERVAL,
+        )
+        self._cache_dir = Path(cache_dir) if cache_dir is not None else None
+        self._refresh = refresh
+
+    @property
+    def ticker(self) -> str:
+        """Yahoo Finance ticker symbol."""
+        return self._config.ticker
+
+    @property
+    def field(self) -> YFinanceField:
+        """Yahoo Finance daily history field exposed as ``value``."""
+        return self._config.field
+
+    @property
+    def start(self) -> str | None:
+        """Inclusive start date for the requested window."""
+        return self._config.start
+
+    @property
+    def end(self) -> str | None:
+        """Exclusive end date for the requested window."""
+        return self._config.end
+
+    @property
+    def cache_path(self) -> Path | None:
+        """Full path to this adapter's parquet cache file, or ``None`` if disabled."""
+        if self._cache_dir is None:
+            return None
+        stem = _cache_stem(self._config.ticker, self._config.field, self._config.interval)
+        return self._cache_dir / f"{stem}.parquet"
+
+    def fetch(self) -> pd.DataFrame:
+        """Return the series in canonical format, using disk cache when available.
+
+        Returns
+        -------
+        pd.DataFrame
+            Columns: ``timestamp`` (datetime64[ns]), ``value`` (float64), and
+            ``released_at`` (datetime64[ns]). Rows are sorted ascending by
+            ``timestamp`` and filtered to the configured ``start`` / ``end``
+            window.
+
+        Raises
+        ------
+        RuntimeError
+            If yfinance cannot be imported, the request fails, or no rows are
+            available after normalization and date filtering.
+        ValueError
+            If the Yahoo response is missing the configured field.
+        """
+        cache_path = self.cache_path
+        if cache_path is not None and cache_path.exists() and not self._refresh:
+            cached = self._read_cache(cache_path)
+            if self._cache_covers_range(cached):
+                return self._apply_date_range(cached)
+
+        df = self._fetch_from_yfinance()
+
+        if cache_path is not None:
+            cache_path.parent.mkdir(parents=True, exist_ok=True)
+            df.to_parquet(cache_path, index=False)
+
+        return self._apply_date_range(df)
+
+    def _cache_covers_range(self, df: pd.DataFrame) -> bool:
+        """Return whether cached data fully covers the requested date range.
+
+        Both the start and end boundaries are checked. If either falls outside
+        the cached window we fall through to a live yfinance fetch so the caller
+        always receives the exact rows they asked for.
+
+        Start boundary: the cache is considered sufficient when it opens on or
+        before the first business day on or after the requested ``start``. This
+        handles non-trading days (weekends, public holidays) at the boundary
+        without accepting a cache that is genuinely missing earlier data. For
+        example, a ``start`` of ``"2005-01-01"`` (Saturday) is satisfied by a
+        cache that begins on ``"2005-01-03"`` (Monday), but a ``start`` of
+        ``"2024-01-02"`` (Tuesday) would *not* be satisfied by a cache that
+        begins on ``"2024-01-03"``.
+        """
+        if df.empty:
+            return False
+        if self._config.start is not None:
+            cache_start = df["timestamp"].min()
+            first_trading_day = pd.bdate_range(start=self._config.start, periods=1)[0].normalize()
+            if cache_start > first_trading_day:
+                return False
+        if self._config.end is not None:
+            cache_end = df["timestamp"].max()
+            # end is exclusive, so the last row we expect is strictly before it.
+            # Allow one calendar day of slack to tolerate weekends/holidays at
+            # the boundary; any larger gap means the cache is genuinely short.
+            if cache_end < pd.Timestamp(self._config.end) - pd.Timedelta(days=1):
+                return False
+        return True
+
+    def _fetch_from_yfinance(self) -> pd.DataFrame:
+        """Fetch and normalize a daily history frame from yfinance."""
+        try:
+            import yfinance as yf  # noqa: PLC0415
+        except ImportError as exc:
+            raise RuntimeError("yfinance is not installed. Run `uv add yfinance` to install it.") from exc
+
+        try:
+            ticker = yf.Ticker(self._config.ticker)
+            raw: pd.DataFrame = ticker.history(
+                start=self._config.start,
+                end=self._config.end,
+                interval=self._config.interval,
+                auto_adjust=False,
+            )
+        except Exception as exc:
+            raise RuntimeError(f"Failed to fetch yfinance ticker {self._config.ticker!r}: {exc}") from exc
+
+        if raw.empty:
+            raise RuntimeError(
+                f"Yahoo Finance returned no rows for ticker {self._config.ticker!r} "
+                f"between {self._config.start!r} and {self._config.end!r}."
+            )
+
+        return self._normalize_history(raw)
+
+    def _normalize_history(self, raw: pd.DataFrame) -> pd.DataFrame:
+        """Normalize a yfinance history frame to canonical columns."""
+        if self._config.field not in raw.columns:
+            raise ValueError(
+                f"Yahoo Finance response for {self._config.ticker!r} is missing field "
+                f"{self._config.field!r}. Available columns: {raw.columns.tolist()}"
+            )
+
+        df = raw.reset_index()
+        timestamp_col = self._find_timestamp_column(df)
+        result = pd.DataFrame(
+            {
+                "timestamp": self._normalize_timestamp(df[timestamp_col]),
+                "value": pd.to_numeric(df[self._config.field], errors="coerce"),
+            }
+        )
+        result["released_at"] = result["timestamp"] + pd.offsets.BDay(1)
+        result = result.dropna(subset=["timestamp", "value"])
+        result = result.sort_values("timestamp").reset_index(drop=True)
+
+        if result.empty:
+            raise RuntimeError(
+                f"Yahoo Finance returned no usable {self._config.field!r} values for ticker {self._config.ticker!r}."
+            )
+
+        return result[["timestamp", "value", "released_at"]]
+
+    def _apply_date_range(self, df: pd.DataFrame) -> pd.DataFrame:
+        """Apply the configured ``start`` / ``end`` window to cached or fetched data."""
+        result = df.copy()
+        if self._config.start is not None:
+            result = result[result["timestamp"] >= pd.Timestamp(self._config.start)]
+        if self._config.end is not None:
+            result = result[result["timestamp"] < pd.Timestamp(self._config.end)]
+        result = result.reset_index(drop=True)
+        if result.empty:
+            raise RuntimeError(
+                f"No rows left after applying date range start={self._config.start!r} "
+                f"end={self._config.end!r} for ticker {self._config.ticker!r}."
+            )
+        return result
+
+    @staticmethod
+    def _find_timestamp_column(df: pd.DataFrame) -> str:
+        """Return the yfinance date/datetime column created by ``reset_index``."""
+        for candidate in ("Date", "Datetime"):
+            if candidate in df.columns:
+                return candidate
+        return str(df.columns[0])
+
+    @staticmethod
+    def _normalize_timestamp(values: Any) -> pd.Series:
+        """Return timezone-naive pandas timestamps."""
+        timestamps = pd.to_datetime(values, errors="coerce")
+        if isinstance(timestamps.dtype, pd.DatetimeTZDtype):
+            timestamps = timestamps.dt.tz_localize(None)
+        return timestamps.astype("datetime64[ns]")
+
+    @staticmethod
+    def _read_cache(cache_path: Path) -> pd.DataFrame:
+        """Read a cached parquet and normalize dtypes defensively."""
+        df = pd.read_parquet(cache_path)
+        missing = {"timestamp", "value", "released_at"} - set(df.columns)
+        if missing:
+            raise ValueError(f"Cached yfinance file {cache_path} is missing column(s): {sorted(missing)}")
+        result = pd.DataFrame(
+            {
+                "timestamp": YFinanceDailyAdapter._normalize_timestamp(df["timestamp"]),
+                "value": pd.to_numeric(df["value"], errors="coerce"),
+                "released_at": YFinanceDailyAdapter._normalize_timestamp(df["released_at"]),
+            }
+        )
+        result = result.dropna(subset=["timestamp", "value", "released_at"])
+        return result.sort_values("timestamp").reset_index(drop=True)
+
+    def __repr__(self) -> str:
+        """Return a short representation of this adapter."""
+        cache = self._cache_dir if self._cache_dir is not None else "disabled"
+        return (
+            f"YFinanceDailyAdapter(ticker={self._config.ticker!r}, field={self._config.field!r}, cache_dir={cache!r})"
+        )
+
+
+__all__ = ["YFinanceDailyAdapter", "YFinanceDailyConfig", "YFinanceField", "YFinanceInterval"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md
new file mode 100644
index 0000000..04cd9b7
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md
@@ -0,0 +1,155 @@
+# Source: aieng-forecasting/aieng/forecasting/data/context.py
+
+kind: python
+
+```python
+"""ForecastContext: the predictor-facing, cutoff-scoped data view."""
+
+from __future__ import annotations
+
+from datetime import datetime
+from typing import TYPE_CHECKING
+
+import pandas as pd
+from aieng.forecasting.data.cutoff import CutoffEnforcer
+from aieng.forecasting.data.models import SeriesMetadata
+from aieng.forecasting.data.store import SeriesStore
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.documents.models import ExtractedDocument
+    from aieng.forecasting.documents.store import DocumentStore
+
+
+class ForecastContext:
+    """Read-only, cutoff-scoped data view passed to predictors.
+
+    ``ForecastContext`` is the object predictors receive during backtesting or
+    live evaluation. It bakes in an ``as_of`` date so that ``get_series()``
+    always enforces the information cutoff automatically — a predictor cannot
+    accidentally access data that was not available at forecast time.
+
+    The harness creates a ``ForecastContext`` for each backtest origin via
+    ``DataService.context(as_of)``. In live mode the same factory is called
+    with the current date. The predictor interface is identical in both modes.
+
+    Intended predictor usage
+    ------------------------
+    >>> def predict(task: ForecastingTask, context: ForecastContext) -> Prediction:
+    ...     series = context.get_series(task.target_series_id)
+    ...     # series contains only observations available as of context.as_of
+    ...
+    ...     # Optionally retrieve cutoff-filtered documents:
+    ...     docs = context.get_documents("cfpr")
+
+    Parameters
+    ----------
+    store : SeriesStore
+        The underlying series store (owned by the ``DataService``).
+    as_of : datetime
+        The information cutoff. All ``get_series`` queries are filtered to
+        data available on or before this date.
+    doc_store : DocumentStore or None
+        Optional document store for report integration. When ``None``,
+        ``get_documents()`` returns an empty list.
+    """
+
+    def __init__(
+        self,
+        store: SeriesStore,
+        as_of: datetime,
+        doc_store: DocumentStore | None = None,
+    ) -> None:
+        self._store = store
+        self._as_of = as_of
+        self._cutoff = CutoffEnforcer()
+        self._doc_store = doc_store
+
+    @property
+    def as_of(self) -> datetime:
+        """The information cutoff date for this context."""
+        return self._as_of
+
+    def get_series(self, series_id: str) -> pd.DataFrame:
+        """Return a series filtered to observations available as of the cutoff.
+
+        Parameters
+        ----------
+        series_id : str
+            The series to retrieve.
+
+        Returns
+        -------
+        pd.DataFrame
+            DataFrame with columns ``timestamp`` and ``value`` (and optionally
+            ``released_at``), containing only rows available as of
+            ``self.as_of``, sorted ascending by ``timestamp``.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        raw = self._store.get(series_id)
+        return self._cutoff.filter(raw, self._as_of)
+
+    def get_metadata(self, series_id: str) -> SeriesMetadata:
+        """Return metadata for a registered series.
+
+        Parameters
+        ----------
+        series_id : str
+            The series identifier.
+
+        Returns
+        -------
+        SeriesMetadata
+            Metadata for the series.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        return self._store.get_metadata(series_id)
+
+    @property
+    def series_ids(self) -> list[str]:
+        """Return a sorted list of registered series identifiers."""
+        return self._store.series_ids
+
+    # ------------------------------------------------------------------
+    # Document access
+    # ------------------------------------------------------------------
+
+    def get_documents(self, source: str) -> list[ExtractedDocument]:
+        """Return cutoff-filtered documents for ``source``.
+
+        Only documents whose ``publication_date`` is on or before
+        ``self.as_of`` are returned.  Returns an empty list when no
+        ``DocumentStore`` is attached.
+
+        Parameters
+        ----------
+        source : str
+            Source key (e.g. ``"cfpr"``).
+
+        Returns
+        -------
+        list[ExtractedDocument]
+            Cutoff-filtered documents in chronological order.
+        """
+        if self._doc_store is None:
+            return []
+        return self._doc_store.list_docs(source, as_of=self._as_of)
+
+    @property
+    def document_sources(self) -> list[str]:
+        """Return sorted list of known document source keys.
+
+        Returns an empty list when no ``DocumentStore`` is attached.
+        """
+        if self._doc_store is None:
+            return []
+        return self._doc_store.sources
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md
new file mode 100644
index 0000000..bbbc8a1
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md
@@ -0,0 +1,73 @@
+# Source: aieng-forecasting/aieng/forecasting/data/cutoff.py
+
+kind: python
+
+```python
+"""Information cutoff enforcement."""
+
+from datetime import datetime
+
+import pandas as pd
+
+
+class CutoffEnforcer:
+    """Enforces information cutoff discipline on series data.
+
+    Ensures that no model or agent receives data that would not have been
+    available at the time a forecast was issued. This is the mechanism that
+    makes backtesting honest: a predictor running as-of 2022-01-01 sees
+    exactly the data that existed on that date, nothing more.
+
+    **Cutoff logic:**
+
+    - If the DataFrame includes a ``released_at`` column, rows where
+      ``released_at > as_of`` are excluded.
+    - If ``released_at`` is absent or null for a row, ``timestamp`` is used
+      as the fallback. This is correct for custom datasets where data is
+      available at observation time, but introduces a slight optimistic bias
+      for official datasets that have publication lags (e.g. StatCan CPI is
+      published ~3 weeks after the reference month).
+
+    Notes
+    -----
+    This class is stateless — it is a pure function wrapped in a class for
+    testability and future extension (e.g. injecting release calendars).
+    """
+
+    def filter(self, df: pd.DataFrame, as_of: datetime) -> pd.DataFrame:
+        """Return only rows available as of the given date.
+
+        Parameters
+        ----------
+        df : pd.DataFrame
+            Series DataFrame with columns ``timestamp`` and ``value``.
+            Optionally includes ``released_at``.
+        as_of : datetime
+            The information cutoff point. Rows with an effective release date
+            after this point are excluded.
+
+        Returns
+        -------
+        pd.DataFrame
+            Filtered copy of ``df`` containing only rows available as of
+            ``as_of``, sorted ascending by ``timestamp``.
+
+        Raises
+        ------
+        ValueError
+            If ``df`` does not contain a ``timestamp`` column.
+        """
+        if "timestamp" not in df.columns:
+            raise ValueError("DataFrame must contain a 'timestamp' column.")
+
+        as_of_ts = pd.Timestamp(as_of)
+
+        if "released_at" in df.columns:
+            # Use released_at when available, fall back to timestamp for null values.
+            effective_release = df["released_at"].fillna(df["timestamp"])
+            mask = pd.to_datetime(effective_release) <= as_of_ts
+        else:
+            mask = pd.to_datetime(df["timestamp"]) <= as_of_ts
+
+        return df.loc[mask].copy().sort_values("timestamp").reset_index(drop=True)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md
new file mode 100644
index 0000000..c61a853
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md
@@ -0,0 +1,86 @@
+# Source: aieng-forecasting/aieng/forecasting/data/models.py
+
+kind: python
+
+```python
+"""Pydantic models for the data service layer."""
+
+from datetime import datetime
+
+from pydantic import BaseModel, Field, model_validator
+
+
+class SeriesRecord(BaseModel):
+    """A single timestamped observation of a series.
+
+    Parameters
+    ----------
+    timestamp : datetime
+        The observation time (when the measurement was taken / the reference period).
+    value : float
+        The observed quantity.
+    released_at : datetime or None
+        When this data point became publicly available. If None, the
+        CutoffEnforcer falls back to ``timestamp``. For official datasets with
+        known release lags (e.g. StatCan CPI published ~3 weeks after the
+        reference month), this should be set explicitly to ensure backtests
+        respect information cutoff discipline.
+    """
+
+    timestamp: datetime
+    value: float
+    released_at: datetime | None = Field(
+        default=None,
+        description="Publication date; None means available at observation time.",
+    )
+
+    @model_validator(mode="after")
+    def released_at_not_before_timestamp(self) -> "SeriesRecord":
+        """Validate that released_at is not before timestamp.
+
+        Returns
+        -------
+        SeriesRecord
+            The validated instance.
+
+        Raises
+        ------
+        ValueError
+            If released_at is before timestamp.
+        """
+        if self.released_at is not None and self.released_at < self.timestamp:
+            raise ValueError(f"released_at ({self.released_at}) cannot be before timestamp ({self.timestamp})")
+        return self
+
+
+class SeriesMetadata(BaseModel):
+    """Descriptive metadata for a registered series.
+
+    Parameters
+    ----------
+    series_id : str
+        Unique identifier used as the key in SeriesStore.
+    description : str
+        Human-readable description of what the series measures.
+    source : str
+        Data source (e.g. "StatCan", "FRED", "yfinance").
+    units : str
+        Unit of measure (e.g. "Index 2002=100", "Percentage change").
+    frequency : str
+        Pandas offset alias for the series frequency (e.g. "MS" for month-start,
+        "h" for hourly). Used as a hint for gap-filling at the Darts conversion
+        boundary; the SeriesStore itself does not enforce regularity.
+    table_id : str or None
+        Source table or dataset identifier, if applicable.
+    """
+
+    series_id: str
+    description: str
+    source: str
+    units: str
+    frequency: str = Field(description="Pandas offset alias, e.g. 'MS', 'h', 'D'.")
+    table_id: str | None = Field(
+        default=None,
+        description="Source table or dataset identifier.",
+    )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md
new file mode 100644
index 0000000..f8e25e3
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md
@@ -0,0 +1,204 @@
+# Source: aieng-forecasting/aieng/forecasting/data/service.py
+
+kind: python
+
+```python
+"""DataService: registration and management of time series data."""
+
+from __future__ import annotations
+
+from datetime import datetime
+from typing import TYPE_CHECKING
+
+import pandas as pd
+from aieng.forecasting.data.adapters.base import BaseAdapter
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.data.cutoff import CutoffEnforcer
+from aieng.forecasting.data.models import SeriesMetadata
+from aieng.forecasting.data.store import SeriesStore
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.documents.store import DocumentStore
+
+
+class DataService:
+    """Registration and management layer for time series data.
+
+    ``DataService`` owns the ``SeriesStore`` and exposes two distinct
+    responsibilities:
+
+    1. **Registration** — ``register()`` fetches data via an adapter and
+       stores it in memory. Called by setup scripts (e.g.
+       ``scripts/fetch_cpi.py``) once at startup; no further network access
+       occurs after that.
+    2. **Context creation** — ``context(as_of)`` creates a
+       :class:`ForecastContext` scoped to a specific date. This is what the
+       backtesting harness (and live evaluation harness) passes to predictors.
+       Predictors should never receive a raw ``DataService``; they should
+       receive a ``ForecastContext``.
+
+    **Notebooks and scripts** may also call ``get_series`` directly for
+    ad-hoc exploration — this is the same cutoff-filtered query that
+    ``ForecastContext`` wraps, exposed here for convenience.
+
+    Examples
+    --------
+    >>> from aieng.forecasting.data import DataService, SeriesMetadata
+    >>> from aieng.forecasting.data.adapters import StatCanAdapter
+    >>> svc = DataService()
+    >>> adapter = StatCanAdapter(
+    ...     table_id="18-10-0004-11",
+    ...     member_filter={"GEO": "Canada", "Products and product groups": "All-items"},
+    ... )
+    >>> meta = SeriesMetadata(
+    ...     series_id="cpi_all_items_canada",
+    ...     description="CPI All-items, Canada (2002=100)",
+    ...     source="StatCan",
+    ...     units="Index 2002=100",
+    ...     frequency="MS",
+    ...     table_id="18-10-0004-11",
+    ... )
+    >>> svc.register("cpi_all_items_canada", adapter, meta)
+    >>> df = svc.get_series("cpi_all_items_canada", as_of=datetime(2023, 1, 1))
+    """
+
+    def __init__(self, doc_store: DocumentStore | None = None) -> None:
+        self._store = SeriesStore()
+        self._cutoff = CutoffEnforcer()
+        self._doc_store = doc_store
+
+    def register(
+        self,
+        series_id: str,
+        adapter: BaseAdapter,
+        metadata: SeriesMetadata,
+    ) -> None:
+        """Fetch data via an adapter and register the series in the store.
+
+        Parameters
+        ----------
+        series_id : str
+            Unique identifier for the series. Used as the lookup key in
+            subsequent ``get_series`` calls.
+        adapter : BaseAdapter
+            Adapter responsible for fetching the data. ``adapter.fetch()`` is
+            called exactly once; the result is stored in memory.
+        metadata : SeriesMetadata
+            Descriptive metadata (units, source, frequency, etc.).
+
+        Raises
+        ------
+        RuntimeError
+            If the adapter fails to fetch data.
+        ValueError
+            If the fetched DataFrame is missing required columns.
+        """
+        df = adapter.fetch()
+        self._store.put(series_id, df, metadata)
+
+    def get_series(self, series_id: str, as_of: datetime) -> pd.DataFrame:
+        """Return a series filtered to observations available as of ``as_of``.
+
+        The ``CutoffEnforcer`` ensures that only data published on or before
+        ``as_of`` is returned. This guarantees that backtests and live
+        forecasts share the same information discipline.
+
+        Parameters
+        ----------
+        series_id : str
+            The series to retrieve.
+        as_of : datetime
+            Information cutoff point. Observations released after this date
+            are excluded.
+
+        Returns
+        -------
+        pd.DataFrame
+            DataFrame with columns ``timestamp`` and ``value`` (and optionally
+            ``released_at``), containing only rows available as of ``as_of``,
+            sorted ascending by ``timestamp``.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        raw = self._store.get(series_id)
+        return self._cutoff.filter(raw, as_of)
+
+    def context(self, as_of: datetime) -> ForecastContext:
+        """Create a :class:`ForecastContext` scoped to the given as-of date.
+
+        This is the factory method used by the backtesting harness (and live
+        evaluation harness) to create the object passed to predictors. The
+        returned context bakes in ``as_of`` so that ``get_series()`` always
+        enforces the information cutoff automatically.
+
+        If a ``DocumentStore`` was provided at construction, it is wired into
+        every context so predictors can call ``context.get_documents()``.
+
+        Parameters
+        ----------
+        as_of : datetime
+            The information cutoff date.
+
+        Returns
+        -------
+        ForecastContext
+            A read-only, cutoff-scoped view of the series store.
+        """
+        return ForecastContext(self._store, as_of, doc_store=self._doc_store)
+
+    def get_metadata(self, series_id: str) -> SeriesMetadata:
+        """Return metadata for a registered series.
+
+        Parameters
+        ----------
+        series_id : str
+            The series identifier.
+
+        Returns
+        -------
+        SeriesMetadata
+            Metadata for the series.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        return self._store.get_metadata(series_id)
+
+    @property
+    def series_ids(self) -> list[str]:
+        """Return a sorted list of registered series identifiers."""
+        return self._store.series_ids
+
+    def summary(self) -> pd.DataFrame:
+        """Return a summary table of all registered series.
+
+        Returns
+        -------
+        pd.DataFrame
+            One row per series with columns: ``series_id``, ``description``,
+            ``source``, ``units``, ``frequency``, ``n_obs``, ``start``, ``end``.
+        """
+        rows = []
+        for sid in self._store.series_ids:
+            df = self._store.get(sid)
+            meta = self._store.get_metadata(sid)
+            rows.append(
+                {
+                    "series_id": sid,
+                    "description": meta.description,
+                    "source": meta.source,
+                    "units": meta.units,
+                    "frequency": meta.frequency,
+                    "n_obs": len(df),
+                    "start": df["timestamp"].min() if len(df) > 0 else None,
+                    "end": df["timestamp"].max() if len(df) > 0 else None,
+                }
+            )
+        return pd.DataFrame(rows)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md
new file mode 100644
index 0000000..3470703
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md
@@ -0,0 +1,117 @@
+# Source: aieng-forecasting/aieng/forecasting/data/store.py
+
+kind: python
+
+```python
+"""In-memory series store."""
+
+import pandas as pd
+from aieng.forecasting.data.models import SeriesMetadata
+
+
+class SeriesStore:
+    """In-memory store for historical time series.
+
+    Stores each series as a ``pandas.DataFrame`` with columns ``timestamp``,
+    ``value``, and optionally ``released_at``. Series are keyed by
+    ``series_id``; there is no ``series_id`` column in the stored DataFrame.
+
+    This class is intentionally thin — it is a dict with type-checked access
+    and basic introspection helpers. All filtering (cutoff enforcement) happens
+    in ``CutoffEnforcer`` before data reaches callers.
+
+    Notes
+    -----
+    The store makes no guarantees about temporal regularity. Series may be
+    irregularly spaced, sparse, or contain gaps. Gap-filling to a regular
+    frequency is a predictor-level concern performed at the Darts conversion
+    boundary, not here.
+    """
+
+    def __init__(self) -> None:
+        self._data: dict[str, pd.DataFrame] = {}
+        self._metadata: dict[str, SeriesMetadata] = {}
+
+    def put(self, series_id: str, df: pd.DataFrame, metadata: SeriesMetadata) -> None:
+        """Store a series and its metadata.
+
+        Parameters
+        ----------
+        series_id : str
+            Unique identifier for the series. Used as the lookup key.
+        df : pd.DataFrame
+            DataFrame with columns ``timestamp`` (datetime64) and ``value``
+            (float64). Optionally includes ``released_at`` (datetime64).
+            Rows should be sorted ascending by ``timestamp``.
+        metadata : SeriesMetadata
+            Descriptive metadata for the series.
+
+        Raises
+        ------
+        ValueError
+            If ``df`` is missing required columns ``timestamp`` or ``value``.
+        """
+        required = {"timestamp", "value"}
+        missing = required - set(df.columns)
+        if missing:
+            raise ValueError(f"DataFrame for series {series_id!r} is missing required columns: {missing}")
+        self._data[series_id] = df.copy()
+        self._metadata[series_id] = metadata
+
+    def get(self, series_id: str) -> pd.DataFrame:
+        """Return the full (unfiltered) DataFrame for a series.
+
+        Parameters
+        ----------
+        series_id : str
+            The series identifier.
+
+        Returns
+        -------
+        pd.DataFrame
+            A copy of the stored DataFrame.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        if series_id not in self._data:
+            raise KeyError(f"Series {series_id!r} not found. Registered series: {self.series_ids}")
+        return self._data[series_id].copy()
+
+    def get_metadata(self, series_id: str) -> SeriesMetadata:
+        """Return metadata for a series.
+
+        Parameters
+        ----------
+        series_id : str
+            The series identifier.
+
+        Returns
+        -------
+        SeriesMetadata
+            The metadata for the series.
+
+        Raises
+        ------
+        KeyError
+            If ``series_id`` is not registered.
+        """
+        if series_id not in self._metadata:
+            raise KeyError(f"Series {series_id!r} not found. Registered series: {self.series_ids}")
+        return self._metadata[series_id]
+
+    @property
+    def series_ids(self) -> list[str]:
+        """Return a sorted list of registered series identifiers."""
+        return sorted(self._data.keys())
+
+    def __contains__(self, series_id: str) -> bool:
+        """Return True if series_id is registered."""
+        return series_id in self._data
+
+    def __len__(self) -> int:
+        """Return the number of registered series."""
+        return len(self._data)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md
new file mode 100644
index 0000000..ed56429
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md
@@ -0,0 +1,41 @@
+# Source: aieng-forecasting/aieng/forecasting/documents/__init__.py
+
+kind: python
+
+```python
+"""Document extraction: source-agnostic PDF -> full text + cutoff metadata.
+
+This sub-package turns published document PDFs (e.g. Canada's Food Price Report,
+Bank of Canada Monetary Policy Report) into minimal, cutoff-stamped
+:class:`ExtractedDocument` artifacts -- full text plus a ``publication_date``
+and size counts.  It intentionally models no source-specific structure; a future
+cutoff-aware ``DocumentStore`` will consume these artifacts for LLM-P report
+integration.
+
+The extractor depends on the optional ``documents`` extra (``pymupdf4llm``) and
+imports it lazily, so importing this package is cheap.
+"""
+
+from aieng.forecasting.documents.extract import extract_document
+from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument, estimate_tokens
+from aieng.forecasting.documents.pdf_upload import (
+    MIME_PDF,
+    inject_pdf_parts,
+    pdf_bytes_to_content_part,
+    pdf_to_content_part,
+)
+from aieng.forecasting.documents.store import DocumentStore
+
+
+__all__ = [
+    "DocumentMeta",
+    "DocumentStore",
+    "ExtractedDocument",
+    "MIME_PDF",
+    "estimate_tokens",
+    "extract_document",
+    "inject_pdf_parts",
+    "pdf_bytes_to_content_part",
+    "pdf_to_content_part",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md
new file mode 100644
index 0000000..2aa7985
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md
@@ -0,0 +1,92 @@
+# Source: aieng-forecasting/aieng/forecasting/documents/extract.py
+
+kind: python
+
+```python
+"""Document text extraction.
+
+A single, source-agnostic function turns any born-digital PDF into full text
+plus size counts.  The bootcamp reference pipeline targets born-digital report
+PDFs (Canada's Food Price Report, Bank of Canada Monetary Policy Report), where
+a lightweight, deterministic, CPU-only parser captures the text well.
+
+We use the classic ``pymupdf4llm`` engine rather than its OCR layout engine so
+extraction is deterministic and reproducible for honest backtests.  No section
+or heading structure is reconstructed -- callers get the whole document text.
+
+``pymupdf4llm`` is an optional dependency (the ``documents`` extra); it is
+imported lazily so importing this module never requires the package.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from pathlib import Path
+
+from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument, estimate_tokens
+
+
+def extract_document(
+    pdf_path: Path, meta: DocumentMeta, *, dpi: int = 150, min_chars_per_page: int = 20
+) -> ExtractedDocument:
+    """Extract full text and size counts from a born-digital PDF.
+
+    Parameters
+    ----------
+    pdf_path : Path
+        Path to the source PDF.
+    meta : DocumentMeta
+        Provenance/cutoff metadata, carried through to the result.  The
+        ``publication_date`` is supplied by the caller (from the committed
+        manifest), not parsed from the PDF.
+    dpi : int
+        Render DPI passed to the engine.  Affects only any rasterization the
+        engine performs internally; text extraction is unaffected.
+    min_chars_per_page : int
+        Fail loudly if the extracted text averages fewer characters per page
+        than this -- a near-empty result signals a scanned/encrypted/image-only
+        PDF that this text-only path cannot handle. Set ``0`` to disable.
+
+    Returns
+    -------
+    ExtractedDocument
+        Full text plus page count, character count, and an approximate token
+        count.
+
+    Raises
+    ------
+    FileNotFoundError
+        If ``pdf_path`` does not exist.
+    ValueError
+        If extraction yields implausibly little text (see ``min_chars_per_page``).
+    """
+    if not pdf_path.exists():
+        raise FileNotFoundError(f"PDF not found: {pdf_path}")
+
+    # Lazy import: the ``documents`` optional dependency need not be installed
+    # to import this module (only to actually run extraction).
+    from pymupdf4llm.helpers.pymupdf_rag import to_markdown  # noqa: PLC0415
+
+    # ``table_strategy=None`` disables table detection, which trips an upstream
+    # empty-cell ValueError on several real report PDFs and is not needed for
+    # whole-document text extraction.  ``page_chunks=True`` yields one entry per
+    # page, giving us the page count without a separately-typed pymupdf call.
+    chunks = to_markdown(str(pdf_path), page_chunks=True, table_strategy=None, dpi=dpi, show_progress=False)
+    page_count = len(chunks)
+    text = "\n\n".join(str(chunk.get("text", "")) for chunk in chunks).strip()
+
+    n_chars = len(text)
+    if min_chars_per_page > 0 and n_chars < min_chars_per_page * max(page_count, 1):
+        raise ValueError(
+            f"Extracted only {n_chars} chars from {page_count} page(s) of {pdf_path.name}; "
+            "likely a scanned/encrypted/image-only PDF that the text-only extractor cannot read.",
+        )
+    return ExtractedDocument(
+        meta=meta,
+        text=text,
+        page_count=page_count,
+        n_chars=n_chars,
+        est_tokens=estimate_tokens(n_chars),
+        extracted_at=datetime.now(tz=timezone.utc).replace(tzinfo=None),
+    )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md
new file mode 100644
index 0000000..b5d1f9c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md
@@ -0,0 +1,108 @@
+# Source: aieng-forecasting/aieng/forecasting/documents/models.py
+
+kind: python
+
+```python
+"""Pydantic models for extracted documents.
+
+A document here is intentionally minimal: full text plus the metadata needed to
+use it honestly in a backtest.  We deliberately do *not* model sections,
+segments, or any source-specific structure -- different report families (e.g.
+Canada's Food Price Report vs. Bank of Canada Monetary Policy Report) have
+nothing in common structurally, and the planned LLM-P report formats consume the
+whole document at its single publication date rather than hand-picked sections.
+
+The field that matters most for honest backtesting is
+:attr:`DocumentMeta.publication_date`.  A future cutoff-aware ``DocumentStore``
+will filter documents with ``publication_date <= as_of`` using the same
+information-cutoff discipline that ``CutoffEnforcer`` (see
+:mod:`aieng.forecasting.data.cutoff`) applies to numeric series.
+"""
+
+from __future__ import annotations
+
+from datetime import date, datetime
+
+from pydantic import BaseModel, Field
+
+
+class DocumentMeta(BaseModel):
+    """Provenance and cutoff metadata for a single document.
+
+    Parameters
+    ----------
+    source : str
+        Short source key, e.g. ``"cfpr"`` (Canada's Food Price Report) or
+        ``"boc"`` (Bank of Canada Monetary Policy Report).
+    doc_id : str
+        Stable per-document identifier, unique within ``source`` (e.g.
+        ``"2026_en"``).  Used as the cache filename stem.
+    publication_date : date
+        The date the document became publicly available.  This is the cutoff
+        key: a forecast issued before this date must not see this document.
+    title : str or None
+        Document title, if known.
+    lang : str
+        Two-letter language code, e.g. ``"en"``.
+    """
+
+    source: str
+    doc_id: str
+    publication_date: date = Field(description="Public release date; the cutoff key for honest backtests.")
+    title: str | None = None
+    lang: str = "en"
+
+
+def estimate_tokens(n_chars: int) -> int:
+    """Roughly estimate token count from character count.
+
+    Uses the common ``~4 chars/token`` rule of thumb.  This is a deliberately
+    crude, model-agnostic ballpark for context-budget planning -- not an exact
+    count for any specific tokenizer.
+
+    Parameters
+    ----------
+    n_chars : int
+        Number of characters.
+
+    Returns
+    -------
+    int
+        Approximate token count.
+    """
+    return (n_chars + 3) // 4
+
+
+class ExtractedDocument(BaseModel):
+    """The full-text result of extracting one document.
+
+    Parameters
+    ----------
+    meta : DocumentMeta
+        Provenance and cutoff metadata.
+    text : str
+        The full extracted text (markdown).
+    page_count : int
+        Number of pages in the source document.
+    n_chars : int
+        Character count of ``text`` (context-cost signal).
+    est_tokens : int
+        Approximate token count (``~n_chars / 4``); see :func:`estimate_tokens`.
+    extracted_at : datetime
+        UTC timestamp when extraction ran.
+    pdf_path : str or None
+        Local filesystem path to the source PDF, resolved at load time by
+        :class:`~aieng.forecasting.documents.store.DocumentStore` for native
+        document ingestion.  Runtime-only and machine-specific — it is *not*
+        part of the persisted artifact contract; serialized artifacts leave it
+        ``None``.
+    """
+
+    meta: DocumentMeta
+    text: str
+    page_count: int = Field(ge=0)
+    n_chars: int = Field(ge=0)
+    est_tokens: int = Field(ge=0)
+    extracted_at: datetime
+    pdf_path: str | None = None
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md
new file mode 100644
index 0000000..d88e6c0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md
@@ -0,0 +1,208 @@
+# Source: aieng-forecasting/aieng/forecasting/documents/pdf_upload.py
+
+kind: python
+
+```python
+"""PDF-to-message-part conversion for native document ingestion.
+
+Converts a PDF into a content-part dict that a model can read directly,
+**dispatched by backend family** because each provider's native API expects a
+different document-block shape and the Vector Proxy forwards content blocks to
+each backend largely untranslated:
+
+- **Anthropic** (``claude-*``): ``{"type": "document", "source": {...}}``
+- **OpenAI** (``gpt-*``, ``o*``): ``{"type": "file", "file": {...}}``
+- **Google** (``gemini-*``): **not supported through the proxy yet** — the
+  proxy routes Gemini via Google's OpenAI-compatibility endpoint, which drops
+  document (and image) parts. ``pdf_to_content_part`` raises
+  :class:`NotImplementedError` for Gemini models. See the ``TODO(proxy-pdf)``
+  below: once the proxy routes Gemini through the native ``generateContent``
+  API with ``inline_data``, emit that part here and Gemini becomes just another
+  branch — at which point native ingestion can be configured uniformly
+  alongside text extraction for every model.
+
+Usage::
+
+    from aieng.forecasting.documents.pdf_upload import pdf_to_content_part
+
+    part = pdf_to_content_part(Path("report.pdf"), model="claude-sonnet-4-6")
+    messages = [{"role": "user", "content": "Summarize this document."}]
+    messages = inject_pdf_parts(messages, [part])
+"""
+
+from __future__ import annotations
+
+import base64
+from pathlib import Path
+from typing import Any
+
+
+#: MIME type used for PDF document parts.
+MIME_PDF = "application/pdf"
+
+
+def _backend_family(model: str) -> str:
+    """Map a proxy model name to its backend family.
+
+    The model may carry a LiteLLM provider prefix (e.g. ``openai/gpt-4o``);
+    only the bare name after the last ``/`` is inspected.
+
+    Returns one of ``"anthropic"``, ``"openai"``, ``"google"``.
+
+    Raises
+    ------
+    ValueError
+        If the model name does not match a known family.
+    """
+    name = model.lower().rsplit("/", 1)[-1]
+    if name.startswith("claude"):
+        return "anthropic"
+    if name.startswith(("gpt", "o1", "o3", "o4")):
+        return "openai"
+    if name.startswith("gemini"):
+        return "google"
+    raise ValueError(
+        f"Cannot determine backend family for model {model!r}; native PDF "
+        "ingestion supports Anthropic ('claude-*') and OpenAI ('gpt-*', 'o*') "
+        "models. Use text extraction (report_ingestion='text') for others."
+    )
+
+
+def pdf_bytes_to_content_part(
+    pdf_bytes: bytes,
+    model: str,
+    *,
+    filename: str = "document.pdf",
+) -> dict[str, Any]:
+    """Convert raw PDF bytes into a backend-appropriate content-part dict.
+
+    Parameters
+    ----------
+    pdf_bytes : bytes
+        Raw PDF file bytes.
+    model : str
+        Target model name (bare or provider-prefixed). Selects the block shape.
+    filename : str
+        Filename advertised to OpenAI's ``file`` block. Ignored by Anthropic.
+
+    Returns
+    -------
+    dict
+        A content-part dict in the target backend's native document format.
+
+    Raises
+    ------
+    ValueError
+        If ``model`` is not a recognised Anthropic/OpenAI family member.
+    NotImplementedError
+        If ``model`` is a Gemini model (unsupported through the proxy today).
+    """
+    family = _backend_family(model)
+    b64 = base64.b64encode(pdf_bytes).decode("utf-8")
+    if family == "anthropic":
+        return {
+            "type": "document",
+            "source": {"type": "base64", "media_type": MIME_PDF, "data": b64},
+        }
+    if family == "openai":
+        return {
+            "type": "file",
+            "file": {"filename": filename, "file_data": f"data:{MIME_PDF};base64,{b64}"},
+        }
+    # Remaining family: Google (Gemini).
+    # TODO(proxy-pdf): the Vector Proxy currently routes Gemini through Google's
+    # OpenAI-compatibility endpoint, which silently drops document/image parts
+    # (verified: multimodal content reaches Gemini as 0 added prompt tokens).
+    # Once the proxy routes Gemini via the native generateContent API, emit a
+    # Gemini-native inline_data part here (a "file"/"file_data" data-URI part
+    # that the proxy translates to inline_data) and delete this guard, so native
+    # ingestion becomes configurable uniformly for every model alongside text
+    # extraction.
+    raise NotImplementedError(
+        f"Native PDF ingestion for Gemini model {model!r} is not supported "
+        "through the Vector Proxy yet: the proxy routes Gemini via Google's "
+        "OpenAI-compatibility endpoint, which drops document parts. Use text "
+        "extraction (report_ingestion='text') for Gemini, or a Claude/GPT "
+        "model for native ingestion."
+    )
+
+
+def pdf_to_content_part(pdf_path: Path, model: str) -> dict[str, Any]:
+    """Read a PDF file and convert it to a backend-appropriate content part.
+
+    Parameters
+    ----------
+    pdf_path : Path
+        Path to the PDF file. Must exist and be readable. The filename is
+        forwarded to OpenAI's ``file`` block.
+    model : str
+        Target model name (bare or provider-prefixed). Selects the block shape.
+
+    Returns
+    -------
+    dict
+        A content-part dict in the target backend's native document format.
+
+    Raises
+    ------
+    FileNotFoundError
+        If ``pdf_path`` does not exist.
+    ValueError
+        If ``model`` is not a recognised Anthropic/OpenAI family member.
+    NotImplementedError
+        If ``model`` is a Gemini model (unsupported through the proxy today).
+    """
+    if not pdf_path.exists():
+        raise FileNotFoundError(f"PDF not found: {pdf_path}")
+    return pdf_bytes_to_content_part(pdf_path.read_bytes(), model, filename=pdf_path.name)
+
+
+def inject_pdf_parts(
+    messages: list[dict[str, Any]],
+    pdf_parts: list[dict[str, Any]],
+    *,
+    target_role: str = "user",
+) -> list[dict[str, Any]]:
+    """Inject PDF content parts into the first message matching ``target_role``.
+
+    If the target message's ``content`` is a string, it is converted to a
+    content-part list with the original text as a ``"text"`` part.  PDF parts
+    are prepended so the model sees the document before the instruction text.
+
+    When no message matches ``target_role``, a new message with that role
+    and only the PDF parts is appended as a fallback.
+
+    Parameters
+    ----------
+    messages : list[dict]
+        Existing messages list (mutated in place and returned for chaining).
+    pdf_parts : list[dict]
+        One or more content-part dicts from :func:`pdf_to_content_part`.
+    target_role : str
+        Role of the message to inject into (default ``"user"``).
+
+    Returns
+    -------
+    list[dict]
+        The same ``messages`` list (mutated in place).
+    """
+    for msg in messages:
+        if msg.get("role") == target_role:
+            content = msg["content"]
+            if isinstance(content, str):
+                msg["content"] = [{"type": "text", "text": content}]
+            # Prepend PDF parts before text instruction.
+            msg["content"] = pdf_parts + list(msg["content"])
+            return messages
+    # Fallback: append a new target_role message with only the PDF parts.
+    messages.append({"role": target_role, "content": pdf_parts})
+    return messages
+
+
+__all__ = [
+    "MIME_PDF",
+    "inject_pdf_parts",
+    "pdf_bytes_to_content_part",
+    "pdf_to_content_part",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md
new file mode 100644
index 0000000..848778f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md
@@ -0,0 +1,200 @@
+# Source: aieng-forecasting/aieng/forecasting/documents/store.py
+
+kind: python
+
+```python
+"""Cutoff-aware in-memory store for extracted documents.
+
+``DocumentStore`` loads :class:`ExtractedDocument` JSON artifacts written by
+``scripts/extract_reports.py`` and makes them queryable by source and ``as_of``
+date — the same information-discipline pattern that ``SeriesStore`` enforces for
+numeric series.
+
+Artifact layout (one directory per source)::
+
+    data/reports/<source>/
+    ├── <doc_id>.pdf        # cached PDF (source of extraction)
+    ├── <doc_id>.md         # extracted full text
+    └── <doc_id>.json       # ExtractedDocument metadata + text_path pointer
+"""
+
+from __future__ import annotations
+
+import json
+from datetime import date, datetime
+from pathlib import Path
+from typing import Any
+
+from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument
+
+
+class DocumentStore:
+    """In-memory store for extracted documents, indexed by ``(source, doc_id)``.
+
+    Populated by ``load_dir()`` from the JSON artifacts written by
+    ``scripts/extract_reports.py``.  Supports cutoff-filtered listing via
+    ``list_docs()`` so that predictors can only see documents whose
+    ``publication_date`` is <= the forecast ``as_of`` date.
+
+    Parameters
+    ----------
+    source_dirs : dict[str, Path] or None
+        Mapping of ``source`` keys to artifact directories.  When ``None``,
+        the store starts empty; call ``load_dir()`` to populate.
+    """
+
+    def __init__(self, source_dirs: dict[str, Path] | None = None) -> None:
+        self._docs: dict[tuple[str, str], ExtractedDocument] = {}
+        self._source_names: set[str] = set()
+        if source_dirs:
+            for source, directory in source_dirs.items():
+                self.load_dir(source, directory)
+
+    # ------------------------------------------------------------------
+    # Population
+    # ------------------------------------------------------------------
+
+    def load_dir(self, source: str, directory: Path) -> int:
+        """Load all ``*.json`` artifacts from ``directory`` into the store.
+
+        Each ``.json`` file must be a serialized :class:`ExtractedDocument`
+        (the shape written by ``scripts/extract_reports.py``).  The ``text``
+        field is loaded from the ``text_path`` pointer stored inside the JSON,
+        or from the ``.md`` companion file with the same stem.
+
+        Parameters
+        ----------
+        source : str
+            Source key (e.g. ``"cfpr"``).
+        directory : Path
+            Directory containing ``<doc_id>.json`` artifacts.
+
+        Returns
+        -------
+        int
+            Number of documents loaded.
+        """
+        if not directory.is_dir():
+            self._source_names.add(source)
+            return 0
+        count = 0
+        for json_path in sorted(directory.glob("*.json")):
+            doc = self._load_one(source, json_path)
+            if doc is not None:
+                self._docs[(source, doc.meta.doc_id)] = doc
+                count += 1
+        self._source_names.add(source)
+        return count
+
+    def _load_one(self, source: str, json_path: Path) -> ExtractedDocument | None:
+        """Parse one ``<doc_id>.json`` artifact and resolve its text."""
+        try:
+            raw: dict[str, Any] = json.loads(json_path.read_text(encoding="utf-8"))
+        except (json.JSONDecodeError, OSError):
+            return None
+
+        meta_raw = raw.get("meta", {})
+        text = raw.get("text", "") or ""
+
+        # If text is empty (extract_reports.py excludes it from the JSON), load
+        # it from the companion .md.  Prefer the co-located ``<doc_id>.md`` next
+        # to the JSON — it is CWD-independent.  The stored ``text_path`` is only
+        # a fallback and may be repo-root-relative, so resolve it against the
+        # JSON's own directory rather than the current working directory.
+        if not text:
+            md_companion = json_path.with_suffix(".md")
+            text_path_str = raw.get("text_path")
+            if md_companion.exists():
+                text = md_companion.read_text(encoding="utf-8")
+            elif text_path_str:
+                candidate = Path(text_path_str)
+                if not candidate.is_absolute():
+                    candidate = json_path.parent / candidate.name
+                text = candidate.read_text(encoding="utf-8")
+
+        meta = DocumentMeta(
+            source=source,
+            doc_id=meta_raw.get("doc_id", json_path.stem),
+            publication_date=date.fromisoformat(meta_raw["publication_date"]),
+            title=meta_raw.get("title"),
+            lang=meta_raw.get("lang", "en"),
+        )
+        # Resolve the companion PDF (``<doc_id>.pdf``) for native ingestion.
+        # Runtime-only; not persisted in the JSON artifact.
+        pdf_companion = json_path.with_suffix(".pdf")
+        pdf_path = str(pdf_companion) if pdf_companion.exists() else None
+        return ExtractedDocument(
+            meta=meta,
+            text=text,
+            page_count=raw.get("page_count", 0),
+            n_chars=len(text),
+            est_tokens=raw.get("est_tokens", 0),
+            extracted_at=datetime.fromisoformat(raw["extracted_at"]) if raw.get("extracted_at") else datetime.now(),
+            pdf_path=pdf_path,
+        )
+
+    # ------------------------------------------------------------------
+    # Query
+    # ------------------------------------------------------------------
+
+    def get(self, source: str, doc_id: str) -> ExtractedDocument:
+        """Return a single document by source and doc_id.
+
+        Raises
+        ------
+        KeyError
+            If ``(source, doc_id)`` is not in the store.
+        """
+        key = (source, doc_id)
+        if key not in self._docs:
+            available = [f"{s}/{d}" for s, d in self._docs]
+            raise KeyError(f"Document '{source}/{doc_id}' not found. Available: {sorted(available)}")
+        return self._docs[key]
+
+    def list_docs(
+        self,
+        source: str,
+        *,
+        as_of: date | datetime | None = None,
+    ) -> list[ExtractedDocument]:
+        """Return documents for ``source``, optionally cutoff-filtered.
+
+        Documents are sorted by ``publication_date`` ascending then by
+        ``doc_id`` for stable ordering.
+
+        Parameters
+        ----------
+        source : str
+            Source key (e.g. ``"cfpr"``).
+        as_of : date or datetime or None
+            When set, only documents with ``publication_date <= as_of`` are
+            returned.  ``None`` returns all documents for the source.
+
+        Returns
+        -------
+        list[ExtractedDocument]
+            Cutoff-filtered, chronologically sorted documents.
+        """
+        candidates = [doc for (s, _), doc in self._docs.items() if s == source]
+        if as_of is not None:
+            as_of_date = as_of.date() if isinstance(as_of, datetime) else as_of
+            candidates = [d for d in candidates if d.meta.publication_date <= as_of_date]
+        candidates.sort(key=lambda d: (d.meta.publication_date, d.meta.doc_id))
+        return candidates
+
+    @property
+    def sources(self) -> list[str]:
+        """Return sorted list of known document source keys."""
+        return sorted(self._source_names)
+
+    def __contains__(self, key: tuple[str, str]) -> bool:
+        """Check whether ``(source, doc_id)`` is in the store."""
+        return key in self._docs
+
+    def __len__(self) -> int:
+        """Return total number of loaded documents across all sources."""
+        return len(self._docs)
+
+
+__all__ = ["DocumentStore"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md
new file mode 100644
index 0000000..3b9a937
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md
@@ -0,0 +1,84 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/__init__.py
+
+kind: python
+
+```python
+"""Evaluation harness: forecasting tasks, prediction payloads, and scoring."""
+
+from aieng.forecasting.evaluation.artifacts import (
+    DEFAULT_STORE_DIR,
+    cached_backtest,
+    cached_multi_backtest,
+    load_backtest_result,
+    load_multi_backtest_results,
+    save_backtest_result,
+    save_eval_result,
+    save_multi_backtest_results,
+    save_multi_eval_results,
+)
+from aieng.forecasting.evaluation.backtest import (
+    BacktestResult,
+    BacktestSpec,
+    MultiTargetBacktestSpec,
+    backtest,
+    compute_brier_score,
+    compute_rps,
+    multi_backtest,
+)
+from aieng.forecasting.evaluation.describe import describe_spec, describe_task
+from aieng.forecasting.evaluation.eval import (
+    EvalBudgetExceededError,
+    EvalResult,
+    EvalSpec,
+    EvalTracker,
+    MultiTargetEvalSpec,
+    evaluate,
+    multi_evaluate,
+)
+from aieng.forecasting.evaluation.prediction import (
+    STANDARD_QUANTILES,
+    BinaryForecast,
+    CategoricalForecast,
+    ContinuousForecast,
+    Prediction,
+)
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory
+
+
+__all__ = [
+    "DEFAULT_STORE_DIR",
+    "BacktestResult",
+    "BacktestSpec",
+    "BinaryForecast",
+    "CategoricalForecast",
+    "ContinuousForecast",
+    "EvalBudgetExceededError",
+    "EvalResult",
+    "EvalSpec",
+    "EvalTracker",
+    "ForecastingTask",
+    "MultiTargetBacktestSpec",
+    "MultiTargetEvalSpec",
+    "Prediction",
+    "Predictor",
+    "STANDARD_QUANTILES",
+    "TaskCategory",
+    "backtest",
+    "cached_backtest",
+    "cached_multi_backtest",
+    "compute_brier_score",
+    "compute_rps",
+    "describe_spec",
+    "describe_task",
+    "evaluate",
+    "load_backtest_result",
+    "load_multi_backtest_results",
+    "multi_backtest",
+    "multi_evaluate",
+    "save_backtest_result",
+    "save_eval_result",
+    "save_multi_backtest_results",
+    "save_multi_eval_results",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md
new file mode 100644
index 0000000..a9fba49
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md
@@ -0,0 +1,440 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/artifacts.py
+
+kind: python
+
+```python
+"""Persist backtest and eval results to a filesystem artefact store.
+
+Backtests can be expensive to run — especially for agentic or LLM-based
+predictors — and their outputs are the primary input to downstream analysis,
+plotting, and leaderboard computation.  This module provides a small
+filesystem-backed store so that results can be saved once and re-read many
+times across notebook sessions.
+
+Layout
+------
+Results are stored as YAML files under a store directory:
+
+.. code-block:: text
+
+    data/predictions/
+        <spec_id>/
+            <predictor_id>.yaml                  # single-target backtest
+            <predictor_id>__<task_id>.yaml       # one file per task for multi-target
+            <predictor_id>__<task_id>__eval.yaml # multi-target eval run
+
+Single-target :class:`BacktestResult` / :class:`EvalResult` files live at
+``<store>/<spec_id>/<predictor_id>.yaml``.
+
+Multi-target results (one result per task under a single
+:class:`MultiTargetBacktestSpec` / :class:`MultiTargetEvalSpec`) are split
+across one YAML file per task.  This keeps individual files readable and
+makes partial caching straightforward: re-running after a new task is added
+to the spec only has to compute the missing task.
+
+Caching semantics
+-----------------
+:func:`cached_backtest` and :func:`cached_multi_backtest` implement a simple
+load-or-compute policy:
+
+- If all expected files exist under the store, load and return them.
+- Otherwise, run the backtest, save the result(s), and return them.
+- ``force_refresh=True`` always recomputes and overwrites.
+
+**Eval runs are never silently cached.**  Each :func:`evaluate` /
+:func:`multi_evaluate` call consumes one run from the budget in
+:class:`EvalTracker`, so caching would obscure budget spend.  Eval helpers
+are write-only: :func:`save_eval_result` / :func:`save_multi_eval_result`.
+
+YAML (not parquet or pickle) is the on-disk format because
+:class:`BacktestResult` and :class:`EvalResult` are Pydantic models — the
+YAML round-trip is straightforward and the result is human-readable, which
+matters more than disk footprint at bootcamp scale.
+"""
+
+from __future__ import annotations
+
+import logging
+from pathlib import Path
+
+import yaml
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import (
+    BacktestResult,
+    BacktestSpec,
+    MultiTargetBacktestSpec,
+    backtest,
+)
+from aieng.forecasting.evaluation.eval import EvalResult, MultiTargetEvalSpec
+from aieng.forecasting.evaluation.predictor import Predictor
+
+
+#: Default store location, relative to the caller's working directory.
+DEFAULT_STORE_DIR = Path("data/predictions")
+
+
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+
+
+def _resolve_store(store_dir: Path | None) -> Path:
+    """Return the effective store directory, falling back to the default."""
+    return Path(store_dir) if store_dir is not None else DEFAULT_STORE_DIR
+
+
+def _dump_yaml(model: BacktestResult | EvalResult, path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    data = model.model_dump(mode="json")
+    with path.open("w") as f:
+        yaml.safe_dump(data, f, default_flow_style=False, sort_keys=False)
+
+
+def _load_yaml(path: Path) -> dict[str, object]:
+    with path.open() as f:
+        loaded = yaml.safe_load(f)
+    if not isinstance(loaded, dict):
+        raise ValueError(f"Expected a mapping at {path}, got {type(loaded).__name__}")
+    return loaded
+
+
+def _backtest_path(store_dir: Path, spec_id: str, predictor_id: str, task_id: str | None = None) -> Path:
+    """Return the artefact path for a single :class:`BacktestResult`.
+
+    For single-target backtests pass ``task_id=None`` — the filename is
+    ``<predictor_id>.yaml``.  For multi-target the filename becomes
+    ``<predictor_id>__<task_id>.yaml`` to keep all tasks for a spec in one
+    directory.
+    """
+    if task_id is None:
+        return store_dir / spec_id / f"{predictor_id}.yaml"
+    return store_dir / spec_id / f"{predictor_id}__{task_id}.yaml"
+
+
+def _eval_path(store_dir: Path, spec_id: str, predictor_id: str, run_number: int, task_id: str | None = None) -> Path:
+    """Return the artefact path for a single :class:`EvalResult`.
+
+    Eval filenames include ``run_number`` because each eval run consumes the
+    budget and we want all runs persisted rather than overwriting a previous
+    one.
+    """
+    if task_id is None:
+        return store_dir / spec_id / f"{predictor_id}__eval_run{run_number}.yaml"
+    return store_dir / spec_id / f"{predictor_id}__{task_id}__eval_run{run_number}.yaml"
+
+
+# ---------------------------------------------------------------------------
+# Single-target backtest artefacts
+# ---------------------------------------------------------------------------
+
+
+def save_backtest_result(
+    result: BacktestResult,
+    spec_id: str,
+    store_dir: Path | None = None,
+) -> Path:
+    """Persist a :class:`BacktestResult` to the artefact store.
+
+    Parameters
+    ----------
+    result : BacktestResult
+        The result to persist.
+    spec_id : str
+        Directory key under the store.  For single-target backtests the
+        :class:`BacktestSpec` does not carry a ``spec_id`` field, so callers
+        must supply one explicitly.
+    store_dir : Path or None
+        Store root.  Defaults to :data:`DEFAULT_STORE_DIR`.
+
+    Returns
+    -------
+    Path
+        The path the result was written to.
+    """
+    store = _resolve_store(store_dir)
+    path = _backtest_path(store, spec_id, result.predictor_id)
+    _dump_yaml(result, path)
+    return path
+
+
+def load_backtest_result(
+    spec_id: str,
+    predictor_id: str,
+    store_dir: Path | None = None,
+) -> BacktestResult | None:
+    """Load a previously persisted :class:`BacktestResult` from the store.
+
+    Parameters
+    ----------
+    spec_id : str
+        Directory key under the store.
+    predictor_id : str
+        Predictor whose result to load.
+    store_dir : Path or None
+        Store root.  Defaults to :data:`DEFAULT_STORE_DIR`.
+
+    Returns
+    -------
+    BacktestResult or None
+        The loaded result, or ``None`` if no file exists for this combination.
+    """
+    store = _resolve_store(store_dir)
+    path = _backtest_path(store, spec_id, predictor_id)
+    if not path.exists():
+        return None
+    return BacktestResult.model_validate(_load_yaml(path))
+
+
+def cached_backtest(
+    predictor: Predictor,
+    spec: BacktestSpec,
+    spec_id: str,
+    data_service: DataService,
+    store_dir: Path | None = None,
+    force_refresh: bool = False,
+) -> BacktestResult:
+    """Run :func:`backtest` with a load-or-compute cache.
+
+    If a result already exists under ``<store>/<spec_id>/<predictor_id>.yaml``
+    and ``force_refresh`` is ``False``, the cached result is returned.
+    Otherwise the backtest is run and the result is persisted before return.
+
+    Parameters
+    ----------
+    predictor : Predictor
+        Forecasting model to evaluate.
+    spec : BacktestSpec
+        Backtest specification.
+    spec_id : str
+        Directory key used to locate / persist the artefact.
+    data_service : DataService
+        Pre-populated data service.
+    store_dir : Path or None
+        Store root.  Defaults to :data:`DEFAULT_STORE_DIR`.
+    force_refresh : bool
+        When ``True`` always recompute even if a cached file exists.
+
+    Returns
+    -------
+    BacktestResult
+        The (possibly cached) backtest result.
+    """
+    if not force_refresh:
+        cached = load_backtest_result(spec_id, predictor.predictor_id, store_dir=store_dir)
+        if cached is not None:
+            return cached
+    result = backtest(predictor=predictor, spec=spec, data_service=data_service)
+    save_backtest_result(result, spec_id=spec_id, store_dir=store_dir)
+    return result
+
+
+# ---------------------------------------------------------------------------
+# Multi-target backtest artefacts
+# ---------------------------------------------------------------------------
+
+
+def save_multi_backtest_results(
+    results: dict[str, BacktestResult],
+    spec: MultiTargetBacktestSpec,
+    store_dir: Path | None = None,
+) -> dict[str, Path]:
+    """Persist a full multi-target backtest result set (one file per task).
+
+    Parameters
+    ----------
+    results : dict[str, BacktestResult]
+        Output of :func:`multi_backtest`, keyed by ``task_id``.
+    spec : MultiTargetBacktestSpec
+        The parent spec; supplies ``spec_id`` used as the store subdirectory.
+    store_dir : Path or None
+        Store root.
+
+    Returns
+    -------
+    dict[str, Path]
+        Map from ``task_id`` to the written artefact path.
+    """
+    store = _resolve_store(store_dir)
+    paths: dict[str, Path] = {}
+    for task_id, result in results.items():
+        path = _backtest_path(store, spec.spec_id, result.predictor_id, task_id=task_id)
+        _dump_yaml(result, path)
+        paths[task_id] = path
+    return paths
+
+
+def load_multi_backtest_results(
+    spec: MultiTargetBacktestSpec,
+    predictor_id: str,
+    store_dir: Path | None = None,
+) -> dict[str, BacktestResult] | None:
+    """Load persisted multi-target results if *all* tasks have an artefact.
+
+    Parameters
+    ----------
+    spec : MultiTargetBacktestSpec
+        The parent spec.  Its ``spec_id`` keys the lookup and its ``tasks``
+        list enumerates which artefacts to load.
+    predictor_id : str
+        Predictor whose results to load.
+    store_dir : Path or None
+        Store root.
+
+    Returns
+    -------
+    dict[str, BacktestResult] or None
+        Full result dict keyed by ``task_id``.  Returns ``None`` if any task
+        is missing — partial caches are never returned, to avoid hiding
+        incomplete state from callers.
+    """
+    store = _resolve_store(store_dir)
+    results: dict[str, BacktestResult] = {}
+    for task in spec.tasks:
+        path = _backtest_path(store, spec.spec_id, predictor_id, task_id=task.task_id)
+        if not path.exists():
+            return None
+        results[task.task_id] = BacktestResult.model_validate(_load_yaml(path))
+    return results
+
+
+_log = logging.getLogger(__name__)
+
+
+def cached_multi_backtest(
+    predictor: Predictor,
+    spec: MultiTargetBacktestSpec,
+    data_service: DataService,
+    store_dir: Path | None = None,
+    force_refresh: bool = False,
+    max_retries: int = 2,
+    retry_delay: float = 2.0,
+) -> dict[str, BacktestResult]:
+    """Run :func:`multi_backtest` with a per-task load-or-compute cache.
+
+    Each task is cached independently under
+    ``<store>/<spec_id>/<predictor_id>__<task_id>.yaml``.  On a fresh run a
+    completed task's file is written immediately so a crash mid-run leaves all
+    prior tasks intact.  Re-running after a crash skips every already-cached
+    task and only retries the ones that didn't complete.
+
+    If a task fails even after the retry logic inside :func:`run_eval_loop`
+    has been exhausted, the failure is logged at WARNING level and the task is
+    omitted from the returned dict rather than propagating the exception.  This
+    keeps the outer experiment loop running so all other predictors still
+    complete.
+
+    Parameters
+    ----------
+    predictor : Predictor
+        Forecasting model to evaluate.
+    spec : MultiTargetBacktestSpec
+        Multi-target backtest specification.
+    data_service : DataService
+        Pre-populated data service.
+    store_dir : Path or None
+        Store root.  Defaults to :data:`DEFAULT_STORE_DIR`.
+    force_refresh : bool
+        When ``True`` always recompute even if cached files exist.
+    max_retries : int, default=2
+        Passed through to :func:`~aieng.forecasting.evaluation.backtest.backtest`.
+        Number of retry attempts per failing origin.
+    retry_delay : float, default=2.0
+        Seconds to wait between per-origin retry attempts.
+
+    Returns
+    -------
+    dict[str, BacktestResult]
+        Results keyed by ``task_id``.  Tasks that failed are absent from the
+        dict; a WARNING log entry is emitted for each failure.
+    """
+    store = _resolve_store(store_dir)
+    results: dict[str, BacktestResult] = {}
+    for single_spec in spec.specs():
+        task_id = single_spec.task.task_id
+        path = _backtest_path(store, spec.spec_id, predictor.predictor_id, task_id=task_id)
+        if not force_refresh and path.exists():
+            results[task_id] = BacktestResult.model_validate(_load_yaml(path))
+            continue
+        try:
+            result = backtest(
+                predictor=predictor,
+                spec=single_spec,
+                data_service=data_service,
+                max_retries=max_retries,
+                retry_delay=retry_delay,
+            )
+        except Exception as exc:
+            _log.warning(
+                "Backtest failed for predictor=%s task=%s — skipping task: %s",
+                predictor.predictor_id,
+                task_id,
+                exc,
+            )
+            continue
+        _dump_yaml(result, path)
+        results[task_id] = result
+    return results
+
+
+# ---------------------------------------------------------------------------
+# Eval artefacts (write-only — eval is never silently cached)
+# ---------------------------------------------------------------------------
+
+
+def save_eval_result(
+    result: EvalResult,
+    store_dir: Path | None = None,
+) -> Path:
+    """Persist a single :class:`EvalResult` to the artefact store.
+
+    The filename encodes ``run_number`` so that successive eval runs are all
+    preserved rather than overwriting each other.
+
+    Parameters
+    ----------
+    result : EvalResult
+        The eval result to persist.  Its ``eval_spec.spec_id`` determines the
+        subdirectory under the store.
+    store_dir : Path or None
+        Store root.  Defaults to :data:`DEFAULT_STORE_DIR`.
+
+    Returns
+    -------
+    Path
+        The path the result was written to.
+    """
+    store = _resolve_store(store_dir)
+    path = _eval_path(store, result.eval_spec.spec_id, result.predictor_id, result.run_number)
+    _dump_yaml(result, path)
+    return path
+
+
+def save_multi_eval_results(
+    results: dict[str, EvalResult],
+    spec: MultiTargetEvalSpec,
+    store_dir: Path | None = None,
+) -> dict[str, Path]:
+    """Persist a full multi-target eval run (one file per task).
+
+    Parameters
+    ----------
+    results : dict[str, EvalResult]
+        Output of :func:`multi_evaluate`, keyed by ``task_id``.
+    spec : MultiTargetEvalSpec
+        Parent spec; supplies ``spec_id`` used as the store subdirectory.
+    store_dir : Path or None
+        Store root.
+
+    Returns
+    -------
+    dict[str, Path]
+        Map from ``task_id`` to the written artefact path.
+    """
+    store = _resolve_store(store_dir)
+    paths: dict[str, Path] = {}
+    for task_id, result in results.items():
+        path = _eval_path(store, spec.spec_id, result.predictor_id, result.run_number, task_id=task_id)
+        _dump_yaml(result, path)
+        paths[task_id] = path
+    return paths
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md
new file mode 100644
index 0000000..e83f5e2
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md
@@ -0,0 +1,826 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/backtest.py
+
+kind: python
+
+```python
+"""BacktestSpec, BacktestResult, and the backtest() harness.
+
+This module also provides :class:`MultiTargetBacktestSpec` and
+:func:`multi_backtest` for running a single predictor across a collection of
+related forecasting tasks (e.g. all food CPI sub-categories) under identical
+evaluation window parameters.
+"""
+
+from __future__ import annotations
+
+import logging
+import math
+import time
+from datetime import datetime, timezone
+from typing import Literal
+
+import numpy as np
+import pandas as pd
+import properscoring as ps
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.prediction import BinaryForecast, CategoricalForecast, ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+from pydantic import AliasChoices, BaseModel, Field, model_validator
+
+
+ScoreMetric = Literal["crps", "brier", "rps"]
+
+#: Score metric names, keyed by ``ForecastingTask.payload_type``.
+METRIC_BY_PAYLOAD_TYPE: dict[str, ScoreMetric] = {"continuous": "crps", "binary": "brier", "categorical": "rps"}
+
+
+logger = logging.getLogger(__name__)
+
+
+def _compute_origins(start: datetime, end: datetime, frequency: str, stride: int) -> list[datetime]:
+    """Compute strided forecast origin dates for a spec window.
+
+    Shared by :class:`BacktestSpec` and
+    :class:`~aieng.forecasting.evaluation.eval.EvalSpec` to avoid duplicating
+    the striding logic.
+
+    Parameters
+    ----------
+    start : datetime
+        First candidate origin.
+    end : datetime
+        Last candidate origin (inclusive).
+    frequency : str
+        Pandas offset alias (e.g. ``"MS"``).
+    stride : int
+        Step size between origins in frequency units.
+
+    Returns
+    -------
+    list[datetime]
+        Candidate forecast origin dates, sorted ascending.
+    """
+    all_dates = pd.date_range(start=start, end=end, freq=frequency)
+    strided = all_dates[::stride]
+    return [ts.to_pydatetime() for ts in strided]
+
+
+class BacktestSpec(BaseModel):
+    """Specifies when and how often to evaluate a predictor against a task.
+
+    ``BacktestSpec`` separates the *evaluation window* from the prediction
+    problem itself. A :class:`ForecastingTask` defines *what* to forecast;
+    ``BacktestSpec`` wraps a task and adds *when* and *how often* to run
+    the harness.
+
+    Because ``BacktestSpec`` is a Pydantic model it is YAML-serializable,
+    making evaluation windows shareable and reproducible. Reference specs for
+    canonical tasks live in ``implementations/<use-case>/specs/``.
+
+    Parameters
+    ----------
+    task : ForecastingTask
+        The prediction problem to evaluate.
+    start : datetime
+        First candidate forecast origin.
+    end : datetime
+        Last candidate forecast origin (inclusive).
+    stride : int
+        Step size between origins in task-frequency units. ``stride=1`` means
+        every period; ``stride=6`` on monthly data means twice per year
+        (January and July when ``start`` falls on a month boundary).
+    origin_dates : list[datetime] or None
+        Optional explicit forecast origins. When provided, :meth:`origins`
+        returns exactly these dates (sorted ascending) instead of deriving a
+        regular grid from ``start``/``end``/``stride``. This supports
+        irregular event calendars — for example Bank of Canada fixed
+        announcement dates, which occur eight times per year on dates that no
+        pandas frequency alias can generate. All dates must fall within
+        ``[start, end]`` so the window fields remain an honest summary of the
+        evaluation period.
+    warmup : int
+        Minimum number of observations required in the cutoff-filtered series
+        before a forecast origin is used. Origins that do not have enough
+        history are silently skipped.
+    description : str
+        Free-form prose description of the backtest intent (methodology,
+        origin rationale, etc.). Optional — defaults to an empty string.
+        Consumers such as :func:`aieng.forecasting.evaluation.describe.describe_spec`
+        and LLM-based predictors surface this to provide qualitative context
+        alongside the quantitative task definition.
+
+    Examples
+    --------
+    >>> from datetime import datetime
+    >>> spec = BacktestSpec(
+    ...     task=ForecastingTask(
+    ...         task_id="cpi_gasoline_canada_1m",
+    ...         target_series_id="cpi_gasoline_canada",
+    ...         horizon=1,
+    ...         frequency="MS",
+    ...         description="CPI Gasoline Canada, 1-month ahead forecast",
+    ...     ),
+    ...     start=datetime(2000, 1, 1),
+    ...     end=datetime(2025, 1, 1),
+    ...     stride=1,
+    ...     warmup=24,
+    ... )
+    >>> origins = spec.origins()
+    >>> len(origins) > 0
+    True
+    """
+
+    task: ForecastingTask
+    start: datetime = Field(description="First candidate forecast origin.")
+    end: datetime = Field(description="Last candidate forecast origin (inclusive).")
+    stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.")
+    origin_dates: list[datetime] | None = Field(
+        default=None,
+        description=(
+            "Optional explicit forecast origins for irregular calendars (e.g. central bank "
+            "announcement dates). When set, overrides the start/end/stride grid derivation."
+        ),
+    )
+    warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.")
+    description: str = Field(
+        default="",
+        description="Free-form prose description of the backtest intent (methodology, origin rationale, etc.).",
+    )
+
+    @model_validator(mode="after")
+    def start_before_end(self) -> "BacktestSpec":
+        """Validate that start precedes end."""
+        if self.start >= self.end:
+            raise ValueError(f"start ({self.start}) must be before end ({self.end})")
+        return self
+
+    @model_validator(mode="after")
+    def origin_dates_in_window(self) -> "BacktestSpec":
+        """Validate that explicit origin dates fall within [start, end]."""
+        if self.origin_dates is not None:
+            if not self.origin_dates:
+                raise ValueError("origin_dates must be non-empty when provided; omit it to derive origins.")
+            out_of_window = [d for d in self.origin_dates if not (self.start <= d <= self.end)]
+            if out_of_window:
+                raise ValueError(
+                    f"All origin_dates must fall within [start, end] = [{self.start}, {self.end}]. "
+                    f"Out of window: {out_of_window}"
+                )
+        return self
+
+    def origins(self) -> list[datetime]:
+        """Return the candidate forecast origins derived from this spec.
+
+        When ``origin_dates`` is set, those dates are returned sorted
+        ascending. Otherwise origins are generated using ``pd.date_range``
+        with the task's frequency and the configured stride. The returned
+        list does not apply the warmup filter — that is applied inside
+        :func:`backtest` where the actual series data is available.
+
+        Returns
+        -------
+        list[datetime]
+            Candidate forecast origin dates, sorted ascending.
+        """
+        if self.origin_dates is not None:
+            return sorted(self.origin_dates)
+        return _compute_origins(self.start, self.end, self.task.frequency, self.stride)
+
+
+class BacktestResult(BaseModel):
+    """The outcome of a backtest run — a self-contained, serializable record.
+
+    ``BacktestResult`` is a first-class Pydantic model (not just a DataFrame
+    of numbers). It is designed to be YAML-roundtrippable so that results can
+    be persisted alongside predictor implementations, fed to downstream agents
+    as structured context, or used as submission artefacts in a future
+    competition mechanism.
+
+    Parameters
+    ----------
+    spec : BacktestSpec
+        The exact spec that was evaluated.
+    predictor_id : str
+        Identifier for the predictor that produced these forecasts.
+    predictions : list[Prediction]
+        Flat list of scored predictions. For single-horizon tasks this is one
+        entry per evaluated origin; for multi-horizon tasks it is
+        ``origins_scored × len(task.horizons)`` (minus any future steps that
+        could not yet be resolved). Ordered by origin then by horizon.
+    scores : list[float]
+        Score for each prediction, parallel to ``predictions``. CRPS for
+        continuous tasks, Brier for binary tasks, RPS for categorical tasks.
+        Lower is better.
+    metric : {"crps", "brier", "rps"}
+        Which scoring rule produced ``scores`` / ``mean_score``. Determined by
+        the task's ``payload_type``. Defaults to ``"crps"`` so artefacts
+        written before binary support existed still load correctly.
+    mean_score : float
+        Mean score across all scored (origin, horizon) pairs. Older artefacts
+        serialized this field as ``mean_crps``; both keys are accepted on load.
+    ran_at : datetime
+        UTC wall-clock time when the backtest was executed.
+    skipped_origins : int
+        Number of candidate origins where no horizon could be scored (either
+        warmup not met, or all forecast dates were unresolvable).
+    """
+
+    spec: BacktestSpec
+    predictor_id: str
+    predictions: list[Prediction]
+    scores: list[float]
+    metric: ScoreMetric = Field(
+        default="crps",
+        description="Scoring rule used: 'crps' (continuous), 'brier' (binary), or 'rps' (categorical).",
+    )
+    mean_score: float = Field(
+        validation_alias=AliasChoices("mean_score", "mean_crps"),
+        description="Mean score across all scored predictions (CRPS or Brier; lower is better).",
+    )
+    ran_at: datetime
+    skipped_origins: int = Field(default=0, description="Candidate origins skipped due to warmup.")
+
+    @model_validator(mode="after")
+    def lengths_match(self) -> "BacktestResult":
+        """Validate that predictions and scores have the same length."""
+        if len(self.predictions) != len(self.scores):
+            raise ValueError(
+                f"predictions ({len(self.predictions)}) and scores ({len(self.scores)}) must have the same length"
+            )
+        return self
+
+
+def _crps_for_prediction(prediction: Prediction, actual: float) -> float:
+    """Compute CRPS for a single ContinuousForecast against an observed value.
+
+    Uses ``properscoring.crps_ensemble`` with the quantile forecast values
+    as an ensemble. While quantile values are not independent samples from
+    the predictive distribution, this gives a reasonable CRPS approximation
+    when the quantile grid is sufficiently fine.
+
+    Parameters
+    ----------
+    prediction : Prediction
+        Must have a :class:`ContinuousForecast` payload.
+    actual : float
+        The observed value at the forecast date.
+
+    Returns
+    -------
+    float
+        CRPS score (lower is better).
+    """
+    if not isinstance(prediction.payload, ContinuousForecast):
+        raise TypeError("CRPS scoring requires a ContinuousForecast payload.")
+    payload = prediction.payload
+    ensemble = np.array(sorted(payload.quantiles.values()), dtype=float)
+    return float(ps.crps_ensemble(actual, ensemble))
+
+
+def compute_brier_score(probabilities: list[float], outcomes: list[float]) -> float:
+    """Mean Brier score for a batch of binary forecasts.
+
+    The Brier score is ``mean((p - y)**2)`` over forecast/outcome pairs. It is
+    a strictly proper scoring rule for binary events: it is minimised in
+    expectation only by reporting the true event probability.
+
+    Parameters
+    ----------
+    probabilities : list[float]
+        Predicted P(event), each in [0, 1].
+    outcomes : list[float]
+        Realised outcomes (0 or 1), parallel to ``probabilities``.
+
+    Returns
+    -------
+    float
+        Mean Brier score in [0, 1]; lower is better. ``nan`` for empty input.
+    """
+    if not probabilities:
+        return float("nan")
+    if len(probabilities) != len(outcomes):
+        raise ValueError(
+            f"probabilities ({len(probabilities)}) and outcomes ({len(outcomes)}) must have the same length"
+        )
+    probs = np.asarray(probabilities, dtype=float)
+    ys = np.asarray(outcomes, dtype=float)
+    return float(np.mean((probs - ys) ** 2))
+
+
+def compute_rps(probabilities: list[list[float]], outcome_indices: list[int]) -> float:
+    """Mean Ranked Probability Score for ordered-categorical forecasts.
+
+    RPS is a strictly proper scoring rule for ordinal outcomes. For one
+    forecast with ``K`` ordered category probabilities ``p`` and realised
+    category index ``j``, this implementation uses the standard unnormalized
+    Epstein/Murphy convention:
+    ``sum((cumsum(p)[k] - I[j <= k])**2 for k in range(K - 1))``.
+
+    For ``K=2`` it equals the binary Brier score ``(p - y)**2`` as implemented
+    by :func:`compute_brier_score`. This convention is one half of Brier's
+    original 1950 multi-category score, which is noted here because both
+    normalizations appear in the literature.
+
+    Parameters
+    ----------
+    probabilities : list[list[float]]
+        Ordered category probability rows, one row per forecast. All rows must
+        have the same length ``K >= 2``.
+    outcome_indices : list[int]
+        Realised category indices in ``[0, K)``, parallel to ``probabilities``.
+
+    Returns
+    -------
+    float
+        Mean RPS in ``[0, K-1]``; lower is better. ``nan`` for empty input.
+    """
+    if not probabilities:
+        return float("nan")
+    if len(probabilities) != len(outcome_indices):
+        raise ValueError(
+            f"probabilities ({len(probabilities)}) and outcome_indices ({len(outcome_indices)}) "
+            "must have the same length"
+        )
+
+    row_length = len(probabilities[0])
+    if row_length < 2:
+        raise ValueError(f"RPS probability rows must have length K >= 2; got {row_length}.")
+    for row in probabilities:
+        if len(row) != row_length:
+            raise ValueError("RPS probability rows must all have the same length.")
+
+    scores: list[float] = []
+    for row, outcome_index in zip(probabilities, outcome_indices, strict=True):
+        if outcome_index < 0 or outcome_index >= row_length:
+            raise ValueError(f"RPS outcome index {outcome_index} is out of range for K={row_length}.")
+        cumulative = np.cumsum(np.asarray(row, dtype=float))[:-1]
+        observed = np.asarray([1.0 if outcome_index <= k else 0.0 for k in range(row_length - 1)], dtype=float)
+        scores.append(float(np.sum((cumulative - observed) ** 2)))
+    return float(np.mean(scores))
+
+
+def _brier_for_prediction(prediction: Prediction, actual: float) -> float:
+    """Compute the Brier score for a single BinaryForecast against an observed outcome.
+
+    The Brier score is the squared error between the forecast probability and
+    the realised binary outcome: ``(p - y)**2``. It is a strictly proper
+    scoring rule for binary events — the binary counterpart of CRPS.
+
+    Parameters
+    ----------
+    prediction : Prediction
+        Must have a :class:`BinaryForecast` payload.
+    actual : float
+        The observed outcome at the forecast date. Must be 0.0 or 1.0
+        (binary tasks resolve against a 0/1 event series).
+
+    Returns
+    -------
+    float
+        Brier score in [0, 1] (lower is better).
+    """
+    if not isinstance(prediction.payload, BinaryForecast):
+        raise TypeError("Brier scoring requires a BinaryForecast payload.")
+    if actual not in (0.0, 1.0):
+        raise ValueError(
+            f"Brier scoring requires a binary (0/1) resolved outcome; got {actual}. "
+            f"Check that the task's target series is a 0/1 event series."
+        )
+    return compute_brier_score([prediction.payload.probability], [actual])
+
+
+def _rps_for_prediction(task: ForecastingTask, prediction: Prediction, actual: float) -> float:
+    """Compute RPS for a single CategoricalForecast against an observed outcome."""
+    if task.categories is None:
+        raise ValueError(f"Task '{task.task_id}' declares payload_type='categorical' but has no categories.")
+    if not isinstance(prediction.payload, CategoricalForecast):
+        raise TypeError("RPS scoring requires a CategoricalForecast payload.")
+
+    categories = task.categories
+    labels = [category.label for category in categories]
+    values = [category.value for category in categories]
+    expected_labels = set(labels)
+    predicted_labels = set(prediction.payload.probabilities)
+    if predicted_labels != expected_labels:
+        missing = sorted(expected_labels - predicted_labels)
+        extra = sorted(predicted_labels - expected_labels)
+        raise ValueError(
+            f"Categorical prediction from predictor '{prediction.predictor_id}' must contain exactly the task "
+            f"category labels. Missing labels: {missing}; extra labels: {extra}."
+        )
+
+    outcome_index: int | None = None
+    for index, value in enumerate(values):
+        if math.isclose(actual, value, abs_tol=1e-9):
+            outcome_index = index
+            break
+    if outcome_index is None:
+        raise ValueError(
+            f"Categorical resolved outcome {actual} does not match any task category value. Allowed values: {values}."
+        )
+
+    ordered_probabilities = [prediction.payload.probabilities[label] for label in labels]
+    return compute_rps([ordered_probabilities], [outcome_index])
+
+
+def _score_for_prediction(task: ForecastingTask, prediction: Prediction, actual: float) -> float:
+    """Score a prediction with the metric implied by the task's payload type.
+
+    Dispatches to CRPS for ``payload_type="continuous"``, Brier for
+    ``payload_type="binary"``, and RPS for ``payload_type="categorical"``,
+    after validating that the payload the predictor returned actually matches
+    the task declaration. A mismatch fails loudly: a probability scored with
+    CRPS (or quantiles scored with Brier/RPS) would be silently meaningless.
+
+    Parameters
+    ----------
+    task : ForecastingTask
+        Declares the expected payload modality.
+    prediction : Prediction
+        The prediction to score.
+    actual : float
+        The resolved ground-truth value.
+
+    Returns
+    -------
+    float
+        CRPS, Brier, or RPS score (lower is better).
+    """
+    if task.payload_type == "binary":
+        if not isinstance(prediction.payload, BinaryForecast):
+            raise TypeError(
+                f"Task '{task.task_id}' declares payload_type='binary' but predictor "
+                f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload."
+            )
+        return _brier_for_prediction(prediction, actual)
+    if task.payload_type == "categorical":
+        if not isinstance(prediction.payload, CategoricalForecast):
+            raise TypeError(
+                f"Task '{task.task_id}' declares payload_type='categorical' but predictor "
+                f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload."
+            )
+        return _rps_for_prediction(task, prediction, actual)
+    if not isinstance(prediction.payload, ContinuousForecast):
+        raise TypeError(
+            f"Task '{task.task_id}' declares payload_type='continuous' but predictor "
+            f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload."
+        )
+    return _crps_for_prediction(prediction, actual)
+
+
+def _resolve(task: ForecastingTask, forecast_date: datetime, data_service: DataService) -> float | None:
+    """Look up the observed value at a forecast date.
+
+    Queries the data service with a sufficiently late ``as_of`` to ensure the
+    observation is available. Returns ``None`` if the observation is not found
+    (e.g. the forecast date is in the future).
+
+    Parameters
+    ----------
+    task : ForecastingTask
+        Used to identify the target series.
+    forecast_date : datetime
+        The date whose observed value is needed.
+    data_service : DataService
+        The data service to query.
+
+    Returns
+    -------
+    float or None
+        The observed value, or ``None`` if unavailable.
+    """
+    # Query with today as as_of to get all available data including future observations.
+    as_of_now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    full_series = data_service.get_series(task.target_series_id, as_of=as_of_now)
+
+    target_ts = pd.Timestamp(forecast_date)
+    match = full_series[pd.to_datetime(full_series["timestamp"]) == target_ts]
+    if match.empty:
+        return None
+    return float(match["value"].iloc[0])
+
+
+def run_eval_loop(
+    predictor: Predictor,
+    task: ForecastingTask,
+    origins: list[datetime],
+    warmup: int,
+    data_service: DataService,
+    max_retries: int = 2,
+    retry_delay: float = 2.0,
+) -> tuple[list[Prediction], list[float], int]:
+    """Core evaluation loop shared by ``backtest()`` and ``evaluate()``.
+
+    Iterates over ``origins``, calls the predictor at each origin, resolves
+    predictions against the observed series, and scores with the metric
+    implied by the task's ``payload_type`` (CRPS for continuous, Brier for
+    binary, RPS for categorical).
+
+    Parameters
+    ----------
+    predictor : Predictor
+        The forecasting model to evaluate.
+    task : ForecastingTask
+        The prediction problem being evaluated.
+    origins : list[datetime]
+        Candidate forecast origin dates (already strided / derived from a spec).
+    warmup : int
+        Minimum number of observations required before a forecast origin is used.
+    data_service : DataService
+        Pre-populated data service. Must have the target series registered.
+    max_retries : int, default=2
+        Number of times to retry a failing ``predictor.predict()`` call before
+        skipping the origin.  Handles transient model errors (e.g. malformed
+        structured output) without crashing the whole backtest.
+    retry_delay : float, default=2.0
+        Seconds to wait between retry attempts.
+
+    Returns
+    -------
+    tuple[list[Prediction], list[float], int]
+        ``(predictions, scores, skipped)`` — parallel lists of predictions and
+        scores, plus the count of origins that were skipped.
+
+    Raises
+    ------
+    ValueError
+        If no origins produce a resolvable prediction.
+    """
+    predictions: list[Prediction] = []
+    scores: list[float] = []
+    skipped = 0
+
+    for origin in origins:
+        ctx = data_service.context(as_of=origin)
+
+        if warmup > 0:
+            series = ctx.get_series(task.target_series_id)
+            if len(series) < warmup:
+                skipped += 1
+                continue
+
+        origin_predictions: list[Prediction] = []
+        last_exc: BaseException | None = None
+        for attempt in range(max_retries + 1):
+            try:
+                origin_predictions = predictor.predict(task, ctx)
+                last_exc = None
+                break
+            except Exception as exc:
+                last_exc = exc
+                if attempt < max_retries:
+                    logger.warning(
+                        "predict() failed at origin %s (attempt %d/%d): %s — retrying in %.0fs",
+                        origin.date(),
+                        attempt + 1,
+                        max_retries + 1,
+                        exc,
+                        retry_delay,
+                    )
+                    time.sleep(retry_delay)
+
+        if last_exc is not None:
+            logger.warning(
+                "predict() failed at origin %s after %d attempt(s) — skipping origin: %s",
+                origin.date(),
+                max_retries + 1,
+                last_exc,
+            )
+            skipped += 1
+            continue
+
+        origin_scored = 0
+        for pred in origin_predictions:
+            actual = _resolve(task, pred.forecast_date, data_service)
+            if actual is None:
+                continue
+            score = _score_for_prediction(task, pred, actual)
+            predictions.append(pred)
+            scores.append(score)
+            origin_scored += 1
+
+        if origin_scored == 0:
+            skipped += 1
+
+    if not predictions:
+        raise ValueError(
+            f"No predictions were scored. All {len(origins)} candidate origins were skipped. "
+            f"Check that the target series covers the evaluation window and that warmup ({warmup}) "
+            f"is not too large."
+        )
+
+    return predictions, scores, skipped
+
+
+def backtest(
+    predictor: Predictor,
+    spec: BacktestSpec,
+    data_service: DataService,
+    max_retries: int = 2,
+    retry_delay: float = 2.0,
+) -> BacktestResult:
+    """Run a backtest of a predictor against a BacktestSpec.
+
+    Iterates over forecast origins derived from the spec, calls the predictor
+    at each origin (with a :class:`~aieng.forecasting.data.context.ForecastContext`
+    scoped to that date), resolves predictions against the observed series, and
+    scores with the metric implied by the task's ``payload_type`` (CRPS for
+    continuous tasks, Brier for binary tasks, RPS for categorical tasks).
+
+    Origins with insufficient history (fewer than ``spec.warmup`` observations
+    in the cutoff-filtered series) are silently skipped. Origins whose
+    ``forecast_date`` has not yet been observed are also skipped with a warning.
+
+    Parameters
+    ----------
+    predictor : Predictor
+        The forecasting model to evaluate.
+    spec : BacktestSpec
+        Defines the task, evaluation window, stride, and warmup.
+    data_service : DataService
+        Pre-populated data service. Must have the target series registered.
+    max_retries : int, default=2
+        Passed through to :func:`run_eval_loop`.  Number of retry attempts per
+        failing origin before it is counted as skipped.
+    retry_delay : float, default=2.0
+        Seconds to wait between retry attempts.
+
+    Returns
+    -------
+    BacktestResult
+        A fully populated result record including all predictions and scores.
+
+    Raises
+    ------
+    KeyError
+        If the target series is not registered in the data service.
+    ValueError
+        If no origins produce a resolvable prediction (all skipped).
+
+    Examples
+    --------
+    >>> results = backtest(predictor=my_predictor, spec=spec, data_service=svc)
+    >>> print(f"Mean {results.metric.upper()}: {results.mean_score:.4f}")
+    """
+    predictions, scores, skipped = run_eval_loop(
+        predictor=predictor,
+        task=spec.task,
+        origins=spec.origins(),
+        warmup=spec.warmup,
+        data_service=data_service,
+        max_retries=max_retries,
+        retry_delay=retry_delay,
+    )
+    return BacktestResult(
+        spec=spec,
+        predictor_id=predictor.predictor_id,
+        predictions=predictions,
+        scores=scores,
+        metric=METRIC_BY_PAYLOAD_TYPE[spec.task.payload_type],
+        mean_score=float(np.mean(scores)),
+        ran_at=datetime.now(tz=timezone.utc).replace(tzinfo=None),
+        skipped_origins=skipped,
+    )
+
+
+# ---------------------------------------------------------------------------
+# MultiTargetBacktestSpec and multi_backtest()  # noqa: ERA001
+# ---------------------------------------------------------------------------
+
+
+class MultiTargetBacktestSpec(BaseModel):
+    """Backtest spec that evaluates a predictor across multiple related tasks.
+
+    ``MultiTargetBacktestSpec`` groups several :class:`ForecastingTask` objects
+    under a single shared evaluation window (``start``, ``end``, ``stride``,
+    ``warmup``).  All tasks must share the same ``frequency`` — this is
+    enforced at construction time.
+
+    A typical use case is evaluating a predictor on all food CPI sub-categories
+    simultaneously: each category is a separate task, but they all use monthly
+    data and the same historical window.
+
+    The spec can be decomposed into a list of standard :class:`BacktestSpec`
+    objects via :meth:`specs`, or evaluated directly with :func:`multi_backtest`.
+
+    Parameters
+    ----------
+    spec_id : str
+        Stable identifier for this spec. Used as the directory key for
+        persisted artefacts (see
+        :mod:`aieng.forecasting.evaluation.artifacts`) and for surfacing the
+        spec in logs and agent context. Should be unique across all spec files.
+    tasks : list[ForecastingTask]
+        The prediction problems to evaluate.  All must share the same
+        ``frequency``.
+    start : datetime
+        First candidate forecast origin.
+    end : datetime
+        Last candidate forecast origin (inclusive).
+    stride : int
+        Step size between origins in task-frequency units.
+    warmup : int
+        Minimum number of observations required before a forecast origin is used.
+    description : str
+        Free-form prose description of the backtest intent (methodology,
+        origin rationale, etc.). Optional — defaults to an empty string.
+
+    Examples
+    --------
+    >>> spec = MultiTargetBacktestSpec(
+    ...     spec_id="food_cpi_cfpr_backtest",
+    ...     tasks=[task_food, task_meat, task_dairy],
+    ...     start=datetime(2000, 1, 1),
+    ...     end=datetime(2026, 1, 1),
+    ...     stride=6,
+    ...     warmup=24,
+    ... )
+    >>> per_task_results = multi_backtest(my_predictor, spec, svc)
+    >>> for task_id, result in per_task_results.items():
+    ...     print(f"{task_id}: mean CRPS = {result.mean_score:.4f}")
+    """
+
+    spec_id: str = Field(description="Stable identifier for this spec; keys the artefact store.")
+    tasks: list[ForecastingTask] = Field(
+        min_length=1, description="Prediction problems; all must share the same frequency."
+    )
+    start: datetime = Field(description="First candidate forecast origin.")
+    end: datetime = Field(description="Last candidate forecast origin (inclusive).")
+    stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.")
+    warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.")
+    description: str = Field(
+        default="",
+        description="Free-form prose description of the backtest intent (methodology, origin rationale, etc.).",
+    )
+
+    @model_validator(mode="after")
+    def _validate(self) -> "MultiTargetBacktestSpec":
+        if self.start >= self.end:
+            raise ValueError(f"start ({self.start}) must be before end ({self.end})")
+        frequencies = {t.frequency for t in self.tasks}
+        if len(frequencies) > 1:
+            raise ValueError(
+                f"All tasks in a MultiTargetBacktestSpec must share the same frequency. Found: {sorted(frequencies)}"
+            )
+        return self
+
+    def specs(self) -> list[BacktestSpec]:
+        """Decompose into one :class:`BacktestSpec` per task.
+
+        Returns
+        -------
+        list[BacktestSpec]
+            One spec per task, all sharing the same window parameters.
+        """
+        return [
+            BacktestSpec(
+                task=t,
+                start=self.start,
+                end=self.end,
+                stride=self.stride,
+                warmup=self.warmup,
+                description=self.description,
+            )
+            for t in self.tasks
+        ]
+
+
+def multi_backtest(
+    predictor: Predictor, spec: MultiTargetBacktestSpec, data_service: DataService
+) -> dict[str, BacktestResult]:
+    """Run a backtest of a predictor across all tasks in a MultiTargetBacktestSpec.
+
+    Calls :func:`backtest` once per task and returns the results keyed by
+    ``task_id``.  All tasks share the same evaluation window, stride, and warmup
+    defined in the spec.
+
+    Parameters
+    ----------
+    predictor : Predictor
+        The forecasting model to evaluate.
+    spec : MultiTargetBacktestSpec
+        Defines the tasks, shared evaluation window, stride, and warmup.
+    data_service : DataService
+        Pre-populated data service.  Must have all target series registered.
+
+    Returns
+    -------
+    dict[str, BacktestResult]
+        Backtest results keyed by ``task_id``, one entry per task.
+
+    Raises
+    ------
+    KeyError
+        If any target series is not registered in the data service.
+    ValueError
+        If no origins can be scored for any task.
+
+    Examples
+    --------
+    >>> results = multi_backtest(predictor=my_predictor, spec=spec, data_service=svc)
+    >>> for task_id, result in results.items():
+    ...     print(f"{task_id}: {result.mean_score:.4f}")
+    """
+    return {single_spec.task.task_id: backtest(predictor, single_spec, data_service) for single_spec in spec.specs()}
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md
new file mode 100644
index 0000000..e44aa07
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md
@@ -0,0 +1,183 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/describe.py
+
+kind: python
+
+```python
+"""Human-readable descriptions of forecasting tasks and specs.
+
+These helpers turn a :class:`ForecastingTask` / :class:`BacktestSpec` /
+:class:`EvalSpec` and their multi-target counterparts into a plain-text
+block suitable for printing in a notebook or piping into an LLM predictor
+prompt.  They are the simplest form of "spec as source of truth": one
+input (the spec, optionally a :class:`DataService` for metadata lookup),
+one output (a string that captures the full problem definition).
+
+The output format is intentionally minimal and stable — it is not an API,
+and production code should depend on the model fields directly.  It is
+purely for display / prompt-construction use cases.
+"""
+
+from __future__ import annotations
+
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import BacktestSpec, MultiTargetBacktestSpec
+from aieng.forecasting.evaluation.eval import EvalSpec, MultiTargetEvalSpec
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+def _series_line(series_id: str, data_service: DataService | None) -> str:
+    """Return a display line for ``series_id``, with metadata if available."""
+    if data_service is None:
+        return f"- target_series_id: {series_id}"
+    try:
+        meta = data_service.get_metadata(series_id)
+    except KeyError:
+        return f"- target_series_id: {series_id}  (not registered in data_service)"
+    return (
+        f"- target_series_id: {series_id}\n"
+        f"    description:    {meta.description}\n"
+        f"    source:         {meta.source}\n"
+        f"    units:          {meta.units}\n"
+        f"    frequency:      {meta.frequency}"
+    )
+
+
+def describe_task(task: ForecastingTask, data_service: DataService | None = None) -> str:
+    """Return a plain-text description of a :class:`ForecastingTask`.
+
+    Parameters
+    ----------
+    task : ForecastingTask
+        The task to describe.
+    data_service : DataService or None
+        Optional data service.  When provided, metadata for
+        ``target_series_id`` is included in the description.
+
+    Returns
+    -------
+    str
+        Multi-line description suitable for printing or embedding in a prompt.
+    """
+    horizons_display = task.horizons[0] if len(task.horizons) == 1 else f"{task.horizons} (len={len(task.horizons)})"
+    lines = [
+        f"Task: {task.task_id}",
+        f"  description: {task.description}",
+        f"  horizons:    {horizons_display}",
+        f"  frequency:   {task.frequency}",
+        f"  payload:     {task.payload_type}",
+    ]
+    if task.payload_type == "categorical" and task.categories is not None:
+        categories = " < ".join(f"{category.label}({category.value:g})" for category in task.categories)
+        lines.append(f"  categories:  {categories}")
+    lines.extend(
+        [
+            f"  resolution:  {task.resolution_fn}",
+            _series_line(task.target_series_id, data_service),
+        ]
+    )
+    return "\n".join(lines)
+
+
+def _window_lines(start: object, end: object, stride: int, warmup: int) -> list[str]:
+    return [
+        f"  start:       {start}",
+        f"  end:         {end}",
+        f"  stride:      {stride}",
+        f"  warmup:      {warmup}",
+    ]
+
+
+def _describe_backtest_spec(spec: BacktestSpec, data_service: DataService | None) -> str:
+    lines = [
+        "BacktestSpec",
+    ]
+    if spec.description:
+        lines.append(f"  description: {spec.description}")
+    lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup))
+    if spec.origin_dates is not None:
+        lines.append(f"  origins:     {len(spec.origin_dates)} explicit dates (irregular calendar)")
+    lines.append("")
+    lines.append(describe_task(spec.task, data_service))
+    return "\n".join(lines)
+
+
+def _describe_eval_spec(spec: EvalSpec, data_service: DataService | None) -> str:
+    lines = [
+        f"EvalSpec (spec_id={spec.spec_id})",
+    ]
+    if spec.description:
+        lines.append(f"  description: {spec.description}")
+    lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup))
+    if spec.origin_dates is not None:
+        lines.append(f"  origins:     {len(spec.origin_dates)} explicit dates (irregular calendar)")
+    lines.append(f"  max_runs:    {spec.max_runs}")
+    lines.append("")
+    lines.append(describe_task(spec.task, data_service))
+    return "\n".join(lines)
+
+
+def _describe_multi_target_backtest_spec(spec: MultiTargetBacktestSpec, data_service: DataService | None) -> str:
+    lines = [
+        f"MultiTargetBacktestSpec (spec_id={spec.spec_id})",
+    ]
+    if spec.description:
+        lines.append(f"  description: {spec.description}")
+    lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup))
+    lines.append(f"  tasks:       {len(spec.tasks)}")
+    lines.append("")
+    for task in spec.tasks:
+        lines.append(describe_task(task, data_service))
+        lines.append("")
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def _describe_multi_target_eval_spec(spec: MultiTargetEvalSpec, data_service: DataService | None) -> str:
+    lines = [
+        f"MultiTargetEvalSpec (spec_id={spec.spec_id})",
+    ]
+    if spec.description:
+        lines.append(f"  description: {spec.description}")
+    lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup))
+    lines.append(f"  max_runs:    {spec.max_runs}")
+    lines.append(f"  tasks:       {len(spec.tasks)}")
+    lines.append("")
+    for task in spec.tasks:
+        lines.append(describe_task(task, data_service))
+        lines.append("")
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def describe_spec(
+    spec: BacktestSpec | EvalSpec | MultiTargetBacktestSpec | MultiTargetEvalSpec,
+    data_service: DataService | None = None,
+) -> str:
+    """Return a plain-text description of any supported spec.
+
+    Dispatches on the spec type and produces a consistent multi-line layout
+    covering the window parameters, budget / run-count (where applicable),
+    and the full task definition(s).
+
+    Parameters
+    ----------
+    spec : BacktestSpec | EvalSpec | MultiTargetBacktestSpec | MultiTargetEvalSpec
+        The specification to describe.
+    data_service : DataService or None
+        Optional data service used to enrich target-series lines with
+        metadata (description, source, units, frequency).
+
+    Returns
+    -------
+    str
+        Multi-line description suitable for printing or embedding in a
+        prompt.
+    """
+    if isinstance(spec, MultiTargetBacktestSpec):
+        return _describe_multi_target_backtest_spec(spec, data_service)
+    if isinstance(spec, MultiTargetEvalSpec):
+        return _describe_multi_target_eval_spec(spec, data_service)
+    if isinstance(spec, EvalSpec):
+        return _describe_eval_spec(spec, data_service)
+    if isinstance(spec, BacktestSpec):
+        return _describe_backtest_spec(spec, data_service)
+    raise TypeError(f"Unsupported spec type: {type(spec).__name__}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md
new file mode 100644
index 0000000..3f06a68
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md
@@ -0,0 +1,676 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/eval.py
+
+kind: python
+
+```python
+"""EvalSpec, EvalResult, EvalTracker, and the evaluate() harness.
+
+Eval mode is distinct from backtesting: it is intended to estimate how well
+learned or backtested results generalise to recent, held-out data.  The key
+differences from a backtest are:
+
+- **Protected window** — the evaluation window should cover recent data that
+  has not been used for tuning or learning.  By convention, reference eval
+  specs live in ``implementations/<use-case>/specs/`` and are not modified
+  by participants.
+
+- **Run-budget control** — ``EvalSpec.max_runs`` optionally caps how many
+  times a participant is allowed to run a given eval.  When an
+  :class:`EvalTracker` is supplied to :func:`evaluate`, the budget is checked
+  before the run and the counter is incremented on success.  This prevents
+  inadvertent over-fitting to the held-out window.
+
+This module also provides :class:`MultiTargetEvalSpec` and
+:func:`multi_evaluate` for evaluating a predictor across multiple related tasks
+under a single shared budget.  A single ``multi_evaluate`` call counts as one
+run against the budget regardless of how many tasks are included.
+
+Intended usage in a bootcamp session::
+
+    import yaml
+    from pathlib import Path
+    from aieng.forecasting.evaluation import EvalSpec, EvalTracker, evaluate
+
+    with open("implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml") as f:
+        spec = EvalSpec.model_validate(yaml.safe_load(f))
+
+    tracker = EvalTracker(Path("eval_runs.yaml"))
+    result = evaluate(my_predictor, spec, svc, tracker=tracker)
+    print(f"Eval mean {result.metric.upper()}: {result.mean_score:.4f}")
+
+If ``tracker`` is omitted, :func:`evaluate` runs unconditionally and sets
+``run_number=1``.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from pathlib import Path
+
+import numpy as np
+import yaml
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import METRIC_BY_PAYLOAD_TYPE, ScoreMetric, _compute_origins, run_eval_loop
+from aieng.forecasting.evaluation.prediction import Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+from pydantic import AliasChoices, BaseModel, Field, model_validator
+
+
+# ---------------------------------------------------------------------------
+# Exceptions
+# ---------------------------------------------------------------------------
+
+
+class EvalBudgetExceededError(ValueError):
+    """Raised when an :class:`EvalTracker` has exhausted the run budget for a spec.
+
+    Parameters
+    ----------
+    spec_id : str
+        The identifier of the eval spec whose budget was exceeded.
+    runs_used : int
+        How many runs have already been recorded for this spec.
+    max_runs : int
+        The budget cap declared on the spec.
+    """
+
+    def __init__(self, spec_id: str, runs_used: int, max_runs: int) -> None:
+        self.spec_id = spec_id
+        self.runs_used = runs_used
+        self.max_runs = max_runs
+        super().__init__(
+            f"Eval budget exhausted for '{spec_id}': "
+            f"{runs_used}/{max_runs} runs already used. "
+            f"Run fewer evaluations against the held-out window to avoid over-fitting."
+        )
+
+
+# ---------------------------------------------------------------------------
+# EvalSpec
+# ---------------------------------------------------------------------------
+
+
+class EvalSpec(BaseModel):
+    """Specifies a protected evaluation window for estimating generalisation.
+
+    ``EvalSpec`` mirrors :class:`~aieng.forecasting.evaluation.backtest.BacktestSpec`
+    but adds two fields that make it suitable as a held-out, budget-controlled
+    evaluation mode:
+
+    - ``spec_id`` — a stable, human-readable identifier used by
+      :class:`EvalTracker` to key run counts.  Should be unique per spec file.
+    - ``max_runs`` — an optional cap on how many times this spec may be
+      evaluated by a single participant.  ``None`` means unlimited.
+
+    Like ``BacktestSpec``, ``EvalSpec`` is fully YAML-serializable.  Reference
+    eval specs live in ``implementations/<use-case>/specs/`` and are
+    versioned in the repo so that the exact window used for evaluation is
+    always reproducible.
+
+    Parameters
+    ----------
+    spec_id : str
+        Stable identifier for this spec, used by :class:`EvalTracker` to key
+        run counts.  Should be unique across all spec files.
+    task : ForecastingTask
+        The prediction problem to evaluate.
+    start : datetime
+        First candidate forecast origin.
+    end : datetime
+        Last candidate forecast origin (inclusive).
+    stride : int
+        Step size between origins in task-frequency units.
+    origin_dates : list[datetime] or None
+        Optional explicit forecast origins. When provided, :meth:`origins`
+        returns exactly these dates (sorted ascending) instead of deriving a
+        regular grid from ``start``/``end``/``stride``. Supports irregular
+        event calendars (e.g. Bank of Canada fixed announcement dates). All
+        dates must fall within ``[start, end]``.
+    warmup : int
+        Minimum number of observations required before a forecast origin is used.
+    max_runs : int or None
+        Maximum number of times this spec may be evaluated (per tracker).
+        ``None`` means unlimited.
+
+    Examples
+    --------
+    >>> spec = EvalSpec(
+    ...     spec_id="cpi_gasoline_eval_2025",
+    ...     task=ForecastingTask(
+    ...         task_id="cpi_gasoline_canada_1m",
+    ...         target_series_id="cpi_gasoline_canada",
+    ...         horizon=1,
+    ...         frequency="MS",
+    ...         description="CPI Gasoline Canada, 1-month ahead forecast",
+    ...     ),
+    ...     start=datetime(2025, 1, 1),
+    ...     end=datetime(2026, 3, 1),
+    ...     stride=1,
+    ...     warmup=24,
+    ...     max_runs=5,
+    ... )
+    """
+
+    spec_id: str = Field(description="Stable identifier for tracking; keyed by EvalTracker.")
+    task: ForecastingTask
+    start: datetime = Field(description="First candidate forecast origin.")
+    end: datetime = Field(description="Last candidate forecast origin (inclusive).")
+    stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.")
+    origin_dates: list[datetime] | None = Field(
+        default=None,
+        description=(
+            "Optional explicit forecast origins for irregular calendars (e.g. central bank "
+            "announcement dates). When set, overrides the start/end/stride grid derivation."
+        ),
+    )
+    warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.")
+    max_runs: int | None = Field(
+        default=None,
+        ge=1,
+        description="Maximum allowed evaluations against this spec (per tracker). None = unlimited.",
+    )
+    description: str = Field(
+        default="",
+        description="Free-form prose description of the eval intent (methodology, origin rationale, etc.).",
+    )
+
+    @model_validator(mode="after")
+    def start_before_end(self) -> "EvalSpec":
+        """Validate that start precedes end."""
+        if self.start >= self.end:
+            raise ValueError(f"start ({self.start}) must be before end ({self.end})")
+        return self
+
+    @model_validator(mode="after")
+    def origin_dates_in_window(self) -> "EvalSpec":
+        """Validate that explicit origin dates fall within [start, end]."""
+        if self.origin_dates is not None:
+            if not self.origin_dates:
+                raise ValueError("origin_dates must be non-empty when provided; omit it to derive origins.")
+            out_of_window = [d for d in self.origin_dates if not (self.start <= d <= self.end)]
+            if out_of_window:
+                raise ValueError(
+                    f"All origin_dates must fall within [start, end] = [{self.start}, {self.end}]. "
+                    f"Out of window: {out_of_window}"
+                )
+        return self
+
+    def origins(self) -> list[datetime]:
+        """Return the candidate forecast origins derived from this spec.
+
+        When ``origin_dates`` is set, those dates are returned sorted
+        ascending. Otherwise origins are derived from
+        ``start``/``end``/``stride`` on the task's frequency grid.
+
+        Returns
+        -------
+        list[datetime]
+            Candidate forecast origin dates, sorted ascending.
+        """
+        if self.origin_dates is not None:
+            return sorted(self.origin_dates)
+        return _compute_origins(self.start, self.end, self.task.frequency, self.stride)
+
+
+# ---------------------------------------------------------------------------
+# EvalResult
+# ---------------------------------------------------------------------------
+
+
+class EvalResult(BaseModel):
+    """The outcome of an eval run — analogous to ``BacktestResult`` for eval mode.
+
+    ``EvalResult`` carries the same payload as
+    :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` plus
+    ``run_number``, which records which run against this spec this was (1st,
+    2nd, …).  This provenance field is populated automatically by
+    :func:`evaluate` using the :class:`EvalTracker`.
+
+    Parameters
+    ----------
+    eval_spec : EvalSpec
+        The exact spec that was evaluated.
+    predictor_id : str
+        Identifier for the predictor that produced these forecasts.
+    predictions : list[Prediction]
+        One ``Prediction`` per evaluated forecast origin, in chronological order.
+    scores : list[float]
+        Score for each prediction. CRPS for continuous tasks, Brier for binary
+        tasks, RPS for categorical tasks. Lower is better.
+    metric : {"crps", "brier", "rps"}
+        Which scoring rule produced ``scores`` / ``mean_score``. Determined by
+        the task's ``payload_type``. Defaults to ``"crps"`` so artefacts
+        written before binary support existed still load correctly.
+    mean_score : float
+        Mean score across all evaluated origins. Older artefacts serialized
+        this field as ``mean_crps``; both keys are accepted on load.
+    ran_at : datetime
+        UTC wall-clock time when the eval was executed.
+    skipped_origins : int
+        Number of candidate origins skipped due to insufficient warmup or
+        missing ground truth.
+    run_number : int
+        Which run against this spec this was (1-indexed).  Set to 1 when no
+        tracker is supplied to :func:`evaluate`.
+    """
+
+    eval_spec: EvalSpec
+    predictor_id: str
+    predictions: list[Prediction]
+    scores: list[float]
+    metric: ScoreMetric = Field(
+        default="crps",
+        description="Scoring rule used: 'crps' (continuous), 'brier' (binary), or 'rps' (categorical).",
+    )
+    mean_score: float = Field(
+        validation_alias=AliasChoices("mean_score", "mean_crps"),
+        description="Mean score across all scored predictions (CRPS or Brier; lower is better).",
+    )
+    ran_at: datetime
+    skipped_origins: int = Field(default=0)
+    run_number: int = Field(default=1, ge=1, description="Which run against this spec this was (1-indexed).")
+
+    @model_validator(mode="after")
+    def lengths_match(self) -> "EvalResult":
+        """Validate that predictions and scores have the same length."""
+        if len(self.predictions) != len(self.scores):
+            raise ValueError(
+                f"predictions ({len(self.predictions)}) and scores ({len(self.scores)}) must have the same length"
+            )
+        return self
+
+
+# ---------------------------------------------------------------------------
+# EvalTracker
+# ---------------------------------------------------------------------------
+
+
+class EvalTracker:
+    """Persists run counts for eval specs to a YAML file.
+
+    Each call to :meth:`record` increments the run counter for the given
+    ``spec_id`` and writes the updated state to disk.  On the next call to
+    :func:`evaluate`, the counter is read back via :meth:`runs_for` before the
+    run begins so that the budget cap in :attr:`EvalSpec.max_runs` can be
+    enforced.
+
+    The tracking file is created on first write; the directory must already
+    exist.
+
+    Tracking file format::
+
+        cpi_gasoline_eval_2025:
+          runs: 2
+          last_run_at: "2026-04-02T10:00:00"
+
+    Parameters
+    ----------
+    path : Path
+        Path to the YAML tracking file.
+
+    Examples
+    --------
+    >>> tracker = EvalTracker(Path("eval_runs.yaml"))
+    >>> tracker.runs_for("my_spec")
+    0
+    >>> tracker.record("my_spec", datetime.utcnow())
+    >>> tracker.runs_for("my_spec")
+    1
+    """
+
+    def __init__(self, path: Path) -> None:
+        self._path = path
+
+    @property
+    def path(self) -> Path:
+        """Path to the YAML tracking file."""
+        return self._path
+
+    def _load(self) -> dict[str, dict[str, object]]:
+        if not self._path.exists():
+            return {}
+        with self._path.open() as f:
+            data = yaml.safe_load(f)
+        return data if isinstance(data, dict) else {}
+
+    def _save(self, data: dict[str, dict[str, object]]) -> None:
+        with self._path.open("w") as f:
+            yaml.dump(data, f, default_flow_style=False, sort_keys=True)
+
+    def runs_for(self, spec_id: str) -> int:
+        """Return the number of runs already recorded for ``spec_id``.
+
+        Parameters
+        ----------
+        spec_id : str
+            The eval spec identifier to query.
+
+        Returns
+        -------
+        int
+            Number of runs recorded; 0 if ``spec_id`` has never been run.
+        """
+        data = self._load()
+        entry = data.get(spec_id, {})
+        runs_val = entry.get("runs", 0)
+        return runs_val if isinstance(runs_val, int) else int(str(runs_val))
+
+    def record(self, spec_id: str, ran_at: datetime) -> None:
+        """Increment the run counter for ``spec_id`` and persist to disk.
+
+        Parameters
+        ----------
+        spec_id : str
+            The eval spec identifier to update.
+        ran_at : datetime
+            The UTC time of the run being recorded.
+        """
+        data = self._load()
+        entry = data.get(spec_id, {"runs": 0})
+        runs_val = entry.get("runs", 0)
+        current = runs_val if isinstance(runs_val, int) else int(str(runs_val))
+        entry["runs"] = current + 1
+        entry["last_run_at"] = ran_at.isoformat()
+        data[spec_id] = entry
+        self._save(data)
+
+
+# ---------------------------------------------------------------------------
+# evaluate() harness
+# ---------------------------------------------------------------------------
+
+
+def evaluate(
+    predictor: Predictor,
+    spec: EvalSpec,
+    data_service: DataService,
+    tracker: EvalTracker | None = None,
+) -> EvalResult:
+    """Run an evaluation of a predictor against a protected :class:`EvalSpec`.
+
+    Behaves identically to :func:`~aieng.forecasting.evaluation.backtest.backtest`
+    at the forecast level, but additionally:
+
+    1. **Budget check** — if ``tracker`` is provided and ``spec.max_runs`` is
+       set, the run is refused with :exc:`EvalBudgetExceededError` if the
+       budget has been exhausted.
+    2. **Run recording** — after a successful run, ``tracker.record()`` is
+       called so the budget is decremented for subsequent attempts.
+    3. **Provenance** — :attr:`EvalResult.run_number` records which run this
+       was (derived from the tracker, or 1 if no tracker is supplied).
+
+    Parameters
+    ----------
+    predictor : Predictor
+        The forecasting model to evaluate.
+    spec : EvalSpec
+        Defines the task, evaluation window, stride, warmup, and optional
+        run budget.
+    data_service : DataService
+        Pre-populated data service. Must have the target series registered.
+    tracker : EvalTracker or None
+        Optional tracker for budget enforcement and run-count provenance.
+        If ``None``, the run proceeds unconditionally and ``run_number`` is 1.
+
+    Returns
+    -------
+    EvalResult
+        A fully populated result record including all predictions, scores,
+        and run provenance.
+
+    Raises
+    ------
+    EvalBudgetExceededError
+        If ``tracker`` is provided, ``spec.max_runs`` is set, and the budget
+        has been exhausted.
+    KeyError
+        If the target series is not registered in the data service.
+    ValueError
+        If no origins produce a resolvable prediction (all skipped).
+
+    Examples
+    --------
+    >>> result = evaluate(predictor=my_predictor, spec=spec, data_service=svc)
+    >>> print(f"Eval mean {result.metric.upper()}: {result.mean_score:.4f}")
+    """
+    runs_used = tracker.runs_for(spec.spec_id) if tracker is not None else 0
+
+    if tracker is not None and spec.max_runs is not None and runs_used >= spec.max_runs:
+        raise EvalBudgetExceededError(
+            spec_id=spec.spec_id,
+            runs_used=runs_used,
+            max_runs=spec.max_runs,
+        )
+
+    ran_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+    predictions, scores, skipped = run_eval_loop(
+        predictor=predictor,
+        task=spec.task,
+        origins=spec.origins(),
+        warmup=spec.warmup,
+        data_service=data_service,
+    )
+
+    if tracker is not None:
+        tracker.record(spec.spec_id, ran_at)
+
+    return EvalResult(
+        eval_spec=spec,
+        predictor_id=predictor.predictor_id,
+        predictions=predictions,
+        scores=scores,
+        metric=METRIC_BY_PAYLOAD_TYPE[spec.task.payload_type],
+        mean_score=float(np.mean(scores)),
+        ran_at=ran_at,
+        skipped_origins=skipped,
+        run_number=runs_used + 1,
+    )
+
+
+# ---------------------------------------------------------------------------
+# MultiTargetEvalSpec and multi_evaluate()  # noqa: ERA001
+# ---------------------------------------------------------------------------
+
+
+class MultiTargetEvalSpec(BaseModel):
+    """Eval spec that assesses a predictor across multiple related tasks.
+
+    ``MultiTargetEvalSpec`` is the eval-mode counterpart to
+    :class:`~aieng.forecasting.evaluation.backtest.MultiTargetBacktestSpec`.
+    It groups several :class:`ForecastingTask` objects under a single shared
+    evaluation window and a single run budget.
+
+    **Budget semantics:** One call to :func:`multi_evaluate` counts as *one*
+    run against ``max_runs``, regardless of how many tasks are included.  This
+    means the budget governs "evaluation sessions", not individual series.
+
+    All tasks must share the same ``frequency``; this is enforced at
+    construction time.
+
+    Parameters
+    ----------
+    spec_id : str
+        Stable identifier for this spec, used by :class:`EvalTracker` to key
+        run counts.  Should be unique across all spec files.
+    tasks : list[ForecastingTask]
+        The prediction problems to evaluate.  All must share the same
+        ``frequency``.
+    start : datetime
+        First candidate forecast origin.
+    end : datetime
+        Last candidate forecast origin (inclusive).
+    stride : int
+        Step size between origins in task-frequency units.
+    warmup : int
+        Minimum observations required before a forecast origin is used.
+    max_runs : int or None
+        Maximum number of ``multi_evaluate`` calls allowed (per tracker).
+        ``None`` means unlimited.
+
+    Examples
+    --------
+    >>> spec = MultiTargetEvalSpec(
+    ...     spec_id="food_cpi_18m_eval",
+    ...     tasks=[task_food, task_meat, task_dairy],
+    ...     start=datetime(2022, 7, 1),
+    ...     end=datetime(2024, 7, 1),
+    ...     stride=6,
+    ...     warmup=24,
+    ...     max_runs=5,
+    ... )
+    """
+
+    spec_id: str = Field(description="Stable identifier for tracking; keyed by EvalTracker.")
+    tasks: list[ForecastingTask] = Field(
+        min_length=1, description="Prediction problems; all must share the same frequency."
+    )
+    start: datetime = Field(description="First candidate forecast origin.")
+    end: datetime = Field(description="Last candidate forecast origin (inclusive).")
+    stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.")
+    warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.")
+    max_runs: int | None = Field(
+        default=None,
+        ge=1,
+        description="Maximum allowed evaluation sessions against this spec (per tracker). None = unlimited.",
+    )
+    description: str = Field(
+        default="",
+        description="Free-form prose description of the eval intent (methodology, origin rationale, etc.).",
+    )
+
+    @model_validator(mode="after")
+    def _validate(self) -> "MultiTargetEvalSpec":
+        if self.start >= self.end:
+            raise ValueError(f"start ({self.start}) must be before end ({self.end})")
+        frequencies = {t.frequency for t in self.tasks}
+        if len(frequencies) > 1:
+            raise ValueError(
+                f"All tasks in a MultiTargetEvalSpec must share the same frequency. Found: {sorted(frequencies)}"
+            )
+        return self
+
+    def specs(self) -> list[EvalSpec]:
+        """Decompose into one :class:`EvalSpec` per task.
+
+        The individual specs share ``spec_id`` and window parameters.  They are
+        intended for internal use by :func:`multi_evaluate` — the budget is
+        enforced once at the multi-target level, not per task.
+
+        Returns
+        -------
+        list[EvalSpec]
+            One spec per task, sharing ``spec_id``, window, and budget fields.
+        """
+        return [
+            EvalSpec(
+                spec_id=self.spec_id,
+                task=t,
+                start=self.start,
+                end=self.end,
+                stride=self.stride,
+                warmup=self.warmup,
+                max_runs=self.max_runs,
+                description=self.description,
+            )
+            for t in self.tasks
+        ]
+
+
+def multi_evaluate(
+    predictor: Predictor,
+    spec: MultiTargetEvalSpec,
+    data_service: DataService,
+    tracker: EvalTracker | None = None,
+) -> dict[str, EvalResult]:
+    """Run an evaluation of a predictor across all tasks in a MultiTargetEvalSpec.
+
+    The budget check and tracker increment happen *once* for the entire
+    multi-target evaluation — one call counts as one run regardless of how
+    many tasks are in the spec.  All tasks then run using the same
+    underlying :func:`evaluate`-level loop, but without re-checking the budget
+    for each individual task.
+
+    Parameters
+    ----------
+    predictor : Predictor
+        The forecasting model to evaluate.
+    spec : MultiTargetEvalSpec
+        Defines the tasks, shared evaluation window, stride, warmup, and
+        optional run budget.
+    data_service : DataService
+        Pre-populated data service.  Must have all target series registered.
+    tracker : EvalTracker or None
+        Optional tracker for budget enforcement and run-count provenance.
+        If ``None``, runs unconditionally and ``run_number`` is 1 on all results.
+
+    Returns
+    -------
+    dict[str, EvalResult]
+        Eval results keyed by ``task_id``, one entry per task.
+
+    Raises
+    ------
+    EvalBudgetExceededError
+        If ``tracker`` is provided, ``spec.max_runs`` is set, and the budget
+        has been exhausted.
+    KeyError
+        If any target series is not registered in the data service.
+    ValueError
+        If no origins can be scored for any task.
+
+    Examples
+    --------
+    >>> results = multi_evaluate(my_predictor, spec, svc, tracker=tracker)
+    >>> for task_id, result in results.items():
+    ...     print(f"{task_id}: mean {result.metric.upper()} = {result.mean_score:.4f}")
+    """
+    runs_used = tracker.runs_for(spec.spec_id) if tracker is not None else 0
+
+    if tracker is not None and spec.max_runs is not None and runs_used >= spec.max_runs:
+        raise EvalBudgetExceededError(
+            spec_id=spec.spec_id,
+            runs_used=runs_used,
+            max_runs=spec.max_runs,
+        )
+
+    ran_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    run_number = runs_used + 1
+
+    results: dict[str, EvalResult] = {}
+    for task in spec.tasks:
+        predictions, scores, skipped = run_eval_loop(
+            predictor=predictor,
+            task=task,
+            origins=_compute_origins(spec.start, spec.end, task.frequency, spec.stride),
+            warmup=spec.warmup,
+            data_service=data_service,
+        )
+        task_eval_spec = EvalSpec(
+            spec_id=spec.spec_id,
+            task=task,
+            start=spec.start,
+            end=spec.end,
+            stride=spec.stride,
+            warmup=spec.warmup,
+            max_runs=spec.max_runs,
+            description=spec.description,
+        )
+        results[task.task_id] = EvalResult(
+            eval_spec=task_eval_spec,
+            predictor_id=predictor.predictor_id,
+            predictions=predictions,
+            scores=scores,
+            metric=METRIC_BY_PAYLOAD_TYPE[task.payload_type],
+            mean_score=float(np.mean(scores)),
+            ran_at=ran_at,
+            skipped_origins=skipped,
+            run_number=run_number,
+        )
+
+    if tracker is not None:
+        tracker.record(spec.spec_id, ran_at)
+
+    return results
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md
new file mode 100644
index 0000000..319fa3e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md
@@ -0,0 +1,298 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/langfuse_traces.py
+
+kind: python
+
+```python
+"""Langfuse trace-evaluation plumbing: stamp forecasts, fetch traces, push scores.
+
+This module makes the **Langfuse trace** the canonical record a trace evaluator
+reads from and writes back to. It owns three jobs:
+
+1. **Stamp** the structured forecast onto the trace at generation time
+   (:func:`stamp_forecast_on_trace`) so the model's rationale and distribution can
+   be read straight back from the trace rather than from a local cache. The
+   forecast is written as the ``output`` of a dedicated ``forecast`` child
+   observation: observation I/O is the supported surface (trace-level
+   ``input``/``output`` is deprecated in the v4 SDK). Stamping works either in the
+   active trace context (the LLMP path) or **post-hoc by trace id** (the agent
+   path, whose trace is created on a worker thread the caller cannot see).
+2. **Fetch** a trace by id with readiness polling
+   (:func:`fetch_trace_with_wait`), because trace ingestion is asynchronous — a
+   freshly-emitted trace may not yet carry the ``forecast`` observation when the
+   evaluator looks.
+3. **Push** an evaluation result back as a Langfuse score
+   (:func:`push_trace_score`), dispatching the score ``data_type`` from the
+   Python value type.
+
+The fetch/score/readiness pattern is a trimmed, **Langfuse v4** adaptation of the
+trace-evaluation pass in VectorInstitute/eval-agents
+(``aieng/agent_evals/evaluation/trace.py``); that reference targets the Langfuse
+v3 SDK, so the API calls here (``client.api.trace.get`` / ``set_current_trace_io``
+/ ``create_score``) are the v4 equivalents.
+
+Langfuse is an optional dependency (the ``llm`` / ``agentic`` extras); every entry
+point imports it lazily and degrades to a guarded no-op when it is absent, so
+importing this module never requires the package.
+"""
+
+from __future__ import annotations
+
+import logging
+import time
+from typing import Any, Callable, Sequence
+
+from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction
+
+
+logger = logging.getLogger(__name__)
+
+
+#: Name of the child observation whose ``output`` holds the stamped forecasts.
+FORECAST_OBSERVATION_NAME = "forecast"
+
+#: Key under the observation ``output`` that holds the list of forecast dicts.
+FORECAST_TRACE_OUTPUT_KEY = "forecasts"
+
+
+def _get_client(client: Any | None = None) -> Any:
+    """Return the given client, or the process-wide Langfuse client."""
+    if client is not None:
+        return client
+    from langfuse import get_client  # noqa: PLC0415
+
+    return get_client()
+
+
+# --------------------------------------------------------------------------- #
+# Stamp: generation side
+# --------------------------------------------------------------------------- #
+def _forecast_to_dict(pred: Prediction) -> dict[str, Any] | None:
+    """Project one rationale-bearing categorical prediction to a trace-output dict.
+
+    Returns ``None`` for predictions the rationale evaluator ignores (non
+    categorical, or without a stated rationale), so they are not stamped.
+    """
+    if not isinstance(pred.payload, CategoricalForecast):
+        return None
+    metadata = pred.metadata or {}
+    rationale = str(metadata.get("rationale", "") or "").strip()
+    if not rationale:
+        return None
+    forecast_date = pred.forecast_date
+    return {
+        "predictor_id": pred.predictor_id,
+        "task_id": pred.task_id,
+        "forecast_date": forecast_date.isoformat() if hasattr(forecast_date, "isoformat") else str(forecast_date),
+        "probabilities": dict(pred.payload.probabilities),
+        "rationale": rationale,
+        "key_signals": list(metadata.get("key_signals", []) or []),
+        "confidence": str(metadata.get("confidence", "") or ""),
+    }
+
+
+def stamp_forecast_on_trace(
+    predictions: Sequence[Prediction], *, trace_id: str | None = None, client: Any | None = None
+) -> bool:
+    """Write the structured forecast(s) onto a ``forecast`` observation in the trace.
+
+    Creates a child observation named ``forecast`` whose ``output`` carries the
+    rationale, predicted distribution, and forecast date, so they can be read back
+    by :func:`read_forecasts_from_trace`. Only rationale-bearing categorical
+    predictions are stamped.
+
+    Parameters
+    ----------
+    predictions : sequence of Prediction
+        The predictions to stamp (filtered to rationale-bearing categorical ones).
+    trace_id : str or None
+        When given, the observation is attached to that trace **post-hoc** (the
+        agent path, whose trace is created on a worker thread). When ``None``, it
+        is created in the active trace context (the LLMP path, inside ``@observe``).
+    client : Langfuse client, optional
+        Defaults to the process-wide client.
+
+    Returns ``True`` when something was stamped, ``False`` on no-op (nothing to
+    stamp, or Langfuse unavailable). Best-effort: never raises.
+    """
+    forecasts = [d for d in (_forecast_to_dict(p) for p in predictions) if d is not None]
+    if not forecasts:
+        return False
+    try:
+        client = _get_client(client)
+        kwargs: dict[str, Any] = {"name": FORECAST_OBSERVATION_NAME, "as_type": "span"}
+        if trace_id is not None:
+            from langfuse.types import TraceContext  # noqa: PLC0415
+
+            kwargs["trace_context"] = TraceContext(trace_id=trace_id)
+        with client.start_as_current_observation(**kwargs) as observation:
+            observation.update(output={FORECAST_TRACE_OUTPUT_KEY: forecasts})
+        return True
+    except Exception:  # pragma: no cover - guarded no-op when tracing is unavailable
+        logger.debug("Could not stamp forecast onto Langfuse trace.", exc_info=True)
+        return False
+
+
+def _forecasts_from_output(output: Any) -> list[dict[str, Any]]:
+    """Extract the forecast list from an observation/trace ``output`` payload."""
+    if isinstance(output, dict):
+        forecasts = output.get(FORECAST_TRACE_OUTPUT_KEY)
+        if isinstance(forecasts, list):
+            return [f for f in forecasts if isinstance(f, dict)]
+    return []
+
+
+def read_forecasts_from_trace(trace: Any) -> list[dict[str, Any]]:
+    """Return the stamped forecast dicts from a fetched trace, or ``[]``.
+
+    Reads the ``forecast`` child observation's output (the supported surface);
+    falls back to trace-level ``output`` for traces stamped before the switch.
+    """
+    for observation in getattr(trace, "observations", None) or []:
+        if getattr(observation, "name", None) == FORECAST_OBSERVATION_NAME:
+            forecasts = _forecasts_from_output(getattr(observation, "output", None))
+            if forecasts:
+                return forecasts
+    return _forecasts_from_output(getattr(trace, "output", None))
+
+
+def trace_has_forecast(trace: Any) -> bool:
+    """Readiness predicate: the trace carries at least one stamped forecast."""
+    return bool(read_forecasts_from_trace(trace))
+
+
+# --------------------------------------------------------------------------- #
+# Fetch: evaluation side (readiness polling)
+# --------------------------------------------------------------------------- #
+def _is_retryable_trace_fetch_error(exc: BaseException) -> bool:
+    """Whether a trace-fetch error is worth retrying (ingestion still in flight)."""
+    name = type(exc).__name__
+    if name == "NotFoundError":  # trace id not yet ingested
+        return True
+    if name in {"ConnectError", "ConnectTimeout", "ReadError", "ReadTimeout", "RemoteProtocolError", "TimeoutError"}:
+        return True
+    status = getattr(exc, "status_code", None)
+    return isinstance(status, int) and (status in (408, 429) or 500 <= status < 600)
+
+
+def fetch_trace_with_wait(
+    trace_id: str,
+    *,
+    client: Any | None = None,
+    max_wait_s: float = 30.0,
+    initial_delay_s: float = 1.0,
+    max_delay_s: float = 8.0,
+    ready: Callable[[Any], bool] = trace_has_forecast,
+    sleep: Callable[[float], None] = time.sleep,
+) -> Any | None:
+    """Fetch a trace by id, polling until ``ready(trace)`` or the budget expires.
+
+    Trace ingestion is asynchronous, so a just-emitted trace may 404 or lack its
+    output briefly. This retries on transient/not-found errors with exponential
+    backoff up to ``max_wait_s``.
+
+    Returns the ready trace, or ``None`` if it never became ready within the
+    budget (the caller should *skip*, not fail — mirrors eval-agents' SKIPPED
+    bucket). Raises only on non-retryable errors.
+    """
+    client = _get_client(client)
+    delay = initial_delay_s
+    waited = 0.0
+    while True:
+        try:
+            trace = client.api.trace.get(trace_id)
+            if ready(trace):
+                return trace
+        except Exception as exc:  # noqa: BLE001 - re-raised below unless retryable
+            if not _is_retryable_trace_fetch_error(exc):
+                raise
+        if waited >= max_wait_s:
+            return None
+        step = min(delay, max_delay_s, max_wait_s - waited)
+        sleep(step)
+        waited += step
+        delay *= 2
+
+
+def list_trace_ids(
+    *,
+    client: Any | None = None,
+    name: str | None = None,
+    tags: Sequence[str] | str | None = None,
+    since: Any | None = None,
+    limit: int = 50,
+) -> list[str]:
+    """Discover trace ids by name/tags/time window (best-effort; ``[]`` on error)."""
+    try:
+        client = _get_client(client)
+        response = client.api.trace.list(name=name, tags=tags, from_timestamp=since, limit=limit)
+        data = getattr(response, "data", None) or []
+        return [trace.id for trace in data if getattr(trace, "id", None)]
+    except Exception:  # pragma: no cover - guarded no-op when listing is unavailable
+        logger.debug("Could not list Langfuse traces.", exc_info=True)
+        return []
+
+
+# --------------------------------------------------------------------------- #
+# Push: write evaluation results back as scores
+# --------------------------------------------------------------------------- #
+def push_trace_score(
+    trace_id: str,
+    name: str,
+    value: bool | int | float | str,
+    *,
+    client: Any | None = None,
+    comment: str | None = None,
+    metadata: dict[str, Any] | None = None,
+    config_id: str | None = None,
+) -> bool:
+    """Push one Langfuse score to ``trace_id``, picking ``data_type`` from ``value``.
+
+    ``bool`` -> ``BOOLEAN``, ``int``/``float`` -> ``NUMERIC``, ``str`` ->
+    ``CATEGORICAL`` (mirrors eval-agents' ``_upload_trace_scores``). Returns
+    whether the score was pushed; guarded no-op (``False``) without Langfuse.
+    """
+    data_type: str
+    score_value: bool | float | str
+    if isinstance(value, bool):  # bool before int: bool is an int subclass
+        data_type, score_value = "BOOLEAN", value
+    elif isinstance(value, (int, float)):
+        data_type, score_value = "NUMERIC", float(value)
+    else:
+        data_type, score_value = "CATEGORICAL", str(value)
+    try:
+        client = _get_client(client)
+        client.create_score(
+            name=name,
+            value=score_value,
+            trace_id=trace_id,
+            data_type=data_type,
+            comment=comment,
+            metadata=metadata,
+            config_id=config_id,
+        )
+        return True
+    except Exception:  # pragma: no cover - guarded no-op when scoring is unavailable
+        logger.debug("Could not push Langfuse score %r to trace %s.", name, trace_id, exc_info=True)
+        return False
+
+
+def flush_scores(client: Any | None = None) -> None:
+    """Flush pending score/trace exports (best-effort)."""
+    try:
+        _get_client(client).flush()
+    except Exception:  # pragma: no cover - guarded no-op when tracing is unavailable
+        logger.debug("Langfuse flush failed.", exc_info=True)
+
+
+__all__ = [
+    "FORECAST_OBSERVATION_NAME",
+    "FORECAST_TRACE_OUTPUT_KEY",
+    "fetch_trace_with_wait",
+    "flush_scores",
+    "list_trace_ids",
+    "push_trace_score",
+    "read_forecasts_from_trace",
+    "stamp_forecast_on_trace",
+    "trace_has_forecast",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md
new file mode 100644
index 0000000..abd8b64
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md
@@ -0,0 +1,191 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/prediction.py
+
+kind: python
+
+```python
+"""Prediction payload types and the Prediction metadata wrapper."""
+
+from datetime import datetime
+from math import isfinite
+from typing import Any
+
+from pydantic import BaseModel, Field, field_validator
+
+
+#: Standard quantile levels stored in every ContinuousForecast.
+STANDARD_QUANTILES: list[float] = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95]
+
+
+class ContinuousForecast(BaseModel):
+    """Probabilistic forecast payload for one continuous future target.
+
+    Stores a point estimate and a set of quantile forecasts at standard levels.
+    The quantile grid (0.05 … 0.95) is dense enough to compute a good CRPS
+    approximation and compact enough to be stored in YAML alongside the
+    full prediction record.
+
+    Parameters
+    ----------
+    point_forecast : float
+        Central estimate — typically the median (0.50 quantile) of the
+        predictive distribution.
+    quantiles : dict[float, float]
+        Mapping from quantile level (in [0, 1]) to forecast value. Keys must
+        be strictly in ``(0, 1)``; values are the corresponding forecast
+        values. The standard levels in :data:`STANDARD_QUANTILES` are
+        recommended for compatibility with the CRPS scorer, but any set of
+        quantile keys in range is accepted.
+
+    Examples
+    --------
+    >>> fc = ContinuousForecast(
+    ...     point_forecast=160.5,
+    ...     quantiles={0.05: 155.0, 0.50: 160.5, 0.95: 166.0},
+    ... )
+    """
+
+    point_forecast: float = Field(description="Central estimate of the predictive distribution.")
+    quantiles: dict[float, float] = Field(
+        description="Quantile forecasts. Keys are quantile levels in (0, 1); values are forecast values."
+    )
+
+    @field_validator("quantiles")
+    @classmethod
+    def quantile_keys_in_range(cls, v: dict[float, float]) -> dict[float, float]:
+        """Validate that all quantile keys are strictly in (0, 1)."""
+        bad = [q for q in v if not (0.0 < q < 1.0)]
+        if bad:
+            raise ValueError(f"Quantile keys must be in (0, 1). Invalid keys: {bad}")
+        return v
+
+
+class BinaryForecast(BaseModel):
+    """Binary event probability payload for discrete-event forecasting tasks.
+
+    Parameters
+    ----------
+    probability : float
+        Predicted probability that the event resolves ``True``, in ``[0, 1]``.
+    """
+
+    probability: float = Field(ge=0.0, le=1.0, description="Predicted probability the event occurs.")
+
+    @field_validator("probability")
+    @classmethod
+    def probability_is_finite(cls, value: float) -> float:
+        """Reject NaN and infinite probabilities."""
+        if not isfinite(value):
+            raise ValueError("Probability must be a finite number.")
+        return value
+
+
+class CategoricalForecast(BaseModel):
+    """Ordered-categorical probability payload for one future target.
+
+    The category order and allowed label set are declared on the
+    :class:`~aieng.forecasting.evaluation.task.ForecastingTask` via
+    ``task.categories``. The scorer aligns this probability dictionary to that
+    task-declared order before computing the Ranked Probability Score.
+
+    Parameters
+    ----------
+    probabilities : dict[str, float]
+        Mapping from category label to predicted probability. Values must be
+        finite probabilities in ``[0, 1]`` and sum to 1 within absolute
+        tolerance ``1e-6``.
+    """
+
+    probabilities: dict[str, float] = Field(description="Predicted probability for each category label.")
+
+    @field_validator("probabilities")
+    @classmethod
+    def probabilities_are_valid(cls, value: dict[str, float]) -> dict[str, float]:
+        """Validate that probabilities form a finite distribution."""
+        if len(value) < 2:
+            raise ValueError("Categorical probabilities must include at least two categories.")
+        bad_finite = [label for label, probability in value.items() if not isfinite(probability)]
+        if bad_finite:
+            raise ValueError(f"Categorical probabilities must be finite. Invalid labels: {bad_finite}")
+        bad_range = [label for label, probability in value.items() if not (0.0 <= probability <= 1.0)]
+        if bad_range:
+            raise ValueError(f"Categorical probabilities must be in [0, 1]. Invalid labels: {bad_range}")
+        total = sum(value.values())
+        if abs(total - 1.0) > 1e-6:
+            raise ValueError(f"Categorical probabilities must sum to 1 within 1e-6; got {total}.")
+        return value
+
+
+ForecastPayload = ContinuousForecast | BinaryForecast | CategoricalForecast
+
+
+class Prediction(BaseModel):
+    """A single forecast submission — metadata wrapper around a forecast payload.
+
+    ``Prediction`` is the unit of exchange between a :class:`Predictor` and the
+    evaluation harness. It carries all the metadata needed to score, persist,
+    and compare forecasts independently of the system that produced them.
+
+    Designed to be YAML-serializable so it can be:
+
+    - Persisted alongside a predictor implementation.
+    - Passed as structured context to downstream agents.
+    - Used as the unit of submission in a live evaluation or competition.
+
+    Parameters
+    ----------
+    predictor_id : str
+        Identifier for the predictor that issued this forecast.
+    task_id : str
+        Identifier for the
+        :class:`~aieng.forecasting.evaluation.task.ForecastingTask` this
+        prediction is for.
+    issued_at : datetime
+        Wall-clock time when the prediction was generated.
+    as_of : datetime
+        Information cutoff used — the ``as_of`` date of the
+        :class:`~aieng.forecasting.data.context.ForecastContext` passed to the
+        predictor.
+    forecast_date : datetime
+        The future date being predicted (``as_of`` + horizon steps).
+    payload : ContinuousForecast | BinaryForecast | CategoricalForecast
+        The forecast payload.
+    metadata : dict[str, Any]
+        Optional free-form metadata the predictor wants to return alongside the
+        forecast. The evaluation harness never reads or validates this field —
+        it passes through transparently into ``BacktestResult.predictions`` and
+        ``EvalResult.predictions``. Use it to surface structured side-channel
+        data: token counts, source lists, intermediate statistics, agent trace
+        IDs, etc. Anything requiring richer structure should be stored
+        externally (e.g. in Langfuse) and referenced here by ID.
+
+    Examples
+    --------
+    >>> from datetime import datetime
+    >>> pred = Prediction(
+    ...     predictor_id="arima_auto",
+    ...     task_id="cpi_all_items_canada_12m",
+    ...     issued_at=datetime(2024, 1, 1),
+    ...     as_of=datetime(2024, 1, 1),
+    ...     forecast_date=datetime(2025, 1, 1),
+    ...     payload=ContinuousForecast(
+    ...         point_forecast=162.3,
+    ...         quantiles={0.05: 157.0, 0.50: 162.3, 0.95: 167.8},
+    ...     ),
+    ... )
+    """
+
+    predictor_id: str = Field(description="Identifier for the predictor that issued this forecast.")
+    task_id: str = Field(description="Identifier for the ForecastingTask this prediction answers.")
+    issued_at: datetime = Field(description="Wall-clock time when the prediction was generated.")
+    as_of: datetime = Field(description="Information cutoff used when generating this prediction.")
+    forecast_date: datetime = Field(description="The future date being predicted.")
+    payload: ForecastPayload = Field(description="The forecast payload.")
+    metadata: dict[str, Any] = Field(
+        default_factory=dict,
+        description=(
+            "Optional free-form metadata returned alongside the forecast. "
+            "Ignored by the evaluation harness; passes through transparently. "
+            "Use for token counts, source lists, trace IDs, etc."
+        ),
+    )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md
new file mode 100644
index 0000000..e8cef8f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md
@@ -0,0 +1,134 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/predictor.py
+
+kind: python
+
+```python
+"""Predictor ABC — the interface all forecasting models must implement."""
+
+from abc import ABC, abstractmethod
+
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import Prediction
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class Predictor(ABC):
+    """Abstract base class for all forecasting models.
+
+    A ``Predictor`` encapsulates everything about *how* a forecasting problem
+    is solved: which series to request from the data service, how to handle
+    gaps, what model or agent to use, and how to produce probabilistic
+    forecasts.
+
+    The interface is deliberately minimal — a ``predict`` method and a
+    ``predictor_id`` property. This means any two implementations — a vanilla
+    ARIMA and a multi-step LLM agent — can be evaluated against the same
+    :class:`~aieng.forecasting.evaluation.task.ForecastingTask` without the
+    evaluation harness needing to know anything about either.
+
+    **Multi-horizon forecasting:** ``predict()`` returns ``list[Prediction]``
+    — one ``Prediction`` per horizon step declared in ``task.horizons``.
+    Single-horizon tasks produce a one-element list; multi-horizon tasks
+    produce one element per requested step.
+
+    This design lets trajectory-based models (Darts, LLMs) produce a coherent
+    forecast path in one call, while also making single-step predictors natural
+    (just return a one-element list). The evaluation harness scores each
+    ``Prediction`` in the list independently and accumulates the results in a
+    flat ``BacktestResult``.
+
+    **Backtesting vs live evaluation:** the predictor never knows which mode
+    it is in. The harness creates a
+    :class:`~aieng.forecasting.data.context.ForecastContext` scoped to the
+    appropriate ``as_of`` date and passes it in. The predictor's code is
+    identical in both modes.
+
+    **Information discipline:** the ``ForecastContext`` enforces the
+    information cutoff for deterministic data (historical series). For
+    agentic predictors that use live tools (web search, news APIs), the cutoff
+    cannot be enforced structurally — this is a known limitation and is part
+    of the challenge for evaluating such predictors.
+
+    **Side-effects and metadata:** predictors are free to write logs, traces,
+    or other artifacts as side-effects of ``predict()``. For structured data
+    that should travel *with* each prediction (token counts, source lists,
+    agent trace IDs), populate the ``Prediction.metadata`` dict. The harness
+    passes it through transparently.
+
+    Examples
+    --------
+    Implementing a trivial constant predictor::
+
+        class ConstantPredictor(Predictor):
+            def __init__(self, value: float) -> None:
+                self._value = value
+
+            @property
+            def predictor_id(self) -> str:
+                return "constant"
+
+            def predict(
+                self,
+                task: ForecastingTask,
+                context: ForecastContext,
+            ) -> list[Prediction]:
+                from datetime import datetime
+                import pandas as pd
+                from aieng.forecasting.evaluation.prediction import (
+                    ContinuousForecast,
+                    STANDARD_QUANTILES,
+                )
+
+                offset = pd.tseries.frequencies.to_offset(task.frequency)
+                start = pd.Timestamp(context.as_of)
+                payload = ContinuousForecast(
+                    point_forecast=self._value,
+                    quantiles={q: self._value for q in STANDARD_QUANTILES},
+                )
+                return [
+                    Prediction(
+                        predictor_id=self.predictor_id,
+                        task_id=task.task_id,
+                        issued_at=datetime.utcnow(),
+                        as_of=context.as_of,
+                        forecast_date=(start + offset * h).to_pydatetime(),
+                        payload=payload,
+                    )
+                    for h in task.horizons
+                ]
+    """
+
+    @property
+    @abstractmethod
+    def predictor_id(self) -> str:
+        """Unique, human-readable identifier for this predictor.
+
+        Used in :class:`~aieng.forecasting.evaluation.backtest.BacktestResult`
+        and in persisted :class:`~aieng.forecasting.evaluation.prediction.Prediction`
+        records to identify which predictor produced a forecast.
+        """
+
+    @abstractmethod
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic forecasts for the given task and context.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Defines the prediction problem — target series, horizon(s),
+            frequency, and resolution logic. The predictor must not modify
+            the task.
+        context : ForecastContext
+            The information state available at forecast time. All calls to
+            ``context.get_series()`` are automatically filtered to
+            ``context.as_of`` — the predictor cannot accidentally access
+            future data from the series store.
+
+        Returns
+        -------
+        list[Prediction]
+            One ``Prediction`` per horizon step in ``task.horizons``, each
+            with ``as_of = context.as_of`` and ``forecast_date`` set to the
+            corresponding step ahead of the origin. The list must not be empty.
+        """
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md
new file mode 100644
index 0000000..450e9fd
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md
@@ -0,0 +1,214 @@
+# Source: aieng-forecasting/aieng/forecasting/evaluation/task.py
+
+kind: python
+
+```python
+"""ForecastingTask: defines a prediction problem against the data service."""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from pydantic import BaseModel, Field, model_validator
+
+
+class TaskCategory(BaseModel):
+    """Ordered category declaration for a categorical forecasting task.
+
+    Parameters
+    ----------
+    label : str
+        Non-empty category label used by categorical forecast payloads.
+    value : float
+        Numeric value stored in the observed target series for this category.
+    """
+
+    label: str = Field(min_length=1, description="Category label used by categorical forecast payloads.")
+    value: float = Field(description="Numeric value stored in the observed target series for this category.")
+
+
+class ForecastingTask(BaseModel):
+    """Defines a prediction problem, independent of how it is solved.
+
+    A ``ForecastingTask`` specifies *what* to forecast: the target series,
+    the horizon(s), the temporal resolution, and how to determine ground truth.
+    It says nothing about *how* a predictor should solve the problem —
+    covariate selection, gap-filling, and model choice are all predictor
+    concerns.
+
+    This separation means any two predictors (a vanilla ARIMA and a
+    multi-step LLM agent) can be evaluated against the same task without
+    the task needing to know anything about either of them.
+
+    Parameters
+    ----------
+    task_id : str
+        Unique identifier for this forecasting task.
+    target_series_id : str
+        The ``series_id`` (key in ``SeriesStore``) of the series to forecast.
+    horizons : list[int]
+        One or more horizon steps to forecast.  Horizon ``h`` means ``h``
+        frequency-units ahead of the forecast origin.  For example,
+        ``horizons=[18]`` on monthly data means 18 months ahead;
+        ``horizons=[6, 7, 8, ..., 17]`` produces a full trajectory.
+
+        **Backward compatibility:** you may pass ``horizon=N`` (singular, int)
+        and it will be silently coerced to ``horizons=[N]``.  This keeps
+        existing YAML specs, notebook code, and tests working without changes.
+    frequency : str
+        Pandas offset alias for the forecast frequency (e.g. ``"MS"`` for
+        month-start, ``"h"`` for hourly, ``"D"`` for daily). Combined with
+        ``horizons``, this determines the forecast window.
+    description : str
+        Human-readable description of the prediction problem.
+    payload_type : {"continuous", "binary", "categorical"}
+        The forecast payload modality this task expects. ``"continuous"``
+        (the default) means predictors must return
+        :class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast`
+        payloads, scored with CRPS. ``"binary"`` means the target series is a
+        0/1 event series and predictors must return
+        :class:`~aieng.forecasting.evaluation.prediction.BinaryForecast`
+        payloads, scored with the Brier score. ``"categorical"`` means the
+        target series stores ordered category values declared in
+        ``categories`` and predictors must return
+        :class:`~aieng.forecasting.evaluation.prediction.CategoricalForecast`
+        payloads, scored with RPS. The evaluation harness validates payloads
+        against this declaration and fails loudly on a mismatch rather than
+        producing meaningless scores.
+    categories : list[TaskCategory] or None
+        Ordered category declarations for ``payload_type="categorical"``.
+        The list order is the ordinal order used by RPS, e.g.
+        ``cut < hold < hike``. Must be omitted for non-categorical tasks.
+    resolution_fn : str
+        How ground truth is determined. Defaults to
+        ``"observed_value_at_resolution_timestamp"``, meaning the resolution
+        is the actual observed value of ``target_series_id`` at the target
+        timestamp.
+
+        .. note::
+            **This field is currently a placeholder.** The evaluation harness
+            always uses ``"observed_value_at_resolution_timestamp"`` regardless
+            of this value. Dispatch on alternative strategies is deferred.
+
+    Notes
+    -----
+    The evaluation loop is identical for backtesting and live forecasting:
+
+    .. code-block:: text
+
+        ForecastingTask  →  defines the question
+        Predictor        →  decides how to answer it
+        list[Prediction] →  the answers (one per horizon)
+        Resolution       →  ground truth
+        Score            →  how well each answer matched
+
+    In backtest mode, the harness iterates over historical forecast origins.
+    In live mode, it waits for the resolution date. The task definition does
+    not change between modes.
+
+    Examples
+    --------
+    Single horizon (equivalent to old ``horizon=18``):
+
+    >>> task = ForecastingTask(
+    ...     task_id="cpi_food_18m",
+    ...     target_series_id="cpi_food_canada",
+    ...     horizons=[18],
+    ...     frequency="MS",
+    ...     description="Forecast Canada food CPI 18 months ahead.",
+    ... )
+
+    Multi-horizon trajectory (horizons 6–17 → January through December of Y+1
+    from a July origin):
+
+    >>> task = ForecastingTask(
+    ...     task_id="cpi_food_cfpr_trajectory",
+    ...     target_series_id="cpi_food_canada",
+    ...     horizons=list(range(6, 18)),
+    ...     frequency="MS",
+    ...     description="Full 12-step trajectory for CFPR average-year analysis.",
+    ... )
+
+    Backward-compatible old syntax still works:
+
+    >>> task = ForecastingTask(
+    ...     task_id="cpi_all_items_1m_ahead",
+    ...     target_series_id="cpi_all_items_canada",
+    ...     horizon=1,
+    ...     frequency="MS",
+    ...     description="Forecast Canada All-items CPI one month ahead.",
+    ... )
+    """
+
+    task_id: str = Field(description="Unique identifier for this forecasting task.")
+    target_series_id: str = Field(description="The series_id of the series to forecast.")
+    horizons: list[int] = Field(
+        min_length=1,
+        description=(
+            "One or more horizon steps to forecast. Horizon h means h frequency-units ahead of the forecast origin."
+        ),
+    )
+    frequency: str = Field(description="Pandas offset alias for the forecast frequency, e.g. 'MS', 'h', 'D'.")
+    description: str = Field(description="Human-readable description of the prediction problem.")
+    payload_type: Literal["continuous", "binary", "categorical"] = Field(
+        default="continuous",
+        description=(
+            "Forecast payload modality: 'continuous' (ContinuousForecast, CRPS-scored), "
+            "'binary' (BinaryForecast against a 0/1 event series, Brier-scored), or "
+            "'categorical' (CategoricalForecast against ordered categories, RPS-scored)."
+        ),
+    )
+    categories: list[TaskCategory] | None = Field(
+        default=None,
+        description="Ordered categories for categorical tasks; omitted for continuous and binary tasks.",
+    )
+    resolution_fn: str = Field(
+        default="observed_value_at_resolution_timestamp",
+        description=(
+            "How ground truth is determined. Placeholder — harness currently always uses "
+            "'observed_value_at_resolution_timestamp' regardless of this value. "
+            "Dispatch on alternative strategies is deferred."
+        ),
+    )
+
+    @model_validator(mode="before")
+    @classmethod
+    def _coerce_single_horizon(cls, data: object) -> object:
+        """Accept legacy ``horizon=N`` and coerce to ``horizons=[N]``."""
+        if isinstance(data, dict) and "horizon" in data and "horizons" not in data:
+            data = dict(data)
+            data["horizons"] = [int(data.pop("horizon"))]
+        return data
+
+    @model_validator(mode="after")
+    def _validate_categories(self) -> "ForecastingTask":
+        """Validate the categorical task contract."""
+        if self.payload_type == "categorical":
+            if self.categories is None:
+                raise ValueError("Categorical tasks must define categories with at least two entries.")
+            if len(self.categories) < 2:
+                raise ValueError("Categorical tasks must define at least two categories.")
+            labels = [category.label for category in self.categories]
+            duplicate_labels = sorted({label for label in labels if labels.count(label) > 1})
+            if duplicate_labels:
+                raise ValueError(f"Categorical task category labels must be unique. Duplicates: {duplicate_labels}")
+            values = [category.value for category in self.categories]
+            duplicate_values = sorted({value for value in values if values.count(value) > 1})
+            if duplicate_values:
+                raise ValueError(f"Categorical task category values must be unique. Duplicates: {duplicate_values}")
+            return self
+        if self.categories is not None:
+            raise ValueError("categories must be omitted unless payload_type='categorical'.")
+        return self
+
+    @property
+    def horizon(self) -> int:
+        """The maximum (outermost) horizon step.
+
+        For single-horizon tasks this is the only element of ``horizons``.
+        For multi-horizon tasks this is ``max(horizons)``, which is what
+        Darts models and other trajectory-based predictors need as their
+        ``n`` parameter.
+        """
+        return max(self.horizons)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md
new file mode 100644
index 0000000..105042e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md
@@ -0,0 +1,192 @@
+# Source: aieng-forecasting/aieng/forecasting/langfuse_tracing.py
+
+kind: python
+
+```python
+"""Langfuse-oriented tracing bootstrap for LiteLLM and Google ADK.
+
+Call :func:`init_langfuse_tracing` once at process startup when using the
+``llm`` or ``agentic`` extras and Langfuse credentials are set in the
+environment.
+
+Call :func:`print_langfuse_trace_url` after a ``predict()`` call to flush
+pending spans and print a clickable Langfuse UI link.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+
+
+logger = logging.getLogger(__name__)
+
+
+def _langfuse_credentials_present() -> bool:
+    pub = os.environ.get("LANGFUSE_PUBLIC_KEY", "").strip()
+    sec = os.environ.get("LANGFUSE_SECRET_KEY", "").strip()
+    return bool(pub and sec)
+
+
+class _LangfuseTracingBootstrap:
+    """Registers LiteLLM + ADK exporters at most once per process."""
+
+    __slots__ = ("_google_adk_instrumented", "_langfuse_client_initialized", "_litellm_instrumented")
+
+    def __init__(self) -> None:
+        self._litellm_instrumented = False
+        self._google_adk_instrumented = False
+        self._langfuse_client_initialized = False
+
+    def init(self) -> None:
+        """Initialize Langfuse tracing when credentials and dependencies exist."""
+        if not _langfuse_credentials_present():
+            logger.debug(
+                "Skipping Langfuse tracing: set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.",
+            )
+            return
+
+        # OpenInference's ADK instrumentor uses the *global* OTel tracer provider.
+        # Langfuse attaches its span processor when the SDK client is created; without
+        # this, ADK spans are emitted into a no-op provider and never reach Langfuse.
+        self._ensure_langfuse_client()
+
+        self._register_litellm_langfuse_otel()
+        self._instrument_google_adk()
+
+    def _ensure_langfuse_client(self) -> None:
+        if self._langfuse_client_initialized:
+            return
+        try:
+            from langfuse import get_client  # noqa: PLC0415
+        except ImportError:
+            logger.debug("langfuse not installed; skipping Langfuse client initialization.")
+            return
+        try:
+            get_client()
+        except Exception:
+            logger.exception("Langfuse get_client() failed; ADK spans may not export.")
+            return
+        self._langfuse_client_initialized = True
+
+    def _register_litellm_langfuse_otel(self) -> None:
+        """Register LiteLLM Langfuse callback."""
+        if self._litellm_instrumented:
+            return
+        try:
+            import litellm  # noqa: PLC0415
+        except ImportError:
+            logger.debug("litellm not installed; skipping LiteLLM Langfuse callback.")
+            return
+
+        existing = list(getattr(litellm, "callbacks", None) or [])
+        if "langfuse_otel" not in existing:
+            litellm.callbacks = [*existing, "langfuse_otel"]
+        self._litellm_instrumented = True
+
+    def _instrument_google_adk(self) -> None:
+        """Instrument Google ADK."""
+        if self._google_adk_instrumented:
+            return
+        try:
+            from openinference.instrumentation.google_adk import (  # noqa: PLC0415
+                GoogleADKInstrumentor,
+            )
+        except ImportError:
+            logger.debug(
+                "openinference-instrumentation-google-adk not installed; skipping ADK instrumentation.",
+            )
+            return
+
+        try:
+            GoogleADKInstrumentor().instrument()
+        except Exception:
+            logger.exception("GoogleADKInstrumentor().instrument() failed.")
+            return
+
+        self._google_adk_instrumented = True
+
+
+_bootstrap = _LangfuseTracingBootstrap()
+
+
+def init_langfuse_tracing() -> None:
+    """Wire LiteLLM and Google ADK to Langfuse.
+
+    No-ops when ``LANGFUSE_PUBLIC_KEY`` or ``LANGFUSE_SECRET_KEY`` is absent
+    from the environment.  Safe to call multiple times.
+
+    Notes
+    -----
+    When both environment keys are present, performs up to three one-time
+    registrations:
+
+    1. Calls ``langfuse.get_client()`` so the global OpenTelemetry
+       ``TracerProvider`` receives Langfuse's span processor.  This is required
+       for ADK spans emitted via ``openinference-instrumentation-google-adk``
+       to reach Langfuse.
+    2. Appends ``"langfuse_otel"`` to ``litellm.callbacks`` once (if
+       ``litellm`` is importable).
+    3. Runs ``GoogleADKInstrumentor().instrument()`` once (if
+       ``openinference-instrumentation-google-adk`` is importable).
+
+    Set ``LANGFUSE_HOST`` or ``LANGFUSE_BASE_URL`` for non-default regions.
+    For short-lived processes, call ``langfuse.get_client().flush()`` before
+    exit so pending spans are exported.
+    """
+    _bootstrap.init()
+
+
+def print_langfuse_trace_url(
+    trace_id: str | None = None,
+    *,
+    trace_name: str | None = None,
+) -> str | None:
+    """Flush pending spans and print a Langfuse trace URL (no API trace fetch).
+
+    Uses the in-process trace id when available (``get_current_trace_id``).
+    Does **not** call ``api.trace.list`` — use this from notebooks when list/get
+    time out. If no trace id is available, prints the project traces page and the
+    ``trace_name`` to filter manually in the UI.
+
+    Parameters
+    ----------
+    trace_id : str, optional
+        Explicit trace id. When omitted, uses ``get_current_trace_id()`` if set.
+    trace_name : str, optional
+        ``trace_name`` from ``propagate_attributes`` (for manual UI lookup).
+
+    Returns
+    -------
+    str or None
+        Trace URL when resolved, else ``None``.
+    """
+    if not _langfuse_credentials_present():
+        print("Langfuse: set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY in .env to export traces.")
+        return None
+
+    try:
+        from langfuse import get_client  # noqa: PLC0415
+    except ImportError:
+        print("Langfuse package not installed.")
+        return None
+
+    init_langfuse_tracing()
+    client = get_client()
+    client.flush()
+
+    resolved_id = trace_id or client.get_current_trace_id()
+    url = client.get_trace_url(trace_id=resolved_id)
+    if url:
+        print(f"Langfuse trace: {url}")
+        return url
+
+    project_id = client._get_project_id()  # noqa: SLF001
+    base = getattr(client, "_base_url", None) or os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
+    traces_page = f"{base}/project/{project_id}/traces" if project_id else base
+    print("Langfuse: trace id not available in this process after flush.")
+    print(f"  Open traces: {traces_page}")
+    if trace_name:
+        print(f"  Filter by trace name: {trace_name!r}")
+    return None
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md
new file mode 100644
index 0000000..e7beeec
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md
@@ -0,0 +1,119 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/README.md
+
+kind: markdown
+
+# Methods
+
+This directory contains **reference predictor implementations** — concrete
+`Predictor` subclasses that are reusable across more than one forecasting
+experiment.
+
+The package is organized by method family:
+
+```text
+methods/
+├── baselines/       # simple floor baselines and teaching references
+├── numerical/       # classical / ML numerical forecasters
+├── llm_processes/   # LLM-process predictors (sampled trajectories, quantile grids, etc.)
+└── agentic/         # reusable ADK runners, agent factory, predictors, and output schemas
+```
+
+---
+
+## What belongs here
+
+- Concrete `Predictor` subclasses that are **not** tied to a specific use case
+- Implementations that a participant would use as-is or as a copy-paste
+  starting point across more than one experiment
+- Well-documented, linted Python modules (not notebooks)
+
+## What does NOT belong here
+
+- Task-specific configuration (prompts tuned for CFPR, specs, task YAMLs) —
+  those live in `implementations/<use-case>/`
+- Notebooks or experiment scripts — those live in `implementations/<use-case>/`
+- Infrastructure or ABCs — those live elsewhere in `aieng.forecasting`
+  (`data/`, `evaluation/`, future `agents/`)
+
+---
+
+## Import patterns
+
+Common imports:
+
+```python
+from aieng.forecasting.methods import (
+    DartsAutoARIMAPredictor,
+    DartsLightGBMPredictor,
+    DartsLinearRegressionPredictor,
+    LastValuePredictor,
+)
+```
+
+Sub-package imports are also fine when you want to signal the method family:
+
+```python
+from aieng.forecasting.methods.baselines import LastValuePredictor
+from aieng.forecasting.methods.numerical import DartsAutoARIMAPredictor
+```
+
+Agentic runner, factory, and output schemas:
+
+```python
+from aieng.forecasting.methods.agentic import (
+    AdkTextRunner,
+    AdkTextRunnerConfig,
+    AgentConfig,
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    build_adk_agent,
+)
+```
+
+---
+
+## Current contents
+
+### Baselines
+
+| Module | Class | Description |
+|---|---|---|
+| `baselines/naive.py` | `LastValuePredictor` | Last-value naive baseline. Predicts the most recently observed value at all quantiles. The floor every predictor must beat. Also the annotated reference implementation — read this to understand the `Predictor` interface. |
+| `baselines/historical_frequency.py` | `HistoricalFrequencyPredictor` | Binary floor baseline: the constant historical base rate of the event, optionally over a trailing window. |
+| `baselines/categorical_frequency.py` | `CategoricalFrequencyPredictor` | Categorical floor baseline: the constant climatological distribution over the task-declared ordered categories. |
+
+### Numerical
+
+| Module | Class | Description |
+|---|---|---|
+| `numerical/darts_arima.py` | `DartsAutoARIMAPredictor` | Univariate Darts AutoARIMA with probabilistic multi-horizon output via Monte Carlo sampling. |
+| `numerical/darts_classical.py` | `DartsExponentialSmoothingPredictor` | Univariate state-space exponential smoothing (ETS); fast probabilistic baseline (non-seasonal by default, optional `seasonal_periods`). |
+| `numerical/darts_classical.py` | `DartsKalmanForecasterPredictor` | Univariate linear Gaussian state-space (Kalman) forecaster; fast probabilistic baseline with configurable latent dimension `dim_x`. |
+| `numerical/darts_regression.py` | `DartsLinearRegressionPredictor` | Darts linear regression predictor with optional past covariates and probabilistic output. |
+| `numerical/darts_regression.py` | `DartsLightGBMPredictor` | Darts LightGBM quantile-regression predictor with optional past covariates. |
+
+### LLM Processes
+
+| Module | Class | Description |
+|---|---|---|
+| `llm_processes/sampled_trajectory.py` | `SampledTrajectoryLLMPredictor` | Samples full trajectories from an LLM, then computes empirical quantiles per horizon. Supports optional covariates: set `covariate_series_ids` to serialize labeled exogenous-series history into the prompt (Context-is-Key §5.4). |
+| `llm_processes/quantile_grid.py` | `QuantileGridLLMPredictor` | Asks an LLM for the standard quantile grid in one structured completion. |
+| `llm_processes/binary_probability.py` | `BinaryProbabilityLLMPredictor` | Direct elicitation of one calibrated event probability for binary tasks (Brier-scored), in one structured completion. |
+| `llm_processes/categorical_probability.py` | `CategoricalProbabilityLLMPredictor` | Direct elicitation of a calibrated distribution over the task-declared ordered categories (RPS-scored); history serialized as category labels. |
+| `llm_processes/point_intervals.py` | — | Placeholder for a compact point-plus-interval contract; may become configurable sparse quantile-grid elicitation. |
+
+### Agentic
+
+| Module | Class / Function | Description |
+|---|---|---|
+| `agentic/adk_runner.py` | `AdkTextRunner` | Async text-in / text-out wrapper around ADK `InMemoryRunner`. Manages ADK sessions (fresh-per-message or sticky) and optionally traces each turn to Langfuse via `propagate_attributes`. |
+| `agentic/adk_runner.py` | `AdkTextRunnerConfig` | Pydantic configuration for `AdkTextRunner` (session mode, Langfuse fields). |
+| `agentic/agent_factory.py` | `build_adk_agent` | Generic ADK `LlmAgent` factory with optional code execution, context retrieval, skills, generation controls, and structured output schema. |
+| `agentic/agent_factory.py` | `AgentConfig` | Pydantic configuration for reusable ADK agents. `output_schema=None` supports interactive/free-form agents; a structured `AgentForecastOutput` schema supports Track 1 predictors. The `function_tools` field attaches conventional ADK tools (e.g. `ForecastTool`). Use-case-specific prompts and presets should live in `implementations/<use-case>/`. |
+| `agentic/forecast_tool.py` | `ForecastTool` | Conventional ADK `FunctionTool` that runs a pre-specified `Predictor` (AutoARIMA by default) on any registered series at a given cutoff/horizon, returning a structured JSON forecast. A controlled, reproducible alternative to open-ended code execution; series data never enters the LLM context. |
+| `agentic/outputs.py` | `AgentForecastOutput` | Abstract output adapter interface for converting structured agent JSON into evaluation `Prediction` objects. |
+| `agentic/outputs.py` | `ContinuousAgentForecastOutput` | Canonical continuous forecasting output schema. Declares `modality = "continuous"`, requires one forecast per task horizon and the standard quantile grid, then converts to `ContinuousForecast` payloads. |
+| `agentic/outputs.py` | `DiscreteAgentForecastOutput` | Binary event output schema (`modality = "discrete"`): one probability plus `reasoning` / `key_signals` metadata, converted to a `BinaryForecast` payload. |
+| `agentic/outputs.py` | `CategoricalAgentForecastOutput` | Ordered-categorical output schema (`modality = "categorical"`): one `{label, probability}` row per task category, validated against `task.categories` and converted to a `CategoricalForecast` payload. |
+| `agentic/predictor.py` | `AgentPredictor` | Track 1 `Predictor` that builds prompts, runs an ADK agent through `AdkTextRunner`, validates structured JSON, and converts it to `Prediction` objects. Accepts an optional injected runner for tests or custom observability. |
+| `agentic/predictor.py` | `ForecastPromptBuilder` | Protocol for task-specific prompt builders that turn `(task, context)` into the text passed to the agent. |
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md
new file mode 100644
index 0000000..151e290
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md
@@ -0,0 +1,90 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/__init__.py
+
+kind: python
+
+```python
+"""Reference predictor implementations for ``aieng.forecasting``.
+
+This package groups concrete :class:`~aieng.forecasting.evaluation.predictor.Predictor`
+implementations by method family:
+
+- :mod:`baselines` — simple floor baselines and teaching references
+- :mod:`numerical` — classical / ML numerical forecasters
+- :mod:`llm_processes` — LLM-process predictors
+- :mod:`agentic` — tool-using / hybrid agentic predictors
+
+"""
+
+# ---------------------------------------------------------------------------
+# Patch: suppress spurious OTel cross-context ValueError in Jupyter / ADK
+# ---------------------------------------------------------------------------
+# When ADK or openinference-instrumented code runs async generators inside
+# Jupyter's nested event loop, pending tasks are garbage-collected mid-span.
+# GeneratorExit is thrown into OTel's start_as_current_span context manager,
+# which then tries to detach a contextvars Token that was created in a
+# different asyncio.Context, raising:
+#   ValueError: <Token ...> was created in a different Context
+#
+# Patching opentelemetry.context.detach (the module attribute) is insufficient
+# because openinference captures a direct `from opentelemetry.context import
+# detach` reference at instrumentation time, bypassing any later module-level
+# reassignment. Patching at the ContextVarsRuntimeContext *class* level is the
+# correct fix: it intercepts the call site that actually raises the error,
+# regardless of when or how callers imported the detach function.
+#
+# This patch is applied here, before any LLM or ADK imports, to ensure it is
+# in place before openinference instruments litellm.
+try:
+    import contextlib
+
+    from opentelemetry.context.contextvars_context import ContextVarsRuntimeContext as _CtxVarsRC
+
+    _orig_ctx_detach = _CtxVarsRC.detach
+
+    def _safe_ctx_detach(self, token):  # type: ignore[no-untyped-def]
+        with contextlib.suppress(ValueError):
+            _orig_ctx_detach(self, token)
+
+    _CtxVarsRC.detach = _safe_ctx_detach  # type: ignore[method-assign]
+except ImportError:
+    pass  # opentelemetry not installed; nothing to patch
+
+from .baselines import CategoricalFrequencyPredictor, HistoricalFrequencyPredictor, LastValuePredictor
+from .llm_processes import (
+    BinaryProbabilityLLMPredictor,
+    BinaryProbabilityLLMPredictorConfig,
+    CategoricalProbabilityLLMPredictor,
+    CategoricalProbabilityLLMPredictorConfig,
+    QuantileGridLLMPredictor,
+    QuantileGridLLMPredictorConfig,
+    SampledTrajectoryLLMPredictor,
+    SampledTrajectoryLLMPredictorConfig,
+)
+from .numerical import (
+    DartsAutoARIMAPredictor,
+    DartsExponentialSmoothingPredictor,
+    DartsKalmanForecasterPredictor,
+    DartsLightGBMPredictor,
+    DartsLinearRegressionPredictor,
+)
+
+
+__all__ = [
+    "BinaryProbabilityLLMPredictor",
+    "BinaryProbabilityLLMPredictorConfig",
+    "CategoricalFrequencyPredictor",
+    "CategoricalProbabilityLLMPredictor",
+    "CategoricalProbabilityLLMPredictorConfig",
+    "DartsAutoARIMAPredictor",
+    "DartsExponentialSmoothingPredictor",
+    "DartsKalmanForecasterPredictor",
+    "DartsLightGBMPredictor",
+    "DartsLinearRegressionPredictor",
+    "HistoricalFrequencyPredictor",
+    "LastValuePredictor",
+    "QuantileGridLLMPredictor",
+    "QuantileGridLLMPredictorConfig",
+    "SampledTrajectoryLLMPredictor",
+    "SampledTrajectoryLLMPredictorConfig",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md
new file mode 100644
index 0000000..167d828
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md
@@ -0,0 +1,116 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/__init__.py
+
+kind: python
+
+```python
+"""ADK-based agentic predictors.
+
+Concrete forecasting components that use tool execution, code interpreters,
+or hybrid numerical reasoning to produce forecasts.
+
+This subpackage requires the ``agentic`` extra. Install it with::
+
+    pip install aieng-forecasting[agentic]
+
+Importing any name from this package (or its submodules) without the extra
+raises :class:`ImportError` with installation guidance.
+
+Public API
+----------
+AdaptiveSkillState, AdaptiveSkillStore
+    Abstract base and generic persistence layer for learnable agent skills.
+    Subclass ``AdaptiveSkillState`` with domain-specific fields and implement
+    ``build_markdown()`` to render the strategy to a ``SKILL.md`` the agent
+    reads.  ``AdaptiveSkillStore`` handles YAML serialisation, ``SKILL.md``
+    rendering, and timestamped backup on every mutation.
+format_backtest_report, load_context_documents, build_curriculum_prompt
+    Curriculum assembly utilities for adaptive agent training.  Format a
+    :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` as a
+    structured markdown document, load pre-cached context files by date, and
+    assemble both into a single curriculum message for the agent.
+AgentConfig, CodeExecutionConfig, ContextRetrievalConfig
+    Pydantic configuration for building an ADK ``LlmAgent`` with optional
+    code execution and a Google Search sub-agent.
+build_adk_agent
+    Factory that turns an :class:`AgentConfig` into a configured
+    :class:`google.adk.agents.LlmAgent`.
+AdkTextRunner, AdkTextRunnerConfig
+    Text-in / text-out wrapper around ADK's ``InMemoryRunner`` with session
+    management and optional Langfuse tracing.
+AgentForecastOutput, ContinuousAgentForecastOutput, ...
+    Schemas for structured agent output and conversion to evaluation
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` objects.
+    See :mod:`aieng.forecasting.methods.agentic.outputs`.
+AgentPredictor, ForecastPromptBuilder
+    :class:`~aieng.forecasting.evaluation.predictor.Predictor`
+    that drives an ADK agent and converts its structured output into
+    predictions, plus the prompt-builder protocol it depends on.
+
+Examples
+--------
+Building a predictor from a config::
+
+    from aieng.forecasting.methods.agentic import (
+        AgentConfig,
+        AgentPredictor,
+        ContinuousAgentForecastOutput,
+    )
+
+    config = AgentConfig(instruction="Forecast the target series.")
+    predictor = AgentPredictor(
+        config,
+        my_prompt_builder,
+        output_schema=ContinuousAgentForecastOutput,
+    )
+"""
+
+from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillState, AdaptiveSkillStore
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.curriculum import (
+    build_curriculum_prompt,
+    format_backtest_report,
+    load_context_documents,
+)
+from aieng.forecasting.methods.agentic.forecast_tool import ForecastTool
+from aieng.forecasting.methods.agentic.outputs import (
+    AgentCategoryProbability,
+    AgentForecastOutput,
+    AgentQuantileForecast,
+    CategoricalAgentForecastOutput,
+    ContinuousAgentForecastOutput,
+    ContinuousAgentHorizonForecast,
+    DiscreteAgentForecastOutput,
+)
+from aieng.forecasting.methods.agentic.predictor import AgentPredictor, ForecastPromptBuilder
+
+
+__all__: list[str] = [
+    "AdaptiveSkillState",
+    "AdaptiveSkillStore",
+    "AdkTextRunner",
+    "AdkTextRunnerConfig",
+    "AgentCategoryProbability",
+    "AgentConfig",
+    "AgentForecastOutput",
+    "AgentPredictor",
+    "AgentQuantileForecast",
+    "CategoricalAgentForecastOutput",
+    "CodeExecutionConfig",
+    "ContinuousAgentForecastOutput",
+    "ContinuousAgentHorizonForecast",
+    "ContextRetrievalConfig",
+    "DiscreteAgentForecastOutput",
+    "ForecastPromptBuilder",
+    "ForecastTool",
+    "build_adk_agent",
+    "build_curriculum_prompt",
+    "format_backtest_report",
+    "load_context_documents",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md
new file mode 100644
index 0000000..e9a4af9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md
@@ -0,0 +1,241 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/adaptive_skill.py
+
+kind: python
+
+```python
+"""Generic adaptive skill infrastructure for learnable agent strategies.
+
+An *adaptive skill* is a skill whose content can be mutated by the agent
+through typed tool calls — as opposed to read-only skills that are authored
+once and never changed by the agent.
+
+The pattern has three layers:
+
+``AdaptiveSkillState`` (abstract)
+    A Pydantic model that is the source of truth for a skill's content.
+    Concrete subclasses define the domain-specific fields and implement
+    ``build_markdown()`` to render those fields into the ``SKILL.md`` text
+    that the ADK ``SkillToolset`` injects into the agent's context.
+
+``AdaptiveSkillStore``
+    Persistence layer for a concrete ``AdaptiveSkillState``.  Manages three
+    artefacts in a skill directory:
+
+    - ``skill_state.yaml`` — serialized state; the source of truth.
+    - ``SKILL.md`` — rendered from state on every save; what the agent reads.
+    - ``.history/`` — timestamped backups written before each save so that
+      every mutation is reversible without git involvement.
+
+    The ``confirmation_threshold`` (how many confirmed hypothesis outcomes are
+    required before ``graduate_hypothesis`` is permitted) is a *store*
+    parameter, not a *state* field.  This keeps the evidence bar outside the
+    agent's reach — the agent cannot lower the bar by mutating state.
+
+Usage pattern
+-------------
+Define a concrete state and instantiate a store once at module level in the
+implementation's ``skill_tools.py``::
+
+    from aieng.forecasting.methods.agentic.adaptive_skill import (
+        AdaptiveSkillState,
+        AdaptiveSkillStore,
+    )
+
+
+    class MyStrategyState(AdaptiveSkillState):
+        approach_narrative: str
+        ...
+
+        def build_markdown(self) -> str: ...
+
+
+    STORE: AdaptiveSkillStore[MyStrategyState] = AdaptiveSkillStore(
+        skill_dir=Path(__file__).parent / "skills" / "my-strategy",
+        state_type=MyStrategyState,
+        confirmation_threshold=3,
+    )
+
+Then write one thin tool function per mutation operation and register them via
+``AgentConfig(extra_tools=[...])``.
+"""
+
+from __future__ import annotations
+
+import shutil
+from abc import ABC, abstractmethod
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Generic, TypeVar
+
+import yaml
+from pydantic import BaseModel
+
+
+# ---------------------------------------------------------------------------
+# Abstract base
+# ---------------------------------------------------------------------------
+
+
+class AdaptiveSkillState(BaseModel, ABC):
+    """Abstract base for skill states that an agent can modify through typed tool calls.
+
+    Subclasses define domain-specific fields (observations, hypotheses,
+    calibration corrections, etc.) and implement ``build_markdown()`` to
+    render those fields into a complete ``SKILL.md`` document — including
+    YAML frontmatter.
+
+    The rendered ``SKILL.md`` is what the ADK ``SkillToolset`` loads and
+    injects into the agent's context.  The state itself is serialized to
+    ``skill_state.yaml`` as the authoritative source of truth.
+    """
+
+    schema_version: str = "1.0"
+
+    @abstractmethod
+    def build_markdown(self, skill_name: str | None = None) -> str:
+        """Render the current state to full ``SKILL.md`` content.
+
+        The output must include valid YAML frontmatter (``---`` fences) at the
+        top so that ``load_skill_from_dir`` can parse the skill metadata.
+
+        Parameters
+        ----------
+        skill_name : str or None
+            The value to embed in the frontmatter ``name:`` field.  Must match
+            the containing directory name exactly (ADK enforces this).  When
+            ``None``, subclasses should fall back to their default skill name.
+
+        Returns
+        -------
+        str
+            Complete ``SKILL.md`` text, ready to write to disk.
+        """
+        ...
+
+
+# ---------------------------------------------------------------------------
+# Generic store
+# ---------------------------------------------------------------------------
+
+S = TypeVar("S", bound=AdaptiveSkillState)
+
+_YAML_STATE_FILENAME = "skill_state.yaml"
+_SKILL_MD_FILENAME = "SKILL.md"
+_HISTORY_DIR = ".history"
+_GENERATED_HEADER = "<!-- Generated by AdaptiveSkillStore — do not edit by hand. -->\n"
+
+
+class AdaptiveSkillStore(Generic[S]):
+    """Persistence layer for an :class:`AdaptiveSkillState`.
+
+    Manages the three artefacts in a skill directory:
+
+    - ``skill_state.yaml`` — YAML-serialised state (source of truth).
+    - ``SKILL.md`` — rendered from state on every save.
+    - ``.history/`` — timestamped backups for reversibility.
+
+    Parameters
+    ----------
+    skill_dir : Path
+        Directory containing the skill (the one passed to
+        ``load_skill_from_dir``).  Must exist.
+    state_type : type[S]
+        Concrete ``AdaptiveSkillState`` subclass.  Used for deserialisation.
+    confirmation_threshold : int, default=3
+        Minimum ``confirmations`` count a hypothesis must reach before
+        ``graduate_hypothesis`` is allowed to promote it to a calibration
+        correction.  Kept here (not in state) so the agent cannot lower it.
+    """
+
+    def __init__(
+        self,
+        skill_dir: Path,
+        state_type: type[S],
+        confirmation_threshold: int = 3,
+    ) -> None:
+        if not skill_dir.is_dir():
+            raise ValueError(f"Skill directory does not exist: {skill_dir}")
+        self._skill_dir = skill_dir
+        self._state_type = state_type
+        self.confirmation_threshold = confirmation_threshold
+
+    # ------------------------------------------------------------------
+    # Paths
+    # ------------------------------------------------------------------
+
+    @property
+    def state_path(self) -> Path:
+        """Path to ``skill_state.yaml``."""
+        return self._skill_dir / _YAML_STATE_FILENAME
+
+    @property
+    def skill_md_path(self) -> Path:
+        """Path to the rendered ``SKILL.md``."""
+        return self._skill_dir / _SKILL_MD_FILENAME
+
+    @property
+    def history_dir(self) -> Path:
+        """Path to the ``.history/`` backup directory."""
+        return self._skill_dir / _HISTORY_DIR
+
+    # ------------------------------------------------------------------
+    # Load / save
+    # ------------------------------------------------------------------
+
+    def load(self) -> S:
+        """Deserialise state from ``skill_state.yaml``.
+
+        Raises
+        ------
+        FileNotFoundError
+            If ``skill_state.yaml`` does not exist.  Seed it first with
+            ``save(initial_state)`` before registering mutation tools.
+        """
+        if not self.state_path.exists():
+            raise FileNotFoundError(
+                f"skill_state.yaml not found in {self._skill_dir}. Run the seed script to initialise the skill state."
+            )
+        raw = yaml.safe_load(self.state_path.read_text(encoding="utf-8"))
+        return self._state_type.model_validate(raw)
+
+    def save(self, state: S) -> str:
+        """Persist *state* to disk and re-render ``SKILL.md``.
+
+        Steps:
+
+        1. Back up the current ``skill_state.yaml`` to ``.history/`` with a
+           UTC ISO-8601 timestamp suffix (before overwriting).
+        2. Write ``skill_state.yaml`` from ``state.model_dump()``.
+        3. Re-render ``SKILL.md`` via ``state.build_markdown()``.
+
+        Parameters
+        ----------
+        state : S
+            Updated state to persist.
+
+        Returns
+        -------
+        str
+            Human-readable confirmation message suitable for returning from a
+            tool call.
+        """
+        # 1. Backup
+        if self.state_path.exists():
+            self.history_dir.mkdir(exist_ok=True)
+            ts = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ")
+            backup_path = self.history_dir / f"skill_state_{ts}.yaml"
+            shutil.copy2(self.state_path, backup_path)
+
+        # 2. Write YAML state
+        data = state.model_dump(mode="json")
+        self.state_path.write_text(
+            yaml.dump(data, default_flow_style=False, allow_unicode=True, sort_keys=False),
+            encoding="utf-8",
+        )
+
+        # 3. Re-render SKILL.md — pass dir name so frontmatter matches ADK requirement
+        rendered = state.build_markdown(skill_name=self._skill_dir.name)
+        self.skill_md_path.write_text(rendered, encoding="utf-8")
+
+        return f"State saved to {self.state_path.name}. SKILL.md re-rendered. Backup written to {_HISTORY_DIR}/."
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md
new file mode 100644
index 0000000..6337eb9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md
@@ -0,0 +1,375 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/adk_runner.py
+
+kind: python
+
+```python
+"""General-purpose ADK runner: text-in / text-out over ``InMemoryRunner``.
+
+This module provides :class:`AdkTextRunner`, a thin wrapper around Google
+ADK's :class:`~google.adk.runners.InMemoryRunner` that exposes a single
+``run_text_async(prompt) -> str`` method, manages per-user session lifecycle,
+and optionally propagates Langfuse trace attributes for each turn.
+
+This module requires the ``agentic`` extra; importing it without the extra
+raises :class:`ImportError`.
+"""
+
+from __future__ import annotations
+
+import types as py_types
+from typing import Any
+
+from pydantic import BaseModel, Field
+
+
+try:
+    from google.adk.agents.base_agent import BaseAgent
+    from google.adk.agents.run_config import RunConfig
+    from google.adk.runners import InMemoryRunner
+    from google.genai import types as genai_types
+except ModuleNotFoundError as exc:
+    raise ImportError(
+        "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'."
+    ) from exc
+
+
+class AdkTextRunnerConfig(BaseModel):
+    """Configuration for :class:`AdkTextRunner`.
+
+    Attributes
+    ----------
+    app_name : str
+        Application id shared by the session service and runner.
+    default_user_id : str
+        Fallback user id when :meth:`~AdkTextRunner.run_text_async` is called
+        without an explicit ``user_id``.
+    fresh_session_per_message : bool
+        When ``True`` (default), each :meth:`~AdkTextRunner.run_text_async`
+        call creates a fresh ADK session and any supplied ``session_id`` is
+        ignored.  When ``False``, sessions are reused per ``user_id``
+        (sticky conversation).
+    enable_langfuse_tracing : bool
+        When ``True``, initialise Langfuse at construction time and wrap every
+        turn with ``propagate_attributes``.  Requires the ``agentic`` extra.
+    langfuse_tags : list of str or None
+        Tags forwarded to Langfuse ``propagate_attributes``.
+    langfuse_propagate_metadata : dict of str to str, or None
+        Extra key/value metadata merged with ``adk_app_name`` and forwarded
+        to ``propagate_attributes``.
+    langfuse_trace_name : str or None
+        ``trace_name`` forwarded to Langfuse ``propagate_attributes``.
+    langfuse_version : str or None
+        ``version`` forwarded to Langfuse ``propagate_attributes``.
+
+    Notes
+    -----
+    When ``enable_langfuse_tracing`` is ``True``, ``user_id``, ``session_id``,
+    ``trace_name``, and every key/value in ``langfuse_propagate_metadata`` must
+    be US-ASCII and ≤ 200 characters each; Langfuse silently drops
+    non-conforming values.
+    """
+
+    app_name: str = Field(
+        ...,
+        description="Application id shared by session service and runner.",
+    )
+    default_user_id: str = Field(
+        default="user",
+        description=(
+            "Used when ``run_text_async`` is called without ``user_id``. "
+            "If Langfuse tracing is enabled, must be US-ASCII and ≤ 200 characters."
+        ),
+    )
+    fresh_session_per_message: bool = Field(
+        default=True,
+        description=(
+            "If True, each ``run_text_async`` creates a new session (``session_id`` is ignored). "
+            "If False, turns for the same ``user_id`` reuse one session: the first call creates it, "
+            "later calls omit ``session_id`` unless switching threads; optional explicit "
+            "``session_id`` joins or replaces the sticky session for that user."
+        ),
+    )
+    enable_langfuse_tracing: bool = Field(
+        default=False,
+        description=(
+            "If True, call :func:`~aieng.forecasting.langfuse_tracing.init_langfuse_tracing` "
+            "at runner construction and wrap each turn with Langfuse "
+            "``propagate_attributes``. Forwards resolved ``user_id`` and ADK ``session_id`` "
+            "plus optional fields below. Langfuse requires propagated identifiers to be "
+            "US-ASCII and ≤ 200 characters; invalid values may be dropped with warnings. "
+            "Requires the ``agentic`` extra (``langfuse``)."
+        ),
+    )
+    langfuse_tags: list[str] | None = Field(
+        default=None,
+        description=("Optional tags for ``propagate_attributes`` to categorize observations in Langfuse."),
+    )
+    langfuse_propagate_metadata: dict[str, str] | None = Field(
+        default=None,
+        description=(
+            "Extra metadata merged with ``adk_app_name`` for ``propagate_attributes``. "
+            "Keys and values must be US-ASCII strings ≤ 200 characters each; avoid large "
+            "payloads or sensitive data (non-conforming entries may be dropped with warnings)."
+        ),
+    )
+    langfuse_trace_name: str | None = Field(
+        default=None,
+        description=("Optional ``trace_name`` for ``propagate_attributes``: US-ASCII, ≤ 200 characters."),
+    )
+    langfuse_version: str | None = Field(
+        default=None,
+        description=(
+            "Optional ``version`` for independently versioned parts of the app (e.g. agent "
+            "revision). Use short US-ASCII values suitable for span attributes."
+        ),
+    )
+
+    model_config = {"extra": "forbid"}
+
+
+class AdkTextRunner:
+    """Wrap ``InMemoryRunner`` with session helpers.
+
+    Parameters
+    ----------
+    agent : BaseAgent
+        The ADK agent to run.
+    config : AdkTextRunnerConfig
+        The configuration for the runner.
+
+    Examples
+    --------
+    Build a runner from an :class:`AgentConfig` and send one prompt:
+
+    >>> from aieng.forecasting.methods.agentic import (
+    ...     AgentConfig,
+    ...     build_adk_agent,
+    ... )
+    >>> from aieng.forecasting.methods.agentic.adk_runner import (
+    ...     AdkTextRunner,
+    ...     AdkTextRunnerConfig,
+    ... )
+    >>> agent = build_adk_agent(AgentConfig(instruction="You are a helpful assistant."))
+    >>> runner = AdkTextRunner(
+    ...     agent,
+    ...     config=AdkTextRunnerConfig(app_name="demo"),
+    ... )
+    >>> reply = await runner.run_text_async("Hello.")
+    """
+
+    def __init__(self, agent: BaseAgent, *, config: AdkTextRunnerConfig) -> None:
+        """Construct the runner and optionally initialise Langfuse tracing."""
+        self.config = config
+        self.agent = agent
+        self._runner = InMemoryRunner(agent=agent, app_name=config.app_name)
+        # Sticky ADK session per user when ``fresh_session_per_message`` is False.
+        self._conversation_session_by_user: dict[str, str] = {}
+        # Trace id captured during the most recent traced run (see ``last_trace_id``).
+        self._last_trace_id: str | None = None
+        if config.enable_langfuse_tracing:
+            from aieng.forecasting.langfuse_tracing import init_langfuse_tracing  # noqa: PLC0415
+
+            init_langfuse_tracing()
+
+    @property
+    def last_trace_id(self) -> str | None:
+        """Langfuse trace id captured during the most recent traced run, if any.
+
+        The agent runs on a worker event loop whose trace context the caller's
+        thread cannot see; the runner captures the id here so a predictor can link
+        and score the trace after the run. ``None`` when tracing is off or the last
+        run produced no trace.
+        """
+        return self._last_trace_id
+
+    @property
+    def runner(self) -> InMemoryRunner:
+        """Underlying ADK runner (session, artifact, memory services)."""
+        return self._runner
+
+    async def _resolve_session_id(self, user_id: str | None, session_id: str | None) -> str:
+        """Return the ADK session id to use for a single turn.
+
+        Parameters
+        ----------
+        user_id : str or None
+            Resolved user id; falls back to ``default_user_id`` when ``None``.
+        session_id : str or None
+            Explicit session id from the caller.  ``None`` triggers sticky-session
+            lookup or new-session creation depending on ``fresh_session_per_message``.
+
+        Returns
+        -------
+        str
+            ADK session id for this turn.
+        """
+        if user_id is None:
+            user_id = self.config.default_user_id
+
+        if self.config.fresh_session_per_message:
+            new_session = await self._runner.session_service.create_session(
+                app_name=self.config.app_name,
+                user_id=user_id,
+            )
+            sid = new_session.id
+        elif session_id is not None:
+            sid = session_id
+            self._conversation_session_by_user[user_id] = sid
+        elif user_id in self._conversation_session_by_user:
+            sid = self._conversation_session_by_user[user_id]
+        else:
+            new_session = await self._runner.session_service.create_session(
+                app_name=self.config.app_name,
+                user_id=user_id,
+            )
+            sid = new_session.id
+            self._conversation_session_by_user[user_id] = sid
+
+        return sid
+
+    async def run_text_async(
+        self,
+        prompt: str,
+        *,
+        user_id: str | None = None,
+        session_id: str | None = None,
+        run_config: RunConfig | None = None,
+    ) -> str:
+        """Run one user turn; return the first final model text or an empty string.
+
+        Parameters
+        ----------
+        prompt : str
+            The user prompt to run.
+        user_id : str | None, optional
+            The user id to use for the session. If not provided, the default
+            user id is used. With Langfuse tracing, must be US-ASCII and ≤ 200
+            characters for propagation.
+        session_id : str | None, optional
+            The session id to use for the session. If not provided, a new session
+            is created. With Langfuse tracing, the ADK session id must remain
+            US-ASCII and ≤ 200 characters for propagation.
+        run_config : RunConfig | None, optional
+            The run configuration to use for the run. If not provided, the default
+            run configuration is used.
+
+        Returns
+        -------
+        str
+            The first final model text or an empty string.
+
+        Notes
+        -----
+        If ``fresh_session_per_message`` is True, each call uses a new ADK session and
+        ``session_id`` is ignored.
+
+        If it is False, the runner keeps a session per ``user_id``: omit ``session_id``
+        after the first message to continue the same conversation. Pass ``session_id``
+        to attach to an existing session or switch threads; that id is remembered for
+        later calls with ``session_id`` omitted (same user).
+
+        When ``enable_langfuse_tracing`` is True, each turn runs inside Langfuse
+        ``propagate_attributes`` using the resolved ``user_id`` and ADK ``session_id``.
+        """
+        from aieng.forecasting.methods.agentic.agent_factory import SMR_STATE_KEY  # noqa: PLC0415
+
+        user_id = user_id or self.config.default_user_id
+
+        session_id = await self._resolve_session_id(user_id, session_id)
+
+        content = genai_types.Content(role="user", parts=[genai_types.Part(text=prompt)])
+
+        async def drain_run() -> str:
+            async for event in self._runner.run_async(
+                user_id=user_id,
+                session_id=session_id,
+                new_message=content,
+                run_config=run_config,
+            ):
+                if event.is_final_response() and event.content and event.content.parts:
+                    return event.content.parts[0].text or ""
+            return ""
+
+        async def run_and_resolve() -> str:
+            """Run the agent and return the best available output string.
+
+            When the agent uses our set_model_response shim (LiteLlm path with
+            tools + output_schema), the structured JSON is stored in session
+            state under SMR_STATE_KEY.  We prefer that over the model's
+            subsequent "Task complete." text response.
+            """
+            text = await drain_run()
+            session = await self._runner.session_service.get_session(
+                app_name=self.config.app_name,
+                user_id=user_id,
+                session_id=session_id,
+            )
+            if session is not None and SMR_STATE_KEY in (session.state or {}):
+                return str(session.state[SMR_STATE_KEY])
+            return text
+
+        if self.config.enable_langfuse_tracing:
+            from langfuse import get_client, propagate_attributes  # noqa: PLC0415
+
+            metadata: dict[str, str] = {"adk_app_name": self.config.app_name}
+            if self.config.langfuse_propagate_metadata:
+                metadata = {**metadata, **self.config.langfuse_propagate_metadata}
+
+            pa_kw: dict[str, Any] = {
+                k: v
+                for k, v in {
+                    "user_id": user_id,
+                    "session_id": session_id,
+                    "metadata": metadata,
+                    "tags": self.config.langfuse_tags,
+                    "trace_name": self.config.langfuse_trace_name,
+                    "version": self.config.langfuse_version,
+                }.items()
+                if v is not None
+            }
+            # Wrap the run in an explicit Langfuse span so (a) the ADK spans nest
+            # under one root trace and (b) we can capture the trace id while its
+            # context is active — the caller's thread cannot see it otherwise.
+            self._last_trace_id = None
+            client = get_client()
+            root_name = self.config.langfuse_trace_name or self.config.app_name
+            with client.start_as_current_observation(name=root_name, as_type="agent"):
+                with propagate_attributes(**pa_kw):
+                    result = await run_and_resolve()
+                self._last_trace_id = client.get_current_trace_id()
+            return result
+
+        return await run_and_resolve()
+
+    def clear_conversation(self, *, user_id: str | None = None) -> None:
+        """Drop sticky session id(s). Next ``run_text_async`` starts a new chat.
+
+        With ``user_id``, clear only that user. With ``None``, clear every user.
+        No effect when ``fresh_session_per_message`` is True.
+
+        Parameters
+        ----------
+        user_id : str | None, optional
+            The user id to clear the conversation for. If not provided, all users
+            are cleared. No effect when ``fresh_session_per_message`` is True.
+        """
+        if user_id is None:
+            self._conversation_session_by_user.clear()
+        else:
+            self._conversation_session_by_user.pop(user_id, None)
+
+    async def aclose(self) -> None:
+        """Close the underlying runner (plugins, toolsets)."""
+        self._conversation_session_by_user.clear()
+        await self._runner.close()  # type: ignore[no-untyped-call]
+
+    async def __aenter__(self) -> AdkTextRunner:
+        """Return self for use as an ``async with`` target."""
+        return self
+
+    async def __aexit__(
+        self, exc_type: type[BaseException] | None, exc_val: BaseException | None, exc_tb: py_types.TracebackType | None
+    ) -> None:
+        """Close the runner when leaving the ``async with`` block."""
+        await self.aclose()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md
new file mode 100644
index 0000000..494c0ed
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md
@@ -0,0 +1,576 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py
+
+kind: python
+
+```python
+"""Factory functions for building Google ADK agents for forecasting.
+
+This module exposes :class:`AgentConfig` plus its nested
+:class:`CodeExecutionConfig` and :class:`ContextRetrievalConfig` configs,
+and the :func:`build_adk_agent` factory that turns a config into a fully
+configured :class:`google.adk.agents.LlmAgent` (with optional E2B-backed
+code execution and a proxy-grounded web-search tool for context retrieval).
+
+This module requires the ``agentic`` extra; importing it without the extra
+raises :class:`ImportError` with installation guidance.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+import warnings
+from pathlib import Path
+from typing import Any, Callable, Sequence
+
+from aieng.forecasting.methods.agentic.outputs import AgentForecastOutput
+from aieng.forecasting.models import LITE_MODEL
+from google.adk.models.base_llm import BaseLlm
+from pydantic import BaseModel, Field, field_validator, model_validator
+
+
+# ---------------------------------------------------------------------------
+# Suppress LiteLLM startup and OTEL noise
+# ---------------------------------------------------------------------------
+# LiteLLM logs Bedrock/SageMaker "no botocore" warnings and an OTEL proxy-
+# server notice on every import — all harmless when using the Vector proxy.
+# OTEL span-lifecycle warnings ("Tried calling set_status on an ended span")
+# fire when LiteLLM callbacks run after spans close; also benign.
+# These filters run at module-import time so they are active before the first
+# litellm import (which happens lazily inside search_web / build_adk_agent).
+
+
+class _LiteLLMNoiseFilter(logging.Filter):
+    _NOISE = ("botocore", "Proxy Server is not installed")
+
+    def filter(self, record: logging.LogRecord) -> bool:
+        return not any(n in record.getMessage() for n in self._NOISE)
+
+
+logging.getLogger("LiteLLM").addFilter(_LiteLLMNoiseFilter())
+logging.getLogger("opentelemetry").setLevel(logging.ERROR)
+warnings.filterwarnings("ignore", message="Tried calling set_status on an ended span")
+warnings.filterwarnings("ignore", message="Setting attribute on ended span")
+
+
+try:
+    from aieng.agents.tools.code_interpreter import CodeInterpreter
+    from google.adk.agents import LlmAgent
+    from google.adk.skills import load_skill_from_dir
+    from google.adk.skills.models import Skill
+    from google.adk.tools.function_tool import FunctionTool
+    from google.adk.tools.skill_toolset import SkillToolset
+    from google.adk.tools.tool_context import ToolContext
+    from google.genai.types import (
+        AutomaticFunctionCallingConfig,
+        GenerateContentConfig,
+        ThinkingConfig,
+        ThinkingLevel,
+    )
+except ModuleNotFoundError as exc:
+    raise ImportError(
+        "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'."
+    ) from exc
+
+
+# Session-state key used by our proxy-compatible set_model_response shim.
+# When a LiteLlm agent has both output_schema and tools, we register a flat
+# set_model_response(json_response: str) tool that stores the JSON here.
+# AdkTextRunner reads this key after each run and returns it in place of the
+# final text, giving the predictor the structured JSON it expects.
+SMR_STATE_KEY = "__smr_output__"
+
+
+def _build_set_model_response_tool() -> FunctionTool:
+    """Return a proxy-compatible ``set_model_response`` shim.
+
+    Gemini thinking models call ``set_model_response`` when they produce
+    structured output alongside other tools — regardless of whether ADK
+    registered the tool.  The real ``SetModelResponseTool`` uses a nested
+    Pydantic schema for its function declaration, which Gemini rejects via the
+    OpenAI-compatible proxy (``$defs``/``$ref`` not supported).
+
+    This shim accepts the JSON as a plain string and stores it in session
+    state under :data:`SMR_STATE_KEY`.  :class:`AdkTextRunner` reads that key
+    after the run and returns it as the final output, bypassing the model's
+    subsequent "Done." text response.
+    """
+
+    async def set_model_response(json_response: str, tool_context: ToolContext) -> str:
+        """Submit your final structured JSON response as a string.
+
+        Call this tool once, passing the complete JSON object that satisfies
+        the required output schema. Do not produce any further text after
+        calling this tool.
+        """
+        tool_context.state[SMR_STATE_KEY] = json_response
+        return "Response submitted. Task complete."
+
+    return FunctionTool(set_model_response)
+
+
+class ContextRetrievalConfig(BaseModel):
+    """Configuration for the web-search context-retrieval tool.
+
+    When enabled, :func:`build_adk_agent` attaches a ``search_web``
+    :class:`~google.adk.tools.FunctionTool` to the agent.  The tool calls
+    the Vector proxy with Gemini's ``googleSearch`` server-side extension so
+    the calling agent can retrieve grounded, sourced web context without a
+    direct Gemini API key.
+
+    Temporal cutoff enforcement is soft (LLM-judgment-based): when
+    ``enforce_cutoff`` is ``True`` and the calling agent passes a
+    ``cutoff_date`` to the tool, the inner proxy prompt explicitly asks the
+    model to exclude post-cutoff sources.  This is the same trust model used
+    by the prior Google Search sub-agent — backtest leakage is a
+    pedagogically useful discussion point, not a hard guarantee.
+
+    Attributes
+    ----------
+    enabled : bool, default=False
+        Whether to enable context retrieval. Disabled by default.
+    search_model : str, default=LITE_MODEL (``"gemini-3.1-flash-lite-preview"``)
+        Proxy model used inside the ``search_web`` tool call.  Must be a
+        model that supports the ``googleSearch`` server-side tool extension.
+    instruction : str
+        System prompt passed to the inner proxy call.  Should describe the
+        search persona and what kind of output to return.  Must be non-empty
+        when ``enabled`` is ``True``.
+    enforce_cutoff : bool, default=True
+        When ``True``, the ``search_web`` tool appends a cutoff-date
+        constraint to the user prompt whenever ``cutoff_date`` is supplied by
+        the calling agent.  Set to ``False`` for live (non-backtest) agents
+        where no temporal fence is needed.
+    temperature : float | None, default=None
+        Sampling temperature for the inner search call.
+    max_output_tokens : int | None, default=None
+        Maximum output tokens for the inner search call.
+    """
+
+    model_config = {"extra": "forbid"}
+
+    enabled: bool = False
+    search_model: str = LITE_MODEL
+    instruction: str = (
+        "You are a specialized web search assistant.\n\n"
+        "Search for information relevant to the query and return a concise, "
+        "grounded summary with source URLs."
+    )
+    enforce_cutoff: bool = True
+    temperature: float | None = Field(default=None, ge=0.0, le=2.0)
+    max_output_tokens: int | None = Field(default=None, ge=1)
+
+
+class CodeExecutionConfig(BaseModel):
+    """Configuration for the E2B code execution tool.
+
+    Code runs in an E2B-backed sandbox managed by the
+    :class:`~aieng.agents.tools.code_interpreter.CodeInterpreter` tool.
+
+    Attributes
+    ----------
+    enabled : bool, default=False
+        Whether to enable code execution. Disabled by default.
+    template_name : str | None, default="agentic-forecasting-bootcamp"
+        E2B template name.
+    sandbox_timeout_seconds : int, default=3600
+        E2B sandbox lifetime in seconds.
+    code_execution_timeout_seconds : float | None, default=3300
+        Per-execution timeout in seconds.
+    """
+
+    model_config = {"extra": "forbid"}
+
+    enabled: bool = False
+    template_name: str | None = "agentic-forecasting-bootcamp"
+    sandbox_timeout_seconds: int = Field(default=3600, ge=1, le=3600)
+    code_execution_timeout_seconds: float | None = Field(default=3300, gt=0)
+
+    @model_validator(mode="after")
+    def _timeouts_consistent(self) -> "CodeExecutionConfig":
+        """Ensure code execution cannot outlive the sandbox itself."""
+        if (
+            self.code_execution_timeout_seconds is not None
+            and self.code_execution_timeout_seconds > self.sandbox_timeout_seconds
+        ):
+            raise ValueError("code_execution_timeout_seconds cannot exceed sandbox_timeout_seconds")
+        return self
+
+
+def _build_automatic_function_calling_config(
+    config: AgentConfig,
+    *,
+    tools: list[Any],
+    output_schema: type[AgentForecastOutput] | None,
+) -> AutomaticFunctionCallingConfig | None:
+    """Disable genai AFC when ADK orchestrates tools or schemas."""
+    disable = config.disable_automatic_function_calling
+    if disable is None:
+        disable = bool(tools or output_schema is not None)
+    if not disable:
+        return None
+    return AutomaticFunctionCallingConfig(disable=True)
+
+
+def _build_search_tool(
+    config: ContextRetrievalConfig,
+    *,
+    openai_base_url: str,
+    openai_api_key: str | None,
+) -> Callable[..., Any]:
+    """Return an async ``search_web`` FunctionTool backed by the proxy's googleSearch.
+
+    The returned coroutine function is registered as an ADK tool.  It calls
+    the proxy with ``"tools": [{"googleSearch": {}}]`` so the model does
+    server-side grounding and returns a synthesised answer plus source URLs
+    extracted from ``choices[0].provider_specific_fields["grounding_metadata"]``.
+    """
+
+    async def search_web(query: str, cutoff_date: str | None = None) -> str:
+        """Search the web and return a grounded summary with source URLs.
+
+        Args:
+            query: What to search for.
+            cutoff_date: ISO date (YYYY-MM-DD). When provided, only include
+                         information published strictly before this date.
+
+        Returns
+        -------
+            A grounded summary of search results, with source URLs appended.
+        """
+        import litellm  # noqa: PLC0415
+
+        user_content = query
+        if cutoff_date and config.enforce_cutoff:
+            user_content += f"\n\nOnly include and cite information published strictly before {cutoff_date}."
+        search_model = config.search_model
+        if not search_model.startswith("openai/"):
+            search_model = f"openai/{search_model}"
+        resp = await litellm.acompletion(
+            model=search_model,
+            api_base=openai_base_url,
+            api_key=openai_api_key,
+            messages=[
+                {"role": "system", "content": config.instruction},
+                {"role": "user", "content": user_content},
+            ],
+            tools=[{"googleSearch": {}}],
+            max_tokens=config.max_output_tokens or 4096,
+            temperature=config.temperature or 0.0,
+            timeout=60.0,
+        )
+        content = resp.choices[0].message.content or ""
+        psf = getattr(resp.choices[0], "provider_specific_fields", {}) or {}
+        gm = psf.get("grounding_metadata") or {}
+        sources: list[str] = [
+            uri for c in gm.get("groundingChunks", []) if (uri := (c.get("web") or {}).get("uri")) is not None
+        ]
+        if sources:
+            content += "\n\nSources:\n" + "\n".join(sources[:5])
+        return content
+
+    return search_web
+
+
+class AgentConfig(BaseModel):
+    """Configuration for building an ADK agent for forecasting tasks.
+
+    Attributes
+    ----------
+    name : str, default="adk_forecasting_agent"
+        Name of the agent.
+    model : str | BaseLlm, default=LITE_MODEL (``"gemini-3.1-flash-lite-preview"``)
+        Model name (bare, no provider prefix) or a custom
+        :class:`~google.adk.models.base_llm.BaseLlm` instance.  When
+        ``openai_base_url`` is set and ``model`` is a plain string,
+        :func:`build_adk_agent` wraps it in a
+        :class:`~google.adk.models.lite_llm.LiteLlm` instance pointing to
+        the proxy.  Pass a ``BaseLlm`` directly to skip automatic wrapping.
+    openai_base_url : str | None, default=OPENAI_BASE_URL env var
+        Base URL for the OpenAI-compatible LLM proxy.  Defaults to the
+        ``OPENAI_BASE_URL`` environment variable.  When set, the agent (and
+        the ``search_web`` tool) route all calls through the proxy.
+    openai_api_key : str | None, default=OPENAI_API_KEY env var
+        API key for the proxy.  Defaults to the ``OPENAI_API_KEY``
+        environment variable.
+    description : str, default=""
+        Description of the agent. Useful when the agent is used as a sub-agent.
+    instruction : str, default=""
+        Instruction for the agent.
+    skills_dirs : Sequence[Path], default=()
+        Sequence of paths to skill directories.
+    function_tools : Sequence[Any], default=()
+        Conventional ADK tools (e.g. :class:`~google.adk.tools.FunctionTool`
+        instances or plain callables) appended directly to the agent's tool
+        list. Use this to give the agent a rigid, pre-specified capability such
+        as the
+        :class:`~aieng.forecasting.methods.agentic.forecast_tool.ForecastTool`
+        (in contrast to open-ended code execution). Stored as-is; not validated.
+    seed : int or None, default=None
+        Generation seed forwarded to the model for reproducibility.
+    temperature : float or None, default=None
+        Sampling temperature; ``None`` uses the model default.
+    max_output_tokens : int or None, default=None
+        Maximum tokens per model response; ``None`` uses the model default.
+    thinking_budget : int or None, default=None
+        Token budget for extended thinking (Gemini thinking models only).
+        **Proxy-path caveat:** when routing through the Vector proxy (or any
+        OpenAI-compatible proxy), ``thinking_budget`` is passed via ADK's
+        ``ThinkingConfig`` → ``GenerateContentConfig``. Whether LiteLLM's
+        ``drop_params`` strips it on the proxy path is untested — if you set
+        this and see no change in thinking behaviour, treat it as silently
+        dropped (same root cause as the ``reasoning_effort`` stripping issue
+        documented in ``planning-docs/vector-llm-proxy.md``).
+    thinking_level : ThinkingLevel or None, default=None
+        Thinking-level preset; overrides ``thinking_budget`` when both are set.
+        Subject to the same proxy-path caveat as ``thinking_budget``.
+    code_execution : CodeExecutionConfig
+        Configuration for E2B code execution. Disabled by default.
+    context_retrieval : ContextRetrievalConfig
+        Configuration for web-search context retrieval. Disabled by default.
+    disable_automatic_function_calling : bool or None, default=None
+        When ``True``, sets ``automatic_function_calling.disable`` on the
+        Gemini request config.  ADK agents execute tools via the ADK runtime,
+        not the genai SDK's Automatic Function Calling (AFC) helper.
+        ``None`` (default) auto-disables AFC whenever tools or an
+        ``output_schema`` are configured.
+    extra_tools : Sequence[Callable[..., Any]], default=()
+        Additional callable tools to register with the agent beyond the
+        standard code-execution and context-retrieval tools.  Use this to
+        inject implementation-specific tools (e.g. adaptive skill mutation
+        tools) without coupling the shared factory to implementation code.
+        Each callable is appended to the tool list after skills are loaded
+        and will be wrapped by ADK as a ``FunctionTool``.
+    """
+
+    model_config = {"extra": "forbid", "arbitrary_types_allowed": True}
+
+    name: str = "adk_forecasting_agent"
+    model: str | BaseLlm = LITE_MODEL
+    openai_base_url: str | None = Field(
+        default_factory=lambda: os.getenv("OPENAI_BASE_URL"),
+        description=(
+            "Base URL for the OpenAI-compatible LLM proxy. Defaults to the OPENAI_BASE_URL environment variable."
+        ),
+    )
+    openai_api_key: str | None = Field(
+        default_factory=lambda: os.getenv("OPENAI_API_KEY"),
+        description="API key for the proxy. Defaults to the OPENAI_API_KEY environment variable.",
+    )
+    description: str = ""
+    instruction: str = ""
+    skills_dirs: Sequence[Path] = ()
+    function_tools: Sequence[Any] = ()
+    # Optional generation overrides (None = model/provider defaults).
+    seed: int | None = None
+    temperature: float | None = None
+    max_output_tokens: int | None = None
+    thinking_budget: int | None = None
+    thinking_level: ThinkingLevel | None = None
+
+    # Capabilities
+    code_execution: CodeExecutionConfig = Field(default_factory=CodeExecutionConfig)
+    context_retrieval: ContextRetrievalConfig = Field(default_factory=ContextRetrievalConfig)
+    disable_automatic_function_calling: bool | None = None
+    extra_tools: Sequence[Callable[..., Any]] = ()
+
+    @field_validator("skills_dirs")
+    @classmethod
+    def _skill_dirs_exist(cls, dirs: Sequence[Path]) -> Sequence[Path]:
+        """Reject skill directories that do not resolve to a real directory."""
+        missing = [p for p in dirs if not p.is_dir()]
+        if missing:
+            raise ValueError(f"Skill directories do not exist: {missing}")
+        return dirs
+
+    @model_validator(mode="after")
+    def _enabled_requires_instruction(self) -> "AgentConfig":
+        """Require non-empty instructions for the root and context-retrieval agents."""
+        if self.context_retrieval.enabled and not self.context_retrieval.instruction.strip():
+            raise ValueError(
+                "Expected non-empty instruction for context retrieval agent. "
+                "Please provide an instruction in the agent configuration."
+            )
+        if not self.instruction.strip():
+            raise ValueError(
+                "Expected non-empty instruction for root agent. "
+                "Please provide an instruction in the agent configuration."
+            )
+        return self
+
+
+def build_adk_agent(
+    config: AgentConfig,
+    *,
+    output_schema: type[AgentForecastOutput] | None = None,
+) -> LlmAgent:
+    """Build an ADK agent for forecasting tasks with the given configuration.
+
+    Code execution (E2B) and the web-search context-retrieval tool are wired
+    only when the corresponding capability blocks in ``config`` are enabled.
+
+    When ``config.openai_base_url`` is set and ``config.model`` is a plain
+    string, the model is automatically wrapped in a
+    :class:`~google.adk.models.lite_llm.LiteLlm` instance that routes all
+    calls through the proxy.  Pass a ``BaseLlm`` instance directly to bypass
+    automatic wrapping.
+
+    Parameters
+    ----------
+    config : AgentConfig
+        Configuration for the agent.  ``config.instruction`` must be
+        non-empty; if ``config.context_retrieval.enabled`` is ``True``,
+        ``config.context_retrieval.instruction`` must also be non-empty
+        (enforced by :class:`AgentConfig`).
+    output_schema : type[AgentForecastOutput] or None, default=None
+        When provided, configures the agent to return JSON constrained to
+        this schema.  Typically supplied by :class:`AgentPredictor`.
+
+        Note: avoid ``str | None`` optional fields on schemas that also
+        contain ``list[BaseModel]`` fields; use string defaults (e.g.
+        ``rationale=""``) to stay compatible with ADK's
+        ``set_model_response`` tool.
+
+    Returns
+    -------
+    LlmAgent
+        Configured ADK agent with tools and skills attached.
+
+    Examples
+    --------
+    Interactive analyst — free-form output, no schema constraint:
+
+    >>> from aieng.forecasting.methods.agentic import AgentConfig, build_adk_agent
+    >>> agent = build_adk_agent(AgentConfig(instruction="You are a helpful analyst."))
+
+    Predictor role — structured JSON output constrained to a schema:
+
+    >>> from aieng.forecasting.methods.agentic import (
+    ...     AgentConfig,
+    ...     ContinuousAgentForecastOutput,
+    ...     build_adk_agent,
+    ... )
+    >>> agent = build_adk_agent(
+    ...     AgentConfig(instruction="Forecast the supplied series."),
+    ...     output_schema=ContinuousAgentForecastOutput,
+    ... )
+    """
+    # Resolve model: wrap bare string in LiteLlm when proxy is configured.
+    model: str | BaseLlm = config.model
+    if isinstance(model, str) and config.openai_base_url:
+        from google.adk.models.lite_llm import LiteLlm  # noqa: PLC0415
+
+        # Prefix with "openai/" so LiteLLM uses the OpenAI-compatible path.
+        # LiteLLM strips the prefix before sending, so the proxy receives the
+        # bare model name.
+        litellm_model = model if model.startswith("openai/") else f"openai/{model}"
+        model = LiteLlm(
+            model=litellm_model,
+            api_base=config.openai_base_url,
+            api_key=config.openai_api_key,
+        )
+
+    # Configure tools
+    tools: list[Any] = []
+
+    if config.code_execution.enabled:
+        tools.append(
+            CodeInterpreter(
+                template_name=config.code_execution.template_name,
+                sandbox_timeout_seconds=config.code_execution.sandbox_timeout_seconds,
+                code_execution_timeout_seconds=config.code_execution.code_execution_timeout_seconds,
+            ).run_code
+        )
+
+    if config.context_retrieval.enabled:
+        openai_base_url = config.openai_base_url or os.getenv("OPENAI_BASE_URL") or ""
+        tools.append(
+            _build_search_tool(
+                config.context_retrieval,
+                openai_base_url=openai_base_url,
+                openai_api_key=config.openai_api_key,
+            )
+        )
+
+    # Load skills
+    skills: list[Skill] = []
+    for skills_dir in config.skills_dirs:
+        skills.append(load_skill_from_dir(skills_dir))
+
+    if skills:
+        tools.append(SkillToolset(skills=skills))
+
+    # Append any extra implementation-specific tools (e.g. adaptive skill
+    # mutation tools).  These run in the host process, not in E2B.
+    for extra in config.extra_tools:
+        tools.append(extra)
+
+    # For LiteLlm agents with both output_schema and tools, ADK's
+    # can_use_output_schema_with_tools() returns True and skips set_model_response
+    # injection, using response_format instead.  However, Gemini thinking models
+    # (e.g. gemini-3.5-flash) are trained to call set_model_response when
+    # producing structured output alongside other tools — and they do so even when
+    # output_schema=None on the Python side.
+    #
+    # The real SetModelResponseTool fails here because its function declaration
+    # uses JSON Schema $defs/$ref (from the Pydantic output schema), which Gemini
+    # rejects via the OpenAI-compatible proxy.
+    #
+    # Fix: register our flat-schema shim (_build_set_model_response_tool) that
+    # accepts the JSON as a plain string and parks it in session state.  Clear
+    # output_schema so ADK does not also try to enforce it via response_format.
+    # AdkTextRunner reads the state key after the run and returns the captured
+    # JSON as the final output.
+    #
+    # This applies to *every* proxy-routed (LiteLlm) agent with an output_schema,
+    # not only tool-bearing ones: a schema-only agent with no other tools (e.g. a
+    # bare AgentPredictor) would otherwise send the Pydantic schema as Gemini's
+    # response_schema and 400 on $defs/$ref/additionalProperties through the proxy.
+    # When the shim is the only tool, the model emits the JSON via set_model_response
+    # (or as plain text, which AdkTextRunner returns as a fallback) — both paths are
+    # handled downstream. Direct-Gemini (non-LiteLlm) agents keep the native schema.
+    effective_output_schema = output_schema
+    try:
+        from google.adk.models.lite_llm import LiteLlm as _LiteLlm  # noqa: PLC0415
+
+        if output_schema is not None and isinstance(model, _LiteLlm):
+            tools.append(_build_set_model_response_tool())
+            effective_output_schema = None
+    except ImportError:
+        pass
+
+    # Conventional function tools (e.g. ForecastTool) attach directly.
+    tools.extend(config.function_tools)
+
+    thinking_config = (
+        ThinkingConfig(
+            include_thoughts=True,
+            thinking_budget=config.thinking_budget,
+            thinking_level=config.thinking_level,
+        )
+        if config.thinking_budget is not None or config.thinking_level is not None
+        else None
+    )
+
+    automatic_function_calling = _build_automatic_function_calling_config(
+        config,
+        tools=tools,
+        output_schema=output_schema,
+    )
+
+    return LlmAgent(
+        name=config.name,
+        description=config.description,
+        model=model,
+        instruction=config.instruction,
+        tools=tools,
+        output_schema=effective_output_schema,
+        generate_content_config=GenerateContentConfig(
+            seed=config.seed,
+            temperature=config.temperature,
+            max_output_tokens=config.max_output_tokens,
+            thinking_config=thinking_config,
+            automatic_function_calling=automatic_function_calling,
+        ),
+    )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md
new file mode 100644
index 0000000..402af8c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md
@@ -0,0 +1,520 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/curriculum.py
+
+kind: python
+
+```python
+"""Curriculum assembly utilities for adaptive agent training.
+
+These functions help prepare structured learning material from historical
+backtest results and cached context documents, and assemble it into a single
+curriculum prompt that can be sent to an adaptive agent via
+:class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`.
+
+The paradigm is **curriculum learning** — the agent studies evidence as a new
+analyst would study case files, rather than simulating itself going back in
+time.  The curriculum utility functions are domain-agnostic; domain-specific
+curriculum builders in each implementation assemble and pass the right content.
+
+Typical usage::
+
+    from aieng.forecasting.methods.agentic.curriculum import (
+        format_backtest_report,
+        load_context_documents,
+        build_curriculum_prompt,
+    )
+
+    report = format_backtest_report(
+        result=backtest_result,
+        actuals=actuals_dict,
+        title="2024 WTI Baseline Backtest",
+        training_start=date(2024, 1, 1),
+        training_end=date(2024, 12, 31),
+    )
+
+    context_docs = load_context_documents(
+        context_dir=Path("adaptive_agent/curriculum/context"),
+        dates=["2024-03-04", "2024-06-03", ...],
+    )
+
+    prompt = build_curriculum_prompt(
+        report=report,
+        context_documents=context_docs,
+        as_of="2025-12-31",
+        preamble="Review 2025 WTI forecasting performance for systematic patterns.",
+    )
+
+    reply = await runner.run_text_async(prompt)
+"""
+
+from __future__ import annotations
+
+import logging
+import math
+import warnings
+from datetime import date, datetime
+from pathlib import Path
+from typing import TYPE_CHECKING
+
+import numpy as np
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from aieng.forecasting.evaluation.prediction import ContinuousForecast, Prediction
+
+
+if TYPE_CHECKING:
+    import pandas as pd
+
+logger = logging.getLogger(__name__)
+
+# ---------------------------------------------------------------------------
+# Vol-regime helper
+# ---------------------------------------------------------------------------
+
+_VOL_REGIMES = [
+    (15.0, "low"),
+    (30.0, "medium"),
+    (50.0, "elevated"),
+    (math.inf, "extreme"),
+]
+
+
+_MIN_VOL_WINDOW = 5
+_COV_LOW = 0.70
+_COV_HIGH = 0.90
+_BIAS_FRACTION = 0.3
+_COV_TREND_THRESHOLD = 0.05
+_MAE_TREND_THRESHOLD = 1.1
+_MIN_HORIZONS_FOR_NARRATIVE = 2
+
+
+def _vol_regime(price_series: pd.DataFrame, as_of: datetime, lookback: int = 21) -> str:
+    """Classify the vol regime at *as_of*.
+
+    Uses *lookback* trading days of log returns.
+    """
+    import pandas as pd  # noqa: PLC0415 — conditional import for optional dep
+
+    ts = pd.to_datetime(price_series["timestamp"])
+    vals = price_series.loc[ts <= pd.Timestamp(as_of), "value"].values
+    window = vals[-lookback:]
+    if len(window) < _MIN_VOL_WINDOW:
+        return "unknown"
+    log_returns = np.diff(np.log(window.astype(float)))
+    annualized_vol = float(np.std(log_returns) * np.sqrt(252) * 100)
+    for threshold, label in _VOL_REGIMES:
+        if annualized_vol < threshold:
+            return label
+    return "extreme"
+
+
+# ---------------------------------------------------------------------------
+# Report
+# ---------------------------------------------------------------------------
+
+
+def format_backtest_report(  # noqa: PLR0912, PLR0913, PLR0915
+    result: BacktestResult,
+    actuals: dict[tuple[str, int], float],
+    *,
+    title: str = "Backtest Report",
+    training_start: date | None = None,
+    training_end: date | None = None,
+    baseline_result: BacktestResult | None = None,
+    price_series: pd.DataFrame | None = None,
+) -> str:
+    """Render a backtest result as a curriculum document.
+
+    Formats a :class:`~aieng.forecasting.evaluation.backtest.BacktestResult`
+    as a structured markdown document for curriculum delivery.
+
+    Produces a header, an optional naive-baseline comparison table, per-horizon
+    detail sections, and a cross-horizon pattern narrative.  Each per-horizon
+    section includes:
+
+    - **Coverage** — fraction of actuals inside the 80% CI (target: 0.80).
+    - **Mean bias** — signed mean error (positive = over-forecasting).
+    - **MAE** — mean absolute error of the point forecast.
+    - **Interval width** — average 80% CI width, vs. width needed for 80% coverage.
+    - **Regime breakdown** — coverage and MAE by vol regime (if *price_series* given).
+
+    Parameters
+    ----------
+    result : BacktestResult
+        Completed backtest result.
+    actuals : dict[tuple[str, int], float]
+        Mapping from ``(as_of_date_str, horizon_days)`` to the realised value.
+        ``as_of_date_str`` must match ``str(prediction.as_of.date())``.
+    title : str, default="Backtest Report"
+        Section heading at the top of the document.
+    training_start : date or None
+        If provided, only predictions with ``as_of.date() >= training_start``
+        are included.
+    training_end : date or None
+        If provided, only predictions with ``as_of.date() <= training_end``
+        are included.
+    baseline_result : BacktestResult or None
+        Optional naive/last-value backtest result for a relative-skill comparison
+        row in the header table.  The same *actuals* dict is used for scoring.
+    price_series : DataFrame or None
+        Full price series returned by ``data_service.get_series()`` (columns:
+        ``timestamp``, ``value``).  When provided, each origin is classified
+        into a vol regime (low / medium / elevated / extreme) based on 21-day
+        realized volatility, and per-regime coverage/MAE tables are appended
+        to each horizon section.
+
+    Returns
+    -------
+    str
+        Markdown-formatted curriculum document.
+    """
+    preds = result.predictions
+
+    if training_start is not None:
+        preds = [p for p in preds if p.as_of.date() >= training_start]
+    if training_end is not None:
+        preds = [p for p in preds if p.as_of.date() <= training_end]
+
+    if not preds:
+        return f"# {title}\n\nNo predictions in the specified training window.\n"
+
+    # Organise by horizon
+    horizons: dict[int, list[Prediction]] = {}
+    for pred in preds:
+        h = (pred.forecast_date - pred.as_of).days
+        horizons.setdefault(h, []).append(pred)
+
+    # Pre-compute vol regime per origin (optional)
+    regime_at: dict[str, str] = {}
+    if price_series is not None:
+        for pred in preds:
+            key = str(pred.as_of.date())
+            if key not in regime_at:
+                regime_at[key] = _vol_regime(price_series, pred.as_of)
+
+    # ── Header ───────────────────────────────────────────────────────────────
+    lines: list[str] = [
+        f"# {title}",
+        "",
+        f"**Predictor:** {result.predictor_id}  ",
+        f"**Origins included:** {len({str(p.as_of.date()) for p in preds})}  ",
+        f"**Mean CRPS (all horizons):** {result.mean_score:.4f}",
+        "",
+    ]
+
+    # ── Naive comparison (optional) ──────────────────────────────────────────
+    if baseline_result is not None:
+        b_preds = baseline_result.predictions
+        if training_start is not None:
+            b_preds = [p for p in b_preds if p.as_of.date() >= training_start]
+        if training_end is not None:
+            b_preds = [p for p in b_preds if p.as_of.date() <= training_end]
+
+        b_horizons: dict[int, list[Prediction]] = {}
+        for pred in b_preds:
+            h = (pred.forecast_date - pred.as_of).days
+            b_horizons.setdefault(h, []).append(pred)
+
+        lines += [
+            "## Relative skill vs. naive baseline",
+            "",
+            f"Baseline predictor: **{baseline_result.predictor_id}**  ",
+            f"Baseline mean CRPS: {baseline_result.mean_score:.4f}  ",
+            f"This predictor mean CRPS: {result.mean_score:.4f}",
+            "",
+            "| Horizon | This MAE | Baseline MAE | Skill (lower is better) |",
+            "|---------|----------|--------------|-------------------------|",
+        ]
+
+        def _mae(pred_list: list[Prediction], horizon: int) -> float:
+            errs = []
+            for p in pred_list:
+                k = (str(p.as_of.date()), horizon)
+                a = actuals.get(k)
+                if a is not None and isinstance(p.payload, ContinuousForecast):
+                    errs.append(abs(p.payload.point_forecast - a))
+            return float(np.mean(errs)) if errs else float("nan")
+
+        for h in sorted(horizons):
+            this_mae = _mae(horizons[h], h)
+            base_mae = _mae(b_horizons.get(h, []), h)
+            skill = (
+                f"{this_mae:.2f} vs {base_mae:.2f} "
+                f"({'better' if this_mae < base_mae else 'worse'} by {abs(this_mae - base_mae):.2f})"
+                if not math.isnan(base_mae)
+                else f"{this_mae:.2f} (no baseline)"
+            )
+            lines.append(f"| {h}d | {this_mae:.2f} | {base_mae:.2f} | {skill} |")
+        lines += ["", "---", ""]
+
+    # ── Per-horizon detail ────────────────────────────────────────────────────
+    horizon_summaries: list[str] = []  # bullet lines for the narrative section
+    _cov_vals: list[float] = []
+    _mae_vals: list[float] = []
+    _bias_vals: list[float] = []
+
+    for h in sorted(horizons):
+        h_preds = horizons[h]
+        resolved: list[tuple[Prediction, float, bool, float, float, float]] = []
+        unresolved_count = 0
+
+        for pred in h_preds:
+            ak = (str(pred.as_of.date()), h)
+            actual = actuals.get(ak)
+            if actual is None:
+                unresolved_count += 1
+                continue
+            if not isinstance(pred.payload, ContinuousForecast):
+                continue
+            lower = pred.payload.quantiles.get(0.1, float("nan"))
+            upper = pred.payload.quantiles.get(0.9, float("nan"))
+            covered = lower <= actual <= upper
+            error = abs(pred.payload.point_forecast - actual)
+            bias = pred.payload.point_forecast - actual
+            ci_width = upper - lower
+            resolved.append((pred, actual, covered, error, bias, ci_width))
+
+        if not resolved:
+            lines += [
+                f"## Horizon: {h} days",
+                "",
+                f"No resolved predictions (unresolved: {unresolved_count}).",
+                "",
+            ]
+            continue
+
+        n = len(resolved)
+        coverage = sum(1 for r in resolved if r[2]) / n
+        mae = float(np.mean([r[3] for r in resolved]))
+        mean_bias = float(np.mean([r[4] for r in resolved]))
+        avg_ci_width = float(np.mean([r[5] for r in resolved]))
+        # Half-width needed for a symmetric interval to achieve 80% coverage
+        required_half_width = float(np.percentile([r[3] for r in resolved], 80))
+
+        lines += [
+            f"## Horizon: {h} days",
+            "",
+            "| Metric | Value |",
+            "|--------|-------|",
+            f"| Predictions resolved | {n} |",
+            f"| 80% CI coverage | {coverage:.1%} (target 80%) |",
+            f"| Mean bias (forecast − actual) | {mean_bias:+.2f} "
+            f"({'over-forecasting' if mean_bias > 0 else 'under-forecasting'}) |",
+            f"| Mean absolute error | {mae:.2f} |",
+            f"| Average 80% CI width | {avg_ci_width:.2f} |",
+            f"| Width needed for 80% coverage | ±{required_half_width:.2f} "
+            f"(current half-width: ±{avg_ci_width / 2:.2f}) |",
+        ]
+        if unresolved_count:
+            lines.append(f"| Unresolved (skipped) | {unresolved_count} |")
+        lines.append("")
+
+        # Coverage / width commentary
+        if coverage < _COV_LOW:
+            ratio = required_half_width / (avg_ci_width / 2) if avg_ci_width > 0 else float("nan")
+            if mean_bias > mae * _BIAS_FRACTION:
+                bias_note = "intervals are also off-center (systematic over-forecast)."
+            elif mean_bias < -mae * _BIAS_FRACTION:
+                bias_note = "intervals are also off-center (systematic under-forecast)."
+            else:
+                bias_note = "point forecasts are roughly unbiased; the issue is interval width alone."
+            lines.append(
+                f"> **Coverage {coverage:.1%} is well below target.** "
+                f"Intervals are too narrow — they would need to be "
+                f"~{ratio:.1f}× wider to capture 80% of actuals. "
+                f"Mean bias of {mean_bias:+.2f} suggests {bias_note}"
+            )
+        elif coverage > _COV_HIGH:
+            lines.append(
+                f"> **Coverage {coverage:.1%} is above target** — intervals may be overly conservative at this horizon."
+            )
+        lines.append("")
+
+        # Regime breakdown (optional)
+        if regime_at:
+            regime_buckets: dict[str, list[tuple[bool, float, float]]] = {}
+            for r in resolved:
+                pred_obj = r[0]
+                regime = regime_at.get(str(pred_obj.as_of.date()), "unknown")
+                regime_buckets.setdefault(regime, []).append((r[2], r[3], r[4]))
+
+            regime_order = ["low", "medium", "elevated", "extreme", "unknown"]
+            present = [reg for reg in regime_order if reg in regime_buckets]
+            if len(present) > 1:
+                lines += [
+                    "**Regime breakdown:**",
+                    "",
+                    "| Vol regime | N | Coverage | MAE | Mean bias |",
+                    "|-----------|---|----------|-----|-----------|",
+                ]
+                for reg in present:
+                    bucket = regime_buckets[reg]
+                    reg_cov = sum(1 for c, _, _ in bucket if c) / len(bucket)
+                    reg_mae = float(np.mean([e for _, e, _ in bucket]))
+                    reg_bias = float(np.mean([b for _, _, b in bucket]))
+                    lines.append(f"| {reg} | {len(bucket)} | {reg_cov:.1%} | {reg_mae:.2f} | {reg_bias:+.2f} |")
+                lines.append("")
+
+        # Collect values for cross-horizon narrative
+        _cov_vals.append(coverage)
+        _mae_vals.append(mae)
+        _bias_vals.append(mean_bias)
+
+        bias_dir = "over" if mean_bias > 0 else "under"
+        horizon_summaries.append(
+            f"h={h}d: coverage {coverage:.1%}, MAE {mae:.2f}, "
+            f"bias {mean_bias:+.2f} ({bias_dir}), "
+            f"CI width {avg_ci_width:.2f} (needed {required_half_width * 2:.2f})"
+        )
+
+    # ── Cross-horizon narrative ────────────────────────────────────────────────
+    if len(horizon_summaries) > 1:
+        lines += [
+            "---",
+            "",
+            "## Cross-horizon pattern summary",
+            "",
+        ]
+        lines += [f"- {s}" for s in horizon_summaries]
+        lines.append("")
+
+        # Synthesize from values already collected in the per-horizon loop
+        if len(_cov_vals) >= _MIN_HORIZONS_FOR_NARRATIVE:
+            if _cov_vals[-1] < _cov_vals[0] - _COV_TREND_THRESHOLD:
+                cov_trend = "worsens"
+            elif _cov_vals[-1] > _cov_vals[0] + _COV_TREND_THRESHOLD:
+                cov_trend = "improves"
+            else:
+                cov_trend = "is roughly flat"
+            mae_trend = "increases" if _mae_vals[-1] > _mae_vals[0] * _MAE_TREND_THRESHOLD else "is flat"
+            bias_consistent = all(b > 0 for b in _bias_vals) or all(b < 0 for b in _bias_vals)
+            if bias_consistent:
+                bias_note = (
+                    f"Bias is **consistent in direction** across all horizons "
+                    f"({'+' if _bias_vals[0] > 0 else '-'}), suggesting a structural "
+                    "over/under-forecast rather than a horizon-specific issue."
+                )
+            else:
+                bias_note = "Bias **changes direction** across horizons, suggesting a more complex error pattern."
+            lines += [
+                f"Coverage **{cov_trend}** across horizons. MAE **{mae_trend}** with horizon. {bias_note}",
+                "",
+            ]
+
+    return "\n".join(lines)
+
+
+def load_context_documents(
+    context_dir: Path,
+    dates: list[str],
+) -> list[tuple[str, str]]:
+    """Load pre-cached context markdown files for a list of dates.
+
+    Files are expected to be named ``<prefix>_<YYYY-MM-DD>.md`` (any prefix).
+    This function matches by the date suffix — any file in ``context_dir``
+    whose stem ends with the date string is considered a match.  Missing dates
+    are warned and skipped.
+
+    Parameters
+    ----------
+    context_dir : Path
+        Directory containing pre-cached context files.
+    dates : list[str]
+        ISO-8601 date strings to load (e.g. ``["2024-03-04", "2024-06-03"]``).
+
+    Returns
+    -------
+    list[tuple[str, str]]
+        ``(date_str, content)`` pairs for each date that had a cached file,
+        sorted by date ascending.
+    """
+    results: list[tuple[str, str]] = []
+    for d in dates:
+        matches = sorted(context_dir.glob(f"*{d}.md"))
+        if not matches:
+            warnings.warn(
+                f"No cached context file found for date {d} in {context_dir}. Skipping.",
+                stacklevel=2,
+            )
+            continue
+        if len(matches) > 1:
+            logger.warning("Multiple context files match date %s; using %s", d, matches[0])
+        results.append((d, matches[0].read_text(encoding="utf-8")))
+
+    return sorted(results, key=lambda x: x[0])
+
+
+def build_curriculum_prompt(
+    report: str,
+    context_documents: list[tuple[str, str]],
+    *,
+    as_of: str,
+    preamble: str = "",
+) -> str:
+    """Assemble a structured curriculum message for the agent.
+
+    Combines a backtest report and any number of dated context documents into a
+    single prompt the agent receives as a curriculum delivery message.  The
+    agent is expected to:
+
+    1. Read the backtest report and identify systematic patterns.
+    2. Read the context documents to understand what information was available
+       at each date.
+    3. Decide whether any findings meet the evidence threshold in
+       ``meta-learning`` and call the appropriate mutation tools.
+
+    Parameters
+    ----------
+    report : str
+        Backtest report markdown (from :func:`format_backtest_report`).
+    context_documents : list[tuple[str, str]]
+        ``(date_str, content)`` pairs from :func:`load_context_documents`.
+        May be empty for a statistics-only curriculum.
+    as_of : str
+        The end date of the training period.  Included in the prompt header
+        so the agent knows the temporal scope of the curriculum.
+    preamble : str, optional
+        Domain-specific framing text prepended before the report.  Use this to
+        orient the agent (e.g. "You are reviewing your 2024 WTI forecasting
+        performance to identify systematic patterns.").
+
+    Returns
+    -------
+    str
+        Complete curriculum message, ready to send via
+        :class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`.
+    """
+    parts: list[str] = []
+
+    parts.append(
+        f"## Curriculum delivery — training period ending {as_of}\n\n"
+        "This is a structured self-study session, not a prediction request. "
+        "Read the materials below, identify any systematic patterns in your "
+        "forecasting behaviour, and decide whether any findings meet the "
+        "evidence threshold described in your `meta-learning` skill. "
+        "Call mutation tools only if the evidence warrants it."
+    )
+
+    if preamble.strip():
+        parts.append(f"\n{preamble.strip()}")
+
+    parts.append(f"\n---\n\n{report}")
+
+    if context_documents:
+        parts.append(
+            "\n---\n\n## Market context at key dates\n\n"
+            "The following summaries describe what market and news context was "
+            "available at selected dates during the training period. Use them "
+            "to assess whether your information-weighting approach was well-calibrated."
+        )
+        for d, content in context_documents:
+            parts.append(f"\n### Context as of {d}\n\n{content.strip()}")
+
+    parts.append(
+        "\n---\n\n"
+        "Review the materials above. If you identify a pattern meeting the "
+        "evidence threshold, call the appropriate tool(s) (`record_observation`, "
+        "`open_hypothesis`, etc.). If the evidence is insufficient, state why "
+        "and what additional resolutions would be needed."
+    )
+
+    return "\n".join(parts)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md
new file mode 100644
index 0000000..ef69185
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md
@@ -0,0 +1,264 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/forecast_tool.py
+
+kind: python
+
+```python
+"""A conventional ADK function tool that runs a forecasting model on demand.
+
+:class:`ForecastTool` exposes a single, rigidly-typed callable that lets an
+analyst agent ask: *"show me what a statistical forecast would look like on
+this series, using the data available up to this date, for these horizons."*
+
+Unlike the open-ended code-execution path, this tool gives the agent a fixed,
+auditable interface to a pre-specified
+:class:`~aieng.forecasting.evaluation.predictor.Predictor`. The agent supplies
+only metadata (series id, cutoff date, horizons, frequency); the underlying
+series data never passes through the LLM context window.
+
+The tool is constructed with a
+:class:`~aieng.forecasting.data.service.DataService` and a ``Predictor``
+(dependency injection). At call time it builds a
+:class:`~aieng.forecasting.data.context.ForecastContext` scoped to the requested
+cutoff date and invokes the predictor against it, so the same
+information-cutoff discipline used in backtests applies here.
+
+Scope: the tool reports continuous (numeric) forecasts. The injected predictor
+must emit
+:class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast` payloads;
+other modalities are out of scope.
+
+This module requires the ``agentic`` extra; importing it without the extra
+raises :class:`ImportError` with installation guidance.
+"""
+
+from __future__ import annotations
+
+import json
+from datetime import datetime
+
+import pandas as pd
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.prediction import ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor
+
+
+try:
+    from google.adk.tools.function_tool import FunctionTool
+except ModuleNotFoundError as exc:
+    raise ImportError(
+        "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'."
+    ) from exc
+
+
+#: Prediction-interval bounds reported by the tool, keyed by nominal coverage.
+#: 95% is intentionally absent: the standard quantile grid tops out at p05/p95,
+#: so the widest honest interval is 90%. Reporting a "95%" interval here would
+#: require extrapolation beyond what the model actually produces.
+_INTERVAL_QUANTILES: dict[str, tuple[float, float]] = {
+    "80%": (0.10, 0.90),
+    "90%": (0.05, 0.95),
+}
+
+
+class ForecastTool:
+    """ADK function tool that runs a forecasting predictor on a registered series.
+
+    Wraps a :class:`~aieng.forecasting.evaluation.predictor.Predictor` behind a
+    rigid, JSON-native callable signature suitable for registration as a Google
+    ADK :class:`~google.adk.tools.FunctionTool`. The tool is general-purpose: it
+    forecasts any series registered in the injected
+    :class:`~aieng.forecasting.data.service.DataService`, selected by
+    ``series_id`` at call time.
+
+    The wrapped predictor is fixed at construction time. To expose a different
+    method, construct a new tool with a different predictor; the predictor's
+    identity belongs in the tool description shown to the agent.
+
+    Parameters
+    ----------
+    data_service : DataService
+        Already-populated data service. The tool reads from it but never
+        fetches from external APIs. Series are selected by ``series_id``.
+    predictor : Predictor or None, default=None
+        Predictor to invoke. When ``None``, a
+        :class:`~aieng.forecasting.methods.numerical.darts_arima.DartsAutoARIMAPredictor`
+        is constructed with its own defaults. To tune it (e.g. reduce
+        ``num_samples`` to bound agent latency), pass an explicit instance such
+        as ``DartsAutoARIMAPredictor(num_samples=200)``. The predictor must emit
+        :class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast`
+        payloads.
+
+    Examples
+    --------
+    >>> from aieng.forecasting.methods.agentic import ForecastTool
+    >>> tool = ForecastTool(data_service=svc)
+    >>> function_tool = tool.as_function_tool()  # register on an AgentConfig
+    """
+
+    def __init__(
+        self,
+        data_service: DataService,
+        *,
+        predictor: Predictor | None = None,
+    ) -> None:
+        self._data_service = data_service
+        self._predictor: Predictor = predictor or DartsAutoARIMAPredictor()
+
+    def as_function_tool(self) -> FunctionTool:
+        """Wrap :meth:`run_forecast` as an ADK :class:`FunctionTool`.
+
+        Returns
+        -------
+        FunctionTool
+            Ready to append to an agent's tool list (e.g. via
+            ``AgentConfig.function_tools``). ADK introspects the bound method's
+            signature and docstring to build the tool schema.
+        """
+        return FunctionTool(func=self.run_forecast)
+
+    def run_forecast(
+        self,
+        series_id: str,
+        cutoff_date: str,
+        horizons: list[int],
+        frequency: str,
+    ) -> str:
+        """Fit a forecasting model up to a cutoff date and return its forecast.
+
+        Runs the configured statistical predictor on the requested series using
+        only data available on or before ``cutoff_date``, and returns its point
+        forecasts and prediction intervals for each horizon. Use this to ground
+        your reasoning in a conventional statistical forecast before combining
+        it with retrieved market context.
+
+        Args:
+            series_id: Identifier of the registered series to forecast (e.g.
+                "wti_crude_oil_price").
+            cutoff_date: Forecast origin / information cutoff in YYYY-MM-DD
+                format. Only data on or before this date is used.
+            horizons: Steps ahead to forecast, in units of ``frequency`` (e.g.
+                [1, 5, 10] for a daily/business-day series).
+            frequency: Pandas offset alias matching the series sampling, e.g.
+                "B" (business day), "D" (daily), "MS" (month start).
+
+        Returns
+        -------
+            A JSON string with the point forecast, 80% and 90% prediction
+            interval bounds, and the full quantile grid for each horizon, plus
+            the series description, units, and cutoff date used.
+        """
+        try:
+            as_of = datetime.strptime(cutoff_date, "%Y-%m-%d")
+        except ValueError:
+            return self._error(
+                f"Invalid cutoff_date '{cutoff_date}'. Expected format YYYY-MM-DD.",
+                series_id=series_id,
+                cutoff_date=cutoff_date,
+            )
+
+        clean_horizons = [int(h) for h in horizons]
+        if not clean_horizons or any(h < 1 for h in clean_horizons):
+            return self._error(
+                "horizons must be a non-empty list of positive integers.",
+                series_id=series_id,
+                cutoff_date=cutoff_date,
+            )
+
+        try:
+            context = self._data_service.context(as_of)
+            metadata = context.get_metadata(series_id)
+            history = context.get_series(series_id)
+        except KeyError:
+            return self._error(
+                f"Series '{series_id}' is not registered. Available series: "
+                f"{', '.join(self._data_service.series_ids)}.",
+                series_id=series_id,
+                cutoff_date=cutoff_date,
+            )
+
+        if history.empty:
+            return self._error(
+                f"No observations available for '{series_id}' on or before {cutoff_date}.",
+                series_id=series_id,
+                cutoff_date=cutoff_date,
+            )
+
+        task = ForecastingTask(
+            task_id=f"forecast_{series_id}_{cutoff_date}",
+            target_series_id=series_id,
+            horizons=clean_horizons,
+            frequency=frequency,
+            description=f"Forecast for {series_id} as of {cutoff_date}.",
+        )
+
+        try:
+            predictions = self._predictor.predict(task, context)
+        except Exception as exc:  # noqa: BLE001 - surface model failures to the agent as data
+            return self._error(
+                f"Forecast model failed: {type(exc).__name__}: {exc}",
+                series_id=series_id,
+                cutoff_date=cutoff_date,
+            )
+
+        last_row = history.iloc[-1]
+        result = {
+            "status": "ok",
+            "series_id": series_id,
+            "series_description": metadata.description,
+            "units": metadata.units,
+            "frequency": frequency,
+            "cutoff_date": cutoff_date,
+            "n_observations_at_cutoff": int(len(history)),
+            "last_observed": {
+                "date": str(pd.Timestamp(last_row["timestamp"]).date()),
+                "value": float(last_row["value"]),
+            },
+            "forecasts": [
+                self._format_prediction(horizon, prediction)
+                for horizon, prediction in zip(clean_horizons, predictions, strict=True)
+            ],
+            "notes": (
+                "Point forecast is the predictive median. Intervals are derived "
+                "from the model's Monte Carlo quantiles. A 95% interval is not "
+                "reported because the standard quantile grid tops out at p05/p95 "
+                "(widest interval shown is 90%)."
+            ),
+        }
+        return json.dumps(result, indent=2)
+
+    @staticmethod
+    def _format_prediction(horizon: int, prediction: Prediction) -> dict[str, object]:
+        """Render a single :class:`Prediction` as a JSON-friendly dict."""
+        payload = prediction.payload
+        if not isinstance(payload, ContinuousForecast):  # pragma: no cover - defensive
+            raise TypeError(f"Expected ContinuousForecast payload, got {type(payload).__name__}.")
+
+        quantiles = payload.quantiles
+        intervals = {
+            label: {"lower": quantiles[lo], "upper": quantiles[hi]}
+            for label, (lo, hi) in _INTERVAL_QUANTILES.items()
+            if lo in quantiles and hi in quantiles
+        }
+        return {
+            "horizon": horizon,
+            "forecast_date": str(pd.Timestamp(prediction.forecast_date).date()),
+            "point_forecast": payload.point_forecast,
+            "intervals": intervals,
+            "quantiles": {str(q): v for q, v in sorted(quantiles.items())},
+        }
+
+    @staticmethod
+    def _error(message: str, *, series_id: str, cutoff_date: str) -> str:
+        """Return a structured error payload the agent can read and react to."""
+        return json.dumps(
+            {
+                "status": "error",
+                "series_id": series_id,
+                "cutoff_date": cutoff_date,
+                "error": message,
+            },
+            indent=2,
+        )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md
new file mode 100644
index 0000000..0a39edc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md
@@ -0,0 +1,634 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/outputs.py
+
+kind: python
+
+```python
+"""Output schemas for agentic forecasting.
+
+This module defines the structured output contract that an ADK agent must
+satisfy to be driven by
+:class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`.
+
+:class:`AgentForecastOutput` is the abstract base; concrete subclasses
+declare their forecast modality via the ``modality`` ``ClassVar`` and
+implement :meth:`AgentForecastOutput.to_predictions` to convert validated
+agent JSON into evaluation
+:class:`~aieng.forecasting.evaluation.prediction.Prediction` objects.
+
+:class:`ContinuousAgentForecastOutput` is the canonical schema for
+continuous forecasting tasks; it enforces the standard quantile
+grid, non-crossing quantiles, and ``point_forecast`` consistency with the
+median. :class:`DiscreteAgentForecastOutput` covers binary event tasks, and
+:class:`CategoricalAgentForecastOutput` covers ordered-categorical tasks
+whose category set is declared on the task.
+"""
+
+import json
+from abc import ABC, abstractmethod
+from datetime import datetime
+from math import isclose, isfinite
+from typing import Any, ClassVar, Literal
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import (
+    STANDARD_QUANTILES,
+    BinaryForecast,
+    CategoricalForecast,
+    ContinuousForecast,
+    Prediction,
+)
+from aieng.forecasting.evaluation.task import ForecastingTask
+from pydantic import BaseModel, Field, field_validator, model_validator
+
+
+class AgentForecastOutput(BaseModel, ABC):
+    """Base class for structured agent forecast output.
+
+    Subclasses declare the forecast modality they produce via the
+    ``modality`` ``ClassVar`` and implement :meth:`to_predictions` to
+    convert validated agent JSON into evaluation
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` objects.
+
+    Attributes
+    ----------
+    modality : ClassVar[Literal["continuous", "discrete", "categorical"]]
+        Forecast modality this schema produces. Concrete subclasses must
+        set this; :class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`
+        reads it to derive its ``predictor_id`` and tracing metadata.
+
+    Notes
+    -----
+    Subclasses must use ``model_config = {"extra": "ignore"}`` (not
+    ``"forbid"``) so that Pydantic does not emit ``additionalProperties:
+    false`` in the JSON schema — that key is rejected by the Gemini API
+    when the schema is used as a response constraint.  All field-level
+    validations (types, constraints, required presence) still apply.
+    """
+
+    modality: ClassVar[Literal["continuous", "discrete", "categorical"]]
+
+    @abstractmethod
+    def to_predictions(
+        self,
+        *,
+        task: ForecastingTask,
+        context: ForecastContext,
+        predictor_id: str,
+        metadata: dict[str, Any] | None = None,
+    ) -> list[Prediction]:
+        """Convert the forecast output to a list of predictions.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            The forecasting task.
+        context : ForecastContext
+            The forecast context.
+        predictor_id : str
+            The predictor ID.
+        metadata : dict[str, Any] | None, default=None
+            The metadata for the predictions.
+
+        Returns
+        -------
+        list[Prediction]
+            The list of predictions.
+        """
+        ...
+
+
+class AgentQuantileForecast(BaseModel):
+    """A single quantile forecast value emitted by an agent.
+
+    Attributes
+    ----------
+    quantile : float
+        Quantile level in the open interval ``(0, 1)``, e.g. ``0.50``.
+    value : float
+        Forecast value at this quantile level. Must be finite.
+    """
+
+    model_config = {"extra": "ignore"}
+
+    quantile: float = Field(description="Quantile level in (0, 1), e.g. 0.50.")
+    value: float = Field(description="Forecast value at this quantile level.")
+
+    @field_validator("quantile", "value")
+    @classmethod
+    def _values_are_finite(cls, value: float) -> float:
+        """Reject NaN and infinite quantile levels and values."""
+        if not isfinite(value):
+            raise ValueError("Forecast quantile levels and values must be finite numbers.")
+        return value
+
+
+class ContinuousAgentHorizonForecast(BaseModel):
+    """Agent output for one continuous forecast horizon.
+
+    Attributes
+    ----------
+    horizon : int
+        Forecast horizon step (>= 1) corresponding to one entry of
+        :attr:`~aieng.forecasting.evaluation.task.ForecastingTask.horizons`.
+    point_forecast : float
+        Central forecast for this horizon. Must equal the 0.50 quantile.
+    quantiles : list[AgentQuantileForecast]
+        Forecast values at every level of
+        :data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES`,
+        with no duplicates and non-decreasing values.
+    rationale : str
+        Optional horizon-specific explanation propagated to
+        ``Prediction.metadata["horizon_rationale"]`` when non-empty.
+    """
+
+    model_config = {"extra": "ignore"}
+
+    horizon: int = Field(ge=1, description="Forecast horizon step from the task, e.g. 1 for one period ahead.")
+    point_forecast: float = Field(
+        description="Central forecast. This must match the 0.50 quantile to avoid contradictory output."
+    )
+    quantiles: list[AgentQuantileForecast] = Field(
+        description="Forecast values for every standard quantile level.",
+    )
+    rationale: str = Field(default="", description="Optional horizon-specific explanation; omit when not needed.")
+
+    @field_validator("point_forecast")
+    @classmethod
+    def _point_forecast_is_finite(cls, value: float) -> float:
+        """Reject NaN and infinite point forecasts."""
+        if not isfinite(value):
+            raise ValueError("Point forecast must be a finite number.")
+        return value
+
+    @model_validator(mode="after")
+    def _validate_quantiles(self) -> "ContinuousAgentHorizonForecast":
+        """Require the standard quantile grid and a non-crossing distribution."""
+        by_level: dict[float, float] = {}
+        duplicates: list[float] = []
+        for forecast in self.quantiles:
+            if forecast.quantile in by_level:
+                duplicates.append(forecast.quantile)
+            by_level[forecast.quantile] = forecast.value
+
+        if duplicates:
+            raise ValueError(f"Duplicate quantile levels are not allowed: {duplicates}")
+
+        expected = set(STANDARD_QUANTILES)
+        actual = set(by_level)
+        missing = sorted(expected - actual)
+        extra = sorted(actual - expected)
+        if missing or extra:
+            raise ValueError(
+                "Continuous agent forecasts must include exactly the standard quantiles. "
+                f"Missing: {missing}; extra: {extra}"
+            )
+
+        values = [by_level[q] for q in STANDARD_QUANTILES]
+        if any(left > right for left, right in zip(values, values[1:])):
+            raise ValueError("Quantile forecasts must be non-decreasing as quantile levels increase.")
+
+        median = by_level[0.50]
+        if not isclose(self.point_forecast, median, rel_tol=1e-9, abs_tol=1e-9):
+            raise ValueError("point_forecast must match the 0.50 quantile.")
+
+        return self
+
+    def quantile_dict(self) -> dict[float, float]:
+        """Return quantiles as the evaluation payload mapping.
+
+        Returns
+        -------
+        dict[float, float]
+            Mapping from each quantile level in
+            :data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES`
+            to its forecast value, in standard-quantile order.
+        """
+        by_level = {forecast.quantile: forecast.value for forecast in self.quantiles}
+        return {q: by_level[q] for q in STANDARD_QUANTILES}
+
+
+class ContinuousAgentForecastOutput(AgentForecastOutput):
+    """Canonical agent output for continuous forecasting tasks.
+
+    The agent supplies only forecast values and optional explanatory metadata.
+    Task-owned fields such as ``task_id``, ``as_of``, and ``forecast_date`` are
+    derived during conversion so the output cannot drift from the evaluation
+    contract.
+
+    Attributes
+    ----------
+    forecasts : list[ContinuousAgentHorizonForecast]
+        One forecast per requested task horizon. Horizon values must be
+        unique; :meth:`to_predictions` additionally requires the set of
+        horizons to match ``task.horizons`` exactly.
+    rationale : str
+        Optional overall explanation propagated to
+        ``Prediction.metadata["rationale"]`` when non-empty.
+
+    Examples
+    --------
+    Validating an agent JSON response and converting it to predictions:
+
+    >>> output = ContinuousAgentForecastOutput.model_validate_json(
+    ...     raw_json,
+    ... )
+    >>> predictions = output.to_predictions(
+    ...     task=task,
+    ...     context=context,
+    ...     predictor_id="my_predictor",
+    ... )
+    """
+
+    modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "continuous"
+
+    model_config = {"extra": "ignore"}
+
+    forecasts: list[ContinuousAgentHorizonForecast] = Field(
+        description="One forecast object for each requested task horizon.",
+    )
+    rationale: str = Field(
+        default="", description="Optional overall explanation for the forecast; omit when not needed."
+    )
+
+    @model_validator(mode="after")
+    def _forecast_horizons_are_unique(self) -> "ContinuousAgentForecastOutput":
+        """Reject empty or duplicate horizon forecasts before task-level conversion."""
+        if not self.forecasts:
+            raise ValueError("forecasts must contain at least one horizon forecast.")
+        seen: set[int] = set()
+        duplicates: list[int] = []
+        for forecast in self.forecasts:
+            if forecast.horizon in seen:
+                duplicates.append(forecast.horizon)
+            seen.add(forecast.horizon)
+
+        if duplicates:
+            raise ValueError(f"Duplicate forecast horizons are not allowed: {duplicates}")
+        return self
+
+    @classmethod
+    def prompt_schema_json(cls) -> str:
+        """Return a JSON template for use in agent instruction strings.
+
+        The quantile list is derived from :data:`STANDARD_QUANTILES` so the
+        template stays in sync automatically when the standard grid changes.
+        Use this in agent instructions instead of a hardcoded JSON block.
+
+        Returns
+        -------
+        str
+            Indented JSON string showing the exact structure the agent must
+            pass to ``set_model_response``.
+        """
+        quantile_entries = [{"quantile": float(q), "value": "<float>"} for q in STANDARD_QUANTILES]
+        template: dict[str, object] = {
+            "forecasts": [
+                {
+                    "horizon": "<integer — one entry per horizon from the task>",
+                    "point_forecast": "<float — must equal the 0.50 quantile value>",
+                    "quantiles": quantile_entries,
+                    "rationale": "<string>",
+                }
+            ],
+            "rationale": "<string, optional overall explanation>",
+        }
+        return json.dumps(template, indent=2)
+
+    def to_predictions(
+        self,
+        *,
+        task: ForecastingTask,
+        context: ForecastContext,
+        predictor_id: str,
+        metadata: dict[str, Any] | None = None,
+    ) -> list[Prediction]:
+        """Convert agent output to evaluation ``Prediction`` objects.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Source task. The set of forecast horizons in ``self.forecasts``
+            must match ``task.horizons`` exactly.
+        context : ForecastContext
+            Forecast context whose ``as_of`` anchors each prediction's
+            ``forecast_date`` via ``task.frequency`` arithmetic.
+        predictor_id : str
+            Identifier of the predictor that produced this output.
+        metadata : dict, optional
+            Extra metadata merged into every generated ``Prediction.metadata``.
+            ``rationale`` keys are written after this merge and cannot be
+            overridden here.
+
+        Returns
+        -------
+        list[Prediction]
+            One :class:`~aieng.forecasting.evaluation.prediction.Prediction`
+            per ``task.horizons`` entry, in task-horizon order.
+
+        Raises
+        ------
+        ValueError
+            If the horizons in ``self.forecasts`` do not match ``task.horizons``.
+        """
+        by_horizon = {forecast.horizon: forecast for forecast in self.forecasts}
+        expected = set(task.horizons)
+        actual = set(by_horizon)
+        missing = sorted(expected - actual)
+        extra = sorted(actual - expected)
+        if missing or extra:
+            raise ValueError(
+                f"Continuous agent output must contain exactly the task horizons. Missing: {missing}; extra: {extra}"
+            )
+
+        issued_at = datetime.utcnow()  # naive UTC; Prediction.issued_at expects timezone-naive
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        base_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {}
+        if self.rationale.strip():
+            base_metadata["rationale"] = self.rationale
+
+        predictions: list[Prediction] = []
+        for horizon in task.horizons:
+            forecast = by_horizon[horizon]
+            prediction_metadata = dict(base_metadata)
+            if forecast.rationale.strip():
+                prediction_metadata["horizon_rationale"] = forecast.rationale
+
+            quantiles = forecast.quantile_dict()
+            predictions.append(
+                Prediction(
+                    predictor_id=predictor_id,
+                    task_id=task.task_id,
+                    issued_at=issued_at,
+                    as_of=context.as_of,
+                    forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(),
+                    payload=ContinuousForecast(
+                        point_forecast=forecast.point_forecast,
+                        quantiles=quantiles,
+                    ),
+                    metadata=prediction_metadata,
+                )
+            )
+
+        return predictions
+
+
+class DiscreteAgentForecastOutput(AgentForecastOutput):
+    """Agent output for binary / discrete-event forecasting tasks.
+
+    Attributes
+    ----------
+    probability : float
+        Predicted probability the event resolves True, in ``[0, 1]``.
+    reasoning : str
+        Optional explanation propagated to ``Prediction.metadata``.
+    direction_bias : str
+        Optional directional label (``up``, ``down``, ``neutral``).
+    key_signals : list[str]
+        Optional list of supporting signals for the forecast.
+    confidence : str
+        Optional self-reported confidence label.
+    """
+
+    modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "discrete"
+
+    model_config = {"extra": "ignore"}
+
+    probability: float = Field(ge=0.0, le=1.0, description="Predicted probability the event occurs.")
+    reasoning: str = Field(default="", description="Optional explanation for the probability estimate.")
+    direction_bias: str = Field(default="", description="Optional directional label: up, down, or neutral.")
+    key_signals: list[str] = Field(default_factory=list, description="Key signals supporting the estimate.")
+    confidence: str = Field(default="", description="Optional self-reported confidence: high, medium, or low.")
+
+    @classmethod
+    def prompt_schema_json(cls) -> str:
+        """Return a JSON template for use in agent instruction strings.
+
+        Returns
+        -------
+        str
+            Indented JSON string showing the exact structure the agent must
+            pass to ``set_model_response``.
+        """
+        template: dict[str, object] = {
+            "probability": "<float in [0, 1]>",
+            "direction_bias": "<'up' | 'down' | 'neutral'>",
+            "reasoning": "<string>",
+            "key_signals": ["<signal 1>", "<signal 2>"],
+            "confidence": "<'high' | 'medium' | 'low'>",
+        }
+        return json.dumps(template, indent=2)
+
+    def to_predictions(
+        self,
+        *,
+        task: ForecastingTask,
+        context: ForecastContext,
+        predictor_id: str,
+        metadata: dict[str, Any] | None = None,
+    ) -> list[Prediction]:
+        """Convert agent output to a single binary :class:`Prediction`."""
+        if len(task.horizons) != 1:
+            raise ValueError("Discrete agent output expects exactly one task horizon.")
+
+        horizon = task.horizons[0]
+        issued_at = datetime.utcnow()
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        prediction_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {}
+        if self.reasoning.strip():
+            prediction_metadata["rationale"] = self.reasoning
+        if self.direction_bias.strip():
+            prediction_metadata["direction_bias"] = self.direction_bias
+        if self.key_signals:
+            prediction_metadata["key_signals"] = list(self.key_signals)
+        if self.confidence.strip():
+            prediction_metadata["confidence"] = self.confidence
+
+        return [
+            Prediction(
+                predictor_id=predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(),
+                payload=BinaryForecast(probability=self.probability),
+                metadata=prediction_metadata,
+            )
+        ]
+
+
+#: Maximum allowed |sum - 1| before a categorical agent distribution is
+#: rejected instead of renormalized in ``to_predictions``.
+CATEGORICAL_RENORMALIZATION_TOLERANCE: float = 0.05
+
+
+class AgentCategoryProbability(BaseModel):
+    """One (label, probability) row of a categorical agent forecast.
+
+    Attributes
+    ----------
+    label : str
+        Category label. Must match one of the task's declared category
+        labels; checked during :meth:`CategoricalAgentForecastOutput.to_predictions`.
+    probability : float
+        Predicted probability of this category, in ``[0, 1]``.
+    """
+
+    model_config = {"extra": "ignore"}
+
+    label: str = Field(min_length=1, description="Category label from the task's declared category set.")
+    probability: float = Field(ge=0.0, le=1.0, description="Predicted probability of this category.")
+
+
+class CategoricalAgentForecastOutput(AgentForecastOutput):
+    """Agent output for ordered-categorical forecasting tasks.
+
+    The agent supplies one probability per category label plus optional
+    explanatory metadata. The category order, label set, and series-value
+    mapping live on the task (``task.categories``); :meth:`to_predictions`
+    validates the agent's labels against that declaration, so the schema
+    itself stays task-agnostic.
+
+    Schema validation enforces per-row constraints only. Cross-row
+    constraints (exact label-set match, probabilities summing to 1) are
+    enforced in :meth:`to_predictions`, where the task is available. Sums
+    within :data:`CATEGORICAL_RENORMALIZATION_TOLERANCE` of 1 are
+    renormalized — LLMs routinely emit 0.99 totals — with the raw sum
+    recorded in ``Prediction.metadata["probability_sum_raw"]``; sums further
+    off raise.
+
+    Attributes
+    ----------
+    probabilities : list[AgentCategoryProbability]
+        One probability per category label, in any order.
+    reasoning : str
+        Optional explanation propagated to ``Prediction.metadata``.
+    key_signals : list[str]
+        Optional list of supporting signals for the forecast.
+    confidence : str
+        Optional self-reported confidence label.
+    """
+
+    modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "categorical"
+
+    model_config = {"extra": "ignore"}
+
+    probabilities: list[AgentCategoryProbability] = Field(
+        description="One {label, probability} entry per task category."
+    )
+    reasoning: str = Field(default="", description="Optional explanation for the distribution.")
+    key_signals: list[str] = Field(default_factory=list, description="Key signals supporting the estimate.")
+    confidence: str = Field(default="", description="Optional self-reported confidence: high, medium, or low.")
+
+    @model_validator(mode="after")
+    def _labels_are_unique(self) -> "CategoricalAgentForecastOutput":
+        """Reject empty distributions and duplicate labels before conversion."""
+        if not self.probabilities:
+            raise ValueError("probabilities must contain at least one category entry.")
+        labels = [row.label for row in self.probabilities]
+        duplicates = sorted({label for label in labels if labels.count(label) > 1})
+        if duplicates:
+            raise ValueError(f"Duplicate category labels are not allowed: {duplicates}")
+        return self
+
+    @classmethod
+    def prompt_schema_json(cls, labels: list[str] | None = None) -> str:
+        """Return a JSON template for use in agent instruction strings.
+
+        Parameters
+        ----------
+        labels : list[str], optional
+            Category labels to render in the template, in task order. When
+            given, the template shows one concrete entry per label; otherwise
+            it shows generic placeholders.
+
+        Returns
+        -------
+        str
+            Indented JSON string showing the exact structure the agent must
+            pass to ``set_model_response``.
+        """
+        if labels:
+            entries: list[dict[str, object]] = [
+                {"label": label, "probability": "<float in [0, 1]>"} for label in labels
+            ]
+        else:
+            entries = [{"label": "<category label from the task>", "probability": "<float in [0, 1]>"}]
+        template: dict[str, object] = {
+            "probabilities": entries,
+            "reasoning": "<string>",
+            "key_signals": ["<signal 1>", "<signal 2>"],
+            "confidence": "<'high' | 'medium' | 'low'>",
+        }
+        return json.dumps(template, indent=2)
+
+    def to_predictions(
+        self,
+        *,
+        task: ForecastingTask,
+        context: ForecastContext,
+        predictor_id: str,
+        metadata: dict[str, Any] | None = None,
+    ) -> list[Prediction]:
+        """Convert agent output to a single categorical :class:`Prediction`.
+
+        Raises
+        ------
+        ValueError
+            If the task is not a single-horizon categorical task, if the
+            output labels do not exactly match ``task.categories``, or if the
+            probabilities sum outside
+            ``1 +/- CATEGORICAL_RENORMALIZATION_TOLERANCE``.
+        """
+        if task.payload_type != "categorical" or task.categories is None:
+            raise ValueError(
+                f"Categorical agent output requires a categorical task with declared categories; "
+                f"task '{task.task_id}' declares payload_type='{task.payload_type}'."
+            )
+        if len(task.horizons) != 1:
+            raise ValueError("Categorical agent output expects exactly one task horizon.")
+
+        by_label = {row.label: row.probability for row in self.probabilities}
+        expected = {category.label for category in task.categories}
+        actual = set(by_label)
+        if actual != expected:
+            missing = sorted(expected - actual)
+            extra = sorted(actual - expected)
+            raise ValueError(
+                f"Categorical agent output must contain exactly the task category labels. "
+                f"Missing: {missing}; extra: {extra}."
+            )
+
+        raw_sum = sum(by_label.values())
+        if abs(raw_sum - 1.0) > CATEGORICAL_RENORMALIZATION_TOLERANCE or raw_sum <= 0.0:
+            raise ValueError(
+                f"Categorical agent probabilities sum to {raw_sum}, outside the renormalization "
+                f"tolerance of 1 +/- {CATEGORICAL_RENORMALIZATION_TOLERANCE}."
+            )
+        probabilities = {category.label: by_label[category.label] / raw_sum for category in task.categories}
+
+        horizon = task.horizons[0]
+        issued_at = datetime.utcnow()  # naive UTC; Prediction.issued_at expects timezone-naive
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        prediction_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {}
+        if self.reasoning.strip():
+            prediction_metadata["rationale"] = self.reasoning
+        if self.key_signals:
+            prediction_metadata["key_signals"] = list(self.key_signals)
+        if self.confidence.strip():
+            prediction_metadata["confidence"] = self.confidence
+        if not isclose(raw_sum, 1.0, abs_tol=1e-9):
+            prediction_metadata["probability_sum_raw"] = raw_sum
+
+        return [
+            Prediction(
+                predictor_id=predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(),
+                payload=CategoricalForecast(probabilities=probabilities),
+                metadata=prediction_metadata,
+            )
+        ]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md
new file mode 100644
index 0000000..86401bc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md
@@ -0,0 +1,323 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/agentic/predictor.py
+
+kind: python
+
+```python
+"""Predictor that uses an ADK agent for forecasting.
+
+This module provides :class:`AgentPredictor`, the agentic
+:class:`~aieng.forecasting.evaluation.predictor.Predictor` that drives an
+ADK agent through an
+:class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`,
+parses the agent's structured JSON response against an
+:class:`~aieng.forecasting.methods.agentic.outputs.AgentForecastOutput`
+schema, and converts it into evaluation
+:class:`~aieng.forecasting.evaluation.prediction.Prediction` objects.
+
+It also defines the :class:`ForecastPromptBuilder` ``Protocol`` that
+task-specific prompt builders must satisfy.
+
+This module requires the ``agentic`` extra; importing it without the extra
+raises :class:`ImportError`.
+"""
+
+import asyncio
+import json
+import logging
+import threading
+from collections.abc import Coroutine
+from typing import Any, Protocol, TypeVar, cast
+
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.langfuse_traces import stamp_forecast_on_trace
+from aieng.forecasting.evaluation.prediction import Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+from aieng.forecasting.methods.agentic.agent_factory import AgentConfig, build_adk_agent
+from aieng.forecasting.methods.agentic.outputs import AgentForecastOutput
+from aieng.forecasting.methods.llm_processes._client import strip_markdown_fence, trace_url_for
+from google.adk.agents.base_agent import BaseAgent
+from pydantic import ValidationError
+
+
+logger: logging.Logger = logging.getLogger(__name__)
+T = TypeVar("T")
+
+
+def _run_coroutine_sync(coro: Coroutine[Any, Any, T]) -> T:
+    """Run an async coroutine from the sync ``Predictor`` interface.
+
+    If no event loop is running on the current thread, the coroutine is
+    executed via :func:`asyncio.run`. If a loop is already running (e.g.
+    inside a Jupyter notebook), the coroutine is executed on a fresh loop
+    in a daemon thread so the caller's loop is not disturbed.
+    """
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        return asyncio.run(coro)
+
+    result: T | None = None
+    error: BaseException | None = None
+
+    def run_in_thread() -> None:
+        nonlocal error, result
+        loop = asyncio.new_event_loop()
+        try:
+            asyncio.set_event_loop(loop)
+            result = loop.run_until_complete(coro)
+        except BaseException as exc:  # pragma: no cover - defensive thread boundary
+            error = exc
+        finally:
+            # Cancel and drain any background tasks (e.g. LiteLLM's LoggingWorker)
+            # before closing the loop.  Without this, Python emits
+            # "Task was destroyed but it is pending!" warnings for every run.
+            try:
+                pending = asyncio.all_tasks(loop)
+                if pending:
+                    for task in pending:
+                        task.cancel()
+                    loop.run_until_complete(asyncio.gather(*pending, return_exceptions=True))
+            except Exception:
+                pass
+            finally:
+                loop.close()
+
+    thread = threading.Thread(target=run_in_thread, daemon=True)
+    thread.start()
+    thread.join()
+    if error is not None:
+        raise error
+    return cast("T", result)
+
+
+class ForecastPromptBuilder(Protocol):
+    """Protocol for building prompts for forecasting agents.
+
+    This is used to build the prompt that will be used to invoke the ADK agent
+    for forecasting.
+    """
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        """Build the prompt for the forecasting agent.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Defines the prediction problem — target series, horizon(s),
+            frequency, and resolution logic. The predictor must not modify
+            the task.
+        context : ForecastContext
+            The information state available at forecast time. All calls to
+            ``context.get_series()`` are automatically filtered to
+            ``context.as_of`` — the predictor cannot accidentally access
+            future data from the series store.
+
+        Returns
+        -------
+        str
+            The prompt for the forecasting agent.
+        """
+        ...
+
+
+class AgentPredictor(Predictor):
+    """Predictor that drives an ADK agent to produce forecasts.
+
+    On each :meth:`predict` call, the predictor:
+
+    1. Builds a prompt with ``prompt_builder(task=task, context=context)``.
+    2. Runs the prompt through the ADK runner (synchronously, even from
+       inside a running event loop).
+    3. Validates the agent's JSON response against ``output_schema``.
+    4. Converts the validated output to a list of
+       :class:`~aieng.forecasting.evaluation.prediction.Prediction` via
+       :meth:`AgentForecastOutput.to_predictions`.
+
+    Conversion errors are logged and surfaced as an empty prediction list
+    so a single bad agent response does not abort a backtest loop. Schema
+    validation errors are *not* swallowed.
+
+    The ``output_schema`` is separate from ``agent_config`` by design:
+    ``AgentConfig`` captures the agent's *identity* (instruction, model,
+    skills), while ``output_schema`` declares the agent's *role* in a
+    specific experiment. The same config can be used to build a free-form
+    interactive analyst (via :func:`build_adk_agent` with no schema) or
+    wired into different predictors with different output contracts.
+
+    Parameters
+    ----------
+    agent_config : AgentConfig
+        Configuration for the underlying ADK agent — instruction, model,
+        skills, and capability toggles. The output format is *not* part
+        of the agent config; it is declared via ``output_schema``.
+    prompt_builder : ForecastPromptBuilder
+        Callable that produces the prompt text for one ``(task, context)``
+        pair. See :class:`ForecastPromptBuilder` for the contract.
+    output_schema : type[AgentForecastOutput]
+        Structured output schema the agent must satisfy. The forecast
+        modality is derived from ``output_schema.modality``. Supplied at
+        predictor instantiation time so the same agent config can be reused
+        with different schemas or in interactive (schema-free) mode.
+    enable_langfuse_tracing : bool, optional
+        Whether to wrap each turn in Langfuse ``propagate_attributes``.
+        ``None`` (default) auto-detects: enabled when the ``langfuse``
+        package is importable, disabled otherwise. Ignored when ``runner``
+        is supplied — the supplied runner's tracing config takes precedence.
+    runner : AdkTextRunner, optional
+        Custom runner to use. When ``None`` (default), the predictor
+        builds its own ADK agent and runner from ``agent_config``. Supply
+        a runner for tests (with a stub agent) or to share one runner
+        across predictors.
+
+    Examples
+    --------
+    >>> from aieng.forecasting.methods.agentic import (
+    ...     AgentConfig,
+    ...     AgentPredictor,
+    ...     ContinuousAgentForecastOutput,
+    ... )
+    >>> predictor = AgentPredictor(
+    ...     AgentConfig(instruction="Forecast the supplied series."),
+    ...     my_prompt_builder,
+    ...     output_schema=ContinuousAgentForecastOutput,
+    ... )
+    >>> predictions = predictor.predict(task, context)
+    """
+
+    def __init__(
+        self,
+        agent_config: AgentConfig,
+        prompt_builder: ForecastPromptBuilder,
+        *,
+        output_schema: type[AgentForecastOutput],
+        enable_langfuse_tracing: bool | None = None,
+        runner: AdkTextRunner | None = None,
+    ) -> None:
+        """Store the schema, derive the modality, and build or accept a runner."""
+        if enable_langfuse_tracing is None:
+            # Auto-detect: enable Langfuse tracing iff the package is importable.
+            try:
+                import langfuse  # noqa: F401, PLC0415
+
+                enable_langfuse_tracing = True
+            except ModuleNotFoundError:
+                enable_langfuse_tracing = False
+
+        self.prompt_builder = prompt_builder
+        self.agent_config = agent_config
+        self.output_schema: type[AgentForecastOutput] = output_schema
+        self.enable_langfuse_tracing = enable_langfuse_tracing
+
+        self._forecast_output_modality = output_schema.modality
+
+        if runner is None:
+            built_agent = build_adk_agent(agent_config, output_schema=output_schema)
+            self._agent: BaseAgent = built_agent
+            self._runner = AdkTextRunner(
+                agent=built_agent,
+                config=AdkTextRunnerConfig(
+                    app_name="agentic_forecasting_predictor",
+                    default_user_id="forecasting_agent",
+                    fresh_session_per_message=True,
+                    enable_langfuse_tracing=self.enable_langfuse_tracing,
+                    langfuse_tags=["agent_predictor", "track1"],
+                    langfuse_trace_name=self.predictor_id,
+                    langfuse_propagate_metadata={
+                        "predictor_id": self.predictor_id,
+                        "agent_name": built_agent.name,
+                        "model": str(built_agent.model),
+                        "output_modality": self._forecast_output_modality,
+                    },
+                ),
+            )
+        else:
+            self._runner = runner
+            self._agent = runner.agent
+
+    @property
+    def predictor_id(self) -> str:
+        """Stable identifier for this predictor.
+
+        This is used to identify the predictor in the evaluation results.
+        """
+        model = getattr(self._agent, "model", None)
+        model_suffix = f"_{model}" if isinstance(model, str) else ""
+        return f"agent_predictor_{self._agent.name}{model_suffix}_{self._forecast_output_modality}"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic forecasts for the given task and context.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Defines the prediction problem — target series, horizon(s),
+            frequency, and resolution logic. The predictor must not modify
+            the task.
+        context : ForecastContext
+            The information state available at forecast time. All calls to
+            ``context.get_series()`` are automatically filtered to
+            ``context.as_of`` — the predictor cannot accidentally access
+            future data from the series store.
+
+        Returns
+        -------
+        list[Prediction]
+            One ``Prediction`` per horizon step in ``task.horizons``, each
+            with ``as_of = context.as_of`` and ``forecast_date`` set to the
+            corresponding step ahead of the origin. An empty list is
+            returned when the agent's structured output cannot be
+            converted to predictions (the error is logged); schema
+            validation errors on the agent's JSON are not swallowed.
+        """
+        prompt = self.prompt_builder(task=task, context=context)
+        output_str = _run_coroutine_sync(self._runner.run_text_async(prompt))
+
+        # Normalise: strip markdown fences before validation so any model can
+        # be swapped in without breaking the parse layer.
+        output_str = strip_markdown_fence(output_str)
+
+        # Validate the output against the output schema; tolerate JSON
+        # responses that ``model_validate_json`` cannot parse but
+        # ``json.loads`` + ``model_validate`` can.
+        try:
+            output = self.output_schema.model_validate_json(output_str)
+        except ValidationError:
+            try:
+                output = self.output_schema.model_validate(json.loads(output_str))
+            except Exception:
+                logger.warning("Raw agent response (schema validation failed):\n%s", output_str)
+                raise
+
+        # Convert output to list of predictions
+        try:
+            predictions = output.to_predictions(
+                task=task,
+                context=context,
+                predictor_id=self.predictor_id,
+            )
+        except Exception as e:
+            # Log the error and return an empty list of predictions
+            logger.error("Error converting output to list of predictions: %s", e)
+            return []
+
+        # Link each prediction back to its Langfuse trace so side-channel
+        # evaluators can attach scores. The agent runs on a worker event loop whose
+        # trace context isn't active here, so use the id the runner captured during
+        # the run (not the current context, which is empty on this thread).
+        trace_id = self._runner.last_trace_id
+        if trace_id is not None:
+            trace_url = trace_url_for(trace_id)
+            for prediction in predictions:
+                prediction.metadata.setdefault("langfuse_trace_id", trace_id)
+                if trace_url is not None:
+                    prediction.metadata.setdefault("langfuse_trace_url", trace_url)
+
+            # Make the trace the canonical record for rationale evaluation: stamp the
+            # structured forecast onto that trace (post-hoc, by id) so the evaluator
+            # reads the rationale + distribution from Langfuse, not from a cached run.
+            stamp_forecast_on_trace(predictions, trace_id=trace_id)
+
+        return predictions
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md
new file mode 100644
index 0000000..289e10f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md
@@ -0,0 +1,18 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/baselines/__init__.py
+
+kind: python
+
+```python
+"""Baseline predictor implementations.
+
+Baselines provide fast, low-dependency reference points that every more complex
+predictor should be compared against.
+"""
+
+from .categorical_frequency import CategoricalFrequencyPredictor
+from .historical_frequency import HistoricalFrequencyPredictor
+from .naive import LastValuePredictor
+
+
+__all__ = ["CategoricalFrequencyPredictor", "HistoricalFrequencyPredictor", "LastValuePredictor"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md
new file mode 100644
index 0000000..75a6f81
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md
@@ -0,0 +1,136 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/baselines/categorical_frequency.py
+
+kind: python
+
+```python
+"""Categorical-frequency predictor — the floor baseline for ordinal tasks.
+
+``CategoricalFrequencyPredictor`` predicts each ordered category with the
+probability it has occurred historically (the climatological category
+distribution). It is the categorical counterpart of
+:class:`~aieng.forecasting.methods.baselines.historical_frequency.HistoricalFrequencyPredictor`:
+zero modelling, pure persistence of the empirical distribution.
+
+Unseen categories receive probability 0. Run this first on any new
+ordered-categorical task; every conditioned model should beat this floor
+baseline on RPS.
+
+Usage::
+
+    from aieng.forecasting.methods import CategoricalFrequencyPredictor
+    from aieng.forecasting.evaluation import backtest, BacktestSpec
+
+    predictor = CategoricalFrequencyPredictor()
+    result = backtest(predictor=predictor, spec=spec, data_service=svc)
+    print(f"Category-frequency mean RPS: {result.mean_score:.4f}")  # must be beaten
+"""
+
+from __future__ import annotations
+
+import math
+from datetime import datetime, timezone
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory
+
+
+class CategoricalFrequencyPredictor(Predictor):
+    """Categorical baseline: forecast the empirical category frequencies.
+
+    The target series must store one value per resolution opportunity, with
+    every observed value matching one of ``task.categories``. The predicted
+    probabilities are raw empirical frequencies from the cutoff-filtered
+    history, optionally restricted to a trailing window. There is no smoothing:
+    categories absent from the history receive probability 0.
+
+    Parameters
+    ----------
+    window : int or None
+        If set, only the last ``window`` observations are used to compute the
+        category frequencies, making the baseline responsive to slow regime
+        change. ``None`` uses the full history.
+    """
+
+    def __init__(self, window: int | None = None) -> None:
+        if window is not None and window < 1:
+            raise ValueError(f"window must be a positive integer or None; got {window}")
+        self._window = window
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable identifier for this predictor."""
+        if self._window is not None:
+            return f"categorical_frequency_w{self._window}"
+        return "categorical_frequency"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce category-frequency forecasts for the task's single horizon.
+
+        Raises
+        ------
+        ValueError
+            If the task does not declare ``payload_type='categorical'``, if it
+            has more than one horizon, if the cutoff-filtered history is empty,
+            or if any observed value does not match a task category value.
+        """
+        if task.payload_type != "categorical":
+            raise ValueError(
+                f"{type(self).__name__} requires a categorical task (payload_type='categorical'); "
+                f"task '{task.task_id}' declares payload_type='{task.payload_type}'."
+            )
+        if len(task.horizons) != 1:
+            raise ValueError(f"{type(self).__name__} requires exactly one horizon; got {task.horizons}.")
+        if task.categories is None:
+            raise ValueError(f"Categorical task '{task.task_id}' must define categories.")
+
+        series_df = context.get_series(task.target_series_id)
+        if series_df.empty:
+            raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.")
+
+        values = series_df["value"].astype(float)
+        if self._window is not None:
+            values = values.tail(self._window)
+        if values.empty:
+            raise ValueError(f"History for '{task.target_series_id}' is empty after applying window={self._window}.")
+
+        counts = {category.label: 0 for category in task.categories}
+        for observed in values:
+            category = _matching_category(float(observed), task.categories)
+            if category is None:
+                allowed = [category.value for category in task.categories]
+                raise ValueError(
+                    f"Target series '{task.target_series_id}' contains value {float(observed)} that does not "
+                    f"match any task category value. Allowed values: {allowed}."
+                )
+            counts[category.label] += 1
+
+        n_observations = int(len(values))
+        probabilities = {label: count / n_observations for label, count in counts.items()}
+        payload = CategoricalForecast(probabilities=probabilities)
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        horizon = task.horizons[0]
+
+        return [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(),
+                payload=payload,
+                metadata={"n_observations": n_observations, "window": self._window},
+            )
+        ]
+
+
+def _matching_category(value: float, categories: list[TaskCategory]) -> TaskCategory | None:
+    """Return the task category whose series value matches ``value``."""
+    for category in categories:
+        if math.isclose(value, category.value, abs_tol=1e-9):
+            return category
+    return None
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md
new file mode 100644
index 0000000..9d63ae5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md
@@ -0,0 +1,113 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/baselines/historical_frequency.py
+
+kind: python
+
+```python
+"""Historical-frequency predictor — the floor baseline for binary-event tasks.
+
+``HistoricalFrequencyPredictor`` predicts that a binary event occurs with the
+probability it has occurred historically (the climatological base rate). It is
+the binary counterpart of
+:class:`~aieng.forecasting.methods.baselines.naive.LastValuePredictor`: zero
+modelling, pure persistence of the empirical distribution.
+
+A constant base-rate forecast is surprisingly hard to beat on Brier score for
+rare or regime-driven events — any model that reacts to conditions must react
+*correctly* to win. Run this first on any new binary task; every other
+predictor should beat it.
+
+Usage::
+
+    from aieng.forecasting.methods import HistoricalFrequencyPredictor
+    from aieng.forecasting.evaluation import backtest, BacktestSpec
+
+    predictor = HistoricalFrequencyPredictor()
+    result = backtest(predictor=predictor, spec=spec, data_service=svc)
+    print(f"Base-rate mean Brier: {result.mean_score:.4f}")  # must be beaten
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import BinaryForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class HistoricalFrequencyPredictor(Predictor):
+    """Binary baseline: forecast the empirical event frequency as the probability.
+
+    The target series must be a 0/1 event series (one row per resolution
+    opportunity, e.g. one row per central-bank meeting). The predicted
+    probability is the mean of the cutoff-filtered history, optionally
+    restricted to a trailing window.
+
+    Parameters
+    ----------
+    window : int or None
+        If set, only the last ``window`` observations are used to compute the
+        base rate, making the baseline responsive to slow regime change
+        (e.g. "share of cuts in the last 16 meetings" rather than all-time).
+        ``None`` uses the full history.
+    """
+
+    def __init__(self, window: int | None = None) -> None:
+        if window is not None and window < 1:
+            raise ValueError(f"window must be a positive integer or None; got {window}")
+        self._window = window
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable identifier for this predictor."""
+        if self._window is not None:
+            return f"historical_frequency_w{self._window}"
+        return "historical_frequency"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce base-rate probability forecasts for every horizon in the task.
+
+        Raises
+        ------
+        ValueError
+            If the task does not declare ``payload_type='binary'``, or if the
+            cutoff-filtered history is empty or contains non-0/1 values.
+        """
+        if task.payload_type != "binary":
+            raise ValueError(
+                f"{type(self).__name__} requires a binary task (payload_type='binary'); "
+                f"task '{task.task_id}' declares payload_type='{task.payload_type}'."
+            )
+
+        series_df = context.get_series(task.target_series_id)
+        if series_df.empty:
+            raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.")
+
+        values = series_df["value"].astype(float)
+        if not values.isin([0.0, 1.0]).all():
+            bad = sorted(set(values[~values.isin([0.0, 1.0])]))
+            raise ValueError(f"Target series '{task.target_series_id}' must be a 0/1 event series; found values {bad}.")
+
+        if self._window is not None:
+            values = values.tail(self._window)
+        base_rate = float(values.mean())
+
+        payload = BinaryForecast(probability=base_rate)
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+        return [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(),
+                payload=payload,
+                metadata={"n_observations": int(len(values)), "window": self._window},
+            )
+            for h in task.horizons
+        ]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md
new file mode 100644
index 0000000..5a03df9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md
@@ -0,0 +1,135 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/baselines/naive.py
+
+kind: python
+
+```python
+"""Naive last-value predictor — the floor baseline for any continuous forecasting task.
+
+``LastValuePredictor`` predicts that the next observation will equal the most
+recently observed value, with no uncertainty spread (all quantiles equal the
+point forecast). It is task-agnostic and applies to any ``ForecastingTask``
+with a continuous series target.
+
+Use this as:
+
+1. **A performance floor.** Run it first on any new task. Every other predictor
+   should beat it. If yours doesn't, something is wrong with your model.
+
+2. **A readable reference implementation.** The code is annotated step-by-step
+   to show exactly how to satisfy the ``Predictor`` ABC — what fields are
+   required, how to compute ``forecast_date``, and how to construct a
+   ``Prediction``. Copy the structure and replace the forecast logic.
+
+Usage::
+
+    from aieng.forecasting.methods.naive import LastValuePredictor
+    from aieng.forecasting.evaluation import backtest, BacktestSpec
+
+    result = backtest(predictor=LastValuePredictor(), spec=spec, data_service=svc)
+    print(f"Naive mean CRPS: {result.mean_score:.4f}")  # your model must beat this
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES, ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class LastValuePredictor(Predictor):
+    """Naive baseline: forecast the most recently observed value at all quantiles.
+
+    All quantile levels receive the same value as the point forecast, producing
+    a degenerate distribution with zero spread. This gives the worst possible
+    calibration score — a well-calibrated model should spread its quantiles to
+    reflect genuine uncertainty.
+
+    For multi-horizon tasks (``len(task.horizons) > 1``), the same last value
+    is carried forward as a flat forecast for every requested step — equivalent
+    to the "persistence" or "random-walk" assumption.
+
+    Parameters
+    ----------
+    None
+    """
+
+    # ------------------------------------------------------------------
+    # Step 1: give your predictor a stable string ID.
+    # This appears in BacktestResult and every Prediction record,
+    # so changing it mid-experiment will break comparisons.
+    # ------------------------------------------------------------------
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable identifier for this predictor."""
+        return "last_value_naive"
+
+    # ------------------------------------------------------------------
+    # Step 2: implement predict().
+    #
+    # Arguments:
+    #   task    — ForecastingTask: defines the problem (target series,
+    #             horizons, frequency). Read-only; do not modify it.
+    #   context — ForecastContext: your data access object. All series
+    #             returned by context.get_series() are already filtered
+    #             to context.as_of — you cannot accidentally access
+    #             future data.
+    #
+    # Return:
+    #   list[Prediction] — one per horizon step in task.horizons.
+    # ------------------------------------------------------------------
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce last-value naive forecasts for every horizon in the task."""
+        # ------------------------------------------------------------------
+        # Step 3: fetch the target series.
+        # Returns a DataFrame with columns: timestamp, value, released_at.
+        # Rows are already cut off at context.as_of.
+        # ------------------------------------------------------------------
+        series_df = context.get_series(task.target_series_id)
+
+        # ------------------------------------------------------------------
+        # Step 4: produce a forecast.
+        # Replace everything below with your model logic.
+        # Here we just take the last observed value as the point forecast.
+        # ------------------------------------------------------------------
+        last_value = float(series_df["value"].iloc[-1])
+
+        # ------------------------------------------------------------------
+        # Step 5: build the ContinuousForecast payload.
+        # point_forecast: your central estimate (typically median).
+        # quantiles: a dict mapping quantile level → forecast value.
+        #   STANDARD_QUANTILES = [0.05, 0.10, ..., 0.90, 0.95]  # noqa: ERA001
+        #   The evaluation engine uses these to compute CRPS.
+        #   A naive predictor with no uncertainty puts the same value
+        #   at every quantile — real models spread them out.
+        # ------------------------------------------------------------------
+        payload = ContinuousForecast(
+            point_forecast=last_value,
+            quantiles=dict.fromkeys(STANDARD_QUANTILES, last_value),
+        )
+
+        # ------------------------------------------------------------------
+        # Step 6: build one Prediction per requested horizon.
+        # task.horizons is a list of integer steps (e.g. [18] or [6..17]).
+        # For each step h, the forecast date is as_of + h × frequency.
+        # The harness uses each forecast_date to look up the ground-truth
+        # observation and score the prediction.
+        # ------------------------------------------------------------------
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+        return [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(),
+                payload=payload,
+            )
+            for h in task.horizons
+        ]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md
new file mode 100644
index 0000000..ecd5ed5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md
@@ -0,0 +1,82 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/__init__.py
+
+kind: python
+
+```python
+"""LLM-process predictor implementations.
+
+Predictors that use an LLM directly as the forecasting engine (no agent loop,
+no tool use). Concrete subclasses are organised by target type and elicitation
+strategy:
+
+- :class:`SampledTrajectoryLLMPredictor` — sample-based empirical quantiles for
+  continuous targets (Gruver / Context-is-Key Direct Prompt path).
+- :class:`QuantileGridLLMPredictor` — direct elicitation of the standard
+  quantile grid for continuous targets.
+- :class:`BinaryProbabilityLLMPredictor` — direct elicitation of one
+  calibrated probability for binary-event tasks
+  (``ForecastingTask.payload_type == "binary"``), scored with Brier.
+- :class:`CategoricalProbabilityLLMPredictor` — direct elicitation of a
+  calibrated distribution over the task-declared ordered categories
+  (``ForecastingTask.payload_type == "categorical"``), scored with RPS.
+- ``point_intervals`` — design placeholder for a token-efficient point-plus-
+  interval contract. It may become a configurable sparse quantile grid rather
+  than a separate predictor.
+
+Method *variants* from the literature (Requeima A-LLMP / I-LLMP, logprob-based
+hierarchical density, conformal-wrapped predictors) belong as additional
+sibling classes here, **not** as configurations of an existing class. The same
+rule applies to binary elicitation: sampled-outcome, logprob, or
+conformal-wrapped binary forecasters should be siblings of
+:class:`BinaryProbabilityLLMPredictor`, not modes on it.
+
+---
+
+Placeholder method design notes
+-------------------------------
+
+``point_intervals.py`` is intentionally non-exported. A point-plus-interval
+prompt asks for a central path plus compact uncertainty bands (for example
+``q10``, ``q50``, ``q90``). That contract is attractive for larger,
+reasoning-capable LLMs because it is much cheaper than a full quantile grid,
+but it is also just sparse quantile elicitation. Before implementing it, decide
+whether configurable quantile sets belong on :class:`QuantileGridLLMPredictor`
+instead, and how sparse intervals map to the standard ``ContinuousForecast``
+quantiles used for scoring.
+"""
+
+from aieng.forecasting.methods.llm_processes.base import (
+    LLMPredictor,
+    LLMPredictorConfig,
+)
+from aieng.forecasting.methods.llm_processes.binary_probability import (
+    BinaryProbabilityLLMPredictor,
+    BinaryProbabilityLLMPredictorConfig,
+)
+from aieng.forecasting.methods.llm_processes.categorical_probability import (
+    CategoricalProbabilityLLMPredictor,
+    CategoricalProbabilityLLMPredictorConfig,
+)
+from aieng.forecasting.methods.llm_processes.quantile_grid import (
+    QuantileGridLLMPredictor,
+    QuantileGridLLMPredictorConfig,
+)
+from aieng.forecasting.methods.llm_processes.sampled_trajectory import (
+    SampledTrajectoryLLMPredictor,
+    SampledTrajectoryLLMPredictorConfig,
+)
+
+
+__all__ = [
+    "BinaryProbabilityLLMPredictor",
+    "BinaryProbabilityLLMPredictorConfig",
+    "CategoricalProbabilityLLMPredictor",
+    "CategoricalProbabilityLLMPredictorConfig",
+    "SampledTrajectoryLLMPredictor",
+    "SampledTrajectoryLLMPredictorConfig",
+    "QuantileGridLLMPredictor",
+    "QuantileGridLLMPredictorConfig",
+    "LLMPredictor",
+    "LLMPredictorConfig",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md
new file mode 100644
index 0000000..877bd51
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md
@@ -0,0 +1,487 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/_client.py
+
+kind: python
+
+```python
+"""Shared LiteLLM call seam for all ``llm_processes`` predictors.
+
+This module owns:
+
+- Idempotent module-level bootstrap of LiteLLM callbacks.
+- Async single-completion seam with one retry on parse failure.
+- Parallel ``asyncio.gather`` fan-out for ``N``-sample elicitation.
+- A small ``run_async`` shim that works in scripts, pytest, and Jupyter.
+- Langfuse ``@observe`` decorator factory and trace-info helpers.
+
+Continuous and (future) binary predictors share this seam so the LLM-call
+contract — request shape, retry policy, tracing — lives in exactly one
+place.
+
+LiteLLM caching is intentionally **not** wired here: ``litellm[caching]``
+is an optional extra and disk caching collapses repeated identical prompts
+into a single response, which would defeat sample-based forecasting.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import contextvars
+import json
+import logging
+import os
+import warnings
+from concurrent.futures import ThreadPoolExecutor
+from typing import Any, Callable, TypeVar
+
+from pydantic import BaseModel, ValidationError
+
+
+logger = logging.getLogger(__name__)
+
+T = TypeVar("T", bound=BaseModel)
+
+_BOOTSTRAP_DONE = False
+
+
+def bootstrap_litellm() -> None:
+    """One-time wiring of LiteLLM callbacks.
+
+    Lazy and idempotent so non-LLM predictors do not require Langfuse env vars.
+    The Langfuse OTEL callback is registered only when ``LANGFUSE_PUBLIC_KEY``
+    is set in the environment.
+    """
+    global _BOOTSTRAP_DONE  # noqa: PLW0603
+    if _BOOTSTRAP_DONE:
+        return
+    import litellm  # noqa: PLC0415
+
+    if os.environ.get("LANGFUSE_PUBLIC_KEY"):
+        existing = list(getattr(litellm, "callbacks", []) or [])
+        if "langfuse_otel" not in existing:
+            litellm.callbacks = [*existing, "langfuse_otel"]
+
+    # Suppress LiteLLM startup and OTEL noise (mirrors agent_factory.py filter).
+    # Bedrock/SageMaker "no botocore" and OTEL proxy-server notices are harmless.
+    # OTEL span-lifecycle warnings fire when callbacks run after spans close.
+    class _NoiseFilter(logging.Filter):
+        _NOISE = ("botocore", "Proxy Server is not installed")
+
+        def filter(self, record: logging.LogRecord) -> bool:
+            return not any(n in record.getMessage() for n in self._NOISE)
+
+    logging.getLogger("LiteLLM").addFilter(_NoiseFilter())
+    warnings.filterwarnings("ignore", message="Tried calling set_status on an ended span")
+    warnings.filterwarnings("ignore", message="Setting attribute on ended span")
+    logging.getLogger("opentelemetry").setLevel(logging.ERROR)
+
+    _BOOTSTRAP_DONE = True
+
+
+def langfuse_observe(name: str) -> Callable[..., Any]:
+    """Return Langfuse's ``@observe`` decorator with the given span name.
+
+    Falls back to a no-op decorator if Langfuse is not installed or fails to
+    import, so the predictor remains usable without the ``agentic`` extra.
+    """
+    try:
+        from langfuse import observe  # noqa: PLC0415
+
+        return observe(name=name)
+    except Exception:  # pragma: no cover
+        logger.debug("langfuse not available; skipping @observe decoration")
+
+        def _noop(fn: Any) -> Any:
+            return fn
+
+        return _noop
+
+
+def current_trace_info() -> tuple[str | None, str | None]:
+    """Return ``(trace_id, trace_url)`` from the active Langfuse client, if any."""
+    try:
+        from langfuse import get_client  # noqa: PLC0415
+    except Exception:
+        return None, None
+    try:
+        client = get_client()
+        return client.get_current_trace_id(), client.get_trace_url()
+    except Exception:  # pragma: no cover
+        return None, None
+
+
+def trace_url_for(trace_id: str) -> str | None:
+    """Return the Langfuse UI URL for a specific ``trace_id``, or ``None``.
+
+    Unlike :func:`current_trace_info`, this resolves a URL for a trace by id even
+    when no trace context is active (e.g. the agent path, whose trace id is
+    captured on a worker thread). No-op when Langfuse is unavailable.
+    """
+    try:
+        from langfuse import get_client  # noqa: PLC0415
+
+        return get_client().get_trace_url(trace_id=trace_id)
+    except Exception:
+        return None
+
+
+def set_current_trace_name(name: str) -> None:
+    """Name the active Langfuse trace, if any, so it is identifiable in the UI.
+
+    LLMP predictors call this with their ``predictor_id`` at the top of
+    ``predict``. Because ``predict`` is the ``@observe``-wrapped root span, its
+    name is what Langfuse shows as the trace name; renaming the current span
+    therefore renames the trace to the same identifier used by leaderboards and
+    artifact storage — matching how agent predictors name their traces. No-op
+    when Langfuse is not installed or no span is active.
+    """
+    try:
+        from langfuse import get_client  # noqa: PLC0415
+    except Exception:
+        return
+    try:
+        get_client().update_current_span(name=name)
+    except Exception:  # pragma: no cover
+        logger.debug("update_current_span(name=%r) failed; trace name unchanged.", name)
+
+
+def _strip_additional_properties(node: Any) -> Any:
+    """Recursively drop ``additionalProperties`` keys from a JSON schema.
+
+    The Vector proxy's Gemini ``response_schema`` route rejects
+    ``additionalProperties`` (``Unknown name "additionalProperties" at
+    'generation_config.response_schema'``), even though OpenAI strict mode
+    expects ``additionalProperties: false``. We strip it centrally so the same
+    predictor schemas route through the proxy unchanged; ``strict: True`` still
+    pins the model to the declared fields. (If a direct OpenAI-strict route is
+    ever added, that path would need ``additionalProperties: false`` restored.)
+    """
+    if isinstance(node, dict):
+        return {k: _strip_additional_properties(v) for k, v in node.items() if k != "additionalProperties"}
+    if isinstance(node, list):
+        return [_strip_additional_properties(v) for v in node]
+    return node
+
+
+def make_json_schema_response_format(name: str, schema: dict[str, Any]) -> dict[str, Any]:
+    """Build the explicit ``json_schema`` ``response_format`` dict.
+
+    Always pass this dict form to ``litellm.completion`` rather than a Pydantic
+    class — the class-to-schema conversion path has known regressions on
+    Anthropic providers. ``additionalProperties`` is stripped from the schema
+    for proxy/Gemini compatibility (see :func:`_strip_additional_properties`).
+    """
+    return {
+        "type": "json_schema",
+        "json_schema": {"name": name, "schema": _strip_additional_properties(schema), "strict": True},
+    }
+
+
+def strip_markdown_fence(content: str) -> str:
+    r"""Normalise an LLM response down to its JSON payload.
+
+    Defends the parse layer against two model/proxy quirks so participants can
+    swap models freely without hitting parse failures:
+
+    1. **Markdown fences.** Some models wrap JSON in a ```json ... ``` fence
+       even when ``response_format`` is set.
+    2. **Surrounding prose.** Some models (notably Claude through the proxy)
+       append an explanation *after* the JSON — e.g. ``{...}\n\n**Method:**
+       ...`` — or leak a stray closing fence when prose follows it. This is a
+       Predictor-interface concern, not LLMP-specific: every methodology that
+       parses a structured JSON response needs the payload isolated.
+
+    The prose-trimming step is best-effort: it isolates the first complete
+    JSON object via :meth:`json.JSONDecoder.raw_decode` and discards anything
+    after it. When no JSON object is present the fence-stripped string is
+    returned unchanged, so non-JSON content passes through untouched.
+
+    Parameters
+    ----------
+    content : str
+        Raw LLM response content, possibly fenced and/or surrounded by prose.
+
+    Returns
+    -------
+    str
+        The isolated JSON payload, or the fence-stripped, whitespace-trimmed
+        input when no JSON object can be located.
+    """
+    stripped = content.strip()
+    if stripped.startswith("```"):
+        lines = stripped.splitlines()
+        # Drop opening fence line (```json or ```)
+        inner_lines = lines[1:]
+        # Drop closing fence line if present
+        if inner_lines and inner_lines[-1].strip() == "```":
+            inner_lines = inner_lines[:-1]
+        stripped = "\n".join(inner_lines).strip()
+    payload = _extract_json_payload(stripped)
+    return payload if payload is not None else stripped
+
+
+def _extract_json_payload(text: str) -> str | None:
+    """Return the first complete JSON object in ``text``, or ``None``.
+
+    Scans for the first ``{`` and uses ``raw_decode`` to consume a single
+    balanced JSON object, ignoring any trailing (or leading) prose. Candidate
+    start positions that do not begin a valid object are skipped, so a stray
+    brace inside prose cannot derail extraction.
+
+    Only objects are matched (not arrays): every structured forecast payload in
+    the Predictor interface is a top-level JSON object, so anchoring on ``{``
+    avoids accidentally capturing an echoed numeric array (e.g. the input
+    series) that some models repeat in their prose.
+    """
+    decoder = json.JSONDecoder()
+    for start, char in enumerate(text):
+        if char != "{":
+            continue
+        try:
+            _, end = decoder.raw_decode(text, start)
+        except json.JSONDecodeError:
+            continue
+        return text[start:end]
+    return None
+
+
+# ---------------------------------------------------------------------------
+# Async sampling seam
+# ---------------------------------------------------------------------------
+
+
+async def _one_completion_async(
+    *,
+    model: str,
+    messages: list[dict[str, Any]],
+    response_format: dict[str, Any],
+    temperature: float,
+    max_tokens: int,
+    timeout_s: float,
+    reasoning_effort: str | None,
+    api_base: str | None = None,
+    api_key: str | None = None,
+) -> tuple[str | None, float, int, int]:
+    """Issue a single ``litellm.acompletion`` and return content + usage."""
+    import litellm  # noqa: PLC0415
+
+    kwargs: dict[str, Any] = {
+        "model": model,
+        "messages": messages,
+        "response_format": response_format,
+        "temperature": temperature,
+        "max_tokens": max_tokens,
+        "timeout": timeout_s,
+    }
+    if api_base is not None:
+        kwargs["api_base"] = api_base
+        # Prefix the model with "openai/" so LiteLLM routes via the
+        # OpenAI-compatible path.  LiteLLM strips the prefix before sending
+        # the request, so the proxy receives the bare model name as expected.
+        if not model.startswith("openai/"):
+            kwargs["model"] = f"openai/{model}"
+    if api_key is not None:
+        kwargs["api_key"] = api_key
+    if reasoning_effort is not None:
+        # LiteLLM unifies the per-provider reasoning-budget kwargs behind
+        # ``reasoning_effort`` ∈ {"disable", "low", "medium", "high"}. We
+        # default to ``"disable"`` in the config because CoT-induced
+        # overconfidence is well-documented for continuous probabilistic
+        # forecasting (Welch 2026, Marzoev 2026).
+        #
+        # IMPORTANT: when routing through an OpenAI-compatible proxy (api_base
+        # set), LiteLLM treats the model as a generic OpenAI model and does not
+        # list ``reasoning_effort`` as a supported param for non-o1/o3 model
+        # names (confirmed via litellm.get_supported_openai_params). With
+        # ``drop_params=True`` it is silently stripped before the request
+        # reaches the proxy, so the thinking model runs unconstrained.
+        # Workaround: inject via ``extra_body``, which bypasses LiteLLM's
+        # param-filtering step and is merged directly into the request JSON.
+        if api_base is not None:
+            kwargs.setdefault("extra_body", {})["reasoning_effort"] = reasoning_effort
+        else:
+            kwargs["reasoning_effort"] = reasoning_effort
+        # drop_params=True is still needed for other non-standard params on
+        # models that don't support them (e.g. temperature on some o-series).
+        kwargs["drop_params"] = True
+
+    resp = await litellm.acompletion(**kwargs)
+    cost = float(getattr(resp, "_hidden_params", {}).get("response_cost") or 0.0)
+    usage = getattr(resp, "usage", None)
+    in_tok = int(getattr(usage, "prompt_tokens", 0) or 0) if usage is not None else 0
+    out_tok = int(getattr(usage, "completion_tokens", 0) or 0) if usage is not None else 0
+    # Log full usage so we can see thinking-token breakdown when available.
+    # The proxy may populate completion_tokens_details.reasoning_tokens.
+    if usage is not None:
+        logger.debug("LLM usage: %s", vars(usage) if hasattr(usage, "__dict__") else usage)
+    raw = resp.choices[0].message.content
+    content = strip_markdown_fence(raw) if raw else raw
+    return content, cost, in_tok, out_tok
+
+
+async def _one_completion_with_transient_retry(
+    *,
+    model: str,
+    messages: list[dict[str, str]],
+    response_format: dict[str, Any],
+    temperature: float,
+    max_tokens: int,
+    timeout_s: float,
+    reasoning_effort: str | None,
+    api_base: str | None = None,
+    api_key: str | None = None,
+) -> tuple[str | None, float, int, int]:
+    """Call ``_one_completion_async`` with retries for transient API errors.
+
+    Retries up to 3 times on 503 / rate-limit responses, backing off
+    exponentially (5 s, 15 s).  Non-transient errors propagate immediately.
+    """
+    from litellm.exceptions import RateLimitError, ServiceUnavailableError  # noqa: PLC0415
+
+    _transient = (ServiceUnavailableError, RateLimitError)
+    for attempt in range(3):
+        try:
+            return await _one_completion_async(
+                model=model,
+                messages=messages,
+                response_format=response_format,
+                temperature=temperature,
+                max_tokens=max_tokens,
+                timeout_s=timeout_s,
+                reasoning_effort=reasoning_effort,
+                api_base=api_base,
+                api_key=api_key,
+            )
+        except _transient as exc:
+            if attempt == 2:
+                raise
+            wait_s = 5 * (3**attempt)  # 5 s, 15 s
+            logger.warning(
+                "Transient API error (attempt %d/3), retrying in %ds: %s",
+                attempt + 1,
+                wait_s,
+                exc,
+            )
+            await asyncio.sleep(wait_s)
+    raise RuntimeError("unreachable")  # pragma: no cover
+
+
+async def _sample_one_with_retry(
+    *,
+    schema_cls: type[T],
+    model: str,
+    base_messages: list[dict[str, Any]],
+    response_format: dict[str, Any],
+    temperature: float,
+    max_tokens: int,
+    timeout_s: float,
+    reasoning_effort: str | None,
+    sample_index: int,
+    api_base: str | None = None,
+    api_key: str | None = None,
+) -> tuple[T | None, float, int, int, int]:
+    """Single sample with one retry on parse failure and transient-error backoff."""
+    cost = 0.0
+    in_tok = 0
+    out_tok = 0
+    failures = 0
+
+    for attempt in range(2):
+        content, c, i, o = await _one_completion_with_transient_retry(
+            model=model,
+            messages=base_messages,
+            response_format=response_format,
+            temperature=temperature,
+            max_tokens=max_tokens,
+            timeout_s=timeout_s,
+            reasoning_effort=reasoning_effort,
+            api_base=api_base,
+            api_key=api_key,
+        )
+        cost += c
+        in_tok += i
+        out_tok += o
+        try:
+            parsed = schema_cls.model_validate(json.loads(content or ""))
+            return parsed, cost, in_tok, out_tok, failures
+        except (json.JSONDecodeError, ValidationError) as exc:
+            failures += 1
+            logger.warning(
+                "Sample %d parse failure on attempt %d: %s",
+                sample_index + 1,
+                attempt + 1,
+                exc,
+            )
+
+    return None, cost, in_tok, out_tok, failures
+
+
+async def sample_n_async(
+    *,
+    schema_cls: type[T],
+    model: str,
+    base_messages: list[dict[str, Any]],
+    response_format: dict[str, Any],
+    n_samples: int,
+    temperature: float,
+    max_tokens: int,
+    timeout_s: float,
+    reasoning_effort: str | None,
+    api_base: str | None = None,
+    api_key: str | None = None,
+) -> tuple[list[T], float, int, int, int]:
+    """Fan ``n_samples`` calls out via ``asyncio.gather`` and aggregate usage.
+
+    Returns ``(parsed_samples, total_cost, total_in_tokens, total_out_tokens,
+    total_parse_failures)``. Failed samples are dropped silently here; the
+    caller must decide what to do if the parsed list is empty.
+    """
+    coros = [
+        _sample_one_with_retry(
+            schema_cls=schema_cls,
+            model=model,
+            base_messages=base_messages,
+            response_format=response_format,
+            temperature=temperature,
+            max_tokens=max_tokens,
+            timeout_s=timeout_s,
+            reasoning_effort=reasoning_effort,
+            sample_index=i,
+            api_base=api_base,
+            api_key=api_key,
+        )
+        for i in range(n_samples)
+    ]
+    results = await asyncio.gather(*coros)
+
+    parsed: list[T] = []
+    total_cost = 0.0
+    total_in = 0
+    total_out = 0
+    total_failures = 0
+    for sample, c, i, o, f in results:
+        total_cost += c
+        total_in += i
+        total_out += o
+        total_failures += f
+        if sample is not None:
+            parsed.append(sample)
+    return parsed, total_cost, total_in, total_out, total_failures
+
+
+def run_async(coro: Any) -> Any:
+    """Run an async coroutine from sync code; works in scripts and Jupyter.
+
+    If no event loop is running (scripts, pytest), uses ``asyncio.run``.
+    If a loop is already running (Jupyter), runs the coroutine on a fresh
+    loop in a worker thread with the current ``contextvars`` context copied
+    across, so Langfuse trace context propagates into the async sampling.
+    """
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        return asyncio.run(coro)
+
+    ctx = contextvars.copy_context()
+    with ThreadPoolExecutor(max_workers=1) as pool:
+        return pool.submit(ctx.run, asyncio.run, coro).result()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md
new file mode 100644
index 0000000..f943f74
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md
@@ -0,0 +1,480 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/base.py
+
+kind: python
+
+```python
+"""Abstract base class and shared config for LLM-process predictors.
+
+``LLMPredictor`` is the abstract parent shared by every concrete predictor in
+this package (today: :class:`SampledTrajectoryLLMPredictor` and
+:class:`QuantileGridLLMPredictor`; planned: ``BinaryProbabilityLLMPredictor``). It is
+**never instantiated directly** — users instantiate one of the concrete
+subclasses re-exported from :mod:`aieng.forecasting.methods`.
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, ClassVar, Literal, Mapping
+
+import pandas as pd
+from aieng.forecasting.documents.models import ExtractedDocument
+from aieng.forecasting.documents.pdf_upload import pdf_to_content_part
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.methods.llm_processes._client import bootstrap_litellm, current_trace_info
+from aieng.forecasting.models import LITE_MODEL
+from pydantic import BaseModel, ConfigDict, Field
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.context import ForecastContext
+    from aieng.forecasting.data.models import SeriesMetadata
+    from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class LLMPredictorConfig(BaseModel):
+    """Frozen base config: provider-agnostic LLM-call settings.
+
+    Subclasses extend with modality-specific fields (e.g. ``n_samples``,
+    ``precision`` for the continuous case).
+    """
+
+    model_config = ConfigDict(frozen=True)
+
+    model: str = Field(
+        default=LITE_MODEL,
+        description=(
+            "Model name as expected by the proxy (bare, no provider prefix), "
+            "e.g. 'gemini-3.1-flash-lite-preview', 'gpt-4o-mini'. "
+            "When openai_base_url is set, LiteLLM routes this to the proxy via "
+            "custom_llm_provider='openai'."
+        ),
+    )
+    openai_base_url: str | None = Field(
+        default_factory=lambda: os.getenv("OPENAI_BASE_URL"),
+        description=(
+            "Base URL for an OpenAI-compatible LLM proxy. Defaults to the "
+            "``OPENAI_BASE_URL`` environment variable. When set, all completions "
+            "are routed through the proxy using ``api_base`` + "
+            "``custom_llm_provider='openai'``."
+        ),
+    )
+    openai_api_key: str | None = Field(
+        default_factory=lambda: os.getenv("OPENAI_API_KEY"),
+        description=("API key for the proxy. Defaults to the ``OPENAI_API_KEY`` environment variable."),
+    )
+    temperature: float = Field(default=1.0, ge=0.0, le=2.0, description="Sampling temperature.")
+    max_tokens: int = Field(
+        default=16384,
+        ge=1,
+        description=(
+            "Per-call output token budget. "
+            "Thinking models (e.g. gemini-3.1-pro-preview) consume thinking tokens "
+            "from this same budget via the OpenAI-compatible proxy — the 16 k default "
+            "is intentionally generous to prevent truncation; the model only generates "
+            "tokens it needs, so non-thinking models are not affected in cost."
+        ),
+    )
+    timeout_s: float = Field(default=120.0, gt=0.0, description="Per-call timeout in seconds.")
+    reasoning_effort: Literal["disable", "low", "medium", "high"] | None = Field(
+        default=None,
+        description=(
+            "Reasoning budget passed through to LiteLLM. ``None`` (default) sends "
+            "no ``reasoning_effort`` and lets the provider use its own default — "
+            "for the project's Gemini-via-proxy setup the lite model does not "
+            "force chain-of-thought, which suits calibration-sensitive "
+            "forecasting (CoT-induced overconfidence is well-documented for "
+            "continuous probabilistic forecasting). ``'medium'`` / ``'high'`` "
+            "request more reasoning. NOTE: the Vector proxy currently rejects "
+            "``'disable'`` and ``'low'`` for Gemini models (valid: "
+            "minimal/medium/high) — those literals are retained for other "
+            "providers but will 400 through the proxy."
+        ),
+    )
+    variant_tag: str | None = Field(
+        default=None,
+        description=(
+            "Optional short identifier for a method recipe (e.g. ``'food_cpi_v1_h60_n3'``, "
+            "``'short_history'``). When set, it is folded into :attr:`predictor_id` "
+            "as ``<method_tag>_<variant_tag>[<model>]`` so artifact storage, cached "
+            "backtests, and leaderboards keep recipes distinct. ``None`` preserves "
+            "the bare ``<method_tag>[<model>]`` form used by ad-hoc construction."
+        ),
+    )
+    report_sources: list[str] | None = Field(
+        default=None,
+        description=(
+            "Optional list of document source keys (e.g. ``['cfpr']``) to include "
+            "as a report preamble in the prompt.  When set, the predictor calls "
+            "``context.get_documents(source)`` for each source and prepends the "
+            "extracted text to the user prompt in CiK-style Format A.  Requires a "
+            "``DocumentStore`` to be attached to the ``DataService``."
+        ),
+    )
+    report_max_chars: int | None = Field(
+        default=None,
+        ge=1,
+        description=(
+            "Per-report character truncation limit.  Reports can be ~80,000 chars "
+            "each; set this to keep context windows manageable.  Truncation is "
+            "applied per-report before concatenation.  ``None`` means no truncation. "
+            "Only used by the ``'text'`` ingestion mode."
+        ),
+    )
+    report_ingestion: Literal["text", "native"] = Field(
+        default="text",
+        description=(
+            "How report documents are fed to the model when ``report_sources`` is "
+            "set.  ``'text'`` (default) injects pymupdf4llm-extracted markdown as a "
+            "CiK-style text preamble — works for every model through the proxy.  "
+            "``'native'`` uploads the source PDFs as backend-native document parts "
+            "so the model reads the original (tables/figures intact).  "
+            "TEMPORARY LIMITATION: native ingestion works only for Claude/GPT "
+            "models — the proxy drops document parts on the Gemini route.  Once the "
+            "proxy routes Gemini natively (see TODO(proxy-pdf) in "
+            "documents/pdf_upload.py), native ingestion will apply uniformly and "
+            "this becomes a free text-vs-native choice for any model."
+        ),
+    )
+
+
+def serialize_history(df: pd.DataFrame, precision: int) -> str:
+    """Render a cutoff-filtered series as one ``<date>: value`` line per row.
+
+    Uses ``YYYY-MM-DD`` format when any timestamp falls on a day other than 1
+    (i.e. the series is sub-monthly), and ``YYYY-MM`` format otherwise.
+
+    .. TODO(history-format): the day-!= 1 heuristic handles monthly vs daily but
+       breaks for quarterly, weekly, or truly irregular series.  A future revision
+       should accept an explicit ``fmt`` or ``frequency`` parameter so callers
+       have full control over the date representation sent to the LLM.
+    """
+    timestamps = [pd.Timestamp(ts) for ts in df["timestamp"]]
+    is_sub_monthly = any(ts.day != 1 for ts in timestamps)
+    fmt = "%Y-%m-%d" if is_sub_monthly else "%Y-%m"
+    lines = [f"{ts.strftime(fmt)}: {v:.{precision}f}" for ts, v in zip(timestamps, df["value"])]
+    return "\n".join(lines)
+
+
+def build_covariate_block(
+    context: ForecastContext,
+    covariate_series_ids: list[str],
+    *,
+    precision: int,
+    history_window: int | None = None,
+) -> str:
+    """Serialize covariate histories into labeled blocks for the LLM prompt.
+
+    Each registered covariate series is rendered cutoff-safe (via
+    ``context.get_series``) as a labeled block: a description / units header
+    (from :meth:`get_metadata` when available) followed by its
+    :func:`serialize_history` rendering. Series with no observations at the
+    cutoff are skipped. When ``history_window`` is set, each covariate is
+    truncated to its last ``history_window`` observations, matching the target.
+
+    This is the Context-is-Key §5.4 "labeled covariate blocks" pattern: the
+    model sees the target history plus the recent trajectory of each exogenous
+    series and may condition on cross-series structure.
+
+    Returns an empty string when ``covariate_series_ids`` is empty or no
+    covariate has usable history, so callers can unconditionally interpolate the
+    result into a prompt.
+    """
+    blocks: list[str] = []
+    for cov_id in covariate_series_ids:
+        cov_df = context.get_series(cov_id)
+        if cov_df.empty:
+            continue
+        if history_window is not None:
+            cov_df = cov_df.tail(history_window).reset_index(drop=True)
+        try:
+            cov_meta: SeriesMetadata | None = context.get_metadata(cov_id)
+        except KeyError:
+            cov_meta = None
+        if cov_meta is not None:
+            header = f"Covariate: {cov_meta.description} (source: {cov_meta.source})\nUnits: {cov_meta.units}"
+        else:
+            header = f"Covariate: {cov_id}"
+        blocks.append(f"{header}\n{serialize_history(cov_df, precision=precision)}")
+    if not blocks:
+        return ""
+    intro = (
+        "Covariates (exogenous series observed through the forecast origin; "
+        "use as additional context for your forecast):"
+    )
+    return intro + "\n\n" + "\n\n".join(blocks)
+
+
+def get_history_and_meta(
+    task: ForecastingTask,
+    context: ForecastContext,
+) -> tuple[pd.DataFrame, SeriesMetadata | None]:
+    """Fetch the target series and its metadata, respecting the cutoff.
+
+    Raises ``ValueError`` if the series has no observations at ``context.as_of``.
+    Returns ``(df, None)`` for series whose adapter did not register metadata.
+    """
+    series_df = context.get_series(task.target_series_id)
+    if series_df.empty:
+        raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.")
+    try:
+        series_meta = context.get_metadata(task.target_series_id)
+    except KeyError:
+        series_meta = None
+    return series_df, series_meta
+
+
+def fetch_report_docs(
+    *,
+    config: LLMPredictorConfig,
+    context: ForecastContext,
+) -> list[ExtractedDocument]:
+    """Fetch cutoff-filtered report documents per ``config.report_sources``.
+
+    Parameters
+    ----------
+    config : LLMPredictorConfig
+        Config with ``report_sources`` and ``report_max_chars`` fields.
+    context : ForecastContext
+        Cutoff-scoped context with optional ``DocumentStore``.
+
+    Returns
+    -------
+    list[ExtractedDocument]
+        Cutoff-filtered, chronologically sorted documents.  Empty when
+        ``report_sources`` is ``None`` or no ``DocumentStore`` is attached.
+    """
+    if not config.report_sources:
+        return []
+    docs: list[ExtractedDocument] = []
+    for source in config.report_sources:
+        docs.extend(context.get_documents(source))
+    docs.sort(key=lambda d: (d.meta.publication_date, d.meta.doc_id))
+    return docs
+
+
+def build_report_preamble(
+    docs: list[ExtractedDocument],
+    *,
+    max_chars: int | None = None,
+) -> str:
+    """Build a CiK-style Format A report preamble from a list of documents.
+
+    Each document is formatted as a titled, dated block::
+
+        === Canada's Food Price Report 2025 (15th edition) ===
+        Source: cfpr
+        Published: 2024-12-05
+        <extracted text>
+
+    When ``max_chars`` is set, each report's text is truncated to that limit
+    with a ``[...]`` marker appended.  Documents are rendered in the order
+    provided (typically chronological).
+
+    Parameters
+    ----------
+    docs : list[ExtractedDocument]
+        Documents to include in the preamble.
+    max_chars : int or None
+        Per-report character truncation limit.  ``None`` means no truncation.
+
+    Returns
+    -------
+    str
+        Formatted preamble string, or an empty string when ``docs`` is empty.
+    """
+    if not docs:
+        return ""
+    blocks: list[str] = []
+    for doc in docs:
+        title = doc.meta.title or f"{doc.meta.source}/{doc.meta.doc_id}"
+        text = doc.text
+        if max_chars is not None and len(text) > max_chars:
+            text = text[:max_chars] + "\n\n[...]"
+        block = (
+            f"=== {title} ===\nSource: {doc.meta.source}\nPublished: {doc.meta.publication_date.isoformat()}\n\n{text}"
+        )
+        blocks.append(block)
+    return "\n\n".join(blocks)
+
+
+#: Shared framing line that introduces report context in both ingestion modes.
+_REPORT_INTRO = (
+    "You are provided with the following economic report(s) "
+    "published before the forecast date. Use them as context "
+    "for your forecast."
+)
+
+
+def apply_report_context(
+    *,
+    config: LLMPredictorConfig,
+    docs: list[ExtractedDocument],
+    user_prompt: str,
+) -> str | list[dict[str, Any]]:
+    """Apply report context to the user prompt in the configured ingestion mode.
+
+    Centralizes the report-injection logic shared by every LLMP predictor so the
+    text-vs-native decision lives in one place.
+
+    Modes (``config.report_ingestion``):
+
+    - ``"text"`` (default): build a CiK-style text preamble via
+      :func:`build_report_preamble` and prepend it to ``user_prompt``.  Returns
+      a single string.  Works for every model through the proxy.
+    - ``"native"``: emit the source PDFs as backend-native document content
+      parts (:func:`~aieng.forecasting.documents.pdf_upload.pdf_to_content_part`)
+      so the model reads the originals directly.  Returns a content-part list
+      ``[intro_text, <pdf parts...>, prompt_text]``.  Requires each document to
+      carry a resolvable ``pdf_path`` and a Claude/GPT model — Gemini native
+      ingestion is not supported through the proxy yet (see ``pdf_upload.py``).
+
+    When ``docs`` is empty the bare ``user_prompt`` is returned unchanged, so
+    callers can pass the result straight through as message content regardless
+    of whether any reports were configured.
+
+    Returns
+    -------
+    str or list[dict]
+        A string (text mode / no docs) or a list of content-part dicts (native
+        mode), suitable as the ``content`` of a user message.
+    """
+    if not docs:
+        return user_prompt
+    if config.report_ingestion == "native":
+        return _build_native_report_content(config=config, docs=docs, user_prompt=user_prompt)
+    preamble = build_report_preamble(docs, max_chars=config.report_max_chars)
+    if not preamble:
+        return user_prompt
+    return f"{_REPORT_INTRO}\n\n{preamble}\n\n---\n\n{user_prompt}"
+
+
+def _build_native_report_content(
+    *,
+    config: LLMPredictorConfig,
+    docs: list[ExtractedDocument],
+    user_prompt: str,
+) -> list[dict[str, Any]]:
+    """Build a content-part list with native PDF document parts + the prompt.
+
+    Order: a brief intro text part, one backend-native document part per source
+    PDF (in the order given), then the user prompt as a trailing text part.
+
+    Raises
+    ------
+    ValueError
+        If any document lacks a resolved ``pdf_path``.
+    NotImplementedError
+        If ``config.model`` is a Gemini model (proxy limitation; raised by
+        :func:`~aieng.forecasting.documents.pdf_upload.pdf_to_content_part`).
+    """
+    parts: list[dict[str, Any]] = [{"type": "text", "text": _REPORT_INTRO}]
+    for doc in docs:
+        if not doc.pdf_path:
+            raise ValueError(
+                f"Native report ingestion requested but document "
+                f"'{doc.meta.source}/{doc.meta.doc_id}' has no resolved pdf_path. "
+                "Ensure the source PDF sits beside its .json artifact, or use "
+                "report_ingestion='text'."
+            )
+        parts.append(pdf_to_content_part(Path(doc.pdf_path), config.model))
+    parts.append({"type": "text", "text": f"---\n\n{user_prompt}"})
+    return parts
+
+
+class LLMPredictor(Predictor):
+    """Abstract parent for all LLM-process predictors.
+
+    Concrete subclasses differ in:
+
+    - The config type they accept (extends :class:`LLMPredictorConfig`).
+    - The output schema they request from the LLM.
+    - How they aggregate one or many LLM responses into ``Prediction`` objects.
+
+    What this base provides:
+
+    - LiteLLM bootstrap on construction (lazy, idempotent).
+    - ``predictor_id`` derived from the class-level ``_method_tag``.
+    - ``cfg`` storage with the right modality-specific type.
+
+    Subclasses must:
+
+    - Set the class attribute ``_method_tag`` (e.g. ``"llmp_sampled_trajectories"``).
+    - Override ``_default_config`` to return their concrete config type.
+    - Implement ``predict``.
+    """
+
+    #: Stable, human-readable family tag used in :attr:`predictor_id`.
+    #: Subclasses must override (e.g. ``"llmp_sampled_trajectories"``).
+    _method_tag: ClassVar[str] = ""
+
+    def __init__(self, cfg: LLMPredictorConfig | None = None) -> None:
+        if not self._method_tag:
+            raise TypeError(
+                f"{type(self).__name__} must set the class attribute '_method_tag'.",
+            )
+        self.cfg = cfg if cfg is not None else self._default_config()
+        bootstrap_litellm()
+
+    @classmethod
+    def _default_config(cls) -> LLMPredictorConfig:
+        """Return a default config; subclasses override with their own config type."""
+        return LLMPredictorConfig()
+
+    @property
+    def predictor_id(self) -> str:
+        """Stable identifier folding method tag, optional variant tag, and model.
+
+        Format:
+
+        - ``<method_tag>[<model>]`` when ``cfg.variant_tag`` is ``None`` (default).
+        - ``<method_tag>_<variant_tag>[<model>]`` otherwise.
+
+        Recipes (see ``implementations/<use-case>/predictors/``) set
+        ``variant_tag`` so their cached backtests and leaderboard rows stay
+        distinct from ad-hoc bare-config runs.  Examples:
+
+        - ``llmp_sampled_trajectories[anthropic/claude-sonnet-4-5]``
+        - ``llmp_sampled_trajectories_food_cpi_v1_h60_n3[<model>]``
+        - ``llmp_quantile_grid_food_cpi_v1_h60_rlow[<model>]``
+        """
+        if self.cfg.variant_tag:
+            return f"{self._method_tag}_{self.cfg.variant_tag}[{self.cfg.model}]"
+        return f"{self._method_tag}[{self.cfg.model}]"
+
+    def _build_metadata(
+        self,
+        *,
+        cost_usd: float,
+        in_tokens: int,
+        out_tokens: int,
+        parse_failures: int,
+        history_window: int | None = None,
+        extra: Mapping[str, Any] | None = None,
+    ) -> dict[str, Any]:
+        """Build common metadata for an LLM-backed prediction."""
+        trace_id, trace_url = current_trace_info()
+        metadata: dict[str, Any] = {"model": self.cfg.model}
+        if extra is not None:
+            metadata.update(extra)
+        metadata.update(
+            {
+                "temperature": self.cfg.temperature,
+                "reasoning_effort": self.cfg.reasoning_effort,
+                "cost_usd": cost_usd,
+                "input_tokens": in_tokens,
+                "output_tokens": out_tokens,
+                "parse_failures": parse_failures,
+            }
+        )
+        if self.cfg.variant_tag is not None:
+            metadata["variant_tag"] = self.cfg.variant_tag
+        if history_window is not None:
+            metadata["history_window"] = history_window
+        if trace_id is not None:
+            metadata["langfuse_trace_id"] = trace_id
+        if trace_url is not None:
+            metadata["langfuse_trace_url"] = trace_url
+        return metadata
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md
new file mode 100644
index 0000000..0c81afc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md
@@ -0,0 +1,340 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/binary_probability.py
+
+kind: python
+
+```python
+"""BinaryProbabilityLLMPredictor — direct probability elicitation for binary events.
+
+Asks an LLM for a single calibrated probability that a binary event resolves
+``True``, via one structured completion per forecast origin. This is the
+binary counterpart of
+:class:`~aieng.forecasting.methods.llm_processes.quantile_grid.QuantileGridLLMPredictor`:
+where the quantile grid elicits a full predictive distribution for a
+continuous target, this class elicits the one number that fully describes a
+Bernoulli predictive distribution.
+
+Direct probabilities are token-efficient and easy to score (Brier), but the
+prompt must distinguish *calibrated probability* from *model confidence* —
+the system prompt below is explicit about coverage semantics. Sampled-outcome,
+logprob, and conformal variants should be implemented as sibling classes if
+they prove useful, not as modes on this predictor.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING, Any, ClassVar
+
+import pandas as pd
+from aieng.forecasting.evaluation.prediction import BinaryForecast, Prediction
+from aieng.forecasting.methods.llm_processes._client import (
+    langfuse_observe,
+    make_json_schema_response_format,
+    run_async,
+    sample_n_async,
+    set_current_trace_name,
+)
+from aieng.forecasting.methods.llm_processes.base import (
+    LLMPredictor,
+    LLMPredictorConfig,
+    apply_report_context,
+    fetch_report_docs,
+    get_history_and_meta,
+    serialize_history,
+)
+from pydantic import BaseModel, ConfigDict, Field
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.context import ForecastContext
+    from aieng.forecasting.data.models import SeriesMetadata
+    from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class BinaryProbabilityLLMPredictorConfig(LLMPredictorConfig):
+    """Frozen configuration for :class:`BinaryProbabilityLLMPredictor`.
+
+    Adds only binary-task prompt controls that preserve the direct-probability
+    contract. The predictor makes one structured completion per forecast
+    origin and does not expose ``n_samples``.
+    """
+
+    model_config = ConfigDict(frozen=True)
+
+    precision: int = Field(
+        default=0,
+        ge=0,
+        le=10,
+        description="Decimal places used when serializing the (0/1) event history.",
+    )
+    history_window: int | None = Field(
+        default=None,
+        ge=1,
+        description="If set, only the last N cutoff-filtered observations are serialized into the prompt.",
+    )
+    series_description: str | None = Field(
+        default=None,
+        description="Optional replacement for the metadata-derived series description block.",
+    )
+    elicit_reasoning: bool = Field(
+        default=True,
+        description=(
+            "When True, ask the model for a short free-text 'reasoning' field alongside the "
+            "probability, captured into Prediction.metadata['rationale'] for inspection and "
+            "downstream reasoning evaluation. The field is requested *after* the probability so "
+            "the model commits to the number first, keeping the answer-first ordering that "
+            "protects calibration. Set False to restore the bare probability-only elicitation."
+        ),
+    )
+    system_prompt_override: str | None = Field(
+        default=None,
+        description="Full replacement for the built-in binary-probability system prompt.",
+    )
+    user_prompt_suffix: str | None = Field(
+        default=None,
+        description=(
+            "Free-form text appended to the user prompt after the standard question. "
+            "Use-case recipes use this to inject domain context (covariate summaries, "
+            "report excerpts) without changing the elicitation contract."
+        ),
+    )
+
+
+class _BinaryProbability(BaseModel):
+    """Internal Pydantic schema for one directly elicited event probability.
+
+    ``reasoning`` is optional so parsing succeeds whether or not the field was
+    requested (controlled by ``elicit_reasoning`` on the config).
+    """
+
+    probability: float = Field(ge=0.0, le=1.0)
+    reasoning: str = Field(default="")
+
+
+def _build_binary_probability_schema(elicit_reasoning: bool) -> dict[str, Any]:
+    """Build the strict ``json_schema`` for one event probability.
+
+    ``probability`` comes first so the model commits to the number before any
+    justification. When ``elicit_reasoning`` is True, a free-text ``reasoning``
+    field is appended; strict mode with ``additionalProperties: False`` requires
+    every property to be listed in ``required``.
+    """
+    properties: dict[str, Any] = {
+        "probability": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+    }
+    required = ["probability"]
+    if elicit_reasoning:
+        properties["reasoning"] = {"type": "string"}
+        required.append("reasoning")
+    return {
+        "type": "object",
+        "properties": properties,
+        "required": required,
+        "additionalProperties": False,
+    }
+
+
+def _build_system_prompt(override: str | None = None, *, elicit_reasoning: bool = False) -> str:
+    """Return the binary-probability system prompt, or ``override`` verbatim."""
+    if override is not None:
+        return override
+    reasoning_rule = (
+        "- Decide your probability first, then briefly justify it in plain text in the "
+        "'reasoning' field (a few sentences naming the key drivers).\n"
+        if elicit_reasoning
+        else ""
+    )
+    return (
+        "You are a probabilistic forecaster of binary events. Given the history of past "
+        "outcomes and a question about a future event, return one calibrated probability "
+        "that the event occurs.\n"
+        "\n"
+        "Rules:\n"
+        "- Return ONLY a JSON object matching the provided schema. No prose, no markdown.\n"
+        "- 'probability' is the probability the event resolves TRUE (1), in [0, 1].\n"
+        "- Report a CALIBRATED probability, not your confidence in a point answer: across "
+        "many questions where you answer 0.7, the event should occur about 70% of the time.\n"
+        "- Avoid 0.0 and 1.0 unless the outcome is logically certain.\n"
+        f"{reasoning_rule}"
+        "- Base rates matter: anchor on how often the event has occurred historically, then "
+        "adjust for the current situation."
+    )
+
+
+def _build_user_prompt(
+    task: ForecastingTask,
+    history_str: str,
+    series_meta: SeriesMetadata | None,
+    forecast_date: pd.Timestamp,
+    series_description_override: str | None = None,
+    suffix: str | None = None,
+) -> str:
+    """Build the binary-probability user prompt."""
+    if series_description_override is not None:
+        meta_block = series_description_override
+    else:
+        meta_lines: list[str] = []
+        if series_meta is not None:
+            meta_lines.append(f"Event series: {series_meta.description} (source: {series_meta.source})")
+            meta_lines.append(f"Units: {series_meta.units}")
+        else:
+            meta_lines.append(f"Event series: {task.target_series_id}")
+        meta_block = "\n".join(meta_lines)
+
+    base = (
+        f"Question: {task.description}\n"
+        "\n"
+        f"{meta_block}\n"
+        "\n"
+        "History of past outcomes (1 = event occurred, 0 = it did not):\n"
+        f"{history_str}\n"
+        "\n"
+        f"The event resolves on {forecast_date.strftime('%Y-%m-%d')}.\n"
+        "Return a JSON object with a single 'probability' field: the calibrated probability "
+        "that the event occurs (resolves to 1)."
+    )
+    if suffix:
+        base = f"{base}\n\n{suffix.lstrip(chr(10))}"
+    return base
+
+
+def _sample_probability(
+    *,
+    cfg: BinaryProbabilityLLMPredictorConfig,
+    system_prompt: str,
+    user_prompt: str | list[dict[str, Any]],
+) -> tuple[_BinaryProbability, float, int, int, int]:
+    """Issue one structured completion and return the parsed probability."""
+    base_messages: list[dict[str, Any]] = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    response_format = make_json_schema_response_format(
+        "BinaryProbability", _build_binary_probability_schema(cfg.elicit_reasoning)
+    )
+
+    parsed, cost_usd, in_tokens, out_tokens, parse_failures = run_async(
+        sample_n_async(
+            schema_cls=_BinaryProbability,
+            model=cfg.model,
+            base_messages=base_messages,
+            response_format=response_format,
+            n_samples=1,
+            temperature=cfg.temperature,
+            max_tokens=cfg.max_tokens,
+            timeout_s=cfg.timeout_s,
+            reasoning_effort=cfg.reasoning_effort,
+            api_base=cfg.openai_base_url,
+            api_key=cfg.openai_api_key,
+        ),
+    )
+    if not parsed:
+        raise RuntimeError("No valid binary-probability response returned by LLM.")
+    return parsed[0], cost_usd, in_tokens, out_tokens, parse_failures
+
+
+class BinaryProbabilityLLMPredictor(LLMPredictor):
+    """Binary-event LLM forecaster using direct probability elicitation."""
+
+    _method_tag: ClassVar[str] = "llmp_binary_probability"
+
+    cfg: BinaryProbabilityLLMPredictorConfig
+
+    def __init__(self, cfg: BinaryProbabilityLLMPredictorConfig | None = None) -> None:
+        super().__init__(cfg)
+
+    @classmethod
+    def _default_config(cls) -> BinaryProbabilityLLMPredictorConfig:
+        return BinaryProbabilityLLMPredictorConfig()
+
+    @langfuse_observe("BinaryProbabilityLLMPredictor.predict")
+    def predict(
+        self,
+        task: ForecastingTask,
+        context: ForecastContext,
+    ) -> list[Prediction]:
+        """Produce one BinaryForecast prediction from a directly elicited probability.
+
+        Raises
+        ------
+        ValueError
+            If the task does not declare ``payload_type='binary'`` or requests
+            more than one horizon — a single probability maps to exactly one
+            resolution date.
+        """
+        if task.payload_type != "binary":
+            raise ValueError(
+                f"{type(self).__name__} requires a binary task (payload_type='binary'); "
+                f"task '{task.task_id}' declares payload_type='{task.payload_type}'."
+            )
+        if len(task.horizons) != 1:
+            raise ValueError(
+                f"{type(self).__name__} supports exactly one horizon per task; "
+                f"task '{task.task_id}' declares horizons={task.horizons}."
+            )
+
+        set_current_trace_name(self.predictor_id)
+        series_df, series_meta = get_history_and_meta(task, context)
+        if self.cfg.history_window is not None:
+            series_df = series_df.tail(self.cfg.history_window).reset_index(drop=True)
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        horizon = task.horizons[0]
+        forecast_date = (pd.Timestamp(context.as_of) + offset * horizon).normalize()
+
+        history_str = serialize_history(series_df, precision=self.cfg.precision)
+
+        # Report context (before the task/history block): text preamble (CiK
+        # Format A) or native PDF parts, per cfg.report_ingestion.
+        report_docs = fetch_report_docs(config=self.cfg, context=context)
+
+        system_prompt = _build_system_prompt(
+            self.cfg.system_prompt_override, elicit_reasoning=self.cfg.elicit_reasoning
+        )
+        user_prompt = _build_user_prompt(
+            task,
+            history_str,
+            series_meta,
+            forecast_date,
+            series_description_override=self.cfg.series_description,
+            suffix=self.cfg.user_prompt_suffix,
+        )
+        user_content = apply_report_context(config=self.cfg, docs=report_docs, user_prompt=user_prompt)
+
+        parsed, cost_usd, in_tokens, out_tokens, parse_failures = _sample_probability(
+            cfg=self.cfg,
+            system_prompt=system_prompt,
+            user_prompt=user_content,
+        )
+
+        rationale = parsed.reasoning.strip()
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        return [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=forecast_date.to_pydatetime(),
+                payload=BinaryForecast(probability=float(parsed.probability)),
+                metadata=self._build_metadata(
+                    cost_usd=cost_usd,
+                    in_tokens=in_tokens,
+                    out_tokens=out_tokens,
+                    parse_failures=parse_failures,
+                    history_window=self.cfg.history_window,
+                    extra={
+                        **({"rationale": rationale} if rationale else {}),
+                        "n_report_docs": len(report_docs),
+                        **({"report_sources": self.cfg.report_sources} if self.cfg.report_sources else {}),
+                    },
+                ),
+            ),
+        ]
+
+
+__all__ = [
+    "BinaryProbabilityLLMPredictor",
+    "BinaryProbabilityLLMPredictorConfig",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md
new file mode 100644
index 0000000..4a3088c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md
@@ -0,0 +1,485 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/categorical_probability.py
+
+kind: python
+
+```python
+"""CategoricalProbabilityLLMPredictor — direct categorical distribution elicitation.
+
+Asks an LLM for one calibrated probability per ordered category, via one
+structured completion per forecast origin. This is the categorical
+counterpart of
+:class:`~aieng.forecasting.methods.llm_processes.binary_probability.BinaryProbabilityLLMPredictor`:
+where the binary predictor elicits the single number describing a Bernoulli
+distribution, this class elicits the full probability vector over the task's
+ordered categories (e.g. cut < hold < hike), scored with RPS.
+
+The category order, labels, and series-value mapping all come from
+``task.categories`` — the predictor never invents its own label set. Observed
+history is serialized using category *labels* rather than raw series values so
+the LLM reasons over "cut/hold/hike" instead of "-1/0/1".
+
+LLMs frequently return distributions that sum to 0.99 or 1.01 (e.g. three
+"0.33" entries). Rather than failing on the payload validator's 1e-6 sum
+tolerance, responses within :data:`RENORMALIZATION_TOLERANCE` of 1 are
+renormalized (and the raw sum recorded in prediction metadata); responses
+further off than that are treated as malformed and raise.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from math import isclose
+from typing import TYPE_CHECKING, Any, ClassVar
+
+import pandas as pd
+from aieng.forecasting.evaluation.langfuse_traces import stamp_forecast_on_trace
+from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction
+from aieng.forecasting.methods.llm_processes._client import (
+    langfuse_observe,
+    make_json_schema_response_format,
+    run_async,
+    sample_n_async,
+    set_current_trace_name,
+)
+from aieng.forecasting.methods.llm_processes.base import (
+    LLMPredictor,
+    LLMPredictorConfig,
+    apply_report_context,
+    fetch_report_docs,
+    get_history_and_meta,
+)
+from pydantic import BaseModel, ConfigDict, Field, field_validator
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.context import ForecastContext
+    from aieng.forecasting.data.models import SeriesMetadata
+    from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory
+
+
+#: Maximum allowed |sum - 1| before an elicited distribution is rejected
+#: instead of renormalized.
+RENORMALIZATION_TOLERANCE: float = 0.05
+
+
+class CategoricalProbabilityLLMPredictorConfig(LLMPredictorConfig):
+    """Frozen configuration for :class:`CategoricalProbabilityLLMPredictor`.
+
+    Adds only categorical-task prompt controls that preserve the
+    direct-distribution contract. The predictor makes one structured
+    completion per forecast origin and does not expose ``n_samples``.
+    """
+
+    model_config = ConfigDict(frozen=True)
+
+    history_window: int | None = Field(
+        default=None,
+        ge=1,
+        description="If set, only the last N cutoff-filtered observations are serialized into the prompt.",
+    )
+    series_description: str | None = Field(
+        default=None,
+        description="Optional replacement for the metadata-derived series description block.",
+    )
+    elicit_reasoning: bool = Field(
+        default=True,
+        description=(
+            "When True, ask the model for a short free-text 'reasoning' field alongside the "
+            "distribution, captured into Prediction.metadata['rationale'] for inspection and "
+            "downstream reasoning evaluation. The field is requested *after* the probabilities so "
+            "the model commits to the distribution first, keeping the answer-first ordering that "
+            "protects calibration. Set False to restore the bare distribution-only elicitation."
+        ),
+    )
+    system_prompt_override: str | None = Field(
+        default=None,
+        description="Full replacement for the built-in categorical-probability system prompt.",
+    )
+    user_prompt_suffix: str | None = Field(
+        default=None,
+        description=(
+            "Free-form text appended to the user prompt after the standard question. "
+            "Use-case recipes use this to inject domain context (covariate summaries, "
+            "report excerpts) without changing the elicitation contract."
+        ),
+    )
+
+
+class _CategoryProbability(BaseModel):
+    """One (label, probability) row of an elicited categorical distribution."""
+
+    label: str
+    probability: float = Field(ge=0.0, le=1.0)
+
+
+class _CategoricalDistribution(BaseModel):
+    """Internal Pydantic schema for one directly elicited distribution.
+
+    ``reasoning`` is optional so parsing succeeds whether or not the field was
+    requested (controlled by ``elicit_reasoning`` on the config).
+    """
+
+    probabilities: list[_CategoryProbability]
+    reasoning: str = Field(default="")
+
+    @field_validator("probabilities", mode="before")
+    @classmethod
+    def _coerce_mapping_to_rows(cls, value: Any) -> Any:
+        """Accept a ``{label: probability}`` mapping as well as the list form.
+
+        Despite the strict ``{label, probability}`` array schema, some models
+        (and some proxy routes) return the distribution as a JSON object
+        mapping label to probability, e.g. ``{"cut": 0.25, "hold": 0.7,
+        "hike": 0.05}``. Coerce that shape into the canonical list of rows so a
+        well-formed answer is not discarded as a parse failure.
+        """
+        if isinstance(value, dict):
+            return [{"label": label, "probability": probability} for label, probability in value.items()]
+        return value
+
+
+def _build_categorical_distribution_schema(elicit_reasoning: bool) -> dict[str, Any]:
+    """Build the strict ``json_schema`` for one elicited categorical distribution.
+
+    ``probabilities`` comes first so the model commits to the distribution
+    before any justification. When ``elicit_reasoning`` is True, a free-text
+    ``reasoning`` field is appended; strict mode with
+    ``additionalProperties: False`` requires every property in ``required``.
+    """
+    properties: dict[str, Any] = {
+        "probabilities": {
+            "type": "array",
+            "items": {
+                "type": "object",
+                "properties": {
+                    "label": {"type": "string"},
+                    "probability": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+                },
+                "required": ["label", "probability"],
+                "additionalProperties": False,
+            },
+        },
+    }
+    required = ["probabilities"]
+    if elicit_reasoning:
+        properties["reasoning"] = {"type": "string"}
+        required.append("reasoning")
+    return {
+        "type": "object",
+        "properties": properties,
+        "required": required,
+        "additionalProperties": False,
+    }
+
+
+def serialize_categorical_history(df: pd.DataFrame, categories: list[TaskCategory]) -> str:
+    """Render a categorical series as one ``<date>: <label>`` line per row.
+
+    Mirrors the date-format heuristic of
+    :func:`~aieng.forecasting.methods.llm_processes.base.serialize_history`
+    but replaces raw series values with their task-declared category labels,
+    so the LLM sees ``2024-06-05: cut`` rather than ``2024-06-05: -1``.
+
+    Raises
+    ------
+    ValueError
+        If any observed value does not match a declared category value.
+    """
+    timestamps = [pd.Timestamp(ts) for ts in df["timestamp"]]
+    is_sub_monthly = any(ts.day != 1 for ts in timestamps)
+    fmt = "%Y-%m-%d" if is_sub_monthly else "%Y-%m"
+
+    lines: list[str] = []
+    for ts, value in zip(timestamps, df["value"]):
+        label = _label_for_value(float(value), categories)
+        if label is None:
+            allowed = [category.value for category in categories]
+            raise ValueError(
+                f"Observed value {float(value)} does not match any task category value. Allowed values: {allowed}."
+            )
+        lines.append(f"{ts.strftime(fmt)}: {label}")
+    return "\n".join(lines)
+
+
+def _label_for_value(value: float, categories: list[TaskCategory]) -> str | None:
+    """Return the label of the category whose series value matches ``value``."""
+    for category in categories:
+        if isclose(value, category.value, abs_tol=1e-9):
+            return category.label
+    return None
+
+
+def _build_system_prompt(override: str | None = None, *, elicit_reasoning: bool = False) -> str:
+    """Return the categorical-probability system prompt, or ``override`` verbatim."""
+    if override is not None:
+        return override
+    reasoning_rule = (
+        "- Decide the distribution first, then briefly justify it in plain text in the "
+        "'reasoning' field (a few sentences naming the key drivers).\n"
+        if elicit_reasoning
+        else ""
+    )
+    return (
+        "You are a probabilistic forecaster of categorical events. Given the history of "
+        "past outcomes and a question about a future event with a fixed set of ordered "
+        "possible outcomes, return one calibrated probability for each outcome.\n"
+        "\n"
+        "Rules:\n"
+        "- Return ONLY a JSON object matching the provided schema. No prose, no markdown.\n"
+        "- Include exactly one entry per listed outcome label, using the labels verbatim.\n"
+        "- Probabilities must be in [0, 1] and sum to 1.\n"
+        "- Report CALIBRATED probabilities, not your confidence in a point answer: across "
+        "many questions where you assign 0.7 to an outcome, that outcome should occur "
+        "about 70% of the time.\n"
+        "- Avoid 0.0 and 1.0 unless an outcome is logically impossible or certain.\n"
+        f"{reasoning_rule}"
+        "- Base rates matter: anchor on how often each outcome has occurred historically, "
+        "then adjust for the current situation."
+    )
+
+
+def _build_user_prompt(
+    task: ForecastingTask,
+    history_str: str,
+    series_meta: SeriesMetadata | None,
+    forecast_date: pd.Timestamp,
+    series_description_override: str | None = None,
+    suffix: str | None = None,
+) -> str:
+    """Build the categorical-probability user prompt."""
+    if series_description_override is not None:
+        meta_block = series_description_override
+    else:
+        meta_lines: list[str] = []
+        if series_meta is not None:
+            meta_lines.append(f"Outcome series: {series_meta.description} (source: {series_meta.source})")
+            meta_lines.append(f"Units: {series_meta.units}")
+        else:
+            meta_lines.append(f"Outcome series: {task.target_series_id}")
+        meta_block = "\n".join(meta_lines)
+
+    categories = task.categories or []
+    labels_ordered = " < ".join(category.label for category in categories)
+    labels_json = ", ".join(f"'{category.label}'" for category in categories)
+
+    base = (
+        f"Question: {task.description}\n"
+        "\n"
+        f"{meta_block}\n"
+        "\n"
+        f"Possible outcomes, in order: {labels_ordered}\n"
+        "\n"
+        "History of past outcomes:\n"
+        f"{history_str}\n"
+        "\n"
+        f"The event resolves on {forecast_date.strftime('%Y-%m-%d')}.\n"
+        "Return a JSON object with a 'probabilities' array containing exactly one "
+        f"{{label, probability}} entry for each of: {labels_json}. "
+        "The probabilities must sum to 1."
+    )
+    if suffix:
+        base = f"{base}\n\n{suffix.lstrip(chr(10))}"
+    return base
+
+
+def _sample_distribution(
+    *,
+    cfg: CategoricalProbabilityLLMPredictorConfig,
+    system_prompt: str,
+    user_prompt: str | list[dict[str, Any]],
+) -> tuple[_CategoricalDistribution, float, int, int, int]:
+    """Issue one structured completion and return the parsed distribution."""
+    base_messages: list[dict[str, Any]] = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    response_format = make_json_schema_response_format(
+        "CategoricalDistribution", _build_categorical_distribution_schema(cfg.elicit_reasoning)
+    )
+
+    parsed, cost_usd, in_tokens, out_tokens, parse_failures = run_async(
+        sample_n_async(
+            schema_cls=_CategoricalDistribution,
+            model=cfg.model,
+            base_messages=base_messages,
+            response_format=response_format,
+            n_samples=1,
+            temperature=cfg.temperature,
+            max_tokens=cfg.max_tokens,
+            timeout_s=cfg.timeout_s,
+            reasoning_effort=cfg.reasoning_effort,
+            api_base=cfg.openai_base_url,
+            api_key=cfg.openai_api_key,
+        ),
+    )
+    if not parsed:
+        raise RuntimeError("No valid categorical-distribution response returned by LLM.")
+    return parsed[0], cost_usd, in_tokens, out_tokens, parse_failures
+
+
+def _align_and_normalize(
+    parsed: _CategoricalDistribution,
+    categories: list[TaskCategory],
+) -> tuple[dict[str, float], float]:
+    """Validate the elicited rows against the task labels and normalize the sum.
+
+    Returns the label-aligned probability dict and the raw (pre-normalization)
+    probability sum.
+
+    Raises
+    ------
+    RuntimeError
+        If the response labels do not exactly match the task labels, a label
+        is duplicated, or the probabilities sum outside
+        ``1 +/- RENORMALIZATION_TOLERANCE``.
+    """
+    by_label: dict[str, float] = {}
+    duplicates: list[str] = []
+    for row in parsed.probabilities:
+        if row.label in by_label:
+            duplicates.append(row.label)
+        by_label[row.label] = row.probability
+    if duplicates:
+        raise RuntimeError(f"LLM returned duplicate category labels: {sorted(set(duplicates))}.")
+
+    expected = {category.label for category in categories}
+    actual = set(by_label)
+    if actual != expected:
+        missing = sorted(expected - actual)
+        extra = sorted(actual - expected)
+        raise RuntimeError(
+            f"LLM response labels must exactly match the task categories. Missing: {missing}; extra: {extra}."
+        )
+
+    raw_sum = sum(by_label.values())
+    if abs(raw_sum - 1.0) > RENORMALIZATION_TOLERANCE or raw_sum <= 0.0:
+        raise RuntimeError(
+            f"LLM probabilities sum to {raw_sum}, outside the renormalization tolerance "
+            f"of 1 +/- {RENORMALIZATION_TOLERANCE}."
+        )
+
+    probabilities = {category.label: by_label[category.label] / raw_sum for category in categories}
+    return probabilities, raw_sum
+
+
+class CategoricalProbabilityLLMPredictor(LLMPredictor):
+    """Ordered-categorical LLM forecaster using direct distribution elicitation."""
+
+    _method_tag: ClassVar[str] = "llmp_categorical_probability"
+
+    cfg: CategoricalProbabilityLLMPredictorConfig
+
+    def __init__(self, cfg: CategoricalProbabilityLLMPredictorConfig | None = None) -> None:
+        super().__init__(cfg)
+
+    @classmethod
+    def _default_config(cls) -> CategoricalProbabilityLLMPredictorConfig:
+        return CategoricalProbabilityLLMPredictorConfig()
+
+    @langfuse_observe("CategoricalProbabilityLLMPredictor.predict")
+    def predict(
+        self,
+        task: ForecastingTask,
+        context: ForecastContext,
+    ) -> list[Prediction]:
+        """Produce one CategoricalForecast prediction from an elicited distribution.
+
+        Raises
+        ------
+        ValueError
+            If the task does not declare ``payload_type='categorical'`` or
+            requests more than one horizon — one distribution maps to exactly
+            one resolution date.
+        RuntimeError
+            If the LLM response labels do not match the task categories or the
+            probabilities sum outside ``1 +/- RENORMALIZATION_TOLERANCE``.
+        """
+        if task.payload_type != "categorical":
+            raise ValueError(
+                f"{type(self).__name__} requires a categorical task (payload_type='categorical'); "
+                f"task '{task.task_id}' declares payload_type='{task.payload_type}'."
+            )
+        if task.categories is None:
+            raise ValueError(f"Categorical task '{task.task_id}' must define categories.")
+        if len(task.horizons) != 1:
+            raise ValueError(
+                f"{type(self).__name__} supports exactly one horizon per task; "
+                f"task '{task.task_id}' declares horizons={task.horizons}."
+            )
+
+        set_current_trace_name(self.predictor_id)
+        series_df, series_meta = get_history_and_meta(task, context)
+        if self.cfg.history_window is not None:
+            series_df = series_df.tail(self.cfg.history_window).reset_index(drop=True)
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        horizon = task.horizons[0]
+        forecast_date = (pd.Timestamp(context.as_of) + offset * horizon).normalize()
+
+        history_str = serialize_categorical_history(series_df, task.categories)
+
+        # Report context (before the task/history block): text preamble (CiK
+        # Format A) or native PDF parts, per cfg.report_ingestion.
+        report_docs = fetch_report_docs(config=self.cfg, context=context)
+
+        system_prompt = _build_system_prompt(
+            self.cfg.system_prompt_override, elicit_reasoning=self.cfg.elicit_reasoning
+        )
+        user_prompt = _build_user_prompt(
+            task,
+            history_str,
+            series_meta,
+            forecast_date,
+            series_description_override=self.cfg.series_description,
+            suffix=self.cfg.user_prompt_suffix,
+        )
+        user_content = apply_report_context(config=self.cfg, docs=report_docs, user_prompt=user_prompt)
+
+        parsed, cost_usd, in_tokens, out_tokens, parse_failures = _sample_distribution(
+            cfg=self.cfg,
+            system_prompt=system_prompt,
+            user_prompt=user_content,
+        )
+        probabilities, raw_sum = _align_and_normalize(parsed, task.categories)
+
+        rationale = parsed.reasoning.strip()
+        metadata = self._build_metadata(
+            cost_usd=cost_usd,
+            in_tokens=in_tokens,
+            out_tokens=out_tokens,
+            parse_failures=parse_failures,
+            history_window=self.cfg.history_window,
+            extra={
+                **({"rationale": rationale} if rationale else {}),
+                "n_report_docs": len(report_docs),
+                **({"report_sources": self.cfg.report_sources} if self.cfg.report_sources else {}),
+            },
+        )
+        if not isclose(raw_sum, 1.0, abs_tol=1e-9):
+            metadata["probability_sum_raw"] = raw_sum
+
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        predictions = [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=forecast_date.to_pydatetime(),
+                payload=CategoricalForecast(probabilities=probabilities),
+                metadata=metadata,
+            ),
+        ]
+        # Make the trace the canonical record for rationale evaluation: stamp the
+        # structured forecast onto the active trace so a trace evaluator reads the
+        # rationale + distribution straight from Langfuse, not from a cached run.
+        stamp_forecast_on_trace(predictions)
+        return predictions
+
+
+__all__ = [
+    "CategoricalProbabilityLLMPredictor",
+    "CategoricalProbabilityLLMPredictorConfig",
+    "serialize_categorical_history",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__point_intervals.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__point_intervals.py.md
new file mode 100644
index 0000000..d36030f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__point_intervals.py.md
@@ -0,0 +1,31 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/point_intervals.py
+
+kind: python
+
+```python
+"""Design placeholder for point-plus-interval LLM forecasting.
+
+This module intentionally exports no predictor yet. The candidate contract is a
+single structured LLM response containing a central path plus interval endpoints
+for each horizon, for example ``q10``, ``q50``, and ``q90``. That is
+substantially more token-efficient than eliciting the full standard quantile
+grid, while preserving a continuous forecast with uncertainty.
+
+Trade-offs to resolve before implementation:
+
+- A point-plus-interval response is mathematically a sparse quantile grid. If
+  we only need configurable quantile density, this may belong as a quantile-set
+  option on ``QuantileGridLLMPredictorConfig`` rather than a separate method.
+- Sparse intervals require interpolation or explicit downstream support before
+  they can satisfy the current ``ContinuousForecast`` standard-quantile
+  contract.
+- The smaller schema may work better with larger reasoning-capable models, but
+  it gives the model fewer opportunities to express tail shape than the full
+  quantile-grid method.
+
+Keep this as a design note until we have calibration results showing that the
+compact interval contract is worth a distinct implementation surface.
+"""
+
+__all__: list[str] = []
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__quantile_grid.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__quantile_grid.py.md
new file mode 100644
index 0000000..0e1d2c0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__quantile_grid.py.md
@@ -0,0 +1,351 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/quantile_grid.py
+
+kind: python
+
+```python
+"""QuantileGridLLMPredictor — one-shot quantile forecaster.
+
+Asks an LLM for the full standard quantile grid in a single structured
+completion, then converts the returned grid into one :class:`Prediction` per
+requested horizon. This is a sibling elicitation strategy to
+:class:`~aieng.forecasting.methods.llm_processes.sampled_trajectory.SampledTrajectoryLLMPredictor`:
+continuous sampled trajectories estimate quantiles empirically; this class
+elicits the quantiles directly.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING, Any, ClassVar
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.evaluation.prediction import (
+    STANDARD_QUANTILES,
+    ContinuousForecast,
+    Prediction,
+)
+from aieng.forecasting.methods.llm_processes._client import (
+    langfuse_observe,
+    make_json_schema_response_format,
+    run_async,
+    sample_n_async,
+    set_current_trace_name,
+)
+from aieng.forecasting.methods.llm_processes.base import (
+    LLMPredictor,
+    LLMPredictorConfig,
+    apply_report_context,
+    fetch_report_docs,
+    get_history_and_meta,
+    serialize_history,
+)
+from pydantic import BaseModel, ConfigDict, Field
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.context import ForecastContext
+    from aieng.forecasting.data.models import SeriesMetadata
+    from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class QuantileGridLLMPredictorConfig(LLMPredictorConfig):
+    """Frozen configuration for :class:`QuantileGridLLMPredictor`.
+
+    Quantile levels are fixed to :data:`STANDARD_QUANTILES` and not exposed.
+    This method makes one structured completion per forecast origin; it does
+    not expose ``n_samples`` because it does not aggregate sampled trajectories.
+    """
+
+    model_config = ConfigDict(frozen=True)
+
+    precision: int = Field(default=2, ge=0, le=10, description="Decimal places used when serializing values.")
+    history_window: int | None = Field(
+        default=None,
+        ge=1,
+        description="If set, only the last N cutoff-filtered observations are serialized into the prompt.",
+    )
+    series_description: str | None = Field(
+        default=None,
+        description="Optional replacement for the metadata-derived series description block.",
+    )
+    system_prompt_override: str | None = Field(
+        default=None,
+        description="Full replacement for the built-in quantile-grid system prompt.",
+    )
+    user_prompt_suffix: str | None = Field(
+        default=None,
+        description="Free-form text appended to the user prompt after the standard forecast instruction.",
+    )
+
+
+class _QuantileStep(BaseModel):
+    """Flat standard-quantile fields for one forecast step."""
+
+    q05: float
+    q10: float
+    q20: float
+    q30: float
+    q40: float
+    q50: float
+    q60: float
+    q70: float
+    q80: float
+    q90: float
+    q95: float
+
+
+class _QuantileTrajectory(BaseModel):
+    """Internal Pydantic schema for one directly elicited quantile trajectory."""
+
+    forecasts: list[_QuantileStep]
+
+
+_STEP_PROPERTIES: dict[str, dict[str, str]] = {
+    "q05": {"type": "number"},
+    "q10": {"type": "number"},
+    "q20": {"type": "number"},
+    "q30": {"type": "number"},
+    "q40": {"type": "number"},
+    "q50": {"type": "number"},
+    "q60": {"type": "number"},
+    "q70": {"type": "number"},
+    "q80": {"type": "number"},
+    "q90": {"type": "number"},
+    "q95": {"type": "number"},
+}
+
+_QUANTILE_TRAJECTORY_JSON_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "forecasts": {
+            "type": "array",
+            "items": {
+                "type": "object",
+                "properties": _STEP_PROPERTIES,
+                "required": list(_STEP_PROPERTIES),
+                "additionalProperties": False,
+            },
+        },
+    },
+    "required": ["forecasts"],
+    "additionalProperties": False,
+}
+
+_FIELD_BY_QUANTILE: dict[float, str] = {
+    0.05: "q05",
+    0.10: "q10",
+    0.20: "q20",
+    0.30: "q30",
+    0.40: "q40",
+    0.50: "q50",
+    0.60: "q60",
+    0.70: "q70",
+    0.80: "q80",
+    0.90: "q90",
+    0.95: "q95",
+}
+
+
+def _build_system_prompt(override: str | None = None) -> str:
+    """Return the quantile-grid system prompt, or ``override`` verbatim."""
+    if override is not None:
+        return override
+    return (
+        "You are a probabilistic time-series forecaster. Given a historical series and a "
+        "task description, return calibrated predictive quantiles for every requested "
+        "forecast step.\n"
+        "\n"
+        "Rules:\n"
+        "- Return ONLY a JSON object matching the provided schema. No prose, no markdown.\n"
+        "- The 'forecasts' array MUST have exactly the requested number of elements, one "
+        "per forecast step in chronological order.\n"
+        "- Each forecast object MUST contain q05, q10, q20, q30, q40, q50, q60, q70, "
+        "q80, q90, and q95.\n"
+        "- Quantiles should be in the same units as the input series.\n"
+        "- Quantiles should be monotone non-decreasing within each forecast step."
+    )
+
+
+def _build_user_prompt(
+    task: ForecastingTask,
+    history_str: str,
+    series_meta: SeriesMetadata | None,
+    forecast_start: pd.Timestamp,
+    forecast_end: pd.Timestamp,
+    n_steps: int,
+    series_description_override: str | None = None,
+    suffix: str | None = None,
+) -> str:
+    """Build the quantile-grid user prompt."""
+    if series_description_override is not None:
+        meta_block = series_description_override
+    else:
+        meta_lines: list[str] = []
+        if series_meta is not None:
+            meta_lines.append(f"Series: {series_meta.description} (source: {series_meta.source})")
+            meta_lines.append(f"Units: {series_meta.units}")
+        else:
+            meta_lines.append(f"Series: {task.target_series_id}")
+        meta_lines.append(f"Frequency: {task.frequency}")
+        meta_block = "\n".join(meta_lines)
+
+    base = (
+        f"Task: {task.description}\n"
+        "\n"
+        f"{meta_block}\n"
+        "\n"
+        "History:\n"
+        f"{history_str}\n"
+        "\n"
+        f"Forecast the next {n_steps} {task.frequency} values "
+        f"({forecast_start.strftime('%Y-%m-%d')} through {forecast_end.strftime('%Y-%m-%d')}).\n"
+        "Return a JSON object with a 'forecasts' array of length "
+        f"{n_steps}; each item contains the standard quantile fields q05 through q95."
+    )
+    if suffix:
+        base = f"{base}\n\n{suffix.lstrip(chr(10))}"
+    return base
+
+
+def _quantile_grid_from_response(response: _QuantileTrajectory, n_steps: int) -> np.ndarray:
+    """Convert a parsed LLM response into a monotone quantile grid."""
+    if len(response.forecasts) != n_steps:
+        raise RuntimeError(
+            f"Quantile-grid response had {len(response.forecasts)} forecast steps; expected {n_steps}.",
+        )
+    rows = [[float(getattr(step, _FIELD_BY_QUANTILE[q])) for q in STANDARD_QUANTILES] for step in response.forecasts]
+    q_grid = np.asarray(rows, dtype=float)
+    q_grid.sort(axis=1)
+    return q_grid
+
+
+def _sample_quantile_grid(
+    *,
+    cfg: QuantileGridLLMPredictorConfig,
+    system_prompt: str,
+    user_prompt: str | list[dict[str, Any]],
+) -> tuple[_QuantileTrajectory, float, int, int, int]:
+    """Issue one structured completion and return the parsed quantile trajectory."""
+    base_messages: list[dict[str, Any]] = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    response_format = make_json_schema_response_format("QuantileTrajectory", _QUANTILE_TRAJECTORY_JSON_SCHEMA)
+
+    parsed, cost_usd, in_tokens, out_tokens, parse_failures = run_async(
+        sample_n_async(
+            schema_cls=_QuantileTrajectory,
+            model=cfg.model,
+            base_messages=base_messages,
+            response_format=response_format,
+            n_samples=1,
+            temperature=cfg.temperature,
+            max_tokens=cfg.max_tokens,
+            timeout_s=cfg.timeout_s,
+            reasoning_effort=cfg.reasoning_effort,
+            api_base=cfg.openai_base_url,
+            api_key=cfg.openai_api_key,
+        ),
+    )
+    if not parsed:
+        raise RuntimeError("No valid quantile-grid response returned by LLM.")
+    return parsed[0], cost_usd, in_tokens, out_tokens, parse_failures
+
+
+class QuantileGridLLMPredictor(LLMPredictor):
+    """Continuous-target LLM forecaster using quantile-grid elicitation."""
+
+    _method_tag: ClassVar[str] = "llmp_quantile_grid"
+
+    cfg: QuantileGridLLMPredictorConfig
+
+    def __init__(self, cfg: QuantileGridLLMPredictorConfig | None = None) -> None:
+        super().__init__(cfg)
+
+    @classmethod
+    def _default_config(cls) -> QuantileGridLLMPredictorConfig:
+        return QuantileGridLLMPredictorConfig()
+
+    @langfuse_observe("QuantileGridLLMPredictor.predict")
+    def predict(
+        self,
+        task: ForecastingTask,
+        context: ForecastContext,
+    ) -> list[Prediction]:
+        """Produce forecasts from directly elicited quantiles."""
+        set_current_trace_name(self.predictor_id)
+        series_df, series_meta = get_history_and_meta(task, context)
+        if self.cfg.history_window is not None:
+            series_df = series_df.tail(self.cfg.history_window).reset_index(drop=True)
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        n_steps = task.horizon
+        forecast_start = (pd.Timestamp(context.as_of) + offset * 1).normalize()
+        forecast_end = (pd.Timestamp(context.as_of) + offset * n_steps).normalize()
+
+        history_str = serialize_history(series_df, precision=self.cfg.precision)
+
+        # Report context (before the task/history block): text preamble (CiK
+        # Format A) or native PDF parts, per cfg.report_ingestion.
+        report_docs = fetch_report_docs(config=self.cfg, context=context)
+
+        system_prompt = _build_system_prompt(self.cfg.system_prompt_override)
+        user_prompt = _build_user_prompt(
+            task,
+            history_str,
+            series_meta,
+            forecast_start,
+            forecast_end,
+            n_steps,
+            series_description_override=self.cfg.series_description,
+            suffix=self.cfg.user_prompt_suffix,
+        )
+        user_content = apply_report_context(config=self.cfg, docs=report_docs, user_prompt=user_prompt)
+
+        parsed, cost_usd, in_tokens, out_tokens, parse_failures = _sample_quantile_grid(
+            cfg=self.cfg,
+            system_prompt=system_prompt,
+            user_prompt=user_content,
+        )
+        q_grid = _quantile_grid_from_response(parsed, n_steps=n_steps)
+
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        median_idx = STANDARD_QUANTILES.index(0.50)
+        predictions: list[Prediction] = []
+        for h in task.horizons:
+            row = q_grid[h - 1]
+            quantiles = {q: float(row[i]) for i, q in enumerate(STANDARD_QUANTILES)}
+            payload = ContinuousForecast(
+                point_forecast=float(row[median_idx]),
+                quantiles=quantiles,
+            )
+            predictions.append(
+                Prediction(
+                    predictor_id=self.predictor_id,
+                    task_id=task.task_id,
+                    issued_at=issued_at,
+                    as_of=context.as_of,
+                    forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(),
+                    payload=payload,
+                    metadata=self._build_metadata(
+                        cost_usd=cost_usd,
+                        in_tokens=in_tokens,
+                        out_tokens=out_tokens,
+                        parse_failures=parse_failures,
+                        history_window=self.cfg.history_window,
+                        extra={
+                            "n_report_docs": len(report_docs),
+                            **({"report_sources": self.cfg.report_sources} if self.cfg.report_sources else {}),
+                        },
+                    ),
+                ),
+            )
+        return predictions
+
+
+__all__ = [
+    "QuantileGridLLMPredictor",
+    "QuantileGridLLMPredictorConfig",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__sampled_trajectory.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__sampled_trajectory.py.md
new file mode 100644
index 0000000..b560e35
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__sampled_trajectory.py.md
@@ -0,0 +1,440 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/sampled_trajectory.py
+
+kind: python
+
+```python
+"""SampledTrajectoryLLMPredictor — sample-based quantile forecaster.
+
+Asks an LLM for ``N`` numerical trajectories spanning ``max(task.horizons)``
+steps, stacks them, and computes per-step empirical quantiles at
+:data:`STANDARD_QUANTILES`.  One :class:`Prediction` is returned per horizon
+step in ``task.horizons``.
+
+This is the Gruver / Context-is-Key "Direct Prompt" path: no chain-of-thought
+and no logprob density.  Optional **covariates** are supported — set
+``covariate_series_ids`` to serialize labeled exogenous-series history into the
+prompt (Context-is-Key §5.4).  Method variants from the literature
+(``LLMProcessPredictor`` for Requeima A-LLMP, logprob-density variants,
+conformal wrappers) belong as sibling classes in this package, not as
+configurations of this class.
+
+Usage::
+
+    from aieng.forecasting.methods import (
+        SampledTrajectoryLLMPredictor,
+        SampledTrajectoryLLMPredictorConfig,
+    )
+
+    predictor = SampledTrajectoryLLMPredictor(
+        SampledTrajectoryLLMPredictorConfig(model="gemini-3.1-flash-lite-preview"),
+    )
+"""
+
+from __future__ import annotations
+
+import logging
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING, Any, ClassVar
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.evaluation.prediction import (
+    STANDARD_QUANTILES,
+    ContinuousForecast,
+    Prediction,
+)
+from aieng.forecasting.methods.llm_processes._client import (
+    langfuse_observe,
+    make_json_schema_response_format,
+    run_async,
+    sample_n_async,
+    set_current_trace_name,
+)
+from aieng.forecasting.methods.llm_processes.base import (
+    LLMPredictor,
+    LLMPredictorConfig,
+    apply_report_context,
+    build_covariate_block,
+    fetch_report_docs,
+    get_history_and_meta,
+    serialize_history,
+)
+from pydantic import BaseModel, ConfigDict, Field
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.context import ForecastContext
+    from aieng.forecasting.data.models import SeriesMetadata
+    from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+logger = logging.getLogger(__name__)
+
+
+class SampledTrajectoryLLMPredictorConfig(LLMPredictorConfig):
+    """Frozen configuration for :class:`SampledTrajectoryLLMPredictor`.
+
+    Quantile levels are fixed to :data:`STANDARD_QUANTILES` and not exposed.
+
+    The string overrides (``series_description``, ``system_prompt_override``,
+    ``user_prompt_suffix``) plus ``history_window`` are the degrees of freedom
+    intended for use-case recipes under ``implementations/<use-case>/predictors/``
+    — they reshape what the LLM sees without changing the predictor's
+    statistical contract (sampled trajectories → empirical quantiles).
+    """
+
+    model_config = ConfigDict(frozen=True)
+
+    n_samples: int = Field(default=20, ge=1, description="Number of trajectory samples per forecast origin.")
+    precision: int = Field(default=2, ge=0, le=10, description="Decimal places used when serializing values.")
+    history_window: int | None = Field(
+        default=None,
+        ge=1,
+        description=(
+            "If set, only the last ``history_window`` cutoff-filtered observations "
+            "are serialized into the prompt. ``None`` (default) sends the full "
+            "available history. Useful for keeping prompts short on larger models "
+            "(Sonnet, Gemini Pro) where per-call cost dominates."
+        ),
+    )
+    series_description: str | None = Field(
+        default=None,
+        description=(
+            "Optional override for the metadata-derived series description block. "
+            "When set, replaces the ``Series: ... / Units: ... / Frequency: ...`` "
+            "lines in the user prompt. Use to inject task-specific economic or "
+            "domain framing that the bare adapter metadata does not capture."
+        ),
+    )
+    system_prompt_override: str | None = Field(
+        default=None,
+        description=(
+            "Full replacement for the built-in system prompt. ``None`` (default) "
+            "uses the calibration-tuned base prompt. Recipes that change the "
+            "output contract or impose domain rules should set this."
+        ),
+    )
+    user_prompt_suffix: str | None = Field(
+        default=None,
+        description=(
+            "Free-form text appended to the user prompt after the standard "
+            "task / history / forecast-window blocks. Use for recipe-specific "
+            "hints (non-negativity, plausible-range anchors, known events) "
+            "without rewriting the system prompt."
+        ),
+    )
+    covariate_series_ids: list[str] | None = Field(
+        default=None,
+        description=(
+            "Optional list of registered covariate series ids to serialize into "
+            "the prompt as labeled, cutoff-safe history blocks (Context-is-Key "
+            "§5.4 style), letting the model condition on exogenous series. Each "
+            "is fetched via ``context.get_series`` and truncated to "
+            "``history_window``. ``None`` (default) is target-only. Set a "
+            "distinct ``variant_tag`` to keep covariate vs target-only runs "
+            "separate on leaderboards and in artifact storage."
+        ),
+    )
+
+
+class _Trajectory(BaseModel):
+    """Internal Pydantic schema for one numerical trajectory."""
+
+    values: list[float]
+
+
+_TRAJECTORY_JSON_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "values": {"type": "array", "items": {"type": "number"}},
+    },
+    "required": ["values"],
+    "additionalProperties": False,
+}
+
+
+def _build_system_prompt(override: str | None = None) -> str:
+    """Stable, cacheable system prompt carrying the output contract and rules.
+
+    When ``override`` is provided, it replaces the built-in prompt verbatim.
+    Recipes pass the override through
+    :attr:`SampledTrajectoryLLMPredictorConfig.system_prompt_override`.
+    """
+    if override is not None:
+        return override
+    return (
+        "You are a probabilistic time-series forecaster. Given a historical series and a "
+        "task description, return a single numerical trajectory covering the requested "
+        "forecast window.\n"
+        "\n"
+        "Rules:\n"
+        "- Return ONLY a JSON object matching the provided schema. No prose, no markdown, "
+        "no chain-of-thought reasoning.\n"
+        "- The 'values' array MUST have exactly the requested number of elements, one per "
+        "forecast step in chronological order.\n"
+        "- Use the same units and the same number of decimal places as the input series.\n"
+        "- Account for trend and seasonality implicitly. Do not emit reasoning tokens.\n"
+        "- Respect any constraints stated in the task description (non-negativity, domain "
+        "bounds, known future events)."
+    )
+
+
+def _build_user_prompt(
+    task: ForecastingTask,
+    history_str: str,
+    series_meta: SeriesMetadata | None,
+    forecast_start: pd.Timestamp,
+    forecast_end: pd.Timestamp,
+    n_steps: int,
+    series_description_override: str | None = None,
+    suffix: str | None = None,
+    covariate_block: str = "",
+) -> str:
+    """Task description + series metadata + history + explicit forecast window.
+
+    ``series_description_override`` replaces the metadata-derived series block;
+    ``covariate_block`` (when non-empty) is inserted as labeled exogenous-series
+    context between the target history and the forecast instruction; ``suffix``
+    is appended verbatim at the end of the prompt. All are surfaced to recipes
+    via :class:`SampledTrajectoryLLMPredictorConfig`.
+    """
+    if series_description_override is not None:
+        meta_block = series_description_override
+    else:
+        meta_lines: list[str] = []
+        if series_meta is not None:
+            meta_lines.append(f"Series: {series_meta.description} (source: {series_meta.source})")
+            meta_lines.append(f"Units: {series_meta.units}")
+        else:
+            meta_lines.append(f"Series: {task.target_series_id}")
+        meta_lines.append(f"Frequency: {task.frequency}")
+        meta_block = "\n".join(meta_lines)
+
+    covariate_section = f"\n{covariate_block}\n" if covariate_block else ""
+    base = (
+        f"Task: {task.description}\n"
+        "\n" + meta_block + "\n"
+        "\n"
+        "History:\n"
+        f"{history_str}\n"
+        f"{covariate_section}"
+        "\n"
+        f"Forecast the next {n_steps} {task.frequency} values "
+        f"({forecast_start.strftime('%Y-%m-%d')} through {forecast_end.strftime('%Y-%m-%d')}).\n"
+        f"Return a JSON object with a single 'values' array of length {n_steps}."
+    )
+    if suffix:
+        base = f"{base}\n\n{suffix.lstrip(chr(10))}"
+    return base
+
+
+def _stack_trajectories(trajectories: list[list[float]], n_steps: int) -> np.ndarray:
+    """Stack ``N`` length-``n_steps`` trajectories into ``(N, n_steps)``.
+
+    Wrong-length trajectories are dropped with a warning; at least one valid
+    trajectory must remain.
+    """
+    valid = [np.asarray(t, dtype=float) for t in trajectories if len(t) == n_steps]
+    dropped = len(trajectories) - len(valid)
+    if dropped:
+        logger.warning("Dropped %d/%d trajectories with wrong length", dropped, len(trajectories))
+    if not valid:
+        raise RuntimeError(
+            f"No valid trajectories returned by LLM (all {len(trajectories)} had wrong length).",
+        )
+    return np.vstack(valid)
+
+
+def _quantiles_per_step(samples: np.ndarray) -> np.ndarray:
+    """Compute :data:`STANDARD_QUANTILES` per column, sort each row monotone.
+
+    Parameters
+    ----------
+    samples : np.ndarray
+        Shape ``(N, n_steps)``.
+
+    Returns
+    -------
+    np.ndarray
+        Shape ``(n_steps, len(STANDARD_QUANTILES))``, monotone non-decreasing
+        per row.
+    """
+    q = np.quantile(samples, STANDARD_QUANTILES, axis=0).T
+    q.sort(axis=1)
+    return np.asarray(q)
+
+
+def _sample_trajectories(
+    *,
+    cfg: SampledTrajectoryLLMPredictorConfig,
+    system_prompt: str,
+    user_prompt: str | list[dict[str, Any]],
+) -> tuple[list[_Trajectory], float, int, int, int]:
+    """Issue ``cfg.n_samples`` parallel completions and return parsed trajectories.
+
+    Returns ``(parsed, total_cost_usd, total_input_tokens, total_output_tokens,
+    parse_failures)``.
+    """
+    base_messages: list[dict[str, Any]] = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    response_format = make_json_schema_response_format("Trajectory", _TRAJECTORY_JSON_SCHEMA)
+
+    result: tuple[list[_Trajectory], float, int, int, int] = run_async(
+        sample_n_async(
+            schema_cls=_Trajectory,
+            model=cfg.model,
+            base_messages=base_messages,
+            response_format=response_format,
+            n_samples=cfg.n_samples,
+            temperature=cfg.temperature,
+            max_tokens=cfg.max_tokens,
+            timeout_s=cfg.timeout_s,
+            reasoning_effort=cfg.reasoning_effort,
+            api_base=cfg.openai_base_url,
+            api_key=cfg.openai_api_key,
+        ),
+    )
+    return result
+
+
+class SampledTrajectoryLLMPredictor(LLMPredictor):
+    """Continuous-modality LLM forecaster (sample-based empirical quantiles).
+
+    Issues ``cfg.n_samples`` completion calls in parallel via
+    ``asyncio.gather``, each returning a numerical trajectory of length
+    ``max(task.horizons)``.  Per-step empirical quantiles are computed across
+    samples and sorted for monotonicity.  Returns one :class:`Prediction` per
+    horizon step in ``task.horizons``.
+
+    Notes
+    -----
+    - Each sampled call appends a per-draw disambiguator to the user message
+      so LiteLLM's disk cache yields distinct entries per sample.
+    - Covariates are optional (``covariate_series_ids``) and off by default;
+      no chain-of-thought (``reasoning_effort`` defaults to ``"disable"`` per
+      the calibration evidence).
+    """
+
+    _method_tag: ClassVar[str] = "llmp_sampled_trajectories"
+
+    cfg: SampledTrajectoryLLMPredictorConfig  # type narrowing for static checkers
+
+    def __init__(self, cfg: SampledTrajectoryLLMPredictorConfig | None = None) -> None:
+        super().__init__(cfg)
+
+    @classmethod
+    def _default_config(cls) -> SampledTrajectoryLLMPredictorConfig:
+        return SampledTrajectoryLLMPredictorConfig()
+
+    @langfuse_observe("SampledTrajectoryLLMPredictor.predict")
+    def predict(
+        self,
+        task: ForecastingTask,
+        context: ForecastContext,
+    ) -> list[Prediction]:
+        """Produce per-horizon probabilistic forecasts.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Defines the target series, horizons, and frequency.
+        context : ForecastContext
+            Cutoff-scoped data view.  All series returned respect
+            ``context.as_of``.
+
+        Returns
+        -------
+        list[Prediction]
+            One :class:`Prediction` per horizon step in ``task.horizons``,
+            with ``point_forecast`` equal to the sample median at that step.
+        """
+        set_current_trace_name(self.predictor_id)
+        series_df, series_meta = get_history_and_meta(task, context)
+        if self.cfg.history_window is not None:
+            series_df = series_df.tail(self.cfg.history_window).reset_index(drop=True)
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        n_steps = task.horizon
+        forecast_start = (pd.Timestamp(context.as_of) + offset * 1).normalize()
+        forecast_end = (pd.Timestamp(context.as_of) + offset * n_steps).normalize()
+
+        history_str = serialize_history(series_df, precision=self.cfg.precision)
+
+        # Labeled covariate blocks (Context-is-Key §5.4): cutoff-safe history of
+        # each exogenous series, truncated to the same window as the target.
+        covariate_block = ""
+        if self.cfg.covariate_series_ids:
+            covariate_block = build_covariate_block(
+                context,
+                self.cfg.covariate_series_ids,
+                precision=self.cfg.precision,
+                history_window=self.cfg.history_window,
+            )
+
+        # Report context (before the task/history block): text preamble (CiK
+        # Format A) or native PDF parts, per cfg.report_ingestion.
+        report_docs = fetch_report_docs(config=self.cfg, context=context)
+
+        system_prompt = _build_system_prompt(self.cfg.system_prompt_override)
+        user_prompt = _build_user_prompt(
+            task,
+            history_str,
+            series_meta,
+            forecast_start,
+            forecast_end,
+            n_steps,
+            series_description_override=self.cfg.series_description,
+            suffix=self.cfg.user_prompt_suffix,
+            covariate_block=covariate_block,
+        )
+        user_content = apply_report_context(config=self.cfg, docs=report_docs, user_prompt=user_prompt)
+
+        parsed, cost_usd, in_tokens, out_tokens, parse_failures = _sample_trajectories(
+            cfg=self.cfg,
+            system_prompt=system_prompt,
+            user_prompt=user_content,
+        )
+        samples = _stack_trajectories([t.values for t in parsed], n_steps=n_steps)
+        q_grid = _quantiles_per_step(samples)
+
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        median_idx = STANDARD_QUANTILES.index(0.50)
+        predictions: list[Prediction] = []
+        for h in task.horizons:
+            row = q_grid[h - 1]
+            quantiles = {q: float(row[i]) for i, q in enumerate(STANDARD_QUANTILES)}
+            payload = ContinuousForecast(
+                point_forecast=float(row[median_idx]),
+                quantiles=quantiles,
+            )
+            predictions.append(
+                Prediction(
+                    predictor_id=self.predictor_id,
+                    task_id=task.task_id,
+                    issued_at=issued_at,
+                    as_of=context.as_of,
+                    forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(),
+                    payload=payload,
+                    metadata=self._build_metadata(
+                        cost_usd=cost_usd,
+                        in_tokens=in_tokens,
+                        out_tokens=out_tokens,
+                        parse_failures=parse_failures,
+                        history_window=self.cfg.history_window,
+                        extra={
+                            "n_samples": self.cfg.n_samples,
+                            "n_report_docs": len(report_docs),
+                            **({"report_sources": self.cfg.report_sources} if self.cfg.report_sources else {}),
+                            **(
+                                {"covariate_series_ids": list(self.cfg.covariate_series_ids)}
+                                if self.cfg.covariate_series_ids
+                                else {}
+                            ),
+                        },
+                    ),
+                ),
+            )
+        return predictions
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical____init__.py.md
new file mode 100644
index 0000000..679086a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical____init__.py.md
@@ -0,0 +1,25 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/numerical/__init__.py
+
+kind: python
+
+```python
+"""Numerical forecasting predictor implementations.
+
+These predictors wrap classical or machine-learning time-series models behind
+the shared :class:`~aieng.forecasting.evaluation.predictor.Predictor`
+interface.
+"""
+
+from .darts_arima import DartsAutoARIMAPredictor
+from .darts_classical import DartsExponentialSmoothingPredictor, DartsKalmanForecasterPredictor
+from .darts_regression import DartsLightGBMPredictor, DartsLinearRegressionPredictor
+
+
+__all__ = [
+    "DartsAutoARIMAPredictor",
+    "DartsExponentialSmoothingPredictor",
+    "DartsKalmanForecasterPredictor",
+    "DartsLightGBMPredictor",
+    "DartsLinearRegressionPredictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_arima.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_arima.py.md
new file mode 100644
index 0000000..0c6ffb6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_arima.py.md
@@ -0,0 +1,141 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/numerical/darts_arima.py
+
+kind: python
+
+```python
+"""Darts AutoARIMA predictor — probabilistic forecast via Monte Carlo sampling.
+
+``DartsAutoARIMAPredictor`` wraps Darts ``AutoARIMA`` on the target series only
+(univariate). Darts' ``AutoARIMA`` implementation used here does not support
+exogenous covariates; this class does not expose any covariate parameters.
+
+The probabilistic forecast is produced via Monte Carlo sampling (``num_samples``
+draws from the predictive distribution).  Point forecast is the median;
+quantiles use :data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES`
+levels.
+
+For multi-horizon tasks, the model is fitted once to ``n = max(task.horizons)``
+and samples are extracted at each requested horizon index from the resulting
+trajectory. This is more efficient than fitting once per horizon.
+
+Usage::
+
+    from aieng.forecasting.methods.darts_arima import DartsAutoARIMAPredictor
+    from aieng.forecasting.evaluation import backtest, BacktestSpec
+
+    predictor = DartsAutoARIMAPredictor()
+    result = backtest(predictor=predictor, spec=spec, data_service=svc)
+    print(f"Mean CRPS: {result.mean_score:.4f}")
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import Any
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES, ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+class DartsAutoARIMAPredictor(Predictor):
+    """Probabilistic predictor wrapping Darts AutoARIMA (univariate).
+
+    Fits AutoARIMA on the target series history available at the forecast
+    origin, then generates a probabilistic trajectory via Monte Carlo sampling.
+    One :class:`~aieng.forecasting.evaluation.prediction.Prediction` is
+    returned per horizon step declared in ``task.horizons``.
+
+    Parameters
+    ----------
+    num_samples : int
+        Number of Monte Carlo samples used to build the predictive distribution.
+        Higher values give smoother quantile estimates at the cost of compute.
+        Default: 500.
+
+    Notes
+    -----
+    - **Darts AutoARIMA** requires ``statsforecast`` (already a project
+      dependency).  No additional install is needed.
+    - AutoARIMA can be slow (seconds to tens of seconds per origin). For rapid
+      iteration use
+      :class:`~aieng.forecasting.methods.darts_regression.DartsLinearRegressionPredictor`
+      instead.
+    """
+
+    def __init__(self, num_samples: int = 500) -> None:
+        self._num_samples = num_samples
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable string identifier for this predictor."""
+        return "darts_autoarima"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic AutoARIMA forecasts for every horizon in the task.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            Defines the target series, horizons, and frequency.
+        context : ForecastContext
+            Cutoff-scoped data view.  All series returned respect
+            ``context.as_of``.
+
+        Returns
+        -------
+        list[Prediction]
+            One ``ContinuousForecast`` per horizon step in ``task.horizons``,
+            with ``point_forecast`` equal to the median of the predictive
+            sample at that step.
+        """
+        from darts import TimeSeries  # noqa: PLC0415
+        from darts.models import AutoARIMA  # noqa: PLC0415  # type: ignore[import-untyped]
+
+        series_df = context.get_series(task.target_series_id)
+
+        ts = TimeSeries.from_dataframe(
+            series_df,
+            time_col="timestamp",
+            value_cols="value",
+            fill_missing_dates=True,
+            freq=task.frequency,
+        )
+
+        model = AutoARIMA()
+        model.fit(ts)
+
+        # Fit once to max horizon; extract samples at each requested step.
+        # all_values() shape: (n_steps, n_components, n_samples), 0-indexed.
+        forecast_ts: Any = model.predict(
+            n=task.horizon,
+            num_samples=self._num_samples,
+        )
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+        predictions: list[Prediction] = []
+
+        for h in task.horizons:
+            samples: np.ndarray = forecast_ts.all_values()[h - 1, 0, :]
+            payload = ContinuousForecast(
+                point_forecast=float(np.median(samples)),
+                quantiles={q: float(np.quantile(samples, q)) for q in STANDARD_QUANTILES},
+            )
+            forecast_date: datetime = (pd.Timestamp(context.as_of) + offset * h).to_pydatetime()
+            predictions.append(
+                Prediction(
+                    predictor_id=self.predictor_id,
+                    task_id=task.task_id,
+                    issued_at=issued_at,
+                    as_of=context.as_of,
+                    forecast_date=forecast_date,
+                    payload=payload,
+                )
+            )
+
+        return predictions
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_classical.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_classical.py.md
new file mode 100644
index 0000000..a59a5ca
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_classical.py.md
@@ -0,0 +1,210 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/numerical/darts_classical.py
+
+kind: python
+
+```python
+"""Fast classical Darts predictors — Exponential Smoothing and Kalman filter.
+
+Two lightweight, **univariate**, probabilistic forecasters that round out the
+conventional-methods comparison alongside
+:class:`~aieng.forecasting.methods.numerical.darts_arima.DartsAutoARIMAPredictor`
+and the regression models in
+:mod:`aieng.forecasting.methods.numerical.darts_regression`:
+
+- :class:`DartsExponentialSmoothingPredictor` — state-space exponential
+  smoothing (ETS).  Defaults to a non-seasonal, non-trend specification
+  (simple exponential smoothing), which is the robust, fast choice for
+  stationary return series; pass ``seasonal_periods`` to enable additive
+  seasonality.
+- :class:`DartsKalmanForecasterPredictor` — a linear Gaussian state-space
+  (Kalman filter) model.  ``dim_x`` sets the latent state dimension.
+
+Both produce a probabilistic forecast via Monte Carlo sampling (``num_samples``
+draws); the point forecast is the median and quantiles use
+:data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES`.  Like
+``DartsAutoARIMAPredictor``, neither model consumes exogenous covariates — for
+covariate-aware numerical forecasting use the Darts regression models, and for
+covariate-aware LLM forecasting use
+:class:`~aieng.forecasting.methods.llm_processes.SampledTrajectoryLLMPredictor`.
+
+For multi-horizon tasks the model is fitted once to ``n = max(task.horizons)``
+and samples are extracted at each requested horizon index.
+
+Usage::
+
+    from aieng.forecasting.methods import DartsExponentialSmoothingPredictor
+
+    predictor = DartsExponentialSmoothingPredictor()
+    result = backtest(predictor=predictor, spec=spec, data_service=svc)
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import Any
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES, ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+def _target_timeseries(task: ForecastingTask, context: ForecastContext) -> Any:
+    """Build a gap-filled Darts ``TimeSeries`` from the cutoff-scoped target.
+
+    Pandas ``B`` frequency treats US market holidays as business days even
+    though daily series have no observation then, which injects NaN rows that
+    ETS / Kalman reject.  We backfill those gaps (a no-op when there are none),
+    mirroring ``darts_regression._to_timeseries``.
+    """
+    from darts import TimeSeries  # noqa: PLC0415
+    from darts.utils.missing_values import fill_missing_values  # noqa: PLC0415
+
+    series_df = context.get_series(task.target_series_id)
+    ts = TimeSeries.from_dataframe(
+        series_df,
+        time_col="timestamp",
+        value_cols="value",
+        fill_missing_dates=True,
+        freq=task.frequency,
+    )
+    return fill_missing_values(ts, fill="auto")
+
+
+def _predictions_from_samples(
+    *,
+    forecast_ts: Any,
+    task: ForecastingTask,
+    context: ForecastContext,
+    predictor_id: str,
+) -> list[Prediction]:
+    """Turn a sampled Darts forecast into one ``Prediction`` per horizon step.
+
+    ``forecast_ts.all_values()`` has shape ``(n_steps, n_components, n_samples)``
+    and is 0-indexed, so horizon ``h`` reads row ``h - 1``.
+    """
+    offset = pd.tseries.frequencies.to_offset(task.frequency)
+    issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    predictions: list[Prediction] = []
+    for h in task.horizons:
+        samples: np.ndarray = forecast_ts.all_values()[h - 1, 0, :]
+        payload = ContinuousForecast(
+            point_forecast=float(np.median(samples)),
+            quantiles={q: float(np.quantile(samples, q)) for q in STANDARD_QUANTILES},
+        )
+        forecast_date: datetime = (pd.Timestamp(context.as_of) + offset * h).to_pydatetime()
+        predictions.append(
+            Prediction(
+                predictor_id=predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=forecast_date,
+                payload=payload,
+            )
+        )
+    return predictions
+
+
+class DartsExponentialSmoothingPredictor(Predictor):
+    """Probabilistic predictor wrapping Darts ``ExponentialSmoothing`` (univariate).
+
+    Parameters
+    ----------
+    num_samples : int
+        Number of Monte Carlo samples used to build the predictive distribution.
+        Default: 500.
+    seasonal_periods : int or None
+        When set, enables **additive** seasonality with this period length (e.g.
+        ``5`` for a weekly cycle on business-day data).  ``None`` (default)
+        disables seasonality, giving a fast, robust simple-exponential-smoothing
+        specification suited to stationary return series.
+
+    Notes
+    -----
+    Darts ``ExponentialSmoothing`` wraps statsmodels ETS (already a project
+    dependency).  Fitting is fast (well under a second per origin), making this a
+    good cheap classical baseline.  Does not support exogenous covariates.
+    """
+
+    def __init__(self, num_samples: int = 500, seasonal_periods: int | None = None) -> None:
+        self._num_samples = num_samples
+        self._seasonal_periods = seasonal_periods
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable string identifier for this predictor."""
+        return "darts_ets"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic ETS forecasts for every horizon in the task."""
+        from darts.models import ExponentialSmoothing  # noqa: PLC0415  # type: ignore[import-untyped]
+        from darts.utils.utils import ModelMode, SeasonalityMode  # noqa: PLC0415
+
+        ts = _target_timeseries(task, context)
+
+        if self._seasonal_periods is not None:
+            model = ExponentialSmoothing(
+                trend=ModelMode.ADDITIVE,
+                seasonal=SeasonalityMode.ADDITIVE,
+                seasonal_periods=self._seasonal_periods,
+            )
+        else:
+            # Non-seasonal, non-trend: robust simple exponential smoothing.
+            model = ExponentialSmoothing(trend=ModelMode.NONE, seasonal=SeasonalityMode.NONE)
+
+        model.fit(ts)
+        forecast_ts = model.predict(n=task.horizon, num_samples=self._num_samples)
+        return _predictions_from_samples(
+            forecast_ts=forecast_ts,
+            task=task,
+            context=context,
+            predictor_id=self.predictor_id,
+        )
+
+
+class DartsKalmanForecasterPredictor(Predictor):
+    """Probabilistic predictor wrapping Darts ``KalmanForecaster`` (univariate).
+
+    Parameters
+    ----------
+    num_samples : int
+        Number of Monte Carlo samples used to build the predictive distribution.
+        Default: 500.
+    dim_x : int
+        Latent state-space dimension of the Kalman filter.  ``1`` (default) is a
+        fast local-level specification well-suited to stationary return series;
+        higher values capture richer dynamics at some fitting cost.
+
+    Notes
+    -----
+    Darts fits the linear Gaussian state-space model with N4SID system
+    identification.  Fast per origin.  Does not support exogenous covariates.
+    """
+
+    def __init__(self, num_samples: int = 500, dim_x: int = 1) -> None:
+        self._num_samples = num_samples
+        self._dim_x = dim_x
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable string identifier for this predictor."""
+        return "darts_kalman"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic Kalman forecasts for every horizon in the task."""
+        from darts.models import KalmanForecaster  # noqa: PLC0415  # type: ignore[import-untyped]
+
+        ts = _target_timeseries(task, context)
+        model = KalmanForecaster(dim_x=self._dim_x)
+        model.fit(ts)
+        forecast_ts = model.predict(n=task.horizon, num_samples=self._num_samples)
+        return _predictions_from_samples(
+            forecast_ts=forecast_ts,
+            task=task,
+            context=context,
+            predictor_id=self.predictor_id,
+        )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_regression.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_regression.py.md
new file mode 100644
index 0000000..a032d7c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_regression.py.md
@@ -0,0 +1,403 @@
+# Source: aieng-forecasting/aieng/forecasting/methods/numerical/darts_regression.py
+
+kind: python
+
+```python
+"""Darts regression-model predictors — LinearRegression and LightGBM.
+
+Two concrete :class:`~aieng.forecasting.evaluation.predictor.Predictor`
+subclasses, built on Darts' sklearn-style regression forecasters:
+
+- :class:`DartsLinearRegressionPredictor` — thin wrapper around
+  :class:`darts.models.LinearRegressionModel`.
+- :class:`DartsLightGBMPredictor` — thin wrapper around
+  :class:`darts.models.LightGBMModel`.
+
+Both are **per-target** models — one independent fit per :class:`ForecastingTask`.
+Both optionally accept a list of ``covariate_series_ids`` to use as *past*
+covariates; covariates are fetched from the :class:`ForecastContext` (so the
+information cutoff is enforced by the harness, not by the predictor).
+
+Probabilistic forecasts are produced via Darts' ``likelihood="quantile"``
+configuration: the model fits one quantile regression per requested level and
+draws ``num_samples`` from the implied predictive distribution at predict time.
+The point forecast is the sample median; quantiles at
+:data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES` are read
+off the sample distribution.
+
+Multi-horizon support
+---------------------
+Both predictors honour ``task.horizons``.  The model is fitted once to
+``n = max(task.horizons)`` and samples are extracted at each requested horizon
+index from the resulting trajectory array.  This means a 12-step trajectory
+costs the same as a single step in terms of fitting time — only the sample
+extraction loop changes.
+
+LightGBM notes
+--------------
+On macOS the LightGBM wheel requires an OpenMP runtime (``brew install libomp``).
+If you see ``Library not loaded: @rpath/libomp.dylib`` when instantiating the
+predictor, install ``libomp`` and retry.
+
+Usage
+-----
+::
+
+    from aieng.forecasting.methods.darts_regression import (
+        DartsLinearRegressionPredictor,
+        DartsLightGBMPredictor,
+    )
+    from aieng.forecasting.evaluation import backtest
+
+    # Univariate (target only)
+    pred = DartsLinearRegressionPredictor(lags=12)
+    result = backtest(predictor=pred, spec=spec, data_service=svc)
+
+    # With past covariates (e.g. FRED macro series)
+    pred = DartsLinearRegressionPredictor(
+        lags=12,
+        lags_past_covariates=12,
+        covariate_series_ids=[
+            "fred_canada_us_exchange_rate",
+            "fred_canada_10yr_bond_yield",
+        ],
+    )
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import Any, Protocol
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES, ContinuousForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+# Quantile levels Darts fits internally.  A denser grid than STANDARD_QUANTILES
+# so sample-based quantile recovery at the reporting levels is stable.
+_TRAINING_QUANTILES: list[float] = [
+    0.025,
+    0.05,
+    0.1,
+    0.2,
+    0.3,
+    0.4,
+    0.5,
+    0.6,
+    0.7,
+    0.8,
+    0.9,
+    0.95,
+    0.975,
+]
+
+
+class _DartsRegressionModel(Protocol):
+    """Structural protocol for the Darts sklearn-style forecasters we support.
+
+    Declared as a Protocol so static typing works without importing Darts at
+    module load time (Darts pulls in LightGBM, which needs libomp on macOS).
+    """
+
+    def fit(self, series: Any, past_covariates: Any | None = ...) -> Any: ...
+    def predict(self, n: int, num_samples: int = ...) -> Any: ...
+
+
+def _to_timeseries(df: pd.DataFrame, frequency: str) -> Any:
+    """Convert a ``(timestamp, value)`` DataFrame to a Darts ``TimeSeries``.
+
+    Gaps are backfilled with ``fill_missing_dates=True`` so the regression
+    models — which need regularly spaced observations — can consume the result.
+    Pandas' ``B`` frequency counts US trading holidays as business days even
+    though daily market series have no observation on those days, which would
+    otherwise inject NaN rows that the sklearn-style fitters reject. We forward-
+    fill those gaps so the model sees a contiguous series (a no-op for inputs
+    with no missing days).
+    """
+    from darts import TimeSeries  # noqa: PLC0415
+    from darts.utils.missing_values import fill_missing_values  # noqa: PLC0415
+
+    ts = TimeSeries.from_dataframe(
+        df,
+        time_col="timestamp",
+        value_cols="value",
+        fill_missing_dates=True,
+        freq=frequency,
+    )
+    return fill_missing_values(ts, fill="auto")
+
+
+def _build_past_covariates(context: ForecastContext, series_ids: list[str], frequency: str) -> Any:
+    """Build a single multivariate ``TimeSeries`` of past covariates.
+
+    Each covariate is converted to a Darts ``TimeSeries`` and stacked into one
+    multivariate series on the intersection of their time indices.  Callers
+    must supply at least one covariate id.
+    """
+    from darts import concatenate  # noqa: PLC0415
+
+    pieces = []
+    for cov_id in series_ids:
+        cov_df = context.get_series(cov_id)
+        cov_ts = _to_timeseries(cov_df, frequency)
+        cov_ts = cov_ts.with_columns_renamed(["value"], [cov_id])
+        pieces.append(cov_ts)
+
+    # Intersect time indices so stacking is well-defined.
+    start = max(p.start_time() for p in pieces)
+    end = min(p.end_time() for p in pieces)
+    pieces = [p.slice(start, end) for p in pieces]
+    return concatenate(pieces, axis=1)
+
+
+def _compute_forecast_payload(samples: np.ndarray) -> ContinuousForecast:
+    """Derive point forecast and STANDARD_QUANTILES from a 1-D sample vector."""
+    point_forecast = float(np.median(samples))
+    quantiles = {q: float(np.quantile(samples, q)) for q in STANDARD_QUANTILES}
+    return ContinuousForecast(point_forecast=point_forecast, quantiles=quantiles)
+
+
+def _fit_and_sample(
+    *,
+    model: _DartsRegressionModel,
+    task: ForecastingTask,
+    context: ForecastContext,
+    covariate_series_ids: list[str] | None,
+    num_samples: int,
+) -> dict[int, np.ndarray]:
+    """Fit a Darts regression model and return horizon-indexed sample arrays.
+
+    Parameters
+    ----------
+    model :
+        A Darts regression model already configured with
+        ``likelihood="quantile"`` and appropriate lag parameters.
+    task :
+        The forecasting task; supplies ``target_series_id``, ``horizons`` and
+        ``frequency``.  The model is fitted to ``n = task.horizon``
+        (i.e. ``max(task.horizons)``) so every requested step is available in
+        the trajectory.
+    context :
+        Cutoff-scoped data view.  All series returned respect
+        ``context.as_of``.
+    covariate_series_ids :
+        Optional list of series to use as past covariates.  ``None`` is
+        equivalent to an empty list (univariate fit).
+    num_samples :
+        Monte Carlo samples drawn from the predictive distribution.
+
+    Returns
+    -------
+    dict[int, np.ndarray]
+        Mapping from horizon step ``h`` → 1-D array of ``num_samples`` draws
+        from the distribution at that step.  Only the steps listed in
+        ``task.horizons`` are included.
+    """
+    target_df = context.get_series(task.target_series_id)
+    target_ts = _to_timeseries(target_df, task.frequency)
+
+    past_covariates: Any | None = None
+    if covariate_series_ids:
+        past_covariates = _build_past_covariates(context, covariate_series_ids, task.frequency)
+
+    model.fit(target_ts, past_covariates=past_covariates)
+    # Fit once to the outermost horizon; all steps 1..horizon are available.
+    forecast_ts = model.predict(n=task.horizon, num_samples=num_samples)
+
+    # all_values() shape: (n_steps, n_components, n_samples), 0-indexed.
+    return {h: np.asarray(forecast_ts.all_values()[h - 1, 0, :]) for h in task.horizons}
+
+
+def _build_predictions(
+    *,
+    predictor_id: str,
+    task: ForecastingTask,
+    context: ForecastContext,
+    samples_by_horizon: dict[int, np.ndarray],
+    metadata: dict[str, Any] | None = None,
+) -> list[Prediction]:
+    """Assemble one ``Prediction`` per horizon from per-horizon sample arrays."""
+    offset = pd.tseries.frequencies.to_offset(task.frequency)
+    issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    return [
+        Prediction(
+            predictor_id=predictor_id,
+            task_id=task.task_id,
+            issued_at=issued_at,
+            as_of=context.as_of,
+            forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(),
+            payload=_compute_forecast_payload(samples),
+            metadata=metadata or {},
+        )
+        for h, samples in samples_by_horizon.items()
+    ]
+
+
+class DartsLinearRegressionPredictor(Predictor):
+    """Probabilistic predictor wrapping Darts :class:`LinearRegressionModel`.
+
+    Fits a per-target quantile regression on lagged target values (and,
+    optionally, lagged covariate values) at every forecast origin, then draws
+    ``num_samples`` from the implied predictive distribution at predict time.
+
+    Returns one :class:`~aieng.forecasting.evaluation.prediction.Prediction`
+    per horizon step in ``task.horizons``.  The model is fitted once to the
+    outermost horizon so the cost is the same regardless of how many horizon
+    steps are requested.
+
+    Parameters
+    ----------
+    lags : int
+        Number of lagged target observations used as features.  Defaults to 12
+        — sufficient to capture one annual cycle in monthly data.
+    lags_past_covariates : int or None
+        Number of lagged covariate observations used as features when
+        ``covariate_series_ids`` is non-empty.  Ignored otherwise.
+        Defaults to 12.
+    covariate_series_ids : list[str] or None
+        Series ids to fetch from the :class:`ForecastContext` and stack as
+        past covariates.  ``None`` means univariate (no covariates).
+    num_samples : int
+        Monte Carlo samples drawn at predict time to compute quantiles.
+        Defaults to 500.
+    """
+
+    def __init__(
+        self,
+        lags: int = 12,
+        lags_past_covariates: int | None = 12,
+        covariate_series_ids: list[str] | None = None,
+        num_samples: int = 500,
+    ) -> None:
+        self._lags = lags
+        self._lags_past_covariates = lags_past_covariates
+        self._covariate_series_ids = list(covariate_series_ids) if covariate_series_ids else None
+        self._num_samples = num_samples
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable identifier, suffixed ``_cov`` when covariates are used."""
+        suffix = "_cov" if self._covariate_series_ids else ""
+        return f"darts_linreg{suffix}"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Probabilistic linear-regression forecasts for each task horizon."""
+        from darts.models import LinearRegressionModel  # noqa: PLC0415
+
+        model = LinearRegressionModel(
+            lags=self._lags,
+            lags_past_covariates=(self._lags_past_covariates if self._covariate_series_ids else None),
+            output_chunk_length=task.horizon,
+            likelihood="quantile",
+            quantiles=_TRAINING_QUANTILES,
+        )
+
+        samples_by_horizon = _fit_and_sample(
+            model=model,
+            task=task,
+            context=context,
+            covariate_series_ids=self._covariate_series_ids,
+            num_samples=self._num_samples,
+        )
+
+        return _build_predictions(
+            predictor_id=self.predictor_id,
+            task=task,
+            context=context,
+            samples_by_horizon=samples_by_horizon,
+            metadata={"covariates": self._covariate_series_ids or []},
+        )
+
+
+class DartsLightGBMPredictor(Predictor):
+    """Probabilistic predictor wrapping Darts :class:`LightGBMModel`.
+
+    Fits a per-target quantile-regression gradient booster on lagged target
+    and covariate values.  Predicted distributions are drawn via Darts'
+    Monte Carlo sampling over the fitted quantile regressors.
+
+    Returns one :class:`~aieng.forecasting.evaluation.prediction.Prediction`
+    per horizon step in ``task.horizons``.
+
+    Parameters
+    ----------
+    lags : int
+        Number of lagged target observations used as features.  Defaults to 12.
+    lags_past_covariates : int or None
+        Number of lagged covariate observations used as features when
+        ``covariate_series_ids`` is non-empty.  Ignored otherwise.
+        Defaults to 12.
+    covariate_series_ids : list[str] or None
+        Series ids to fetch from the :class:`ForecastContext` and stack as
+        past covariates.  ``None`` means univariate (no covariates).
+    num_samples : int
+        Monte Carlo samples drawn at predict time to compute quantiles.
+        Defaults to 500.
+    lgbm_kwargs : dict[str, Any] or None
+        Extra keyword arguments passed through to
+        :class:`darts.models.LightGBMModel`.  Use this to tune tree depth,
+        leaf count, regularisation, etc.  ``verbose=-1`` is always injected
+        unless the caller overrides it.
+
+    Notes
+    -----
+    On macOS you must have ``libomp`` installed (``brew install libomp``) for
+    LightGBM to load.  The import is deferred until :meth:`predict` so that
+    users without libomp can still use other predictors in this module.
+    """
+
+    def __init__(
+        self,
+        lags: int = 12,
+        lags_past_covariates: int | None = 12,
+        covariate_series_ids: list[str] | None = None,
+        num_samples: int = 500,
+        lgbm_kwargs: dict[str, Any] | None = None,
+    ) -> None:
+        self._lags = lags
+        self._lags_past_covariates = lags_past_covariates
+        self._covariate_series_ids = list(covariate_series_ids) if covariate_series_ids else None
+        self._num_samples = num_samples
+        kwargs = dict(lgbm_kwargs or {})
+        kwargs.setdefault("verbose", -1)
+        self._lgbm_kwargs = kwargs
+
+    @property
+    def predictor_id(self) -> str:
+        """Return a stable identifier, suffixed ``_cov`` when covariates are used."""
+        suffix = "_cov" if self._covariate_series_ids else ""
+        return f"darts_lightgbm{suffix}"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Produce probabilistic LightGBM forecasts for every horizon in the task."""
+        from darts.models import LightGBMModel  # noqa: PLC0415
+
+        model = LightGBMModel(
+            lags=self._lags,
+            lags_past_covariates=(self._lags_past_covariates if self._covariate_series_ids else None),
+            output_chunk_length=task.horizon,
+            likelihood="quantile",
+            quantiles=_TRAINING_QUANTILES,
+            **self._lgbm_kwargs,
+        )
+
+        samples_by_horizon = _fit_and_sample(
+            model=model,
+            task=task,
+            context=context,
+            covariate_series_ids=self._covariate_series_ids,
+            num_samples=self._num_samples,
+        )
+
+        return _build_predictions(
+            predictor_id=self.predictor_id,
+            task=task,
+            context=context,
+            samples_by_horizon=samples_by_horizon,
+            metadata={"covariates": self._covariate_series_ids or []},
+        )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__models.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__models.py.md
new file mode 100644
index 0000000..77e33c8
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__models.py.md
@@ -0,0 +1,38 @@
+# Source: aieng-forecasting/aieng/forecasting/models.py
+
+kind: python
+
+```python
+"""Canonical proxy model identifiers used across the project.
+
+The bootcamp standardizes on exactly two Vector-proxy models so examples,
+defaults, and notebooks stay consistent for participants:
+
+- :data:`LITE_MODEL` — the default / lite model. Fast and cheap; used
+  everywhere unless a task specifically benefits from the advanced model.
+- :data:`ADVANCED_MODEL` — the advanced model. Higher capability; reserved for
+  the adaptive-agent path and production-quality / curriculum-generation runs.
+
+Reference these constants instead of hardcoding model strings, so a model
+swap is a one-line change here rather than a repo-wide find-and-replace.
+
+This module is intentionally dependency-free (it imports nothing from the rest
+of the package) so it can be imported from anywhere without risking an import
+cycle.
+"""
+
+from __future__ import annotations
+
+
+#: Default / lite model — fast and cheap; the project-wide default.
+LITE_MODEL = "gemini-3.1-flash-lite-preview"
+
+#: Advanced model — higher capability; adaptive-agent and production runs.
+ADVANCED_MODEL = "gemini-3.5-flash"
+
+#: Alias for the project-wide default model (the lite model).
+DEFAULT_MODEL = LITE_MODEL
+
+
+__all__ = ["ADVANCED_MODEL", "DEFAULT_MODEL", "LITE_MODEL"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/docs__adk-skills-guide.md.md b/implementations/getting_started/concierge_agent/context/artifacts/docs__adk-skills-guide.md.md
new file mode 100644
index 0000000..ae7c4ae
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/docs__adk-skills-guide.md.md
@@ -0,0 +1,196 @@
+# Source: docs/adk-skills-guide.md
+
+kind: markdown
+
+# ADK Skills and Code Execution — How-To for This Repo
+
+How agentic forecasters in this repo extend their capabilities, and the rules
+for adding each correctly the first time. The patterns here are not
+hypothetical — they are the ones the energy implementation
+(`implementations/energy_oil_forecasting/`) uses today, and this guide points
+at those skills as the canonical examples.
+
+All of this is wired through `AgentConfig` /
+`build_adk_agent` in
+[`aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py`](../aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py).
+
+---
+
+## 1. The three ways to extend an agent
+
+Pick the lightest mechanism that does the job. They compose — the energy
+agents use all three.
+
+| Mechanism | `AgentConfig` field | Runs where | Use it for |
+|---|---|---|---|
+| **Read-only skill** (ADK `SkillToolset`) | `skills_dirs: Sequence[Path]` | Content injected into the model context; files loaded on demand | Reference data and instructions too large or specific for the system prompt — benchmark tables, calibration stats, code patterns, series metadata |
+| **Function tool** | `function_tools` (pre-built ADK tools) and `extra_tools` (plain callables wrapped as `FunctionTool`) | The **host process** | Deterministic, auditable operations: a pre-specified `ForecastTool`, or typed state-mutation tools (see §5) |
+| **Code execution** | `code_execution: CodeExecutionConfig` | An **E2B cloud sandbox** | Open-ended ad-hoc computation the LLM writes itself — rolling indicators, interval calibration, exploratory analysis |
+
+A read-only skill *describes* how to do something; a function tool *does* a
+fixed thing reproducibly; code execution *lets the model do anything* inside a
+sandbox. Reach for code execution only when the flexibility is the point —
+otherwise a function tool is more controllable.
+
+---
+
+## 2. Code execution is E2B-only
+
+This repo standardized on **E2B** for code execution. There is no
+Gemini-native / built-in code-execution path. When `code_execution.enabled` is
+true, `build_adk_agent` attaches the E2B `CodeInterpreter(...).run_code` tool;
+code runs in a sandbox built from the image named in
+`CodeExecutionConfig.template_name`.
+
+- Code execution is **disabled by default** (`CodeExecutionConfig`).
+- Build the sandbox image once before enabling it — see
+  [Getting Started → Build the E2B sandbox image](../README.md) and
+  `scripts/build_e2b_template.py`.
+- Function tools and skill-mutation tools (§5) run in the **host process, not
+  in the sandbox**; only the model's own `run_code` calls execute in E2B.
+
+> **Prompt hygiene.** Do not tell the model it "may execute code" unless
+> `code_execution.enabled` is true. With no `run_code` tool available, the
+> model will look for the nearest substitute (historically, a hallucinated
+> `run_skill_script` call). Match the prompt to the tools actually attached.
+
+---
+
+## 3. How ADK skills work
+
+An ADK skill is a directory:
+
+```
+my-skill/
+├── SKILL.md          # required — frontmatter (name, description) + body instructions
+├── references/       # optional — docs or data files, loaded via load_skill_resource
+├── assets/           # optional — templates or other resources, loaded via load_skill_resource
+└── scripts/          # optional — Python/bash scripts, executed via run_skill_script
+```
+
+You attach skills by listing their directories in `AgentConfig.skills_dirs`;
+`build_adk_agent` calls `load_skill_from_dir` on each and wraps them in a
+single `SkillToolset`. When a `SkillToolset` is present, ADK registers **four
+tools** for every model call, regardless of which subdirectories actually
+exist:
+
+| Tool | What it does |
+|------|-------------|
+| `list_skills` | Returns each skill's `name` + `description` from its SKILL.md frontmatter (L1 metadata). |
+| `load_skill` | Returns the full SKILL.md body for a named skill (L2 instructions). |
+| `load_skill_resource` | Loads a file from `references/`, `assets/`, or `scripts/`. |
+| `run_skill_script` | Executes a Python or bash script from `scripts/`. |
+
+ADK also injects a fixed paragraph into the system prompt describing these
+folders **unconditionally** — there is no public API to suppress it. The model
+reads it and concludes that scripts exist, **even when the skill has none.**
+That single fact drives the rules below.
+
+---
+
+## 4. The design rules
+
+### Rule 1 — Don't attach a skill that has no files in `references/`, `assets/`, or `scripts/`.
+
+A skill with only a `SKILL.md` body is a system-prompt fragment wearing four
+extra tool declarations. It adds the ADK injection (which advertises scripts
+that don't exist) for zero benefit. If all you have is body text, put it in the
+agent instruction and leave `skills_dirs` empty.
+
+A skill earns its place when it provides reference **data** loaded on demand
+(`load_skill_resource`) or executable **scripts** (`run_skill_script`).
+
+> **Why this rule exists (the food-CPI incident).** The first skill in the repo
+> — `forecast-food-cpi` — had only a `SKILL.md` body, no `references/` or
+> `scripts/`. The ADK injection told the model scripts existed, so it invented
+> plausible names (`scripts/setup.py`, `scripts/forecast.py`) and burned three
+> tool round-trips on `SCRIPT_NOT_FOUND` before giving up and reasoning from the
+> prompt directly — which is all it ever needed to do. The skill was removed and
+> its content folded back into the system prompt. The rules here are the lesson.
+
+### Rule 2 — If a skill has references but no scripts, say so in the prompt.
+
+ADK will advertise `run_skill_script` regardless. Pre-empt the hallucination
+with an explicit instruction. The energy analyst agent does exactly this —
+after telling the model to use `list_skills` → `load_skill` →
+`load_skill_resource`, it adds:
+
+> These skills have NO scripts. Do not call `run_skill_script`.
+
+### Rule 3 — Keep the SKILL.md body minimal.
+
+Only instructions specific to the reference data or scripts. Anything that
+duplicates the system prompt belongs in the system prompt.
+
+---
+
+## 5. Worked examples in the repo
+
+### Read-only skills — `energy_oil_forecasting/analyst_agent/skills/`
+
+The code-executing analyst variant attaches two skills via `skills_dirs` (see
+`analyst_agent/agent.py`):
+
+- **`statistical-analysis/`** — `SKILL.md` plus
+  `references/analysis-patterns.md` and `references/wti_benchmarks.json`
+  (seasonal/volatility benchmarks loaded via `load_skill_resource`).
+- **`trend-projection/`** — `SKILL.md` plus `references/projection-examples.md`
+  (code patterns for fitting a trend and calibrating intervals).
+
+Both follow Rule 1 (real `references/` content) and the agent prompt follows
+Rule 2 (explicit "no scripts"). This is the calibration-benchmarks idea that
+earlier versions of this guide only sketched — now realized in working code.
+
+### Adaptive skills — a learnable strategy
+
+The adaptive agent (`energy_oil_forecasting/adaptive_agent/`) introduces a
+fourth idea: a skill whose content the agent **mutates** over a study session,
+rather than reading read-only. The infrastructure is generic and lives in
+[`aieng/forecasting/methods/agentic/adaptive_skill.py`](../aieng-forecasting/aieng/forecasting/methods/agentic/adaptive_skill.py):
+
+- **`AdaptiveSkillState`** — an abstract Pydantic model that is the source of
+  truth for the skill's content; subclasses implement `build_markdown()` to
+  render the state into the `SKILL.md` the `SkillToolset` injects.
+- **`AdaptiveSkillStore`** — persists one skill directory: `skill_state.yaml`
+  (the source of truth), `SKILL.md` (re-rendered from state on every save), and
+  `.history/` (a timestamped backup before each save, so every mutation is
+  reversible without git). Its `confirmation_threshold` lives on the *store*,
+  not the *state*, so the agent cannot lower its own evidence bar by mutating
+  state.
+
+The mutations are exposed as **typed function tools**, not as `run_skill_script`
+scripts. The implementation writes one thin callable per operation
+(`record_observation`, `open_hypothesis`, `graduate_hypothesis`, …) in
+`adaptive_agent/skill_tools.py` and registers them with
+`AgentConfig(extra_tools=build_skill_tools(strategy_dir))`. They run in the host
+process and persist through the store. The agent reads its current strategy as a
+normal read-only skill (`skills_dirs` includes the strategy directory) and
+updates it through the tools — read and write are deliberately separate
+surfaces.
+
+Use this pattern when an agent should *learn* a durable strategy; use a plain
+read-only skill when the reference material is fixed.
+
+---
+
+## 6. Checklist for adding a skill
+
+- [ ] The skill has at least one file in `references/`, `assets/`, or `scripts/`. If not, move the SKILL.md content into the agent instruction and leave it out of `skills_dirs` (Rule 1).
+
+- [ ] If it has `references/`/`assets/` **but no `scripts/`**, the agent instruction says so explicitly (Rule 2) — ADK advertises `run_skill_script` regardless.
+- [ ] If it has `scripts/`, every script the model is likely to call actually exists, and the SKILL.md body lists the available scripts.
+- [ ] The SKILL.md body is minimal — nothing that duplicates the system prompt (Rule 3).
+- [ ] For an adaptive skill: state lives in a `AdaptiveSkillState` subclass, mutations go through `extra_tools` (never `run_skill_script`), and the evidence threshold stays on the store.
+- [ ] A test confirms the skill directory loads and its L1 metadata (name, description) is what you expect.
+- [ ] After wiring it up, run one trace and confirm no spurious `run_skill_script` / `load_skill_resource` errors appear.
+
+---
+
+## 7. Current status
+
+- **Energy** uses read-only skills (analyst agent) and adaptive skills
+  (adaptive agent), all following the rules above.
+- **Food Price Forecasting** is a numerical-predictor path and runs **without
+  any ADK skills** — there is no agent or skill directory under it. If an
+  agentic food-CPI path is added later, the `statistical-analysis` skill in
+  energy is the closest template for a benchmarks-style read-only skill.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__README.md.md
new file mode 100644
index 0000000..000ae99
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__README.md.md
@@ -0,0 +1,62 @@
+# Source: implementations/README.md
+
+kind: markdown
+
+# implementations
+
+Self-contained reference implementations and their helper code.
+
+This is a local uv workspace package. It is installed automatically when you run `uv sync` from the repository root, but it is not a separately published public API.
+
+Some use cases are notebook-only. Others expose a small importable helper package so shared analysis, plotting, or data-registration code can live in Python modules instead of large notebook cells.
+
+---
+
+## Directory layout
+
+Numbered in the recommended order (mirrors the bootcamp progression: conventional numerical methods → LLM Processes → agents → agentic evaluation). The directories are not renamed — the numbers are an ordering convention used across the docs, and each directory stays an importable package (`from sp500_forecasting.data import ...`).
+
+```text
+implementations/
+|-- getting_started/          # 0 · CPI gasoline hello-world (start here)
+|   `-- specs/                #     backtest and eval YAML
+|-- sp500_forecasting/        # 1 · S&P 500 multivariate numerical comparison (financial markets)
+|   `-- specs/                #     backtest YAML (smoke + full)
+|-- food_price_forecasting/   # 2 · CFPR-style food CPI experiment
+|   `-- specs/                #     backtest YAML
+|-- energy_oil_forecasting/   # 3 · Daily WTI oil price forecasting experiment
+|   `-- specs/                #     backtest and eval YAML
+|-- boc_rate_decisions/       # 4 · Discrete-event reference: BoC cut/hold/hike direction
+|   `-- specs/                #     direction + binary backtest / eval / smoke YAML
+|-- tests/                    # tests for implementation-specific helper modules
+`-- pyproject.toml            # local workspace packaging
+```
+
+YAML backtest and eval specs live under each use case in `specs/`. Each directory is independent; see its `README.md` for the walkthrough.
+
+Every domain use case (all except `getting_started`) also ships a `starter_agent/` module and a `99_starter_agent.ipynb` — a fresh, hackable **starter agent** that is the consistent "build your own" entry point for that use case (toggleable news search + code execution, two lightweight tool-usage skills, an interactive cell, and one scored forecast).
+
+`getting_started/` additionally ships a **`concierge_agent/`** module and **`99_repo_concierge.ipynb`** — a repo onboarding helper (not a forecaster) that answers questions about how the codebase works using a committed public-`main` knowledge digest. See that notebook for notebook and `adk run` usage.
+
+---
+
+## Relationship to `aieng-forecasting`
+
+- `aieng-forecasting` (`aieng.forecasting`) owns reusable infrastructure and reusable reference predictors under `aieng.forecasting.methods`.
+- `implementations/` owns use-case material: walkthrough notebooks, experiment-specific helper modules, plotting/analysis code, and task-specific framing.
+
+If code becomes broadly reusable across use cases, promote it into `aieng-forecasting`.
+
+---
+
+## Adding a new use case
+
+1. Create `implementations/<use-case>/`.
+2. Add a `README.md` describing the task, the data, and what the notebooks cover.
+3. Add YAML specs under `implementations/<use-case>/specs/`.
+4. Start with notebooks as the primary user surface.
+5. If notebook code becomes bulky or repeated, extract small helper modules into that use-case directory.
+6. Add tests under `implementations/tests/<use-case>/` for non-trivial helper logic.
+7. Promote code into `aieng-forecasting` once it is clearly reusable across more than one use case.
+
+For architecture principles and cross-cutting extension ideas, see `planning-docs/roadmap.md`.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations____init__.py.md
new file mode 100644
index 0000000..f724a08
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations____init__.py.md
@@ -0,0 +1,7 @@
+# Source: implementations/__init__.py
+
+kind: python
+
+```python
+"""Namespace package root for per-topic experiment notebooks and helpers."""
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__01_boc_data_exploration.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__01_boc_data_exploration.ipynb.md
new file mode 100644
index 0000000..54cf4bc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__01_boc_data_exploration.ipynb.md
@@ -0,0 +1,417 @@
+# Source: implementations/boc_rate_decisions/01_boc_data_exploration.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# BoC Rate Decisions — Data Exploration & Problem Framing
+
+This notebook frames the Bank of Canada rate-decision use case and walks
+through its data layer. It is the warm-up for the experiment notebook
+(`02_boc_rate_direction_experiment.ipynb`), which runs the predictors.
+
+**The question.** At each of the Bank of Canada's eight fixed announcement
+dates per year: *will the Bank **cut**, **hold**, or **hike** its target for
+the overnight rate?* A predictor must emit a **probability distribution over
+the three ordered outcomes** four weeks before each announcement, and is
+scored with the **Ranked Probability Score (RPS)** — squared error
+accumulated over the cumulative distribution, so putting mass on *hike* when
+the Bank cuts costs more than putting it on *hold*.
+
+**Why four weeks?** On the eve of a decision the bond market has already
+converged: the 2-year GoC yield prices the announcement to near-certainty,
+and a "forecast" at that point mostly reads market consensus off a curve. At
+a 28-day lead the decision is genuinely uncertain — the skill being measured
+is *anticipating cycle turns before the market does*. An eve-of-decision
+(T−1) variant is kept as a diagnostic; notebook 02 compares the two leads
+directly.
+
+**Why this is a different kind of problem.** Every other use case in this
+repository forecasts a *continuous trajectory* (CPI levels, oil prices) and
+scores it with CRPS. Here the target is a *discrete decision* on an
+*irregular calendar*:
+
+- There is no curve to extrapolate — the outcome space is {cut, hold, hike},
+  and the categories are *ordered* (a cut is "further" from a hike than from
+  a hold).
+- Observations occur only on meeting dates, which don't fall on a fixed grid.
+- The classes are heavily imbalanced (holds dominate; cuts and hikes are rare
+  and clustered), so calibration — not classification accuracy — is what
+  matters.
+
+This is the reference example for ordered-categorical tasks in the evaluation
+harness (`ForecastingTask.payload_type == "categorical"`), exercising the
+explicit `origin_dates` calendar and RPS scoring that were added to the core
+package for exactly this class of problem. The binary special case — *cut vs
+no cut*, scored with Brier — is kept alongside as a compact copy-paste
+reference for naturally binary problems (prediction-market style questions);
+the experiment notebook opens with it as a warm-up.
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup
+
+Three ingredients, all registered on a `DataService` by
+`boc_rate_decisions.data.build_boc_service()`:
+
+| Ingredient | Source | Role |
+|---|---|---|
+| Daily target for the overnight rate | StatCan 10-10-0139-01 | Raw policy-rate path |
+| Fixed announcement dates 2009–2026 | `meeting_schedule.yaml` (curated, source-cited) | The meeting calendar — required to observe *holds*, which no published series encodes |
+| Derived `boc_rate_decision_direction` series | `BoCDecisionEventAdapter` | The −1/0/+1 target: one observation per meeting (cut / hold / hike) |
+| Derived `boc_rate_cut_event` series | `BoCDecisionEventAdapter` | The 0/1 binary view of the same decisions, for the compact binary reference |
+
+Plus three macro covariates: the 2-year GoC benchmark yield (daily, StatCan),
+headline CPI (monthly, StatCan), and the unemployment rate (monthly, FRED).
+
+Populate the local cache once before running:
+
+```bash
+uv run python scripts/fetch_boc.py
+```
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import pandas as pd
+
+
+ROOT = Path.cwd().resolve().parents[1]
+STATCAN_CACHE = ROOT / "data" / "statcan"
+FRED_CACHE = ROOT / "data" / "fred"
+
+from boc_rate_decisions.data import (
+    BOND_YIELD_2YR_SERIES_ID,
+    CPI_SERIES_ID,
+    DIRECTION_SERIES_ID,
+    RATE_CUT_EVENT_SERIES_ID,
+    TARGET_RATE_SERIES_ID,
+    UNEMPLOYMENT_SERIES_ID,
+    build_boc_service,
+    load_meeting_schedule,
+)
+
+
+svc = build_boc_service(statcan_cache_dir=STATCAN_CACHE, fred_cache_dir=FRED_CACHE)
+
+_as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+rate_df = svc.get_series(TARGET_RATE_SERIES_ID, as_of=_as_of)
+direction_df = svc.get_series(DIRECTION_SERIES_ID, as_of=_as_of)
+event_df = svc.get_series(RATE_CUT_EVENT_SERIES_ID, as_of=_as_of)
+meeting_dates = load_meeting_schedule()
+
+n_cuts = int((direction_df["value"] == -1.0).sum())
+n_holds = int((direction_df["value"] == 0.0).sum())
+n_hikes = int((direction_df["value"] == 1.0).sum())
+print(
+    f"Target rate:   {rate_df['timestamp'].min().date()} → {rate_df['timestamp'].max().date()}  ({len(rate_df)} days)"
+)
+print(f"Meetings:      {meeting_dates[0].date()} → {meeting_dates[-1].date()}  ({len(meeting_dates)} scheduled)")
+print(f"Directions:    {len(direction_df)} resolved meetings — {n_cuts} cuts, {n_holds} holds, {n_hikes} hikes")
+print(f"Binary view:   {len(event_df)} resolved meetings, {int(event_df['value'].sum())} cut events")
+```
+
+## Cell 4 (markdown)
+
+---
+## 2. The policy rate and its decisions
+
+The chart below is the visual motivation for the whole experiment. The policy
+rate is a step function — long flat stretches punctuated by short bursts of
+movement. Cuts (red down-triangles) cluster into easing cycles — 2015 (oil
+shock), 2020 (COVID), 2024–25 (post-inflation normalisation) — and hikes
+(teal up-triangles) into tightening cycles — 2010, 2017–18, and the steep
+2022–23 inflation fight. Moves of either kind are rare; the grey dots
+(holds) dominate.
+
+That clustering is what makes this problem interesting. An unconditional
+base-rate forecast is hard to beat *on average*, but it is wrong in exactly
+the periods that matter most. A good predictor has to recognise which cycle
+the Bank is in — easing, tightening, or neither — and the ordered outcome
+space means it should essentially never hesitate *between* cut and hike at
+the same meeting.
+
+## Cell 5 (code)
+
+```python
+from boc_rate_decisions.plots import plot_policy_rate_with_decisions
+
+
+fig, _ = plot_policy_rate_with_decisions(rate_df, direction_df, kind="direction")
+plt.show()
+```
+
+## Cell 6 (markdown)
+
+---
+## 3. Deriving the decision series
+
+No published series encodes BoC *decisions* — the daily rate tells you the
+level, but a flat line is ambiguous between "no meeting happened" and "a
+meeting happened and the Bank held". That's why the meeting calendar is
+committed as a curated YAML (`meeting_schedule.yaml`, built from the Bank's
+own announcement archive) and joined against the daily rate:
+
+- For each scheduled announcement date, compare the rate strictly *before*
+  the date with the rate shortly *after* it.
+- `−1` if the rate decreased (a cut of any size), `+1` if it increased, `0`
+  if it held — that is `boc_rate_decision_direction`, the primary target.
+  The binary `boc_rate_cut_event` series is a thin wrapper over the same
+  comparison (`1.0` exactly where the direction is `−1`).
+- The comparison uses a post-meeting lookahead window because the effective
+  date of a change moved from same-day to next-day in 2021 — deriving from
+  levels on both sides of the announcement is robust to that regime change.
+
+`validate_schedule_against_rate_series` cross-checks the curated calendar:
+every observed rate change must be attributable to a scheduled meeting or a
+known unscheduled announcement (there is exactly one since 2009: the
+emergency COVID cut of March 27, 2020, which is *excluded* from the task —
+predicting emergency moves is a different problem).
+
+## Cell 7 (code)
+
+```python
+from boc_rate_decisions.data import load_unscheduled_announcements, validate_schedule_against_rate_series
+
+
+unattributed = validate_schedule_against_rate_series(
+    rate_df, meeting_dates, unscheduled_dates=load_unscheduled_announcements()
+)
+print(f"Rate changes not attributable to a scheduled/known announcement: {len(unattributed)}")
+
+# The last 10 resolved meetings, with the rate on each side of the announcement.
+_LABELS = {-1.0: "CUT", 0.0: "HOLD", 1.0: "HIKE"}
+recent = direction_df.tail(10).copy()
+rate_by_date = rate_df.set_index("timestamp")["value"]
+rows = []
+for ts, direction in zip(recent["timestamp"], recent["value"]):
+    before = float(rate_by_date[rate_by_date.index < ts].iloc[-1])
+    after_window = rate_by_date[(rate_by_date.index >= ts) & (rate_by_date.index <= ts + pd.Timedelta(days=7))]
+    after = float(after_window.iloc[-1]) if not after_window.empty else float("nan")
+    rows.append(
+        {
+            "meeting": ts.date(),
+            "rate_before": before,
+            "rate_after": after,
+            "outcome": _LABELS[float(direction)],
+            "direction_value": int(direction),
+        }
+    )
+print()
+print(pd.DataFrame(rows).to_string(index=False))
+```
+
+## Cell 8 (markdown)
+
+---
+## 4. Class imbalance and the climatology floor
+
+Holds are roughly three meetings in four since 2009; cuts and hikes split the
+rest — and the per-year table below shows how lumpy those averages are: most
+years are all-hold, while a handful of cycle years account for nearly all the
+moves.
+
+This sets the **floor baseline** for the experiment. The RPS of a forecast
+over \(K\) ordered categories decomposes into \(K-1\) binary Brier scores of
+the *cumulative* events — for cut < hold < hike:
+
+\[ \mathrm{RPS} = \big(P(\text{cut}) - y_{\le\text{cut}}\big)^2 +
+   \big(P(\text{cut}) + P(\text{hold}) - y_{\le\text{hold}}\big)^2, \]
+
+where \(y_{\le k}\) indicates the realised outcome was at or below category
+\(k\). Each binary term is minimised in expectation by the corresponding
+cumulative base rate, so the constant *climatological distribution* — the
+empirical cut/hold/hike frequencies — is the optimal conditions-blind
+forecast. Any predictor that can't beat it has learned nothing from
+conditions; that is exactly what `CategoricalFrequencyPredictor` implements
+in notebook 02. (Set \(K = 2\) and the decomposition collapses to the
+familiar binary Brier score \((p - y)^2\) — the identity the binary warm-up
+verifies.)
+
+## Cell 9 (code)
+
+```python
+import numpy as np
+from aieng.forecasting.evaluation import compute_rps
+from boc_rate_decisions.analysis import yearly_outcome_table
+
+
+yearly = yearly_outcome_table(direction_df, labels={-1.0: "cut", 0.0: "hold", 1.0: "hike"})
+print(yearly.to_string())
+
+base_rates = direction_df["value"].map({-1.0: "cut", 0.0: "hold", 1.0: "hike"}).value_counts(normalize=True)
+climatology = [base_rates.get("cut", 0.0), base_rates.get("hold", 0.0), base_rates.get("hike", 0.0)]
+outcome_idx = direction_df["value"].map({-1.0: 0, 0.0: 1, 1.0: 2}).tolist()
+print(f"\nOverall base rates: cut {climatology[0]:.3f}, hold {climatology[1]:.3f}, hike {climatology[2]:.3f}")
+print(f"RPS of always predicting the climatology:  {compute_rps([climatology] * len(outcome_idx), outcome_idx):.4f}")
+uniform = [1 / 3, 1 / 3, 1 / 3]
+print(f"RPS of always predicting uniform thirds:   {compute_rps([uniform] * len(outcome_idx), outcome_idx):.4f}")
+always_hold = [0.0, 1.0, 0.0]
+print(f"RPS of always predicting a certain hold:   {compute_rps([always_hold] * len(outcome_idx), outcome_idx):.4f}")
+
+fig, ax = plt.subplots(figsize=(11, 3))
+bottom = np.zeros(len(yearly))
+for column, color, label in [
+    ("n_cut", "#d62728", "Cuts"),
+    ("n_hold", "#cccccc", "Holds"),
+    ("n_hike", "#1b7a76", "Hikes"),
+]:
+    ax.bar(yearly.index, yearly[column], bottom=bottom, color=color, label=label)
+    bottom += yearly[column].to_numpy()
+ax.set_ylabel("Meetings")
+ax.set_title("BoC fixed announcements per year by outcome")
+ax.legend(fontsize=9)
+ax.grid(axis="y", alpha=0.3)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 10 (markdown)
+
+---
+## 5. Cutoff discipline at announcement dates
+
+Forecast origins sit **four weeks before each announcement** (`as_of =
+meeting − 28 days`, `horizons=[28]`, frequency `"D"`), so the forecast date
+lands exactly on the meeting and the harness's cutoff enforcement excludes
+everything after the origin. Scheduled meetings are never closer than 35
+days apart, so the *previous* decision is always visible at the origin.
+Because the meeting calendar is irregular, the backtest spec lists its
+origins explicitly via `origin_dates` rather than deriving them from a
+stride.
+
+Each series carries a `released_at` column reflecting when the data was
+*available*, not when it was *measured*. For daily market series (target
+rate, bond yields) this is precise: next business day. For the monthly
+series it is an **approximation** — `timestamp + 21 days` for StatCan CPI,
+FRED's own approximate stamps for unemployment — and the approximation is
+*optimistic*: StatCan actually publishes CPI for reference month *m* about
+three weeks after the *end* of *m*, roughly 51 days after the month-start
+timestamp.
+
+This is why the predictors apply an extra month of conservative lag on
+monthly covariates: they drop the newest reference month visible in the
+context, which lands exactly on the month that was genuinely public at the
+origin. The cell below shows both layers.
+
+The cell below freezes the world four weeks before the June 2024 meeting —
+the first cut of the 2024 easing cycle — and shows exactly what a predictor
+was allowed to see.
+
+## Cell 11 (code)
+
+```python
+# Freeze the information state 28 days before the 2024-06-05 announcement.
+ctx = svc.context(as_of=datetime(2024, 5, 8))
+
+visible_rate = ctx.get_series(TARGET_RATE_SERIES_ID)
+visible_directions = ctx.get_series(DIRECTION_SERIES_ID)
+visible_cpi = ctx.get_series(CPI_SERIES_ID)
+
+print("As-of 2024-05-08 (announcement on 2024-06-05, 28 days out):")
+print(
+    f"  Last visible target rate:   {visible_rate['value'].iloc[-1]:.2f}%  on {visible_rate['timestamp'].iloc[-1].date()}"
+)
+print(
+    f"  Last visible meeting:       {visible_directions['timestamp'].iloc[-1].date()}  "
+    f"(outcome: {_LABELS[float(visible_directions['value'].iloc[-1])].lower()})"
+)
+print(
+    f"  Last visible CPI month:     {visible_cpi['timestamp'].iloc[-1].date()}  "
+    f"(approx. released_at {visible_cpi['released_at'].iloc[-1].date()})"
+)
+print(f"  CPI month predictors USE:   {visible_cpi['timestamp'].iloc[-2].date()}  (after the extra-month lag)")
+print()
+print("In reality, the April CPI print (released May 21, 2024) was NOT yet")
+print("available on May 8 — the approximate released_at makes April look")
+print("visible, and the predictors' extra-month lag corrects for exactly that.")
+print()
+print("Neither the 2024-06-05 meeting NOR anything after the origin is visible:")
+print(f"  Meetings visible: {len(visible_directions)} of {len(direction_df)} resolved")
+print()
+print("What actually happened on 2024-06-05: the Bank cut 25bp to 4.75% —")
+print("the first cut of the easing cycle. A predictor at this origin had the")
+print("rate path, April's hold, and March CPI. Four weeks before the decision,")
+print("markets had not yet fully converged on a June cut.")
+```
+
+## Cell 12 (markdown)
+
+---
+## 6. Macro covariates
+
+Three series the Bank itself watches, used by the logistic baseline and fed
+to the LLM-based predictors:
+
+- **2-year GoC yield vs the policy rate** — the bond market's implied policy
+  path. The 2-year trading well *below* the overnight rate means markets are
+  pricing cuts; well *above*, hikes. It is the single most informative
+  pre-meeting signal, and it is naturally *directional*.
+- **CPI inflation vs the 2% target** — the Bank's mandate. Sustained
+  below-target inflation creates room to cut; above-target overshoots force
+  hikes.
+- **Unemployment momentum** — a rising unemployment rate pressures the Bank
+  toward easing; a tight labour market supports tightening.
+
+## Cell 13 (code)
+
+```python
+yield_df = svc.get_series(BOND_YIELD_2YR_SERIES_ID, as_of=_as_of)
+cpi_df = svc.get_series(CPI_SERIES_ID, as_of=_as_of)
+unemp_df = svc.get_series(UNEMPLOYMENT_SERIES_ID, as_of=_as_of)
+
+cpi_yoy = cpi_df.set_index("timestamp")["value"].pct_change(12) * 100
+
+fig, axes = plt.subplots(3, 1, figsize=(12, 9), sharex=True)
+
+axes[0].plot(rate_df["timestamp"], rate_df["value"], color="k", linewidth=1.2, label="Target rate")
+axes[0].plot(yield_df["timestamp"], yield_df["value"], color="#1f77b4", linewidth=0.9, alpha=0.8, label="2yr GoC yield")
+axes[0].set_ylabel("%")
+axes[0].set_title("Policy rate vs 2-year GoC yield (yield below rate = cuts priced in)", fontsize=10)
+axes[0].legend(fontsize=8)
+
+axes[1].axhline(2.0, color="#d62728", linewidth=0.8, linestyle="--", label="2% target")
+axes[1].plot(cpi_yoy.index, cpi_yoy.values, color="#2ca02c", linewidth=1.1, label="CPI YoY")
+axes[1].set_ylabel("YoY %")
+axes[1].set_title("Headline CPI inflation vs the 2% target", fontsize=10)
+axes[1].legend(fontsize=8)
+
+axes[2].plot(unemp_df["timestamp"], unemp_df["value"], color="#9467bd", linewidth=1.1)
+axes[2].set_ylabel("%")
+axes[2].set_title("Unemployment rate (FRED, monthly)", fontsize=10)
+
+for ax in axes:
+    ax.grid(axis="y", alpha=0.3)
+    ax.set_xlim(pd.Timestamp("2009-01-01"), None)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 14 (markdown)
+
+---
+## 7. What's next
+
+`02_boc_rate_direction_experiment.ipynb` opens with a compact binary warm-up
+(*cut vs no cut*, Brier-scored — the copy-paste reference for naturally
+binary problems), then runs four predictors against the 3-way direction
+task at the canonical 28-day lead — the climatology floor, a fit-at-origin
+multinomial logistic regression on the covariates above, a
+direct-elicitation categorical LLMP, and an agentic BoC analyst — and
+compares them on RPS, one-vs-rest calibration, and the decision timeline.
+It closes the loop on the lead-time question with a T−28 vs T−1
+comparison: how much of each predictor's skill is anticipation, and how
+much is reading the market on the eve of the decision?
+
+Two components are deliberately deferred and have explicit seams in the
+code: grounding the LLM-based predictors in the Bank's own communications
+(press releases, Monetary Policy Reports), and an LLM evaluator that scores
+*reasoning alignment* between the agent's rationale and the Bank's published
+one. See the use-case `README.md` for the roadmap.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__02_boc_rate_direction_experiment.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__02_boc_rate_direction_experiment.ipynb.md
new file mode 100644
index 0000000..f7909ba
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__02_boc_rate_direction_experiment.ipynb.md
@@ -0,0 +1,626 @@
+# Source: implementations/boc_rate_decisions/02_boc_rate_direction_experiment.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# BoC Rate Decisions — 3-Way Direction Prediction Experiment
+
+This notebook runs the full predictor lineup on the discrete BoC task:
+**a probability distribution over {cut, hold, hike} at the next fixed
+announcement date, issued four weeks (28 days) before the announcement**,
+scored with the **Ranked Probability Score (RPS)**. Read
+`01_boc_data_exploration.ipynb` first for the problem framing, data layer,
+and cutoff-discipline walkthrough. A compact **binary warm-up** (*cut vs no
+cut*, Brier-scored) opens the experiment as a copy-paste reference for
+naturally binary problems — prediction-market-style questions — before the
+3-way main event.
+
+**Why a 28-day lead.** On the eve of a decision, the 2-year GoC yield has
+already absorbed the market consensus and the outcome is priced to
+near-certainty — a T−1 "forecast" mostly reads market expectations off a
+curve. Four weeks out, the decision is genuinely uncertain, so the skill
+being measured is *anticipation*. The eve-of-decision variant is kept as a
+diagnostic, and a dedicated section compares the two leads directly.
+
+**The leakage problem, discrete edition.** Frontier LLMs were trained on
+data that includes news coverage of every historical BoC decision. For
+backtest origins in 2010–2024, an LLM-based predictor may simply *remember*
+what the Bank decided — and unlike the continuous use cases, here
+memorisation is worth even more, because a single recalled label (cut /
+hold / hike) is the entire answer. Backtest RPS for LLMP and the agent is
+therefore an **upper bound on live skill**, useful for verifying the
+pipeline and calibration format, not for claiming forecasting ability. The
+conventional predictors (climatology, multinomial logistic) are blind to
+the future by construction, so their backtest scores are honest.
+
+The protected 2025–2026 eval window at the end is closer to (and partly
+beyond) current model training cutoffs — that comparison is fairer, and the
+budget-gated `evaluate()` harness keeps it honest.
+
+**What's here:**
+
+1. Setup and experiment config — smoke (3 origins) vs full (120 origins).
+2. Spec — loaded from YAML; ordered `categories` + explicit `origin_dates`.
+3. Binary warm-up — the compact Brier-scored reference, and the
+   RPS(K=2) ≡ Brier identity.
+4. Predictors — climatology, multinomial logistic, categorical LLMP,
+   agentic analyst.
+5. Backtest — cached on disk; RPS leaderboard with skill scores.
+6. Skill vs lead time — the canonical T−28 forecast against the T−1
+   eve-of-decision diagnostic.
+7. Predicted distributions over time, by method.
+8. Decision panels — context, predictions, rationales, and the realised
+   outcome per meeting.
+9. Protected eval — budget-gated 2025–2026 window.
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup
+
+The analytical code lives in modules alongside this notebook:
+
+- `data.py` — registers the target rate, derived decision series, and covariates.
+- `predictors/` — the (multinomial) logistic baseline and the BoC LLMP recipes.
+- `analyst_agent/` — the agentic BoC analyst (prompt builder + configs).
+- `analysis.py` / `plots.py` — score leaderboard, calibration, timeline.
+
+**Five specs, two jobs.** There is one *pedagogical backtest* on the deep
+pre-2025 history (where the cutoff-safe baselines shine and the LLM/agent rows
+are an honest-to-goodness *upper bound* — they may be reciting memorised
+decisions) and one *honest eval* on the scarce post-cutoff window (the only
+place the LLM/agent scores reflect forecasting). The binary warm-up and the
+eve-of-decision spec are small single-purpose illustrations.
+
+| Spec file | Role | Lead | Origins | Window | Cutoff posture |
+|---|---|---|---|---|---|
+| `boc_rate_direction_smoke.yaml` | fast dev loop (a slice of the full backtest) | T−28 | 3 | 2024 | pedagogical |
+| `boc_rate_direction_backtest.yaml` | **canonical backtest** (3 easing + 3 tightening cycles) | T−28 | 120 | 2010–2024 | pedagogical / LLM upper-bound |
+| `boc_rate_direction_eval.yaml` | **protected eval — the honest scoreboard** (`max_runs: 5`) | T−28 | 12 | 2025–Jun 2026 | post-cutoff / honest |
+| `boc_rate_cut_smoke.yaml` | binary reference for the §3 warm-up (Brier-scored) | T−1 | 3 | 2024 | illustrative |
+| `boc_rate_direction_eve_smoke.yaml` | eve-of-decision diagnostic for the §7 lead comparison | T−1 | 3 | 2024 | illustrative |
+
+`EXPERIMENT_CONFIG` (next code cell) swaps the main backtest between the 3-origin
+smoke slice and the 120-origin full window; the warm-up and eve specs are always
+the small ones. A use-case test (`test_specs.py`) asserts every origin list stays
+consistent with `meeting_schedule.yaml` at each spec's own lead.
+
+Populate the data cache once before running:
+
+```bash
+uv run python scripts/fetch_boc.py
+```
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+import warnings
+from datetime import datetime, timezone
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import yaml
+from dotenv import load_dotenv
+from IPython.display import Markdown, display  # noqa: A004
+
+
+warnings.filterwarnings("ignore")
+
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+from aieng.forecasting.evaluation import BacktestSpec, cached_backtest, describe_spec
+from boc_rate_decisions.analysis import (
+    decision_panel_data,
+    panel_rationales_markdown,
+    predictions_to_frame,
+    score_leaderboard,
+)
+from boc_rate_decisions.data import (
+    DIRECTION_SERIES_ID,
+    RATE_CUT_EVENT_SERIES_ID,
+    TARGET_RATE_SERIES_ID,
+    build_boc_service,
+)
+from boc_rate_decisions.plots import plot_decision_panel, plot_probability_timeline
+
+
+STATCAN_CACHE = ROOT / "data" / "statcan"
+FRED_CACHE = ROOT / "data" / "fred"
+PREDICTIONS_DIR = ROOT / "data" / "predictions"
+SPECS_DIR = ROOT / "implementations" / "boc_rate_decisions" / "specs"
+
+svc = build_boc_service(statcan_cache_dir=STATCAN_CACHE, fred_cache_dir=FRED_CACHE)
+
+_as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+direction_df = svc.get_series(DIRECTION_SERIES_ID, as_of=_as_of)
+event_df = svc.get_series(RATE_CUT_EVENT_SERIES_ID, as_of=_as_of)
+rate_df = svc.get_series(TARGET_RATE_SERIES_ID, as_of=_as_of)  # daily target rate, for panel context
+n_cuts = int((direction_df["value"] == -1.0).sum())
+n_holds = int((direction_df["value"] == 0.0).sum())
+n_hikes = int((direction_df["value"] == 1.0).sum())
+print(f"Direction series: {len(direction_df)} resolved meetings — {n_cuts} cuts, {n_holds} holds, {n_hikes} hikes")
+```
+
+## Cell 4 (code)
+
+```python
+# ── Experiment configuration ──────────────────────────────────────────────────
+# EXPERIMENT_CONFIG sets the size of the *main* direction backtest (sections 5-9):
+#
+#   "smoke"  3 origins — a slice of the full backtest (one hold, two cuts in 2024)
+#            fast dev loop: ~3 LLM calls per LLM-based predictor
+#   "full"   120 origins, 2010-2024 — the canonical pre-2025 backtest
+#            first LLM-based run makes 120 calls per predictor before caching
+#
+# The binary warm-up (§3) and the eve-of-decision diagnostic (§7) are always the
+# small single-purpose specs — they illustrate a format and a lead-time point,
+# so a 120-origin variant would add cost without adding clarity. The honest
+# post-2025 scoreboard is the separate protected eval in §10.
+
+EXPERIMENT_CONFIG = "smoke"
+
+_BACKTEST_SPEC_FILES = {
+    "smoke": "boc_rate_direction_smoke.yaml",
+    "full": "boc_rate_direction_backtest.yaml",
+}
+_BACKTEST_SPEC_FILE = _BACKTEST_SPEC_FILES[EXPERIMENT_CONFIG]
+# Cache key for artefacts under data/predictions/<spec_id>/<predictor_id>.yaml
+BACKTEST_SPEC_ID = f"boc_rate_direction_{EXPERIMENT_CONFIG}"
+
+# The binary warm-up (§3) and eve diagnostic (§7) each use a single fast spec.
+_WARMUP_SPEC_FILE = "boc_rate_cut_smoke.yaml"
+WARMUP_SPEC_ID = "boc_rate_cut_smoke"
+_EVE_SPEC_FILE = "boc_rate_direction_eve_smoke.yaml"
+EVE_SPEC_ID = "boc_rate_direction_eve_smoke"
+
+print(f"Config: {EXPERIMENT_CONFIG!r}  →  {_BACKTEST_SPEC_FILE}")
+print(f"  warm-up: {_WARMUP_SPEC_FILE}   eve diagnostic: {_EVE_SPEC_FILE}")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. The backtest spec
+
+Three things distinguish this spec from the continuous use cases:
+
+- **`payload_type: categorical`** on the task — predictors must return a
+  `CategoricalForecast(probabilities={...})` and the harness scores with
+  RPS instead of CRPS.
+- **`categories`** — the task declares the *ordered* outcome set and the
+  mapping to series values: `cut(−1) < hold(0) < hike(+1)`. The order is
+  what makes RPS distance-sensitive: mass on `hike` when the Bank cuts is
+  penalised through *two* cumulative thresholds, mass on `hold` through one.
+- **`origin_dates`** — BoC meetings are an irregular calendar (eight per
+  year, unevenly spaced), so the spec lists every forecast origin explicitly
+  (`announcement_date − 28 days`) instead of deriving origins from a stride.
+  Scheduled meetings are never closer than 35 days apart, so the previous
+  decision is always visible at the origin.
+
+## Cell 6 (code)
+
+```python
+with (SPECS_DIR / _BACKTEST_SPEC_FILE).open() as f:
+    backtest_spec = BacktestSpec.model_validate(yaml.safe_load(f))
+
+print(describe_spec(backtest_spec, data_service=svc))
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. Warm-up: the binary special case (a copy-paste reference)
+
+Many real prediction problems are naturally binary — *will X happen by
+date D?* — and prediction markets trade exactly that contract. Before the
+3-way main event, this section runs the same machinery on the binary view
+of the problem (*cut vs no cut*, `payload_type: binary`, Brier-scored) so
+you have a minimal, complete reference to copy for your own binary tasks:
+
+- the task: `specs/boc_rate_cut_smoke.yaml`,
+- the floor baseline: `HistoricalFrequencyPredictor` (the constant base rate),
+- the LLMP recipe: `build_llmp_binary` wrapping
+  `BinaryProbabilityLLMPredictor`,
+- the conventional model: the same `BoCLogisticPredictor`, which dispatches
+  to plain logistic regression on binary tasks.
+
+The binary reference stays at the **T−1 (eve-of-decision) lead**: its job is
+to demonstrate the payload and scoring format in the fewest moving parts,
+not the lead-time question — that analysis belongs to the 3-way experiment
+below.
+
+**Why the two framings agree.** The unnormalized RPS over \(K\) ordered
+categories is a sum of \(K-1\) cumulative binary Brier scores, so for
+\(K = 2\) it *is* the Brier score \((p - y)^2\). (Brier's original 1950
+multi-category score is twice this — both conventions appear in the
+literature; this codebase uses the cumulative form everywhere.) The cell
+below verifies the identity numerically with `compute_rps` and
+`compute_brier_score` — the binary problem is the \(K{=}2\) corner of the
+categorical machinery, which is exactly why the 3-way framing is the more
+general reference.
+
+## Cell 8 (code)
+
+```python
+from aieng.forecasting.evaluation import compute_brier_score, compute_rps
+from aieng.forecasting.methods import HistoricalFrequencyPredictor
+from boc_rate_decisions.predictors import BoCLogisticPredictor, build_llmp_binary
+
+
+# RPS(K=2) == Brier identity: categories ordered [no-cut, cut].
+for p_cut, outcome in [(0.1, 0), (0.3, 1), (0.85, 1)]:
+    rps = compute_rps([[1.0 - p_cut, p_cut]], [outcome])
+    brier = compute_brier_score([p_cut], [float(outcome)])
+    print(f"P(cut)={p_cut:.2f}, outcome={outcome}:  RPS(K=2) = {rps:.4f}  ==  Brier = {brier:.4f}")
+
+# The compact binary experiment: floor baseline + logistic + LLMP, Brier-scored.
+with (SPECS_DIR / _WARMUP_SPEC_FILE).open() as f:
+    warmup_spec = BacktestSpec.model_validate(yaml.safe_load(f))
+
+warmup_predictors = [HistoricalFrequencyPredictor(), BoCLogisticPredictor(), build_llmp_binary()]
+warmup_results = {}
+for predictor in warmup_predictors:
+    warmup_results[predictor.predictor_id] = cached_backtest(
+        predictor=predictor, spec=warmup_spec, spec_id=WARMUP_SPEC_ID, data_service=svc, store_dir=PREDICTIONS_DIR
+    )
+
+print()
+print(score_leaderboard(warmup_results, reference_id="historical_frequency").to_string(index=False))
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Predictors
+
+Four predictors spanning the methodology spectrum, all implementing the same
+`Predictor` API:
+
+| Group | Predictor | What it sees | Notes |
+|---|---|---|---|
+| Floor baseline | `CategoricalFrequencyPredictor` | Past outcomes only | Constant climatological distribution; the bar every other predictor must clear |
+| Conventional | `BoCLogisticPredictor` | Leak-safe macro features | Multinomial logistic regression fit at every origin: yield spread, rate momentum, inflation gap, unemployment momentum |
+| LLMP | `CategoricalProbabilityLLMPredictor` | Outcome history + prompt context | Direct distribution elicitation in one structured call; no tools, no covariates |
+| Agentic | `AgentPredictor` (BoC analyst) | Rate path + outcome history + **the same macro features as the logistic model** | Reasons over the evidence; emits a cut/hold/hike distribution + rationale + key signals |
+
+The agent and the logistic model deliberately receive **identical macro
+indicators** (the agent's prompt builder imports the same feature function),
+making this a clean comparison of *conventional fitting* vs *LLM reasoning*
+over the same information set. The LLMP variant sees less — just the
+labelled outcome sequence (`2024-04-10: hold`, …) and a description — which
+isolates the value of the covariates.
+
+## Cell 10 (code)
+
+```python
+from aieng.forecasting.methods import CategoricalFrequencyPredictor
+from boc_rate_decisions.analyst_agent import build_boc_agent_predictor, build_boc_basic_config
+from boc_rate_decisions.predictors import build_llmp_direction
+
+
+# Model for the LLM-based predictors (LLMP + agent). Flash-lite is the fast/cheap
+# default so a first Run All stays light; gemini-3.5-flash reasons noticeably
+# better at higher cost/latency. Switch by commenting the two lines below.
+MODEL = "gemini-3.1-flash-lite-preview"  # fast/cheap default
+# MODEL = "gemini-3.5-flash"             # stronger reasoning, higher cost/slower
+
+climatology = CategoricalFrequencyPredictor()
+logistic = BoCLogisticPredictor()  # dispatches to multinomial on categorical tasks
+llmp = build_llmp_direction(model=MODEL, reasoning_effort=None)
+agent = build_boc_agent_predictor(build_boc_basic_config(model=MODEL))
+
+# News-grounded agent variant (web search with temporal cutoffs). Leakage
+# risk is higher on historical dates; enable deliberately, not by default.
+# from boc_rate_decisions.analyst_agent import build_boc_news_config
+# agent_news = build_boc_agent_predictor(build_boc_news_config(model=MODEL))
+
+all_predictors = [climatology, logistic, llmp, agent]
+
+PREDICTOR_COLORS: dict[str, str] = {
+    climatology.predictor_id: "#7f7f7f",
+    logistic.predictor_id: "#1f77b4",
+    llmp.predictor_id: "#d62728",
+    agent.predictor_id: "#ff7f0e",
+}
+PREDICTOR_LABELS: dict[str, str] = {
+    climatology.predictor_id: "Climatology",
+    logistic.predictor_id: "Multinomial logistic",
+    llmp.predictor_id: "LLMP direction",
+    agent.predictor_id: "Agent (basic)",
+}
+
+for p in all_predictors:
+    print(f"  {p.predictor_id}")
+```
+
+## Cell 11 (markdown)
+
+---
+## 5. Backtest (cached on disk)
+
+`cached_backtest` writes each `BacktestResult` to
+`data/predictions/<spec_id>/<predictor_id>.yaml` and reuses it on subsequent
+runs; pass `force_refresh=True` to recompute. The climatology and logistic
+predictors are free; the LLMP and agent make one LLM call per origin on a
+first run (3 calls under `smoke`, 120 under `full`).
+
+**Reading the scores.** RPS accumulates squared error over the cumulative
+distribution. A confident, correct forecast scores near 0; a confident
+forecast on the *adjacent* category costs ~1; a confident forecast on the
+*opposite tail* (hike when the Bank cuts) costs ~2 — the ordering is what
+separates RPS from a plain multi-class Brier. Mean RPS over the window
+rewards predictors that keep mass on hold through the long quiet stretches
+*and* shift it toward the right tail in time for cycle turns — four weeks
+before each announcement, while the outcome is still genuinely open.
+Remember the asymmetry from the intro — only the climatology and logistic
+rows of the leaderboard are leakage-free.
+
+## Cell 12 (code)
+
+```python
+from aieng.forecasting.evaluation.backtest import BacktestResult
+
+
+results: dict[str, BacktestResult] = {}
+
+for predictor in all_predictors:
+    print(f"Running {predictor.predictor_id} ...", flush=True)
+    results[predictor.predictor_id] = cached_backtest(
+        predictor=predictor,
+        spec=backtest_spec,
+        spec_id=BACKTEST_SPEC_ID,
+        data_service=svc,
+        store_dir=PREDICTIONS_DIR,
+        force_refresh=True,
+    )
+    r = results[predictor.predictor_id]
+    print(f"  mean RPS = {r.mean_score:.4f}  ({len(r.predictions)} predictions, {r.skipped_origins} skipped)")
+```
+
+## Cell 13 (markdown)
+
+---
+## 6. RPS leaderboard
+
+`skill_vs_reference` is the skill score against the
+`CategoricalFrequencyPredictor`: positive = beats the climatology, 0 =
+matches it, negative = worse than knowing nothing. With holds at ~76%, the
+climatological forecast is a deceptively low bar that conditions-blind
+models struggle to clear — most of the score separation happens at the
+handful of cycle-turn meetings.
+
+> ⚠️ **Leakage caveat — read before comparing.** Gemini's training cutoff is ~January 2025, so on this **pre-2025 backtest** the LLM-Process and agent rows may be *reciting memorised rate decisions* rather than forecasting. Treat their scores as an **upper bound**, not live skill — the cutoff-safe baselines (climatology, logistic) are the honest comparison here. A fair LLM evaluation needs **post-cutoff / prospective** origins (see §10's budget-gated eval and the energy reference).
+
+> **The honest LLM/agent comparison is the post-2025 protected eval in §10**, which now runs by default — the 2010–24 leaderboard here is pedagogical (rich history for the cutoff-safe baselines; an upper bound for the LLM rows).
+
+## Cell 14 (code)
+
+```python
+board = score_leaderboard(results, reference_id=climatology.predictor_id)
+board["label"] = board["predictor_id"].map(PREDICTOR_LABELS)
+print(board.set_index("label").drop(columns="predictor_id").to_string())
+
+predictions_df = predictions_to_frame(results, direction_df)
+print(f"\nTidy prediction rows: {len(predictions_df)}")
+print(predictions_df[["predictor_id", "meeting_date", "p_cut", "p_hold", "p_hike", "outcome_label", "score"]].head())
+```
+
+## Cell 15 (markdown)
+
+---
+## 7. Skill vs lead time: T−28 vs the eve of the decision
+
+The same meetings, the same predictors, two information states: the
+canonical four-week lead and the eve-of-decision (T−1) diagnostic
+(`boc_rate_direction_eve_smoke.yaml`). Three things to expect:
+
+- **Climatology is lead-invariant** — it conditions on nothing, so its RPS
+  is identical at both leads. It anchors the comparison.
+- **Conditioning predictors should improve toward T−1** as the market
+  converges: the yield spread is far more decisive the day before a
+  decision than four weeks out. The *gap* between a predictor's T−28 and
+  T−1 scores is roughly "how much of its skill is anticipation vs reading
+  the market's final answer".
+- **For the LLM-based rows, interpret with the leakage caveat** — on
+  historical origins a memorised outcome inflates both leads equally, which
+  itself is diagnostic: a genuine forecaster should get *worse* as the lead
+  grows; a memoriser won't.
+
+## Cell 16 (code)
+
+```python
+with (SPECS_DIR / _EVE_SPEC_FILE).open() as f:
+    eve_spec = BacktestSpec.model_validate(yaml.safe_load(f))
+
+eve_results: dict[str, BacktestResult] = {}
+for predictor in all_predictors:
+    eve_results[predictor.predictor_id] = cached_backtest(
+        predictor=predictor,
+        spec=eve_spec,
+        spec_id=EVE_SPEC_ID,
+        data_service=svc,
+        store_dir=PREDICTIONS_DIR,
+    )
+
+lead_comparison = score_leaderboard(results)[["predictor_id", "mean_score"]].rename(columns={"mean_score": "rps_t28"})
+eve_board = score_leaderboard(eve_results)[["predictor_id", "mean_score"]].rename(columns={"mean_score": "rps_t1"})
+lead_comparison = lead_comparison.merge(eve_board, on="predictor_id")
+lead_comparison["anticipation_gap"] = (lead_comparison["rps_t28"] - lead_comparison["rps_t1"]).round(4)
+lead_comparison["label"] = lead_comparison["predictor_id"].map(PREDICTOR_LABELS)
+print(lead_comparison.set_index("label").drop(columns="predictor_id").to_string())
+print()
+print("anticipation_gap = RPS(T-28) - RPS(T-1): how much score the predictor")
+print("recovers as the market converges. ~0 for climatology by construction.")
+```
+
+## Cell 17 (markdown)
+
+---
+## 8. Predicted distributions over time, by method
+
+One stacked-area panel per method. Within a panel the three category
+probabilities **sum to 1 at every meeting**, so the bands show how each method
+moves probability mass between cut (red), hold (grey), and hike (teal) across
+the backtest. The marker strip along the top of each panel is the **realised**
+outcome at every meeting — filled and colour-coded when resolved, hollow when
+it hasn't resolved yet.
+
+Read each panel as: *does the method shift mass onto the right band, and in
+time?* Climatology is essentially flat by construction (the bar to clear); a
+good conditional method visibly tilts toward the cut/hike bands as cycle turns
+approach, while the outcome strip tells you whether that tilt was right.
+
+## Cell 18 (code)
+
+```python
+fig, _ = plot_probability_timeline(predictions_df, labels=PREDICTOR_LABELS)
+plt.show()
+```
+
+## Cell 19 (markdown)
+
+---
+## 9. Decision panels: prediction vs. outcome
+
+The leaderboard compresses each forecast to a single number. A **decision
+panel** unpacks one meeting so you can see what every method actually said and
+why:
+
+- **Context strip** — the policy-rate path over the year leading into the
+  meeting (dotted line = forecast origin, solid coloured line = the
+  announcement), plus the rate at origin and the prior decision.
+- **Probability bars** — each method's predicted cut/hold/hike distribution.
+  The **★** marks the category that actually happened, and the bar for the
+  realised outcome is outlined. Each row also shows that method's RPS.
+- **Rationales** — rendered as markdown beneath the figure (for the methods
+  that produce one: the agent and the LLMP), so the reasoning stays readable.
+
+We show the most recent meeting first; `show_meeting(...)` below re-renders the
+panel for any other announcement and its rationales.
+
+## Cell 20 (code)
+
+```python
+# Most recent meeting in the backtest, all methods at a glance.
+panel = decision_panel_data(results, direction_df)
+fig, _ = plot_decision_panel(panel, rate_df, labels=PREDICTOR_LABELS)
+plt.show()
+
+rationale_md = panel_rationales_markdown(panel, PREDICTOR_LABELS)
+if rationale_md:
+    display(Markdown(rationale_md))
+```
+
+## Cell 21 (markdown)
+
+### Viewing other meetings
+
+`show_meeting(date)` re-renders the panel for any announcement and prints the
+full (untruncated) rationales and key signals beneath it.
+
+These rationale fields are also the seam for the planned **reasoning-alignment**
+evaluation: the Bank publishes its own explanation with every decision, so an
+LLM judge could later score whether a method was right *for the right reasons*
+— most valuable exactly where the backtest score is least trustworthy
+(historical origins with training-data leakage).
+
+## Cell 22 (code)
+
+```python
+def show_meeting(meeting_date: str) -> None:
+    """Render a meeting's decision panel, then its rationales as markdown."""
+    panel = decision_panel_data(results, direction_df, meeting_date=meeting_date)
+    fig, _ = plot_decision_panel(panel, rate_df, labels=PREDICTOR_LABELS)
+    plt.show()
+    rationale_md = panel_rationales_markdown(panel, PREDICTOR_LABELS)
+    if rationale_md:
+        display(Markdown(rationale_md))
+
+
+available_meetings = sorted({str(d.date()) for d in predictions_df["meeting_date"].unique()})
+print("Available meetings:", ", ".join(available_meetings))
+
+# Inspect any meeting by date, e.g. the first origin in the window:
+show_meeting(available_meetings[0])
+```
+
+## Cell 23 (markdown)
+
+---
+## 10. The honest scoreboard — protected post-2025 eval
+
+**This is the result that counts for the LLM/agent rows.** `specs/boc_rate_direction_eval.yaml`
+covers the 12 announcements from January 2025 through June 2026 at the canonical
+28-day lead: the tail of the easing cycle (cuts in Jan, Mar, Sep, Oct 2025)
+followed by an extended hold at 2.25%. These origins are **at/after the model's
+~January 2025 training cutoff**, so — unlike the 2010–24 backtest above — the
+LLM-Process and agent scores here reflect *forecasting*, not recall. (The window
+contains no hikes, so it can't reward hike discrimination, but RPS still penalises
+mass wasted on the hike tail.)
+
+It runs by default and **unbudgeted** (`tracker=None`) so you always see the
+scoreboard. The spec carries `max_runs: 5`; in a real competition you'd pass an
+`EvalTracker` to enforce that budget across sessions — the scarcity is the point:
+an eval you can re-run freely becomes another backtest to over-fit.
+
+## Cell 24 (code)
+
+```python
+from aieng.forecasting.evaluation import EvalSpec, evaluate
+
+
+# The post-2025 protected eval is the honest scoreboard for the LLM/agent rows, so
+# it runs by default. We run it UNBUDGETED here (tracker=None) so the result always
+# shows. In a real competition you'd enforce the spec's max_runs budget across
+# sessions — an eval you can re-run freely becomes another backtest to over-fit:
+#     from aieng.forecasting.evaluation import EvalTracker
+#     tracker = EvalTracker(ROOT / "data" / "eval_runs.yaml")  # then pass tracker=tracker below
+with (SPECS_DIR / "boc_rate_direction_eval.yaml").open() as f:
+    eval_spec = EvalSpec.model_validate(yaml.safe_load(f))
+
+eval_results = {}
+for predictor in all_predictors:
+    print(f"Evaluating {predictor.predictor_id} on the post-2025 window ...", flush=True)
+    eval_results[predictor.predictor_id] = evaluate(predictor=predictor, spec=eval_spec, data_service=svc, tracker=None)
+    r = eval_results[predictor.predictor_id]
+    print(f"  {PREDICTOR_LABELS[predictor.predictor_id]:20s} mean RPS = {r.mean_score:.4f}")
+
+eval_board = score_leaderboard(eval_results, reference_id=climatology.predictor_id)
+print()
+print(eval_board.to_string(index=False))
+```
+
+## Cell 25 (markdown)
+
+---
+## 11. What's next — the deferred components
+
+This notebook completes the **quantitative** version of the problem. Three
+components are deliberately deferred, and the code leaves explicit seams for
+each:
+
+1. **BoC communications as context.** Every decision comes with a press
+   release, and four per year come with a full Monetary Policy Report. The
+   `CategoricalProbabilityLLMPredictorConfig.user_prompt_suffix` hook and the
+   `build_boc_news_config` retrieval sub-agent are the insertion points for
+   report-grounded variants once the document ingestion work (Track 2)
+   lands. The key engineering constraint carries over unchanged: documents
+   must be filtered by `released_at`, exactly like series data.
+
+2. **Reasoning-alignment evaluation.** Both the agent and the LLMP now emit a
+   `rationale` per meeting (shown in the decision panels in section 9). An LLM
+   evaluator comparing that rationale against the Bank's own published
+   explanation would complement the RPS with a *process* metric — was the
+   predictor right for the right reasons? This matters most precisely where
+   the score is least trustworthy (historical origins with leakage).
+
+3. **Live forecasting.** The cleanest evaluation needs no leakage analysis
+   at all: forecast the *next* announcement before it happens. The eval spec
+   ends at June 2026; extending `meeting_schedule.yaml` with the Bank's
+   published 2027 calendar and issuing forecasts the day before each
+   announcement turns this use case into a standing live experiment — eight
+   genuinely out-of-sample data points per year.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__03_rationale_alignment.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__03_rationale_alignment.ipynb.md
new file mode 100644
index 0000000..b7f1e49
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__03_rationale_alignment.ipynb.md
@@ -0,0 +1,264 @@
+# Source: implementations/boc_rate_decisions/03_rationale_alignment.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# BoC — Rationale-alignment evaluation (LLM-as-a-judge, on the side)
+
+This notebook is a **side-channel** evaluation: it does not touch the resolution
+loop. It is **trace-driven** — the Langfuse **trace** is the canonical record of
+what each forecaster said. For every trace it reads the structured forecast the
+predictor stamped on at run time (its `rationale`, cited signals, and predicted
+distribution), compares that rationale to the Bank of Canada's **own** published
+press release for that meeting, and **pushes** a structured *alignment* verdict
+back to the trace as Langfuse scores — complementing the accuracy score (RPS)
+with a *process* metric: was the forecaster right **for the right reasons**?
+
+So evaluation **reads from and writes to Langfuse**, not a local prediction cache.
+
+**Prerequisites**
+1. Langfuse configured (`LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` in `.env`).
+2. Press releases cached: `uv run python scripts/fetch_boc_press_releases.py`
+   (covers every scheduled date back to 2009).
+3. The generation cell in section 2 runs the reasoning predictors live, so a
+   fresh trace exists for every meeting it scores — no prior traced run needed.
+
+**Cutoff posture.** This notebook runs on the **protected post-2025 eval window**
+(Jan 2025 – Jun 2026), the same honest origins as notebook 02 §10. They sit
+at/after the model's ~January 2025 training cutoff, so the rationale being judged
+reflects genuine reasoning rather than a recalled outcome — the alignment verdict
+is as clean as the accuracy score there. (Pointing this at a pre-2025 backtest
+would inherit the same memorisation caveat as the accuracy backtest.)
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+import warnings
+from datetime import datetime, timezone
+from pathlib import Path
+
+import pandas as pd
+import yaml
+from dotenv import load_dotenv
+from IPython.display import Markdown, display  # noqa: A004
+
+
+warnings.filterwarnings("ignore")
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+from aieng.forecasting.evaluation import EvalSpec, evaluate
+from boc_rate_decisions.data import DIRECTION_SERIES_ID, build_boc_service
+from boc_rate_decisions.press_releases import PressReleaseStore
+from boc_rate_decisions.rationale_eval import evaluate_result_alignment
+
+
+STATCAN_CACHE = ROOT / "data" / "statcan"
+FRED_CACHE = ROOT / "data" / "fred"
+SPECS_DIR = ROOT / "implementations" / "boc_rate_decisions" / "specs"
+# Anchor the press-release cache to the repo root (notebook cwd is the use-case dir).
+PRESS_RELEASE_CACHE = ROOT / "data" / "reports" / "boc_press_releases"
+
+svc = build_boc_service(statcan_cache_dir=STATCAN_CACHE, fred_cache_dir=FRED_CACHE)
+_as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+direction_df = svc.get_series(DIRECTION_SERIES_ID, as_of=_as_of)
+
+store = PressReleaseStore.from_cache(PRESS_RELEASE_CACHE)
+print(f"Cached press releases: {len(store)}")
+if len(store) == 0:
+    print("No releases cached — run:  uv run python scripts/fetch_boc_press_releases.py")
+```
+
+## Cell 4 (markdown)
+
+---
+## 2. Generate the traced runs to evaluate
+
+Only methods that produce a `rationale` (the agent and the reasoning-enabled
+LLMP) can be alignment-scored; the baselines are skipped automatically.
+
+The judge reads each forecast **from its Langfuse trace**, so a traced run must
+exist. `evaluate()` runs the predictors live over the protected post-2025 eval
+window (`boc_rate_direction_eval.yaml`, 12 meetings Jan 2025 – Jun 2026) — the
+same honest origins as notebook 02 §10 — emitting a fresh trace per origin, each
+stamped with the structured forecast. There's no cache to go stale: every run
+re-traces, so section 3 always has live traces to read.
+
+> Running these two reasoning predictors over all 12 origins is ~24 model calls;
+> it re-runs each time the cell executes. The accuracy scoreboard is computed and
+> budgeted separately in notebook 02 §10 — this notebook only adds the *process*
+> (alignment) verdict on top of the same traces.
+
+## Cell 5 (code)
+
+```python
+from boc_rate_decisions.analyst_agent import build_boc_agent_predictor, build_boc_basic_config
+from boc_rate_decisions.predictors import build_llmp_direction
+
+
+# Model for BOTH reasoning predictors. Flash-lite is the fast/cheap default; on
+# this window gemini-3.5-flash reasons noticeably better at higher cost/latency
+# (see the §5 note). Switch by commenting the two lines below. The LLM-as-judge
+# in §3 always uses the advanced model regardless of this choice.
+MODEL = "gemini-3.1-flash-lite-preview"  # fast/cheap default
+# MODEL = "gemini-3.5-flash"             # stronger reasoning, higher cost/slower
+
+# Run the reasoning predictors over the PROTECTED POST-2025 eval window — the same
+# honest origins as notebook 02 §10 (boc_rate_direction_eval.yaml: 12 meetings,
+# Jan 2025 – Jun 2026, at/after the model's ~Jan 2025 cutoff). evaluate() runs each
+# predictor live, emitting a fresh Langfuse trace per origin (each stamped with the
+# structured forecast). Unlike cached_backtest there's no cache to go stale: every
+# run re-traces, so the judge in section 3 always has live traces to read.
+with (SPECS_DIR / "boc_rate_direction_eval.yaml").open() as f:
+    spec = EvalSpec.model_validate(yaml.safe_load(f))
+
+llmp = build_llmp_direction(model=MODEL, reasoning_effort=None)
+agent = build_boc_agent_predictor(build_boc_basic_config(model=MODEL))
+PREDICTOR_LABELS = {llmp.predictor_id: "LLMP direction", agent.predictor_id: "Agent (basic)"}
+
+results = {}
+for predictor in [llmp, agent]:
+    # tracker=None: a side-channel eval runs unbudgeted and does not spend the
+    # spec's max_runs accuracy-eval budget (mirrors notebook 02 §10).
+    results[predictor.predictor_id] = evaluate(predictor=predictor, spec=spec, data_service=svc, tracker=None)
+print(f"Loaded results ({MODEL}):", ", ".join(PREDICTOR_LABELS[p] for p in results))
+```
+
+## Cell 6 (markdown)
+
+---
+## 3. Judge each trace and push scores
+
+For every trace the evaluator fetches it from Langfuse (polling briefly, since
+ingestion is async), reads the stamped forecast, and runs one LLM-as-judge call
+(advanced model). The judge scores *alignment only*; correctness comes from the
+realised decision, and the two combine into `right_for_right_reasons`. With
+`PUSH_TO_LANGFUSE = True` the verdict is written straight back to the trace as a
+numeric `rationale_alignment` score and a categorical `right_for_right_reasons`
+score, so it shows up alongside the trace in the Langfuse UI.
+
+## Cell 7 (code)
+
+```python
+PUSH_TO_LANGFUSE = True  # write rationale_alignment + right_for_right_reasons scores back to each trace
+
+frames = [
+    evaluate_result_alignment(result, store, direction_df, push_to_langfuse=PUSH_TO_LANGFUSE)
+    for result in results.values()
+]
+nonempty = [f for f in frames if not f.empty]
+alignment = pd.concat(nonempty, ignore_index=True) if nonempty else pd.DataFrame()
+
+if alignment.empty:
+    print(
+        "Scored 0 forecasts. Check that (1) Langfuse tracing is configured so the section 2 run emitted "
+        "traces (LANGFUSE_* keys in .env), and (2) press releases are cached for these meetings "
+        "(run scripts/fetch_boc_press_releases.py)."
+    )
+else:
+    alignment["label"] = alignment["predictor_id"].map(PREDICTOR_LABELS)
+    print(f"Scored {len(alignment)} rationale-bearing forecast(s).\n")
+    summary = alignment.groupby("label").agg(
+        n=("alignment_score", "size"),
+        mean_alignment=("alignment_score", "mean"),
+        correct_aligned=("right_for_right_reasons", lambda s: int((s == "correct_aligned").sum())),
+    )
+    print(summary.to_string())
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Per-meeting verdicts
+
+Rendered as markdown (not crammed into a figure). Each verdict links to its
+Langfuse trace when one is available.
+
+## Cell 10 (code)
+
+```python
+if alignment.empty:
+    print("Nothing to show — see the message above.")
+else:
+    for _, row in alignment.sort_values(["meeting_date", "label"]).iterrows():
+        signals = ", ".join(row["key_signal_overlap"]) if row["key_signal_overlap"] else "—"
+        trace = f"  ·  [trace]({row['langfuse_trace_url']})" if row.get("langfuse_trace_url") else ""
+        display(
+            Markdown(
+                f"**{row['label']} — {row['meeting_date'].date()}**{trace}  \n"
+                f"predicted **{row['predicted_label']}** · realised **{row['realized_label']}** · "
+                f"alignment **{row['alignment_score']:.2f}** · _{row['right_for_right_reasons']}_\n\n"
+                f"Signal overlap: {signals}\n\n"
+                f"{row['justification']}\n\n---"
+            )
+        )
+```
+
+## Cell 11 (markdown)
+
+---
+## 5. Langfuse scores — review
+
+The `rationale_alignment` and `right_for_right_reasons` scores were pushed to each
+trace in section 3 (when `PUSH_TO_LANGFUSE = True`). This table summarises what
+landed and links to each trace, so the verdicts are one click from the traces and
+dashboards — a step toward closing the agent feedback loop. The **Result** column
+(✅/❌) marks whether the predicted direction matched the actual decision —
+*accuracy*, distinct from *alignment* (was the reasoning sound), so you can spot
+the revealing cases: right for the wrong reasons (✅ + low alignment) and wrong
+for sound reasons (❌ + high alignment).
+
+> **Read the numbers with the window in mind.** This is **11 meetings per method**
+> (Jan 2025 – Jun 2026) with **no hikes** — mostly holds and a few cuts. That's
+> enough to *see* a model gap (the default `gemini-3.1-flash-lite-preview` reasons
+> visibly worse than `gemini-3.5-flash` — flip `MODEL` in §2 to compare), but too
+> small to *rank* models with confidence. Treat it as directional, not decisive.
+
+## Cell 12 (code)
+
+```python
+if alignment.empty:
+    print("Nothing scored — see section 3.")
+else:
+    n_total = len(alignment)
+    n_pushed = int(alignment["langfuse_scored"].sum())
+    correct_mask = alignment["predicted_label"] == alignment["realized_label"]
+    n_correct = int(correct_mask.sum())
+
+    table = [
+        "| Method | Meeting | Result | Pred → Real | Alignment | Pushed | Trace |",
+        "|---|---|:--:|---|---:|:--:|---|",
+    ]
+    for _, row in alignment.sort_values(["meeting_date", "label"]).iterrows():
+        result = "✅" if row["predicted_label"] == row["realized_label"] else "❌"
+        link = f"[open trace]({row['langfuse_trace_url']})" if row.get("langfuse_trace_url") else "—"
+        pushed = "✅" if row.get("langfuse_scored") else "—"
+        table.append(
+            f"| {row['label']} | {row['meeting_date'].date()} | {result} | "
+            f"{row['predicted_label']} → {row['realized_label']} | "
+            f"{row['alignment_score']:.2f} | {pushed} | {link} |"
+        )
+
+    header = (
+        f"**Langfuse — `rationale_alignment`**  \n"
+        f"scored **{n_total}** · correct **{n_correct}/{n_total}** "
+        f"(✅ = predicted direction matched the decision) · pushed **{n_pushed}**"
+    )
+    display(Markdown(header + "\n\n" + "\n".join(table)))
+
+    if not PUSH_TO_LANGFUSE:
+        display(
+            Markdown(
+                "_`PUSH_TO_LANGFUSE = False` in section 3 — set it `True` to write the scores. "
+                "Trace links are clickable either way._"
+            )
+        )
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__99_starter_agent.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__99_starter_agent.ipynb.md
new file mode 100644
index 0000000..7ff9789
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__99_starter_agent.ipynb.md
@@ -0,0 +1,188 @@
+# Source: implementations/boc_rate_decisions/99_starter_agent.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Bank of Canada Rate Decisions — Your Starter Agent
+
+**If you're not sure what to do next, continue from here.**
+
+This notebook is a fresh, hackable agent for the BoC rate-decision use case — deliberately *not* wired into the numbered curriculum. It gives you our common building blocks behind simple toggles, so you can start building something of your own:
+
+- **optional news search** — bounded, cutoff-aware Google Search (proxy-only)
+- **optional code execution** — an E2B Python sandbox
+- **two lightweight skills** — *tool-usage playbooks* in `starter_agent/skills/`
+
+It does two things: lets you **talk to the agent** (open-ended, Track 2) and **score one real forecast** (Track 1). The live cells are gated by `RUN_AGENT` so a fresh `Run All` is safe and free; flip it to `True` to actually call the model.
+
+## Cell 2 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from dotenv import load_dotenv
+
+
+# Repo root holds the .env with PROXY_* creds the agent needs.
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+# ── Model selection ───────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Lite is the default.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+# AGENT_MODEL = "gemini-3.5-flash"  # advanced (higher cost/latency)
+
+# ── Run guard ──────────────────────────────────────
+# Live agent calls cost tokens and need PROXY_* in the repo-root .env, plus warm
+# data caches. Default False so `Run All` is safe; set True to call the model.
+RUN_AGENT = False
+
+from boc_rate_decisions.starter_agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+print("RUN_AGENT =", RUN_AGENT, "| model =", AGENT_MODEL)
+```
+
+## Cell 3 (markdown)
+
+---
+## 1. Meet your agent
+
+`build_starter_agent_config` returns an `AgentConfig` with two toggles. The default turns **news search on** (proxy-only, no extra key) and **code execution off** (it needs `E2B_API_KEY` and is slower). Flip them and re-run — the loaded skills follow the enabled tools.
+
+## Cell 4 (code)
+
+```python
+config = build_starter_agent_config(
+    model=AGENT_MODEL,
+    enable_search=True,  # ← cutoff-aware Google Search (proxy-only)
+    enable_code_exec=False,  # ← E2B Python sandbox (needs E2B_API_KEY); try True!
+)
+
+print("Agent:", config.name)
+print("Search enabled:    ", config.context_retrieval.enabled)
+print("Code-exec enabled: ", config.code_execution.enabled)
+print("Skills loaded:     ", [p.name for p in config.skills_dirs])
+print("\n── System instruction (edit this in starter_agent/agent.py) ──\n")
+print(config.instruction[:1200], "...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Talk to it  *(Track 2 — open-ended analysis)*
+
+Ask the agent anything. This is the interactive mode: no scoring, no schema — just reasoning (and a web search, since search is on). Edit the question and explore.
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+
+
+QUESTION = (
+    "What is the case for a cut versus a hold at the Bank of Canada\u2019s next "
+    "rate decision, and which looks more likely? Be concise."
+)
+
+if RUN_AGENT:
+    chat_agent = build_adk_agent(config)  # schema-free: plain text in, text out
+    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name="boc_starter_chat"))
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True in the setup cell to talk to the agent.")
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. Score one prediction against a known outcome  *(Track 1)*
+
+Now run the agent as a `Predictor`. We pick the **most recent already-resolved** decision, forecast it from 28 days out, and print the agent's distribution next to **what the Bank actually did** — so you can see whether it was any good. (One decision can't tell you if the agent is *calibrated*; that's what the leaderboard + reliability curves in `02_boc_rate_direction_experiment.ipynb` are for.) Live, so gated by `RUN_AGENT`.
+
+## Cell 8 (code)
+
+```python
+from datetime import datetime, timezone
+
+
+if RUN_AGENT:
+    from aieng.forecasting.evaluation.task import ForecastingTask
+    from aieng.forecasting.methods import CategoricalFrequencyPredictor
+    from boc_rate_decisions.data import (
+        DIRECTION_SERIES_ID,
+        DIRECTION_TASK_CATEGORIES,
+        build_boc_service,
+    )
+
+    svc = build_boc_service(
+        statcan_cache_dir=ROOT / "data" / "statcan",
+        fred_cache_dir=ROOT / "data" / "fred",
+    )
+    now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+    # Most recent already-resolved decision = last point in the realized direction series.
+    dir_df = svc.get_series(DIRECTION_SERIES_ID, as_of=now)
+    last = dir_df.iloc[-1]
+    ANNOUNCEMENT = pd.Timestamp(last["timestamp"])
+    realized = {-1.0: "cut", 0.0: "hold", 1.0: "hike"}[float(last["value"])]
+    AS_OF = ANNOUNCEMENT - pd.Timedelta(days=28)
+
+    task = ForecastingTask(
+        task_id="boc_starter_direction",
+        target_series_id=DIRECTION_SERIES_ID,
+        horizons=[28],
+        frequency="D",
+        payload_type="categorical",
+        categories=DIRECTION_TASK_CATEGORIES,
+        description="BoC rate decision direction (cut/hold/hike), 28 days ahead (starter).",
+    )
+    ctx = svc.context(as_of=AS_OF)
+    pred = build_starter_agent_predictor(config).predict(task, ctx)[0]
+    floor = CategoricalFrequencyPredictor().predict(task, ctx)[0]
+
+    probs = pred.payload.probabilities
+    print(f"Decision {ANNOUNCEMENT.date()} forecast from as_of={AS_OF.date()} (T-28)")
+    print(f"Actual outcome: {realized.upper()}\n")
+    print("  outcome   agent prob   climatology")
+    for label in ("cut", "hold", "hike"):
+        mark = "   <- ACTUAL" if label == realized else ""
+        print(f"  {label:<7}   {probs[label]:7.2%}     {floor.payload.probabilities[label]:7.2%}{mark}")
+    top = max(probs, key=probs.get)
+    print(
+        f"\nAgent put {probs[realized]:.0%} on what happened "
+        f"({'its top pick ✓' if top == realized else f'top pick was {top}'})."
+    )
+    if pred.metadata.get("reasoning"):
+        print("\nReasoning:", pred.metadata["reasoning"][:300])
+else:
+    print("RUN_AGENT is False — set it to True to score a live forecast against a known outcome.")
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Make it yours
+
+This agent is a starting point. Here are concrete next steps, easiest first — each is a small edit, then re-run the cells above.
+
+1. **Flip code execution on.** Set `enable_code_exec=True` in §1 (needs `E2B_API_KEY`). The agent loads the `code-analysis-playbook` skill and can compute its own diagnostics before forecasting. Compare the rationale.
+2. **Edit the agent's personality.** Open `starter_agent/agent.py` and change `_build_starter_instruction()` — make it more cautious, more contrarian, focused on one driver. Re-run §1 to see the new instruction.
+3. **Sharpen the skills.** The two files in `starter_agent/skills/` are short on purpose. Add your best queries to `research-playbook`, or a new diagnostic to `code-analysis-playbook`. The agent picks them up automatically.
+4. **Change the question and the origin.** Try a different `QUESTION` in §2 and a different origin in §3.
+5. **Mind the leakage.** News grounding on historical origins can leak the outcome — keep `cutoff_date` honest, and prefer genuinely upcoming meetings.
+6. **Judge the reasoning, not just the number.** Notebook 03 scores the agent's `reasoning`/`key_signals` against the Bank's published rationale (`rationale_eval.py`) — a process metric that complements RPS.
+
+Bigger ideas — press releases as forecast context (not just the evaluator), live forecasting of upcoming meetings — are in the use-case `README.md` and `planning-docs/roadmap.md`.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__README.md.md
new file mode 100644
index 0000000..cc93b33
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__README.md.md
@@ -0,0 +1,239 @@
+# Source: implementations/boc_rate_decisions/README.md
+
+kind: markdown
+
+# BoC Rate Decisions
+
+> **Reference implementation 4 of 4.** Recommended order: [getting_started](../getting_started/) → [S&P 500](../sp500_forecasting/) → [food CPI](../food_price_forecasting/) → [energy / WTI](../energy_oil_forecasting/) → **BoC rate decisions**. Each stands on its own.
+
+Predicts the **direction of the Bank of Canada's decision at the next fixed
+announcement date** — cut, hold, or hike — as a calibrated probability
+distribution issued **four weeks (28 days) before the announcement**. This
+is the repository's reference implementation for **discrete event
+prediction**. Where every other use case forecasts a continuous trajectory
+and scores it with CRPS, this one resolves an ordered categorical outcome
+on an irregular meeting calendar and scores distributions with the
+**Ranked Probability Score (RPS)**.
+
+The 28-day lead is the point: on the eve of a decision the 2-year GoC yield
+has already absorbed the market consensus, so a T−1 "forecast" mostly reads
+market pricing off a curve. Four weeks out the decision is genuinely
+uncertain, and the skill being measured is *anticipating cycle turns before
+the market converges*. An eve-of-decision (T−1) diagnostic variant is kept
+alongside; notebook 02 compares the two leads directly.
+
+It is the validation surface for the discrete half of the evaluation
+harness: `ForecastingTask.payload_type == "categorical"` with ordered
+`categories`, `CategoricalForecast` payloads, RPS dispatch in
+`backtest()`/`evaluate()`, and explicit `origin_dates` on specs. The
+**binary special case** (*cut vs no cut*, `payload_type == "binary"`,
+Brier-scored) is kept alongside as a compact copy-paste reference for
+naturally binary problems — prediction-market-style questions — and the
+experiment notebook opens with it as a warm-up, including a numerical check
+of the RPS(K=2) ≡ Brier identity.
+
+This is the repository's only discrete-event reference implementation: come
+here to see the same evaluation harness applied to a problem that is not a time
+series. For the minimal continuous-forecasting loop, see
+[`getting_started/`](../getting_started/).
+
+---
+
+## Prediction task
+
+**Question:** at the fixed announcement date occurring 28 days after the
+forecast origin, will the Bank of Canada CUT, HOLD, or HIKE its target for
+the overnight rate? Outcome is the direction of the change (any size).
+
+- **Target series:** `boc_rate_decision_direction` — derived −1/0/+1
+  series, one observation per fixed announcement date (8 per year),
+  `released_at` = the announcement date itself.
+- **Categories (ordered):** `cut(−1) < hold(0) < hike(+1)` — declared on the
+  task via `categories`, which is what makes RPS distance-sensitive: mass on
+  *hike* when the Bank cuts is penalised through two cumulative thresholds,
+  mass on *hold* through one.
+- **Origins:** `announcement_date − 28 days`, listed explicitly in the
+  specs via `origin_dates` (the meeting calendar is irregular; a stride
+  cannot produce it). Scheduled meetings are never closer than 35 days
+  apart, so the previous decision is always visible at the origin. A
+  use-case test (`test_specs.py`) asserts the origin lists stay consistent
+  with `meeting_schedule.yaml`.
+- **Horizon:** 28 days — the forecast date lands exactly on the
+  announcement, and cutoff enforcement excludes everything after the
+  origin.
+- **Eve diagnostic:** `boc_rate_direction_eve_smoke.yaml` keeps
+  the T−1 framing (task id `boc_rate_direction_next_meeting_eve`) for the
+  lead-time comparison in notebook 02 — the RPS gap between T−28 and T−1
+  separates anticipation from eve-of-decision market reading.
+- **Metric:** unnormalized RPS (the Epstein/Murphy cumulative form: for
+  \(K = 2\) it equals the binary Brier score \((p-y)^2\); Brier's original
+  1950 multi-category score is twice this — both conventions circulate).
+  The headline comparison is the skill score against the climatological
+  distribution. With holds at ~76%, climatology is a deceptively low bar
+  that conditions-blind models struggle to clear.
+- **Binary view:** `boc_rate_cut_event` (0/1, 1 = cut) remains registered
+  and the binary smoke/backtest specs are kept as the compact reference.
+
+**Excluded by design:** unscheduled (emergency) announcements — there have
+been exactly two since 2009 (March 13 and March 27, 2020, the COVID-19
+intermeeting cuts). They are recorded in the calendar file and used for
+validation, but no forecast origin targets them.
+
+---
+
+## Data
+
+| Ingredient | Source | Notes |
+|---|---|---|
+| Daily target for the overnight rate | StatCan 10-10-0139-01 (`StatCanAdapter`, `release_lag_days=1`) | The raw policy path |
+| Fixed announcement dates 2009–2026 | `meeting_schedule.yaml` (committed, curated) | Required to observe *holds*; sourced from the Bank's announcement archive, validated against the rate series |
+| `boc_rate_decision_direction` | `BoCDecisionEventAdapter(kind="direction")` | Joins calendar + daily rate into −1/0/+1; robust to the 2021 effective-date regime change |
+| `boc_rate_cut_event` | `BoCDecisionEventAdapter(kind="cut")` | The binary view of the same derivation |
+| 2-year GoC benchmark yield | StatCan 10-10-0139-01 | Market-implied policy expectations — the strongest single covariate, and naturally directional |
+| CPI all-items | StatCan 18-10-0004-11 | The Bank targets 2% CPI inflation |
+| Unemployment rate | FRED `LRUNTTTTCAM156S` | Labour-market pressure |
+| BoC rate-announcement press releases | Bank of Canada announcement pages (`scripts/fetch_boc_press_releases.py`) | One release per scheduled meeting, cached to `data/reports/boc_press_releases/`; served cutoff-aware by `PressReleaseStore` (only releases published on or before the origin are visible). Currently the published-rationale source for the reasoning-alignment evaluator; available as a context seam for the LLMP/agent predictors |
+
+Populate the cache once:
+
+```bash
+uv run python scripts/fetch_boc.py                 # series: rate, 2yr yield, CPI, unemployment
+uv run python scripts/fetch_boc_press_releases.py  # press releases (for the rationale-alignment eval)
+```
+
+`fetch_boc.py` uses the FRED API for the unemployment covariate (`FRED_API_KEY` in
+your repo-root `.env`); the script degrades gracefully without it, but the unemployment
+feature will be absent. FRED keys are free but must be requested individually —
+**we cannot provide one for you**. Request yours at
+https://fred.stlouisfed.org/docs/api/api_key.html (approval is usually quick, but
+allow some time). A description like "Requesting an API key to explore the
+effectiveness of various forecasting techniques on economic data." works well.
+
+**Cutoff discipline.** Monthly adapters carry *approximate* `released_at`
+stamps that are optimistic by roughly one month (the lag is measured from
+the month-start timestamp; StatCan publishes ~3 weeks after the month
+ends). All predictors in this use case therefore drop the newest visible
+reference month of any monthly covariate — see
+`predictors/logistic_baseline.py::build_feature_row`, which both the
+logistic model and the agent prompt builder share. Notebook 01 demonstrates
+the full chain at a real origin.
+
+**Maintenance:** extend `meeting_schedule.yaml` each year when the Bank
+publishes its next calendar (provenance notes are in the file header), and
+re-run `scripts/fetch_boc.py --refresh` to pick up new announcements.
+
+---
+
+## Predictors
+
+| Group | Predictor | Information set |
+|---|---|---|
+| Floor baseline | `CategoricalFrequencyPredictor` (core package) | Past outcomes only — the constant climatological distribution |
+| Conventional | `predictors/logistic_baseline.py` | Fit-at-origin multinomial logistic regression on four leak-safe macro features (yield spread, rate momentum, inflation gap, unemployment momentum); training features are rebuilt at each past meeting minus the task's own lead, so the train and predict feature distributions match; dispatches to plain logistic regression on binary tasks |
+| LLMP | `predictors/llmp_direction.py` → `CategoricalProbabilityLLMPredictor` | Labelled outcome history + BoC context block; one structured call, direct distribution elicitation. `predictors/llmp_binary.py` is the binary counterpart |
+| Agentic | `analyst_agent/` → `AgentPredictor` + `CategoricalAgentForecastOutput` | Rate path + decision history + **the same macro features as the logistic model** |
+
+The agent/logistic pairing is deliberate: identical indicators, so the
+comparison isolates *conventional fitting* vs *LLM reasoning*. The agent
+also emits `reasoning` and `key_signals` per meeting — the input for the
+reasoning-alignment evaluator in `rationale_eval.py`, demonstrated
+end-to-end in notebook 03.
+
+> **Leakage note (cutoff posture).** Gemini's parametric knowledge cutoff is
+> ~January 2025, and for a discrete outcome a single recalled label is the whole
+> answer — so the 2010–2024 backtest RPS for the LLMP and agent is an **upper
+> bound** on live skill (the conventional rows are the honest backtest there).
+> The **post-2025 protected eval** (12 resolved meetings, Jan 2025 – Jun 2026) is
+> the honest LLM/agent scoreboard; notebook 02 §10 now runs it by default.
+
+---
+
+## Reference specs
+
+Five specs, two jobs — a pedagogical pre-2025 backtest (cutoff-safe baselines
+are honest; LLM/agent rows are an upper bound) and the honest post-2025 eval —
+plus two small single-purpose illustrations:
+
+```
+specs/
+├── boc_rate_direction_backtest.yaml      # CANONICAL backtest: T−28, 120 origins, 2010–2024 (3 easing + 3 tightening cycles)
+├── boc_rate_direction_smoke.yaml         # a 3-origin slice of the above (2024: one hold, two cuts) — fast dev loop
+├── boc_rate_direction_eval.yaml          # HONEST eval: T−28, 12 origins, Jan 2025 – Jun 2026, max_runs: 5 (no hikes in window)
+├── boc_rate_cut_smoke.yaml               # binary reference (cut vs no cut), Brier-scored — §3 warm-up
+└── boc_rate_direction_eve_smoke.yaml     # T−1 eve-of-decision diagnostic, 3 origins — §7 lead comparison
+```
+
+The post-2025 window is too scarce (12 meetings) to split into both a held-out
+eval and a separate LLM backtest, so there is no "recent backtest" tier: the
+deep pre-2025 history is the backtest surface (numerical methods + LLM
+upper-bound) and the 2025–26 window is reserved for the eval. Notebook 02
+sizes the main backtest (smoke slice vs full window) via `EXPERIMENT_CONFIG`;
+the warm-up and eve specs are always the small ones.
+
+---
+
+## Module layout
+
+```
+implementations/boc_rate_decisions/
+├── meeting_schedule.yaml  # curated BoC announcement calendar (source-cited)
+├── data.py                # build_boc_service(); direction/event derivation + validation
+├── press_releases.py      # PressReleaseStore: cutoff-aware press-release store + HTML extraction/caching helpers
+├── predictors/            # (multinomial) logistic baseline; direction + binary LLMP recipes
+├── analyst_agent/         # AgentConfig factories + prompt builder + predictor factory
+├── starter_agent/         # fresh, hackable agent template (toggleable search/code-exec + skills)
+├── analysis.py            # score leaderboard, one-vs-rest frames, calibration bins, rationales
+├── rationale_eval.py      # LLM-as-judge reasoning-alignment evaluator; reads Langfuse traces, pushes scores back
+├── plots.py               # decision timeline, reliability curve, rate-path chart
+├── specs/                 # direction + binary backtest / eval / smoke YAML
+├── 01_boc_data_exploration.ipynb           # framing, direction derivation, cutoff walkthrough
+├── 02_boc_rate_direction_experiment.ipynb  # binary warm-up + the 3-way experiment
+├── 03_rationale_alignment.ipynb            # reasoning-alignment evaluation (LLM-as-judge over traces)
+└── 99_starter_agent.ipynb                  # ← start here to build your own agent
+```
+
+Tests live under `implementations/tests/boc_rate_decisions/` (direction and
+event derivation semantics; feature leak-safety).
+
+---
+
+## Notebooks
+
+| Notebook | Purpose |
+|---|---|
+| `01_boc_data_exploration.ipynb` | Problem framing (ordered decision vs time series), policy-rate history with cut/hold/hike markers, direction derivation + schedule validation, class imbalance and the climatology RPS floor (with the cumulative-Brier decomposition), cutoff discipline at a real origin. |
+| `02_boc_rate_direction_experiment.ipynb` | **Main experiment.** Binary warm-up (the copy-paste reference + RPS(K=2) ≡ Brier check), smoke/full config switch, cached backtests for all four predictors at the canonical T−28 lead, RPS leaderboard with skill scores, the T−28 vs T−1 lead-time comparison ("anticipation gap"), decision timeline (P(cut) and P(hike)), one-vs-rest reliability curves, agent-reasoning inspection, budget-gated protected eval. |
+| `03_rationale_alignment.ipynb` | **Reasoning-alignment evaluation.** Runs traced LLMP/agent forecasts, then judges each trace's `reasoning`/`key_signals` against the Bank's published press release with an LLM-as-judge (`rationale_eval.py`), pushing `rationale_alignment` (0–1) and `right_for_right_reasons` scores back to Langfuse. A *process* metric that complements RPS — most valuable exactly where backtest scores are least trustworthy (see the leakage note above). |
+| `99_starter_agent.ipynb` | **Your starter agent.** A fresh, hackable cut/hold/hike agent — *not* part of the experiment above. Toggleable news search + code execution and two lightweight tool-usage skills, with an interactive (Track 2) cell, one scored prediction (Track 1), and a "make it yours" guide. The place to start building your own. |
+
+---
+
+## Roadmap
+
+### Implemented since the first draft
+
+1. **BoC communications ingestion.** `press_releases.py` fetches one rate
+   announcement per scheduled meeting (`scripts/fetch_boc_press_releases.py`),
+   caches them under `data/reports/boc_press_releases/`, and serves them
+   cutoff-aware through `PressReleaseStore` — releases published after the
+   forecast origin are never visible, exactly like series data.
+2. **Reasoning-alignment evaluation.** `rationale_eval.py` is an LLM-as-judge
+   that compares the forecaster's per-meeting `reasoning`/`key_signals`
+   against the Bank's published rationale and writes `rationale_alignment`
+   and `right_for_right_reasons` scores back to the Langfuse trace. Notebook
+   03 runs it end-to-end.
+
+### Remaining extensions — good participant projects
+
+**Start in [`99_starter_agent.ipynb`](99_starter_agent.ipynb)** — it ships a
+fresh, hackable agent and a hands-on "make it yours" guide for going further.
+Two substantive projects, each with an explicit seam in the code, are
+catalogued in [`planning-docs/roadmap.md`](../../planning-docs/roadmap.md):
+
+1. **Press releases as predictor context** — feed cutoff-filtered release
+   excerpts into the *forecast* (not just the evaluator) via
+   `CategoricalProbabilityLLMPredictorConfig.user_prompt_suffix` or the
+   `build_boc_news_config` retrieval seam, and measure the lift.
+2. **Live forecasting** — forecast each upcoming announcement the day before it
+   happens: genuinely out-of-sample, and the honest test backtest leakage
+   precludes.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions____init__.py.md
new file mode 100644
index 0000000..dc8e1cf
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions____init__.py.md
@@ -0,0 +1,26 @@
+# Source: implementations/boc_rate_decisions/__init__.py
+
+kind: python
+
+```python
+"""Bank of Canada rate-decision experiment — helper modules and reference implementations.
+
+This use case frames a **discrete event-prediction problem**: the probability
+of a rate *cut* at the next BoC fixed announcement date, scored with the
+Brier score. It is the reference example for binary tasks
+(``ForecastingTask.payload_type == "binary"``) in the evaluation harness.
+
+The notebooks are deliberately kept thin; most of the analytical code lives
+in the modules in this package:
+
+- :mod:`data` — data service setup: daily target rate (StatCan), the curated
+  meeting calendar, the derived 0/1 rate-cut event series, and macro
+  covariates.
+- :mod:`analysis` — Brier leaderboards and calibration (reliability) tables.
+- :mod:`plots` — matplotlib figures (decision timeline, reliability curve).
+- :mod:`predictors` — the logistic-regression conventional baseline.
+- :mod:`analyst_agent` — the agentic BoC analyst predictor.
+
+See ``README.md`` in this directory for the full experiment description.
+"""
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analysis.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analysis.py.md
new file mode 100644
index 0000000..4ced900
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analysis.py.md
@@ -0,0 +1,507 @@
+# Source: implementations/boc_rate_decisions/analysis.py
+
+kind: python
+
+```python
+"""Analysis helpers for the BoC rate-decision experiment.
+
+Pure functions that turn :class:`BacktestResult` / :class:`EvalResult`
+objects into tidy DataFrames for binary and ordered-categorical evaluations:
+per-meeting prediction tables, score leaderboards, reliability/calibration
+bins, and rationale extracts.
+
+Kept separate from the notebooks so they can be unit-tested and reused.
+All functions are pure: they take results plus an observed event series and
+return DataFrames. They never fetch data or mutate global state.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from aieng.forecasting.evaluation.eval import EvalResult
+from aieng.forecasting.evaluation.prediction import BinaryForecast, CategoricalForecast
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+def _task_from_result(result: BacktestResult | EvalResult) -> ForecastingTask:
+    """Return the forecasting task attached to a backtest or eval result."""
+    if isinstance(result, BacktestResult):
+        return result.spec.task
+    return result.eval_spec.task
+
+
+def predictions_to_frame(
+    results: dict[str, BacktestResult | EvalResult],
+    event_df: pd.DataFrame,
+) -> pd.DataFrame:
+    """Flatten binary/categorical predictions into a tidy per-meeting DataFrame.
+
+    Parameters
+    ----------
+    results : dict[str, BacktestResult | EvalResult]
+        Mapping ``predictor_id -> result``. Binary results contribute
+        ``probability`` rows; categorical results contribute one ``p_<label>``
+        column per task-declared category.
+    event_df : pd.DataFrame
+        Observed event series (``timestamp`` / ``value`` columns, as returned
+        by :meth:`DataService.get_series`), used to attach the realised outcome
+        at each prediction's ``forecast_date``. Binary series use 0/1 values;
+        direction series use the task category values (for example -1/0/+1).
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``predictor_id``, ``origin``, ``meeting_date``, ``score``,
+        ``metric``, ``outcome``, ``probability`` for binary rows, and one
+        ``p_<label>`` column per categorical label. Categorical rows also carry
+        ``outcome_label`` so one-vs-rest views can be derived without the task
+        object.
+    """
+    outcome_by_date = {
+        pd.Timestamp(ts).normalize(): float(v) for ts, v in zip(event_df["timestamp"], event_df["value"], strict=True)
+    }
+
+    rows: list[dict[str, object]] = []
+    for predictor_id, result in results.items():
+        task = _task_from_result(result)
+        value_to_label = {category.value: category.label for category in task.categories or []}
+        category_labels = [category.label for category in task.categories or []]
+        for pred, score in zip(result.predictions, result.scores, strict=True):
+            meeting_date = pd.Timestamp(pred.forecast_date).normalize()
+            outcome = outcome_by_date.get(meeting_date)
+            row: dict[str, object] = {
+                "predictor_id": predictor_id,
+                "origin": pd.Timestamp(pred.as_of),
+                "meeting_date": meeting_date,
+                "score": float(score),
+                "metric": result.metric,
+                "outcome": outcome,
+                "probability": np.nan,
+            }
+            if isinstance(pred.payload, BinaryForecast):
+                row["probability"] = float(pred.payload.probability)
+                row["outcome_label"] = None
+            elif isinstance(pred.payload, CategoricalForecast):
+                row["outcome_label"] = value_to_label.get(outcome) if outcome is not None else None
+                for label in category_labels:
+                    row[f"p_{label}"] = float(pred.payload.probabilities[label])
+            else:
+                continue
+            rows.append(row)
+    return pd.DataFrame(rows)
+
+
+def score_leaderboard(
+    results: dict[str, BacktestResult | EvalResult],
+    *,
+    reference_id: str | None = None,
+) -> pd.DataFrame:
+    """Build a mean-score leaderboard, optionally with skill scores.
+
+    Skill against a reference predictor is ``1 - score / score_ref``: positive
+    means the predictor beats the reference, 0 means it matches it, negative
+    means it loses. The reference is usually a historical-frequency baseline
+    for binary tasks or a categorical-frequency baseline for direction tasks.
+
+    Parameters
+    ----------
+    results : dict[str, BacktestResult | EvalResult]
+        Mapping ``predictor_id -> result``. Lower scores are better.
+    reference_id : str or None
+        Predictor id to use as the skill-score reference. When ``None`` (or
+        not present in ``results``) the ``skill_vs_reference`` column is
+        omitted. If the reference mean score is non-positive, the skill column
+        is also omitted.
+
+    Returns
+    -------
+    pd.DataFrame
+        One row per predictor, sorted by ``mean_score`` ascending. Columns:
+        ``predictor_id``, ``metric``, ``mean_score``, ``n_predictions``,
+        ``n_skipped_origins`` and optionally ``skill_vs_reference``.
+    """
+    rows: list[dict[str, object]] = []
+    for predictor_id, result in results.items():
+        rows.append(
+            {
+                "predictor_id": predictor_id,
+                "metric": result.metric,
+                "mean_score": result.mean_score,
+                "n_predictions": len(result.predictions),
+                "n_skipped_origins": result.skipped_origins,
+            }
+        )
+    board = pd.DataFrame(rows).sort_values("mean_score").reset_index(drop=True)
+
+    if reference_id is not None and reference_id in results:
+        reference_score = results[reference_id].mean_score
+        if reference_score > 0:
+            board["skill_vs_reference"] = (1.0 - board["mean_score"] / reference_score).round(4)
+    return board
+
+
+def one_vs_rest_frame(predictions_df: pd.DataFrame, category: str) -> pd.DataFrame:
+    """Convert a categorical tidy frame into a binary one-vs-rest frame.
+
+    The input must come from :func:`predictions_to_frame` for a categorical
+    task, which provides ``p_<label>`` probability columns and an
+    ``outcome_label`` column derived from the task category values. For
+    example, ``one_vs_rest_frame(df, "cut")`` returns ``probability = p_cut``
+    and ``outcome = 1`` for realised cuts, ``0`` for realised holds/hikes, and
+    ``NaN`` when the meeting has not resolved.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Categorical tidy frame from :func:`predictions_to_frame`.
+    category : str
+        Category label to evaluate one-vs-rest.
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``predictor_id``, ``meeting_date``, ``probability``,
+        ``outcome``.
+    """
+    probability_col = f"p_{category}"
+    required = {"predictor_id", "meeting_date", "outcome_label", probability_col}
+    missing = required - set(predictions_df.columns)
+    if missing:
+        raise ValueError(f"predictions_df is missing required categorical columns: {sorted(missing)}")
+
+    frame = predictions_df.loc[:, ["predictor_id", "meeting_date", probability_col, "outcome_label"]].copy()
+    frame = frame.rename(columns={probability_col: "probability"})
+    frame["outcome"] = np.where(
+        frame["outcome_label"].isna(),
+        np.nan,
+        (frame["outcome_label"] == category).astype(float),
+    )
+    return frame.loc[:, ["predictor_id", "meeting_date", "probability", "outcome"]]
+
+
+def calibration_table(
+    predictions_df: pd.DataFrame,
+    *,
+    predictor_id: str | None = None,
+    n_bins: int = 5,
+) -> pd.DataFrame:
+    """Bin predicted probabilities and compare against observed event frequency.
+
+    This is the tabular form of the reliability curve: a perfectly calibrated
+    predictor has ``observed_frequency ~= mean_predicted`` in every bin.
+    Pass binary-task frames directly, or pass the output of
+    :func:`one_vs_rest_frame` for a categorical category such as ``cut`` or
+    ``hike``. With only ~120 meetings, bins are necessarily coarse — five
+    equal-width bins is about as fine as the sample supports.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Binary-style frame with ``probability`` and 0/1 ``outcome`` columns.
+        Rows with missing ``probability`` or ``outcome`` are dropped.
+    predictor_id : str or None
+        Restrict to one predictor; ``None`` uses all rows (caller's
+        responsibility to pass a single-predictor frame in that case).
+    n_bins : int
+        Number of equal-width probability bins over [0, 1].
+
+    Returns
+    -------
+    pd.DataFrame
+        One row per non-empty bin: ``bin_left``, ``bin_right``,
+        ``mean_predicted``, ``observed_frequency``, ``n``.
+    """
+    df = predictions_df.dropna(subset=["probability", "outcome"])
+    if predictor_id is not None:
+        df = df[df["predictor_id"] == predictor_id]
+
+    edges = np.linspace(0.0, 1.0, n_bins + 1)
+    rows: list[dict[str, float]] = []
+    for left, right in zip(edges[:-1], edges[1:]):
+        # Right-inclusive last bin so p=1.0 is counted.
+        upper_ok = df["probability"] <= right if right >= 1.0 else df["probability"] < right
+        in_bin = df[(df["probability"] >= left) & upper_ok]
+        if in_bin.empty:
+            continue
+        rows.append(
+            {
+                "bin_left": float(left),
+                "bin_right": float(right),
+                "mean_predicted": float(in_bin["probability"].mean()),
+                "observed_frequency": float(in_bin["outcome"].mean()),
+                "n": int(len(in_bin)),
+            }
+        )
+    return pd.DataFrame(rows)
+
+
+def yearly_outcome_table(event_df: pd.DataFrame, labels: dict[float, str] | None = None) -> pd.DataFrame:
+    """Summarise meeting outcomes per calendar year.
+
+    Parameters
+    ----------
+    event_df : pd.DataFrame
+        Observed event series (``timestamp`` / ``value`` columns).
+    labels : dict[float, str] or None
+        Optional mapping from observed category value to display label. When
+        omitted, preserves the binary cut-event summary.
+
+    Returns
+    -------
+    pd.DataFrame
+        Indexed by year. Binary mode returns ``n_meetings``, ``n_cuts``, and
+        ``cut_rate``. Categorical mode returns ``n_meetings`` plus one
+        ``n_<label>`` column per supplied label.
+    """
+    df = event_df.copy()
+    df["year"] = pd.to_datetime(df["timestamp"]).dt.year
+    if labels is not None:
+        grouped = df.groupby("year")["value"].agg(n_meetings="count")
+        for value, label in labels.items():
+            grouped[f"n_{label}"] = df["value"].eq(value).groupby(df["year"]).sum().astype(int)
+        return grouped
+
+    grouped = df.groupby("year")["value"].agg(n_meetings="count", n_cuts="sum")
+    grouped["n_cuts"] = grouped["n_cuts"].astype(int)
+    grouped["cut_rate"] = (grouped["n_cuts"] / grouped["n_meetings"]).round(3)
+    return grouped
+
+
+def rationales_table(result: BacktestResult | EvalResult) -> pd.DataFrame:
+    """Extract per-prediction metadata (reasoning traces etc.) into a DataFrame.
+
+    For the agent predictor, ``metadata`` carries ``reasoning`` and
+    ``key_signals`` — the inputs for the planned reasoning-alignment
+    evaluation against the Bank's own published rationale.
+
+    Parameters
+    ----------
+    result : BacktestResult | EvalResult
+        Result to introspect.
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``origin``, ``meeting_date``, ``probability`` for binary
+        payloads or one ``p_<label>`` column per categorical label, plus one
+        ``meta_*`` column per distinct metadata key (missing values filled
+        with ``None``).
+    """
+    task = _task_from_result(result)
+    category_labels = [category.label for category in task.categories or []]
+    base_rows: list[dict[str, object]] = []
+    all_keys: set[str] = set()
+    for pred in result.predictions:
+        row: dict[str, object] = {
+            "origin": pd.Timestamp(pred.as_of),
+            "meeting_date": pd.Timestamp(pred.forecast_date),
+        }
+        if isinstance(pred.payload, BinaryForecast):
+            row["probability"] = float(pred.payload.probability)
+        elif isinstance(pred.payload, CategoricalForecast):
+            for label in category_labels:
+                row[f"p_{label}"] = float(pred.payload.probabilities[label])
+        for k, v in pred.metadata.items():
+            row[f"meta_{k}"] = v
+            all_keys.add(f"meta_{k}")
+        base_rows.append(row)
+
+    for row in base_rows:
+        for k in all_keys:
+            row.setdefault(k, None)
+    return pd.DataFrame(base_rows)
+
+
+@dataclass(frozen=True)
+class PanelRow:
+    """One method's prediction for a single meeting, for the decision panel.
+
+    Attributes
+    ----------
+    predictor_id : str
+        Identifier of the predictor that produced this row.
+    probabilities : dict[str, float]
+        Predicted probability per category label, in task-category order.
+    score : float
+        The meeting's score for this predictor (RPS for the 3-way task).
+    rationale : str
+        Stated reasoning, if the method recorded one (agents and LLMPs do via
+        ``Prediction.metadata["rationale"]``); empty string otherwise.
+    key_signals : list[str]
+        Supporting signals, if recorded (agents only today); empty otherwise.
+    """
+
+    predictor_id: str
+    probabilities: dict[str, float]
+    score: float
+    rationale: str = ""
+    key_signals: list[str] = field(default_factory=list)
+
+
+@dataclass(frozen=True)
+class DecisionPanel:
+    """Everything needed to render one meeting's decision panel across methods.
+
+    Attributes
+    ----------
+    meeting_date : pd.Timestamp
+        The announcement date being predicted.
+    origin : pd.Timestamp
+        Forecast origin (``as_of``) the predictions were issued from.
+    categories : list[str]
+        Ordered category labels (e.g. ``["cut", "hold", "hike"]``).
+    outcome_label : str or None
+        Realised decision at ``meeting_date``, or ``None`` if unresolved.
+    prior_outcome_label : str or None
+        The most recent resolved decision strictly before ``meeting_date``,
+        for context; ``None`` if none is available.
+    rows : list[PanelRow]
+        One row per predictor, in the order the results were supplied.
+    """
+
+    meeting_date: pd.Timestamp
+    origin: pd.Timestamp
+    categories: list[str]
+    outcome_label: str | None
+    prior_outcome_label: str | None
+    rows: list[PanelRow]
+
+
+def decision_panel_data(
+    results: dict[str, BacktestResult | EvalResult],
+    event_df: pd.DataFrame,
+    *,
+    meeting_date: str | pd.Timestamp | None = None,
+) -> DecisionPanel:
+    """Assemble one meeting's cross-method prediction panel.
+
+    Gathers, for a single announcement date, each categorical method's
+    predicted distribution, its score, and any stated ``rationale`` /
+    ``key_signals`` (read from ``Prediction.metadata`` exactly as
+    :func:`rationales_table` does), plus the realised outcome and the prior
+    decision for context.
+
+    Parameters
+    ----------
+    results : dict[str, BacktestResult | EvalResult]
+        Mapping ``predictor_id -> result``. Only categorical results
+        contribute; binary-only inputs raise ``ValueError``.
+    event_df : pd.DataFrame
+        Observed direction series (``timestamp`` / ``value``), used for the
+        realised and prior outcomes (values mapped to labels via the task's
+        declared categories).
+    meeting_date : str | pd.Timestamp | None
+        Which announcement to assemble. ``None`` (default) selects the most
+        recent meeting present across the categorical results.
+
+    Returns
+    -------
+    DecisionPanel
+    """
+    categorical = [(pid, r) for pid, r in results.items() if _task_from_result(r).payload_type == "categorical"]
+    if not categorical:
+        raise ValueError("decision_panel_data requires at least one categorical result.")
+
+    task = _task_from_result(categorical[0][1])
+    categories = [category.label for category in task.categories or []]
+    value_to_label = {category.value: category.label for category in task.categories or []}
+    outcome_by_date = {
+        pd.Timestamp(ts).normalize(): float(v) for ts, v in zip(event_df["timestamp"], event_df["value"], strict=True)
+    }
+
+    all_dates = sorted(
+        {pd.Timestamp(pred.forecast_date).normalize() for _, result in categorical for pred in result.predictions}
+    )
+    if not all_dates:
+        raise ValueError("No categorical predictions found in results.")
+    target = all_dates[-1] if meeting_date is None else pd.Timestamp(meeting_date).normalize()
+
+    rows: list[PanelRow] = []
+    origin: pd.Timestamp | None = None
+    for predictor_id, result in categorical:
+        for pred, score in zip(result.predictions, result.scores, strict=True):
+            if pd.Timestamp(pred.forecast_date).normalize() != target:
+                continue
+            if not isinstance(pred.payload, CategoricalForecast):
+                continue
+            metadata = pred.metadata or {}
+            rationale = str(metadata.get("rationale", "") or "").strip()
+            key_signals = list(metadata.get("key_signals", []) or [])
+            rows.append(
+                PanelRow(
+                    predictor_id=predictor_id,
+                    probabilities={label: float(pred.payload.probabilities[label]) for label in categories},
+                    score=float(score),
+                    rationale=rationale,
+                    key_signals=key_signals,
+                )
+            )
+            origin = pd.Timestamp(pred.as_of)
+            break
+
+    outcome_value = outcome_by_date.get(target)
+    outcome_label = value_to_label.get(outcome_value) if outcome_value is not None else None
+
+    prior_dates = sorted(d for d in outcome_by_date if d < target)
+    prior_outcome_label = value_to_label.get(outcome_by_date[prior_dates[-1]]) if prior_dates else None
+
+    return DecisionPanel(
+        meeting_date=target,
+        origin=origin if origin is not None else target,
+        categories=categories,
+        outcome_label=outcome_label,
+        prior_outcome_label=prior_outcome_label,
+        rows=rows,
+    )
+
+
+def panel_rationales_markdown(panel: DecisionPanel, labels: dict[str, str] | None = None) -> str:
+    """Render a decision panel's rationales as a Markdown string.
+
+    One block per method that recorded a ``rationale`` and/or ``key_signals``;
+    methods without either are skipped. Intended for ``IPython.display.Markdown``
+    in notebooks, keeping rationale prose out of the matplotlib figure.
+
+    Parameters
+    ----------
+    panel : DecisionPanel
+        Assembled by :func:`decision_panel_data`.
+    labels : dict[str, str] or None
+        Optional predictor_id -> display-label map for the block headings.
+
+    Returns
+    -------
+    str
+        Markdown text, or the empty string when no method recorded a rationale.
+    """
+    label_map = {row.predictor_id: (labels or {}).get(row.predictor_id, row.predictor_id) for row in panel.rows}
+    blocks: list[str] = []
+    for row in panel.rows:
+        if not row.rationale and not row.key_signals:
+            continue
+        block = f"**{label_map[row.predictor_id]}**"
+        if row.key_signals:
+            block += "\n\nKey signals: " + ", ".join(row.key_signals)
+        if row.rationale:
+            block += f"\n\n{row.rationale}"
+        blocks.append(block)
+    return "\n\n---\n\n".join(blocks)
+
+
+__all__ = [
+    "DecisionPanel",
+    "PanelRow",
+    "calibration_table",
+    "decision_panel_data",
+    "one_vs_rest_frame",
+    "panel_rationales_markdown",
+    "predictions_to_frame",
+    "rationales_table",
+    "score_leaderboard",
+    "yearly_outcome_table",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent____init__.py.md
new file mode 100644
index 0000000..11cf6a0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent____init__.py.md
@@ -0,0 +1,26 @@
+# Source: implementations/boc_rate_decisions/analyst_agent/__init__.py
+
+kind: python
+
+```python
+"""Bank of Canada policy analyst agent module.
+
+Exports the :class:`AgentConfig` factories, prompt builder, and predictor
+convenience factory for the BoC rate-decision reference implementation.
+"""
+
+from boc_rate_decisions.analyst_agent.agent import (
+    BoCDecisionPromptBuilder,
+    build_boc_agent_predictor,
+    build_boc_basic_config,
+    build_boc_news_config,
+)
+
+
+__all__ = [
+    "BoCDecisionPromptBuilder",
+    "build_boc_agent_predictor",
+    "build_boc_basic_config",
+    "build_boc_news_config",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent__agent.py.md
new file mode 100644
index 0000000..8ead716
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__analyst_agent__agent.py.md
@@ -0,0 +1,382 @@
+# Source: implementations/boc_rate_decisions/analyst_agent/agent.py
+
+kind: python
+
+```python
+"""Bank of Canada policy analyst agent configuration and prompt builder.
+
+Provides :class:`~aieng.forecasting.methods.agentic.agent_factory.AgentConfig`
+factories for the primary BoC 3-way rate-direction prediction task
+(cut / hold / hike at the next fixed announcement date):
+
+1. :func:`build_boc_basic_config` — the quantitative-only analyst: reasons
+   from the policy-rate path, past meeting outcomes, and a leak-safe macro
+   snapshot supplied in the prompt payload. No tools.
+2. :func:`build_boc_news_config` — adds bounded Google Search via a
+   :class:`~aieng.forecasting.methods.agentic.agent_factory.ContextRetrievalConfig`
+   sub-agent with strict temporal cutoffs. This is the explicit seam for the
+   deferred BoC-communication-grounded variant: once press releases and
+   Monetary Policy Reports are ingested (Ali's Track 2 work), the retrieval
+   instruction swaps web search for report retrieval without touching the
+   forecasting contract or the reasoning-alignment evaluator interface.
+
+Also provides:
+
+- :class:`BoCDecisionPromptBuilder` — Pydantic ``BaseModel`` that serialises
+  the task, meeting calendar position, rate path, per-meeting decision
+  history, and macro snapshot into a structured JSON payload.
+- :func:`build_boc_agent_predictor` — convenience factory wiring a config to
+  an :class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`
+  with the
+  :class:`~aieng.forecasting.methods.agentic.outputs.CategoricalAgentForecastOutput`
+  schema. The agent's ``reasoning`` / ``key_signals`` output fields are the
+  hook for the planned LLM reasoning-alignment evaluation against the Bank's
+  own published rationale.
+
+The agent is direction-native: the compact binary rate-cut reference uses the
+frequency baseline, logistic regression, and the binary LLMP recipe, and the
+task-agnostic
+:class:`~aieng.forecasting.methods.agentic.outputs.DiscreteAgentForecastOutput`
+remains available in the core package for naturally binary problems.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use.
+"""
+
+from __future__ import annotations
+
+import json
+from typing import Any
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    CategoricalAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+from boc_rate_decisions.data import (
+    BOND_YIELD_2YR_SERIES_ID,
+    CPI_SERIES_ID,
+    TARGET_RATE_SERIES_ID,
+    UNEMPLOYMENT_SERIES_ID,
+)
+from boc_rate_decisions.predictors.logistic_baseline import build_feature_row
+from pydantic import BaseModel
+
+
+# ---------------------------------------------------------------------------
+# System prompt (root analyst agent)
+# ---------------------------------------------------------------------------
+
+
+def _build_boc_analyst_instruction() -> str:
+    """Build the BoC analyst instruction, embedding the output schema from the class.
+
+    Using a function instead of a static string ensures the ``## Output
+    schema`` block is always in sync with ``CategoricalAgentForecastOutput``
+    — no manual JSON to maintain.
+    """
+    schema = CategoricalAgentForecastOutput.prompt_schema_json(labels=["cut", "hold", "hike"])
+    return (
+        "## Role\n\n"
+        "You are an expert Bank of Canada monetary-policy analyst. You produce a "
+        "calibrated probability distribution over what the Bank does to its target "
+        "for the overnight rate at a specific upcoming fixed announcement date — "
+        "CUT (lower), HOLD (unchanged), or HIKE (raise) — grounded in the "
+        "policy-rate path, the Bank's 2% CPI inflation target, labour-market and "
+        "bond-market conditions, and the Bank's institutional behaviour "
+        "(gradualism, data dependence, reluctance to surprise markets).\n\n"
+        "## Forecasting contract\n\n"
+        "You will receive a JSON payload containing:\n"
+        "- `task`: the task identifier and question\n"
+        "- `as_of`: the forecast origin date (YYYY-MM-DD) — your information cutoff\n"
+        "- `announcement_date`: the fixed announcement date being predicted; the "
+        "gap between `as_of` and this date is your forecast lead time\n"
+        "- `policy_rate`: current target rate and the dated history of past rate "
+        "changes\n"
+        "- `meeting_outcomes`: per-meeting decision history (cut / hold / hike) "
+        "with the realised base rates for each outcome\n"
+        "- `macro_snapshot`: leak-safe indicators as of the origin (CPI inflation "
+        "vs the 2% target, unemployment momentum, 2-year GoC yield vs the policy "
+        "rate)\n\n"
+        "Rules:\n"
+        "1. Assign one probability to each of `cut`, `hold`, and `hike` — a move "
+        "of any size counts. The three probabilities must sum to 1.\n"
+        "2. Report CALIBRATED probabilities, not your confidence in a point view: "
+        "across many questions where you assign 0.7 to an outcome, that outcome "
+        "should occur about 70% of the time. Anchor on the historical base rates, "
+        "then adjust.\n"
+        "3. Cuts and hikes cluster into easing and tightening cycles; the macro "
+        "snapshot tells you whether you are in one. The 2-year yield trading well "
+        "below the policy rate means the bond market is pricing cuts; well above "
+        "means it is pricing hikes. Direct cut-to-hike reversals between adjacent "
+        "meetings essentially never happen, so the recent decision history should "
+        "strongly shape which tail outcome is plausible.\n"
+        "4. Use ONLY information available on or before `as_of`. Do not use "
+        "knowledge of what the Bank actually decided on or after "
+        "`announcement_date`, even if you remember it.\n"
+        "5. Document your reasoning in `reasoning` and list the decisive inputs "
+        "in `key_signals` — these are compared against the Bank's own published "
+        "rationale by a downstream evaluator, so be specific.\n\n"
+        "## Output schema\n\n"
+        "Call `set_model_response` with a `json_response` string matching "
+        "**exactly**:\n\n"
+        "```json\n" + schema + "\n```\n"
+    )
+
+
+_BOC_ANALYST_INSTRUCTION = _build_boc_analyst_instruction()
+
+
+# ---------------------------------------------------------------------------
+# Context retrieval instruction (sub-agent) — seam for the report-grounded variant
+# ---------------------------------------------------------------------------
+
+_BOC_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are a Canadian monetary-policy intelligence specialist with access to web search.
+
+Search for information relevant to the query and return a concise structured \
+markdown summary (3-5 paragraphs) covering relevant aspects of:
+- Recent Bank of Canada communications: statements, speeches, Monetary Policy Reports
+- Canadian CPI inflation prints and core-inflation measures vs the 2% target
+- Canadian labour market: employment reports, unemployment rate, wage growth
+- Market pricing of the upcoming decision (overnight index swaps, economist surveys)
+- Macro shocks relevant to Canada: oil prices, exchange rate, US policy, trade
+
+Ground your summary in the search results you actually retrieve. \
+When a cutoff date is specified, do not report or speculate about events \
+that occurred after that date.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Prompt builder
+# ---------------------------------------------------------------------------
+
+
+def _rate_change_history(rate_df: pd.DataFrame, max_changes: int = 40) -> list[dict[str, object]]:
+    """Compress the daily step-wise policy rate into its change points.
+
+    The daily series is constant between decisions, so the list of
+    ``(date, new_rate)`` change points carries all of its information in a
+    tiny fraction of the tokens.
+    """
+    values = rate_df["value"].astype(float)
+    changed = values.diff().fillna(0.0) != 0.0
+    changes = rate_df.loc[changed, ["timestamp", "value"]].tail(max_changes)
+    return [
+        {"date": str(pd.Timestamp(ts).date()), "new_target_rate_pct": float(v)}
+        for ts, v in zip(changes["timestamp"], changes["value"])
+    ]
+
+
+class BoCDecisionPromptBuilder(BaseModel):
+    """Prompt builder for the BoC 3-way rate-direction prediction task.
+
+    Produces a structured JSON payload containing the question, the policy
+    rate path (compressed to change points), the per-meeting decision history
+    (serialised as ``cut`` / ``hold`` / ``hike`` labels with per-outcome base
+    rates), and a leak-safe macro snapshot (shared with the logistic baseline
+    via
+    :func:`~boc_rate_decisions.predictors.logistic_baseline.build_feature_row`,
+    so the agent and the conventional model see exactly the same indicators).
+
+    Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol (structural typing — no explicit inheritance required).
+    """
+
+    model_config = {"extra": "forbid"}
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        """Serialise the task and cutoff-filtered context into a JSON payload.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            The categorical rate-direction task — supplies ``task_id``,
+            ``description``, the ordered ``categories`` mapping series values
+            to labels, and the single-step horizon used to derive the
+            announcement date.
+        context : ForecastContext
+            The information state at forecast time (cutoff-enforced).
+
+        Returns
+        -------
+        str
+            JSON-serialised payload for the analyst agent.
+
+        Raises
+        ------
+        ValueError
+            If the task does not declare categories or an observed outcome
+            does not match any declared category value.
+        """
+        if task.categories is None:
+            raise ValueError(f"{type(self).__name__} requires a categorical task with declared categories.")
+
+        as_of = pd.Timestamp(context.as_of)
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        announcement_date = as_of + offset * task.horizons[0]
+
+        direction_df = context.get_series(task.target_series_id)
+        rate_df = context.get_series(TARGET_RATE_SERIES_ID)
+        yield_df = context.get_series(BOND_YIELD_2YR_SERIES_ID)
+        cpi_df = context.get_series(CPI_SERIES_ID)
+        unemployment_df = context.get_series(UNEMPLOYMENT_SERIES_ID)
+
+        features = build_feature_row(as_of, rate_df, yield_df, cpi_df, unemployment_df)
+
+        labels_by_value = {category.value: category.label for category in task.categories}
+        outcomes: list[dict[str, object]] = []
+        counts = {category.label: 0 for category in task.categories}
+        for ts, value in zip(direction_df["timestamp"], direction_df["value"]):
+            label = labels_by_value.get(float(value))
+            if label is None:
+                raise ValueError(
+                    f"Observed outcome {float(value)} does not match any task category value "
+                    f"({sorted(labels_by_value)})."
+                )
+            outcomes.append({"announcement_date": str(pd.Timestamp(ts).date()), "decision": label})
+            counts[label] += 1
+
+        n_meetings = len(outcomes)
+        base_rates = {label: round(count / n_meetings, 4) for label, count in counts.items()} if n_meetings else None
+
+        payload: dict[str, Any] = {
+            "task": {"task_id": task.task_id, "question": task.description},
+            "as_of": str(as_of.date()),
+            "announcement_date": str(announcement_date.date()),
+            "policy_rate": {
+                "current_target_rate_pct": float(rate_df["value"].iloc[-1]),
+                "rate_changes": _rate_change_history(rate_df),
+            },
+            "meeting_outcomes": {
+                "history": outcomes,
+                "n_meetings": n_meetings,
+                "counts": counts,
+                "historical_base_rates": base_rates,
+            },
+            "macro_snapshot": features if features is not None else "insufficient history at this origin",
+        }
+        return json.dumps(payload, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# AgentConfig factories
+# ---------------------------------------------------------------------------
+
+
+def build_boc_basic_config(model: str = LITE_MODEL) -> AgentConfig:
+    """Build the quantitative-only BoC analyst config (no tools).
+
+    The agent reasons purely from the rate path, outcome history, and macro
+    snapshot in the prompt payload — the same information set as the
+    logistic baseline, making the comparison between conventional fitting
+    and LLM reasoning clean.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier for the analyst agent.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    return AgentConfig(
+        name="boc_analyst_basic",
+        model=model,
+        instruction=_BOC_ANALYST_INSTRUCTION,
+    )
+
+
+def build_boc_news_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+) -> AgentConfig:
+    """Build the news-grounded BoC analyst config (bounded Google Search).
+
+    Wires a context-retrieval sub-agent that enforces a temporal cutoff on
+    every search call. This factory is the seam for the deferred
+    report-grounded variant: replacing the retrieval instruction (and later,
+    the retrieval tool) with BoC press-release / MPR retrieval upgrades the
+    agent without changing the forecasting contract.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to
+        the lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` so
+        that Gemini handles Google Search even when the analyst uses a
+        different provider.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    return AgentConfig(
+        name="boc_analyst_news",
+        model=model,
+        instruction=_BOC_ANALYST_INSTRUCTION,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_BOC_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+def build_boc_agent_predictor(config: AgentConfig) -> AgentPredictor:
+    """Wrap an :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Uses :class:`BoCDecisionPromptBuilder` and the
+    :class:`~aieng.forecasting.methods.agentic.outputs.CategoricalAgentForecastOutput`
+    schema, which converts the agent's cut/hold/hike distribution into a
+    single
+    :class:`~aieng.forecasting.evaluation.prediction.CategoricalForecast`
+    prediction and preserves ``reasoning`` / ``key_signals`` in metadata for
+    the planned reasoning-alignment evaluation.
+
+    Parameters
+    ----------
+    config : AgentConfig
+        Any config produced by :func:`build_boc_basic_config` or
+        :func:`build_boc_news_config`.
+
+    Returns
+    -------
+    AgentPredictor
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=BoCDecisionPromptBuilder(),
+        output_schema=CategoricalAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_boc_basic_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__data.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__data.py.md
new file mode 100644
index 0000000..2b7bf25
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__data.py.md
@@ -0,0 +1,517 @@
+# Source: implementations/boc_rate_decisions/data.py
+
+kind: python
+
+```python
+"""Data-service setup for the Bank of Canada rate-decision experiment.
+
+This use case predicts Bank of Canada decisions at the next fixed announcement
+date — either the compact binary rate-cut event or the ordered 3-way decision
+direction. These are discrete event-prediction problems, not time-series
+problems.
+Three kinds of data come together here:
+
+1. **The daily target rate** (StatCan table 10-10-0139-01, "Target rate"):
+   the ground-truth policy instrument, daily since 1992.
+2. **The meeting calendar** (``meeting_schedule.yaml``): curated fixed
+   announcement dates. Required because *hold* decisions — most meetings —
+   leave no trace in any rate series.
+3. **Derived per-meeting decision series**: ``boc_rate_cut_event`` stores the
+   binary ``1.0`` cut event; ``boc_rate_decision_direction`` stores ``-1.0``
+   for cuts, ``0.0`` for holds, and ``1.0`` for hikes. Deriving these as
+   first-class series means the standard resolution and scoring paths in the
+   evaluation harness apply unchanged.
+
+Macro covariates (CPI, unemployment, bond yields) are registered for the
+conventional baseline and for prompt context. **Leakage warning:** monthly
+covariates carry approximate ``released_at`` stamps (see the adapters);
+feature code must lag them conservatively rather than trusting day-level
+release precision. The daily market series use ``release_lag_days=1``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Literal
+
+import pandas as pd
+import yaml
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters.base import BaseAdapter
+from aieng.forecasting.data.adapters.fred import FREDAdapter
+from aieng.forecasting.data.adapters.statcan import StatCanAdapter
+from aieng.forecasting.evaluation.task import TaskCategory
+
+
+# ---------------------------------------------------------------------------
+# Canonical series IDs (referenced by specs, notebooks, and predictors)
+# ---------------------------------------------------------------------------
+
+TARGET_RATE_SERIES_ID = "boc_overnight_target_rate"
+"""Daily BoC target for the overnight rate (percent)."""
+
+RATE_CUT_EVENT_SERIES_ID = "boc_rate_cut_event"
+"""Derived per-meeting 0/1 series: 1.0 if the target rate was cut at that meeting."""
+
+DIRECTION_SERIES_ID = "boc_rate_decision_direction"
+"""Derived per-meeting direction series: -1.0 cut, 0.0 hold, 1.0 hike."""
+
+DIRECTION_TASK_CATEGORIES: list[TaskCategory] = [
+    TaskCategory(label="cut", value=-1.0),
+    TaskCategory(label="hold", value=0.0),
+    TaskCategory(label="hike", value=1.0),
+]
+"""Ordered categories for the 3-way BoC decision-direction task."""
+
+BOND_YIELD_2YR_SERIES_ID = "boc_govt_bond_yield_2yr"
+"""Daily Government of Canada 2-year benchmark bond yield (percent)."""
+
+CPI_SERIES_ID = "cpi_all_items_canada"
+"""Monthly CPI All-items, Canada (2002=100). Shared with the getting-started use case."""
+
+UNEMPLOYMENT_SERIES_ID = "fred_canada_unemployment_rate"
+"""Monthly Canadian unemployment rate, seasonally adjusted (percent, FRED)."""
+
+RATES_TABLE_ID = "10-10-0139-01"
+"""StatCan financial-market statistics table (daily, Bank of Canada rates and yields)."""
+
+CPI_TABLE_ID = "18-10-0004-11"
+"""StatCan CPI table (monthly, not seasonally adjusted)."""
+
+UNEMPLOYMENT_FRED_ID = "LRUNTTTTCAM156S"
+"""FRED series: Monthly Unemployment Rate, Total, All Persons for Canada (SA)."""
+
+MEETING_SCHEDULE_PATH = Path(__file__).resolve().parent / "meeting_schedule.yaml"
+"""Committed, source-cited BoC fixed announcement date calendar."""
+
+DEFAULT_STATCAN_CACHE_DIR = Path("data/statcan")
+"""Default stats-can zip cache directory (resolved relative to CWD at call time)."""
+
+DEFAULT_FRED_CACHE_DIR = Path("data/fred")
+"""Default FRED parquet cache directory (resolved relative to CWD at call time)."""
+
+#: Maximum days after an announcement to look for the post-decision rate
+#: observation. Announcements are Tue/Wed, so the next business-day print is
+#: 1-5 calendar days out (holidays included); 7 stays clear of the next meeting.
+_POST_MEETING_LOOKAHEAD_DAYS = 7
+
+
+# ---------------------------------------------------------------------------
+# Meeting schedule
+# ---------------------------------------------------------------------------
+
+
+def load_meeting_schedule(path: Path | None = None) -> list[pd.Timestamp]:
+    """Load the BoC fixed announcement dates from the committed YAML calendar.
+
+    Parameters
+    ----------
+    path : Path or None
+        Override for the schedule file location (used in tests). Defaults to
+        the committed ``meeting_schedule.yaml`` next to this module.
+
+    Returns
+    -------
+    list[pd.Timestamp]
+        Announcement dates, sorted ascending.
+    """
+    schedule_path = path if path is not None else MEETING_SCHEDULE_PATH
+    with schedule_path.open() as f:
+        raw = yaml.safe_load(f)
+    dates = [pd.Timestamp(d) for d in raw["announcement_dates"]]
+    return sorted(dates)
+
+
+def load_unscheduled_announcements(path: Path | None = None) -> list[pd.Timestamp]:
+    """Load known unscheduled (emergency) announcement dates from the calendar file.
+
+    Parameters
+    ----------
+    path : Path or None
+        Override for the schedule file location. Defaults to the committed file.
+
+    Returns
+    -------
+    list[pd.Timestamp]
+        Emergency announcement dates, sorted ascending. May be empty.
+    """
+    schedule_path = path if path is not None else MEETING_SCHEDULE_PATH
+    with schedule_path.open() as f:
+        raw = yaml.safe_load(f)
+    return sorted(pd.Timestamp(d) for d in raw.get("unscheduled_announcements", []))
+
+
+# ---------------------------------------------------------------------------
+# Event derivation
+# ---------------------------------------------------------------------------
+
+
+def derive_rate_decision_directions(rate_df: pd.DataFrame, meeting_dates: list[pd.Timestamp]) -> pd.DataFrame:
+    """Derive the per-meeting -1/0/+1 decision-direction series from the daily target rate.
+
+    For each meeting date ``d``, the outcome compares:
+
+    - ``rate_before``: the last daily observation strictly **before** ``d``, and
+    - ``rate_after``: the first daily observation strictly **after** ``d``
+      (within a short lookahead window).
+
+    Reading strictly after the announcement date makes the rule robust to
+    both effective-date regimes: before 2021 a change took effect the same
+    day (so the next day also shows the new rate); since 2021 it takes effect
+    the next business day (so the announcement-day print still shows the old
+    rate). Intermeeting emergency moves shift ``rate_before`` of the *next*
+    meeting, which is exactly the right behaviour — the meeting outcome is
+    "what did the Bank do at this announcement", not "is the rate different
+    than at the previous meeting".
+
+    Meetings without observations on both sides (e.g. future scheduled dates)
+    are skipped.
+
+    Parameters
+    ----------
+    rate_df : pd.DataFrame
+        Daily target-rate series in canonical format (``timestamp``,
+        ``value``, ``released_at``), sorted ascending.
+    meeting_dates : list[pd.Timestamp]
+        Fixed announcement dates to resolve.
+
+    Returns
+    -------
+    pd.DataFrame
+        Canonical direction series: ``timestamp`` (announcement date),
+        ``value`` (-1.0 = cut, 0.0 = hold, 1.0 = hike), ``released_at`` (announcement date —
+        the outcome is public the moment it is announced).
+    """
+    timestamps = pd.to_datetime(rate_df["timestamp"]).reset_index(drop=True)
+    values = rate_df["value"].astype(float).reset_index(drop=True)
+
+    rows: list[dict[str, object]] = []
+    for meeting in meeting_dates:
+        before_mask = timestamps < meeting
+        after_mask = (timestamps > meeting) & (timestamps <= meeting + pd.Timedelta(days=_POST_MEETING_LOOKAHEAD_DAYS))
+        if not before_mask.any() or not after_mask.any():
+            continue
+        rate_before = float(values[before_mask].iloc[-1])
+        rate_after = float(values[after_mask].iloc[0])
+        rows.append(
+            {
+                "timestamp": meeting,
+                "value": -1.0 if rate_after < rate_before else 1.0 if rate_after > rate_before else 0.0,
+                "released_at": meeting,
+            }
+        )
+
+    return pd.DataFrame(rows, columns=["timestamp", "value", "released_at"])
+
+
+def derive_rate_cut_events(rate_df: pd.DataFrame, meeting_dates: list[pd.Timestamp]) -> pd.DataFrame:
+    """Derive the per-meeting 0/1 rate-cut event series from the daily target rate.
+
+    This binary wrapper uses the same before/after comparison as
+    :func:`derive_rate_decision_directions`, returning ``1.0`` exactly when
+    the meeting direction is a cut and ``0.0`` for holds or hikes. Meetings
+    without observations on both sides (e.g. future scheduled dates) are
+    skipped.
+
+    Parameters
+    ----------
+    rate_df : pd.DataFrame
+        Daily target-rate series in canonical format (``timestamp``,
+        ``value``, ``released_at``), sorted ascending.
+    meeting_dates : list[pd.Timestamp]
+        Fixed announcement dates to resolve.
+
+    Returns
+    -------
+    pd.DataFrame
+        Canonical event series: ``timestamp`` (announcement date), ``value``
+        (1.0 = cut, 0.0 = hold or hike), ``released_at`` (announcement date —
+        the outcome is public the moment it is announced).
+    """
+    directions = derive_rate_decision_directions(rate_df, meeting_dates)
+    events = directions.copy()
+    events["value"] = (events["value"] == -1.0).astype(float)
+    return events
+
+
+def validate_schedule_against_rate_series(
+    rate_df: pd.DataFrame,
+    meeting_dates: list[pd.Timestamp],
+    unscheduled_dates: list[pd.Timestamp] | None = None,
+) -> list[pd.Timestamp]:
+    """Cross-check the curated calendar against observed target-rate changes.
+
+    Every day-over-day change in the daily target rate must be attributable
+    to a scheduled meeting or a known unscheduled announcement on, or within
+    a few days before, the change (rate changes print 1-3 business days after
+    the announcement depending on the effective-date regime). A non-empty
+    return value means the curated schedule is missing or misdating a meeting
+    — derived cut/hike outcomes would then be wrong, so callers should treat
+    any return entries as an error.
+
+    Hold meetings that are misdated cannot be detected this way (no change to
+    observe), but a misdated hold still resolves to the correct outcome.
+
+    Parameters
+    ----------
+    rate_df : pd.DataFrame
+        Daily target-rate series in canonical format, sorted ascending.
+    meeting_dates : list[pd.Timestamp]
+        Scheduled announcement dates.
+    unscheduled_dates : list[pd.Timestamp] or None
+        Known emergency announcement dates. Defaults to none.
+
+    Returns
+    -------
+    list[pd.Timestamp]
+        Dates of observed rate changes (first day printing the new rate) not
+        attributable to any known announcement. Empty when the calendar is
+        consistent with the data.
+    """
+    announcements = sorted(list(meeting_dates) + list(unscheduled_dates or []))
+    if not announcements:
+        return []
+    window_start, window_end = announcements[0], announcements[-1] + pd.Timedelta(days=7)
+
+    df = rate_df.sort_values("timestamp").reset_index(drop=True)
+    changed = df["value"].astype(float).diff().fillna(0.0) != 0.0
+    change_dates = pd.to_datetime(df.loc[changed, "timestamp"])
+
+    orphans: list[pd.Timestamp] = []
+    for change_date in change_dates:
+        if not (window_start <= change_date <= window_end):
+            continue
+        attributable = any(
+            ann < change_date <= ann + pd.Timedelta(days=_POST_MEETING_LOOKAHEAD_DAYS) or ann == change_date
+            for ann in announcements
+        )
+        if not attributable:
+            orphans.append(pd.Timestamp(change_date))
+    return orphans
+
+
+class BoCDecisionEventAdapter(BaseAdapter):
+    """Adapter producing derived per-meeting BoC decision series.
+
+    Joins the committed meeting calendar with the daily target-rate series at
+    fetch time, so derived event series always reflect the freshest cached
+    rate data without a separate materialisation step.
+
+    Parameters
+    ----------
+    rate_adapter : BaseAdapter
+        Adapter for the daily target-rate series (canonical format).
+    meeting_dates : list[pd.Timestamp]
+        Fixed announcement dates to resolve into outcomes.
+    kind : {"cut", "direction"}
+        Derived target to produce: binary cut event or 3-way direction.
+    """
+
+    def __init__(
+        self,
+        rate_adapter: BaseAdapter,
+        meeting_dates: list[pd.Timestamp],
+        kind: Literal["cut", "direction"],
+    ) -> None:
+        self._rate_adapter = rate_adapter
+        self._meeting_dates = sorted(meeting_dates)
+        self._kind = kind
+
+    def fetch(self) -> pd.DataFrame:
+        """Return the derived decision series in canonical format.
+
+        Returns
+        -------
+        pd.DataFrame
+            Columns ``timestamp``, ``value``, ``released_at``; one row per
+            resolvable meeting, sorted ascending.
+        """
+        rate_df = self._rate_adapter.fetch()
+        if self._kind == "cut":
+            return derive_rate_cut_events(rate_df, self._meeting_dates)
+        if self._kind == "direction":
+            return derive_rate_decision_directions(rate_df, self._meeting_dates)
+        raise ValueError(f"Unsupported BoC decision event kind: {self._kind!r}.")
+
+
+# ---------------------------------------------------------------------------
+# Service builder
+# ---------------------------------------------------------------------------
+
+
+def build_boc_service(
+    statcan_cache_dir: Path | None = None,
+    fred_cache_dir: Path | None = None,
+    schedule_path: Path | None = None,
+    include_fred: bool = True,
+) -> DataService:
+    """Return a :class:`DataService` with all BoC rate-decision series registered.
+
+    Registers, in order:
+
+    - ``boc_overnight_target_rate`` — daily policy rate (StatCan 10-10-0139-01).
+    - ``boc_rate_cut_event`` — derived 0/1 per-meeting event series (the
+      binary task target).
+    - ``boc_rate_decision_direction`` — derived -1/0/+1 per-meeting direction
+      series (the ordered-categorical task target).
+    - ``boc_govt_bond_yield_2yr`` — daily 2-year GoC benchmark yield, a
+      market-implied gauge of near-term policy expectations.
+    - ``cpi_all_items_canada`` — monthly headline CPI (the BoC targets 2%
+      CPI inflation).
+    - ``fred_canada_unemployment_rate`` — monthly labour-market covariate.
+
+    Parameters
+    ----------
+    statcan_cache_dir : Path or None
+        stats-can cache directory. Defaults to ``data/statcan`` relative to
+        the current working directory. Populate with ``scripts/fetch_boc.py``.
+    fred_cache_dir : Path or None
+        FRED parquet cache directory. Defaults to ``data/fred``. Populate
+        with ``scripts/fetch_fred.py`` (requires ``FRED_API_KEY`` on first run).
+    schedule_path : Path or None
+        Override for the meeting calendar file (used in tests).
+    include_fred : bool
+        When ``False``, skip the FRED unemployment covariate. Registration
+        fetches eagerly, so this lets ``scripts/fetch_boc.py`` populate the
+        StatCan cache before the FRED cache exists.
+
+    Returns
+    -------
+    DataService
+        Ready to hand to ``backtest`` / ``evaluate`` / notebook exploration.
+    """
+    statcan_dir = statcan_cache_dir if statcan_cache_dir is not None else DEFAULT_STATCAN_CACHE_DIR
+    fred_dir = fred_cache_dir if fred_cache_dir is not None else DEFAULT_FRED_CACHE_DIR
+    meeting_dates = load_meeting_schedule(schedule_path)
+
+    svc = DataService()
+
+    target_rate_adapter = StatCanAdapter(
+        table_id=RATES_TABLE_ID,
+        member_filter={"GEO": "Canada", "Financial market statistics": "Target rate"},
+        cache_dir=statcan_dir,
+        release_lag_days=1,  # daily market data, published next business day
+    )
+    svc.register(
+        TARGET_RATE_SERIES_ID,
+        target_rate_adapter,
+        SeriesMetadata(
+            series_id=TARGET_RATE_SERIES_ID,
+            description="Bank of Canada target for the overnight rate (policy rate), daily",
+            source=f"StatCan ({RATES_TABLE_ID})",
+            units="Percent",
+            frequency="B",
+            table_id=RATES_TABLE_ID,
+        ),
+    )
+
+    svc.register(
+        RATE_CUT_EVENT_SERIES_ID,
+        BoCDecisionEventAdapter(target_rate_adapter, meeting_dates, kind="cut"),
+        SeriesMetadata(
+            series_id=RATE_CUT_EVENT_SERIES_ID,
+            description=(
+                "Rate-cut indicator per BoC fixed announcement date: 1.0 if the target "
+                "for the overnight rate was lowered at that announcement, else 0.0 "
+                "(holds and hikes). Derived from the daily target rate and the "
+                "committed meeting calendar."
+            ),
+            source=f"Derived (StatCan {RATES_TABLE_ID} + meeting_schedule.yaml)",
+            units="0/1 event indicator",
+            frequency="irregular (8 fixed announcement dates per year)",
+        ),
+    )
+
+    svc.register(
+        DIRECTION_SERIES_ID,
+        BoCDecisionEventAdapter(target_rate_adapter, meeting_dates, kind="direction"),
+        SeriesMetadata(
+            series_id=DIRECTION_SERIES_ID,
+            description=(
+                "Per-meeting direction of the BoC target-rate decision: -1 cut, "
+                "0 hold, +1 hike. Derived from the daily target rate and the "
+                "committed meeting calendar."
+            ),
+            source=f"Derived (StatCan {RATES_TABLE_ID} + meeting_schedule.yaml)",
+            units="-1/0/+1 direction indicator",
+            frequency="irregular (8 fixed announcement dates per year)",
+        ),
+    )
+
+    svc.register(
+        BOND_YIELD_2YR_SERIES_ID,
+        StatCanAdapter(
+            table_id=RATES_TABLE_ID,
+            member_filter={
+                "GEO": "Canada",
+                "Financial market statistics": "Government of Canada benchmark bond yields, 2 year",
+            },
+            cache_dir=statcan_dir,
+            release_lag_days=1,
+        ),
+        SeriesMetadata(
+            series_id=BOND_YIELD_2YR_SERIES_ID,
+            description="Government of Canada benchmark bond yield, 2 year, daily",
+            source=f"StatCan ({RATES_TABLE_ID})",
+            units="Percent",
+            frequency="B",
+            table_id=RATES_TABLE_ID,
+        ),
+    )
+
+    svc.register(
+        CPI_SERIES_ID,
+        StatCanAdapter(
+            table_id=CPI_TABLE_ID,
+            member_filter={"GEO": "Canada", "Products and product groups": "All-items"},
+            cache_dir=statcan_dir,
+        ),
+        SeriesMetadata(
+            series_id=CPI_SERIES_ID,
+            description="CPI All-items, Canada (2002=100)",
+            source=f"StatCan ({CPI_TABLE_ID})",
+            units="Index 2002=100",
+            frequency="MS",
+            table_id=CPI_TABLE_ID,
+        ),
+    )
+
+    if include_fred:
+        svc.register(
+            UNEMPLOYMENT_SERIES_ID,
+            FREDAdapter(UNEMPLOYMENT_FRED_ID, cache_dir=fred_dir),
+            SeriesMetadata(
+                series_id=UNEMPLOYMENT_SERIES_ID,
+                description="Unemployment rate, total, all persons, Canada (seasonally adjusted)",
+                source=f"FRED ({UNEMPLOYMENT_FRED_ID})",
+                units="Percent",
+                frequency="MS",
+            ),
+        )
+
+    return svc
+
+
+__all__ = [
+    "BOND_YIELD_2YR_SERIES_ID",
+    "CPI_SERIES_ID",
+    "CPI_TABLE_ID",
+    "DEFAULT_FRED_CACHE_DIR",
+    "DEFAULT_STATCAN_CACHE_DIR",
+    "DIRECTION_SERIES_ID",
+    "DIRECTION_TASK_CATEGORIES",
+    "MEETING_SCHEDULE_PATH",
+    "RATES_TABLE_ID",
+    "RATE_CUT_EVENT_SERIES_ID",
+    "TARGET_RATE_SERIES_ID",
+    "UNEMPLOYMENT_FRED_ID",
+    "UNEMPLOYMENT_SERIES_ID",
+    "BoCDecisionEventAdapter",
+    "build_boc_service",
+    "derive_rate_decision_directions",
+    "derive_rate_cut_events",
+    "load_meeting_schedule",
+    "load_unscheduled_announcements",
+    "validate_schedule_against_rate_series",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__plots.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__plots.py.md
new file mode 100644
index 0000000..5fd25b6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__plots.py.md
@@ -0,0 +1,575 @@
+# Source: implementations/boc_rate_decisions/plots.py
+
+kind: python
+
+```python
+"""Plotting helpers for the BoC rate-decision experiment.
+
+Centralises the matplotlib boilerplate so the notebooks stay narrative.
+All plots use matplotlib directly (no seaborn / plotly) to minimise
+dependencies. Each helper returns the ``(fig, ax)`` pair it created so the
+caller can further customise or save the figure.
+"""
+
+from __future__ import annotations
+
+import math
+from typing import Literal
+
+import matplotlib.pyplot as plt
+import pandas as pd
+from matplotlib.axes import Axes
+from matplotlib.figure import Figure
+from matplotlib.lines import Line2D
+from matplotlib.patches import Patch
+
+from .analysis import DecisionPanel, calibration_table
+
+
+DEFAULT_PREDICTOR_PALETTE: list[str] = ["#7f7f7f", "#1f77b4", "#2ca02c", "#d62728", "#9467bd", "#ff7f0e"]
+"""Default colour palette for up to six predictors."""
+
+CATEGORY_COLORS: dict[str, str] = {"cut": "#d62728", "hold": "#bbbbbb", "hike": "#1b7a76"}
+"""Per-outcome colours shared by the timeline and decision-panel views (cut=red, hold=grey, hike=teal)."""
+
+
+def _resolve_colors(predictors: list[str], colors: dict[str, str] | None) -> dict[str, str]:
+    """Return a ``predictor_id -> colour`` map covering every predictor."""
+    resolved: dict[str, str] = dict(colors or {})
+    next_idx = 0
+    for pid in predictors:
+        if pid in resolved:
+            continue
+        resolved[pid] = DEFAULT_PREDICTOR_PALETTE[next_idx % len(DEFAULT_PREDICTOR_PALETTE)]
+        next_idx += 1
+    return resolved
+
+
+def _resolve_labels(predictors: list[str], labels: dict[str, str] | None) -> dict[str, str]:
+    """Return a ``predictor_id -> display label`` map for legends."""
+    return {pid: (labels or {}).get(pid, pid) for pid in predictors}
+
+
+# ---------------------------------------------------------------------------
+# Exploration: policy rate path with decision markers
+# ---------------------------------------------------------------------------
+
+
+def plot_policy_rate_with_decisions(
+    rate_df: pd.DataFrame,
+    event_df: pd.DataFrame,
+    *,
+    start: pd.Timestamp | None = None,
+    kind: Literal["auto", "event", "direction"] = "auto",
+) -> tuple[Figure, Axes]:
+    """Plot the daily target rate with each announcement marked by its outcome.
+
+    Accepts either the binary cut-event series (0/1, where 1 means cut) or the
+    ordered direction series (-1/0/+1, where -1 means cut, 0 hold, +1 hike).
+    Cuts are red down-triangles, hikes are dark-teal up-triangles, and holds
+    are light grey dots.
+
+    Parameters
+    ----------
+    rate_df : pd.DataFrame
+        Daily target-rate series (``timestamp`` / ``value`` columns).
+    event_df : pd.DataFrame
+        Per-meeting outcome series: 0/1 cut events or -1/0/+1 directions.
+    start : pd.Timestamp or None
+        Optional left cutoff for the x-axis.
+    kind : {"auto", "event", "direction"}
+        Which series modality ``event_df`` holds. ``"auto"`` treats values
+        outside ``{0, 1}`` as the direction series — correct for full
+        histories, but a direction series windowed to holds and hikes only
+        is indistinguishable from a 0/1 event series, so pass the modality
+        explicitly when plotting slices.
+
+    Returns
+    -------
+    (Figure, Axes)
+    """
+    rate = rate_df.copy()
+    rate["timestamp"] = pd.to_datetime(rate["timestamp"])
+    events = event_df.copy()
+    events["timestamp"] = pd.to_datetime(events["timestamp"])
+    if start is not None:
+        rate = rate[rate["timestamp"] >= start]
+        events = events[events["timestamp"] >= start]
+
+    rate_by_date = rate.set_index("timestamp")["value"]
+
+    fig, ax = plt.subplots(figsize=(13, 4.5))
+    ax.plot(rate["timestamp"], rate["value"], color="k", linewidth=1.4, label="Target rate", zorder=3)
+
+    if kind == "auto":
+        observed_values = set(events["value"].dropna().astype(float))
+        direction_series = bool(observed_values - {0.0, 1.0})
+    else:
+        direction_series = kind == "direction"
+    marker_specs = (
+        [
+            (-1.0, "v", "#d62728", 55, "Cut"),
+            (0.0, "o", "#bbbbbb", 18, "Hold"),
+            (1.0, "^", "#1b7a76", 55, "Hike"),
+        ]
+        if direction_series
+        else [
+            (0.0, "o", "#bbbbbb", 18, "Hold / hike"),
+            (1.0, "v", "#d62728", 55, "Cut"),
+        ]
+    )
+
+    for outcome, marker, color, size, label in marker_specs:
+        sub = events[events["value"] == outcome]
+        if sub.empty:
+            continue
+        # Rate level at (or just before) each meeting, for marker placement.
+        marker_rows: list[tuple[pd.Timestamp, float]] = []
+        for ts in sub["timestamp"]:
+            eligible_rates = rate_by_date[rate_by_date.index <= ts]
+            if eligible_rates.empty:
+                continue
+            marker_rows.append((ts, float(eligible_rates.iloc[-1])))
+        if marker_rows:
+            timestamps, levels = zip(*marker_rows, strict=True)
+            ax.scatter(timestamps, levels, marker=marker, s=size, color=color, label=label, zorder=4)
+
+    ax.set_ylabel("Target for the overnight rate (%)")
+    ax.set_title("Bank of Canada target rate with fixed announcement dates by outcome")
+    ax.grid(axis="y", alpha=0.3)
+    ax.legend(fontsize=9, loc="upper left")
+    fig.tight_layout()
+    return fig, ax
+
+
+# ---------------------------------------------------------------------------
+# Reliability (calibration) curve
+# ---------------------------------------------------------------------------
+
+
+def plot_reliability_curve(
+    predictions_df: pd.DataFrame,
+    *,
+    n_bins: int = 5,
+    colors: dict[str, str] | None = None,
+    labels: dict[str, str] | None = None,
+    title_suffix: str | None = None,
+) -> tuple[Figure, Axes]:
+    """Draw one reliability curve per predictor against the diagonal.
+
+    Points on the diagonal are perfectly calibrated; above it the predictor
+    under-predicts the event, below it it over-predicts. Marker size scales
+    with bin population, since with ~120 meetings most bins are thin. For
+    categorical tasks, pass a binary-style one-vs-rest frame from
+    :func:`boc_rate_decisions.analysis.one_vs_rest_frame` and use
+    ``title_suffix`` to identify the category, for example
+    ``"P(cut) one-vs-rest"``.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Tidy frame from :func:`~boc_rate_decisions.analysis.predictions_to_frame`.
+    n_bins : int
+        Number of probability bins (keep small: the sample is ~120 meetings).
+    colors, labels : dict[str, str] or None
+        Optional predictor_id -> colour / display-label maps.
+    title_suffix : str or None
+        Optional suffix appended to the plot title.
+
+    Returns
+    -------
+    (Figure, Axes)
+    """
+    predictor_ids = sorted(predictions_df["predictor_id"].unique())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+
+    fig, ax = plt.subplots(figsize=(6.5, 6))
+    ax.plot([0, 1], [0, 1], color="#999", linewidth=1.0, linestyle="--", zorder=1)
+
+    for pid in predictor_ids:
+        table = calibration_table(predictions_df, predictor_id=pid, n_bins=n_bins)
+        if table.empty:
+            continue
+        ax.plot(
+            table["mean_predicted"],
+            table["observed_frequency"],
+            color=color_map[pid],
+            linewidth=1.2,
+            alpha=0.8,
+            zorder=2,
+        )
+        ax.scatter(
+            table["mean_predicted"],
+            table["observed_frequency"],
+            s=table["n"] * 4,
+            color=color_map[pid],
+            label=label_map[pid],
+            alpha=0.85,
+            zorder=3,
+        )
+
+    ax.set_xlim(-0.02, 1.02)
+    ax.set_ylim(-0.02, 1.02)
+    ax.set_xlabel("Mean predicted probability")
+    ax.set_ylabel("Observed frequency")
+    suffix = f": {title_suffix}" if title_suffix else ""
+    ax.set_title(f"Reliability curve{suffix} ({n_bins} bins; marker size = bin count)")
+    ax.legend(fontsize=9, loc="upper left")
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    return fig, ax
+
+
+# ---------------------------------------------------------------------------
+# Decision timeline: predicted probabilities vs realised decisions
+# ---------------------------------------------------------------------------
+
+
+def plot_decision_timeline(
+    predictions_df: pd.DataFrame,
+    *,
+    colors: dict[str, str] | None = None,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, Axes]:
+    """Plot predicted decision probabilities over time, with outcomes shaded.
+
+    Binary-style frames plot one P(event) line per predictor and shade realised
+    event meetings in red. Categorical frames plot P(cut) as solid lines and
+    P(hike) as dashed lines using the same predictor colour, with realised
+    cuts shaded red and realised hikes shaded teal.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Tidy frame from :func:`~boc_rate_decisions.analysis.predictions_to_frame`.
+    colors, labels : dict[str, str] or None
+        Optional predictor_id -> colour / display-label maps.
+
+    Returns
+    -------
+    (Figure, Axes)
+    """
+    predictor_ids = sorted(predictions_df["predictor_id"].unique())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+    categorical = {"p_cut", "p_hike", "outcome_label"}.issubset(predictions_df.columns) and predictions_df[
+        ["p_cut", "p_hike"]
+    ].notna().any().any()
+
+    fig, ax = plt.subplots(figsize=(13, 4.5))
+
+    if categorical:
+        cut_meetings = sorted(predictions_df.loc[predictions_df["outcome_label"] == "cut", "meeting_date"].unique())
+        hike_meetings = sorted(predictions_df.loc[predictions_df["outcome_label"] == "hike", "meeting_date"].unique())
+        for md in cut_meetings:
+            ts = pd.Timestamp(md)
+            ax.axvspan(ts - pd.Timedelta(days=10), ts + pd.Timedelta(days=10), color="#d62728", alpha=0.15, zorder=1)
+        for md in hike_meetings:
+            ts = pd.Timestamp(md)
+            ax.axvspan(ts - pd.Timedelta(days=10), ts + pd.Timedelta(days=10), color="#1b7a76", alpha=0.12, zorder=1)
+
+        for pid in predictor_ids:
+            sub = predictions_df[predictions_df["predictor_id"] == pid].sort_values("meeting_date")
+            ax.plot(
+                sub["meeting_date"],
+                sub["p_cut"],
+                color=color_map[pid],
+                linewidth=1.3,
+                marker="o",
+                markersize=3.5,
+                label=label_map[pid],
+                zorder=3,
+            )
+            ax.plot(
+                sub["meeting_date"],
+                sub["p_hike"],
+                color=color_map[pid],
+                linewidth=1.3,
+                linestyle="--",
+                marker="^",
+                markersize=3.5,
+                label=None,
+                zorder=3,
+            )
+
+        ax.set_ylabel("Predicted probability")
+        ax.set_title("Predicted decision probabilities by meeting (solid = P(cut), dashed = P(hike))")
+        handles, handle_labels = ax.get_legend_handles_labels()
+        handles.extend(
+            [
+                Line2D([0], [0], color="#d62728", alpha=0.3, linewidth=8),
+                Line2D([0], [0], color="#1b7a76", alpha=0.25, linewidth=8),
+            ]
+        )
+        handle_labels.extend(["Realised cut", "Realised hike"])
+    else:
+        cut_meetings = sorted(predictions_df.loc[predictions_df["outcome"] == 1, "meeting_date"].unique())
+        for md in cut_meetings:
+            ts = pd.Timestamp(md)
+            ax.axvspan(ts - pd.Timedelta(days=10), ts + pd.Timedelta(days=10), color="#d62728", alpha=0.15, zorder=1)
+
+        for pid in predictor_ids:
+            sub = predictions_df[predictions_df["predictor_id"] == pid].sort_values("meeting_date")
+            ax.plot(
+                sub["meeting_date"],
+                sub["probability"],
+                color=color_map[pid],
+                linewidth=1.3,
+                marker="o",
+                markersize=3.5,
+                label=label_map[pid],
+                zorder=3,
+            )
+
+        ax.set_ylabel("Predicted P(cut)")
+        ax.set_title("Predicted cut probability by meeting (red bands = realised cuts)")
+        handles, handle_labels = ax.get_legend_handles_labels()
+        handles.append(Line2D([0], [0], color="#d62728", alpha=0.3, linewidth=8))
+        handle_labels.append("Realised cut")
+
+    ax.set_ylim(-0.03, 1.03)
+    ax.legend(handles, handle_labels, fontsize=9, loc="upper left")
+    ax.grid(axis="y", alpha=0.3)
+    fig.tight_layout()
+    return fig, ax
+
+
+# ---------------------------------------------------------------------------
+# Predicted distributions over time, by method (stacked-area small multiples)
+# ---------------------------------------------------------------------------
+
+
+def plot_probability_timeline(
+    predictions_df: pd.DataFrame,
+    *,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, list[Axes]]:
+    """Plot each method's predicted {cut, hold, hike} distribution over meetings.
+
+    One stacked-area panel per predictor: the three category probabilities sum
+    to 1 at every meeting, so a glance shows how each method shifts mass between
+    cut / hold / hike over time. A marker strip above each panel shows the
+    realised outcome at every meeting (filled = resolved, hollow = pending),
+    coloured by :data:`CATEGORY_COLORS`.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Categorical tidy frame from
+        :func:`~boc_rate_decisions.analysis.predictions_to_frame`. Must carry
+        ``p_cut`` / ``p_hold`` / ``p_hike`` and ``outcome_label`` columns.
+    labels : dict[str, str] or None
+        Optional predictor_id -> display-label map for the panel titles.
+
+    Returns
+    -------
+    (Figure, list[Axes])
+    """
+    categories = ["cut", "hold", "hike"]
+    required = {"predictor_id", "meeting_date", "outcome_label", *(f"p_{c}" for c in categories)}
+    missing = required - set(predictions_df.columns)
+    if missing:
+        raise ValueError(f"plot_probability_timeline requires a categorical frame; missing columns: {sorted(missing)}")
+
+    predictor_ids = predictions_df["predictor_id"].drop_duplicates().tolist()
+    label_map = _resolve_labels(predictor_ids, labels)
+    n = len(predictor_ids)
+
+    fig, axes_grid = plt.subplots(n, 1, figsize=(12, 2.0 * n + 1.0), sharex=True, squeeze=False)
+    axes = list(axes_grid[:, 0])
+
+    for ax, pid in zip(axes, predictor_ids, strict=True):
+        sub = predictions_df[predictions_df["predictor_id"] == pid].sort_values("meeting_date")
+        x = sub["meeting_date"].to_numpy()
+        stacks = [sub[f"p_{c}"].to_numpy() for c in categories]
+        ax.stackplot(x, *stacks, colors=[CATEGORY_COLORS[c] for c in categories], alpha=0.9, zorder=2)
+        ax.set_ylim(0.0, 1.0)
+        ax.set_yticks([0.0, 0.5, 1.0])
+        ax.margins(x=0.01)
+        ax.tick_params(labelsize=8)
+        ax.set_title(label_map[pid], fontsize=10, loc="left")
+
+        for meeting_date, outcome_label in zip(sub["meeting_date"], sub["outcome_label"], strict=True):
+            resolved = isinstance(outcome_label, str) and outcome_label in CATEGORY_COLORS
+            ax.scatter(
+                [meeting_date],
+                [1.07],
+                marker="s",
+                s=30,
+                color=CATEGORY_COLORS[outcome_label] if resolved else "white",
+                edgecolors="black" if resolved else "#999999",
+                linewidths=0.5 if resolved else 0.7,
+                clip_on=False,
+                zorder=5,
+            )
+
+    axes[-1].set_xlabel("Meeting date")
+    handles = [Patch(facecolor=CATEGORY_COLORS[c], label=c) for c in categories]
+    handles.append(
+        Line2D(
+            [0],
+            [0],
+            marker="s",
+            color="white",
+            markerfacecolor="#444444",
+            markeredgecolor="black",
+            markersize=8,
+            linestyle="none",
+            label="realised outcome",
+        )
+    )
+    fig.legend(handles=handles, ncol=4, loc="upper center", fontsize=9, frameon=False, bbox_to_anchor=(0.5, 1.0))
+    fig.suptitle("Predicted distribution over {cut, hold, hike} by meeting, per method", y=1.03, fontsize=11)
+    fig.tight_layout()
+    return fig, axes
+
+
+# ---------------------------------------------------------------------------
+# Decision panel: one meeting, all methods (context + bars + outcome + rationale)
+# ---------------------------------------------------------------------------
+
+
+def _draw_panel_context(ax: Axes, panel: DecisionPanel, rate_df: pd.DataFrame) -> None:
+    """Draw the policy-rate context strip leading into the meeting."""
+    rate = rate_df.copy()
+    rate["timestamp"] = pd.to_datetime(rate["timestamp"])
+    window = rate[
+        (rate["timestamp"] >= panel.origin - pd.Timedelta(days=365)) & (rate["timestamp"] <= panel.meeting_date)
+    ]
+    cur_rate = float("nan")
+    if not window.empty:
+        ax.plot(window["timestamp"], window["value"], color="black", linewidth=1.4)
+        at_origin = window[window["timestamp"] <= panel.origin]
+        cur_rate = float((at_origin if not at_origin.empty else window)["value"].iloc[-1])
+    ax.axvline(panel.origin, color="#888888", linestyle=":", linewidth=1.0)
+    ax.axvline(panel.meeting_date, color=CATEGORY_COLORS.get(panel.outcome_label or "", "#888888"), linewidth=1.6)
+    ax.set_ylabel("Target rate (%)", fontsize=8)
+    ax.tick_params(labelsize=8)
+    rate_str = "n/a" if math.isnan(cur_rate) else f"{cur_rate:.2f}%"
+    ax.set_title(
+        f"rate at origin: {rate_str}    ·    prior decision: {panel.prior_outcome_label or 'n/a'}",
+        fontsize=9,
+        loc="left",
+    )
+
+
+def _draw_panel_bars(ax: Axes, panel: DecisionPanel, label_map: dict[str, str]) -> None:
+    """Draw grouped cut/hold/hike probability bars, one band per method."""
+    cats = panel.categories
+    n = max(len(panel.rows), 1)
+    bar_h = 0.24
+    span = bar_h * (len(cats) - 1)
+    cat_offset = {cat: span / 2.0 - i * bar_h for i, cat in enumerate(cats)}
+    yticks: list[float] = []
+    yticklabels: list[str] = []
+    for i, row in enumerate(panel.rows):
+        y0 = float(n - 1 - i)
+        yticks.append(y0)
+        yticklabels.append(f"{label_map[row.predictor_id]}  (RPS {row.score:.2f})")
+        for cat in cats:
+            y = y0 + cat_offset[cat]
+            prob = row.probabilities[cat]
+            realised = cat == panel.outcome_label
+            ax.barh(
+                y,
+                prob,
+                height=bar_h * 0.9,
+                color=CATEGORY_COLORS[cat],
+                edgecolor="black" if realised else "none",
+                linewidth=1.4 if realised else 0.0,
+                zorder=3,
+            )
+            if realised:
+                ax.annotate("★", (prob + 0.012, y), va="center", ha="left", fontsize=11, color="black")
+    ax.set_yticks(yticks)
+    ax.set_yticklabels(yticklabels, fontsize=9)
+    ax.set_ylim(-0.6, n - 0.4)
+    ax.set_xlim(0.0, 1.08)
+    ax.set_xticks([0.0, 0.25, 0.5, 0.75, 1.0])
+    ax.set_xlabel("Predicted probability")
+    ax.grid(axis="x", alpha=0.25)
+    handles = [Patch(facecolor=CATEGORY_COLORS[c], label=c) for c in cats]
+    handles.append(
+        Line2D(
+            [0],
+            [0],
+            marker="*",
+            color="white",
+            markerfacecolor="black",
+            markeredgecolor="black",
+            markersize=11,
+            linestyle="none",
+            label="realised",
+        )
+    )
+    ax.legend(handles=handles, ncol=len(cats) + 1, fontsize=8, loc="lower right", framealpha=0.9)
+
+
+def plot_decision_panel(
+    panel: DecisionPanel,
+    rate_df: pd.DataFrame,
+    *,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, list[Axes]]:
+    """Render one meeting's prediction-vs-outcome panel across all methods.
+
+    Composite figure: a policy-rate context strip (the ~12 months up to the
+    forecast origin) and grouped horizontal probability bars per method with
+    the realised category starred and outlined. Rationales are intentionally
+    *not* drawn here — render them as notebook markdown via
+    :func:`~boc_rate_decisions.analysis.panel_rationales_markdown` so the prose
+    stays legible.
+
+    Parameters
+    ----------
+    panel : DecisionPanel
+        Assembled by :func:`~boc_rate_decisions.analysis.decision_panel_data`.
+    rate_df : pd.DataFrame
+        Daily target-rate series (``timestamp`` / ``value``) for the context
+        strip. Passed in rather than fetched (this module never fetches data).
+    labels : dict[str, str] or None
+        Optional predictor_id -> display-label map.
+
+    Returns
+    -------
+    (Figure, list[Axes])
+        ``[context_ax, bars_ax]``.
+    """
+    rows = panel.rows
+    label_map = _resolve_labels([row.predictor_id for row in rows], labels)
+    n = max(len(rows), 1)
+
+    ctx_h = 1.3
+    bars_h = 0.62 * n + 0.5
+    fig = plt.figure(figsize=(11, ctx_h + bars_h + 0.8))
+    gs = fig.add_gridspec(2, 1, height_ratios=[ctx_h, bars_h], hspace=0.55)
+    ax_ctx = fig.add_subplot(gs[0])
+    ax_bar = fig.add_subplot(gs[1])
+
+    _draw_panel_context(ax_ctx, panel, rate_df)
+    _draw_panel_bars(ax_bar, panel, label_map)
+
+    realised = panel.outcome_label.upper() if panel.outcome_label else "PENDING"
+    fig.suptitle(
+        f"BoC meeting {panel.meeting_date.date()}    ·    issued {panel.origin.date()} (T-28)"
+        f"    ·    REALISED: {realised}",
+        fontsize=12,
+        y=0.99,
+    )
+    # Manual margins (not tight_layout): the GridSpec mixes a context strip and
+    # a bar axis with long y-labels, which tight_layout cannot solve cleanly.
+    fig.subplots_adjust(left=0.26, right=0.97, top=0.91, bottom=0.06)
+    return fig, [ax_ctx, ax_bar]
+
+
+__all__ = [
+    "CATEGORY_COLORS",
+    "DEFAULT_PREDICTOR_PALETTE",
+    "plot_decision_panel",
+    "plot_decision_timeline",
+    "plot_policy_rate_with_decisions",
+    "plot_probability_timeline",
+    "plot_reliability_curve",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors____init__.py.md
new file mode 100644
index 0000000..5ce0ba4
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors____init__.py.md
@@ -0,0 +1,29 @@
+# Source: implementations/boc_rate_decisions/predictors/__init__.py
+
+kind: python
+
+```python
+"""Tuned predictor recipes for the BoC rate-decision experiment.
+
+Use-case-specific predictors and recipes live here, paired with the
+task-agnostic methods in :mod:`aieng.forecasting.methods`:
+
+- :mod:`logistic_baseline` — the conventional baseline: a logistic
+  regression on leak-safe macro features, fit at every forecast origin.
+  Feature engineering is domain-specific, so the predictor lives in the use
+  case (mirroring the placement of energy's Prophet model).
+- :mod:`llmp_direction` — recipe wiring
+  :class:`~aieng.forecasting.methods.CategoricalProbabilityLLMPredictor` with
+  a BoC-specific prompt context block for the primary 3-way direction task.
+- :mod:`llmp_binary` — the binary counterpart for the compact rate-cut
+  reference, wiring
+  :class:`~aieng.forecasting.methods.BinaryProbabilityLLMPredictor`.
+"""
+
+from .llmp_binary import build_llmp_binary
+from .llmp_direction import build_llmp_direction
+from .logistic_baseline import BoCLogisticPredictor
+
+
+__all__ = ["BoCLogisticPredictor", "build_llmp_binary", "build_llmp_direction"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_binary.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_binary.py.md
new file mode 100644
index 0000000..513fe04
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_binary.py.md
@@ -0,0 +1,110 @@
+# Source: implementations/boc_rate_decisions/predictors/llmp_binary.py
+
+kind: python
+
+```python
+"""BoC rate-cut recipe: binary-probability LLMP.
+
+This file is intentionally small and explicit so notebook readers can open it
+as a reference recipe. The reusable method lives in ``aieng.forecasting``;
+this module shows the BoC prompt framing and cache tag used by the
+experiment.
+
+The quantitative-only variant deliberately gives the LLM nothing beyond the
+0/1 meeting-outcome history and a short institutional context block — the
+same information set as :class:`HistoricalFrequencyPredictor` plus world
+knowledge absorbed in pre-training. The deferred report-grounded variant will
+inject BoC press-release and Monetary Policy Report excerpts through the same
+``user_prompt_suffix`` seam (see the use-case README).
+"""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from aieng.forecasting.methods.llm_processes import (
+    BinaryProbabilityLLMPredictor,
+    BinaryProbabilityLLMPredictorConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+
+_ReasoningEffort = Literal["disable", "low", "medium", "high"]
+
+_DEFAULT_MODEL = LITE_MODEL
+# ``None`` = provider default. The Vector proxy now rejects 'disable'/'low' for
+# Gemini models (valid: minimal/medium/high); None sends no reasoning_effort.
+_DEFAULT_REASONING_EFFORT: _ReasoningEffort | None = None
+_RECIPE_FAMILY = "boc_cut_v1"
+
+_SERIES_DESCRIPTION = (
+    "Event series: Bank of Canada rate-cut indicator, one observation per fixed "
+    "announcement date (8 per year). 1 = the Bank lowered its target for the "
+    "overnight rate at that announcement; 0 = it held or raised.\n"
+    "The Bank of Canada sets policy to keep CPI inflation at the 2% midpoint of "
+    "its 1-3% control range. Decisions are announced at 09:45 ET on a published "
+    "schedule of eight fixed dates per year."
+)
+
+_USER_PROMPT_SUFFIX = (
+    "Notes for this question:\n"
+    "- Cuts are rare overall (roughly 1 meeting in 10 since 2009) but strongly "
+    "clustered: once an easing cycle starts, consecutive-meeting cuts are common.\n"
+    "- Holds are the most frequent outcome; long unchanged stretches are normal.\n"
+    "- Use the date of the question to reason about the macro environment you "
+    "know from your training data, but DO NOT use knowledge of what the Bank "
+    "actually decided on or after the resolution date."
+)
+
+
+def build_llmp_binary(
+    *,
+    model: str = _DEFAULT_MODEL,
+    reasoning_effort: _ReasoningEffort | None = _DEFAULT_REASONING_EFFORT,
+    max_tokens: int = 16384,
+    user_prompt_suffix: str | None = None,
+    variant_tag: str | None = None,
+) -> BinaryProbabilityLLMPredictor:
+    """Return the BoC rate-cut binary-probability LLMP recipe.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier. Defaults to the lite model (``gemini-3.1-flash-lite-preview``).
+    reasoning_effort : str or None
+        Reasoning budget. ``"low"`` by default: some deliberation helps event
+        reasoning, while heavy chain-of-thought is a documented source of
+        overconfidence in calibration-sensitive forecasting.
+    max_tokens : int, default=16384
+        Per-call output token budget (shared with thinking tokens on
+        reasoning models routed through the proxy).
+    user_prompt_suffix : str or None
+        Override the default notes block. The report-grounded variant will
+        pass BoC communication excerpts here.
+    variant_tag : str or None
+        Override the cache tag suffix. Defaults to a tag encoding the recipe
+        family and reasoning effort.
+
+    Notes
+    -----
+    **Look-ahead caveat for backtests:** the LLM has seen post-origin history
+    during pre-training, so backtest scores for this predictor carry an
+    unquantifiable memorisation advantage. The protected 2025-2026 eval
+    window (closer to / beyond training cutoffs) is the fairer comparison.
+    """
+    reasoning_tag = "rprovider" if reasoning_effort is None else f"r{reasoning_effort}"
+    resolved_variant_tag = variant_tag or f"{_RECIPE_FAMILY}_{reasoning_tag}"
+
+    config = BinaryProbabilityLLMPredictorConfig(
+        model=model,
+        reasoning_effort=reasoning_effort,
+        max_tokens=max_tokens,
+        series_description=_SERIES_DESCRIPTION,
+        user_prompt_suffix=user_prompt_suffix if user_prompt_suffix is not None else _USER_PROMPT_SUFFIX,
+        variant_tag=resolved_variant_tag,
+    )
+    return BinaryProbabilityLLMPredictor(config)
+
+
+__all__ = ["build_llmp_binary"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_direction.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_direction.py.md
new file mode 100644
index 0000000..4d1d73c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__llmp_direction.py.md
@@ -0,0 +1,114 @@
+# Source: implementations/boc_rate_decisions/predictors/llmp_direction.py
+
+kind: python
+
+```python
+"""BoC rate-direction recipe: categorical-probability LLMP.
+
+This file is intentionally small and explicit so notebook readers can open it
+as a reference recipe. The reusable method lives in ``aieng.forecasting``;
+this module shows the BoC prompt framing and cache tag used by the
+experiment.
+
+The quantitative-only variant deliberately gives the LLM nothing beyond the
+per-meeting cut/hold/hike history and a short institutional context block —
+the same information set as :class:`CategoricalFrequencyPredictor` plus world
+knowledge absorbed in pre-training. The deferred report-grounded variant will
+inject BoC press-release and Monetary Policy Report excerpts through the same
+``user_prompt_suffix`` seam (see the use-case README).
+"""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from aieng.forecasting.methods.llm_processes import (
+    CategoricalProbabilityLLMPredictor,
+    CategoricalProbabilityLLMPredictorConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+
+_ReasoningEffort = Literal["disable", "low", "medium", "high"]
+
+_DEFAULT_MODEL = LITE_MODEL
+# ``None`` = provider default. The Vector proxy now rejects 'disable'/'low' for
+# Gemini models (valid: minimal/medium/high); None sends no reasoning_effort.
+_DEFAULT_REASONING_EFFORT: _ReasoningEffort | None = None
+_RECIPE_FAMILY = "boc_direction_v1"
+
+_SERIES_DESCRIPTION = (
+    "Outcome series: Bank of Canada rate-decision direction, one observation per "
+    "fixed announcement date (8 per year). 'cut' = the Bank lowered its target "
+    "for the overnight rate at that announcement, 'hold' = it left the target "
+    "unchanged, 'hike' = it raised the target.\n"
+    "The Bank of Canada sets policy to keep CPI inflation at the 2% midpoint of "
+    "its 1-3% control range. Decisions are announced at 09:45 ET on a published "
+    "schedule of eight fixed dates per year."
+)
+
+_USER_PROMPT_SUFFIX = (
+    "Notes for this question:\n"
+    "- Holds are by far the most frequent outcome (roughly three meetings in "
+    "four); long unchanged stretches are normal.\n"
+    "- Cuts and hikes are individually rare but strongly clustered into easing "
+    "and tightening cycles: once a cycle starts, consecutive-meeting moves in "
+    "the same direction are common, and direct cut-to-hike reversals between "
+    "adjacent meetings essentially never happen.\n"
+    "- Use the date of the question to reason about the macro environment you "
+    "know from your training data, but DO NOT use knowledge of what the Bank "
+    "actually decided on or after the resolution date."
+)
+
+
+def build_llmp_direction(
+    *,
+    model: str = _DEFAULT_MODEL,
+    reasoning_effort: _ReasoningEffort | None = _DEFAULT_REASONING_EFFORT,
+    max_tokens: int = 16384,
+    user_prompt_suffix: str | None = None,
+    variant_tag: str | None = None,
+) -> CategoricalProbabilityLLMPredictor:
+    """Return the BoC rate-direction categorical-probability LLMP recipe.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier. Defaults to the lite model (``gemini-3.1-flash-lite-preview``).
+    reasoning_effort : str or None
+        Reasoning budget. ``"low"`` by default: some deliberation helps event
+        reasoning, while heavy chain-of-thought is a documented source of
+        overconfidence in calibration-sensitive forecasting.
+    max_tokens : int, default=16384
+        Per-call output token budget (shared with thinking tokens on
+        reasoning models routed through the proxy).
+    user_prompt_suffix : str or None
+        Override the default notes block. The report-grounded variant will
+        pass BoC communication excerpts here.
+    variant_tag : str or None
+        Override the cache tag suffix. Defaults to a tag encoding the recipe
+        family and reasoning effort.
+
+    Notes
+    -----
+    **Look-ahead caveat for backtests:** the LLM has seen post-origin history
+    during pre-training, so backtest scores for this predictor carry an
+    unquantifiable memorisation advantage. The protected 2025-2026 eval
+    window (closer to / beyond training cutoffs) is the fairer comparison.
+    """
+    reasoning_tag = "rprovider" if reasoning_effort is None else f"r{reasoning_effort}"
+    resolved_variant_tag = variant_tag or f"{_RECIPE_FAMILY}_{reasoning_tag}"
+
+    config = CategoricalProbabilityLLMPredictorConfig(
+        model=model,
+        reasoning_effort=reasoning_effort,
+        max_tokens=max_tokens,
+        series_description=_SERIES_DESCRIPTION,
+        user_prompt_suffix=user_prompt_suffix if user_prompt_suffix is not None else _USER_PROMPT_SUFFIX,
+        variant_tag=resolved_variant_tag,
+    )
+    return CategoricalProbabilityLLMPredictor(config)
+
+
+__all__ = ["build_llmp_direction"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__logistic_baseline.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__logistic_baseline.py.md
new file mode 100644
index 0000000..36a0a4a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__predictors__logistic_baseline.py.md
@@ -0,0 +1,341 @@
+# Source: implementations/boc_rate_decisions/predictors/logistic_baseline.py
+
+kind: python
+
+```python
+"""Logistic-regression conventional baseline for BoC rate-decision prediction.
+
+This baseline supports both BoC task framings:
+
+- binary ``P(rate cut)`` forecasts; and
+- ordered 3-way cut/hold/hike direction forecasts.
+
+Logistic regression is a good compact classical method for these discrete
+events with a handful of slow-moving macro drivers: it produces probabilities
+natively, is robust with ~100 training examples and heavy class imbalance, and
+its binary coefficients are directly interpretable in a notebook.
+
+Features (all computed leak-safely as of the forecast origin):
+
+- ``yield_spread``   — 2-year GoC yield minus the current target rate. The
+  bond market prices expected policy; a 2yr yield well below the policy rate
+  means the market expects cuts. Empirically the strongest single signal.
+- ``rate_momentum``  — change in the target rate over the trailing 90 days.
+  Cuts cluster in easing cycles; the best predictor of a cut is being in one.
+- ``inflation_gap``  — latest available CPI year-over-year inflation minus
+  the Bank's 2% target. Above-target inflation argues against cuts.
+- ``unemployment_momentum`` — 12-month change in the unemployment rate.
+  A deteriorating labour market argues for cuts.
+
+The model is re-fit *inside* ``predict()`` at every origin (like the Darts
+predictors): training examples are all past meetings whose outcomes are
+visible at the origin, with features reconstructed as of each past meeting's
+own origin date. This makes the backtest honest — early origins train on few
+examples, exactly as a real forecaster would have.
+
+Leakage discipline
+------------------
+``ForecastContext`` already enforces ``released_at <= as_of``. On top of
+that, this module applies explicit availability lags when slicing history by
+``timestamp`` (1 day for daily market series; one full month for monthly
+macro series), because the monthly adapters carry only approximate
+``released_at`` stamps. See :func:`build_feature_row`.
+"""
+
+from __future__ import annotations
+
+import math
+from datetime import datetime, timezone
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import BinaryForecast, CategoricalForecast, Prediction
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory
+
+from ..data import (
+    BOND_YIELD_2YR_SERIES_ID,
+    CPI_SERIES_ID,
+    DIRECTION_TASK_CATEGORIES,
+    TARGET_RATE_SERIES_ID,
+    UNEMPLOYMENT_SERIES_ID,
+)
+
+
+FEATURE_NAMES = ["yield_spread", "rate_momentum", "inflation_gap", "unemployment_momentum"]
+"""Feature columns produced by :func:`build_feature_row`, in order."""
+
+#: Daily market data prints with a 1-business-day lag; slicing by
+#: ``timestamp <= origin - 1d`` guarantees the row was actually public.
+_DAILY_AVAILABILITY_LAG_DAYS = 1
+
+#: Monthly macro series (CPI, unemployment) are published 3-6 weeks after the
+#: reference month and the adapters' ``released_at`` stamps are approximate,
+#: so the most recent reference month visible in the context is dropped.
+_MONTHLY_EXTRA_LAG_MONTHS = 1
+
+_RATE_MOMENTUM_WINDOW_DAYS = 90
+_UNEMPLOYMENT_MOMENTUM_MONTHS = 12
+
+
+def _last_value_before(df: pd.DataFrame, cutoff: pd.Timestamp) -> float | None:
+    """Return the last ``value`` with ``timestamp <= cutoff``, or ``None``."""
+    visible = df[df["timestamp"] <= cutoff]
+    if visible.empty:
+        return None
+    return float(visible["value"].iloc[-1])
+
+
+def build_feature_row(
+    origin: pd.Timestamp,
+    rate_df: pd.DataFrame,
+    yield_df: pd.DataFrame,
+    cpi_df: pd.DataFrame,
+    unemployment_df: pd.DataFrame,
+) -> dict[str, float] | None:
+    """Compute the macro feature vector available at ``origin``.
+
+    Only observations that were verifiably public at ``origin`` are used:
+    daily series are cut at ``origin - 1 day``; monthly series additionally
+    drop their most recent reference month (see module docstring).
+
+    Parameters
+    ----------
+    origin : pd.Timestamp
+        The forecast origin (announcement date minus one day for this task).
+    rate_df, yield_df, cpi_df, unemployment_df : pd.DataFrame
+        Canonical series frames (``timestamp``/``value``/``released_at``).
+        May contain rows after ``origin``; they are ignored.
+
+    Returns
+    -------
+    dict[str, float] or None
+        Mapping of :data:`FEATURE_NAMES` to values, or ``None`` if any input
+        lacks sufficient visible history at this origin.
+    """
+    daily_cutoff = origin - pd.Timedelta(days=_DAILY_AVAILABILITY_LAG_DAYS)
+
+    rate_now = _last_value_before(rate_df, daily_cutoff)
+    rate_then = _last_value_before(rate_df, daily_cutoff - pd.Timedelta(days=_RATE_MOMENTUM_WINDOW_DAYS))
+    yield_2yr = _last_value_before(yield_df, daily_cutoff)
+    if rate_now is None or rate_then is None or yield_2yr is None:
+        return None
+
+    # Monthly series: slice by timestamp, then drop the newest reference month.
+    cpi_visible = cpi_df[cpi_df["timestamp"] <= origin].iloc[: -_MONTHLY_EXTRA_LAG_MONTHS or None]
+    unemp_visible = unemployment_df[unemployment_df["timestamp"] <= origin].iloc[: -_MONTHLY_EXTRA_LAG_MONTHS or None]
+    # YoY inflation needs 13 reference months; unemployment momentum needs 13.
+    if len(cpi_visible) < 13 or len(unemp_visible) < _UNEMPLOYMENT_MOMENTUM_MONTHS + 1:
+        return None
+
+    cpi_now = float(cpi_visible["value"].iloc[-1])
+    cpi_year_ago = float(cpi_visible["value"].iloc[-13])
+    inflation_yoy = (cpi_now / cpi_year_ago - 1.0) * 100.0
+
+    unemp_now = float(unemp_visible["value"].iloc[-1])
+    unemp_year_ago = float(unemp_visible["value"].iloc[-(_UNEMPLOYMENT_MOMENTUM_MONTHS + 1)])
+
+    return {
+        "yield_spread": yield_2yr - rate_now,
+        "rate_momentum": rate_now - rate_then,
+        "inflation_gap": inflation_yoy - 2.0,
+        "unemployment_momentum": unemp_now - unemp_year_ago,
+    }
+
+
+class BoCLogisticPredictor(Predictor):
+    """Fit-at-origin logistic regression on leak-safe macro features.
+
+    Binary tasks emit :class:`BinaryForecast`; categorical cut/hold/hike tasks
+    emit :class:`CategoricalForecast` in the task-declared category order.
+
+    Parameters
+    ----------
+    regularization_c : float
+        Inverse regularization strength passed to scikit-learn's
+        ``LogisticRegression``. The default (1.0) is deliberately untuned —
+        this is a reference baseline, not a leaderboard entry.
+    min_training_examples : int
+        Minimum number of resolved past meetings (with computable features)
+        required to fit. Below this, the predictor falls back to the
+        historical base rate so early backtest origins still produce a
+        defensible forecast instead of an error.
+    """
+
+    def __init__(self, regularization_c: float = 1.0, min_training_examples: int = 16) -> None:
+        self._c = regularization_c
+        self._min_train = min_training_examples
+
+    @property
+    def predictor_id(self) -> str:
+        """Stable identifier for this predictor."""
+        return "boc_logistic_macro"
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        """Fit on past meetings visible at the origin and emit one forecast.
+
+        Raises
+        ------
+        ValueError
+            If the task payload is unsupported or requests more than one horizon.
+        """
+        if len(task.horizons) != 1:
+            raise ValueError(f"{type(self).__name__} supports exactly one horizon; got {task.horizons}.")
+
+        as_of = pd.Timestamp(context.as_of)
+        target_df = context.get_series(task.target_series_id)
+        rate_df = context.get_series(TARGET_RATE_SERIES_ID)
+        yield_df = context.get_series(BOND_YIELD_2YR_SERIES_ID)
+        cpi_df = context.get_series(CPI_SERIES_ID)
+        unemployment_df = context.get_series(UNEMPLOYMENT_SERIES_ID)
+
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        lead = offset * task.horizons[0]
+        feature_rows, outcomes = self._build_training_data(target_df, rate_df, yield_df, cpi_df, unemployment_df, lead)
+        current_features = build_feature_row(as_of, rate_df, yield_df, cpi_df, unemployment_df)
+
+        if task.payload_type == "binary":
+            payload, model_info = self._predict_binary(feature_rows, outcomes, current_features)
+        elif task.payload_type == "categorical":
+            payload, model_info = self._predict_categorical(task, feature_rows, outcomes, current_features)
+        else:
+            raise ValueError(f"{type(self).__name__} does not support payload_type='{task.payload_type}'.")
+
+        forecast_date = (as_of + lead).to_pydatetime()
+        return [
+            Prediction(
+                predictor_id=self.predictor_id,
+                task_id=task.task_id,
+                issued_at=datetime.now(tz=timezone.utc).replace(tzinfo=None),
+                as_of=context.as_of,
+                forecast_date=forecast_date,
+                payload=payload,
+                metadata={"n_train": len(outcomes), **model_info},
+            )
+        ]
+
+    def _build_training_data(
+        self,
+        target_df: pd.DataFrame,
+        rate_df: pd.DataFrame,
+        yield_df: pd.DataFrame,
+        cpi_df: pd.DataFrame,
+        unemployment_df: pd.DataFrame,
+        lead: pd.DateOffset,
+    ) -> tuple[list[list[float]], list[float]]:
+        """Build leak-safe training examples from resolved past meetings.
+
+        Features for each past meeting are rebuilt at ``meeting - lead`` —
+        the same forecast lead the task predicts at — so the training and
+        prediction feature distributions match. A model trained on
+        eve-of-decision features (where the yield spread has converged on
+        the outcome) would be miscalibrated when asked to predict four
+        weeks out.
+        """
+        feature_rows: list[list[float]] = []
+        outcomes: list[float] = []
+        for meeting, outcome in zip(target_df["timestamp"], target_df["value"]):
+            past_origin = pd.Timestamp(meeting) - lead
+            features = build_feature_row(past_origin, rate_df, yield_df, cpi_df, unemployment_df)
+            if features is None:
+                continue
+            feature_rows.append([features[name] for name in FEATURE_NAMES])
+            outcomes.append(float(outcome))
+        return feature_rows, outcomes
+
+    def _predict_binary(
+        self,
+        feature_rows: list[list[float]],
+        outcomes: list[float],
+        current_features: dict[str, float] | None,
+    ) -> tuple[BinaryForecast, dict[str, object]]:
+        """Fit the binary model and return ``(payload, metadata)``.
+
+        Falls back to the training base rate when the design matrix is too
+        small, degenerate (single class), or current features are missing.
+        """
+        base_rate = float(np.mean(outcomes)) if outcomes else 0.1
+
+        degenerate = (
+            current_features is None or len(outcomes) < self._min_train or len(set(outcomes)) < 2  # noqa: PLR2004
+        )
+        if degenerate:
+            return BinaryForecast(probability=base_rate), {"model": "base_rate_fallback"}
+
+        from sklearn.linear_model import LogisticRegression  # noqa: PLC0415
+        from sklearn.pipeline import make_pipeline  # noqa: PLC0415
+        from sklearn.preprocessing import StandardScaler  # noqa: PLC0415
+
+        model = make_pipeline(StandardScaler(), LogisticRegression(C=self._c, max_iter=1000))
+        model.fit(np.asarray(feature_rows), np.asarray(outcomes))
+
+        x_now = np.asarray([[current_features[name] for name in FEATURE_NAMES]])
+        probability = float(model.predict_proba(x_now)[0, 1])
+
+        coefs = model.named_steps["logisticregression"].coef_[0]
+        return BinaryForecast(probability=probability), {
+            "model": "logistic_regression",
+            "features": dict(zip(FEATURE_NAMES, (float(f) for f in x_now[0]))),
+            "coefficients": dict(zip(FEATURE_NAMES, (float(c) for c in coefs))),
+        }
+
+    def _predict_categorical(
+        self,
+        task: ForecastingTask,
+        feature_rows: list[list[float]],
+        outcomes: list[float],
+        current_features: dict[str, float] | None,
+    ) -> tuple[CategoricalForecast, dict[str, object]]:
+        """Fit the multinomial model and return ``(payload, metadata)``."""
+        categories = task.categories if task.categories is not None else DIRECTION_TASK_CATEGORIES
+        degenerate = (
+            current_features is None or len(outcomes) < self._min_train or len(set(outcomes)) < 2  # noqa: PLR2004
+        )
+        if degenerate:
+            return CategoricalForecast(probabilities=self._class_frequency_probabilities(outcomes, categories)), {
+                "model": "class_frequency_fallback"
+            }
+
+        from sklearn.linear_model import LogisticRegression  # noqa: PLC0415
+        from sklearn.pipeline import make_pipeline  # noqa: PLC0415
+        from sklearn.preprocessing import StandardScaler  # noqa: PLC0415
+
+        model = make_pipeline(StandardScaler(), LogisticRegression(C=self._c, max_iter=1000))
+        model.fit(np.asarray(feature_rows), np.asarray(outcomes))
+
+        x_now = np.asarray([[current_features[name] for name in FEATURE_NAMES]])
+        row = model.predict_proba(x_now)[0]
+        probabilities = {category.label: 0.0 for category in categories}
+        for class_value, probability in zip(model.classes_, row):
+            category = self._category_for_value(float(class_value), categories)
+            probabilities[category.label] = float(probability)
+
+        return CategoricalForecast(probabilities=probabilities), {
+            "model": "multinomial_logistic_regression",
+            "features": dict(zip(FEATURE_NAMES, (float(f) for f in x_now[0]))),
+        }
+
+    def _class_frequency_probabilities(self, outcomes: list[float], categories: list[TaskCategory]) -> dict[str, float]:
+        """Return empirical category frequencies over visible outcomes."""
+        if not outcomes:
+            probability = 1.0 / len(categories)
+            return {category.label: probability for category in categories}
+
+        counts = {category.label: 0 for category in categories}
+        for outcome in outcomes:
+            category = self._category_for_value(outcome, categories)
+            counts[category.label] += 1
+        n = len(outcomes)
+        return {category.label: counts[category.label] / n for category in categories}
+
+    def _category_for_value(self, value: float, categories: list[TaskCategory]) -> TaskCategory:
+        """Find the task category matching an observed class value."""
+        for category in categories:
+            if math.isclose(value, category.value):
+                return category
+        raise ValueError(f"Observed class value {value} is not declared in task.categories.")
+
+
+__all__ = ["FEATURE_NAMES", "BoCLogisticPredictor", "build_feature_row"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__press_releases.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__press_releases.py.md
new file mode 100644
index 0000000..2cb8b76
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__press_releases.py.md
@@ -0,0 +1,285 @@
+# Source: implementations/boc_rate_decisions/press_releases.py
+
+kind: python
+
+```python
+"""Bank of Canada rate-announcement press-release ingestion (use-case glue).
+
+This is the BoC counterpart to ``food_price_forecasting.reports``: it turns the
+Bank's Fixed-Announcement-Date (FAD) press releases into the source-agnostic
+:class:`~aieng.forecasting.documents.ExtractedDocument` artifacts defined in
+:mod:`aieng.forecasting.documents`, and provides a small cutoff-aware
+:class:`PressReleaseStore` over the cached artifacts.
+
+Two differences from the CFPR report path:
+
+- **HTML, not PDF.** FAD press releases are web pages, so we add a lightweight
+  ``bs4`` extractor (:func:`extract_press_release_html`) rather than reusing the
+  PDF-only :func:`aieng.forecasting.documents.extract_document`. The output is
+  the *same* :class:`ExtractedDocument` shape, so the cached artifacts are
+  uniform across sources.
+- **No manifest file.** Press-release URLs are deterministic from the
+  announcement date, so :func:`press_release_entries` derives them directly from
+  the committed ``meeting_schedule.yaml`` (via
+  :func:`boc_rate_decisions.data.load_meeting_schedule`).
+
+The realised cut/hold/hike decision is **not** parsed from the release text —
+downstream consumers take it from the direction series. The release text is the
+Bank's *rationale*, which is what the reasoning-alignment evaluator compares
+against.
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from datetime import date, datetime, timezone
+from pathlib import Path
+
+import pandas as pd
+from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument, estimate_tokens
+from boc_rate_decisions.data import load_meeting_schedule
+from pydantic import BaseModel
+
+
+BOC_PRESS_RELEASE_SOURCE = "boc_press_releases"
+"""Source key for FAD press releases (distinct from the BoC MPR PDF source)."""
+
+DEFAULT_PRESS_RELEASE_CACHE_DIR = Path("data/reports/boc_press_releases")
+"""Default (gitignored) cache directory for extracted press-release artifacts."""
+
+_URL_TEMPLATE = "https://www.bankofcanada.ca/{year:04d}/{month:02d}/fad-press-release-{iso}/"
+
+
+def press_release_url(announcement_date: date | pd.Timestamp | str) -> str:
+    """Return the canonical BoC FAD press-release URL for an announcement date."""
+    ts = pd.Timestamp(announcement_date)
+    return _URL_TEMPLATE.format(year=ts.year, month=ts.month, iso=ts.strftime("%Y-%m-%d"))
+
+
+def _doc_id(announcement_date: date | pd.Timestamp | str) -> str:
+    """Stable per-release id / cache filename stem, e.g. ``"2024-06-05_en"``."""
+    return f"{pd.Timestamp(announcement_date).strftime('%Y-%m-%d')}_en"
+
+
+class PressReleaseEntry(BaseModel):
+    """One press release: cutoff metadata plus where to fetch it.
+
+    Parameters
+    ----------
+    meta : DocumentMeta
+        Source-agnostic provenance/cutoff metadata (``publication_date`` is the
+        announcement date).
+    url : str
+        Canonical BoC press-release URL.
+    """
+
+    meta: DocumentMeta
+    url: str
+
+    @property
+    def key(self) -> str:
+        """Stable per-release key (mirrors ``meta.doc_id``), e.g. ``"2024-06-05_en"``."""
+        return self.meta.doc_id
+
+    def artifact_paths(self, cache_dir: Path = DEFAULT_PRESS_RELEASE_CACHE_DIR) -> tuple[Path, Path]:
+        """Return ``(text_md_path, meta_json_path)`` for this release's artifacts."""
+        return cache_dir / f"{self.key}.md", cache_dir / f"{self.key}.json"
+
+
+def press_release_entries(schedule_path: Path | None = None) -> list[PressReleaseEntry]:
+    """Derive one :class:`PressReleaseEntry` per scheduled announcement date.
+
+    URLs are deterministic from the date, so no manifest file is needed — the
+    committed ``meeting_schedule.yaml`` is the single source of dates.
+
+    Parameters
+    ----------
+    schedule_path : Path or None
+        Optional override forwarded to
+        :func:`~boc_rate_decisions.data.load_meeting_schedule`.
+
+    Returns
+    -------
+    list[PressReleaseEntry]
+        One entry per announcement date, in chronological order.
+    """
+    entries: list[PressReleaseEntry] = []
+    for ts in load_meeting_schedule(schedule_path):
+        announcement_date = ts.date()
+        meta = DocumentMeta(
+            source=BOC_PRESS_RELEASE_SOURCE,
+            doc_id=_doc_id(ts),
+            publication_date=announcement_date,
+            title=f"Bank of Canada rate announcement {announcement_date.isoformat()}",
+            lang="en",
+        )
+        entries.append(PressReleaseEntry(meta=meta, url=press_release_url(ts)))
+    return entries
+
+
+# Line-anchored markers for the start of page furniture that follows the rate
+# decision on a modern BoC announcement page. The decision rationale is the
+# page's primary content and always ends at "The next scheduled date for
+# announcing the overnight rate target is ..."; everything at or after the
+# earliest of these markers is taxonomy, footnotes, bundled same-day operational
+# notices, or related-content link teasers — none of it the published rationale.
+#
+#   - "Content Type(s)" : the taxonomy footer, present on all ~139 cached pages
+#     (older 2009-2020 pages have only this); related-content teasers follow it.
+#   - "Footnotes"        : precedes the marker only when a release bundles extra
+#     same-day content (e.g. 2025-01-29, which appends operational notices after
+#     its footnotes). Rare (1/139) but cuts the bundled residual cleanly.
+#
+# Matching at line start keeps these from ever firing inside rationale prose.
+# (No trailing \b: "Content Type(s)" is followed by ")" then a newline, neither a
+# word character, so \b would never match it — the marker sits alone on its line.)
+_FOOTER_MARKERS = re.compile(r"(?m)^(?:Footnotes|Content Type\(s\))")
+
+
+def _trim_page_furniture(text: str) -> str:
+    """Drop trailing page furniture after the rate-decision rationale.
+
+    Modern BoC announcement pages render the decision inside ``<main>`` together
+    with footer content (a content-type taxonomy line, footnotes, occasionally
+    bundled same-day operational notices, and related-content link teasers).
+    The naive ``<main>`` text grab captures all of it; this truncates at the
+    earliest known footer boundary so the cached artifact is just the rationale.
+
+    Older pages (2009-2020) only carry the taxonomy line, so this trims a few
+    trailing characters; it is a no-op when no marker is present.
+    """
+    match = _FOOTER_MARKERS.search(text)
+    if match is None:
+        return text
+    return text[: match.start()].rstrip()
+
+
+def extract_press_release_html(html: str, meta: DocumentMeta) -> ExtractedDocument:
+    """Extract readable body text from a BoC press-release HTML page.
+
+    Strips boilerplate (scripts, nav, header/footer) and returns the main
+    article text in the same :class:`ExtractedDocument` shape the PDF extractor
+    produces, so cached artifacts are uniform across document sources.
+
+    Parameters
+    ----------
+    html : str
+        Raw HTML of the press-release page.
+    meta : DocumentMeta
+        Provenance/cutoff metadata carried through to the result.
+
+    Returns
+    -------
+    ExtractedDocument
+        Full body text plus character/token counts (``page_count=1`` for HTML).
+    """
+    from bs4 import BeautifulSoup  # noqa: PLC0415 - optional/use-case dependency, imported lazily
+
+    soup = BeautifulSoup(html, "html.parser")
+    for tag in soup(["script", "style", "noscript", "nav", "header", "footer", "aside", "form"]):
+        tag.decompose()
+
+    container = soup.find("main") or soup.find("article") or soup.body or soup
+    raw_text = container.get_text(separator="\n")
+    # Drop social-share boilerplate that lives inside the article container.
+    lines = [line.strip() for line in raw_text.splitlines()]
+    kept = [line for line in lines if line and not line.lower().startswith("share this page")]
+    text = re.sub(r"\n{3,}", "\n\n", "\n".join(kept)).strip()
+    text = _trim_page_furniture(text)
+
+    n_chars = len(text)
+    return ExtractedDocument(
+        meta=meta,
+        text=text,
+        page_count=1,
+        n_chars=n_chars,
+        est_tokens=estimate_tokens(n_chars),
+        extracted_at=datetime.now(tz=timezone.utc).replace(tzinfo=None),
+    )
+
+
+def write_artifact(doc: ExtractedDocument, cache_dir: Path = DEFAULT_PRESS_RELEASE_CACHE_DIR) -> tuple[Path, Path]:
+    """Write a ``<doc_id>.md`` + ``<doc_id>.json`` artifact pair (CFPR-compatible).
+
+    Mirrors ``scripts/extract_reports.py``: the full text lives in the ``.md``
+    and the JSON carries the :class:`ExtractedDocument` metadata plus a
+    ``text_path`` pointer (text not duplicated).
+    """
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    md_path = cache_dir / f"{doc.meta.doc_id}.md"
+    json_path = cache_dir / f"{doc.meta.doc_id}.json"
+    md_path.write_text(doc.text, encoding="utf-8")
+    record = doc.model_dump(mode="json", exclude={"text"})
+    record["text_path"] = str(md_path)
+    json_path.write_text(json.dumps(record, indent=2), encoding="utf-8")
+    return md_path, json_path
+
+
+def _load_artifact(json_path: Path) -> ExtractedDocument:
+    """Reconstruct an :class:`ExtractedDocument` from a cached ``.json`` (+ ``.md``)."""
+    record = json.loads(json_path.read_text(encoding="utf-8"))
+    text_path = record.pop("text_path", None)
+    if text_path and Path(text_path).exists():
+        record["text"] = Path(text_path).read_text(encoding="utf-8")
+    else:  # fall back to a sibling .md, then to any inline text
+        sibling_md = json_path.with_suffix(".md")
+        record["text"] = sibling_md.read_text(encoding="utf-8") if sibling_md.exists() else record.get("text", "")
+    return ExtractedDocument.model_validate(record)
+
+
+class PressReleaseStore:
+    """Cutoff-aware, in-memory store over cached press-release artifacts.
+
+    Filtering mirrors :class:`~aieng.forecasting.data.cutoff.CutoffEnforcer`: a
+    release is visible at ``as_of`` only when its ``publication_date`` is on or
+    before ``as_of``. (For the side-channel evaluator we pass a present-day
+    ``as_of`` so every past release is visible; for prompt-time integration the
+    forecast origin is passed, which keeps the target meeting's release hidden.)
+    """
+
+    def __init__(self, documents: list[ExtractedDocument]) -> None:
+        self._docs = sorted(documents, key=lambda doc: doc.meta.publication_date)
+
+    @classmethod
+    def from_cache(cls, cache_dir: Path = DEFAULT_PRESS_RELEASE_CACHE_DIR) -> PressReleaseStore:
+        """Load every ``<doc_id>.json`` artifact under ``cache_dir`` (non-recursive)."""
+        cache_dir = Path(cache_dir)
+        docs = [_load_artifact(path) for path in sorted(cache_dir.glob("*.json"))]
+        return cls(docs)
+
+    def __len__(self) -> int:
+        return len(self._docs)
+
+    def available(self, as_of: date | pd.Timestamp | str) -> list[ExtractedDocument]:
+        """Releases published on or before ``as_of`` (chronological order)."""
+        cutoff = pd.Timestamp(as_of)
+        return [doc for doc in self._docs if pd.Timestamp(doc.meta.publication_date) <= cutoff]
+
+    def for_meeting(self, meeting_date: date | pd.Timestamp | str) -> ExtractedDocument | None:
+        """Return the release published on ``meeting_date``, or ``None``."""
+        target = pd.Timestamp(meeting_date).date()
+        return next((doc for doc in self._docs if doc.meta.publication_date == target), None)
+
+    def latest_before(self, as_of: date | pd.Timestamp | str) -> ExtractedDocument | None:
+        """Return the most recent release visible at ``as_of``, or ``None``."""
+        available = self.available(as_of)
+        return available[-1] if available else None
+
+    def format_for_prompt(self, doc: ExtractedDocument, *, max_chars: int | None = None) -> str:
+        """Render one release as a prompt-ready block (the seam for LLMP/agent use)."""
+        text = doc.text if max_chars is None else doc.text[:max_chars]
+        return f"Bank of Canada press release ({doc.meta.publication_date.isoformat()}):\n{text}"
+
+
+__all__ = [
+    "BOC_PRESS_RELEASE_SOURCE",
+    "DEFAULT_PRESS_RELEASE_CACHE_DIR",
+    "PressReleaseEntry",
+    "PressReleaseStore",
+    "extract_press_release_html",
+    "press_release_entries",
+    "press_release_url",
+    "write_artifact",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__rationale_eval.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__rationale_eval.py.md
new file mode 100644
index 0000000..9703d24
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__rationale_eval.py.md
@@ -0,0 +1,459 @@
+# Source: implementations/boc_rate_decisions/rationale_eval.py
+
+kind: python
+
+```python
+"""LLM-as-a-judge rationale-alignment evaluator for BoC forecasts (trace-driven).
+
+A **side-channel** evaluator (deliberately *not* part of the resolution loop). It
+treats the **Langfuse trace** as the canonical record of what the forecaster said:
+for each trace it reads the structured forecast the predictor stamped on at run
+time (its stated ``rationale``, cited signals, and predicted distribution; see
+:func:`aieng.forecasting.evaluation.langfuse_traces.stamp_forecast_on_trace`),
+compares that rationale to the Bank of Canada's *own* published rationale (the FAD
+press release for the resolved meeting), and pushes a structured alignment verdict
+back to the same trace as Langfuse **scores**.
+
+It complements — does not replace — the proper accuracy score (RPS/Brier, still
+computed deterministically by the resolution engine). The judge assesses
+*alignment only*; correctness is taken from the realised outcome (read from the
+direction series, never from a trace), and the two are combined into a "right for
+the right reasons" label.
+
+Evaluation reads from and writes to Langfuse, not a local prediction cache: it
+fetches each trace (with readiness polling, since ingestion is async), judges off
+the trace's stamped forecast, and attaches ``rationale_alignment`` (numeric) and
+``right_for_right_reasons`` (categorical) scores. The returned DataFrame is a
+convenience *view* for in-notebook display, not the canonical store.
+
+The judge reuses the LLM-process call seam
+(:mod:`aieng.forecasting.methods.llm_processes._client`) so proxy routing, retries,
+and strict-schema enforcement are shared with the forecasters.
+"""
+
+from __future__ import annotations
+
+import os
+from typing import Any, Callable, Sequence
+
+import pandas as pd
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from aieng.forecasting.evaluation.eval import EvalResult
+from aieng.forecasting.evaluation.langfuse_traces import (
+    fetch_trace_with_wait,
+    flush_scores,
+    push_trace_score,
+    read_forecasts_from_trace,
+)
+from aieng.forecasting.methods.llm_processes._client import (
+    make_json_schema_response_format,
+    run_async,
+    sample_n_async,
+)
+from aieng.forecasting.models import ADVANCED_MODEL
+from pydantic import BaseModel, Field
+
+
+class AlignmentVerdict(BaseModel):
+    """The judge's structured assessment of one prediction's rationale.
+
+    Attributes
+    ----------
+    alignment_score : float
+        ``0`` = the rationale's drivers are unrelated to (or contradict) the
+        Bank's stated reasoning; ``1`` = it cites the same key drivers and points
+        the same direction. Assesses *reasoning alignment only*, independent of
+        whether the forecast was numerically correct.
+    key_signal_overlap : list[str]
+        Which of the forecaster's cited signals/drivers actually appear in the
+        Bank's press release.
+    justification : str
+        A short (2-3 sentence) explanation of the score.
+    """
+
+    alignment_score: float = Field(ge=0.0, le=1.0)
+    key_signal_overlap: list[str] = Field(default_factory=list)
+    justification: str = ""
+
+
+_ALIGNMENT_JSON_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "alignment_score": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+        "key_signal_overlap": {"type": "array", "items": {"type": "string"}},
+        "justification": {"type": "string"},
+    },
+    "required": ["alignment_score", "key_signal_overlap", "justification"],
+    "additionalProperties": False,
+}
+
+_JUDGE_SYSTEM_PROMPT = (
+    "You are an expert evaluator of monetary-policy forecasts. You are given a "
+    "forecaster's stated rationale for a Bank of Canada rate-decision forecast and "
+    "the Bank's OWN published press release for that decision. Judge how well the "
+    "forecaster's reasoning aligns with the Bank's stated reasoning.\n"
+    "\n"
+    "Rules:\n"
+    "- Return ONLY a JSON object matching the provided schema. No prose, no markdown.\n"
+    "- 'alignment_score' in [0, 1] rates REASONING alignment, NOT forecast accuracy: "
+    "1.0 = the rationale emphasises the same key drivers (inflation, labour market, "
+    "growth, financial conditions, forward guidance) and points the same way as the "
+    "Bank; 0.0 = unrelated or contradictory drivers. A forecaster can be numerically "
+    "wrong but well-aligned, or right for the wrong reasons.\n"
+    "- 'key_signal_overlap' lists the forecaster's cited signals that genuinely appear "
+    "in the Bank's release.\n"
+    "- 'justification' is 2-3 sentences citing specifics from the release."
+)
+
+
+def _build_judge_user_prompt(
+    *,
+    task_description: str,
+    predicted_probabilities: dict[str, float],
+    rationale: str,
+    key_signals: list[str],
+    realized_label: str | None,
+    press_release_text: str,
+) -> str:
+    """Assemble the judge's user message."""
+    dist = ", ".join(f"{label} {prob:.2f}" for label, prob in predicted_probabilities.items())
+    signals = "; ".join(key_signals) if key_signals else "(none provided)"
+    realized = realized_label or "(unresolved)"
+    return (
+        f"Forecasting task: {task_description}\n"
+        "\n"
+        f"Forecaster's predicted distribution: {dist}\n"
+        f"Realised decision: {realized}\n"
+        "\n"
+        "Forecaster's stated rationale:\n"
+        f"{rationale}\n"
+        "\n"
+        f"Forecaster's cited signals: {signals}\n"
+        "\n"
+        "Bank of Canada press release for this decision:\n"
+        f"{press_release_text}\n"
+        "\n"
+        "Return the JSON alignment verdict."
+    )
+
+
+def judge_rationale_alignment(
+    *,
+    task_description: str,
+    predicted_probabilities: dict[str, float],
+    rationale: str,
+    key_signals: list[str],
+    realized_label: str | None,
+    press_release_text: str,
+    model: str = ADVANCED_MODEL,
+    reasoning_effort: str | None = None,  # provider default; proxy rejects 'disable'/'low' for Gemini
+    temperature: float = 0.3,
+    max_tokens: int = 4096,
+    timeout_s: float = 120.0,
+) -> AlignmentVerdict:
+    """Run one LLM-as-judge call assessing rationale-vs-release alignment.
+
+    Reuses the shared LLM-process completion seam (proxy routing + strict-schema
+    enforcement). Uses the advanced model by default — judging benefits from
+    capability and is not calibration-sensitive.
+
+    Returns
+    -------
+    AlignmentVerdict
+
+    Raises
+    ------
+    RuntimeError
+        If the judge returns no schema-valid verdict.
+    """
+    base_messages = [
+        {"role": "system", "content": _JUDGE_SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": _build_judge_user_prompt(
+                task_description=task_description,
+                predicted_probabilities=predicted_probabilities,
+                rationale=rationale,
+                key_signals=key_signals,
+                realized_label=realized_label,
+                press_release_text=press_release_text,
+            ),
+        },
+    ]
+    response_format = make_json_schema_response_format("RationaleAlignment", _ALIGNMENT_JSON_SCHEMA)
+    parsed, _cost, _in, _out, _fails = run_async(
+        sample_n_async(
+            schema_cls=AlignmentVerdict,
+            model=model,
+            base_messages=base_messages,
+            response_format=response_format,
+            n_samples=1,
+            temperature=temperature,
+            max_tokens=max_tokens,
+            timeout_s=timeout_s,
+            reasoning_effort=reasoning_effort,
+            api_base=os.getenv("OPENAI_BASE_URL"),
+            api_key=os.getenv("OPENAI_API_KEY"),
+        ),
+    )
+    if not parsed:
+        raise RuntimeError("Rationale-alignment judge returned no schema-valid verdict.")
+    return parsed[0]
+
+
+def _right_for_right_reasons(*, predicted_correct: bool, aligned: bool) -> str:
+    """Combine outcome correctness with reasoning alignment into one label."""
+    correctness = "correct" if predicted_correct else "incorrect"
+    alignment = "aligned" if aligned else "misaligned"
+    return f"{correctness}_{alignment}"
+
+
+def _task_from_result(result: BacktestResult | EvalResult) -> Any:
+    """Return the forecasting task attached to a backtest or eval result."""
+    return result.spec.task if isinstance(result, BacktestResult) else result.eval_spec.task
+
+
+def resolve_trace_url(trace_id: str, *, client: Any | None = None) -> str | None:
+    """Return the Langfuse UI URL for ``trace_id``, or ``None`` if unavailable."""
+    try:
+        if client is None:
+            from langfuse import get_client  # noqa: PLC0415
+
+            client = get_client()
+        return client.get_trace_url(trace_id=trace_id)
+    except Exception:
+        return None
+
+
+def trace_ids_from_result(result: BacktestResult | EvalResult) -> list[str]:
+    """Collect the distinct Langfuse trace ids referenced by a result's predictions.
+
+    These are *pointers* into Langfuse (the canonical content is read from the
+    fetched trace, not from the cached prediction). Order is preserved.
+    """
+    seen: dict[str, None] = {}
+    for pred in result.predictions:
+        trace_id = (pred.metadata or {}).get("langfuse_trace_id")
+        if isinstance(trace_id, str) and trace_id:
+            seen.setdefault(trace_id, None)
+    return list(seen)
+
+
+def _push_alignment_scores(
+    trace_id: str,
+    *,
+    alignment_score: float,
+    right_for_right_reasons: str,
+    justification: str,
+    predictor_id: str,
+    meeting_date: str,
+    client: Any | None = None,
+) -> bool:
+    """Push the numeric alignment + categorical right-for-right-reasons scores.
+
+    Returns whether the numeric ``rationale_alignment`` score was pushed (the
+    headline result); guarded no-op without Langfuse.
+    """
+    shared_metadata = {"predictor_id": predictor_id, "meeting_date": meeting_date}
+    pushed = push_trace_score(
+        trace_id,
+        "rationale_alignment",
+        float(alignment_score),
+        client=client,
+        comment=justification,
+        metadata={**shared_metadata, "right_for_right_reasons": right_for_right_reasons},
+    )
+    push_trace_score(
+        trace_id,
+        "right_for_right_reasons",
+        right_for_right_reasons,
+        client=client,
+        comment=justification,
+        metadata=shared_metadata,
+    )
+    return pushed
+
+
+def evaluate_trace_alignment(
+    trace_ids: Sequence[str],
+    *,
+    task: Any,
+    store: Any,
+    event_df: pd.DataFrame,
+    push_to_langfuse: bool = True,
+    alignment_threshold: float = 0.5,
+    model: str = ADVANCED_MODEL,
+    judge: Callable[..., AlignmentVerdict] = judge_rationale_alignment,
+    client: Any | None = None,
+    fetch: Callable[..., Any] = fetch_trace_with_wait,
+) -> pd.DataFrame:
+    """Score rationale alignment for each Langfuse trace, reading from the trace.
+
+    For every ``trace_id`` it fetches the trace (with readiness polling), reads the
+    structured forecast(s) the predictor stamped on (rationale, cited signals,
+    predicted distribution, forecast date), and — for each forecast whose meeting
+    has a press release — calls ``judge`` and, by default, pushes the verdict back
+    as Langfuse scores. The realised label comes from ``event_df`` (the direction
+    series); the press release from ``store`` — neither lives on a trace.
+
+    Traces that never become ready (ingestion still in flight) are **skipped**, not
+    failed.
+
+    Parameters
+    ----------
+    trace_ids : sequence of str
+        Langfuse trace ids to evaluate (e.g. from :func:`trace_ids_from_result` or
+        :func:`aieng.forecasting.evaluation.langfuse_traces.list_trace_ids`).
+    task : ForecastingTask
+        The categorical task (supplies ``description`` and the value→label map).
+    store : PressReleaseStore
+        Cutoff-aware press-release store (see
+        :class:`boc_rate_decisions.press_releases.PressReleaseStore`).
+    event_df : pd.DataFrame
+        Observed direction series (``timestamp`` / ``value``); supplies the
+        realised cut/hold/hike label per meeting.
+    push_to_langfuse : bool
+        Push ``rationale_alignment`` (numeric) and ``right_for_right_reasons``
+        (categorical) scores back to each trace. Default True — Langfuse is the
+        canonical sink. Guarded no-op without Langfuse.
+    alignment_threshold : float
+        ``alignment_score >= threshold`` counts as "aligned" for the combined
+        ``right_for_right_reasons`` label.
+    model : str
+        Judge model (defaults to the advanced tier).
+    judge : callable
+        Injection seam for testing; defaults to :func:`judge_rationale_alignment`.
+    client : Langfuse client, optional
+        Injection seam for testing; defaults to the process-wide client.
+    fetch : callable
+        Injection seam for the trace fetch; defaults to
+        :func:`~aieng.forecasting.evaluation.langfuse_traces.fetch_trace_with_wait`.
+
+    Returns
+    -------
+    pd.DataFrame
+        One row per scored forecast: ``predictor_id``, ``meeting_date``,
+        ``predicted_label``, ``realized_label``, ``predicted_correct``,
+        ``alignment_score``, ``aligned``, ``right_for_right_reasons``,
+        ``key_signal_overlap``, ``justification``, ``langfuse_trace_id``,
+        ``langfuse_trace_url`` (clickable, when resolvable), and ``langfuse_scored``.
+    """
+    value_to_label = {category.value: category.label for category in task.categories or []}
+    outcome_by_date = {
+        pd.Timestamp(ts).normalize(): float(v) for ts, v in zip(event_df["timestamp"], event_df["value"], strict=True)
+    }
+
+    rows: list[dict[str, object]] = []
+    pushed_any = False
+    for trace_id in trace_ids:
+        trace = fetch(trace_id, client=client)
+        if trace is None:  # never became ready — skip, don't fail
+            continue
+        for forecast in read_forecasts_from_trace(trace):
+            rationale = str(forecast.get("rationale", "") or "").strip()
+            probabilities = {str(k): float(v) for k, v in (forecast.get("probabilities") or {}).items()}
+            if not rationale or not probabilities:
+                continue
+            meeting_date = pd.Timestamp(forecast["forecast_date"]).normalize()
+            release = store.for_meeting(meeting_date)
+            if release is None:
+                continue
+
+            predictor_id = str(forecast.get("predictor_id", "") or "")
+            predicted_label = max(probabilities, key=probabilities.get)
+            outcome_value = outcome_by_date.get(meeting_date)
+            realized_label = value_to_label.get(outcome_value) if outcome_value is not None else None
+
+            verdict = judge(
+                task_description=task.description,
+                predicted_probabilities=probabilities,
+                rationale=rationale,
+                key_signals=list(forecast.get("key_signals", []) or []),
+                realized_label=realized_label,
+                press_release_text=release.text,
+                model=model,
+            )
+
+            predicted_correct = realized_label is not None and predicted_label == realized_label
+            aligned = verdict.alignment_score >= alignment_threshold
+            rfrr = _right_for_right_reasons(predicted_correct=predicted_correct, aligned=aligned)
+
+            langfuse_scored = False
+            if push_to_langfuse:
+                langfuse_scored = _push_alignment_scores(
+                    trace_id,
+                    alignment_score=verdict.alignment_score,
+                    right_for_right_reasons=rfrr,
+                    justification=verdict.justification,
+                    predictor_id=predictor_id,
+                    meeting_date=meeting_date.date().isoformat(),
+                    client=client,
+                )
+                pushed_any = pushed_any or langfuse_scored
+
+            rows.append(
+                {
+                    "predictor_id": predictor_id,
+                    "meeting_date": meeting_date,
+                    "predicted_label": predicted_label,
+                    "realized_label": realized_label,
+                    "predicted_correct": predicted_correct,
+                    "alignment_score": verdict.alignment_score,
+                    "aligned": aligned,
+                    "right_for_right_reasons": rfrr,
+                    "key_signal_overlap": verdict.key_signal_overlap,
+                    "justification": verdict.justification,
+                    "langfuse_trace_id": trace_id,
+                    "langfuse_trace_url": resolve_trace_url(trace_id, client=client),
+                    "langfuse_scored": langfuse_scored,
+                }
+            )
+
+    if pushed_any:
+        flush_scores(client)
+    return pd.DataFrame(rows)
+
+
+def evaluate_result_alignment(
+    result: BacktestResult | EvalResult,
+    store: Any,
+    event_df: pd.DataFrame,
+    *,
+    push_to_langfuse: bool = True,
+    alignment_threshold: float = 0.5,
+    model: str = ADVANCED_MODEL,
+    judge: Callable[..., AlignmentVerdict] = judge_rationale_alignment,
+    client: Any | None = None,
+    fetch: Callable[..., Any] = fetch_trace_with_wait,
+) -> pd.DataFrame:
+    """Evaluate the Langfuse traces a result points to (convenience wrapper).
+
+    Extracts the trace ids referenced by ``result`` (via
+    :func:`trace_ids_from_result`) and delegates to :func:`evaluate_trace_alignment`,
+    which reads the rationale and distribution from each trace — *not* from the
+    cached prediction. Use :func:`evaluate_trace_alignment` directly when you have
+    discovered trace ids straight from Langfuse.
+    """
+    return evaluate_trace_alignment(
+        trace_ids_from_result(result),
+        task=_task_from_result(result),
+        store=store,
+        event_df=event_df,
+        push_to_langfuse=push_to_langfuse,
+        alignment_threshold=alignment_threshold,
+        model=model,
+        judge=judge,
+        client=client,
+        fetch=fetch,
+    )
+
+
+__all__ = [
+    "AlignmentVerdict",
+    "evaluate_result_alignment",
+    "evaluate_trace_alignment",
+    "judge_rationale_alignment",
+    "resolve_trace_url",
+    "trace_ids_from_result",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_cut_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_cut_smoke.yaml.md
new file mode 100644
index 0000000..7be68f5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_cut_smoke.yaml.md
@@ -0,0 +1,44 @@
+# Source: implementations/boc_rate_decisions/specs/boc_rate_cut_smoke.yaml
+
+kind: yaml
+
+```yaml
+# BoC Rate Cut Spec — binary reference (cut vs no cut), 3 origins
+#
+# The compact binary (Brier-scored) reference for the notebook 02 warm-up:
+# the K=2 corner of the categorical machinery, kept deliberately small so it
+# demonstrates the binary payload + scoring format in the fewest moving parts.
+#
+# The three origins span both outcome classes: a hold (2024-04-10), the
+# first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+# (2024-09-04) — enough to exercise scoring and plotting paths without
+# burning tokens on a long run.
+
+description: >-
+  Binary (cut vs no cut) reference backtest for the notebook 02 warm-up,
+  restricted to one hold and two cut meetings in 2024.
+
+task:
+  task_id: boc_rate_cut_next_meeting
+  target_series_id: boc_rate_cut_event
+  horizons: [1]
+  frequency: D
+  payload_type: binary
+  description: >-
+    Will the Bank of Canada LOWER its target for the overnight rate at the
+    fixed announcement date occurring one day after the forecast origin?
+    Outcome is 1 if the target rate decreases at that announcement (any cut
+    size), 0 otherwise (hold or hike). Announcements are at 09:45 ET; the
+    forecast must be issued with information available the day before.
+
+start: "2024-04-01"
+end: "2024-09-30"
+stride: 1
+warmup: 8
+
+# One origin per meeting: announcement_date - 1 day.
+origin_dates:
+  - "2024-04-09"  # meeting 2024-04-10 (hold)
+  - "2024-06-04"  # meeting 2024-06-05 (cut)
+  - "2024-09-03"  # meeting 2024-09-04 (cut)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_backtest.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_backtest.yaml.md
new file mode 100644
index 0000000..f10982c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_backtest.yaml.md
@@ -0,0 +1,187 @@
+# Source: implementations/boc_rate_decisions/specs/boc_rate_direction_backtest.yaml
+
+kind: yaml
+
+```yaml
+# BoC Rate Direction Backtest Spec — 2010-2024 fixed announcement dates, T-28
+#
+# Canonical 3-way ordered-categorical backtest: at each forecast origin
+# (28 days before a BoC fixed announcement date), predict whether the Bank
+# will cut, hold, or hike its target for the overnight rate at the
+# announcement. Scored with RPS (task payload_type: categorical).
+#
+# Why a 28-day lead: on the eve of a decision the 2-year GoC yield has
+# already absorbed the market consensus, so a T-1 forecast mostly reads
+# market pricing off a curve. Four weeks out the decision is genuinely
+# uncertain — the interesting skill is anticipating cycle turns before the
+# market converges. The eve-of-decision variant is kept as a small diagnostic in
+# boc_rate_direction_eve_smoke.yaml; comparing the two shows how skill
+# concentrates as information arrives.
+#
+# Origins are EXPLICIT because BoC meetings are an irregular calendar —
+# 8 dates per year that no pandas frequency alias can generate. Each origin
+# is announcement_date - 28 days, so the 28-day horizon resolves exactly on
+# the announcement. The minimum gap between scheduled meetings is 35 days,
+# so the previous meeting's outcome is always visible at the origin.
+# The origin list is derived from ../meeting_schedule.yaml; a use-case test
+# asserts the two files stay consistent.
+#
+# Origin count : 120 (8 per year, 2010-2024)
+# Coverage     : spans the 2010 + 2017-18 + 2022-23 hike cycles and the
+#                2015 + 2020 + 2024 cut cycles.
+# Warmup       : 8 events (the full 2009 easing cycle is visible history at
+#                every origin).
+
+description: >-
+  Backtest across all 120 BoC fixed announcement dates from 2010 through
+  2024. At each origin (announcement date minus 28 days) predictors emit
+  probabilities over cut, hold, and hike, resolved against the derived
+  direction series and scored with RPS. 2009 meetings are excluded as
+  targets but visible as history.
+
+task:
+  task_id: boc_rate_direction_next_meeting
+  target_series_id: boc_rate_decision_direction
+  horizons: [28]
+  frequency: D
+  payload_type: categorical
+  categories:
+    - {label: cut, value: -1}
+    - {label: hold, value: 0}
+    - {label: hike, value: 1}
+  description: >-
+    At the Bank of Canada fixed announcement date occurring 28 days after the
+    forecast origin, will the Bank CUT, HOLD, or HIKE its target for the
+    overnight rate? Outcome is the direction of the target-rate change at
+    that announcement (any size). Announcements are at 09:45 ET; the
+    forecast must be issued with information available four weeks before
+    the announcement, before markets have converged on the decision.
+
+start: "2009-12-01"
+end: "2024-12-31"
+stride: 1
+warmup: 8
+
+# One origin per meeting: announcement_date - 28 days.
+origin_dates:
+  - "2009-12-22"  # meeting 2010-01-19
+  - "2010-02-02"  # meeting 2010-03-02
+  - "2010-03-23"  # meeting 2010-04-20
+  - "2010-05-04"  # meeting 2010-06-01 (hike)
+  - "2010-06-22"  # meeting 2010-07-20 (hike)
+  - "2010-08-11"  # meeting 2010-09-08 (hike)
+  - "2010-09-21"  # meeting 2010-10-19
+  - "2010-11-09"  # meeting 2010-12-07
+  - "2010-12-21"  # meeting 2011-01-18
+  - "2011-02-01"  # meeting 2011-03-01
+  - "2011-03-15"  # meeting 2011-04-12
+  - "2011-05-03"  # meeting 2011-05-31
+  - "2011-06-21"  # meeting 2011-07-19
+  - "2011-08-10"  # meeting 2011-09-07
+  - "2011-09-27"  # meeting 2011-10-25
+  - "2011-11-08"  # meeting 2011-12-06
+  - "2011-12-20"  # meeting 2012-01-17
+  - "2012-02-09"  # meeting 2012-03-08
+  - "2012-03-20"  # meeting 2012-04-17
+  - "2012-05-08"  # meeting 2012-06-05
+  - "2012-06-19"  # meeting 2012-07-17
+  - "2012-08-08"  # meeting 2012-09-05
+  - "2012-09-25"  # meeting 2012-10-23
+  - "2012-11-06"  # meeting 2012-12-04
+  - "2012-12-26"  # meeting 2013-01-23
+  - "2013-02-06"  # meeting 2013-03-06
+  - "2013-03-20"  # meeting 2013-04-17
+  - "2013-05-01"  # meeting 2013-05-29
+  - "2013-06-19"  # meeting 2013-07-17
+  - "2013-08-07"  # meeting 2013-09-04
+  - "2013-09-25"  # meeting 2013-10-23
+  - "2013-11-06"  # meeting 2013-12-04
+  - "2013-12-25"  # meeting 2014-01-22
+  - "2014-02-05"  # meeting 2014-03-05
+  - "2014-03-19"  # meeting 2014-04-16
+  - "2014-05-07"  # meeting 2014-06-04
+  - "2014-06-18"  # meeting 2014-07-16
+  - "2014-08-06"  # meeting 2014-09-03
+  - "2014-09-24"  # meeting 2014-10-22
+  - "2014-11-05"  # meeting 2014-12-03
+  - "2014-12-24"  # meeting 2015-01-21 (cut)
+  - "2015-02-04"  # meeting 2015-03-04
+  - "2015-03-18"  # meeting 2015-04-15
+  - "2015-04-29"  # meeting 2015-05-27
+  - "2015-06-17"  # meeting 2015-07-15 (cut)
+  - "2015-08-12"  # meeting 2015-09-09
+  - "2015-09-23"  # meeting 2015-10-21
+  - "2015-11-04"  # meeting 2015-12-02
+  - "2015-12-23"  # meeting 2016-01-20
+  - "2016-02-10"  # meeting 2016-03-09
+  - "2016-03-16"  # meeting 2016-04-13
+  - "2016-04-27"  # meeting 2016-05-25
+  - "2016-06-15"  # meeting 2016-07-13
+  - "2016-08-10"  # meeting 2016-09-07
+  - "2016-09-21"  # meeting 2016-10-19
+  - "2016-11-09"  # meeting 2016-12-07
+  - "2016-12-21"  # meeting 2017-01-18
+  - "2017-02-01"  # meeting 2017-03-01
+  - "2017-03-15"  # meeting 2017-04-12
+  - "2017-04-26"  # meeting 2017-05-24
+  - "2017-06-14"  # meeting 2017-07-12 (hike)
+  - "2017-08-09"  # meeting 2017-09-06 (hike)
+  - "2017-09-27"  # meeting 2017-10-25
+  - "2017-11-08"  # meeting 2017-12-06
+  - "2017-12-20"  # meeting 2018-01-17 (hike)
+  - "2018-02-07"  # meeting 2018-03-07
+  - "2018-03-21"  # meeting 2018-04-18
+  - "2018-05-02"  # meeting 2018-05-30
+  - "2018-06-13"  # meeting 2018-07-11 (hike)
+  - "2018-08-08"  # meeting 2018-09-05
+  - "2018-09-26"  # meeting 2018-10-24 (hike)
+  - "2018-11-07"  # meeting 2018-12-05
+  - "2018-12-12"  # meeting 2019-01-09
+  - "2019-02-06"  # meeting 2019-03-06
+  - "2019-03-27"  # meeting 2019-04-24
+  - "2019-05-01"  # meeting 2019-05-29
+  - "2019-06-12"  # meeting 2019-07-10
+  - "2019-08-07"  # meeting 2019-09-04
+  - "2019-10-02"  # meeting 2019-10-30
+  - "2019-11-06"  # meeting 2019-12-04
+  - "2019-12-25"  # meeting 2020-01-22
+  - "2020-02-05"  # meeting 2020-03-04 (cut)
+  - "2020-03-18"  # meeting 2020-04-15
+  - "2020-05-06"  # meeting 2020-06-03
+  - "2020-06-17"  # meeting 2020-07-15
+  - "2020-08-12"  # meeting 2020-09-09
+  - "2020-09-30"  # meeting 2020-10-28
+  - "2020-11-11"  # meeting 2020-12-09
+  - "2020-12-23"  # meeting 2021-01-20
+  - "2021-02-10"  # meeting 2021-03-10
+  - "2021-03-24"  # meeting 2021-04-21
+  - "2021-05-12"  # meeting 2021-06-09
+  - "2021-06-16"  # meeting 2021-07-14
+  - "2021-08-11"  # meeting 2021-09-08
+  - "2021-09-29"  # meeting 2021-10-27
+  - "2021-11-10"  # meeting 2021-12-08
+  - "2021-12-29"  # meeting 2022-01-26
+  - "2022-02-02"  # meeting 2022-03-02 (hike)
+  - "2022-03-16"  # meeting 2022-04-13 (hike)
+  - "2022-05-04"  # meeting 2022-06-01 (hike)
+  - "2022-06-15"  # meeting 2022-07-13 (hike)
+  - "2022-08-10"  # meeting 2022-09-07 (hike)
+  - "2022-09-28"  # meeting 2022-10-26 (hike)
+  - "2022-11-09"  # meeting 2022-12-07 (hike)
+  - "2022-12-28"  # meeting 2023-01-25 (hike)
+  - "2023-02-08"  # meeting 2023-03-08
+  - "2023-03-15"  # meeting 2023-04-12
+  - "2023-05-10"  # meeting 2023-06-07 (hike)
+  - "2023-06-14"  # meeting 2023-07-12 (hike)
+  - "2023-08-09"  # meeting 2023-09-06
+  - "2023-09-27"  # meeting 2023-10-25
+  - "2023-11-08"  # meeting 2023-12-06
+  - "2023-12-27"  # meeting 2024-01-24
+  - "2024-02-07"  # meeting 2024-03-06
+  - "2024-03-13"  # meeting 2024-04-10
+  - "2024-05-08"  # meeting 2024-06-05 (cut)
+  - "2024-06-26"  # meeting 2024-07-24 (cut)
+  - "2024-08-07"  # meeting 2024-09-04 (cut)
+  - "2024-09-25"  # meeting 2024-10-23 (cut)
+  - "2024-11-13"  # meeting 2024-12-11 (cut)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eval.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eval.yaml.md
new file mode 100644
index 0000000..4ab15ba
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eval.yaml.md
@@ -0,0 +1,72 @@
+# Source: implementations/boc_rate_decisions/specs/boc_rate_direction_eval.yaml
+
+kind: yaml
+
+```yaml
+# BoC Rate Direction Eval Spec — 2025-2026 protected window, T-28
+#
+# Held-out, budget-controlled evaluation over the 12 BoC fixed announcement
+# dates from January 2025 through June 2026. All 12 are resolved as of
+# June 2026 (the June 10 announcement resolves the final origin).
+#
+# Origins sit 28 days before each announcement — the same lead as the
+# canonical backtest — so the eval measures anticipation, not eve-of-decision
+# market reading. This window contains cuts and holds but NO hikes, so it
+# cannot reward hike discrimination. That is acceptable for this protected
+# slice because RPS handles absent categories while still scoring calibrated
+# mass over the full ordered support.
+#
+# Use this spec sparingly. max_runs: 5 limits how many times a participant
+# may run evaluate() against it, reducing the risk of inadvertently
+# over-fitting to the held-out window.
+#
+# Origins are explicit (announcement_date - 28 days) because BoC meetings are
+# an irregular calendar; derived from ../meeting_schedule.yaml.
+
+spec_id: boc_rate_direction_eval_2025_2026
+
+description: >-
+  Protected eval across the 12 BoC fixed announcement dates from January
+  2025 through June 2026. At each origin (announcement date minus 28 days)
+  predictors emit probabilities over cut, hold, and hike, scored with RPS.
+  Budget-limited to 5 runs per participant tracker.
+
+task:
+  task_id: boc_rate_direction_next_meeting
+  target_series_id: boc_rate_decision_direction
+  horizons: [28]
+  frequency: D
+  payload_type: categorical
+  categories:
+    - {label: cut, value: -1}
+    - {label: hold, value: 0}
+    - {label: hike, value: 1}
+  description: >-
+    At the Bank of Canada fixed announcement date occurring 28 days after the
+    forecast origin, will the Bank CUT, HOLD, or HIKE its target for the
+    overnight rate? Outcome is the direction of the target-rate change at
+    that announcement (any size). Announcements are at 09:45 ET; the
+    forecast must be issued with information available four weeks before
+    the announcement, before markets have converged on the decision.
+
+start: "2025-01-01"
+end: "2026-06-30"
+stride: 1
+warmup: 8
+max_runs: 5
+
+# One origin per meeting: announcement_date - 28 days.
+origin_dates:
+  - "2025-01-01"  # meeting 2025-01-29 (cut)
+  - "2025-02-12"  # meeting 2025-03-12 (cut)
+  - "2025-03-19"  # meeting 2025-04-16
+  - "2025-05-07"  # meeting 2025-06-04
+  - "2025-07-02"  # meeting 2025-07-30
+  - "2025-08-20"  # meeting 2025-09-17 (cut)
+  - "2025-10-01"  # meeting 2025-10-29 (cut)
+  - "2025-11-12"  # meeting 2025-12-10
+  - "2025-12-31"  # meeting 2026-01-28
+  - "2026-02-18"  # meeting 2026-03-18
+  - "2026-04-01"  # meeting 2026-04-29
+  - "2026-05-13"  # meeting 2026-06-10
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eve_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eve_smoke.yaml.md
new file mode 100644
index 0000000..bd6957b
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eve_smoke.yaml.md
@@ -0,0 +1,50 @@
+# Source: implementations/boc_rate_decisions/specs/boc_rate_direction_eve_smoke.yaml
+
+kind: yaml
+
+```yaml
+# BoC Rate Direction EVE Smoke Spec — T-1 diagnostic, 3 origins
+#
+# Eve-of-decision companion to boc_rate_direction_smoke.yaml: same three
+# meetings, origins the day before each announcement. Used in notebook 02
+# (§7) for the cheap lead-time comparison (T-28 vs T-1) — the eve lead is
+# kept only as this small diagnostic, not as a full backtest.
+#
+# The three origins span holds and cuts but no hikes: a hold (2024-04-10), the
+# first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+# (2024-09-04) — enough to exercise categorical scoring and plotting paths
+# without burning tokens on a long run.
+
+description: >-
+  Three-origin eve-of-decision (T-1) smoke backtest for the lead-time
+  comparison in notebook 02. Same meetings as boc_rate_direction_smoke,
+  origins the day before each announcement.
+
+task:
+  task_id: boc_rate_direction_next_meeting_eve
+  target_series_id: boc_rate_decision_direction
+  horizons: [1]
+  frequency: D
+  payload_type: categorical
+  categories:
+    - {label: cut, value: -1}
+    - {label: hold, value: 0}
+    - {label: hike, value: 1}
+  description: >-
+    At the Bank of Canada fixed announcement date occurring one day after the
+    forecast origin, will the Bank CUT, HOLD, or HIKE its target for the
+    overnight rate? Outcome is the direction of the target-rate change at
+    that announcement (any size). Announcements are at 09:45 ET; the
+    forecast must be issued with information available the day before.
+
+start: "2024-04-01"
+end: "2024-09-30"
+stride: 1
+warmup: 8
+
+# One origin per meeting: announcement_date - 1 day.
+origin_dates:
+  - "2024-04-09"  # meeting 2024-04-10 (hold)
+  - "2024-06-04"  # meeting 2024-06-05 (cut)
+  - "2024-09-03"  # meeting 2024-09-04 (cut)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_smoke.yaml.md
new file mode 100644
index 0000000..e0df1b9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_smoke.yaml.md
@@ -0,0 +1,52 @@
+# Source: implementations/boc_rate_decisions/specs/boc_rate_direction_smoke.yaml
+
+kind: yaml
+
+```yaml
+# BoC Rate Direction Smoke Spec — Fast CI/Testing Backtest, T-28
+#
+# Three-origin subset of boc_rate_direction_backtest.yaml for running the full
+# notebook pipeline cheaply during development and end-to-end testing.
+# Use by setting EXPERIMENT_CONFIG = "smoke" in the notebook setup cell.
+#
+# The three origins span holds and cuts but no hikes: a hold (2024-04-10), the
+# first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+# (2024-09-04) — enough to exercise categorical scoring and plotting paths
+# without burning tokens on 120 LLM calls. Origins sit 28 days before each
+# announcement, matching the canonical backtest lead.
+
+description: >-
+  Three-origin smoke backtest for local and CI testing of the BoC
+  rate-direction pipeline. Same task and warmup as
+  boc_rate_direction_backtest, restricted to one hold and two cut meetings
+  in 2024, with origins 28 days before each announcement.
+
+task:
+  task_id: boc_rate_direction_next_meeting
+  target_series_id: boc_rate_decision_direction
+  horizons: [28]
+  frequency: D
+  payload_type: categorical
+  categories:
+    - {label: cut, value: -1}
+    - {label: hold, value: 0}
+    - {label: hike, value: 1}
+  description: >-
+    At the Bank of Canada fixed announcement date occurring 28 days after the
+    forecast origin, will the Bank CUT, HOLD, or HIKE its target for the
+    overnight rate? Outcome is the direction of the target-rate change at
+    that announcement (any size). Announcements are at 09:45 ET; the
+    forecast must be issued with information available four weeks before
+    the announcement, before markets have converged on the decision.
+
+start: "2024-03-01"
+end: "2024-09-30"
+stride: 1
+warmup: 8
+
+# One origin per meeting: announcement_date - 28 days.
+origin_dates:
+  - "2024-03-13"  # meeting 2024-04-10
+  - "2024-05-08"  # meeting 2024-06-05 (cut)
+  - "2024-08-07"  # meeting 2024-09-04 (cut)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent____init__.py.md
new file mode 100644
index 0000000..65ebfad
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent____init__.py.md
@@ -0,0 +1,22 @@
+# Source: implementations/boc_rate_decisions/starter_agent/__init__.py
+
+kind: python
+
+```python
+"""BoC starter agent — a fresh, hackable template for your own exploration.
+
+Exports the toggle-driven :class:`AgentConfig` factory and the predictor
+convenience factory. See ``99_starter_agent.ipynb`` and ``agent.py``.
+"""
+
+from boc_rate_decisions.starter_agent.agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+__all__ = [
+    "build_starter_agent_config",
+    "build_starter_agent_predictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__agent.py.md
new file mode 100644
index 0000000..500357a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__agent.py.md
@@ -0,0 +1,238 @@
+# Source: implementations/boc_rate_decisions/starter_agent/agent.py
+
+kind: python
+
+```python
+"""BoC starter agent — a fresh, hackable template for your own exploration.
+
+This is **not** part of the notebook 01–03 curriculum. It is a clean starting
+point: the smallest agent that still has room to grow. It ships with our common
+building blocks wired behind simple toggles —
+
+- **optional news search** (``enable_search``, on by default) — bounded,
+  cutoff-aware Google Search through the Vector proxy;
+- **optional code execution** (``enable_code_exec``, off by default) — an E2B
+  Python sandbox;
+- **two lightweight skills** (:mod:`skills/`) that are *tool-usage playbooks*:
+  how to get good results out of search and code execution.
+
+Everything routes through the Vector proxy — no direct provider keys. See
+``planning-docs/vector-llm-proxy.md``.
+
+The prompt builder and output schema are reused from the
+:mod:`~boc_rate_decisions.analyst_agent` module (they are just task
+serialisation — no need to duplicate them); the *agent identity* here is fresh
+and yours to edit. The output is a calibrated distribution over
+``cut / hold / hike``. Pair this with ``99_starter_agent.ipynb``.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Callable
+
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    CategoricalAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+# Reuse the existing BoC prompt builder — it serialises the rate path, decision
+# history, and macro snapshot into the agent's JSON payload.
+from boc_rate_decisions.analyst_agent import BoCDecisionPromptBuilder
+
+
+# Skills live next to this module.
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_FORECASTING_SKILL = _SKILLS_ROOT / "forecasting"
+_RESEARCH_SKILL = _SKILLS_ROOT / "research-playbook"
+_CODE_ANALYSIS_SKILL = _SKILLS_ROOT / "code-analysis-playbook"
+
+
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+
+
+def _build_starter_instruction() -> str:
+    """Build the task-agnostic, skill-agnostic starter persona.
+
+    Just the analyst's identity and how to behave — no output schema, no payload
+    contract, no skill or tool mechanics. ADK injects the name + description of
+    every attached skill (and every tool) into the system prompt, so the agent
+    already knows what it can load and call; repeating that here would only
+    duplicate dynamically-injected information. The forecasting *contract* lives
+    in the loadable ``forecasting`` skill. Edit the persona freely.
+    """
+    return (
+        "## Role\n\n"
+        "You are a Bank of Canada monetary-policy analyst — fluent in the "
+        "policy-rate path, the 2% CPI inflation target, labour-market and "
+        "bond-market conditions, and the Bank's institutional behaviour "
+        "(gradualism, data dependence, reluctance to surprise markets). This is "
+        "a starter agent: keep your reasoning transparent and your claims honest.\n\n"
+        "## How to respond\n\n"
+        "- For open-ended questions, scenario analysis, or anything "
+        "conversational, answer directly and concisely — do NOT ask for a JSON "
+        "payload.\n"
+        "- When you are handed a task that asks for a structured probability "
+        "distribution over the next decision, produce a calibrated one."
+    )
+
+
+_STARTER_INSTRUCTION = _build_starter_instruction()
+
+
+_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are a Canadian monetary-policy intelligence specialist with web search.
+
+Return a concise structured markdown summary (3-5 paragraphs) covering, as the
+query warrants: recent Bank of Canada communications (statements, speeches,
+Monetary Policy Reports); Canadian CPI and core inflation vs the 2% target; the
+labour market; market pricing of the upcoming decision (OIS, economist surveys);
+and macro shocks relevant to Canada (oil, exchange rate, US policy, trade).
+
+Ground every claim in the search results you actually retrieve. When a cutoff
+date is specified, never report or speculate about events after it.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Config factory
+# ---------------------------------------------------------------------------
+
+
+def build_starter_agent_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    *,
+    enable_search: bool = True,
+    enable_code_exec: bool = False,
+) -> AgentConfig:
+    """Build the BoC starter :class:`AgentConfig`.
+
+    Parameters
+    ----------
+    model : str
+        Model for the analyst agent (default: lite). Pass the advanced model
+        (``"gemini-3.5-flash"``) for higher-quality runs.
+    search_model : str
+        Model for the bounded web-search sub-tool.
+    enable_search : bool, default=True
+        Wire a cutoff-aware ``search_web`` tool and load the
+        ``research-playbook`` skill. Proxy-only — no extra API key. Note: news
+        grounding on historical origins carries leakage risk, so keep
+        `cutoff_date` honest.
+    enable_code_exec : bool, default=False
+        Wire an E2B Python sandbox and load the ``code-analysis-playbook``
+        skill. Needs ``E2B_API_KEY`` and is slower, so it is off by default.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    # Every attached skill is loaded on demand: ADK injects each skill's name +
+    # description into the system prompt, and the agent reads the full SKILL.md
+    # only when relevant — so toggling a tool just adds its skill, no persona edits.
+    skills_dirs: list[Path] = [_FORECASTING_SKILL]
+    if enable_search:
+        skills_dirs.append(_RESEARCH_SKILL)
+    if enable_code_exec:
+        skills_dirs.append(_CODE_ANALYSIS_SKILL)
+
+    context_retrieval = (
+        ContextRetrievalConfig(
+            enabled=True,
+            instruction=_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        )
+        if enable_search
+        else ContextRetrievalConfig()
+    )
+
+    return AgentConfig(
+        name="boc_starter_agent",
+        model=model,
+        instruction=_STARTER_INSTRUCTION,
+        # 16k headroom: enough for a complete run_code script + structured output.
+        max_output_tokens=16_384 if enable_code_exec else None,
+        context_retrieval=context_retrieval,
+        code_execution=CodeExecutionConfig(enabled=enable_code_exec),
+        skills_dirs=skills_dirs,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+class _StarterForecastPromptBuilder:
+    """Add the output schema + a forecast directive to a base builder's payload.
+
+    The exact JSON schema is generated at call time from the output class
+    (drift-free) and injected into the user payload — never into the system
+    prompt — so the agent stays conversational until it is actually asked to
+    forecast. Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally.
+    """
+
+    def __init__(self, inner: Callable[..., str], output_schema_json: str) -> None:
+        self._inner = inner
+        self._schema_json = output_schema_json
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        payload = json.loads(self._inner(task=task, context=context))
+        payload["instructions"] = (
+            "Produce a calibrated probability distribution for this decision and return "
+            "it by calling `set_model_response` with a `json_response` string matching "
+            "`output_schema` exactly."
+        )
+        payload["output_schema"] = self._schema_json
+        return json.dumps(payload, indent=2)
+
+
+def build_starter_agent_predictor(config: AgentConfig) -> AgentPredictor:
+    """Wrap a starter :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Reuses :class:`~boc_rate_decisions.analyst_agent.BoCDecisionPromptBuilder`
+    for data serialisation, wrapped so the (drift-free) categorical output schema
+    and a forecast directive ride in the payload — keeping the schema out of the
+    persona. ``predict(task, context)`` returns one
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` carrying the
+    cut/hold/hike distribution.
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=_StarterForecastPromptBuilder(
+            BoCDecisionPromptBuilder(),
+            CategoricalAgentForecastOutput.prompt_schema_json(labels=["cut", "hold", "hike"]),
+        ),
+        output_schema=CategoricalAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_starter_agent_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__code-analysis-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__code-analysis-playbook__SKILL.md.md
new file mode 100644
index 0000000..a7aa5bd
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__code-analysis-playbook__SKILL.md.md
@@ -0,0 +1,59 @@
+# Source: implementations/boc_rate_decisions/starter_agent/skills/code-analysis-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: code-analysis-playbook
+description: >-
+  How to use the code execution sandbox well — parse the JSON payload (not
+  disk files), compute a couple of useful diagnostics before forecasting, and
+  keep the session stateful within a turn. Load this before writing code. No
+  scripts.
+---
+
+# Code-analysis playbook
+
+A short guide to using the `run_code` sandbox productively. This is a starter
+skill — extend it with the diagnostics that matter for your problem.
+
+## Where your data lives
+
+All data comes from the **JSON payload in your context** — there are no disk
+files and no network. The history arrives as a CSV *string* (e.g.
+`target_history_csv`). Parse it with `io.StringIO`, never as a file path:
+
+```python
+import io, pandas as pd
+df = pd.read_csv(io.StringIO(payload["target_history_csv"]))
+```
+
+The sandbox is **stateful within a turn**: parse once in your first code block,
+then reuse the DataFrame in later blocks instead of re-parsing.
+
+## Compute before you forecast
+
+Run a couple of cheap diagnostics so your forecast is grounded in arithmetic,
+not vibes:
+
+1. **Recent trend** — slope/return over the last N observations.
+2. **Volatility** — recent standard deviation of changes; it sets how wide your
+   quantile bands should be.
+3. **Sanity check** — does your point forecast sit within a plausible multiple
+   of recent moves? If not, revisit it.
+
+Use the printed numbers to set the point forecast and to *calibrate the spread*
+between your low and high quantiles — wider when recent volatility is high.
+
+## Domain focus (edit this for your use case)
+
+For a BoC rate decision your payload is categorical: it carries the policy-rate
+change points, the per-outcome base rates, and a macro snapshot — not a price
+CSV, so adapt the parsing above to those fields. Useful diagnostics: recompute
+the empirical base rates, measure how far the current macro snapshot sits from
+typical pre-cut vs pre-hold conditions, and count how often the Bank reversed
+direction between adjacent meetings.
+
+## Room to grow
+
+- Add your own diagnostic patterns (regime detection, seasonality, covariates).
+- Drop reusable reference values into a `references/` file and `load_skill_resource` them.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__forecasting__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__forecasting__SKILL.md.md
new file mode 100644
index 0000000..c61d404
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__forecasting__SKILL.md.md
@@ -0,0 +1,52 @@
+# Source: implementations/boc_rate_decisions/starter_agent/skills/forecasting/SKILL.md
+
+kind: markdown
+
+---
+name: forecasting
+description: >-
+  The output contract for producing a structured probability distribution over
+  the rate decision (cut / hold / hike) — the JSON shape, the calibration rules,
+  and how to submit it. Load this ONLY when your task payload asks for a
+  forecast; ignore it for open-ended questions. No scripts.
+---
+
+# Forecasting skill
+
+Load this when your task payload asks for a structured forecast. For open-ended
+questions, ignore it and just answer.
+
+## What you'll receive
+
+A JSON payload describing the task: the `task` and `as_of` cutoff date, the
+`announcement_date` being predicted, the `policy_rate` path, `meeting_outcomes`
+(decision history + historical base rates), a `macro_snapshot`, and an
+`output_schema` showing the exact JSON to return.
+
+## The output contract
+
+1. Assign one probability to each of **`cut`, `hold`, `hike`**; the three must
+   **sum to 1**.
+2. Report **calibrated** probabilities — across many decisions where you say
+   0.7, that outcome should occur about 70% of the time.
+3. Anchor on the **historical base rates**, then adjust for the macro snapshot
+   and recent decisions. Direct cut→hike reversals between adjacent meetings
+   essentially never happen, so recent history shapes which tail is plausible.
+4. Use ONLY information available on or before `as_of`.
+5. Put your reasoning in `reasoning` and the decisive inputs in `key_signals`.
+
+Submit by calling `set_model_response` with a `json_response` string that
+matches the payload's `output_schema` **exactly**. Omit any field not shown.
+
+## Domain focus (edit this for your use case)
+
+The 2-year GoC yield trading well below the policy rate means the bond market is
+pricing cuts; well above means hikes. CPI relative to the 2% target and
+labour-market momentum tell you whether you are in an easing or tightening
+cycle. Weigh those against the Bank's gradualism and reluctance to surprise
+markets.
+
+## Room to grow
+
+- Add your own calibration notes from the backtest leaderboard.
+- Encode any decision rules you trust (e.g. how much a soft CPI print moves P(cut)).
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__research-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__research-playbook__SKILL.md.md
new file mode 100644
index 0000000..96ad503
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__boc_rate_decisions__starter_agent__skills__research-playbook__SKILL.md.md
@@ -0,0 +1,46 @@
+# Source: implementations/boc_rate_decisions/starter_agent/skills/research-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: research-playbook
+description: >-
+  How to use the search_web tool well when grounding a forecast in recent
+  news — phrase cutoff-aware queries, decide what is worth searching for, and
+  weigh sources. Load this before your first search_web call. No scripts.
+---
+
+# Research playbook
+
+A short guide to getting real signal out of `search_web`. This is a starter
+skill — extend it with the queries and sources that work for your problem.
+
+## The one rule that matters
+
+Always pass `cutoff_date` equal to the `as_of` date in your payload. It is the
+temporal fence that keeps post-origin information out of a historical forecast.
+A forecast that "knew" what happened after `as_of` is not a forecast.
+
+## How to search
+
+- **Search before you forecast, not after.** Gather context first, then reason.
+- **One topic per query.** Several focused queries beat one broad one. Stop when
+  new queries stop returning new facts.
+- **Ask for the present state, not a prediction.** "current OPEC+ production
+  policy" returns facts; "will oil go up" returns noise.
+- **Weigh sources.** Prefer primary releases and major outlets; treat a single
+  blog or forum post as a lead to confirm, not a fact.
+
+## Domain focus (edit this for your use case)
+
+For a Bank of Canada rate decision, the signals that move the odds: recent CPI
+and core-inflation prints vs the 2% target, the labour market (employment,
+unemployment, wages), market pricing (overnight index swaps, economist surveys),
+recent BoC communications, and macro shocks (oil, the loonie, US policy, trade).
+Search for the *current state* of these, then let the base rates set the prior.
+
+## Room to grow
+
+- Add a curated list of go-to sources for your domain.
+- Track which queries paid off and prune the ones that didn't.
+- Add a `references/` file with example high-signal searches.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__01_wti_case_study.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__01_wti_case_study.ipynb.md
new file mode 100644
index 0000000..e0c877a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__01_wti_case_study.ipynb.md
@@ -0,0 +1,278 @@
+# Source: implementations/energy_oil_forecasting/01_wti_case_study.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Oil Prices in 2026 — A Forecasting Case Study
+
+Suppose your operating costs are highly sensitive to oil prices.
+Every day, starting January 1 2025, you run a 30-day-ahead forecast of WTI crude
+using **Prophet** — a lean statistical model built at Meta.
+Prophet extracts trend and seasonality from historical price data and produces
+calibrated 95% confidence intervals.
+
+This notebook asks a simple question: *how well does that work — and what would make it better?*
+
+## Cell 2 (code)
+
+```python
+import logging
+import warnings
+
+
+warnings.filterwarnings("ignore")
+logging.getLogger("prophet").setLevel(logging.ERROR)
+logging.getLogger("cmdstanpy").setLevel(logging.ERROR)
+
+from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service, naive_utc_now
+from energy_oil_forecasting.paths import (
+    ROLLING_CI_WIDTH,
+    ROLLING_FORECAST_CACHE,
+    ROLLING_HORIZON_DAYS,
+    SIMULATION_END,
+    SIMULATION_START,
+)
+from energy_oil_forecasting.prophet_baseline import compute_rolling_forecasts, wti_series_to_price_df
+from energy_oil_forecasting.viz import (
+    build_forecast_animation,
+    export_animation_html,
+    make_context_chart,
+    make_futures_curve_chart,
+    make_punchline_charts,
+)
+```
+
+## Cell 3 (markdown)
+
+## Load WTI price data
+
+We use Yahoo Finance's `CL=F` — the WTI crude oil continuous front-month futures contract.
+It tracks the spot price within cents and requires no API key.
+Data runs from January 2021 through today.
+
+> **Note on data source**: `CL=F` is a futures price, not the EIA-posted spot price (`DCOILWTICO` on FRED).
+> For daily oil price analysis these two series are virtually indistinguishable — the front-month
+> futures contract converges to spot at expiry. Using `CL=F` is also a natural bridge into Act 4,
+> where we discuss what a full futures *curve* would add.
+
+## Cell 4 (code)
+
+```python
+# DataService through as-of today (all available history).
+as_of = naive_utc_now()
+data_service = build_wti_service()
+ctx = data_service.context(as_of=as_of)
+price_df = wti_series_to_price_df(ctx.get_series(WTI_SERIES_ID))
+
+print(f"Trading days loaded: {len(price_df):,}")
+print(f"Latest WTI close: ${price_df['price'].iloc[-1]:.2f}/bbl on {price_df.index[-1].date()}")
+```
+
+## Cell 5 (markdown)
+
+---
+
+## Pre-compute: Rolling 30-Day Prophet Forecasts
+
+Starting January 1, 2025, we simulate what it would have looked like to run a daily
+30-day-ahead forecast using Prophet:
+
+- **Daily refits**: the model re-trains every simulation day on all price data through
+  that day. Prophet is fast enough to make this realistic — each fit takes only a few
+  seconds, which is well within a nightly batch window.
+- **Forecast target**: the price 30 calendar days from today, resolved on the nearest
+  available trading day.
+- **Uncertainty interval**: 95% (`interval_width=0.95`).
+- **Prophet config**: multiplicative seasonality (scales with price level), `changepoint_prior_scale=0.1`, `changepoint_range=0.9`.
+
+Results are cached to `data/energy_case_study_forecasts_30d_daily_v3.parquet`.
+**First run: ~1–2 minutes (~313 daily fits). Subsequent runs: instant.**
+
+## Cell 6 (code)
+
+```python
+forecasts_df = compute_rolling_forecasts(
+    price_df=price_df,
+    simulation_start=SIMULATION_START,
+    simulation_end=SIMULATION_END,
+    horizon_days=ROLLING_HORIZON_DAYS,
+    ci_width=ROLLING_CI_WIDTH,
+    cache_path=ROLLING_FORECAST_CACHE,
+)
+print(f"Forecast date range: {forecasts_df['sim_day'].min().date()} → {forecasts_df['sim_day'].max().date()}")
+forecasts_df.head()
+```
+
+## Cell 7 (markdown)
+
+---
+
+## Act 1 — Forecasting Blind
+
+The animation below replays Prophet's rolling 30-day forecast from January 2025 through
+today.  The model sees only historical prices — nothing else.
+
+**How to read it:**
+- The **blue line** is the realized WTI price, revealing itself as time passes.
+- Each **orange bar** is the 95% CI for one 30-day-ahead forecast, placed at its resolution date.
+  The leading bar (darker) is the active forecast; lighter bars are already resolved.
+- **Green dots** mark resolutions inside the CI. **Red ✕ marks** are misses.
+- The **red dashed line** marks when the US–Iran war began (March 1, 2026).
+
+Use **▶ Play** to run through 2025 at speed. Pause and step as you enter 2026.
+
+## Cell 8 (code)
+
+```python
+anim_fig = build_forecast_animation(price_df, forecasts_df)
+anim_fig.show()
+```
+
+## Cell 9 (code)
+
+```python
+from pathlib import Path
+
+
+html_path = Path("oil_forecast_animation.html")
+export_animation_html(anim_fig, html_path)
+print(f"Exported standalone animation to {html_path.resolve()}")
+```
+
+## Cell 10 (markdown)
+
+---
+
+## Act 2 — The World Context
+
+Now look at the full price history annotated with the major real-world events that moved
+oil markets.  These are the things Prophet never saw — but that any human analyst would
+factor into their predictions.
+
+The red dashed line again marks **March 1, 2026** — the start of the US–Iran war and
+the Strait of Hormuz blockade that drove prices above $100/bbl in days.
+
+## Cell 11 (code)
+
+```python
+make_context_chart(price_df).show()
+```
+
+## Cell 12 (markdown)
+
+---
+
+## Act 3 — Analyzing 2025 vs. 2026 Results
+
+Let's separate the 2025 backtest from the 2026 reality and look at what the numbers actually say.
+
+## Cell 13 (code)
+
+```python
+err_fig, cov_fig, summary = make_punchline_charts(forecasts_df)
+err_fig.show()
+cov_fig.show()
+print("\nSummary:")
+print(summary.to_string())
+```
+
+## Cell 14 (markdown)
+
+### What happened?
+
+Through 2025, Prophet's 95% CI caught about 77% of resolutions — below the nominal 95%,
+but for a 30-day-ahead forecast on a volatile commodity with daily refits, that's a
+workable baseline. The error timeline shows errors scattered in both directions around
+zero, and the MAE held steady around $5–6 per barrel.
+
+Then came early 2026. Conflict escalation in the Persian Gulf — and mounting concern about
+Strait of Hormuz access — drove WTI prices from the low-$70s to over $110/bbl in a matter
+of weeks. The model, still pricing crude at $60–70, was missing by
+**$20–40 per barrel**. Only 29% of 2026 resolutions landed inside the 95% CI, and
+the MAE ballooned to ~$24/bbl.
+
+The model wasn't *wrong in principle*. It correctly described the distribution of outcomes
+that would have been reasonable given everything it had ever seen.
+It just had no way of knowing what it didn't know.
+
+> **A forecaster that backtests adequately is not the same as a forecaster that's robust to regime change.**
+
+The question for the bootcamp: *what class of methods could do better — and specifically,
+where can AI and agentic AI add value?*
+
+## Cell 15 (markdown)
+
+---
+
+## Act 4 — What Could Have Helped?
+
+Prophet is a strong statistical baseline. But it's blind to the world outside the price series.
+Here are four information sources — and four classes of method — that a more capable
+forecaster could exploit.
+
+## Cell 16 (code)
+
+```python
+futures_fig = make_futures_curve_chart(price_df)
+if futures_fig is not None:
+    futures_fig.show()
+else:
+    print("Futures contract data not available — skipping term-structure chart.")
+```
+
+## Cell 17 (markdown)
+
+### The Futures Curve
+
+The futures term structure tells you what *the market* collectively believes about forward prices.
+When the curve is in **backwardation** (near prices > far prices), traders are pricing a
+near-term supply crunch but expecting relief later.
+When it's in **contango** (near < far), the market sees near-term oversupply or weak demand.
+
+Prophet can't see any of this — it only knows historical realized prices.
+A futures-aware model can incorporate curve shape, spread dynamics, and roll signals as features.
+
+But even futures markets can be caught off guard by a sudden geopolitical shock.
+
+---
+
+### Four Levers a Better Forecaster Could Pull
+
+| Information source | What it adds | Limitation |
+|---|---|---|
+| **Futures curve** | Market-implied forward expectations; curve shape; spread signals | Still reactive — can be caught off guard by sudden shocks |
+| **Prediction markets** | Probability-weighted crowd forecasts on discrete events (e.g. Strait of Hormuz closure) | Thin liquidity; slow to update on novel scenarios |
+| **News & social signals** | Real-time event detection; geopolitical escalation indicators; sentiment | Noisy; requires reasoning to connect events to price impact |
+| **Analyst scenarios & expert reasoning** | Structured scenario trees; domain expertise; conditional forecasts | Expensive, infrequent, not automated |
+
+---
+
+### Four Forecasting Method Families
+
+| Method family | What it can do | What it can't do |
+|---|---|---|
+| **Statistical models** (Prophet, ARIMA, ETS) | Transparent; fast; well-calibrated on stable regimes | Blind to context; can't read the news |
+| **ML / multivariate** (XGBoost, LightGBM, Ridge) | Incorporate engineered features: futures spreads, macro indicators | Still needs explicit feature engineering; can't reason |
+| **Time-series foundation models** (Chronos, TimesFM, Moirai) | Pretrained on diverse series; zero-shot or fine-tuned generalization | New category — calibration and regime-break robustness still being evaluated |
+| **LLM Processes + Agentic forecasters** | Retrieve news; reason through scenarios; call code; explain assumptions; update on new context in real time | Black-box risks; prompt sensitivity; require careful evaluation design |
+
+---
+
+### So — Can an Agent Do Better?
+
+The hardest forecasting problems aren't the ones where the past predicts the future well.
+They're the ones where **the world stops looking like it used to**.
+
+In those moments, what matters most is:
+
+1. **Awareness** — knowing that the regime has shifted
+2. **Reasoning** — connecting external signals to a price view
+3. **Uncertainty calibration** — widening the interval appropriately, not confidently predicting the wrong thing
+
+Prophet fails all three. A well-designed agentic forecaster — one that can read the news,
+reason about geopolitical scenarios, and express structured uncertainty — could, in principle, do all three.
+
+**In the next notebook, we put that hypothesis to the test.**
+
+> ➡️ Continue to [`02_intro_agentic_predictor.ipynb`](02_intro_agentic_predictor.ipynb) — introducing the **Agentic Predictor**.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__02_intro_agentic_predictor.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__02_intro_agentic_predictor.ipynb.md
new file mode 100644
index 0000000..95dc711
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__02_intro_agentic_predictor.ipynb.md
@@ -0,0 +1,472 @@
+# Source: implementations/energy_oil_forecasting/02_intro_agentic_predictor.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil Price Forecasting — Introducing the Agentic Predictor (Notebook 2 of 7)
+
+This notebook introduces the **progressive capability staircase** for agentic
+forecasting by studying a single, high-stakes prediction origin:
+**March 2, 2026** — the day news of Persian Gulf shipping-lane disruptions
+began reaching energy markets.
+
+> **Prerequisite:** Run [`01_wti_case_study.ipynb`](01_wti_case_study.ipynb) first — it establishes why a price-only baseline fails during regime breaks.
+
+We build four predictors of increasing sophistication, each implementing
+the standard `Predictor` interface so that the outputs are directly
+comparable and can slot into the systematic backtest in Notebook 4.
+
+| Step | Predictor | Capability |
+|------|-----------|------------|
+| 1 | `ProphetPredictor` | Statistical baseline — extrapolates trend and seasonality |
+| 2 | `SampledTrajectoryLLMPredictor` | Direct-prompting LLMP — no tools, reasons from history text |
+| 3 | `AgentPredictor` (news) | News-grounded agent — bounded Google Search, strict temporal cutoff |
+| 4 | `AgentPredictor` (code+news) | Code-executing agent — E2B sandbox code execution + 2 forecasting skills |
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup & Data Registration
+
+## Cell 3 (code)
+
+```python
+import warnings
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service
+
+
+# ── Data service ──────────────────────────────────────────────────────────
+data_service = build_wti_service()
+
+# ── Single forecast origin for this notebook ──────────────────────────────
+AS_OF = pd.Timestamp("2026-03-01")  # context available the day before origin
+ORIGIN = pd.Timestamp("2026-03-02")  # the day we are predicting *from*
+
+ctx = data_service.context(as_of=AS_OF)
+full_df = ctx.get_series(WTI_SERIES_ID)
+
+print(f"Trading days in cache up to {AS_OF.date()}: {len(full_df)}")
+print(f"Last WTI close: ${full_df['value'].iloc[-1]:.2f}/bbl on {str(full_df['timestamp'].iloc[-1])[:10]}")
+```
+
+## Cell 4 (code)
+
+```python
+from aieng.forecasting.evaluation.task import ForecastingTask
+
+
+# The forecasting task mirrors the spec in specs/energy_oil_backtest.yaml
+task = ForecastingTask(
+    task_id="wti_oil_price_forecast",
+    target_series_id=WTI_SERIES_ID,
+    horizons=[5, 10, 21],
+    frequency="B",
+    description="WTI Crude Oil front-month futures — 5, 10, 21 business days ahead.",
+)
+
+print("Task:", task.task_id)
+print("Horizons:", task.horizons)
+print("Origin context as_of:", ctx.as_of)
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Step 1 — Prophet Baseline
+
+Prophet fits a decomposable time-series model (trend + seasonality) on the
+daily close history. It is deliberately blind to geopolitics — whatever
+is happening in the Persian Gulf, Prophet does not know.
+
+We wrap Prophet in a `Predictor` subclass so it produces standard
+`Prediction` objects with the full 11-quantile grid.
+The same wrapper will be used in Notebook 4's stateless backtest loop.
+
+## Cell 6 (code)
+
+```python
+from energy_oil_forecasting.prophet_baseline import ProphetPredictor
+
+
+prophet = ProphetPredictor()
+```
+
+## Cell 7 (code)
+
+```python
+prophet_preds = prophet.predict(task, ctx)
+
+print(f"Prophet forecast from {AS_OF.date()} (as_of) → {ORIGIN.date()} (origin):\n")
+for p in prophet_preds:
+    fc = p.payload
+    print(
+        f"  h={task.horizons[prophet_preds.index(p)]:>2}d  "
+        f"point=${fc.point_forecast:.2f}  "
+        f"80%CI=[${fc.quantiles[0.10]:.2f}, ${fc.quantiles[0.90]:.2f}]"
+    )
+```
+
+## Cell 8 (markdown)
+
+---
+## 3. Step 2 — Direct-Prompting LLM Process (LLMP)
+
+`SampledTrajectoryLLMPredictor` sends the price history as a structured JSON payload
+directly to Gemini and asks it to return a calibrated probabilistic forecast.
+There is no search tool and no code execution — the model must reason entirely
+from numerical history.
+
+**Data leakage caveat:** Gemini was trained on data that includes WTI prices
+through at least late 2024. For origins in 2026, the model may have implicit
+knowledge of historical events — this is a known limitation of LLMPs that we
+cannot fully control. The LLMP is included as a calibration reference, not as
+a clean counterfactual.
+
+## Cell 9 (code)
+
+```python
+from aieng.forecasting.methods import SampledTrajectoryLLMPredictor, SampledTrajectoryLLMPredictorConfig
+
+
+# Models: "gemini-3.1-flash-lite-preview" (lite/default) · "gemini-3.5-flash" (advanced)
+llmp_config = SampledTrajectoryLLMPredictorConfig(
+    model="gemini-3.1-flash-lite-preview",
+    # model="gemini-3.5-flash",  # advanced
+    n_samples=3,
+)
+
+print("SampledTrajectoryLLMPredictorConfig:")
+print(f"  model:    {llmp_config.model}")
+print(f"  n_samples: {llmp_config.n_samples}")
+```
+
+## Cell 10 (code)
+
+```python
+llmp = SampledTrajectoryLLMPredictor(llmp_config)
+llmp_preds = llmp.predict(task, ctx)
+
+print("LLMP forecast (no tools):\n")
+for p in llmp_preds:
+    fc = p.payload
+    print(
+        f"  h={task.horizons[llmp_preds.index(p)]:>2}d  "
+        f"point=${fc.point_forecast:.2f}  "
+        f"80%CI=[${fc.quantiles[0.10]:.2f}, ${fc.quantiles[0.90]:.2f}]"
+    )
+```
+
+## Cell 11 (markdown)
+
+---
+## 4. Step 3 — News-Grounded Agent
+
+We import `build_wti_news_config` from the `analyst_agent` module and wrap it
+in an `AgentPredictor`. The config wires a `ContextRetrievalConfig` sub-agent
+that uses Google Search with a strict `cutoff_date` enforcement.
+
+The key design constraints — visible in the config below — are:
+- The root agent's instruction contains three sections: `## Role`,
+  `## Forecasting contract`, and `## Analysis discipline`.
+- The context sub-agent's instruction reads `cutoff_date` and `query` from
+  the incoming JSON payload (produced by `ContextRetrievalRequest`).
+- The prompt builder sends a structured JSON payload, not a free-form string,
+  including `standard_quantiles` explicitly.
+
+## Cell 12 (code)
+
+```python
+from energy_oil_forecasting.analyst_agent import (
+    WtiPriceForecastPromptBuilder,
+    build_wti_agent_predictor,
+    build_wti_news_config,
+)
+
+
+# Models: "gemini-3.1-flash-lite-preview" (lite/default) · "gemini-3.5-flash" (advanced)
+news_config = build_wti_news_config(
+    model="gemini-3.1-flash-lite-preview"
+    # model="gemini-3.5-flash"  # advanced
+)
+
+print("=== Root agent instruction (first 1000 chars) ===")
+print(news_config.instruction[:1000])
+print("\n=== Context retrieval instruction (first 500 chars) ===")
+print(news_config.context_retrieval.instruction[:500])
+print("\nContext retrieval enabled:", news_config.context_retrieval.enabled)
+print("Context retrieval model:", news_config.context_retrieval.search_model)
+```
+
+## Cell 13 (code)
+
+```python
+# Inspect the prompt payload that will be sent to the agent
+prompt_builder = WtiPriceForecastPromptBuilder()
+sample_prompt = prompt_builder(task=task, context=ctx)
+
+print("=== Prompt payload sent to agent (first 800 chars) ===")
+print(sample_prompt[:800])
+print("\n...[history_csv truncated]...")
+```
+
+## Cell 14 (code)
+
+```python
+news_predictor = build_wti_agent_predictor(news_config)
+
+print(f"Predictor ID: {news_predictor.predictor_id}")
+print("Running news-grounded agent... (this calls Google Search)")
+
+news_preds = news_predictor.predict(task, ctx)
+
+print("\nNews-grounded agent forecast:\n")
+for p in news_preds:
+    fc = p.payload
+    print(
+        f"  h={task.horizons[news_preds.index(p)]:>2}d  "
+        f"point=${fc.point_forecast:.2f}  "
+        f"80%CI=[${fc.quantiles[0.10]:.2f}, ${fc.quantiles[0.90]:.2f}]"
+    )
+if news_preds and news_preds[0].metadata.get("rationale"):
+    print("\nAgent rationale:", news_preds[0].metadata["rationale"][:400])
+```
+
+## Cell 15 (markdown)
+
+---
+## 5. Step 4 — Code-Executing Agent (E2B)
+
+`build_wti_code_exec_config()` adds two capabilities on top of the news config:
+
+1. **E2B sandbox code execution** — the agent can write and run Python
+   (pandas, numpy, scikit-learn, matplotlib) in a secure E2B container, see
+   the output, and iterate before producing its final structured forecast.
+   All LLM calls route through the Vector proxy (same as every other predictor).
+
+2. **Two forecasting skills** — ADK `SkillToolset` provides reference data
+   and code patterns on demand:
+   - `statistical-analysis` — diagnostic patterns for the payload data (vol
+     regime, anomaly detection, adaptive trend-window selection)
+   - `trend-projection` — linear trend fit, CI calibration, and plausibility
+     guard using the window determined by statistical-analysis
+
+The skills follow the design rule from `docs/adk-skills-guide.md`: each
+skill directory contains at least one real file in `references/` and the
+instruction explicitly tells the agent **not** to call `run_skill_script`.
+
+### Design constraints and skill philosophy
+
+**Context is your data store.** All data the agent can use in code must
+arrive via the JSON payload in the user message. The payload fields are:
+
+| Field | Contents |
+|---|---|
+| `target_history_csv` | Mixed-frequency CSV string — recent 6 months daily, older history as weekly averages |
+| `target_summary` | `last_close_usd_bbl`, `last_date`, `52w_high`, `52w_low`, `n_trading_days` |
+| `as_of` | Forecast origin date |
+| `horizons` | Integer list of horizon steps (business days) |
+| `standard_quantiles` | Exact quantile grid the agent must produce |
+
+There are no disk files, no database connections. The agent parses
+`target_history_csv` using `io.StringIO` (not a file path) and emits
+intermediate results via `print()` so they appear in the conversation.
+
+**Skill philosophy.** The skills don't teach the agent how to use sklearn —
+it already knows that. They signal *which analyses are worth running given the
+payload you have*, and show the best way to get data in and structured results
+out of the code execution environment. The patterns are illustrated with WTI
+but the approach transfers: classify your vol regime, detect anomalies, adapt
+your trend window accordingly.
+
+**TODO (futures curve).** The natural next payload extension for WTI is a
+futures curve snapshot — spot vs M1, M3, M6 spread (a few numbers, low token
+cost) — which would unlock contango/backwardation regime detection in code.
+See `WtiPriceForecastPromptBuilder` in `analyst_agent/agent.py` for where this
+field would be added, and `statistical-analysis/references/analysis-patterns.md`
+for where the corresponding pattern would live.
+
+## Cell 16 (code)
+
+```python
+from energy_oil_forecasting.analyst_agent import build_wti_code_exec_config
+
+
+# Models: "gemini-3.1-flash-lite-preview" (lite/default) · "gemini-3.5-flash" (advanced)
+code_config = build_wti_code_exec_config(
+    model="gemini-3.1-flash-lite-preview"
+    # model="gemini-3.5-flash"  # advanced
+)
+
+print("=== Code-exec agent config summary ===")
+print(f"Code execution enabled:  {code_config.code_execution.enabled}")
+print(f"Sandbox timeout (s):     {code_config.code_execution.sandbox_timeout_seconds}")
+print(f"Skills directories ({len(code_config.skills_dirs)}):")
+for sd in code_config.skills_dirs:
+    print(f"  {sd.name}")
+print("\n=== Skills supplement in instruction (last 600 chars of instruction) ===")
+print(code_config.instruction[-600:])
+```
+
+## Cell 17 (code)
+
+```python
+code_predictor = build_wti_agent_predictor(code_config)
+
+print(f"Predictor ID: {code_predictor.predictor_id}")
+print("Running code-executing agent... (calls Google Search + executes Python code)")
+
+code_preds = code_predictor.predict(task, ctx)
+
+print("\nCode-executing agent forecast:\n")
+for p in code_preds:
+    fc = p.payload
+    print(
+        f"  h={task.horizons[code_preds.index(p)]:>2}d  "
+        f"point=${fc.point_forecast:.2f}  "
+        f"80%CI=[${fc.quantiles[0.10]:.2f}, ${fc.quantiles[0.90]:.2f}]"
+    )
+if code_preds and code_preds[0].metadata.get("rationale"):
+    print("\nAgent rationale:", code_preds[0].metadata["rationale"][:400])
+```
+
+## Cell 18 (markdown)
+
+---
+## 6. Side-by-Side Comparison
+
+All four predictors return standard `Prediction` objects, so we can
+compare them in a uniform table. We also show the actual WTI prices
+at each horizon to contextualise the forecasts.
+
+## Cell 19 (code)
+
+```python
+import matplotlib.pyplot as plt
+import numpy as np
+
+
+# Collect actual prices for horizon dates
+future_ctx = data_service.context(as_of=ORIGIN + pd.offsets.BDay(25))
+future_df = future_ctx.get_series(WTI_SERIES_ID)
+
+
+def get_actual(h: int) -> float | None:
+    target = ORIGIN + pd.offsets.BDay(h)
+    future_df["ts"] = pd.to_datetime(future_df["timestamp"])
+    row = future_df[future_df["ts"] >= target]
+    if row.empty:
+        return None
+    return float(row.iloc[0]["value"])
+
+
+# Build comparison table
+rows = []
+predictor_sets = [
+    ("Prophet", prophet_preds),
+    ("LLMP", llmp_preds),
+    ("Agent (News)", news_preds),
+    ("Agent (Code+News)", code_preds),
+]
+
+for h_idx, h in enumerate(task.horizons):
+    actual = get_actual(h)
+    for name, preds in predictor_sets:
+        if preds and h_idx < len(preds):
+            fc = preds[h_idx].payload
+            rows.append(
+                {
+                    "Horizon": f"h={h}d",
+                    "Predictor": name,
+                    "Point ($)": f"{fc.point_forecast:.2f}",
+                    "p10 ($)": f"{fc.quantiles[0.10]:.2f}",
+                    "p90 ($)": f"{fc.quantiles[0.90]:.2f}",
+                    "Actual ($)": f"{actual:.2f}" if actual else "N/A",
+                }
+            )
+
+import pandas as pd
+
+
+comparison = pd.DataFrame(rows).set_index(["Horizon", "Predictor"])
+print(comparison.to_string())
+```
+
+## Cell 20 (code)
+
+```python
+# Point forecast comparison chart (h=5 and h=21)
+
+fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=False)
+
+for ax, h_idx, h in [(axes[0], 0, 5), (axes[1], 2, 21)]:
+    names, points, lo, hi, actuals = [], [], [], [], []
+    actual_val = get_actual(h)
+
+    for name, preds in predictor_sets:
+        if preds and h_idx < len(preds):
+            fc = preds[h_idx].payload
+            names.append(name)
+            points.append(fc.point_forecast)
+            lo.append(fc.quantiles[0.10])
+            hi.append(fc.quantiles[0.90])
+
+    x = range(len(names))
+    ax.errorbar(
+        x,
+        points,
+        yerr=[np.array(points) - np.array(lo), np.array(hi) - np.array(points)],
+        fmt="o",
+        capsize=5,
+        linewidth=2,
+        label="Point + 80% CI",
+    )
+    if actual_val:
+        ax.axhline(actual_val, color="red", linestyle="--", label=f"Actual ${actual_val:.2f}")
+    ax.set_xticks(list(x))
+    ax.set_xticklabels(names, rotation=20, ha="right")
+    ax.set_title(f"Horizon h={h}d  (origin {ORIGIN.date()})")
+    ax.set_ylabel("USD / bbl")
+    ax.legend(fontsize=8)
+    ax.grid(True, alpha=0.3)
+
+plt.suptitle("WTI Forecast Comparison — March 2, 2026 Origin", fontsize=12)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 21 (markdown)
+
+---
+## Key Takeaways
+
+1. **All four predictors share the same `Predictor` interface.** The same
+   `predict(task, context)` call works whether the model is a statistical
+   trend-fitter or a tool-using agent. This is what makes systematic
+   backtesting in Notebook 4 possible.
+
+2. **`AgentConfig` factories encapsulate capability.** By importing
+   `build_wti_news_config()` and `build_wti_code_exec_config()` from
+   `analyst_agent/`, the notebook stays clean — configs are reproducible and
+   importable from any script or notebook.
+
+3. **Skills guide code execution, not sklearn usage.** The two ADK skills
+   (`statistical-analysis`, `trend-projection`) provide payload-aware code
+   patterns and reference benchmarks the agent loads on demand. They teach
+   effective use of code exec within the Gemini context-as-data-store
+   constraints — not Python basics. Following the design rule in
+   `docs/adk-skills-guide.md`, no scripts are present and the instruction
+   explicitly forbids `run_skill_script`.
+
+4. **Temporal cutoffs prevent data leakage.** The `ContextRetrievalConfig`
+   sub-agent enforces a `cutoff_date` on every search call, allowing the same
+   agent to be used safely in historical backtests.
+
+→ **Notebooks 4–6** build on these four predictors: Notebook 4 runs a systematic
+2025 backtest across all stateless methods; Notebooks 5–6 introduce the adaptive
+agent and compare it against the stateless top-performers on held-out 2025+ data.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__03_one_agent_three_tasks.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__03_one_agent_three_tasks.ipynb.md
new file mode 100644
index 0000000..bfd18f5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__03_one_agent_three_tasks.ipynb.md
@@ -0,0 +1,389 @@
+# Source: implementations/energy_oil_forecasting/03_one_agent_three_tasks.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Oil Price Forecasting — One Agent, Three Tasks
+
+> **Part 3 of 7.** This notebook builds on the agentic predictor introduced in
+> [`02_intro_agentic_predictor.ipynb`](02_intro_agentic_predictor.ipynb).
+
+A single Analyst Agent — backed by bounded Google Search — answers three tasks
+using **one system prompt** and **task-specific user payloads**:
+
+| Stream | Task | Output |
+|--------|------|--------|
+| A | Trajectory | 5/10/21-day price forecasts |
+| B | Binary shock | P(WTI +$5 in 5 days) |
+| C | Scenario analysis | Top 3 expert scenarios for 60 days |
+
+## Cell 2 (code)
+
+```python
+import json
+import warnings
+
+import numpy as np
+import pandas as pd
+from IPython.display import Markdown, display  # noqa: A004
+
+
+warnings.filterwarnings("ignore")
+
+# ── Model selection ───────────────────────────────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Lite is the default here; switch to advanced
+# for higher-quality runs.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+
+# ── Cache control ─────────────────────────────────────────────────────────────
+# Set to False to force a full end-to-end agent run (ignores all cached results).
+USE_CACHE = False
+
+from aieng.forecasting.evaluation.task import ForecastingTask
+from energy_oil_forecasting.analysis import compute_brier_score, trajectory_mae_table
+from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service, naive_utc_now
+from energy_oil_forecasting.paths import (
+    PROPHET_SHOCK_TRAJ_CACHE,
+    PROPHET_TRAJ_CACHE,
+    SCENARIO_CACHE,
+    SCENARIO_ORIGIN,
+    SHOCK_ANALYST_CACHE,
+    SHOCK_HORIZON,
+    SHOCK_ORIGINS,
+    SHOCK_THRESHOLD,
+    TRAJ_AGENT_CACHE,
+    TRAJECTORY_ORIGINS,
+)
+from energy_oil_forecasting.prophet_baseline import (
+    check_shock_outcome,
+    load_prophet_trajectories,
+    prophet_prob_shock,
+    wti_series_to_price_df,
+)
+from energy_oil_forecasting.tasks import TASK_SPECS, build_wti_news_predictor
+from energy_oil_forecasting.viz import (
+    conf_bar,
+    make_shock_comparison_chart,
+    make_trajectory_fan_chart,
+    prob_bar,
+    verdict_label,
+)
+
+
+data_service = build_wti_service()
+ctx = data_service.context(as_of=naive_utc_now())
+price_df = wti_series_to_price_df(ctx.get_series(WTI_SERIES_ID))
+
+prophet_traj_df = load_prophet_trajectories(price_df, TRAJECTORY_ORIGINS, PROPHET_TRAJ_CACHE)
+prophet_shock_df = load_prophet_trajectories(price_df, SHOCK_ORIGINS, PROPHET_SHOCK_TRAJ_CACHE)
+print(f"Price history through {price_df.index[-1].date()}")
+```
+
+## Cell 3 (markdown)
+
+---
+## Stream 1 — Trajectory Forecast
+
+Compare Prophet fan charts to the news-grounded agent at three origins.
+
+## Cell 4 (code)
+
+```python
+trajectory_task = ForecastingTask(
+    task_id="wti_trajectory_demo",
+    target_series_id=WTI_SERIES_ID,
+    horizons=[5, 10, 21],
+    frequency="B",
+    description="Trajectory demo for NB3",
+)
+
+traj_predictor = build_wti_news_predictor("trajectory", model=AGENT_MODEL)
+
+if USE_CACHE and TRAJ_AGENT_CACHE.exists():
+    with open(TRAJ_AGENT_CACHE) as f:
+        traj_agent_results = json.load(f)
+    print(f"Loaded {len(traj_agent_results)} cached trajectory agent runs.")
+else:
+    traj_agent_results = []
+    for origin in TRAJECTORY_ORIGINS:
+        as_of = origin - pd.Timedelta(days=1)
+        origin_ctx = data_service.context(as_of=as_of)
+        preds = traj_predictor.predict(trajectory_task, origin_ctx)
+        traj_agent_results.append(
+            {
+                "origin": str(origin.date()),
+                "predictions": [p.model_dump(mode="json") for p in preds],
+            }
+        )
+    with open(TRAJ_AGENT_CACHE, "w") as f:
+        json.dump(traj_agent_results, f, indent=2)
+    print(f"Saved {len(traj_agent_results)} agent trajectory runs.")
+
+# Summary: agent point forecasts at each origin
+print("\nAgent trajectory summary:")
+for r in traj_agent_results:
+    preds = r["predictions"]
+    pts = [f"h{[5, 10, 21][i]}=${preds[i]['payload']['point_forecast']:.1f}" for i in range(len(preds))]
+    origin_price_rows = price_df[price_df.index >= pd.Timestamp(r["origin"])]
+    origin_price = f"WTI=${origin_price_rows.iloc[0]['price']:.2f}" if not origin_price_rows.empty else ""
+    print(f"  {r['origin']}  {origin_price}  {' | '.join(pts)}")
+```
+
+## Cell 5 (code)
+
+```python
+# ── I/O inspection: 2026-03-02 — conflict onset, most informative ────────────
+INSPECT_ORIGIN = "2026-03-02"
+inspect_rec = next((r for r in traj_agent_results if r["origin"] == INSPECT_ORIGIN), None)
+
+if inspect_rec:
+    origin_ts = pd.Timestamp(INSPECT_ORIGIN)
+    bday_dates = pd.bdate_range(start=origin_ts + pd.offsets.BDay(1), periods=21)
+    origin_price_row = price_df[price_df.index >= origin_ts]
+    origin_price = float(origin_price_row.iloc[0]["price"]) if not origin_price_row.empty else float("nan")
+
+    preds = inspect_rec["predictions"]
+    rationale = preds[0].get("metadata", {}).get("rationale", "") if preds else ""
+
+    table_rows = "| Horizon | Agent ($) | 80% CI | Actual ($) | Agent err | Prophet err |\n|---|---|---|---|---|---|\n"
+    for i, h in enumerate([5, 10, 21]):
+        actual_rows = price_df[price_df.index >= bday_dates[h - 1]]
+        actual = float(actual_rows.iloc[0]["price"]) if not actual_rows.empty else float("nan")
+        pt = preds[i]["payload"]["point_forecast"]
+        q10_val = next(
+            (v for k, v in preds[i]["payload"]["quantiles"].items() if abs(float(k) - 0.1) < 1e-6), float("nan")
+        )
+        q90_val = next(
+            (v for k, v in preds[i]["payload"]["quantiles"].items() if abs(float(k) - 0.9) < 1e-6), float("nan")
+        )
+        p_row = prophet_traj_df[(prophet_traj_df["origin"] == origin_ts) & (prophet_traj_df["horizon"] == h)]
+        p_yhat = float(p_row.iloc[0]["yhat"]) if not p_row.empty else float("nan")
+        table_rows += (
+            f"| {h} bdays | **${pt:.1f}** | [{q10_val:.1f} – {q90_val:.1f}] "
+            f"| ${actual:.1f} | {pt - actual:+.1f} | {p_yhat - actual:+.1f} |\n"
+        )
+
+    display(
+        Markdown(
+            f"### Stream 1 — I/O Inspection: {INSPECT_ORIGIN}  (WTI ${origin_price:.2f}/bbl)\n\n"
+            "Agent and Prophet point forecasts vs realised prices at each horizon.\n\n"
+            + table_rows
+            + (f"\n> **Agent rationale:** {rationale}" if rationale else "")
+        )
+    )
+```
+
+## Cell 6 (code)
+
+```python
+# ── Trajectory fan chart: Prophet fan vs agent error bars at 3 origins ───────
+fig = make_trajectory_fan_chart(traj_agent_results, prophet_traj_df, price_df, TRAJECTORY_ORIGINS)
+fig.show()
+
+# ── MAE evaluation table ──────────────────────────────────────────────────────
+mae_df = trajectory_mae_table(traj_agent_results, prophet_traj_df, price_df)
+if not mae_df.empty:
+    display(mae_df.drop(columns=["Prophet MAE", "Agent MAE"]))
+    mean_mae = mae_df[["Prophet MAE", "Agent MAE"]].mean()
+    print(f"\nMean MAE  Prophet: ${mean_mae['Prophet MAE']:.2f}  Agent: ${mean_mae['Agent MAE']:.2f}")
+```
+
+## Cell 7 (markdown)
+
+---
+## Stream 2 — Binary Shock Prediction
+
+## Cell 8 (code)
+
+```python
+shock_task = ForecastingTask(
+    task_id="wti_upshock_demo",
+    target_series_id=WTI_SERIES_ID,
+    horizons=[SHOCK_HORIZON],
+    frequency="B",
+    description="Binary upshock demo",
+)
+
+shock_predictor = build_wti_news_predictor("shock", model=AGENT_MODEL)
+
+if USE_CACHE and SHOCK_ANALYST_CACHE.exists():
+    with open(SHOCK_ANALYST_CACHE) as f:
+        shock_results = json.load(f)
+    print(f"Loaded {len(shock_results)} cached shock forecasts.")
+else:
+    shock_results = []
+    for origin in SHOCK_ORIGINS:
+        as_of = origin - pd.Timedelta(days=1)
+        origin_ctx = data_service.context(as_of=as_of)
+        preds = shock_predictor.predict(shock_task, origin_ctx)
+        outcome, delta = check_shock_outcome(price_df, origin, SHOCK_THRESHOLD, SHOCK_HORIZON)
+        shock_results.append(
+            {
+                "origin": str(origin.date()),
+                "probability": preds[0].payload.probability,
+                "outcome": outcome,
+                "delta": delta,
+                "metadata": preds[0].metadata,
+            }
+        )
+    with open(SHOCK_ANALYST_CACHE, "w") as f:
+        json.dump(shock_results, f, indent=2)
+
+agent_probs = [r["probability"] for r in shock_results]
+outcomes = [r["outcome"] for r in shock_results]
+print(f"Agent Brier score: {compute_brier_score(agent_probs, outcomes):.4f}")
+print(f"Task spec preview:\n{TASK_SPECS['shock'][:200]}...")
+```
+
+## Cell 9 (code)
+
+```python
+# ── Per-origin forecast cards ─────────────────────────────────────────────────
+for r in shock_results:
+    origin = pd.Timestamp(r["origin"])
+    label = origin.strftime("%b %-d, %Y")
+    origin_price_row = price_df[price_df.index >= origin]
+    origin_price = float(origin_price_row.iloc[0]["price"]) if not origin_price_row.empty else float("nan")
+    a_prob = float(r["probability"])
+    outcome = int(r["outcome"])
+    delta = float(r["delta"])
+    brier = (a_prob - outcome) ** 2
+    meta = r.get("metadata", {})
+    reasoning = meta.get("rationale", "—")
+    key_signals = meta.get("key_signals", [])
+    confidence = meta.get("confidence", "?")
+    outcome_badge = "**SHOCK**" if outcome else "No shock"
+
+    display(
+        Markdown(
+            f"---\n"
+            f"### {label} — WTI ${origin_price:.2f}/bbl\n\n"
+            f"| | |\n|---|---|\n"
+            f"| **Prediction** | P(up > +${SHOCK_THRESHOLD:.0f}) = **{a_prob:.0%}**  `{prob_bar(a_prob)}` |\n"
+            f"| **Confidence** | {confidence.title() if isinstance(confidence, str) else confidence}  {conf_bar(str(confidence))} |\n"
+            f"| **Rationale** | {reasoning} |\n"
+            f"| **Key signals** | {' · '.join(key_signals) if key_signals else '—'} |\n"
+            f"| **Actual outcome** | {outcome_badge} — price moved **{delta:+.2f}/bbl** |\n"
+            f"| **Verdict** | {verdict_label(a_prob, outcome, delta, SHOCK_THRESHOLD)} |\n"
+            f"| **Brier score** | {brier:.3f} {'🟢' if brier < 0.10 else '🟡' if brier < 0.25 else '🔴'} |\n"
+        )
+    )
+```
+
+## Cell 10 (code)
+
+```python
+# ── Prophet probabilities for the shock origins ───────────────────────────────
+prophet_shock_probs = []
+for r in shock_results:
+    origin = pd.Timestamp(r["origin"])
+    origin_price_row = price_df[price_df.index >= origin]
+    origin_price = float(origin_price_row.iloc[0]["price"]) if not origin_price_row.empty else float("nan")
+    p_sub = prophet_shock_df[prophet_shock_df["origin"] == origin]
+    prophet_shock_probs.append(prophet_prob_shock(p_sub, origin_price, SHOCK_THRESHOLD, SHOCK_HORIZON))
+
+# ── Comparison chart: P(shock) over time + cumulative Brier ──────────────────
+fig = make_shock_comparison_chart(shock_results, prophet_shock_probs, shock_threshold=SHOCK_THRESHOLD)
+fig.show()
+
+# ── Brier score summary ───────────────────────────────────────────────────────
+agent_probs = [float(r["probability"]) for r in shock_results]
+outcomes = [int(r["outcome"]) for r in shock_results]
+agent_brier = compute_brier_score(agent_probs, outcomes)
+valid_prophet = [(p, o) for p, o in zip(prophet_shock_probs, outcomes) if not np.isnan(p)]
+prophet_brier = compute_brier_score([p for p, _ in valid_prophet], [o for _, o in valid_prophet])
+brier_df = pd.DataFrame(
+    {"Mean Brier score": [f"{agent_brier:.4f}", f"{prophet_brier:.4f}"]},
+    index=pd.Index(["Analyst Agent", "Prophet"], name="Method"),
+)
+print("Mean Brier score (lower = better, 0.25 = random ceiling):")
+display(brier_df)
+```
+
+## Cell 11 (markdown)
+
+---
+## Stream 3 — Scenario Analysis
+
+## Cell 12 (code)
+
+```python
+scenario_task = ForecastingTask(
+    task_id="wti_scenario_demo",
+    target_series_id=WTI_SERIES_ID,
+    horizons=[21],
+    frequency="B",
+    description="Scenario analysis demo",
+)
+
+scenario_predictor = build_wti_news_predictor("scenario", model=AGENT_MODEL)
+
+if USE_CACHE and SCENARIO_CACHE.exists():
+    with open(SCENARIO_CACHE) as f:
+        scenario_payload = json.load(f)
+    print("Loaded cached scenario analysis.")
+else:
+    as_of = SCENARIO_ORIGIN - pd.Timedelta(days=1)
+    origin_ctx = data_service.context(as_of=as_of)
+    preds = scenario_predictor.predict(scenario_task, origin_ctx)
+    scenario_payload = preds[0].metadata
+    with open(SCENARIO_CACHE, "w") as f:
+        json.dump(scenario_payload, f, indent=2)
+
+# ── Rich scenario cards ───────────────────────────────────────────────────────
+scenario_origin_price_row = price_df[price_df.index >= SCENARIO_ORIGIN]
+scenario_origin_price = (
+    float(scenario_origin_price_row.iloc[0]["price"]) if not scenario_origin_price_row.empty else float("nan")
+)
+
+display(
+    Markdown(
+        f"#### Stream 3 — Scenario Analysis  "
+        f"*(origin: {SCENARIO_ORIGIN.date()}, WTI ${scenario_origin_price:.2f}/bbl)*\n\n"
+        f"Base case: **{scenario_payload.get('base_case', '?')}**"
+    )
+)
+
+base_case = scenario_payload.get("base_case", "")
+for s in scenario_payload.get("scenarios", []):
+    name = s.get("name", "?")
+    desc = s.get("description", "")
+    prob = float(s.get("probability", 0))
+    rng = s.get("wti_range_60d", [float("nan"), float("nan")])
+    lo_r, hi_r = float(rng[0]), float(rng[1])
+    pe = float(s.get("point_estimate_60d", float("nan")))
+    drivers = s.get("key_drivers", [])
+    base_marker = "  ★ **base case**" if name == base_case else ""
+
+    display(
+        Markdown(
+            f"---\n"
+            f"**{name}**{base_marker}\n\n"
+            f"{desc}\n\n"
+            f"| | |\n|---|---|\n"
+            f"| Probability | **{prob:.0%}**  `{prob_bar(prob)}` |\n"
+            f"| WTI range (60 days) | ${lo_r:.0f} – ${hi_r:.0f} /bbl |\n"
+            f"| Point estimate | **${pe:.0f} /bbl** |\n"
+            f"| Key drivers | {' · '.join(drivers) if drivers else '—'} |\n"
+        )
+    )
+
+overall = scenario_payload.get("rationale", "")
+if overall:
+    display(Markdown(f"---\n\n> **Overall reasoning:** {overall}"))
+```
+
+## Cell 13 (markdown)
+
+---
+
+## Summary
+
+One agent identity (`build_wti_multitask_news_config` / `build_wti_news_config`) with
+three task-specific prompt builders and output schemas demonstrates the bootcamp
+pattern for multi-task agentic forecasting. Continue to
+[`04_systematic_backtest_eval.ipynb`](04_systematic_backtest_eval.ipynb) for the
+stateless backtest harness, then Notebooks 5–6 for the adaptive agent training and
+protected evaluation.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__04_systematic_backtest_eval.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__04_systematic_backtest_eval.ipynb.md
new file mode 100644
index 0000000..424fc55
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__04_systematic_backtest_eval.ipynb.md
@@ -0,0 +1,367 @@
+# Source: implementations/energy_oil_forecasting/04_systematic_backtest_eval.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil Price Forecasting — Stateless Methods: Systematic Backtest (Notebook 4 of 7)
+
+This notebook simulates a rigorous production forecasting workflow:
+
+1. Run a **rolling weekly backtest across 2025** using
+   `energy_oil_backtest.yaml` for all candidate predictors.
+2. Compute metrics — **CRPS** for 5/10/21-day trajectories.
+3. Select the **top contender configurations** based solely on 2025
+   historical performance (no peeking at 2026).
+4. Let the contenders compete in the **2026 Protected Arena**
+   (`energy_oil_eval.yaml`) during the geopolitical price shock —
+   measuring adaptive real-time responsiveness and calibration.
+
+All predictors use the same `Predictor` interface introduced in Notebooks 1–2.
+Agent configs are imported from `energy_oil_forecasting.analyst_agent`.
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup, Data Registration & Spec Loading
+
+## Cell 3 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+import energy_oil_forecasting
+import pandas as pd
+import yaml
+from aieng.forecasting.evaluation import (
+    MultiTargetBacktestSpec,
+    cached_multi_backtest,
+    describe_spec,
+)
+from energy_oil_forecasting.data import build_wti_service
+
+
+warnings.filterwarnings("ignore")
+
+# ── Mode ──────────────────────────────────────────────────────────────────────
+# Set SMOKE_TEST = True to run a 2-origin, 1-sample version of the notebook
+# for fast local development and end-to-end CI testing. The full specs run
+# 51 backtest + 8 eval origins; smoke runs 2 + 2.
+SMOKE_TEST = True
+
+# ── Model selection ───────────────────────────────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Change these two lines to swap models for the
+# whole notebook (bare proxy names — no "gemini/" prefix).
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+LLMP_MODEL = "gemini-3.1-flash-lite-preview"
+
+# ── Derived settings (do not edit below) ─────────────────────────────────────
+N_SAMPLES = 1 if SMOKE_TEST else 3  # trajectories per LLMP call
+
+data_service = build_wti_service()
+
+spec_dir = Path(energy_oil_forecasting.__file__).parent / "specs"
+if SMOKE_TEST:
+    backtest_file, eval_file = "energy_oil_smoke.yaml", "energy_oil_eval_smoke.yaml"
+else:
+    backtest_file, eval_file = "energy_oil_backtest.yaml", "energy_oil_eval.yaml"
+
+with open(spec_dir / backtest_file) as f:
+    backtest_spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(f))
+with open(spec_dir / eval_file) as f:
+    eval_spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(f))
+
+print(
+    f"{'⚡ SMOKE MODE' if SMOKE_TEST else '📊 FULL MODE'} — AGENT_MODEL={AGENT_MODEL!r}  LLMP_MODEL={LLMP_MODEL!r}  N_SAMPLES={N_SAMPLES}"
+)
+print()
+print("━" * 72)
+print("LOADED SPECIFICATIONS:")
+print("━" * 72)
+print(describe_spec(backtest_spec, data_service))
+print(describe_spec(eval_spec, data_service))
+```
+
+## Cell 4 (markdown)
+
+---
+## 2. Statistical Baseline
+
+This reference implementation uses **AutoARIMA** as its chosen statistical
+method.  The purpose of this notebook is to characterise AutoARIMA's
+performance thoroughly — understanding where it succeeds, where it fails, and
+in which regimes — so that the adaptive agent in Notebook 5 has a concrete
+foundation to learn from.
+
+The `Naive (Last Value)` predictor provides the floor: AutoARIMA should beat
+it, and the margin tells us how much structure AutoARIMA extracts from the data.
+
+> Other statistical and LLM-based methods are explored in separate reference
+> implementations.  You can uncomment the commented-out predictors below to
+> compare, but they are not the focus of this experiment.
+
+| Predictor | Role |
+|---|---|
+| `LastValuePredictor` | Lower bound — carry-forward baseline |
+| `DartsAutoARIMAPredictor` | **Primary statistical method** — the anchor for adaptive agent training |
+
+## Cell 5 (code)
+
+```python
+from aieng.forecasting.methods import (
+    LastValuePredictor,
+    QuantileGridLLMPredictor,  # noqa: F401
+    QuantileGridLLMPredictorConfig,  # noqa: F401
+    SampledTrajectoryLLMPredictor,  # noqa: F401
+    SampledTrajectoryLLMPredictorConfig,  # noqa: F401
+)
+from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor
+from energy_oil_forecasting.analyst_agent import build_wti_agent_predictor, build_wti_news_config  # noqa: F401
+from energy_oil_forecasting.prophet_baseline import ProphetPredictor  # noqa: F401
+
+
+# ── Predictors ────────────────────────────────────────────────────────────────
+# AutoARIMA is the primary method; Naive is the lower-bound baseline.
+# Both are evaluated in every section — no contender selection needed.
+# NOTE: AutoARIMA re-fits at every origin (slow on first run; cached after).
+PREDICTORS = {
+    "Naive (Last Value)": LastValuePredictor(),
+    "AutoARIMA": DartsAutoARIMAPredictor(),
+    # ── Optional comparisons (not the focus of this experiment) ──────────────
+    # "Prophet": ProphetPredictor(),
+    # f"LLMP-Sampled ({LLMP_MODEL})": SampledTrajectoryLLMPredictor(
+    #     SampledTrajectoryLLMPredictorConfig(model=LLMP_MODEL, n_samples=N_SAMPLES)
+    # ),
+    # f"LLMP-Grid ({LLMP_MODEL})": QuantileGridLLMPredictor(
+    #     QuantileGridLLMPredictorConfig(model=LLMP_MODEL)
+    # ),
+    # f"News Agent ({AGENT_MODEL})": build_wti_agent_predictor(
+    #     build_wti_news_config(model=AGENT_MODEL)
+    # ),
+}
+
+print(f"Active predictors ({len(PREDICTORS)}):")
+for name in PREDICTORS:
+    print(f"  {name}")
+```
+
+## Cell 6 (markdown)
+
+---
+## 3. Run the 2025 Historical Backtest
+
+All 51 weekly origins in 2025 are evaluated for each predictor.
+`cached_multi_backtest` caches results under `data/predictions/` so
+subsequent runs are instant.
+
+## Cell 7 (code)
+
+```python
+print(f"Running 2025 rolling backtest ({len(PREDICTORS)} predictor(s))...")
+print("LLM/agent runs are expensive — first run will take several minutes.\n")
+
+backtest_results: dict[str, object] = {}
+for _name, _predictor in PREDICTORS.items():
+    backtest_results[_name] = cached_multi_backtest(_predictor, backtest_spec, data_service)
+    print(f"  {_name} ✓")
+
+print("\nAll 2025 backtests complete.")
+```
+
+## Cell 8 (markdown)
+
+---
+## 4. Performance Characterisation
+
+We score both predictors on the 2025 backtest data:
+- **CRPS** (Continuous Ranked Probability Score) — sharpness + calibration combined
+- **MAE at h=21d** — point forecast accuracy at the longest horizon
+
+The key question is not which method to pick (we've already chosen AutoARIMA),
+but *where* and *by how much* AutoARIMA beats the naive baseline — and where it
+still struggles. Those gaps are exactly what the adaptive agent will learn to address.
+
+## Cell 9 (code)
+
+```python
+import math
+
+from energy_oil_forecasting.analysis import score_backtest_results
+
+
+leaderboard_rows = []
+for name, results in backtest_results.items():
+    scores = score_backtest_results(results, data_service)
+    leaderboard_rows.append(
+        {
+            "Predictor": name,
+            "Mean CRPS": scores.get("mean_crps", float("nan")),
+            "MAE h=21d": scores.get("mae_h21", float("nan")),
+        }
+    )
+
+df_leaderboard = pd.DataFrame(leaderboard_rows).set_index("Predictor")
+df_leaderboard = df_leaderboard.sort_values("Mean CRPS")
+
+print("━" * 72)
+print("2025 HISTORICAL BACKTEST — PERFORMANCE SUMMARY:")
+print("━" * 72)
+print(df_leaderboard.to_string())
+
+arima_crps = df_leaderboard.loc["AutoARIMA", "Mean CRPS"] if "AutoARIMA" in df_leaderboard.index else float("nan")
+naive_crps = (
+    df_leaderboard.loc["Naive (Last Value)", "Mean CRPS"]
+    if "Naive (Last Value)" in df_leaderboard.index
+    else float("nan")
+)
+if not math.isnan(arima_crps):
+    print(
+        f"\nAutoARIMA CRPS improvement over Naive: {naive_crps - arima_crps:.4f} ({(naive_crps - arima_crps) / naive_crps:.1%})"
+    )
+```
+
+## Cell 10 (code)
+
+```python
+# ── Save backtest results for NB05 / NB06 ────────────────────────────────────
+# Only the two baseline predictors are written to curriculum/ so that
+# uncommenting the optional predictors above does not pollute the files
+# that NB05 and NB06 depend on.
+_CURRICULUM_DIR = Path("adaptive_agent/curriculum")
+_CURRICULUM_DIR.mkdir(exist_ok=True)
+_BASELINE_PREDICTORS = {"Naive (Last Value)", "AutoARIMA"}
+for _name, _result_dict in backtest_results.items():
+    if _name not in _BASELINE_PREDICTORS:
+        continue
+    _result = next(iter(_result_dict.values()))
+    (_CURRICULUM_DIR / f"backtest_{_name}.json").write_text(_result.model_dump_json(), encoding="utf-8")
+print(f"Saved {sum(n in _BASELINE_PREDICTORS for n in backtest_results)} backtest result(s) to {_CURRICULUM_DIR}/")
+```
+
+## Cell 11 (markdown)
+
+---
+## 5. 2026 Evaluation — Held-Out Test Period
+
+We run both predictors on **8 weekly origins in early 2026**
+(`energy_oil_eval.yaml`) — a period of major geopolitical volatility not
+seen during the 2025 backtest.
+
+This evaluation serves two purposes:
+1. **Measure out-of-sample robustness** — does AutoARIMA's 2025 edge hold
+   under a structural regime shift?
+2. **Establish the stateless baseline** that the trained adaptive agents in
+   Notebook 6 are compared against. Both results are saved to
+   `adaptive_agent/curriculum/` for Notebooks 5 and 6 to load.
+
+## Cell 12 (code)
+
+```python
+print("Running 2026 evaluation...")
+eval_results: dict[str, object] = {}
+for name, predictor in PREDICTORS.items():
+    eval_results[name] = cached_multi_backtest(predictor, eval_spec, data_service)
+    print(f"  {name} ✓")
+
+print("\n2026 evaluation complete.")
+```
+
+## Cell 13 (code)
+
+```python
+# ── Save eval results for NB06 ───────────────────────────────────────────────
+# Only baseline predictors are written so uncommenting optional predictors
+# above does not add extra rows to the NB06 scorecard.
+for _name, _result_dict in eval_results.items():
+    if _name not in _BASELINE_PREDICTORS:
+        continue
+    _result = next(iter(_result_dict.values()))
+    (_CURRICULUM_DIR / f"eval_{_name}.json").write_text(_result.model_dump_json(), encoding="utf-8")
+print(f"Saved {sum(n in _BASELINE_PREDICTORS for n in eval_results)} eval result(s) to {_CURRICULUM_DIR}/")
+```
+
+## Cell 14 (markdown)
+
+---
+## 6. Scorecard
+
+Out-of-sample performance of both stateless predictors on the 2026 eval period.
+These numbers are the **stateless baseline** the adaptive agent variants must
+beat in Notebook 6 to demonstrate that training added value.
+
+## Cell 15 (code)
+
+```python
+from energy_oil_forecasting.analysis import score_backtest_results
+
+
+scorecard_rows = []
+for name in PREDICTORS:
+    if name not in eval_results:
+        continue
+    scores = score_backtest_results(eval_results[name], data_service)
+    scorecard_rows.append(
+        {
+            "Predictor": name,
+            "Mean CRPS (2026)": scores.get("mean_crps", float("nan")),
+            "MAE h=21d (2026)": scores.get("mae_h21", float("nan")),
+            "80% CI Coverage": scores.get("coverage_80", float("nan")),
+        }
+    )
+
+df_scorecard = pd.DataFrame(scorecard_rows).set_index("Predictor")
+df_scorecard = df_scorecard.sort_values("Mean CRPS (2026)")
+
+print("━" * 72)
+print("2026 EVAL SCORECARD — STATELESS BASELINE:")
+print("━" * 72)
+print(df_scorecard.to_string())
+```
+
+## Cell 16 (markdown)
+
+---
+## 7. Core Takeaways
+
+1. **AutoARIMA beats the naive baseline** by extracting local autocorrelation
+   structure from the price history. In stable regimes, this translates to
+   noticeably better CRPS and MAE.
+
+2. **AutoARIMA fails under structural regime shifts.** It has no mechanism to
+   incorporate news, OPEC+ decisions, or geopolitical context. During the 2026
+   price shock, it extrapolates past trends and produces systematically biased,
+   under-confident intervals.
+
+3. **These failure modes are learnable.** The backtest report surfaces exactly
+   which regimes and horizons are problematic — and that is precisely the
+   information we hand to the adaptive agent as training material in Notebook 5.
+
+4. **The `Predictor` abstraction makes the comparison clean.** The same harness,
+   scoring functions, and eval spec work for both stateless methods and the
+   adaptive agent variants in Notebook 6.
+
+---
+## 8. What stateless methods can't do
+
+AutoARIMA is calibrated once and never updated. This is intentional here —
+it creates a clean baseline — but it leaves a systematic gap:
+
+- **No error feedback.** If AutoARIMA consistently produces intervals that are
+  too narrow in elevated-vol regimes, it will keep making the same mistake.
+  There is no mechanism to update calibration between rounds.
+
+- **No market context.** AutoARIMA sees only price history. A human analyst
+  reviewing its output would immediately ask: *what's in the news?*
+
+- **No strategy evolution.** Each prediction starts from the same prior.
+  Resolved outcomes disappear without influencing future forecasts.
+
+→ **Notebook 5** introduces adaptive agents that study AutoARIMA's 2025
+performance, record systematic observations, and calibrate their strategies
+accordingly. At inference time, each agent receives the live AutoARIMA estimate
+and decides how to adjust it — applying what it learned from training.
+
+→ **Notebook 6** evaluates whether any training approach actually improved
+out-of-sample performance on the held-out 2026 data.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_adaptive_agent_training.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_adaptive_agent_training.ipynb.md
new file mode 100644
index 0000000..ee2ded3
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_adaptive_agent_training.ipynb.md
@@ -0,0 +1,409 @@
+# Source: implementations/energy_oil_forecasting/05_adaptive_agent_training.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil — Adaptive Agent: Self-Directed Study (Notebook 5 of 7)
+
+> **Part 5 of 7.** Builds on the stateless backtest in [`04_systematic_backtest_eval.ipynb`](04_systematic_backtest_eval.ipynb).
+
+Every method in Notebook 4 was **stateless** — configured once, run the same way each time.
+This notebook introduces an agent that is different: it can **learn from experience**.
+
+The paradigm shift: instead of configuring a model, we onboard an analyst.
+We give the analyst a task, historical data, and a set of tools.
+The analyst explores the data, draws conclusions, and decides whether to update
+its own forecasting strategy — governed by evidence rules in its `meta-learning` skill.
+
+**What this notebook produces:**
+
+| Strategy dir | Contents |
+|---|---|
+| `wti-strategy/` | Clean initial state — never modified |
+| `wti-strategy-trained/` | Strategy after one self-directed study session |
+
+## Cell 2 (markdown)
+
+---
+## 0. Setup
+
+## Cell 3 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+from energy_oil_forecasting.adaptive_agent import build_wti_adaptive_config
+
+
+warnings.filterwarnings("ignore")
+
+# ── Paths ─────────────────────────────────────────────────────────────────────
+_NB_DIR = Path(".")
+_SKILLS_ROOT = _NB_DIR / "adaptive_agent" / "skills"
+_CURRICULUM_DIR = _NB_DIR / "adaptive_agent" / "curriculum"
+
+# Clean seed — read-only baseline, never written to by training.
+SEED_STRATEGY_DIR = _SKILLS_ROOT / "wti-strategy"
+# Strategy state after the self-directed study session.
+TRAINED_STRATEGY_DIR = _SKILLS_ROOT / "wti-strategy-trained"
+
+# ── Model ─────────────────────────────────────────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). The adaptive agent uses the advanced model.
+AGENT_MODEL = "gemini-3.5-flash"
+
+# ── Run guards ────────────────────────────────────────────────────────────────
+# Expensive by default; outputs are committed after first run.
+# Set RUN_STUDY = True only to regenerate from scratch.
+# Set RESEED = True to reset the trained strategy to the clean seed before running.
+RUN_STUDY = False  # Self-directed study session (live API calls)
+RESEED = False  # Reset wti-strategy-trained/ to clean seed first
+
+print("Setup complete.")
+print(f"  Seed:    {SEED_STRATEGY_DIR}")
+print(f"  Trained: {TRAINED_STRATEGY_DIR}")
+```
+
+## Cell 4 (markdown)
+
+---
+## 1. Before — The Agent's Starting State
+
+The seed strategy (`wti-strategy/`) contains domain priors: a sensible initial
+approach, but no evidence-backed calibration corrections.
+It is the same strategy the **untrained agent** uses in Notebook 6.
+
+The trained variant starts from an identical copy of this seed.
+Set `RESEED = True` in Setup if you want to reset it before a fresh study run.
+
+## Cell 5 (code)
+
+```python
+if RESEED:
+    import shutil  # noqa: PLC0415
+
+    from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillStore  # noqa: PLC0415
+    from energy_oil_forecasting.adaptive_agent.skill_state import WtiStrategyState  # noqa: PLC0415
+
+    TRAINED_STRATEGY_DIR.mkdir(exist_ok=True)
+    shutil.copy2(SEED_STRATEGY_DIR / "skill_state.yaml", TRAINED_STRATEGY_DIR / "skill_state.yaml")
+    store = AdaptiveSkillStore(skill_dir=TRAINED_STRATEGY_DIR, state_type=WtiStrategyState)
+    store.save(store.load())
+    print("wti-strategy-trained/ reset to clean seed.")
+else:
+    print("RESEED = False — keeping existing wti-strategy-trained/ state.")
+
+print()
+print("Initial strategy (wti-strategy/SKILL.md):")
+print("─" * 60)
+print((SEED_STRATEGY_DIR / "SKILL.md").read_text())
+```
+
+## Cell 6 (markdown)
+
+---
+## 2. Self-Directed Study
+
+We give the agent one open-ended analytical task: explore 2025 WTI price data
+and assess whether its current forecasting approach is well-calibrated.
+
+The agent has access to:
+- `fetch-yfinance` — live price data from Yahoo Finance (with temporal cutoffs)
+- `vol-regime` — volatility regime classification
+- `trend-projection` — trend fitting and interval calibration
+- `meta-learning` — evidence governance rules for updating strategy
+- Strategy mutation tools — to record observations, open hypotheses, and apply corrections
+
+The agent decides what to compute, what conclusions to draw, and whether any
+finding clears the evidence bar for updating its `wti-strategy-trained/` skill.
+
+> **Run guard:** `RUN_STUDY = False` by default — the trained strategy state
+> is committed so this notebook runs reproducibly without live API calls.
+
+## Cell 7 (code)
+
+```python
+_STUDY_PROMPT = (
+    "You have access to historical WTI crude oil price data via run_code. "
+    "Please do the following:\n\n"
+    "1. Fetch the daily WTI close price series for the full year 2025 using "
+    'yfinance (ticker: CL=F, end="2026-01-01").\n'
+    "2. Compute 21-day rolling realized volatility. Classify each day into a "
+    "vol regime using the thresholds in your vol-regime skill.\n"
+    "3. Simulate the errors a simple trend-projection forecaster would make "
+    "at 5, 10, and 21 business-day horizons during each regime. Approximate "
+    "this using the historical return distribution within each regime window.\n"
+    "4. Summarize: in which regimes and at which horizons does trend-projection "
+    "tend to produce the largest errors? Is there a directional bias?\n\n"
+    "Based on your analysis, decide whether any findings meet the evidence "
+    "threshold in your meta-learning skill. If they do, record them using "
+    "the appropriate mutation tools. If not, explain what additional evidence "
+    "you would need before updating your strategy."
+)
+
+if RUN_STUDY:
+    config = build_wti_adaptive_config(model=AGENT_MODEL, strategy_dir=TRAINED_STRATEGY_DIR)
+    agent = build_adk_agent(config)
+    runner = AdkTextRunner(
+        agent,
+        config=AdkTextRunnerConfig(
+            app_name="wti_self_directed_study",
+            enable_langfuse_tracing=True,
+            langfuse_tags=["energy-oil", "adaptive-agent", "self-directed-study"],
+            langfuse_trace_name="wti-adaptive-self-directed-study",
+        ),
+    )
+    print("Running self-directed study session...")
+    print("(Live API calls + E2B sandbox — may take several minutes.)\n")
+    reply = await runner.run_text_async(_STUDY_PROMPT)  # noqa: F704, PLE1142
+    (_CURRICULUM_DIR / "study_response.txt").write_text(reply, encoding="utf-8")
+    print(reply)
+else:
+    _f = _CURRICULUM_DIR / "study_response.txt"
+    if _f.exists():
+        print(_f.read_text())
+    else:
+        print("[Study session not yet run. Set RUN_STUDY = True and re-run.]")
+```
+
+## Cell 8 (markdown)
+
+---
+## 3. After — What the Agent Learned
+
+The cell below shows the trained strategy state.
+Look at what changed relative to the clean seed:
+
+- **Observations**: patterns the agent noticed during analysis
+- **Hypotheses**: candidate corrections it opened for future confirmation
+- **Calibration corrections**: confirmed adjustments now applied at inference
+- **Approach narrative**: how the agent describes its own strategy in its own words
+
+These are the changes that will be active when the agent makes predictions
+in Notebook 6.
+
+## Cell 9 (code)
+
+```python
+import yaml  # noqa: PLC0415
+
+
+def _load_state(d: Path) -> dict:
+    return yaml.safe_load((d / "skill_state.yaml").read_text())
+
+
+seed_state = _load_state(SEED_STRATEGY_DIR)
+trained_state = _load_state(TRAINED_STRATEGY_DIR)
+
+print("What changed after self-directed study:")
+print("─" * 60)
+for key, label in [
+    ("observations", "Observations"),
+    ("hypotheses", "Hypotheses"),
+    ("calibration_corrections", "Calibration corrections"),
+]:
+    before = len(seed_state.get(key, []))
+    after = len(trained_state.get(key, []))
+    delta = f"+{after - before}" if after >= before else str(after - before)
+    print(f"  {label:28s}: {before} → {after}  ({delta})")
+
+approach_changed = trained_state.get("approach_narrative", "") != seed_state.get("approach_narrative", "")
+print(f"  {'Approach narrative':28s}: {'UPDATED' if approach_changed else "unchanged'"}")
+```
+
+## Cell 10 (code)
+
+```python
+print("Trained strategy (wti-strategy-trained/SKILL.md):")
+print("─" * 60)
+print((TRAINED_STRATEGY_DIR / "SKILL.md").read_text())
+```
+
+## Cell 11 (markdown)
+
+---
+## 4. Optional: Robustness Testing
+
+In the self-directed study, the agent examined 2025 WTI data and recorded at
+least one open hypothesis. The two cells below run follow-up tasks to test
+whether those findings are robust — the standard scientific check before
+promoting any pattern to an active calibration correction.
+
+| Task | Structure | Goal |
+|---|---|---|
+| A — Cross-period | Re-run the same analysis on 2023-2024 data | `record_hypothesis_outcome` for each open hypothesis |
+| B — Scope check | Identify untested boundary conditions and fill the gap | Second confirmation → attempt `graduate_hypothesis` |
+
+> **Run guard:** `RUN_FOLLOWUP = False` by default. Both tasks use the same
+> agent session and must run together — outputs are committed after first run.
+
+## Cell 12 (code)
+
+```python
+# ── Run guard ─────────────────────────────────────────────────────────────────
+# Set True to run the robustness tasks. Both run sequentially in one session.
+# Outputs are saved and committed — leave False for reproducibility.
+RUN_FOLLOWUP = False
+```
+
+## Cell 13 (markdown)
+
+### Task A — Cross-Period Robustness (2023–2024)
+
+Ask the agent to review its open hypotheses and replicate the relevant
+analysis on 2023-2024 WTI data, recording whether the earlier data confirms
+or contradicts each finding.
+
+## Cell 14 (code)
+
+```python
+_FOLLOWUP_A_PROMPT = (
+    "Review the open hypotheses recorded in your strategy file. "
+    "For each one:\n"
+    "1. Summarize what the hypothesis claims and which time period or "
+    "conditions it was originally based on.\n"
+    "2. Fetch WTI daily close prices for 2023 and 2024 "
+    '(ticker: CL=F, end="2025-01-01").\n'
+    "3. Run the same type of analysis the hypothesis was based on, "
+    "using the 2023-2024 data.\n"
+    "4. Call record_hypothesis_outcome for the hypothesis with "
+    'outcome="confirmed" if the pattern holds in the earlier data, or '
+    'outcome="refuted" if it does not. Be specific about what matched '
+    "or contradicted the original finding."
+)
+
+if RUN_FOLLOWUP:
+    config = build_wti_adaptive_config(model=AGENT_MODEL, strategy_dir=TRAINED_STRATEGY_DIR)
+    agent = build_adk_agent(config)
+    runner = AdkTextRunner(
+        agent,
+        config=AdkTextRunnerConfig(
+            app_name="wti_robustness_followup",
+            enable_langfuse_tracing=True,
+            langfuse_tags=["energy-oil", "adaptive-agent", "robustness-followup"],
+            langfuse_trace_name="wti-adaptive-robustness-a",
+            fresh_session_per_message=False,
+        ),
+    )
+    print("Running Task A: cross-period robustness test (2023-2024)...")
+    reply_a = await runner.run_text_async(_FOLLOWUP_A_PROMPT)  # noqa: F704, PLE1142
+    print(reply_a)
+else:
+    _f = _CURRICULUM_DIR / "followup_a_response.txt"
+    if _f.exists():
+        print(_f.read_text())
+    else:
+        print("[Task A not yet run. Set RUN_FOLLOWUP = True and re-run.]")
+```
+
+## Cell 15 (markdown)
+
+### Task B — Scope Check and Graduation Attempt
+
+Ask the agent to identify the untested boundary conditions of its open
+hypotheses — horizons, regimes, or market conditions not yet examined —
+run a targeted analysis to fill the most important gap, and then attempt
+graduation if the confirmation threshold is met.
+
+## Cell 16 (code)
+
+```python
+_FOLLOWUP_B_PROMPT = (
+    "Look at your open hypotheses and any confirmations or refutations "
+    "recorded so far. For each open hypothesis:\n"
+    "1. Identify the boundary conditions — which horizons, regimes, or "
+    "market conditions does the finding cover, and which have not yet been tested?\n"
+    "2. Run a targeted analysis to fill in the most important untested gap "
+    "(e.g. a horizon or market condition you have not yet checked). "
+    "Fetch whatever data you need via yfinance.\n"
+    "3. Call record_hypothesis_outcome based on what you find.\n"
+    "4. If the confirmation threshold is now met (the tool will tell you), "
+    "call graduate_hypothesis with a precise condition, a concrete adjustment, "
+    "and the appropriate horizon_scope. If the threshold is not yet met, "
+    "explain what additional evidence would be needed to graduate the hypothesis."
+)
+
+if RUN_FOLLOWUP:
+    print("\nRunning Task B: horizon scope check...")
+    reply_b = await runner.run_text_async(_FOLLOWUP_B_PROMPT)  # noqa: F704, PLE1142
+    (_CURRICULUM_DIR / "followup_a_response.txt").write_text(reply_a, encoding="utf-8")
+    (_CURRICULUM_DIR / "followup_b_response.txt").write_text(reply_b, encoding="utf-8")
+    print(reply_b)
+else:
+    _f = _CURRICULUM_DIR / "followup_b_response.txt"
+    if _f.exists():
+        print(_f.read_text())
+    else:
+        print("[Task B not yet run. Set RUN_FOLLOWUP = True and re-run.]")
+```
+
+## Cell 17 (markdown)
+
+### Strategy state after robustness testing
+
+## Cell 18 (code)
+
+```python
+trained_state_after = yaml.safe_load((TRAINED_STRATEGY_DIR / "skill_state.yaml").read_text())
+hyps = trained_state_after.get("hypotheses", [])
+corrections = trained_state_after.get("calibration_corrections", [])
+
+print("wti-strategy-trained/ after robustness testing:")
+print("─" * 60)
+for hyp in hyps:
+    print(f"  hyp {hyp['id']}: {hyp['status']}  confirmations={hyp['confirmations']}  refutations={hyp['refutations']}")
+print(f"  Calibration corrections: {len(corrections)}")
+if corrections:
+    for c in corrections:
+        print(f"    [{c['condition']}] → {c['adjustment']}")
+print()
+print((TRAINED_STRATEGY_DIR / "SKILL.md").read_text())
+```
+
+## Cell 19 (markdown)
+
+---
+## 5. Continue Interactively
+
+The notebook has walked the agent through a structured study session. But the
+best way to understand what the agent has learned — and to push it further —
+is to have a direct conversation.
+
+Launch the ADK web interface from the repo root:
+
+```bash
+cd implementations/energy_oil_forecasting
+
+# Start fresh (seed strategy — no training applied yet):
+uv run adk web adaptive_agent/
+
+# Continue from where the training notebook left off:
+WTI_STRATEGY_DIR=adaptive_agent/skills/wti-strategy-trained \\
+    uv run adk web adaptive_agent/
+```
+
+Open `http://localhost:8000` in your browser. The agent has its full skill
+set available: code execution, web search, and mutation tools.
+
+**Suggested conversation starters:**
+
+- *"What's your current forecasting strategy? Summarize it in plain language and tell me what calibration corrections are active."*
+- *"Look at the 2022 Russia-Ukraine oil shock (Feb–Mar 2022). Does your flat-line finding hold during a sharp upward move driven by geopolitical shock?"*
+- *"Explore early 2020 (COVID demand collapse). Does the flat-line advantage hold during a sharp downward move as well as the recovery?"*
+- *"Given your current strategy, what would your 21-day WTI forecast be as of today?"*
+- *"What would it take for you to open a second hypothesis? What's the next most interesting pattern to investigate?"*
+
+---
+## Next: Protected Evaluation
+
+Notebook 6 evaluates both the **untrained agent** (uses `wti-strategy/`)
+and the **trained agent** (uses `wti-strategy-trained/`) on the 2026 eval spec —
+a period of significant market volatility the agent has never seen.
+
+The eval is deliberately **frozen**: the agent cannot update its strategy
+during evaluation, so the comparison is a clean before/after of what
+the self-directed study session contributed.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_forecast_tool_demo.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_forecast_tool_demo.ipynb.md
new file mode 100644
index 0000000..a38e21c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__05_forecast_tool_demo.ipynb.md
@@ -0,0 +1,200 @@
+# Source: implementations/energy_oil_forecasting/05_forecast_tool_demo.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil — The Forecast Tool (Notebook 5)
+
+Notebook 2 built a **progressive capability staircase** for the analyst agent:
+no tools → news → news + open-ended code execution. This notebook adds a
+**fourth, contrasting capability level**: a conventional **function tool**.
+
+Instead of letting the agent write arbitrary Python, we expose a single,
+rigidly-typed callable — `run_forecast` — that fits a pre-specified statistical
+model (**AutoARIMA**) up to a cutoff date and returns a structured forecast. The
+agent expresses intent through the tool's parameters (`series_id`,
+`cutoff_date`, `horizons`, `frequency`); the series data never enters the LLM
+context window.
+
+| Path | Mechanism | Trade-off |
+|------|-----------|-----------|
+| `build_wti_code_exec_config` | Open-ended code generation | Maximum flexibility, less control |
+| `build_wti_tool_config` (this NB) | Fixed function tool | Less flexibility, full control + reproducibility |
+
+> **Prerequisite:** Read [`02_intro_agentic_predictor.ipynb`](02_intro_agentic_predictor.ipynb)
+> first for the staircase framing and the `AgentPredictor` interface.
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup & Data Registration
+
+## Cell 3 (code)
+
+```python
+import warnings
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from aieng.forecasting.evaluation.task import ForecastingTask
+from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service
+
+
+# The data service is shared by the tool and the prompt builder. The tool reads
+# series data directly from it (server-side) — it is never sent to the model.
+data_service = build_wti_service()
+
+AS_OF = pd.Timestamp("2026-03-01")  # information cutoff (context available)
+ORIGIN = pd.Timestamp("2026-03-02")  # the day we forecast *from*
+
+ctx = data_service.context(as_of=AS_OF)
+full_df = ctx.get_series(WTI_SERIES_ID)
+
+print(f"Trading days in cache up to {AS_OF.date()}: {len(full_df)}")
+print(f"Last WTI close: ${full_df['value'].iloc[-1]:.2f}/bbl on {str(full_df['timestamp'].iloc[-1])[:10]}")
+```
+
+## Cell 4 (code)
+
+```python
+task = ForecastingTask(
+    task_id="wti_oil_price_forecast",
+    target_series_id=WTI_SERIES_ID,
+    horizons=[5, 10, 21],
+    frequency="B",
+    description="WTI Crude Oil front-month futures — 5, 10, 21 business days ahead.",
+)
+
+print("Task:", task.task_id, "| horizons:", task.horizons, "| as_of:", ctx.as_of)
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. The tool, standalone
+
+`ForecastTool` is deterministic and needs no LLM. We call it directly here to
+show exactly what the agent will receive: a JSON block with point forecasts and
+prediction intervals per horizon, plus the series metadata and the cutoff date
+used.
+
+We pass an explicit `data_service` (the one registered above) so the tool reads
+from the same cache. The tool wraps a `Predictor`; here we inject an AutoARIMA
+predictor with a modest `num_samples` (it is slow per origin).
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import ForecastTool
+from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor
+
+
+tool = ForecastTool(data_service, predictor=DartsAutoARIMAPredictor(num_samples=200))
+
+print("Running AutoARIMA forecast (this can take tens of seconds)...")
+result_json = tool.run_forecast(
+    series_id=WTI_SERIES_ID,
+    cutoff_date=str(AS_OF.date()),
+    horizons=task.horizons,
+    frequency="B",
+)
+print(result_json)
+```
+
+## Cell 7 (markdown)
+
+Note the `notes` field: a true 95% interval is not reported because the
+predictor's standard quantile grid tops out at p05/p95, so the widest honest
+interval is **90%** (p05–p95). The tool reports the **80%** (p10–p90) and
+**90%** intervals plus the full quantile grid — it never fabricates coverage
+the model did not produce.
+
+## Cell 8 (markdown)
+
+---
+## 3. Wiring the tool into the agent
+
+`build_wti_tool_config()` is the fourth capability factory. It combines the
+bounded Google Search sub-agent (temporal cutoff enforced) with the forecast
+tool, and appends an instruction supplement telling the agent to call
+`run_forecast` once before producing its forecast.
+
+We pass the same `data_service` so the config does not rebuild it.
+
+## Cell 9 (code)
+
+```python
+from energy_oil_forecasting.analyst_agent import (
+    build_wti_agent_predictor,
+    build_wti_tool_config,
+)
+
+
+# Models: "gemini-3.1-flash-lite-preview" (lite/default) · "gemini-3.5-flash" (advanced)
+tool_config = build_wti_tool_config(
+    model="gemini-3.1-flash-lite-preview",
+    # model="gemini-3.5-flash",  # advanced
+    data_service=data_service,
+    num_samples=200,
+)
+
+print("=== Tool config summary ===")
+print("name:                ", tool_config.name)
+print("model:               ", tool_config.model)
+print("function_tools:      ", len(tool_config.function_tools))
+print("context_retrieval:   ", tool_config.context_retrieval.enabled)
+print("search_model:        ", tool_config.context_retrieval.search_model)
+print("\n=== Forecast tool supplement (tail of instruction) ===")
+print(tool_config.instruction[-700:])
+```
+
+## Cell 10 (markdown)
+
+---
+## 4. A single agent call
+
+Wrapping the config in an `AgentPredictor` and calling `predict` runs **one**
+agent turn. In that turn the agent calls the Google Search sub-agent for market
+context **and** `run_forecast` for the AutoARIMA anchor, then returns a
+structured forecast that conditions on both.
+
+## Cell 11 (code)
+
+```python
+tool_predictor = build_wti_agent_predictor(tool_config)
+
+print(f"Predictor ID: {tool_predictor.predictor_id}")
+print("Running tool-equipped agent... (Google Search + AutoARIMA forecast tool)")
+
+tool_preds = tool_predictor.predict(task, ctx)
+
+print("\nTool-equipped agent forecast:\n")
+for p in tool_preds:
+    fc = p.payload
+    print(
+        f"  h={task.horizons[tool_preds.index(p)]:>2}d  "
+        f"point=${fc.point_forecast:.2f}  "
+        f"80%CI=[${fc.quantiles[0.10]:.2f}, ${fc.quantiles[0.90]:.2f}]"
+    )
+if tool_preds and tool_preds[0].metadata.get("rationale"):
+    print("\nAgent rationale:", tool_preds[0].metadata["rationale"][:500])
+```
+
+## Cell 12 (markdown)
+
+---
+## 5. Wrap-up
+
+- The tool-equipped agent returns standard `Prediction` objects, so it drops
+  straight into the Notebook 4 backtest harness via
+  `build_wti_agent_predictor(build_wti_tool_config(...))`.
+- **Conventional tools vs. code generation** is a deliberate design divergence:
+  the tool path trades flexibility for a fixed, auditable, reproducible
+  interface — arguably a safer way to grant an agent a new ability.
+- AutoARIMA is just one example. `ForecastTool` wraps any `Predictor` (passed at
+  construction), so swapping in Prophet, ETS, or an ensemble needs no signature
+  change. And `series_id` makes the tool reusable for food CPI, the BoC rate,
+  and other registered series.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__06_protected_eval.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__06_protected_eval.ipynb.md
new file mode 100644
index 0000000..672e8a9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__06_protected_eval.ipynb.md
@@ -0,0 +1,473 @@
+# Source: implementations/energy_oil_forecasting/06_protected_eval.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil — Protected Evaluation (Notebook 6 of 7)
+
+> **Part 6 of 7.** Requires Notebook 5 to have been run first —
+> the trained strategy (`wti-strategy-trained/`) must exist.
+
+This notebook answers one question: **did the self-directed study session improve
+the agent's forecasting?**
+
+We evaluate two versions of the adaptive agent on held-out 2026 data —
+a period of significant WTI price volatility neither agent has ever seen:
+
+| Variant | Strategy | Training |
+|---|---|---|
+| **Untrained** | `wti-strategy/` | None — initial domain priors only |
+| **Trained** | `wti-strategy-trained/` | One self-directed study session (NB05) |
+
+Stateless methods (AutoARIMA, Naive) from NB04 provide an external reference point.
+Both adaptive agent variants are **frozen** during evaluation — no strategy updates —
+so any difference is attributable solely to the training session.
+
+## Cell 2 (markdown)
+
+---
+## 0. Setup & Freeze
+
+## Cell 3 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+import pandas as pd
+from aieng.forecasting.evaluation import (
+    MultiTargetBacktestSpec,
+    cached_multi_backtest,
+)
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from energy_oil_forecasting.adaptive_agent import build_wti_adaptive_predictor
+from energy_oil_forecasting.adaptive_agent.curriculum.snapshot_utils import (
+    state_checksum,
+)
+from energy_oil_forecasting.analysis import score_backtest_results
+from energy_oil_forecasting.data import build_wti_service
+
+
+warnings.filterwarnings("ignore")
+
+# ── Paths ─────────────────────────────────────────────────────────────────────
+_NB_DIR = Path(".")
+_SKILLS_ROOT = _NB_DIR / "adaptive_agent" / "skills"
+_CURRICULUM_DIR = _NB_DIR / "adaptive_agent" / "curriculum"
+_SPECS_DIR = _NB_DIR / "specs"
+
+SEED_STRATEGY_DIR = _SKILLS_ROOT / "wti-strategy"  # untrained baseline
+TRAINED_STRATEGY_DIR = _SKILLS_ROOT / "wti-strategy-trained"  # after self-directed study
+
+# Both adaptive variants — used for eval, loading, and state checks:
+ADAPTIVE_VARIANTS = {
+    "Agent — untrained": SEED_STRATEGY_DIR,
+    "Agent — trained": TRAINED_STRATEGY_DIR,
+}
+
+# ── Model ─────────────────────────────────────────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). The adaptive agent uses the advanced model.
+AGENT_MODEL = "gemini-3.5-flash"
+
+# ── Run guard ─────────────────────────────────────────────────────────────────
+# Set True on first run; commit outputs; leave False for reproducibility.
+RUN_EVAL = False
+
+# ── Data service ──────────────────────────────────────────────────────────────
+data_service = build_wti_service()
+print("Setup complete.")
+```
+
+## Cell 4 (code)
+
+```python
+# ── Freeze: record pre-eval checksums ────────────────────────────────────────
+_pre_eval_checksums = {name: state_checksum(d) for name, d in ADAPTIVE_VARIANTS.items()}
+print("Pre-eval checksums recorded:")
+for name, ck in _pre_eval_checksums.items():
+    print(f"  {name}: {ck[:16]}...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 1. The Knowledge-Cutoff Teaching Point
+
+**Gemini's parametric knowledge cutoff is approximately January 2025.**
+This has a concrete implication for this evaluation:
+
+- The **training period** (2025) is at or near the model's parametric knowledge
+  horizon. During the self-directed study in NB05, the agent was instructed to
+  fetch data via yfinance and reason from what it computed — not from memorized
+  facts about 2025 WTI prices.
+
+- The **evaluation period** (Feb–Mar 2026) is definitively post-cutoff.
+  During eval, the agent must rely entirely on:
+  1. Live Google Search (with `cutoff_date` enforcement per origin)
+  2. Code execution (for statistical analysis of fetched data)
+  3. Its accumulated strategy state (from the training session)
+
+This is a clean test of what the training phase actually adds: it cannot be
+attributed to the model's parametric knowledge of the eval period.
+
+## Cell 6 (markdown)
+
+---
+## 2. Load Stateless Eval Results
+
+Notebook 4 saved 2026 eval results for AutoARIMA and Naive baselines.
+We load them here as external reference points — no re-run needed.
+
+## Cell 7 (code)
+
+```python
+# ── Load eval results from NB04 ─────────────────────────────────────────────
+# Load only stateless results (not agent-specific files).
+_stateless_jsons = [
+    f for f in sorted(_CURRICULUM_DIR.glob("eval_*.json")) if not f.stem.removeprefix("eval_").startswith("Agent")
+]
+if not _stateless_jsons:
+    raise FileNotFoundError(
+        "No stateless eval result files found in adaptive_agent/curriculum/. "
+        "Run 04_systematic_backtest_eval.ipynb first."
+    )
+
+all_eval_results: dict[str, BacktestResult] = {}
+for f in _stateless_jsons:
+    name = f.stem.removeprefix("eval_").replace("_", " ")
+    all_eval_results[name] = BacktestResult.model_validate_json(f.read_text())
+
+print(f"Loaded {len(all_eval_results)} stateless eval result(s):")
+for name, r in all_eval_results.items():
+    print(f"  {name}: {len(r.predictions)} predictions, mean CRPS = {r.mean_score:.4f}")
+```
+
+## Cell 8 (markdown)
+
+---
+## 3. Evaluate Adaptive Agent Variants
+
+Both adaptive variants are evaluated on the same 2026 eval spec used by
+the stateless predictors in NB04.
+
+> **Run guard:** `RUN_EVAL = False` by default. Set to `True` on first run,
+> commit the saved result files, and leave `False` for reproducibility.
+
+## Cell 9 (code)
+
+```python
+import yaml  # noqa: PLC0415
+
+
+with open(_SPECS_DIR / "energy_oil_eval.yaml") as _f:
+    eval_spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(_f))
+
+
+def _safe_key(name: str) -> str:
+    return name.replace(" ", "_").replace("(", "").replace(")", "").replace("—", "").strip("_")
+
+
+if RUN_EVAL:
+    print("Running adaptive agent variants on 2026 eval spec...")
+    print("(Live API calls — first run may take several minutes.)\n")
+
+    for variant_name, strategy_dir in ADAPTIVE_VARIANTS.items():
+        predictor = build_wti_adaptive_predictor(strategy_dir=strategy_dir, model=AGENT_MODEL)
+        result_dict = cached_multi_backtest(predictor, eval_spec, data_service)
+        result = next(iter(result_dict.values()))
+        all_eval_results[variant_name] = result
+        safe = _safe_key(variant_name)
+        (_CURRICULUM_DIR / f"eval_{safe}.json").write_text(result.model_dump_json(), encoding="utf-8")
+        print(f"  {variant_name}: mean CRPS = {result.mean_score:.4f} ✓")
+
+    print("\nEval complete.")
+else:
+    # Load committed results for all adaptive variants
+    for variant_name in ADAPTIVE_VARIANTS:
+        safe = _safe_key(variant_name)
+        _f = _CURRICULUM_DIR / f"eval_{safe}.json"
+        if _f.exists():
+            all_eval_results[variant_name] = BacktestResult.model_validate_json(_f.read_text())
+    print("RUN_EVAL = False — using committed outputs (or set True to re-run).")
+print(f"Eval results available: {list(all_eval_results)}")
+```
+
+## Cell 10 (markdown)
+
+---
+## 4. Before vs After — Comparative Scorecard
+
+All predictors evaluated on the same 2026 eval origins.
+Lower CRPS is better.
+
+| | What it represents |
+|---|---|
+| **Agent — untrained** | Adaptive architecture + news search, zero training |
+| **Agent — trained** | Same, plus one self-directed study session |
+| AutoARIMA | Best stateless statistical method from NB04 |
+| Naive | Last-value baseline |
+
+## Cell 11 (code)
+
+```python
+scorecard_rows = []
+for name, result in all_eval_results.items():
+    _result_for_scoring = result if isinstance(result, dict) else {name: result}
+    scores = score_backtest_results(_result_for_scoring, data_service)
+    scorecard_rows.append(
+        {
+            "Predictor": name,
+            "Mean CRPS": round(scores.get("mean_crps", float("nan")), 3),
+            "MAE h=21d": round(scores.get("mae_h21", float("nan")), 3),
+            "80% CI Coverage": f"{scores.get('coverage_80', float('nan')):.1f}%",
+        }
+    )
+
+df_scorecard = pd.DataFrame(scorecard_rows).set_index("Predictor")
+df_scorecard = df_scorecard.sort_values("Mean CRPS")
+
+print("━" * 72)
+print("2026 PROTECTED EVAL (sorted by CRPS, lower is better):")
+print("━" * 72)
+print(df_scorecard.to_string())
+```
+
+## Cell 12 (markdown)
+
+---
+## 5. Forecast Comparison — All Eval Origins, h=21d
+
+One panel per predictor, all eval origins on a shared time axis.
+Each panel shows the realised WTI price (black line), the 21-day-ahead
+point forecast (diamond), and the 80% prediction interval (vertical bar).
+Ordered by CRPS score — best at top.
+
+## Cell 13 (code)
+
+```python
+from datetime import datetime
+
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+
+
+_full_series = data_service.get_series("wti_crude_oil_price", as_of=datetime.now())
+_price_ts = pd.to_datetime(_full_series["timestamp"])
+_price_vals = _full_series["value"].values
+
+_COLORS = {
+    "Naive (Last Value)": "#aaaaaa",
+    "AutoARIMA": "#4e8fc7",
+    "Agent — untrained": "#f4a261",
+    "Agent — trained": "#6a0572",
+}
+# Show predictors in scorecard order (best CRPS first)
+_PREDICTOR_ORDER = [p for p in df_scorecard.index if p in _COLORS]
+
+
+def _has_forecast(payload) -> bool:
+    return hasattr(payload, "point_forecast") and hasattr(payload, "quantiles")
+
+
+# Collect the longest-horizon prediction per (predictor, origin)
+_h21: dict[tuple[str, str], object] = {}
+for name, result in all_eval_results.items():
+    for pred in result.predictions:
+        if not _has_forecast(pred.payload):
+            continue
+        horizon = (pd.Timestamp(pred.forecast_date) - pd.Timestamp(pred.as_of)).days
+        key = (name, str(pred.as_of.date()))
+        existing = _h21.get(key)
+        if existing is None:
+            _h21[key] = pred
+        else:
+            existing_h = (pd.Timestamp(existing.forecast_date) - pd.Timestamp(existing.as_of)).days
+            if horizon > existing_h:
+                _h21[key] = pred
+print(f"h-max predictions collected: {len(_h21)}")
+
+_origins = sorted({str(pred.as_of.date()) for result in all_eval_results.values() for pred in result.predictions})
+_t0 = pd.Timestamp(_origins[0]) - pd.Timedelta(days=14)
+_t1 = pd.Timestamp(_origins[-1]) + pd.Timedelta(days=28)
+_mask = (_price_ts >= _t0) & (_price_ts <= _t1)
+_ctx_dates = _price_ts[_mask]
+_ctx_prices = _price_vals[_mask]
+
+_n_rows = len(_PREDICTOR_ORDER)
+fig = make_subplots(
+    rows=_n_rows,
+    cols=1,
+    shared_xaxes=True,
+    vertical_spacing=0.03,
+    subplot_titles=_PREDICTOR_ORDER,
+)
+
+for row_idx, name in enumerate(_PREDICTOR_ORDER, 1):
+    color = _COLORS[name]
+    fig.add_trace(
+        go.Scatter(
+            x=_ctx_dates,
+            y=_ctx_prices,
+            mode="lines",
+            name="Actual",
+            line={"color": "black", "width": 1.5},
+            showlegend=(row_idx == 1),
+            legendgroup="actual",
+        ),
+        row=row_idx,
+        col=1,
+    )
+
+    for origin in _origins:
+        pred = _h21.get((name, origin))
+        if pred is None:
+            continue
+        fc_date = pd.Timestamp(pred.forecast_date)
+        pt = pred.payload.point_forecast
+        lo = pred.payload.quantiles.get(0.1, pt)
+        hi = pred.payload.quantiles.get(0.9, pt)
+        fig.add_shape(
+            type="line",
+            x0=pd.Timestamp(origin),
+            x1=pd.Timestamp(origin),
+            y0=0,
+            y1=1,
+            yref="paper",
+            xref=f"x{row_idx}" if row_idx > 1 else "x",
+            line={"dash": "dot", "color": "#cccccc", "width": 1},
+            row=row_idx,
+            col=1,
+        )
+        fig.add_trace(
+            go.Scatter(
+                x=[fc_date, fc_date],
+                y=[lo, hi],
+                mode="lines",
+                line={"color": color, "width": 4},
+                showlegend=False,
+                legendgroup=name,
+            ),
+            row=row_idx,
+            col=1,
+        )
+        fig.add_trace(
+            go.Scatter(
+                x=[fc_date],
+                y=[pt],
+                mode="markers",
+                marker={"color": color, "size": 9, "symbol": "diamond", "line": {"color": "white", "width": 1}},
+                name=name,
+                showlegend=False,
+                legendgroup=name,
+            ),
+            row=row_idx,
+            col=1,
+        )
+
+fig.update_layout(
+    title="h=21d forecasts — point (diamond) + 80% CI (bar)",
+    height=220 * _n_rows,
+    width=950,
+    showlegend=False,
+    margin={"t": 60, "b": 40},
+)
+for i in range(1, _n_rows + 1):
+    fig.update_yaxes(title_text="USD/bbl", title_font_size=10, row=i, col=1)
+fig.show()
+```
+
+## Cell 14 (markdown)
+
+---
+## 6. What the Agents Said — Rationale Comparison
+
+Each adaptive agent records its reasoning in the prediction metadata.
+Below are the rationales from the first eval origin for the untrained and trained
+agents — the clearest way to see whether the self-directed study session changed
+what the agent attends to and how it frames its uncertainty.
+
+## Cell 15 (code)
+
+```python
+from IPython.display import Markdown
+from IPython.display import display as ipy_display
+
+
+_first_origin = sorted({str(p.as_of.date()) for p in next(iter(all_eval_results.values())).predictions})[0]
+
+for name in ["Agent — untrained", "Agent — trained"]:
+    if name not in all_eval_results:
+        continue
+    preds = [p for p in all_eval_results[name].predictions if str(p.as_of.date()) == _first_origin]
+    if not preds:
+        continue
+    rationale = preds[0].metadata.get("rationale", "*(no rationale stored)*")
+    ipy_display(Markdown(f"### {name}\n*Origin: {_first_origin}*\n\n> {rationale.strip()}"))
+    print()
+```
+
+## Cell 16 (markdown)
+
+---
+## 7. Freeze Verification
+
+Confirm the evaluation did not trigger any skill state mutations.
+Checksums should match the pre-eval values recorded in Setup.
+
+## Cell 17 (code)
+
+```python
+print("State integrity check (both variants should be unchanged):")
+all_ok = True
+for name, d in ADAPTIVE_VARIANTS.items():
+    ck_after = state_checksum(d)
+    ok = ck_after == _pre_eval_checksums[name]
+    all_ok = all_ok and ok
+    print(f"  {name}: {'✓ unchanged' if ok else '⚠ MODIFIED'}")
+
+if not all_ok:
+    print("\nWarning: the agent updated its strategy during evaluation.")
+    print("See Closing Note for how to explore this intentionally.")
+```
+
+## Cell 18 (markdown)
+
+---
+## 8. Closing Note — Unfreezing
+
+The adaptive agents here were **frozen** during evaluation: no strategy updates
+during the eval period. This gives a clean before/after comparison.
+
+But in live deployment, you would not freeze the agent. After each resolved
+prediction, you would send a resolution message and let the agent decide whether
+to record an observation or update a hypothesis. Over time, the strategy evolves.
+
+**To explore unfreezing:**
+
+1. Set `RUN_EVAL = True`.
+2. Remove the state checksum assertion (or ignore the warning).
+3. Modify the eval loop to send a resolution message after each prediction:
+
+```python
+resolution_msg = (
+    f'The actual WTI price on {pred.forecast_date.date()} was {actual:.2f}. '
+    f'Your point forecast was {pred.payload.point_forecast:.2f} '
+    f'(error: {pred.payload.point_forecast - actual:+.2f}). '
+    'Please review whether this outcome is relevant to any open hypothesis.'
+)
+await runner.run_text_async(resolution_msg)
+```
+
+4. Re-run and compare the final strategy state to the frozen baseline.
+
+To continue interactively with the trained agent, launch the ADK web interface:
+
+```bash
+cd implementations/energy_oil_forecasting
+WTI_STRATEGY_DIR=adaptive_agent/skills/wti-strategy-trained \\
+    uv run adk web adaptive_agent/
+```
+
+Open `http://localhost:8000`. See Notebook 5 for suggested conversation starters.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__99_starter_agent.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__99_starter_agent.ipynb.md
new file mode 100644
index 0000000..bc374c3
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__99_starter_agent.ipynb.md
@@ -0,0 +1,174 @@
+# Source: implementations/energy_oil_forecasting/99_starter_agent.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# WTI Crude Oil — Your Starter Agent
+
+**If you're not sure what to do next, continue from here.**
+
+This notebook is a fresh, hackable agent for the WTI crude-oil use case — deliberately *not* wired into the numbered curriculum. It gives you our common building blocks behind simple toggles, so you can start building something of your own:
+
+- **optional news search** — bounded, cutoff-aware Google Search (proxy-only)
+- **optional code execution** — an E2B Python sandbox
+- **two lightweight skills** — *tool-usage playbooks* in `starter_agent/skills/`
+
+It does two things: lets you **talk to the agent** (open-ended, Track 2) and **score one real forecast** (Track 1). The live cells are gated by `RUN_AGENT` so a fresh `Run All` is safe and free; flip it to `True` to actually call the model.
+
+## Cell 2 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from dotenv import load_dotenv
+
+
+# Repo root holds the .env with PROXY_* creds the agent needs.
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+# ── Model selection ───────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Lite is the default.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+# AGENT_MODEL = "gemini-3.5-flash"  # advanced (higher cost/latency)
+
+# ── Run guard ──────────────────────────────────────
+# Live agent calls cost tokens and need PROXY_* in the repo-root .env, plus warm
+# data caches. Default False so `Run All` is safe; set True to call the model.
+RUN_AGENT = False
+
+from energy_oil_forecasting.starter_agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+print("RUN_AGENT =", RUN_AGENT, "| model =", AGENT_MODEL)
+```
+
+## Cell 3 (markdown)
+
+---
+## 1. Meet your agent
+
+`build_starter_agent_config` returns an `AgentConfig` with two toggles. The default turns **news search on** (proxy-only, no extra key) and **code execution off** (it needs `E2B_API_KEY` and is slower). Flip them and re-run — the loaded skills follow the enabled tools.
+
+## Cell 4 (code)
+
+```python
+config = build_starter_agent_config(
+    model=AGENT_MODEL,
+    enable_search=True,  # ← cutoff-aware Google Search (proxy-only)
+    enable_code_exec=False,  # ← E2B Python sandbox (needs E2B_API_KEY); try True!
+)
+
+print("Agent:", config.name)
+print("Search enabled:    ", config.context_retrieval.enabled)
+print("Code-exec enabled: ", config.code_execution.enabled)
+print("Skills loaded:     ", [p.name for p in config.skills_dirs])
+print("\n── System instruction (edit this in starter_agent/agent.py) ──\n")
+print(config.instruction[:1200], "...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Talk to it  *(Track 2 — open-ended analysis)*
+
+Ask the agent anything. This is the interactive mode: no scoring, no schema — just reasoning (and a web search, since search is on). Edit the question and explore.
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+
+
+QUESTION = (
+    "What are the two or three forces most likely to move WTI crude over the "
+    "next month, and which direction does each push? Be concise."
+)
+
+if RUN_AGENT:
+    chat_agent = build_adk_agent(config)  # schema-free: plain text in, text out
+    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name="wti_starter_chat"))
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True in the setup cell to talk to the agent.")
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. Score one prediction against known outcomes  *(Track 1)*
+
+Now run the agent as a `Predictor`. We pick the **most recent origin whose horizons have already resolved**, forecast it, and check whether each actual price landed inside the agent's 80% band — so you can see whether it was any good. (One origin can't tell you if the agent is *calibrated*; that's what the backtest in `04_systematic_backtest_eval.ipynb` is for.) Live, so gated by `RUN_AGENT`.
+
+## Cell 8 (code)
+
+```python
+if RUN_AGENT:
+    from aieng.forecasting.evaluation.task import ForecastingTask
+    from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service, naive_utc_now
+
+    svc = build_wti_service()
+    full = svc.get_series(WTI_SERIES_ID, as_of=naive_utc_now())
+    full["timestamp"] = pd.to_datetime(full["timestamp"])
+    last_date = full["timestamp"].iloc[-1]
+
+    HORIZONS = [5, 10, 21]
+    # Most recent origin whose longest horizon has already resolved.
+    AS_OF = last_date - pd.offsets.BDay(max(HORIZONS) + 1)
+
+    task = ForecastingTask(
+        task_id="wti_starter_forecast",
+        target_series_id=WTI_SERIES_ID,
+        horizons=HORIZONS,
+        frequency="B",
+        description="WTI front-month futures — 5/10/21 business days ahead (starter).",
+    )
+    ctx = svc.context(as_of=AS_OF)
+    preds = build_starter_agent_predictor(config).predict(task, ctx)
+
+    def realized_at(h):
+        rows = full[full["timestamp"] >= AS_OF + pd.offsets.BDay(h)]
+        return float(rows["value"].iloc[0]) if not rows.empty else None
+
+    print(f"Origin as_of={AS_OF.date()}  (latest data {last_date.date()})\n")
+    print("   h    agent point   agent 80% CI           actual   in band?")
+    for i, h in enumerate(HORIZONS):
+        fc = preds[i].payload
+        lo, hi = fc.quantiles[0.10], fc.quantiles[0.90]
+        act = realized_at(h)
+        inb = "—" if act is None else ("yes ✓" if lo <= act <= hi else "no ✗")
+        acts = "  N/A" if act is None else f"${act:7.2f}"
+        print(f"  {h:>2}d   ${fc.point_forecast:7.2f}   [${lo:6.2f}, ${hi:6.2f}]   {acts}   {inb}")
+    if preds[0].metadata.get("rationale"):
+        print("\nRationale:", preds[0].metadata["rationale"][:300])
+else:
+    print("RUN_AGENT is False — set it to True to score a live forecast against known outcomes.")
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Make it yours
+
+This agent is a starting point. Here are concrete next steps, easiest first — each is a small edit, then re-run the cells above.
+
+1. **Flip code execution on.** Set `enable_code_exec=True` in §1 (needs `E2B_API_KEY`). The agent loads the `code-analysis-playbook` skill and can compute its own diagnostics before forecasting. Compare the rationale.
+2. **Edit the agent's personality.** Open `starter_agent/agent.py` and change `_build_starter_instruction()` — make it more cautious, more contrarian, focused on one driver. Re-run §1 to see the new instruction.
+3. **Sharpen the skills.** The two files in `starter_agent/skills/` are short on purpose. Add your best queries to `research-playbook`, or a new diagnostic to `code-analysis-playbook`. The agent picks them up automatically.
+4. **Change the question and the origin.** Try a different `QUESTION` in §2 and a different origin in §3.
+5. **Add a tool.** Give the agent a conventional forecast tool as a statistical anchor — see `analyst_agent.build_wti_tool_config` for the `ForecastTool` pattern.
+6. **Score it properly.** Run it across several origins with `backtest()` (see `04_systematic_backtest_eval.ipynb`) and compare CRPS against the baselines.
+
+Bigger ideas — an agent that *learns* a strategy (notebooks 05–06), news vs. no-news lift, live prospective forecasting — are in the use-case `README.md` and `planning-docs/roadmap.md`.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__README.md.md
new file mode 100644
index 0000000..959dd3f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__README.md.md
@@ -0,0 +1,132 @@
+# Source: implementations/energy_oil_forecasting/README.md
+
+kind: markdown
+
+# WTI Crude Oil Price Forecasting
+
+> **Reference implementation 3 of 4.** Recommended order: [getting_started](../getting_started/) → [S&P 500](../sp500_forecasting/) → [food CPI](../food_price_forecasting/) → **energy / WTI** → [BoC rate decisions](../boc_rate_decisions/). Each stands on its own.
+
+The **high-frequency, context-driven** reference implementation. Unlike long-horizon annual CPI forecasting, the daily resolution of oil markets makes genuinely prospective, real-time evaluation practical: you can lock an agent configuration today and measure its accuracy on unresolved horizons within weeks.
+
+WTI Crude Oil is highly liquid and sensitive to geopolitical risk, macroeconomic policy, and supply disruptions. This implementation works through a progression of forecasting approaches:
+
+1. **Statistical models** (Prophet) extrapolate trend and seasonality but are blind to regime-breaking news.
+2. **Context-aware agentic models** (bounded Google Search) adapt to shocks by reasoning over shipping lane closures, OPEC+ policy, and political escalation.
+3. **Code-executing agentic models** verify trends, compute rolling indicators, and self-calibrate intervals via sandboxed Python.
+
+---
+
+## Curriculum Structure
+
+The curriculum runs in two tracks. The **stateless track** (notebooks 01–04)
+builds up agentic forecasters whose configuration is fixed at definition time.
+The **adaptive-agent track** (notebooks 05–06) treats the forecaster as a
+persistent analyst that *learns* a strategy from data and is scored before vs
+after. Run the notebooks in order; notebook 1 is Prophet-only and agents are
+introduced in notebook 2.
+
+### Stateless capability track
+
+| Notebook | Focus | Agents? |
+|----------|-------|---------|
+| **[`01_wti_case_study.ipynb`](01_wti_case_study.ipynb)** | **The Case Study Narrative** — rolling Prophet backtest animation, annotated context chart, 2025 vs 2026 coverage punchline, futures curve | No |
+| **[`02_intro_agentic_predictor.ipynb`](02_intro_agentic_predictor.ipynb)** | **The Agentic Staircase** — 4 capability levels on Mar 2, 2026; inspect configs and prompts | Yes |
+| **[`03_one_agent_three_tasks.ipynb`](03_one_agent_three_tasks.ipynb)** | **One Agent, Three Tasks** — trajectory, binary shock, scenario analysis via shared agent identity | Yes |
+| **[`04_systematic_backtest_eval.ipynb`](04_systematic_backtest_eval.ipynb)** | **Systematic Competition** — 2025 backtest → leaderboard → 2026 protected eval | Yes |
+
+### Adaptive-agent track
+
+| Notebook | Focus | Agents? |
+|----------|-------|---------|
+| **[`05_adaptive_agent_training.ipynb`](05_adaptive_agent_training.ipynb)** | **Self-Directed Study** — the agent explores 2025 data over a multi-turn curriculum and writes a learned strategy into `adaptive_agent/skills/wti-strategy-trained/`. Defaults to `RUN_STUDY = False` (the study session is expensive); the trained strategy is committed so downstream notebooks run without re-training. | Yes |
+| **[`06_protected_eval.ipynb`](06_protected_eval.ipynb)** | **Protected Evaluation** — frozen before/after comparison of the untrained vs trained adaptive agent on the 2026 eval spec, alongside the stateless baselines from notebook 04. Defaults to `RUN_EVAL = False`; loads committed results otherwise. | Yes |
+
+### Side demo
+
+| Notebook | Focus | Agents? |
+|----------|-------|---------|
+| **[`05_forecast_tool_demo.ipynb`](05_forecast_tool_demo.ipynb)** | **The Forecast Tool** — a standalone demo (not part of the main sequence) of a conventional AutoARIMA function tool (`build_wti_tool_config`) as a controlled, auditable alternative to open-ended code execution | Yes |
+
+### Build your own
+
+| Notebook | Focus | Agents? |
+|----------|-------|---------|
+| **[`99_starter_agent.ipynb`](99_starter_agent.ipynb)** | **Your starter agent** — a fresh, hackable WTI agent (*not* part of the curriculum) with toggleable news search + code execution and two lightweight tool-usage skills. Interactive (Track 2) cell, one scored prediction (Track 1), and a "make it yours" guide. **If you're not sure what to do next, start here.** | Yes |
+
+An earlier set of information-session notebooks is archived in [`playground/energy_case_study/`](../../playground/energy_case_study/); the notebooks here are the maintained reference.
+
+---
+
+## The Forecasting Tasks
+
+Each forecasting origin defines a strict information cutoff (`as_of`). Predictors receive price history up to `as_of` and answer up to three tasks:
+
+### Task A: Trajectory Forecast (Track 1)
+
+- **Horizons:** 5, 10, 21 business days
+- **Output:** Point estimate + standard quantile grid (via `ContinuousAgentForecastOutput`)
+- **Evaluation:** CRPS and MAE (Notebook 4 backtest)
+
+### Task B: Binary Up-shock Probability (Track 1)
+
+- **Question:** P(WTI closes > $5/bbl higher in 5 business days)
+- **Output:** `DiscreteAgentForecastOutput` → `BinaryForecast`
+- **Evaluation:** Brier score (Notebook 3)
+
+### Task C: Scenario Analysis (Track 2)
+
+- **Output:** Three scenario cards with probabilities and 60-day ranges
+- **Evaluation:** Display / qualitative (Track 2 — not head-to-head scored in backtest)
+
+The **one-agent-three-tasks** pattern lives in [`tasks.py`](tasks.py): one `AgentConfig` identity, three `(prompt_builder, output_schema)` pairs via `build_wti_news_predictor(task)`.
+
+---
+
+## Module Layout
+
+```
+implementations/energy_oil_forecasting/
+├── data.py                 # build_wti_service(), WTI_SERIES_ID
+├── paths.py                # cache paths, demo origins, colour constants
+├── prophet_baseline.py     # ProphetPredictor, rolling backtest helpers
+├── viz.py                  # Plotly narrative charts
+├── analysis.py             # Brier, coverage, backtest scoring helpers
+├── tasks.py                # task specs, multitask prompt builders
+├── analyst_agent/          # stateless AgentConfig factories (agent identity only)
+├── adaptive_agent/         # the learning agent: strategy state, mutation tools, curriculum, seed/trained skills
+├── starter_agent/          # fresh, hackable agent template (toggleable search/code-exec + skills)
+├── specs/                  # YAML backtest + eval specs
+└── 01–06 notebooks (+ 05_forecast_tool_demo side demo, 99_starter_agent build-your-own)
+```
+
+`adaptive_agent/` holds `agent.py` (the adaptive `AgentConfig` + predictor factory),
+`skill_state.py` (`WtiStrategyState` — the mutable strategy: observations,
+hypotheses, calibration corrections, approach narrative), `skill_tools.py` (the
+mutation tools the agent calls to update its strategy under evidence governance),
+`curriculum/` (the 2025 weekly news context and cached study/eval snapshots), and
+`skills/` (the `wti-strategy` seed plus the `wti-strategy-trained` output of
+notebook 05).
+
+### Agent layering
+
+| Layer | Module | Owns |
+|-------|--------|------|
+| Package | `aieng.forecasting.methods.agentic` | `AgentPredictor`, `AgentConfig`, output schema base classes |
+| Stateless identity | `analyst_agent/agent.py` | Instructions, capability presets, skills — fixed at config time |
+| Role per task | `tasks.py` | Prompt builders, `build_wti_news_predictor(task)` |
+| Learning agent | `adaptive_agent/` | Persistent, mutable strategy state updated via self-directed study (notebooks 05–06) |
+
+---
+
+## Data Source & Setup
+
+We use Yahoo Finance `CL=F` — cached to `data/yfinance/` by `build_wti_service()`.
+
+Ensure your `.env` contains `GEMINI_API_KEY`. Agent notebook cells cache results under `data/`; delete cache files to force fresh runs.
+
+```bash
+uv sync
+uv run python scripts/fetch_wti.py   # optional: pre-populate WTI cache
+```
+
+Run `make lint` before pushing changes to this use case.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting____init__.py.md
new file mode 100644
index 0000000..3a6fc5e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting____init__.py.md
@@ -0,0 +1,7 @@
+# Source: implementations/energy_oil_forecasting/__init__.py
+
+kind: python
+
+```python
+
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent____init__.py.md
new file mode 100644
index 0000000..06d6089
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent____init__.py.md
@@ -0,0 +1,28 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/__init__.py
+
+kind: python
+
+```python
+"""Adaptive WTI crude oil analyst agent module.
+
+Exports the :class:`AgentConfig` factory, prompt builder, and predictor
+convenience factory for the adaptive energy/oil reference implementation.
+"""
+
+from energy_oil_forecasting.adaptive_agent.agent import (
+    WtiAdaptiveForecastPromptBuilder,
+    build_wti_adaptive_config,
+    build_wti_adaptive_predictor,
+)
+from energy_oil_forecasting.adaptive_agent.skill_tools import build_skill_tools
+from energy_oil_forecasting.analyst_agent import compress_history
+
+
+__all__ = [
+    "WtiAdaptiveForecastPromptBuilder",
+    "build_skill_tools",
+    "build_wti_adaptive_config",
+    "build_wti_adaptive_predictor",
+    "compress_history",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__agent.py.md
new file mode 100644
index 0000000..1041fba
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__agent.py.md
@@ -0,0 +1,415 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/agent.py
+
+kind: python
+
+```python
+"""Adaptive WTI crude oil analyst agent.
+
+Unlike :mod:`energy_oil_forecasting.analyst_agent`, this agent is designed as
+a persistent entity: it maintains a living forecasting strategy through mutable
+skill files on the filesystem and handles multiple message types through a
+single chat interface.
+
+Provides:
+
+- :func:`build_wti_adaptive_config`: full adaptive agent — E2B code execution,
+  bounded web search, and five pipeline-component skills.
+- :class:`WtiAdaptiveForecastPromptBuilder`: prompt builder for prediction-request
+  messages, compatible with the existing backtest/eval harness.
+- :func:`build_wti_adaptive_predictor`: convenience factory wiring the adaptive
+  agent into an :class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`
+  for comparison against stateless baselines in backtests.
+
+Skills
+------
+Skills are **pipeline components**, not end-to-end recipes. The agent composes
+them as needed, loading multiple skills before writing a single complete code block.
+
+``fetch-yfinance``
+    One-shot patterns for downloading market data from Yahoo Finance.
+
+``vol-regime``
+    Volatility regime classification and anomaly detection.
+
+``trend-projection``
+    Linear trend fitting, projection, and interval calibration.
+
+``wti-strategy``
+    The agent's current forecasting strategy (mutable).
+
+``meta-learning``
+    Governs when and how ``wti-strategy`` is updated.
+
+Code execution
+--------------
+Uses E2B (real sandbox). Each ``run_code`` call is a **fresh Python process** —
+no state, variables, or files carry over between calls. All imports, data
+fetching, and analysis must be in a single self-contained block.
+
+Skill mutability
+----------------
+The ``wti-strategy`` skill is backed by a :class:`~energy_oil_forecasting.adaptive_agent.skill_state.WtiStrategyState`
+Pydantic model persisted in ``skills/wti-strategy/skill_state.yaml``.
+``SKILL.md`` is rendered from that model on every mutation and is never
+hand-edited.  Five typed mutation tools (from :mod:`skill_tools`) are
+registered via ``AgentConfig(extra_tools=WTI_SKILL_TOOLS)`` and run in the
+host process — not inside E2B.  See :mod:`skill_tools` for the full tool
+signatures and evidence governance rules.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import ADVANCED_MODEL, LITE_MODEL
+from energy_oil_forecasting.adaptive_agent.skill_tools import build_skill_tools
+from energy_oil_forecasting.analyst_agent import compress_history
+from pydantic import BaseModel
+
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Paths
+# ---------------------------------------------------------------------------
+
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+
+
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+
+
+def _build_adaptive_analyst_instruction() -> str:
+    """Build the adaptive analyst instruction with the output schema embedded.
+
+    Uses ``ContinuousAgentForecastOutput.prompt_schema_json()`` for the
+    prediction-response schema so it stays in sync with the output class.
+    """
+    schema = ContinuousAgentForecastOutput.prompt_schema_json()
+    return (
+        "## Identity\n\n"
+        "You are a persistent WTI crude oil market analyst. You carry knowledge forward "
+        "across invocations: your `wti-strategy` skill captures your current forecasting "
+        "approach, and you update it deliberately as you learn from experience.\n\n"
+        "## Message types\n\n"
+        "You receive messages through a single chat interface. Determine from context "
+        "what kind of invocation this is and respond accordingly:\n\n"
+        "**Prediction request** — contains a JSON payload with `task`, `as_of`, "
+        "`horizons`, and price history. Load `wti-strategy` first to read your current "
+        "approach and any active calibration corrections. Then:\n"
+        "1. Use `run_code` to run your full statistical analysis pipeline: fetch data "
+        "via `fetch-yfinance` (using `end=as_of` as the cutoff), classify the vol "
+        "regime via `vol-regime`, and project trend and intervals via `trend-projection`. "
+        "Apply any calibration corrections from `wti-strategy` — for example, substituting "
+        "a flat-trend model in elevated/extreme vol regimes if your strategy calls for it.\n"
+        "2. Use the context-retrieval tool to gather current market news and adjust your "
+        "estimates where strong catalysts are present.\n"
+        "3. Conclude with `set_model_response` (schema below).\n\n"
+        "Your quantitative pipeline is your starting point — your learned strategy "
+        "corrections and news-grounded judgment shape the final forecast.\n\n"
+        "**Resolution** — describes how a past forecast resolved (actual value, error, "
+        "horizon). Reflect carefully. If the error points to a systematic pattern — not "
+        "a one-off surprise — consult `meta-learning` to assess whether a strategy update "
+        "is warranted.\n\n"
+        "**Self-review / backtesting** — you are asked to analyse your recent performance "
+        "or explore historical data using code execution. Compose the relevant skills, "
+        "write one complete code block, and summarise what you find. If the analysis "
+        "surfaces a durable insight, follow the `meta-learning` process.\n\n"
+        "**User question** — a human is asking for analysis, context, or your market "
+        "view. Engage directly, using code execution and web search as needed.\n\n"
+        "## Skills are pipeline components\n\n"
+        "Your skills cover specific pipeline stages. Compose them: for any task "
+        "involving code, load each relevant skill and its `references/examples.md`, "
+        "then write one complete self-contained code block combining all the patterns.\n\n"
+        "| Skill            | Pipeline stage                                          |\n"
+        "|------------------|---------------------------------------------------------|\n"
+        "| fetch-yfinance   | Download market / futures data from Yahoo Finance       |\n"
+        "| vol-regime       | Classify vol regime, detect anomalies, choose window    |\n"
+        "| trend-projection | Fit trend, project to horizons, calibrate intervals     |\n"
+        "| wti-strategy     | Your current forecasting strategy — load at the start of every prediction |\n"
+        "| meta-learning    | Governs when and how to update wti-strategy             |\n\n"
+        "## Strategy mutation tools\n\n"
+        "These tools write directly to `wti-strategy` on the host filesystem. "
+        "They run outside the E2B sandbox. Consult `meta-learning` before calling "
+        "any of them.\n\n"
+        "| Tool | Evidence layer | Evidence bar |\n"
+        "|------|---------------|---------------|\n"
+        "| `record_observation(finding, linked_hypothesis?)` | Observations | Pattern visible across ≥2 forecasts — not a single surprise |\n"
+        "| `open_hypothesis(claim, initial_evidence)` | Hypotheses | One strong observation suggesting a durable pattern |\n"
+        "| `record_hypothesis_outcome(hypothesis_id, outcome)` | Hypotheses | Each resolution relevant to an open hypothesis |\n"
+        "| `graduate_hypothesis(hypothesis_id, condition, adjustment, horizon_scope)` | Calibration | Tool enforces confirmation threshold — will reject if not met |\n"
+        "| `update_approach_narrative(new_text, rationale)` | Approach | Only when the calibration record reveals a structural insight |\n\n"
+        "Active calibration corrections from `wti-strategy` are **not optional** — "
+        "apply every listed correction when the stated condition is met.\n\n"
+        "## Code execution discipline\n\n"
+        "Treat `run_code` like submitting to a batch queue: plan your complete "
+        "analysis upfront, write one self-contained script, and read the results. "
+        "There is no REPL, no way to inspect intermediate state between calls, and "
+        "no benefit to splitting work — each submission starts from zero with no "
+        "memory of previous calls.\n\n"
+        "Never make a preliminary or test call to check connectivity or verify "
+        "imports. Assume the environment works. Your first `run_code` call should "
+        "produce your complete result.\n\n"
+        "Pre-installed: numpy, pandas, sklearn, yfinance, statsmodels, properscoring.\n\n"
+        "**Data sourcing rule:** Always use the `fetch-yfinance` skill to load price "
+        "data inside `run_code`. **Never embed `target_history_csv` or any CSV "
+        "string literal as a data source in code.** Pasting thousands of rows of "
+        "data as Python string literals is fragile, wastes context, and risks hitting "
+        "sandbox limits. `target_history_csv` is provided in the prediction payload "
+        "for your reading and statistical summary only — not for copy-pasting into "
+        "code blocks. When a skill description says 'assume `df` is already defined', "
+        "that means you should define `df` via a yfinance fetch at the top of your "
+        "script, not by embedding raw data.\n\n"
+        "## Temporal discipline\n\n"
+        "Every forecast is anchored to an `as_of` date. Never use information beyond "
+        "that date — in web search, code analysis, or reasoning.\n\n"
+        "When fetching data inside `run_code`, always pass `end=as_of_date` to "
+        "yfinance to enforce the temporal cutoff — for example:\n\n"
+        "```python\nraw = ticker.history(start='2004-01-01', end='2026-02-16', "
+        "auto_adjust=False)\n```\n\n"
+        "Replace the end date with the actual `as_of` value from the prediction "
+        "payload. This is the only correct way to ensure the sandbox sees the same "
+        "data the agent would have seen on that date.\n\n"
+        "## Prediction output schema\n\n"
+        "For **prediction requests**, call `set_model_response` with `json_response` "
+        "matching **exactly**:\n\n"
+        "```json\n" + schema + "\n```\n\n"
+        'Critical: use `"horizon"` (integer, not `"horizon_days"`). '
+        '`"quantiles"` is a **list** of `{"quantile": <level>, "value": <price>}` '
+        "objects — not a dict."
+    )
+
+
+_ADAPTIVE_ANALYST_INSTRUCTION = _build_adaptive_analyst_instruction()
+
+
+# ---------------------------------------------------------------------------
+# Context retrieval instruction
+# ---------------------------------------------------------------------------
+
+_WTI_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are an oil market intelligence specialist with access to web search.
+
+Search for information relevant to the query and return a concise structured \
+markdown summary (3-5 paragraphs) covering relevant aspects of:
+- WTI/Brent crude price level and recent trend
+- OPEC+ production decisions and supply outlook
+- Geopolitical risks in the Persian Gulf, Middle East, key shipping lanes
+- US Strategic Petroleum Reserve and energy policy signals
+- Notable tanker/shipping incidents or supply disruption signals
+- Published analyst forecasts or unusual price-target revisions
+
+Ground your summary in the search results you actually retrieve. \
+When a cutoff date is specified, do not report or speculate about events \
+that occurred after that date.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Prompt builder — prediction requests
+# ---------------------------------------------------------------------------
+
+
+class WtiAdaptiveForecastPromptBuilder(BaseModel):
+    """Prompt builder for prediction-request messages to the adaptive agent.
+
+    Produces a structured JSON payload containing the compressed price history
+    and key summary statistics.  The agent runs its own full statistical
+    pipeline (fetch-yfinance → vol-regime → trend-projection) inside the E2B
+    sandbox, applies calibration corrections from its ``wti-strategy`` skill,
+    and incorporates news context from web search before returning its forecast.
+
+    For resolution, self-review, and user-question invocations, construct
+    plain-text messages directly and send them via the ADK runner.
+    """
+
+    model_config = {"extra": "forbid"}
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        df = context.get_series(task.target_series_id)
+        compressed = compress_history(df)
+
+        last_row = df.iloc[-1]
+        last_close = float(last_row["value"])
+        last_date = str(pd.Timestamp(last_row["timestamp"]).date())
+        trailing_252 = df["value"].tail(252)
+
+        payload: dict[str, Any] = {
+            "task": task.task_id,
+            "as_of": str(context.as_of)[:10],
+            "horizons": list(task.horizons),
+            "standard_quantiles": list(STANDARD_QUANTILES),
+            "target_summary": {
+                "last_close_usd_bbl": last_close,
+                "last_date": last_date,
+                "n_trading_days": int(len(df)),
+                "52w_high": float(trailing_252.max()),
+                "52w_low": float(trailing_252.min()),
+            },
+            "target_history_csv": compressed,
+        }
+
+        return json.dumps(payload, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# AgentConfig factory
+# ---------------------------------------------------------------------------
+
+
+def build_wti_adaptive_config(
+    model: str = ADVANCED_MODEL,
+    search_model: str = LITE_MODEL,
+    max_output_tokens: int = 16_384,
+    strategy_dir: Path | None = None,
+) -> AgentConfig:
+    """Build the full adaptive WTI analyst :class:`AgentConfig`.
+
+    Combines E2B code execution, bounded Google Search with temporal cutoff
+    enforcement, and five skills: ``fetch-yfinance``, ``vol-regime``,
+    ``trend-projection``, the selected strategy skill, and ``meta-learning``.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to the
+        lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` (the
+        advanced model) so web search stays cheap while the analyst reasons
+        with more capability.
+    max_output_tokens : int, default=16_384
+        Maximum tokens per model response. Set above LiteLLM's OpenAI-compatible
+        default of 4096 so the agent can write a complete ``run_code`` Python
+        script in a single function call without truncation.
+    strategy_dir : Path or None, default=None
+        Directory containing the strategy skill (``skill_state.yaml``,
+        ``SKILL.md``).  Defaults to ``skills/wti-strategy`` (the base variant).
+        Pass an alternative path (e.g. ``skills/wti-strategy-trained``) to
+        instantiate the trained variant after a self-directed study session.
+        The same directory is used for both the ADK skill load and the mutation
+        tool bindings, ensuring the tools always write to the skill the agent
+        is reading.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    resolved_strategy_dir = strategy_dir or (_SKILLS_ROOT / "wti-strategy")
+    # Include strategy dir name in agent name so cached_multi_backtest writes a
+    # separate cache file per variant (cache key is derived from predictor_id,
+    # which is derived from agent name).
+    agent_name = f"wti_adaptive_analyst_{resolved_strategy_dir.name.replace('-', '_')}"
+    return AgentConfig(
+        name=agent_name,
+        model=model,
+        instruction=_ADAPTIVE_ANALYST_INSTRUCTION,
+        max_output_tokens=max_output_tokens,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_WTI_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+        code_execution=CodeExecutionConfig(enabled=True),
+        skills_dirs=[
+            _SKILLS_ROOT / "fetch-yfinance",
+            _SKILLS_ROOT / "vol-regime",
+            _SKILLS_ROOT / "trend-projection",
+            resolved_strategy_dir,
+            _SKILLS_ROOT / "meta-learning",
+        ],
+        extra_tools=build_skill_tools(resolved_strategy_dir, confirmation_threshold=2),
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+def build_wti_adaptive_predictor(
+    config: AgentConfig | None = None,
+    strategy_dir: Path | None = None,
+    model: str = ADVANCED_MODEL,
+) -> AgentPredictor:
+    """Wrap the adaptive agent in an :class:`AgentPredictor` for eval harness use.
+
+    At each forecast origin the predictor sends a prediction-request payload to
+    the agent.  The agent runs its full statistical pipeline (fetch-yfinance →
+    vol-regime → trend-projection) in the E2B sandbox, applies calibration
+    corrections from its ``wti-strategy`` skill, incorporates news context, and
+    returns a probabilistic forecast.
+
+    For resolution delivery and self-review invocations — the interactions
+    through which the agent actually learns — use the ADK runner directly
+    rather than this predictor interface.
+
+    Parameters
+    ----------
+    config : AgentConfig, optional
+        Agent config to use.  When provided, ``strategy_dir`` is ignored.
+        Defaults to ``build_wti_adaptive_config(strategy_dir=strategy_dir)``.
+    strategy_dir : Path or None, optional
+        Strategy directory passed to :func:`build_wti_adaptive_config` when
+        ``config`` is not provided.  Defaults to ``skills/wti-strategy``.
+    model : str, optional
+        Model identifier passed to :func:`build_wti_adaptive_config` when
+        ``config`` is not provided.
+
+    Returns
+    -------
+    AgentPredictor
+    """
+    if config is None:
+        config = build_wti_adaptive_config(model=model, strategy_dir=strategy_dir)
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=WtiAdaptiveForecastPromptBuilder(),
+        output_schema=ContinuousAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    r"""Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``.
+
+    By default the agent loads the seed strategy (``wti-strategy``).  To load
+    a different strategy — e.g. after a training session — set the
+    ``WTI_STRATEGY_DIR`` environment variable to an absolute or repo-relative
+    path before launching::
+
+        WTI_STRATEGY_DIR=adaptive_agent/skills/wti-strategy-trained \
+            uv run adk web adaptive_agent/
+    """
+    if name == "root_agent":
+        import os  # noqa: PLC0415
+
+        strategy_env = os.environ.get("WTI_STRATEGY_DIR")
+        strategy_dir = Path(strategy_env) if strategy_env else None
+        return build_adk_agent(build_wti_adaptive_config(strategy_dir=strategy_dir))
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__curriculum__snapshot_utils.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__curriculum__snapshot_utils.py.md
new file mode 100644
index 0000000..b6ffac2
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__curriculum__snapshot_utils.py.md
@@ -0,0 +1,111 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/curriculum/snapshot_utils.py
+
+kind: python
+
+```python
+"""Skill state snapshot helpers for the adaptive agent training notebooks.
+
+These utilities let NB05 and NB06 safely snapshot the strategy state before
+a training or evaluation run, and restore it afterward if needed.
+
+Snapshot contract
+-----------------
+Before any activity that mutates a strategy dir, call ``snapshot_state()``.
+It copies ``skill_state.yaml`` → ``skill_state_pretrain.yaml`` inside the
+same strategy directory.  If a snapshot already exists, the call is a no-op
+(safe to re-run).
+
+To undo all mutations (e.g. before repeating a training run from scratch),
+call ``restore_state()``.  It copies the pretrain snapshot back over
+``skill_state.yaml`` and re-renders ``SKILL.md`` via ``AdaptiveSkillStore``.
+"""
+
+from __future__ import annotations
+
+import shutil
+from pathlib import Path
+
+from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillStore
+from energy_oil_forecasting.adaptive_agent.skill_state import WtiStrategyState
+
+
+_YAML_FILENAME = "skill_state.yaml"
+_PRETRAIN_FILENAME = "skill_state_pretrain.yaml"
+
+
+def snapshot_state(strategy_dir: Path, *, overwrite: bool = False) -> Path:
+    """Copy ``skill_state.yaml`` to ``skill_state_pretrain.yaml``.
+
+    Parameters
+    ----------
+    strategy_dir : Path
+        The strategy skill directory (e.g. ``skills/wti-strategy-stats``).
+    overwrite : bool, default=False
+        If ``False`` and a snapshot already exists, the call is a no-op.
+        Set ``True`` to force-overwrite an existing snapshot.
+
+    Returns
+    -------
+    Path
+        Path to the snapshot file.
+    """
+    src = strategy_dir / _YAML_FILENAME
+    dst = strategy_dir / _PRETRAIN_FILENAME
+    if dst.exists() and not overwrite:
+        print(f"  [snapshot_state] Snapshot already exists: {dst.name} — skipping.")
+        return dst
+    shutil.copy2(src, dst)
+    print(f"  [snapshot_state] Saved → {dst.relative_to(strategy_dir.parent.parent)}")
+    return dst
+
+
+def restore_state(strategy_dir: Path) -> None:
+    """Restore ``skill_state.yaml`` from the pre-training snapshot and re-render ``SKILL.md``.
+
+    Parameters
+    ----------
+    strategy_dir : Path
+        The strategy skill directory to restore.
+
+    Raises
+    ------
+    FileNotFoundError
+        If no pretrain snapshot exists (``snapshot_state`` was never called).
+    """
+    src = strategy_dir / _PRETRAIN_FILENAME
+    if not src.exists():
+        raise FileNotFoundError(
+            f"No pretrain snapshot found at {src}. Call snapshot_state() before running training activities."
+        )
+    dst = strategy_dir / _YAML_FILENAME
+    shutil.copy2(src, dst)
+    # Re-render SKILL.md from the restored YAML
+    store: AdaptiveSkillStore[WtiStrategyState] = AdaptiveSkillStore(
+        skill_dir=strategy_dir,
+        state_type=WtiStrategyState,
+    )
+    state = store.load()
+    store.save(state)
+    print(f"  [restore_state] Restored {strategy_dir.name} from pretrain snapshot.")
+
+
+def state_checksum(strategy_dir: Path) -> str:
+    """Return a content hash of ``skill_state.yaml`` for before/after comparison.
+
+    Used in NB06 to verify the agent's state was not mutated during evaluation.
+
+    Parameters
+    ----------
+    strategy_dir : Path
+        The strategy skill directory.
+
+    Returns
+    -------
+    str
+        Hex digest of the YAML content.
+    """
+    import hashlib  # noqa: PLC0415
+
+    content = (strategy_dir / _YAML_FILENAME).read_bytes()
+    return hashlib.sha256(content).hexdigest()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_state.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_state.py.md
new file mode 100644
index 0000000..d897e5c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_state.py.md
@@ -0,0 +1,216 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skill_state.py
+
+kind: python
+
+```python
+"""WTI forecasting strategy state model.
+
+Defines the structured state backing the ``wti-strategy`` adaptive skill.
+``WtiStrategyState`` is the single source of truth for the agent's current
+forecasting approach.  It is persisted to ``skills/wti-strategy/skill_state.yaml``
+and rendered to ``skills/wti-strategy/SKILL.md`` on every mutation so that the
+ADK ``SkillToolset`` always reads an up-to-date version.
+
+Learning layers
+---------------
+The four fields of ``WtiStrategyState`` map to distinct update frequencies and
+evidence burdens, enforced partly by the mutation tools in ``skill_tools.py``
+and partly by the ``meta-learning`` governance skill:
+
+``observations``
+    Append-only log of pattern-level findings.  Lowest evidence bar — record
+    any finding that is not a single-outlier surprise.
+
+``hypotheses``
+    Candidate systematic corrections the agent is actively testing.  Open a
+    hypothesis when you suspect a durable pattern.  Accumulate confirmation /
+    refutation counts across resolutions.  A hypothesis graduates to a
+    calibration correction when its confirmation count reaches the store's
+    ``confirmation_threshold``.
+
+``calibration_corrections``
+    Confirmed systematic adjustments applied at prediction time.  Each entry
+    is graduated from a confirmed hypothesis — never added directly.
+
+``approach_narrative``
+    Free-text description of the agent's overall forecasting philosophy.
+    Highest evidence bar.  Update only when the calibration record reveals a
+    structural insight that the narrative no longer captures.
+"""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillState
+from pydantic import BaseModel
+
+
+# ---------------------------------------------------------------------------
+# Sub-models
+# ---------------------------------------------------------------------------
+
+
+class Observation(BaseModel):
+    """A single pattern-level finding from a resolution or self-review."""
+
+    date: str
+    finding: str
+    linked_hypothesis: str | None = None
+
+
+class Hypothesis(BaseModel):
+    """A candidate systematic correction under active testing.
+
+    ``status`` progresses through ``open`` → ``confirmed`` or ``open`` →
+    ``refuted``.  Confirmed hypotheses are graduated to
+    :class:`CalibrationCorrection` via the ``graduate_hypothesis`` tool.
+    """
+
+    id: str
+    claim: str
+    status: Literal["open", "confirmed", "refuted"] = "open"
+    confirmations: int = 0
+    refutations: int = 0
+    opened_on: str
+
+
+class CalibrationCorrection(BaseModel):
+    """A confirmed systematic adjustment applied at prediction time.
+
+    Every entry here was graduated from a confirmed hypothesis; the
+    ``source_hypothesis`` field preserves that lineage.
+    """
+
+    condition: str
+    adjustment: str
+    horizon_scope: str
+    source_hypothesis: str
+    confirmed_on: str
+
+
+class VersionEntry(BaseModel):
+    """One row in the version history table."""
+
+    date: str
+    description: str
+
+
+# ---------------------------------------------------------------------------
+# Strategy state
+# ---------------------------------------------------------------------------
+
+
+class WtiStrategyState(AdaptiveSkillState):
+    """Structured state for the adaptive WTI crude oil forecasting strategy.
+
+    See module docstring for the learning-layer hierarchy and evidence burdens.
+    """
+
+    approach_narrative: str
+    calibration_corrections: list[CalibrationCorrection] = []
+    hypotheses: list[Hypothesis] = []
+    observations: list[Observation] = []
+    version_history: list[VersionEntry] = []
+
+    def build_markdown(self, skill_name: str | None = None) -> str:  # noqa: PLR0912
+        """Render the full ``SKILL.md`` content from current state."""
+        lines: list[str] = []
+
+        # Frontmatter — skill_name must match the containing directory name (ADK requirement)
+        lines += [
+            "---",
+            f"name: {skill_name or 'wti-strategy'}",
+            "description: >-",
+            "  The adaptive WTI analyst's current forecasting strategy. Load this at the",
+            "  start of every prediction task. This file is generated — edit the state",
+            "  through the mutation tools, not by hand.",
+            "---",
+            "",
+        ]
+
+        lines += [
+            "# WTI Forecasting Strategy",
+            "",
+            "## Approach",
+            "",
+            self.approach_narrative.strip(),
+            "",
+        ]
+
+        # Active calibration corrections
+        lines += [
+            "## Active calibration corrections",
+            "",
+        ]
+        if self.calibration_corrections:
+            lines += [
+                "| Condition | Adjustment | Horizon scope | Confirmed on |",
+                "|-----------|-----------|---------------|--------------|",
+            ]
+            for c in self.calibration_corrections:
+                lines.append(f"| {c.condition} | {c.adjustment} | {c.horizon_scope} | {c.confirmed_on} |")
+        else:
+            lines.append("*(No calibration corrections yet. Graduate a confirmed hypothesis to add one.)*")
+        lines.append("")
+
+        # Open hypotheses
+        lines += [
+            "## Open hypotheses",
+            "",
+        ]
+        open_hyps = [h for h in self.hypotheses if h.status == "open"]
+        if open_hyps:
+            lines += [
+                "| ID | Claim | Confirmations | Refutations |",
+                "|----|-------|---------------|-------------|",
+            ]
+            for h in open_hyps:
+                lines.append(f"| {h.id} | {h.claim} | {h.confirmations} | {h.refutations} |")
+        else:
+            lines.append("*(No open hypotheses.)*")
+        lines.append("")
+
+        # Closed hypotheses (confirmed / refuted) — collapsed for readability
+        closed_hyps = [h for h in self.hypotheses if h.status != "open"]
+        if closed_hyps:
+            lines += [
+                "## Closed hypotheses",
+                "",
+                "| ID | Claim | Status | Confirmations | Refutations |",
+                "|----|-------|--------|---------------|-------------|",
+            ]
+            for h in closed_hyps:
+                lines.append(f"| {h.id} | {h.claim} | {h.status} | {h.confirmations} | {h.refutations} |")
+            lines.append("")
+
+        # Observations
+        lines += [
+            "## Observations",
+            "",
+        ]
+        if self.observations:
+            lines += [
+                "| Date | Finding | Linked hypothesis |",
+                "|------|---------|-------------------|",
+            ]
+            for o in self.observations:
+                linked = o.linked_hypothesis or "—"
+                lines.append(f"| {o.date} | {o.finding} | {linked} |")
+        else:
+            lines.append("*(No observations yet. Record findings from resolutions and self-reviews.)*")
+        lines.append("")
+
+        # Version history
+        lines += [
+            "## Version history",
+            "",
+            "| Date | Change |",
+            "|------|--------|",
+        ]
+        for v in self.version_history:
+            lines.append(f"| {v.date} | {v.description} |")
+        lines.append("")
+
+        return "\n".join(lines)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_tools.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_tools.py.md
new file mode 100644
index 0000000..8469396
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_tools.py.md
@@ -0,0 +1,425 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skill_tools.py
+
+kind: python
+
+```python
+"""Mutation tools for the ``wti-strategy`` adaptive skill.
+
+These are plain Python callables registered as ADK ``FunctionTool`` objects via
+``AgentConfig(extra_tools=build_skill_tools(strategy_dir))``.  They run in the
+host process — *not* inside the E2B sandbox — so they can read and write the
+skill directory on the local filesystem.
+
+Factory pattern
+---------------
+Use :func:`build_skill_tools` to create a set of tools bound to a specific
+strategy directory.  This allows multiple named strategy variants (e.g.
+``wti-strategy-stats``, ``wti-strategy-news``) to coexist, each with its own
+``AdaptiveSkillStore`` and ``skill_state.yaml``.
+
+The module-level :data:`STORE` and :data:`WTI_SKILL_TOOLS` are convenience
+bindings to the **default** ``wti-strategy`` directory for backward
+compatibility and interactive ``adk web`` use.
+
+Design principles
+-----------------
+Each tool follows the same three-step cycle:
+
+1. ``store.load()`` — deserialise current state from ``skill_state.yaml``.
+2. Apply one typed mutation to the state model.
+3. ``store.save(state)`` — write YAML, re-render ``SKILL.md``, back up.
+
+Tool signatures are intentionally narrow: they accept only the arguments
+needed for one specific mutation.  The agent cannot write arbitrary content to
+the skill directory through any of these tools.
+
+Evidence governance
+-------------------
+``record_observation``
+    No guard.  Record any pattern-level finding (not a single-outlier surprise).
+
+``open_hypothesis``
+    No guard.  Open a hypothesis whenever you suspect a durable pattern.
+
+``record_hypothesis_outcome``
+    Validates the hypothesis ID exists and is still open.
+
+``graduate_hypothesis``
+    Hard guard: rejected if ``hypothesis.confirmations < store.confirmation_threshold``.
+    Returns a clear message stating the shortfall.
+
+``update_approach_narrative``
+    Requires a ``rationale`` argument but no hard numeric guard.  The
+    ``meta-learning`` skill governs when this is appropriate.
+
+Scope guard
+-----------
+All writes go through the :class:`~aieng.forecasting.methods.agentic.adaptive_skill.AdaptiveSkillStore`
+instance passed to :func:`build_skill_tools`.  No path outside that directory
+can be reached.
+"""
+
+from __future__ import annotations
+
+from datetime import date
+from pathlib import Path
+from typing import Callable
+
+from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillStore
+from energy_oil_forecasting.adaptive_agent.skill_state import (
+    CalibrationCorrection,
+    Hypothesis,
+    Observation,
+    VersionEntry,
+    WtiStrategyState,
+)
+
+
+# ---------------------------------------------------------------------------
+# Default strategy directory (for backward compat and adk web)
+# ---------------------------------------------------------------------------
+
+_SKILL_DIR = Path(__file__).parent / "skills" / "wti-strategy"
+
+
+# ---------------------------------------------------------------------------
+# Stateless helpers
+# ---------------------------------------------------------------------------
+
+
+def _today() -> str:
+    return str(date.today())
+
+
+def _next_hypothesis_id(state: WtiStrategyState) -> str:
+    """Return the next sequential hypothesis ID (e.g. ``hyp-004``)."""
+    n = len(state.hypotheses) + 1
+    return f"hyp-{n:03d}"
+
+
+# ---------------------------------------------------------------------------
+# Tool factory
+# ---------------------------------------------------------------------------
+
+
+def build_skill_tools(  # noqa: PLR0915
+    strategy_dir: Path,
+    *,
+    confirmation_threshold: int = 3,
+) -> list[Callable[..., str]]:
+    """Build a set of strategy mutation tools bound to *strategy_dir*.
+
+    Each call returns five fresh callables (closures over a new
+    :class:`~aieng.forecasting.methods.agentic.adaptive_skill.AdaptiveSkillStore`
+    instance).  Pass the returned list to
+    ``AgentConfig(extra_tools=build_skill_tools(strategy_dir))`` to wire the
+    tools into an agent that operates on a specific strategy variant.
+
+    Parameters
+    ----------
+    strategy_dir : Path
+        Directory containing the strategy skill (``skill_state.yaml``,
+        ``SKILL.md``, ``.history/``).  Must exist and be a directory.
+    confirmation_threshold : int, default=3
+        Number of confirming hypothesis outcomes required before
+        ``graduate_hypothesis`` is permitted.
+
+    Returns
+    -------
+    list[Callable[..., str]]
+        ``[record_observation, open_hypothesis, record_hypothesis_outcome,
+        graduate_hypothesis, update_approach_narrative]``
+    """
+    store: AdaptiveSkillStore[WtiStrategyState] = AdaptiveSkillStore(
+        skill_dir=strategy_dir,
+        state_type=WtiStrategyState,
+        confirmation_threshold=confirmation_threshold,
+    )
+
+    def record_observation(finding: str, linked_hypothesis: str = "") -> str:
+        """Record a pattern-level finding from a resolution or self-review.
+
+        Call this whenever you observe a systematic pattern across multiple
+        forecasts — not after a single surprising outcome.
+
+        Parameters
+        ----------
+        finding : str
+            A concise description of the pattern observed.  Be specific: include
+            the regime, horizon, and direction of the error where applicable.
+            Example: "80% intervals missed 4 of 5 actuals in the elevated vol
+            regime at the 21-day horizon."
+        linked_hypothesis : str, optional
+            ID of an existing open hypothesis this observation supports or
+            refutes (e.g. ``"hyp-001"``).  Leave blank if this is a fresh
+            observation not yet linked to any hypothesis.
+
+        Returns
+        -------
+        str
+            Confirmation message.
+        """
+        state = store.load()
+        obs = Observation(
+            date=_today(),
+            finding=finding.strip(),
+            linked_hypothesis=linked_hypothesis.strip() or None,
+        )
+        state.observations.append(obs)
+        store.save(state)
+        linked_note = f" (linked to {obs.linked_hypothesis})" if obs.linked_hypothesis else ""
+        return f'Observation recorded{linked_note}: "{finding[:80]}{"..." if len(finding) > 80 else ""}"'
+
+    def open_hypothesis(claim: str, initial_evidence: str) -> str:
+        """Open a new hypothesis about a suspected systematic forecasting pattern.
+
+        A hypothesis is a candidate calibration correction under active testing.
+        Open one when you have at least one observation suggesting a durable
+        pattern but do not yet have enough confirming resolutions to graduate it.
+
+        Parameters
+        ----------
+        claim : str
+            A testable claim about your forecasting behaviour.  State it in terms
+            of a specific condition and a directional error.
+            Example: "My 80% prediction intervals are consistently too narrow
+            when the vol regime is classified as elevated or extreme."
+        initial_evidence : str
+            The observation(s) that motivated opening this hypothesis.  This is
+            for the audit record — be specific about the number of data points.
+
+        Returns
+        -------
+        str
+            Confirmation message including the assigned hypothesis ID.
+        """
+        state = store.load()
+        hyp_id = _next_hypothesis_id(state)
+        hyp = Hypothesis(
+            id=hyp_id,
+            claim=claim.strip(),
+            status="open",
+            confirmations=0,
+            refutations=0,
+            opened_on=_today(),
+        )
+        state.hypotheses.append(hyp)
+        obs = Observation(
+            date=_today(),
+            finding=initial_evidence.strip(),
+            linked_hypothesis=hyp_id,
+        )
+        state.observations.append(obs)
+        store.save(state)
+        return (
+            f'Hypothesis {hyp_id} opened: "{claim[:80]}{"..." if len(claim) > 80 else ""}". '
+            f"Initial evidence recorded as an observation linked to {hyp_id}. "
+            f"Confirmations needed to graduate: {store.confirmation_threshold}."
+        )
+
+    def record_hypothesis_outcome(hypothesis_id: str, outcome: str) -> str:
+        """Record a confirming or refuting outcome for an open hypothesis.
+
+        Call this after each resolution where the outcome is directly relevant
+        to an open hypothesis.  Accumulate enough confirmations to graduate.
+
+        Parameters
+        ----------
+        hypothesis_id : str
+            ID of the hypothesis to update (e.g. ``"hyp-001"``).
+        outcome : str
+            Either ``"confirmed"`` or ``"refuted"``.  A single refutation does
+            not automatically close the hypothesis — continue accumulating
+            evidence.  A hypothesis should be manually closed (status →
+            ``"refuted"``) only when refutations clearly outweigh confirmations
+            across a meaningful sample.
+
+        Returns
+        -------
+        str
+            Updated confirmation / refutation counts and progress toward the
+            graduation threshold.
+        """
+        if outcome not in ("confirmed", "refuted"):
+            return f"Invalid outcome '{outcome}'. Must be 'confirmed' or 'refuted'."
+
+        state = store.load()
+        hyp = next((h for h in state.hypotheses if h.id == hypothesis_id), None)
+        if hyp is None:
+            ids = [h.id for h in state.hypotheses]
+            return f"Hypothesis '{hypothesis_id}' not found. Known IDs: {ids}."
+        if hyp.status != "open":
+            return f"Hypothesis {hypothesis_id} is already {hyp.status}. Only open hypotheses can receive new outcomes."
+
+        if outcome == "confirmed":
+            hyp.confirmations += 1
+        else:
+            hyp.refutations += 1
+
+        store.save(state)
+
+        remaining = max(0, store.confirmation_threshold - hyp.confirmations)
+        if remaining == 0:
+            ready_msg = (
+                f" Ready to graduate — call graduate_hypothesis('{hypothesis_id}', ...) "
+                "with a condition, adjustment, and horizon_scope."
+            )
+        else:
+            ready_msg = f" {remaining} more confirmation(s) needed to graduate."
+
+        return (
+            f"{hypothesis_id} updated: {hyp.confirmations} confirmation(s), {hyp.refutations} refutation(s).{ready_msg}"
+        )
+
+    def graduate_hypothesis(
+        hypothesis_id: str,
+        condition: str,
+        adjustment: str,
+        horizon_scope: str,
+    ) -> str:
+        """Graduate a confirmed hypothesis to an active calibration correction.
+
+        This is the primary mechanism through which the agent's strategy
+        improves.  A calibration correction is applied at every future
+        prediction; it is not merely recorded — it changes behaviour.
+
+        This tool enforces the confirmation threshold: it will reject the call
+        if the hypothesis has not accumulated enough confirming outcomes.
+
+        Parameters
+        ----------
+        hypothesis_id : str
+            ID of the confirmed hypothesis to graduate (e.g. ``"hyp-001"``).
+        condition : str
+            The specific condition under which this correction applies.
+            Example: "vol regime is elevated or extreme".
+        adjustment : str
+            The concrete adjustment to make when the condition is met.
+            Example: "Widen 80% CI by 12% relative to the statistical model
+            output."
+        horizon_scope : str
+            Which horizons this correction applies to.
+            One of: ``"all"``, ``"5bd"``, ``"10bd"``, ``"21bd"``, or a
+            combination like ``"10bd and 21bd"``.
+
+        Returns
+        -------
+        str
+            Confirmation message, or a rejection message with the shortfall.
+        """
+        state = store.load()
+        hyp = next((h for h in state.hypotheses if h.id == hypothesis_id), None)
+        if hyp is None:
+            ids = [h.id for h in state.hypotheses]
+            return f"Hypothesis '{hypothesis_id}' not found. Known IDs: {ids}."
+        if hyp.status != "open":
+            return f"Hypothesis {hypothesis_id} is already {hyp.status}. Only open hypotheses can be graduated."
+
+        if hyp.confirmations < store.confirmation_threshold:
+            shortfall = store.confirmation_threshold - hyp.confirmations
+            return (
+                f"Cannot graduate {hypothesis_id}: "
+                f"{hyp.confirmations} confirmation(s), "
+                f"requires {store.confirmation_threshold}. "
+                f"Record {shortfall} more confirming outcome(s) first."
+            )
+
+        today = _today()
+        hyp.status = "confirmed"
+        correction = CalibrationCorrection(
+            condition=condition.strip(),
+            adjustment=adjustment.strip(),
+            horizon_scope=horizon_scope.strip(),
+            source_hypothesis=hypothesis_id,
+            confirmed_on=today,
+        )
+        state.calibration_corrections.append(correction)
+        state.observations.append(
+            Observation(
+                date=today,
+                finding=(
+                    f"Graduated {hypothesis_id} to calibration correction: "
+                    f"'{condition}' → '{adjustment}' ({horizon_scope})."
+                ),
+                linked_hypothesis=hypothesis_id,
+            )
+        )
+        state.version_history.append(
+            VersionEntry(
+                date=today,
+                description=(
+                    f"Graduated {hypothesis_id} to calibration correction "
+                    f"(condition: {condition[:50]}{'...' if len(condition) > 50 else ''})."
+                ),
+            )
+        )
+        store.save(state)
+        return (
+            f"Hypothesis {hypothesis_id} confirmed and graduated. "
+            f"Calibration correction added: when '{condition}', apply '{adjustment}' "
+            f"(scope: {horizon_scope})."
+        )
+
+    def update_approach_narrative(new_text: str, rationale: str) -> str:
+        """Replace the approach narrative with an updated strategic description.
+
+        This is the highest-evidence-bar update.  Consult ``meta-learning``
+        before calling this tool — the narrative should only change when the
+        calibration record reveals a structural insight that the current
+        description no longer captures.  A ``rationale`` argument is required
+        to force articulation of why the change is warranted.
+
+        Parameters
+        ----------
+        new_text : str
+            The complete replacement text for the ``## Approach`` section.
+            Write it as a self-contained description of the current forecasting
+            strategy — what signals are used, in what order, and with what
+            emphasis.
+        rationale : str
+            Why this update is warranted now.  Cite the specific calibration
+            corrections or pattern of observations that motivated the change.
+
+        Returns
+        -------
+        str
+            Confirmation message.
+        """
+        if not new_text.strip():
+            return "new_text must not be empty."
+        if not rationale.strip():
+            return "rationale must not be empty. Explain why the approach narrative warrants an update."
+
+        state = store.load()
+        today = _today()
+        state.approach_narrative = new_text.strip()
+        state.version_history.append(
+            VersionEntry(
+                date=today,
+                description=f"Updated approach narrative. Rationale: {rationale[:120]}{'...' if len(rationale) > 120 else ''}",
+            )
+        )
+        store.save(state)
+        return f"Approach narrative updated ({len(new_text)} chars). Rationale recorded in version history."
+
+    return [
+        record_observation,
+        open_hypothesis,
+        record_hypothesis_outcome,
+        graduate_hypothesis,
+        update_approach_narrative,
+    ]
+
+
+# ---------------------------------------------------------------------------
+# Backward-compatible module-level bindings (default wti-strategy dir)
+# ---------------------------------------------------------------------------
+
+STORE: AdaptiveSkillStore[WtiStrategyState] = AdaptiveSkillStore(
+    skill_dir=_SKILL_DIR,
+    state_type=WtiStrategyState,
+    confirmation_threshold=3,
+)
+
+WTI_SKILL_TOOLS: list[Callable[..., str]] = build_skill_tools(_SKILL_DIR)
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__SKILL.md.md
new file mode 100644
index 0000000..cafd2c8
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__SKILL.md.md
@@ -0,0 +1,53 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/fetch-yfinance/SKILL.md
+
+kind: markdown
+
+---
+name: fetch-yfinance
+description: >-
+  One-shot code patterns for downloading price and market data from yfinance
+  inside the E2B sandbox. Load this skill whenever a task requires market or
+  futures data from Yahoo Finance. Load examples.md for working code.
+---
+
+# Fetching market data with yfinance
+
+## E2B execution model
+
+Each `run_code` call is a completely fresh Python process. There is no state,
+no variables, and no files from any previous call. Every code block must be
+fully self-contained: all imports, all data fetching, and all analysis in one
+block.
+
+yfinance is pre-installed in the sandbox. No `pip install` needed.
+
+## What this skill provides
+
+**`examples.md`** — Working code patterns for:
+- Pattern 1: Single ticker, date range (e.g. WTI crude oil `CL=F`)
+- Pattern 2: Applying a temporal cutoff for backtesting (do not use data after `as_of`)
+- Pattern 3: Multiple tickers in one fetch
+
+## Workflow
+
+1. Call `load_skill_resource("fetch-yfinance", "references/examples.md")` to load the patterns.
+2. Identify which pattern fits your task.
+3. Combine with other skill examples in the same code block.
+
+## Common tickers
+
+| Series               | Ticker  |
+|----------------------|---------|
+| WTI crude oil        | `CL=F`  |
+| Brent crude          | `BZ=F`  |
+| S&P 500              | `^GSPC` |
+| Natural gas          | `NG=F`  |
+| USD/CAD              | `CAD=X` |
+
+## Gotchas
+
+- `ticker.history()` returns a timezone-aware DatetimeIndex on recent yfinance
+  versions. Strip the timezone with `.dt.tz_localize(None)` after reset_index.
+- For futures (`CL=F`, `NG=F`), use `auto_adjust=False` and take the `Close`
+  column directly — adjusted close is not meaningful for rolled futures.
+- Always sort by date ascending after fetching.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__references__examples.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__references__examples.md.md
new file mode 100644
index 0000000..63d87a6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__references__examples.md.md
@@ -0,0 +1,89 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/fetch-yfinance/references/examples.md
+
+kind: markdown
+
+# fetch-yfinance: code examples
+
+---
+
+## Pattern 1: Single ticker, full date range
+
+```python
+import yfinance as yf
+import pandas as pd
+
+ticker = yf.Ticker("CL=F")
+raw = ticker.history(start="2023-01-01", end="2025-12-31", auto_adjust=False)
+raw = raw.reset_index()
+
+df = pd.DataFrame({
+    "date": pd.to_datetime(raw["Date"]).dt.tz_localize(None).dt.normalize(),
+    "close": raw["Close"].values,
+}).dropna().sort_values("date").reset_index(drop=True)
+
+print(f"Fetched {len(df)} rows | {df['date'].iloc[0].date()} → {df['date'].iloc[-1].date()}")
+print(df.tail(3).to_string(index=False))
+```
+
+**Expected output:**
+```
+Fetched 754 rows | 2023-01-03 → 2025-12-31
+        date  close
+2025-12-29  69.45
+2025-12-30  69.12
+2025-12-31  68.88
+```
+
+---
+
+## Pattern 2: Temporal cutoff for backtesting
+
+When simulating a forecast as of a specific date, filter the data to exclude
+anything on or after the cutoff. This prevents future-data leakage.
+
+```python
+import yfinance as yf
+import pandas as pd
+
+AS_OF = "2025-06-01"  # forecast origin date — replace with actual as_of
+
+ticker = yf.Ticker("CL=F")
+raw = ticker.history(start="2023-01-01", end="2026-01-01", auto_adjust=False)
+raw = raw.reset_index()
+
+df = pd.DataFrame({
+    "date": pd.to_datetime(raw["Date"]).dt.tz_localize(None).dt.normalize(),
+    "close": raw["Close"].values,
+}).dropna().sort_values("date").reset_index(drop=True)
+
+# Apply cutoff: keep only data strictly before as_of
+cutoff = pd.Timestamp(AS_OF)
+df = df[df["date"] < cutoff].copy()
+
+print(f"After cutoff {AS_OF}: {len(df)} rows | last date = {df['date'].iloc[-1].date()}")
+print(f"Last close: ${df['close'].iloc[-1]:.2f}")
+```
+
+---
+
+## Pattern 3: Multiple tickers
+
+Fetch several series in one call, then split into per-ticker DataFrames.
+
+```python
+import yfinance as yf
+import pandas as pd
+
+TICKERS = ["CL=F", "BZ=F", "NG=F"]
+START, END = "2024-01-01", "2025-12-31"
+
+raw = yf.download(TICKERS, start=START, end=END, auto_adjust=False, progress=False)
+
+series = {}
+for ticker in TICKERS:
+    s = raw["Close"][ticker].dropna().reset_index()
+    s.columns = ["date", "close"]
+    s["date"] = pd.to_datetime(s["date"]).dt.tz_localize(None).dt.normalize()
+    series[ticker] = s.sort_values("date").reset_index(drop=True)
+    print(f"{ticker}: {len(series[ticker])} rows, last close = {series[ticker]['close'].iloc[-1]:.2f}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__meta-learning__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__meta-learning__SKILL.md.md
new file mode 100644
index 0000000..176f573
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__meta-learning__SKILL.md.md
@@ -0,0 +1,148 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/meta-learning/SKILL.md
+
+kind: markdown
+
+---
+name: meta-learning
+description: >-
+  Governs when and how the adaptive WTI analyst updates its strategy skill.
+  Consult this before calling any strategy mutation tool. The process is
+  deliberately conservative — it resists updating on individual surprises and
+  requires pattern-level evidence before revising strategy.
+---
+
+# Meta-learning: strategy update governance
+
+## The four learning layers
+
+`wti-strategy` has four distinct layers, each with its own evidence bar and
+mutation tool. Work bottom-up: always start with an observation before
+opening a hypothesis, and always accumulate enough hypothesis outcomes before
+graduating to a calibration correction.
+
+| Layer | Tool | Evidence bar |
+|-------|------|-------------|
+| **Observations** | `record_observation` | Pattern visible across ≥2 forecasts — not a single surprise |
+| **Hypotheses** | `open_hypothesis` | One strong observation suggesting a durable pattern |
+| **Hypothesis outcomes** | `record_hypothesis_outcome` | Each resolution relevant to an open hypothesis |
+| **Calibration corrections** | `graduate_hypothesis` | Tool enforces threshold (currently 2 confirmations) — rejects if not met |
+| **Approach narrative** | `update_approach_narrative` | Only when the calibration record reveals a structural insight |
+
+## When to update
+
+Engage the update process only when you have **pattern-level evidence** — not
+after a single surprising outcome. Appropriate triggers:
+
+- A self-review or backtesting exercise spanning five or more origins reveals
+  a systematic bias (e.g. intervals consistently too narrow in a specific
+  vol regime, or a directional skew that persists across horizons).
+- A user identifies a recurring pattern in your errors and you can verify it
+  with code or data.
+- You run a code-execution analysis on historical WTI data that reveals a
+  durable relationship not currently captured in your strategy.
+
+**Do not update after a single resolution, even a large miss.** Markets have
+noise; one bad forecast is not a signal.
+
+## How to update: the tool call sequence
+
+### Step 1 — Always: record an observation
+
+```
+record_observation(
+    finding="<specific pattern, including regime, horizon, and direction>",
+    linked_hypothesis="<hyp-id if this feeds an open hypothesis, else omit>"
+)
+```
+
+This is always the right first step. It costs nothing and builds the audit
+record that governs future decisions.
+
+### Step 2 — If a durable pattern is suspected: open a hypothesis
+
+```
+open_hypothesis(
+    claim="<testable claim about your forecasting behaviour>",
+    initial_evidence="<the observation(s) that motivated this hypothesis>"
+)
+```
+
+A hypothesis is a candidate calibration correction under active testing. State
+the claim in terms of a specific condition and a directional error — not a
+market opinion. The tool assigns an ID (e.g. `hyp-001`) and records the
+initial evidence as a linked observation automatically.
+
+### Step 3 — On each subsequent resolution: update hypothesis counts
+
+```
+record_hypothesis_outcome(
+    hypothesis_id="hyp-001",
+    outcome="confirmed"  # or "refuted"
+)
+```
+
+Call this for any resolution where the outcome is directly relevant to an open
+hypothesis. A single refutation does not close the hypothesis — continue
+accumulating evidence. The tool returns the current counts and how many more
+confirmations are needed to graduate.
+
+### Step 4 — When the threshold is reached: graduate to calibration
+
+```
+graduate_hypothesis(
+    hypothesis_id="hyp-001",
+    condition="<the specific condition under which the correction applies>",
+    adjustment="<the concrete adjustment to make when the condition is met>",
+    horizon_scope="all"  # or "5bd", "10bd", "21bd", etc.
+)
+```
+
+The tool enforces the confirmation threshold. If the hypothesis has not
+accumulated enough confirming outcomes, it will reject the call and state
+exactly how many more are needed. Do not attempt to work around this.
+
+The tool automatically:
+- Marks the hypothesis as confirmed
+- Adds the calibration correction to `wti-strategy`
+- Records a linked observation and version history entry
+
+### Step 5 — Rarely: update the approach narrative
+
+```
+update_approach_narrative(
+    new_text="<complete replacement text for the Approach section>",
+    rationale="<why the current narrative no longer captures the strategy>"
+)
+```
+
+Only call this when the calibration record reveals a structural insight that
+the approach narrative no longer captures — for example, when multiple
+graduated corrections collectively suggest the relative weighting of evidence
+sources has shifted. The `rationale` argument is required and will be logged
+in the version history.
+
+Do not call this during a live prediction task. Approach updates belong in
+self-review or resolution-handling invocations.
+
+## Guarding against over-learning
+
+The greatest risk in a self-updating strategy is chasing noise. Before
+opening a hypothesis or proposing a graduation, ask:
+
+- Is this pattern visible across multiple origins, or just one?
+- Would this update have improved performance over the past ten forecasts, or
+  only the most recent few?
+- Am I reacting to a one-time market event (e.g. a geopolitical shock) rather
+  than a durable forecasting flaw?
+
+If uncertain, call `record_observation` without opening a hypothesis.
+Revisit after more evidence accumulates.
+
+## What NOT to update
+
+- Do not open a hypothesis after a single resolution.
+- Do not attempt to graduate a hypothesis that the tool rejects — accumulate
+  the required outcomes first.
+- Do not update the approach narrative based on market opinions or macro views.
+  Update only based on evidence about your own forecasting behaviour.
+- Do not update during a live prediction task.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__SKILL.md.md
new file mode 100644
index 0000000..83c561e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__SKILL.md.md
@@ -0,0 +1,47 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/trend-projection/SKILL.md
+
+kind: markdown
+
+---
+name: trend-projection
+description: >-
+  Code patterns for fitting a linear trend and projecting calibrated forecasts.
+  Always load alongside fetch-yfinance and vol-regime — references/examples.md
+  includes a Full Pipeline Example showing the complete self-contained script
+  from yfinance data fetch through vol regime to final interval output.
+---
+
+# Linear trend projection
+
+## What this skill provides
+
+**`references/examples.md`** — Working code patterns for:
+- Pattern 1: Fit a linear trend on the most recent `trend_window` daily rows and
+  project to horizons 5, 10, and 21 business days
+- Pattern 2: Calibrate 80% prediction interval widths from residual standard error
+- Pattern 3: Plausibility guard — clip projections to a multiple of the 52-week range
+
+## Typical usage
+
+1. Load `fetch-yfinance` → fetch price history
+2. Load `vol-regime` → classify regime, detect anomaly, determine `trend_window`
+3. Load `trend-projection` → fit trend on the `trend_window` rows, project, calibrate intervals
+4. Write one complete code block combining all three
+
+## Key formula
+
+80% CI half-width at horizon h business days:
+
+```
+half_width = 1.28 * residual_std * sqrt(h / 5)
+```
+
+where `residual_std` is the standard deviation of in-sample residuals on the
+trend window. This produces approximately correct coverage for a normally
+distributed trend residual and scales with horizon.
+
+## Interval calibration note
+
+Statistical intervals are often too narrow in elevated or extreme vol regimes.
+Per the `wti-strategy` skill: widen 80% CI by ~10–15% when regime is elevated
+or extreme. Apply this after computing the base half-width.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__references__examples.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__references__examples.md.md
new file mode 100644
index 0000000..0221dd6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__references__examples.md.md
@@ -0,0 +1,161 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/trend-projection/references/examples.md
+
+kind: markdown
+
+# trend-projection: code examples
+
+Each `run_code` call is a fresh Python process — every script must be fully
+self-contained from data fetch through final output. The patterns below build
+on each other and are meant to be combined in a **single script**.
+
+Start every script with the yfinance fetch (using `end=as_of_date`), then
+add the vol-regime patterns, then the trend-projection patterns below. The
+**Full Pipeline Example** at the end shows the complete assembly.
+
+---
+
+## Pattern 1: Fit linear trend and project to horizons
+
+```python
+# Requires: daily (DataFrame, columns date/close, sorted ascending)
+#           trend_window (int, from vol-regime Pattern 3)
+
+import numpy as np
+from sklearn.linear_model import LinearRegression
+
+HORIZONS = [5, 10, 21]  # business days ahead
+
+window = daily.tail(trend_window).copy().reset_index(drop=True)
+x = np.arange(len(window)).reshape(-1, 1)
+y = window["close"].values
+
+model = LinearRegression().fit(x, y)
+y_hat = model.predict(x)
+residual_std = float(np.std(y - y_hat, ddof=1))
+
+last_idx = len(window) - 1
+
+projections = {}
+for h in HORIZONS:
+    proj_idx = last_idx + h
+    point = float(model.predict([[proj_idx]])[0])
+    projections[h] = point
+    print(f"h={h:2d} bd: point={point:.2f}")
+
+print(f"residual_std={residual_std:.3f}  |  slope={model.coef_[0]:.3f} USD/day")
+```
+
+---
+
+## Pattern 2: Calibrated 80% prediction intervals
+
+```python
+# Requires: projections (dict), residual_std (float), regime (str)
+
+intervals = {}
+for h, point in projections.items():
+    half_width = 1.28 * residual_std * np.sqrt(h / 5)
+    if regime in ("elevated", "extreme"):
+        half_width *= 1.125
+    lo = round(point - half_width, 2)
+    hi = round(point + half_width, 2)
+    intervals[h] = (lo, hi)
+    print(f"h={h:2d} bd: [{lo:.2f}, {hi:.2f}]  (half_width={half_width:.2f})")
+```
+
+---
+
+## Pattern 3: Plausibility guard
+
+```python
+# Requires: projections (dict), df (full DataFrame, not just window)
+
+w52_low = float(df["close"].tail(252).min())
+w52_high = float(df["close"].tail(252).max())
+
+clipped = {}
+for h, point in projections.items():
+    clipped_point = float(np.clip(point, 0.5 * w52_low, 1.5 * w52_high))
+    clipped[h] = clipped_point
+    if clipped_point != point:
+        print(f"h={h}: clipped {point:.2f} → {clipped_point:.2f}")
+
+print(f"52w range: [{w52_low:.2f}, {w52_high:.2f}]")
+```
+
+---
+
+## Full Pipeline Example
+
+This is what a complete, self-contained `run_code` script looks like when you
+combine all three skills. Copy and adapt this — replace `AS_OF` with the
+actual `as_of` from the prediction payload.
+
+```python
+import yfinance as yf
+import pandas as pd
+import numpy as np
+from sklearn.linear_model import LinearRegression
+
+# ── 1. Fetch data (fetch-yfinance) ───────────────────────────────────────────
+AS_OF = "2026-02-16"  # replace with actual as_of from prediction payload
+
+ticker = yf.Ticker("CL=F")
+raw = ticker.history(start="2004-01-01", end="2026-06-01", auto_adjust=False)
+raw = raw.reset_index()
+df = pd.DataFrame({
+    "date": pd.to_datetime(raw["Date"]).dt.tz_localize(None).dt.normalize(),
+    "close": raw["Close"].values,
+}).dropna().sort_values("date").reset_index(drop=True)
+cutoff = pd.Timestamp(AS_OF)
+df = df[df["date"] < cutoff].copy()
+print(f"Loaded {len(df)} rows through {df['date'].iloc[-1].date()}")
+
+# ── 2. Vol regime (vol-regime) ────────────────────────────────────────────────
+day_gaps = df["date"].diff().dt.days
+daily = df[day_gaps <= 3].copy().reset_index(drop=True)
+
+log_returns = np.log(daily["close"] / daily["close"].shift(1)).dropna()
+rolling_vol = log_returns.rolling(30).std() * np.sqrt(252) * 100
+current_vol = float(rolling_vol.iloc[-1])
+
+if current_vol < 20:
+    regime = "low"
+elif current_vol < 35:
+    regime = "normal"
+elif current_vol < 55:
+    regime = "elevated"
+else:
+    regime = "extreme"
+
+close_changes = daily["close"].diff().dropna()
+last_change = float(close_changes.iloc[-1])
+last_std = float(close_changes.rolling(30).std().iloc[-1])
+z_score = last_change / last_std if last_std > 0 else 0.0
+
+trend_window = 15 if regime in ("elevated", "extreme") or abs(z_score) > 2.5 else 30
+
+print(f"REGIME: {regime}  |  vol={current_vol:.1f}%  |  z={z_score:+.2f}  |  window={trend_window}d")
+
+# ── 3. Trend projection (trend-projection) ────────────────────────────────────
+HORIZONS = [5, 10, 21]
+
+window_df = daily.tail(trend_window).copy().reset_index(drop=True)
+x = np.arange(len(window_df)).reshape(-1, 1)
+y = window_df["close"].values
+model = LinearRegression().fit(x, y)
+residual_std = float(np.std(y - model.predict(x), ddof=1))
+last_idx = len(window_df) - 1
+
+w52_low = float(df["close"].tail(252).min())
+w52_high = float(df["close"].tail(252).max())
+
+print(f"\nSlope: {model.coef_[0]:.3f} USD/day  |  residual_std={residual_std:.3f}")
+for h in HORIZONS:
+    point = float(np.clip(model.predict([[last_idx + h]])[0], 0.5 * w52_low, 1.5 * w52_high))
+    half_width = 1.28 * residual_std * np.sqrt(h / 5)
+    if regime in ("elevated", "extreme"):
+        half_width *= 1.125
+    lo, hi = round(point - half_width, 2), round(point + half_width, 2)
+    print(f"h={h:2d} bd: point={point:.2f}  |  80% CI [{lo:.2f}, {hi:.2f}]")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__SKILL.md.md
new file mode 100644
index 0000000..d535a3e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__SKILL.md.md
@@ -0,0 +1,48 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/vol-regime/SKILL.md
+
+kind: markdown
+
+---
+name: vol-regime
+description: >-
+  Code patterns for classifying the current volatility regime and detecting
+  anomalous recent moves. Always load alongside fetch-yfinance — the examples
+  require a yfinance data fetch at the top of the same script. Load
+  references/examples.md for working code.
+---
+
+# Volatility regime classification
+
+## What this skill provides
+
+**`references/examples.md`** — Working code patterns for:
+- Pattern 1: Rolling 30-day annualised vol + regime classification (low / normal /
+  elevated / extreme)
+- Pattern 2: Anomaly detection — z-score of the most recent daily move
+- Pattern 3: Adaptive trend window selection based on regime and anomaly signals
+
+These patterns are designed to be **combined with a data-fetch block** in a single
+code execution. Do not call `run_code` separately for data fetching and regime
+classification — combine them.
+
+## Typical usage
+
+Load `fetch-yfinance` and `vol-regime`, read both `references/examples.md` files, then write
+one complete block that fetches the data and computes the regime.
+
+## Regime thresholds (WTI crude oil)
+
+| Regime   | Annualised vol (%) |
+|----------|--------------------|
+| low      | < 20               |
+| normal   | 20 – 35            |
+| elevated | 35 – 55            |
+| extreme  | > 55               |
+
+Adjust thresholds for other assets. These are calibrated to WTI's historical
+vol distribution (2020–2025 median ≈ 31%).
+
+## Output of Pattern 3
+
+Pattern 3 returns a `trend_window` integer (15 or 30 days) that you should
+pass directly to the `trend-projection` skill's fitting step.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__references__examples.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__references__examples.md.md
new file mode 100644
index 0000000..8126f52
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__references__examples.md.md
@@ -0,0 +1,107 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/vol-regime/references/examples.md
+
+kind: markdown
+
+# vol-regime: code examples
+
+Each `run_code` call is a fresh Python process. These patterns must be part of
+a self-contained script that starts with a yfinance fetch. See the
+`fetch-yfinance` skill for the data-loading block — paste it above these
+patterns in the same script.
+
+The variable `df` (columns: `date`, `close`, sorted ascending) comes from the
+`fetch-yfinance` pattern. Every script that uses vol-regime must define `df`
+first by fetching from yfinance with the appropriate `end=as_of_date` cutoff.
+
+---
+
+## Pattern 1: Rolling vol and regime classification
+
+```python
+import yfinance as yf
+import pandas as pd
+import numpy as np
+
+AS_OF = "2026-02-16"  # replace with actual as_of from the prediction payload
+
+ticker = yf.Ticker("CL=F")
+raw = ticker.history(start="2004-01-01", end="2026-06-01", auto_adjust=False)
+raw = raw.reset_index()
+df = pd.DataFrame({
+    "date": pd.to_datetime(raw["Date"]).dt.tz_localize(None).dt.normalize(),
+    "close": raw["Close"].values,
+}).dropna().sort_values("date").reset_index(drop=True)
+cutoff = pd.Timestamp(AS_OF)
+df = df[df["date"] < cutoff].copy()
+
+# Use only the daily-frequency portion (drop gaps > 3 days)
+day_gaps = df["date"].diff().dt.days
+daily = df[day_gaps <= 3].copy().reset_index(drop=True)
+
+log_returns = np.log(daily["close"] / daily["close"].shift(1)).dropna()
+rolling_vol = log_returns.rolling(30).std() * np.sqrt(252) * 100  # annualised %
+current_vol = float(rolling_vol.iloc[-1])
+
+if current_vol < 20:
+    regime = "low"
+elif current_vol < 35:
+    regime = "normal"
+elif current_vol < 55:
+    regime = "elevated"
+else:
+    regime = "extreme"
+
+print(f"REGIME: {regime}  |  current_vol={current_vol:.1f}%  |  n_daily_rows={len(daily)}")
+```
+
+**Example output:**
+```
+REGIME: elevated  |  current_vol=41.3%  |  n_daily_rows=312
+```
+
+---
+
+## Pattern 2: Anomaly detection (z-score of last move)
+
+```python
+# Add this after Pattern 1 (daily is already defined)
+
+close_changes = daily["close"].diff().dropna()
+rolling_std = close_changes.rolling(30).std()
+
+last_change = float(close_changes.iloc[-1])
+last_std = float(rolling_std.iloc[-1])
+z_score = last_change / last_std if last_std > 0 else 0.0
+
+anomaly = abs(z_score) > 2.5
+print(f"ANOMALY: z={z_score:+.2f}  |  last_move={last_change:+.2f}  |  flagged={anomaly}")
+```
+
+**Example output:**
+```
+ANOMALY: z=+3.14  |  last_move=+4.21  |  flagged=True
+```
+
+---
+
+## Pattern 3: Adaptive trend window
+
+```python
+# Add this after Patterns 1–2 (regime and z_score are already defined)
+
+if regime in ("elevated", "extreme") or abs(z_score) > 2.5:
+    trend_window = 15
+    reason = f"regime={regime}, |z|={abs(z_score):.2f} — shortened window"
+else:
+    trend_window = 30
+    reason = f"regime={regime}, |z|={abs(z_score):.2f} — standard window"
+
+print(f"TREND_WINDOW: {trend_window} days  ({reason})")
+```
+
+**Example output:**
+```
+TREND_WINDOW: 15 days  (regime=elevated, |z|=3.14 — shortened window)
+```
+
+Pass `trend_window` to the `trend-projection` skill.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy-trained__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy-trained__SKILL.md.md
new file mode 100644
index 0000000..b22e3e1
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy-trained__SKILL.md.md
@@ -0,0 +1,58 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/wti-strategy-trained/SKILL.md
+
+kind: markdown
+
+---
+name: wti-strategy-trained
+description: >-
+  The adaptive WTI analyst's current forecasting strategy. Load this at the
+  start of every prediction task. This file is generated — edit the state
+  through the mutation tools, not by hand.
+---
+
+# WTI Forecasting Strategy
+
+## Approach
+
+Produce calibrated probabilistic forecasts by combining two evidence streams:
+statistical analysis of recent price history and web-grounded news context.
+
+At short horizons (5 bd), momentum and recent trend dominate. Trust the trend
+projection output unless there is a strong near-term catalyst visible in news
+context (e.g. an imminent OPEC+ meeting or scheduled inventory release).
+
+At medium horizons (10 bd), OPEC+ meeting schedules and US inventory release
+dates matter. Check for scheduled events in the news context before finalising
+the forecast.
+
+At long horizons (21 bd), macro demand and geopolitical risk dominate. The
+statistical signal loses explanatory power at this horizon; weight news context
+and published analyst consensus more heavily than the trend projection.
+
+Always run statistical analysis (vol-regime, trend-projection) before
+incorporating news context. The regime classification and trend window
+directly inform interval calibration.
+
+## Active calibration corrections
+
+*(No calibration corrections yet. Graduate a confirmed hypothesis to add one.)*
+
+## Open hypotheses
+
+| ID | Claim | Confirmations | Refutations |
+|----|-------|---------------|-------------|
+| hyp-001 | Point forecasts generated by extrapolating a linear trend are consistently less accurate than a flat-trend (no-change) forecast in the elevated or extreme volatility regimes (annualised vol > 35%), particularly at the 10bd and 21bd horizons. | 0 | 0 |
+
+## Observations
+
+| Date | Finding | Linked hypothesis |
+|------|---------|-------------------|
+| 2026-06-03 | In the 2025 WTI backtest, the trend-projection model underperformed a simple flat forecast across all regimes and horizons, with the most extreme errors occurring in the elevated volatility regime at the 21-day horizon (Trend MAE of 11.95 USD vs Flat MAE of 3.91 USD, and a negative bias of -3.03 USD). | — |
+| 2026-06-03 | A full-year 2025 backtest (247 trading days) of WTI crude oil prices shows that in the elevated volatility regime, the trend projection's MAE was more than double the flat forecast's MAE at 5bd (4.89 vs 2.33) and 10bd (5.28 vs 2.58), and more than triple at 21bd (11.95 vs 3.91), with a strong negative bias (-3.03) at 21bd. | hyp-001 |
+| 2026-06-03 | In the 2025 WTI backtest, the trend-projection model underperformed the flat-trend forecast even in the normal volatility regime across all horizons (5bd: 2.60 vs 2.09, 10bd: 3.45 vs 2.83, 21bd: 4.56 vs 3.44). | — |
+
+## Version history
+
+| Date | Change |
+|------|--------|
+| initial | Strategy initialised with domain priors. No backtest evidence yet. |
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy__SKILL.md.md
new file mode 100644
index 0000000..db84b18
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy__SKILL.md.md
@@ -0,0 +1,52 @@
+# Source: implementations/energy_oil_forecasting/adaptive_agent/skills/wti-strategy/SKILL.md
+
+kind: markdown
+
+---
+name: wti-strategy
+description: >-
+  The adaptive WTI analyst's current forecasting strategy. Load this at the
+  start of every prediction task. This file is generated — edit the state
+  through the mutation tools, not by hand.
+---
+
+# WTI Forecasting Strategy
+
+## Approach
+
+Produce calibrated probabilistic forecasts by combining two evidence streams:
+statistical analysis of recent price history and web-grounded news context.
+
+At short horizons (5 bd), momentum and recent trend dominate. Trust the trend
+projection output unless there is a strong near-term catalyst visible in news
+context (e.g. an imminent OPEC+ meeting or scheduled inventory release).
+
+At medium horizons (10 bd), OPEC+ meeting schedules and US inventory release
+dates matter. Check for scheduled events in the news context before finalising
+the forecast.
+
+At long horizons (21 bd), macro demand and geopolitical risk dominate. The
+statistical signal loses explanatory power at this horizon; weight news context
+and published analyst consensus more heavily than the trend projection.
+
+Always run statistical analysis (vol-regime, trend-projection) before
+incorporating news context. The regime classification and trend window
+directly inform interval calibration.
+
+## Active calibration corrections
+
+*(No calibration corrections yet. Graduate a confirmed hypothesis to add one.)*
+
+## Open hypotheses
+
+*(No open hypotheses.)*
+
+## Observations
+
+*(No observations yet. Record findings from resolutions and self-reviews.)*
+
+## Version history
+
+| Date | Change |
+|------|--------|
+| initial | Strategy initialised with domain priors. No backtest evidence yet. |
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analysis.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analysis.py.md
new file mode 100644
index 0000000..5090128
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analysis.py.md
@@ -0,0 +1,165 @@
+# Source: implementations/energy_oil_forecasting/analysis.py
+
+kind: python
+
+```python
+"""Analysis helpers for the WTI crude oil experiment.
+
+Pure functions that turn backtest results and forecast DataFrames into tidy
+tables and scoring metrics. Kept separate from notebooks so they can be tested.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import BacktestResult, compute_brier_score
+from aieng.forecasting.evaluation.prediction import ContinuousForecast
+
+
+def rolling_coverage_pct(forecasts_df: pd.DataFrame, *, year: int | None = None) -> float:
+    """Fraction of resolutions inside the CI for optional calendar year filter."""
+    resolved = forecasts_df.dropna(subset=["actual_price"]).copy()
+    if year is not None:
+        resolved = resolved[resolved["resolution_date"].dt.year == year]
+    if resolved.empty:
+        return float("nan")
+    return float(resolved["inside_ci"].mean() * 100)
+
+
+def score_backtest_results(
+    results: dict[str, BacktestResult],
+    data_service: DataService,
+    *,
+    mae_horizon: int = 21,
+) -> dict[str, float]:
+    """Aggregate CRPS, MAE at a horizon, and 80% CI coverage for backtest results."""
+    all_scores: list[float] = []
+    mae_errors: list[float] = []
+    coverage_hits: list[float] = []
+
+    for result in results.values():
+        all_scores.extend(result.scores)
+        task = result.spec.task
+        actual_df = data_service.get_series(task.target_series_id, as_of=result.spec.end)
+        actual_by_date = {
+            pd.Timestamp(row["timestamp"]).normalize(): float(row["value"]) for _, row in actual_df.iterrows()
+        }
+
+        for pred, score in zip(result.predictions, result.scores, strict=False):
+            _ = score
+            if not isinstance(pred.payload, ContinuousForecast):
+                continue
+            fd = pd.Timestamp(pred.forecast_date).normalize()
+            actual = actual_by_date.get(fd)
+            if actual is None:
+                continue
+            median = pred.payload.point_forecast
+            mae_errors.append(abs(median - actual))
+            q80 = pred.payload.quantiles.get(0.80)
+            q20 = pred.payload.quantiles.get(0.20)
+            if q80 is not None and q20 is not None:
+                coverage_hits.append(float(q20 <= actual <= q80))
+
+    return {
+        "mean_crps": float(np.mean(all_scores)) if all_scores else float("nan"),
+        "mae_h21": float(np.mean(mae_errors)) if mae_errors else float("nan"),
+        "coverage_80": float(np.mean(coverage_hits) * 100) if coverage_hits else float("nan"),
+    }
+
+
+def backtest_results_to_frame(results: dict[str, BacktestResult]) -> pd.DataFrame:
+    """Flatten multiple :class:`BacktestResult` objects into a leaderboard DataFrame."""
+    rows: list[dict[str, Any]] = []
+    for predictor_id, result in results.items():
+        rows.append(
+            {
+                "predictor_id": predictor_id,
+                "mean_crps": result.mean_score,
+                "n_predictions": len(result.predictions),
+                "n_skipped_origins": result.skipped_origins,
+            }
+        )
+    return pd.DataFrame(rows).sort_values("mean_crps")
+
+
+def _extract_agent_point(rec: dict[str, Any], horizon_idx: int, horizon: int) -> float:
+    """Extract a point forecast from either the reference or legacy prediction format."""
+    if "predictions" in rec:
+        preds = rec["predictions"]
+        if horizon_idx < len(preds):
+            return float(preds[horizon_idx]["payload"]["point_forecast"])
+        return float("nan")
+    return float(rec.get(f"day_{horizon}", float("nan")))
+
+
+def trajectory_mae_table(
+    agent_results: list[dict[str, Any]],
+    prophet_traj_df: pd.DataFrame,
+    price_df: pd.DataFrame,
+    horizons: list[int] | None = None,
+) -> pd.DataFrame:
+    """MAE at selected horizons comparing agent point forecasts to Prophet.
+
+    Accepts both the reference prediction format
+    (``{"origin": str, "predictions": [pred.model_dump()]}``)
+    and the legacy playground flat-dict format (``{"origin": str, "day_5": float, ...}``).
+    """
+    horizons = horizons or [5, 10, 21]
+    rows: list[dict[str, Any]] = []
+
+    for rec in agent_results:
+        origin = pd.Timestamp(rec["origin"])
+        if price_df[price_df.index >= origin].empty:
+            continue
+
+        for h_idx, h in enumerate(horizons):
+            target_dates = pd.bdate_range(start=origin + pd.offsets.BDay(1), periods=h)
+            actual_date = target_dates[-1]
+            actual_rows = price_df[price_df.index >= actual_date]
+            if actual_rows.empty:
+                continue
+            actual = float(actual_rows.iloc[0]["price"])
+            agent_pred = _extract_agent_point(rec, h_idx, h)
+            prophet_row = prophet_traj_df[(prophet_traj_df["origin"] == origin) & (prophet_traj_df["horizon"] == h)]
+            prophet_pred = float(prophet_row.iloc[0]["yhat"]) if not prophet_row.empty else float("nan")
+            rows.append(
+                {
+                    "Origin": str(origin.date()),
+                    "Horizon": f"{h} bdays",
+                    "Actual ($)": f"{actual:.1f}",
+                    "Prophet ($)": f"{prophet_pred:.1f}" if not np.isnan(prophet_pred) else "—",
+                    "Agent ($)": f"{agent_pred:.1f}" if not np.isnan(agent_pred) else "—",
+                    "Prophet MAE": abs(prophet_pred - actual) if not np.isnan(prophet_pred) else float("nan"),
+                    "Agent MAE": abs(agent_pred - actual) if not np.isnan(agent_pred) else float("nan"),
+                }
+            )
+
+    df = pd.DataFrame(rows)
+    if df.empty:
+        return df
+    return df.set_index(["Origin", "Horizon"])
+
+
+def select_top_predictors(
+    leaderboard: pd.DataFrame,
+    n: int = 3,
+    *,
+    predictor_ids: dict[str, Any] | None = None,
+) -> list[str]:
+    """Return the top ``n`` predictor IDs by mean CRPS."""
+    return [str(x) for x in leaderboard.head(n)["predictor_id"].tolist()]
+
+
+__all__ = [
+    "backtest_results_to_frame",
+    "compute_brier_score",
+    "rolling_coverage_pct",
+    "score_backtest_results",
+    "select_top_predictors",
+    "trajectory_mae_table",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent____init__.py.md
new file mode 100644
index 0000000..208b7c7
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent____init__.py.md
@@ -0,0 +1,34 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/__init__.py
+
+kind: python
+
+```python
+"""WTI crude oil analyst agent module.
+
+Exports the :class:`AgentConfig` factories, prompt builder, and predictor
+convenience factory for the energy/oil reference implementation.
+"""
+
+from energy_oil_forecasting.analyst_agent.agent import (
+    WtiPriceForecastPromptBuilder,
+    build_wti_agent_predictor,
+    build_wti_basic_config,
+    build_wti_code_exec_config,
+    build_wti_multitask_news_config,
+    build_wti_news_config,
+    build_wti_tool_config,
+    compress_history,
+)
+
+
+__all__ = [
+    "WtiPriceForecastPromptBuilder",
+    "build_wti_agent_predictor",
+    "build_wti_basic_config",
+    "build_wti_code_exec_config",
+    "build_wti_multitask_news_config",
+    "build_wti_news_config",
+    "build_wti_tool_config",
+    "compress_history",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__agent.py.md
new file mode 100644
index 0000000..b050256
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__agent.py.md
@@ -0,0 +1,573 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/agent.py
+
+kind: python
+
+```python
+"""WTI crude oil analyst agent configurations and prompt builder.
+
+Provides four :class:`~aieng.forecasting.methods.agentic.agent_factory.AgentConfig`
+factories that define progressive agent capability levels:
+
+1. :func:`build_wti_basic_config` — LLM reasons from price history alone (no tools).
+2. :func:`build_wti_news_config` — Adds bounded Google Search via a
+   :class:`~aieng.forecasting.methods.agentic.agent_factory.ContextRetrievalConfig`
+   sub-agent with strict temporal cutoffs.
+3. :func:`build_wti_code_exec_config` — Adds Gemini native code execution and
+   three forecasting skills on top of the news-grounded configuration.
+4. :func:`build_wti_tool_config` — Adds a conventional
+   :class:`~aieng.forecasting.methods.agentic.forecast_tool.ForecastTool`
+   (AutoARIMA) on top of news grounding — a rigid, pre-specified alternative to
+   open-ended code execution.
+
+Also provides:
+
+- :class:`WtiPriceForecastPromptBuilder`: Pydantic ``BaseModel`` that serialises
+  the task and history into a structured JSON payload for the agent.
+- :func:`build_wti_agent_predictor`: convenience factory that wires a config to
+  an :class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use without importing the full
+predictor stack.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+from aieng.forecasting.data import DataService
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    ForecastTool,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor
+from aieng.forecasting.models import LITE_MODEL
+from energy_oil_forecasting.data import WTI_SERIES_ID, build_wti_service
+from pydantic import BaseModel
+
+
+# ---------------------------------------------------------------------------
+# System prompt (root analyst agent)
+# ---------------------------------------------------------------------------
+
+_WTI_MULTITASK_ANALYST_INSTRUCTION = """\
+## Role
+
+You are an expert WTI crude oil market analyst.
+
+## Input
+
+You will receive a JSON payload containing:
+- `task_spec`: the exact question and required JSON output schema
+- `as_of`: the forecast origin date (temporal cutoff)
+- `origin_price_usd_bbl`: WTI close on the origin date
+- `target_history_csv`: compressed WTI daily close history
+
+When context retrieval is enabled, call ``search_web`` BEFORE answering.
+
+## Output contract
+
+Read the data (and briefing, if retrieved) carefully, then execute the task \
+in `task_spec` precisely.
+
+If a `set_model_response` tool is available, call it with your complete JSON \
+as `json_response` — the exact schema is described in `task_spec`. Otherwise \
+return the JSON directly as plain text with no preamble.\
+"""
+
+
+def _build_wti_analyst_instruction() -> str:
+    """Build the WTI analyst instruction, embedding the output schema from the class.
+
+    Using a function instead of a static string ensures the ``## Output schema``
+    block is always in sync with ``ContinuousAgentForecastOutput`` —
+    no manual JSON to maintain.
+    """
+    schema = ContinuousAgentForecastOutput.prompt_schema_json()
+    return (
+        "## Role\n\n"
+        "You are an expert WTI crude oil market analyst. You produce calibrated "
+        "probabilistic price forecasts for WTI crude oil futures, grounded in "
+        "supply/demand fundamentals, geopolitical risk, and historical price dynamics.\n\n"
+        "## Forecasting contract\n\n"
+        "You will receive a JSON payload containing:\n"
+        "- `task`: the task identifier\n"
+        "- `as_of`: the forecast origin date in YYYY-MM-DD format\n"
+        "- `horizons`: a list of integer horizon steps (business days ahead)\n"
+        "- `standard_quantiles`: the exact quantile levels you must produce\n"
+        "- `target_summary`: last close price, 52-week range, and observation count\n"
+        "- `target_history_csv`: WTI daily close history (recent 6 months daily, "
+        "older history as weekly averages)\n\n"
+        "Rules:\n"
+        "1. Produce one forecast for each horizon listed in `horizons`.\n"
+        "2. Use exactly the quantile levels from `standard_quantiles` — no additions, no omissions.\n"
+        "3. `point_forecast` must exactly equal the 0.50 quantile value.\n"
+        "4. Quantile values must be strictly non-decreasing as quantile levels increase.\n"
+        "5. Document your reasoning in the `rationale` fields.\n"
+        "6. When tools are enabled, conclude with `set_model_response` to return the structured forecast.\n\n"
+        "## Output schema\n\n"
+        "Call `set_model_response` with a `json_response` string matching **exactly**:\n\n"
+        "```json\n" + schema + "\n```\n\n"
+        'Critical: use `"horizon"` (integer, not `"horizon_days"`). '
+        '`"quantiles"` is a **list** of `{"quantile": <level>, "value": <price>}` '
+        "objects — not a dict. Omit any field not shown above.\n\n"
+        "## Analysis discipline\n\n"
+        "When context retrieval is available, call ``search_web`` to gather market "
+        "intelligence BEFORE producing forecasts.\n\n"
+        "Call ``search_web`` with ``query`` and ``cutoff_date`` (set to the ``as_of`` "
+        "date from the payload). The ``cutoff_date`` MUST always equal ``as_of`` — "
+        "this is the temporal fence that prevents post-origin information from "
+        "contaminating historical backtests.\n\n"
+        "Recommended queries (call ``search_web`` once per topic):\n"
+        '- ``search_web(query="WTI crude oil price trend and OPEC+ supply decisions", cutoff_date=<as_of>)``\n'
+        '- ``search_web(query="Persian Gulf geopolitical risk shipping lane disruptions", cutoff_date=<as_of>)``\n'
+        '- ``search_web(query="US Strategic Petroleum Reserve policy and global demand outlook", cutoff_date=<as_of>)``\n\n'
+        "Document your key assumptions (OPEC+ policy, shipping lane risk, inventory "
+        "levels, macro demand) in the `rationale` fields of your forecast output."
+    )
+
+
+_WTI_ANALYST_INSTRUCTION = _build_wti_analyst_instruction()
+
+# ---------------------------------------------------------------------------
+# Context retrieval instruction (sub-agent)
+# ---------------------------------------------------------------------------
+
+_WTI_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are an oil market intelligence specialist with access to web search.
+
+Search for information relevant to the query and return a concise structured \
+markdown summary (3-5 paragraphs) covering relevant aspects of:
+- WTI/Brent crude price level and recent trend
+- OPEC+ production decisions and supply outlook
+- Geopolitical risks in the Persian Gulf, Middle East, key shipping lanes
+- US Strategic Petroleum Reserve and energy policy signals
+- Notable tanker/shipping incidents or supply disruption signals
+- Published analyst forecasts or unusual price-target revisions
+
+Ground your summary in the search results you actually retrieve. \
+When a cutoff date is specified, do not report or speculate about events \
+that occurred after that date.\
+"""
+
+# ---------------------------------------------------------------------------
+# Skills supplement (appended to instruction when skills are attached)
+# ---------------------------------------------------------------------------
+
+_CODE_EXEC_SKILLS_SUPPLEMENT = """
+
+## Skills
+
+You have access to two forecasting skills via the SkillToolset. All data
+available to code execution comes from the JSON payload in your context —
+there are no disk files to read.
+
+**Recommended invocation order:**
+
+1. `statistical-analysis` — run first. Provides diagnostic code patterns
+   for interrogating the price series you have been given: vol regime
+   classification, anomaly detection, and adaptive trend-window selection.
+   The output of Pattern 3 (trend window) is the input to the projection
+   skill below.
+
+2. `trend-projection` — run second. Provides code patterns for fitting a
+   linear trend on the window chosen above, projecting point forecasts to
+   each horizon, and calibrating 80% prediction interval widths.
+
+**To use a skill:**
+1. Call `list_skills` to see available skill names and descriptions.
+2. Call `load_skill(<name>)` to read the skill's full instructions.
+3. Call `load_skill_resource(<skill_name>, <file_path>)` to load a
+   reference file (e.g. `references/wti_benchmarks.json`).
+
+These skills have NO scripts. Do not call `run_skill_script`.\
+"""
+
+# ---------------------------------------------------------------------------
+# Forecast tool supplement (appended to instruction when the forecast tool is attached)
+# ---------------------------------------------------------------------------
+
+_FORECAST_TOOL_SUPPLEMENT = f"""
+
+## Statistical forecast tool
+
+You have access to `run_forecast`, a conventional statistical baseline
+(AutoARIMA) you can call directly. Unlike open-ended code, this tool has a fixed,
+auditable interface and returns a structured forecast you can reason from.
+
+Call it ONCE before producing your forecast, with:
+- `series_id`: "{WTI_SERIES_ID}"
+- `cutoff_date`: the `as_of` date from the payload (YYYY-MM-DD). This is the
+  information cutoff — the model uses only data on or before it.
+- `horizons`: the `horizons` list from the payload.
+- `frequency`: "B" (WTI trades on business days).
+
+The tool returns JSON with point forecasts and 80%/90% prediction intervals per
+horizon. Treat it as a disciplined statistical anchor: combine it with the
+market context from the search sub-agent. You may adjust away from the baseline
+when fundamentals or geopolitical risk justify it — document your reasoning in
+the `rationale` fields.\
+"""
+
+# ---------------------------------------------------------------------------
+# Skill directories
+# ---------------------------------------------------------------------------
+
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+
+
+# ---------------------------------------------------------------------------
+# History compression
+# ---------------------------------------------------------------------------
+
+
+def compress_history(df: pd.DataFrame) -> str:
+    """Compress WTI daily history to stay within context limits.
+
+    Returns daily bars for the most recent 6 months and weekly averages for
+    older history.  The CSV header is ``date,close``.
+
+    Parameters
+    ----------
+    df : pd.DataFrame
+        DataFrame with columns ``timestamp`` and ``value``.
+
+    Returns
+    -------
+    str
+        CSV string with header ``date,close``.
+    """
+    df = df.copy()
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    cutoff = df["timestamp"].max() - pd.DateOffset(months=6)
+
+    recent = df[df["timestamp"] >= cutoff].copy()
+    old = df[df["timestamp"] < cutoff].copy()
+
+    rows: list[str] = ["date,close"]
+
+    if not old.empty:
+        old_indexed = old.set_index("timestamp")["value"]
+        weekly: pd.Series = old_indexed.resample("W").mean().dropna()
+        for date, val in weekly.items():
+            rows.append(f"{date.date()},{val:.2f}")
+
+    for _, row in recent.iterrows():
+        rows.append(f"{row['timestamp'].date()},{row['value']:.2f}")
+
+    return "\n".join(rows)
+
+
+# ---------------------------------------------------------------------------
+# Prompt builder
+# ---------------------------------------------------------------------------
+
+
+class WtiPriceForecastPromptBuilder(BaseModel):
+    """Prompt builder for WTI crude oil price forecasting tasks.
+
+    Produces a structured JSON payload for the analyst agent containing the
+    task specification, compressed price history, and a data summary.
+    The payload includes ``standard_quantiles`` explicitly so the agent knows
+    the exact grid it must produce.
+
+    Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol (structural typing — no explicit inheritance required).
+    """
+
+    model_config = {"extra": "forbid"}
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        """Serialise the task and context into a JSON string for the agent.
+
+        Parameters
+        ----------
+        task : ForecastingTask
+            The forecasting task — supplies ``task_id``, ``horizons``.
+        context : ForecastContext
+            The information state at forecast time.
+
+        Returns
+        -------
+        str
+            JSON-serialised payload with task metadata, compressed history, and
+            the standard quantile grid the agent must populate.
+        """
+        df = context.get_series(task.target_series_id)
+        compressed = compress_history(df)
+
+        last_row = df.iloc[-1]
+        last_close = float(last_row["value"])
+        last_date = str(pd.Timestamp(last_row["timestamp"]).date())
+        trailing_252 = df["value"].tail(252)
+
+        payload: dict[str, Any] = {
+            "task": task.task_id,
+            "as_of": str(context.as_of)[:10],
+            "horizons": list(task.horizons),
+            "standard_quantiles": list(STANDARD_QUANTILES),
+            "target_summary": {
+                "last_close_usd_bbl": last_close,
+                "last_date": last_date,
+                "n_trading_days": int(len(df)),
+                "52w_high": float(trailing_252.max()),
+                "52w_low": float(trailing_252.min()),
+            },
+            "target_history_csv": compressed,
+        }
+
+        return json.dumps(payload, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# AgentConfig factories
+# ---------------------------------------------------------------------------
+
+
+def build_wti_basic_config(model: str = LITE_MODEL) -> AgentConfig:
+    """Build an :class:`AgentConfig` with no tools.
+
+    The agent reasons purely from the price history in the prompt payload.
+    Useful as a low-cost baseline or starting point when comparing capability
+    levels.
+
+    Parameters
+    ----------
+    model : str
+        Gemini model identifier.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    return AgentConfig(
+        name="wti_analyst_basic",
+        model=model,
+        instruction=_WTI_ANALYST_INSTRUCTION,
+    )
+
+
+def build_wti_multitask_news_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+) -> AgentConfig:
+    """News-grounded config for the one-agent-three-tasks demo (NB3).
+
+    Uses a task-agnostic analyst instruction; the task schema is supplied in
+    the user prompt payload via :class:`~energy_oil_forecasting.tasks.WtiMultitaskPromptBuilder`.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to
+        the lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` so that Gemini
+        handles Google Search even when the analyst uses a different provider.
+    """
+    return AgentConfig(
+        name="wti_analyst_multitask",
+        model=model,
+        instruction=_WTI_MULTITASK_ANALYST_INSTRUCTION,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_WTI_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+    )
+
+
+def build_wti_news_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+) -> AgentConfig:
+    """Build an :class:`AgentConfig` with bounded Google Search.
+
+    Wires a :class:`~aieng.forecasting.methods.agentic.agent_factory.ContextRetrievalConfig`
+    sub-agent that enforces a temporal cutoff on every search call, preventing
+    future information from contaminating historical backtests.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to
+        the lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` so that Gemini
+        handles Google Search even when the analyst uses a different provider.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    return AgentConfig(
+        name="wti_analyst_news",
+        model=model,
+        instruction=_WTI_ANALYST_INSTRUCTION,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_WTI_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+    )
+
+
+def build_wti_code_exec_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    max_output_tokens: int = 16_384,
+) -> AgentConfig:
+    """Build an :class:`AgentConfig` with E2B code execution and forecasting skills.
+
+    Combines bounded Google Search (temporal cutoff enforced) with E2B sandbox
+    code execution and two forecasting skills:
+
+    - ``statistical-analysis``: diagnostic patterns for the payload data
+      (vol regime, anomaly detection, adaptive trend window).
+    - ``trend-projection``: linear trend fit, CI calibration, and plausibility
+      guard using the window determined by statistical-analysis.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to
+        the lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` so that Gemini
+        handles Google Search even when the analyst uses a different provider.
+    max_output_tokens : int, default=16_384
+        Maximum tokens per model response.  The default is set well above
+        LiteLLM's OpenAI-compatible endpoint default of 4096, which is not
+        enough for Claude to write a complete ``run_code`` Python script in a
+        single function call — causing repeated retries with empty arguments.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    return AgentConfig(
+        name="wti_analyst_code",
+        model=model,
+        instruction=_WTI_ANALYST_INSTRUCTION + _CODE_EXEC_SKILLS_SUPPLEMENT,
+        max_output_tokens=max_output_tokens,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_WTI_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+        code_execution=CodeExecutionConfig(enabled=True),
+        skills_dirs=[
+            _SKILLS_ROOT / "statistical-analysis",
+            _SKILLS_ROOT / "trend-projection",
+        ],
+    )
+
+
+def build_wti_tool_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    *,
+    data_service: DataService | None = None,
+    num_samples: int = 200,
+) -> AgentConfig:
+    """Build an :class:`AgentConfig` with a conventional statistical forecast tool.
+
+    This is the fourth analyst capability level. It combines bounded Google
+    Search (temporal cutoff enforced) with a
+    :class:`~aieng.forecasting.methods.agentic.forecast_tool.ForecastTool`
+    that runs AutoARIMA on the WTI series. In contrast to
+    :func:`build_wti_code_exec_config` — which gives the agent open-ended code
+    execution — this path exposes a rigid, pre-specified tool, trading
+    flexibility for control and reproducibility.
+
+    Parameters
+    ----------
+    model : str
+        Model for the top-level analyst agent.
+    search_model : str
+        Model for the context-retrieval (web-search) sub-tool. Defaults to
+        the lite model (``gemini-3.1-flash-lite-preview``) independently of ``model`` so that Gemini
+        handles Google Search even when the analyst uses a different provider.
+    data_service : DataService or None
+        Pre-populated data service with the WTI series registered. When
+        ``None``, one is constructed via
+        :func:`~energy_oil_forecasting.data.build_wti_service` (cache-backed).
+        Series data is read by the tool but never enters the LLM context.
+    num_samples : int, default=200
+        Monte Carlo sample count for AutoARIMA. Kept modest to bound agent
+        latency, since AutoARIMA can be slow per origin.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    service = data_service if data_service is not None else build_wti_service()
+    forecast_tool = ForecastTool(service, predictor=DartsAutoARIMAPredictor(num_samples=num_samples))
+
+    return AgentConfig(
+        name="wti_analyst_tool",
+        model=model,
+        instruction=_WTI_ANALYST_INSTRUCTION + _FORECAST_TOOL_SUPPLEMENT,
+        context_retrieval=ContextRetrievalConfig(
+            enabled=True,
+            instruction=_WTI_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        ),
+        function_tools=[forecast_tool.as_function_tool()],
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+def build_wti_agent_predictor(config: AgentConfig) -> AgentPredictor:
+    """Wrap an :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Uses :class:`WtiPriceForecastPromptBuilder` and
+    :class:`~aieng.forecasting.methods.agentic.outputs.ContinuousAgentForecastOutput`
+    as the output schema.
+
+    Parameters
+    ----------
+    config : AgentConfig
+        Any of the configs produced by :func:`build_wti_basic_config`,
+        :func:`build_wti_news_config`, or :func:`build_wti_code_exec_config`.
+
+    Returns
+    -------
+    AgentPredictor
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=WtiPriceForecastPromptBuilder(),
+        output_schema=ContinuousAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_wti_basic_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__SKILL.md.md
new file mode 100644
index 0000000..c8f1fb4
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__SKILL.md.md
@@ -0,0 +1,65 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/skills/statistical-analysis/SKILL.md
+
+kind: markdown
+
+---
+name: statistical-analysis
+description: >-
+  Diagnostic code patterns for interrogating the WTI price series you have
+  been given — vol regime classification, anomaly detection, and adaptive
+  trend-window selection. Load references/analysis-patterns.md for working
+  code. Load references/wti_benchmarks.json for historical benchmark values
+  to compare against. Run this skill before trend-projection.
+---
+
+# Statistical analysis skill
+
+## Your data universe
+
+All data available to code execution comes from the **JSON payload in your
+context**. There are no disk files, no database connections. The fields are:
+
+| Field | Description |
+|---|---|
+| `target_history_csv` | WTI daily close history as a CSV string — recent 6 months daily, older history as weekly averages |
+| `target_summary` | `last_close_usd_bbl`, `last_date`, `52w_high`, `52w_low`, `n_trading_days` |
+| `as_of` | Forecast origin date (YYYY-MM-DD) |
+| `horizons` | List of integer horizon steps (business days) |
+| `standard_quantiles` | Exact quantile grid you must produce |
+
+`target_history_csv` is a **string embedded in JSON** — parse it with
+`io.StringIO`, not a file path. The CSV has a header row (`date,close`) and
+mixes two frequencies: recent rows are daily (consecutive trading days),
+older rows are weekly averages (gaps of ~7 days between dates). Detect the
+split by looking for date gaps > 3 days.
+
+The Gemini code execution session is **stateful within a turn**: parse the
+CSV once in your first code block, then reference the resulting DataFrame in
+subsequent blocks without re-parsing.
+
+## What this skill provides
+
+**`references/wti_benchmarks.json`** — Pre-computed historical benchmark
+values (2020–2025): weekly move percentiles, rolling-30d vol distribution,
+daily move stats, horizon CI calibration, and regime classification
+thresholds. Load this to compare computed values against a known baseline.
+
+**`references/analysis-patterns.md`** — Working code patterns for three
+diagnostic questions you should answer before producing a forecast. Each
+pattern is self-contained and prints a structured one-line result you can
+read back.
+
+## Recommended workflow
+
+1. Call `load_skill_resource("statistical-analysis", "references/wti_benchmarks.json")`
+   to load benchmark values into context.
+2. Call `load_skill_resource("statistical-analysis", "references/analysis-patterns.md")`
+   to load the diagnostic code patterns.
+3. Run Pattern 1 (vol regime), Pattern 2 (anomaly check), Pattern 3 (window
+   choice) in your code execution blocks.
+4. Use the printed results to inform the trend window you pass to the
+   `trend-projection` skill.
+
+Run this skill **before** `trend-projection`.
+
+**No scripts in this skill. Do not call `run_skill_script`.**
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__references__analysis-patterns.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__references__analysis-patterns.md.md
new file mode 100644
index 0000000..750bfcb
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__references__analysis-patterns.md.md
@@ -0,0 +1,173 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/skills/statistical-analysis/references/analysis-patterns.md
+
+kind: markdown
+
+# Statistical Analysis — Code Patterns
+
+These patterns help you interrogate the price series you have been given
+before producing a forecast. Each one answers a specific diagnostic question
+and prints a structured result you can read back in the conversation.
+
+> **Bootcamp note:** These patterns are demonstrated with WTI but the
+> underlying approach — parse a payload CSV, classify vol regime, detect
+> anomalies, adapt the trend window — transfers to any time-series reference
+> implementation. Replace the regime thresholds with domain-appropriate values
+> from your own benchmarks file.
+
+---
+
+## Section 0: Working with the Gemini execution environment
+
+Before running any of the patterns below, understand the constraints:
+
+**All data enters through the payload.** There are no files to `open()` and
+no packages to `pip install`. Everything you can use in code is already in
+the JSON payload in your context.
+
+**Parse the history string once.** `target_history_csv` is a string — parse
+it with `io.StringIO` in your first code block. The Gemini session is
+stateful within a turn, so the resulting `df` is available in every
+subsequent block without re-parsing.
+
+**Use `print()` to get results out.** Code execution output is returned to
+you as text in the conversation. Design your print statements to be short
+and readable — one labelled line per key result is easier to act on than a
+dump of raw numbers.
+
+**Detect the daily/weekly split.** The history mixes two frequencies: recent
+rows are consecutive trading days; older rows are weekly averages spaced
+~7 days apart. Patterns 1–3 should use only the daily portion for
+close-to-close statistics, since weekly averages suppress intraday moves
+and understate realised volatility.
+
+```python
+import io
+import numpy as np
+import pandas as pd
+
+# Parse once — reference `df` and `daily` in subsequent blocks
+payload = ...  # dict parsed from the JSON payload string
+
+history_csv = payload["target_history_csv"]
+df = pd.read_csv(io.StringIO(history_csv), parse_dates=["date"])
+df = df.sort_values("date").reset_index(drop=True)
+
+# Split daily (recent) vs weekly (older) rows by detecting date gaps > 3 days
+day_gaps = df["date"].diff().dt.days
+daily = df[day_gaps <= 3].copy().reset_index(drop=True)  # daily portion only
+
+print(f"Total rows: {len(df)}  |  Daily rows: {len(daily)}  |  "
+      f"Earliest daily: {daily['date'].iloc[0].date()}")
+```
+
+---
+
+## Pattern 1: Is the current vol regime normal or elevated?
+
+Compute the rolling 30-day annualised volatility over the daily portion of
+the history and classify it against the `regime_thresholds` in
+`wti_benchmarks.json`.
+
+```python
+# Assumes `daily` DataFrame is already defined (Section 0)
+# Assumes `benchmarks` dict is already loaded from wti_benchmarks.json
+
+log_returns = np.log(daily["close"] / daily["close"].shift(1)).dropna()
+
+# Rolling 30-day annualised vol
+rolling_vol = log_returns.rolling(30).std() * np.sqrt(252) * 100  # in %
+current_vol = float(rolling_vol.iloc[-1])
+
+thresholds = benchmarks["regime_thresholds"]
+if current_vol < thresholds["low_vol_max_pct"]:
+    regime = "low"
+elif current_vol < thresholds["normal_vol_max_pct"]:
+    regime = "normal"
+elif current_vol < thresholds["high_vol_max_pct"]:
+    regime = "elevated"
+else:
+    regime = "extreme"
+
+median_vol = benchmarks["rolling_30d_vol"]["median_annualised_pct"]
+print(f"REGIME: {regime}  |  current_vol={current_vol:.1f}%  "
+      f"vs median={median_vol:.1f}%")
+```
+
+**Example output:**
+```
+REGIME: elevated  |  current_vol=41.3%  vs median=31.4%
+```
+
+**What to do with this:** An `elevated` or `extreme` regime means recent
+price swings are larger than usual. This should narrow your trend window
+(see Pattern 3) and widen your forecast intervals relative to the empirical
+calibration floor in `horizon_calibration`.
+
+---
+
+## Pattern 2: Was the most recent move anomalous?
+
+Compute the z-score of the most recent daily close-to-close move relative
+to the rolling standard deviation of daily moves. A large z-score suggests
+a regime break or one-off shock that may not represent the ongoing trend.
+
+```python
+# Assumes `daily` DataFrame is already defined (Section 0)
+
+close_changes = daily["close"].diff().dropna()
+rolling_std = close_changes.rolling(30).std()
+
+last_change = float(close_changes.iloc[-1])
+last_std = float(rolling_std.iloc[-1])
+z_score = last_change / last_std if last_std > 0 else 0.0
+
+print(f"ANOMALY: z={z_score:+.2f}  |  last_move={last_change:+.2f} USD  "
+      f"rolling_std={last_std:.2f} USD")
+```
+
+**Example output:**
+```
+ANOMALY: z=+3.14  |  last_move=+4.21 USD  rolling_std=+1.34 USD
+```
+
+**What to do with this:** |z| > 2.5 indicates an unusual move. Treat a large
+positive z as potential upside momentum, a large negative z as potential
+downside break. Either way, be cautious about extending a short-window trend
+through such a move — it may be an outlier rather than a signal.
+
+This pattern generalises directly to other time series: the z-score logic is
+the same regardless of the underlying asset.
+
+---
+
+## Pattern 3: How many recent days should I trust for trend estimation?
+
+Choose a trend estimation window based on the regime and anomaly signals
+from Patterns 1 and 2. The goal is to use enough history to fit a stable
+trend, but not so much that a regime shift or shock contaminates the window.
+
+```python
+# Assumes `regime` string and `z_score` float are already defined
+
+if regime in ("elevated", "extreme") or abs(z_score) > 2.5:
+    trend_window = 15
+    reason = f"regime={regime}, |z|={abs(z_score):.2f} — shortened window"
+else:
+    trend_window = 30
+    reason = f"regime={regime}, |z|={abs(z_score):.2f} — standard window"
+
+print(f"TREND_WINDOW: {trend_window} days  ({reason})")
+```
+
+**Example output:**
+```
+TREND_WINDOW: 15 days  (regime=elevated, |z|=3.14 — shortened window)
+```
+
+**What to do with this:** Pass `trend_window` to the `trend-projection`
+skill as the number of recent daily rows to use for the `LinearRegression`
+fit (replacing the fixed 30-day window in the projection examples).
+
+The 15/30 thresholds are reasonable defaults for WTI. For a less volatile
+series you might use 20/45; for a more reactive one, 10/20. Adjust based on
+how quickly regimes typically change in your domain.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__SKILL.md.md
new file mode 100644
index 0000000..4ef6d84
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__SKILL.md.md
@@ -0,0 +1,43 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/skills/trend-projection/SKILL.md
+
+kind: markdown
+
+---
+name: trend-projection
+description: >-
+  Copy-pasteable scikit-learn and numpy code patterns for fitting a linear
+  trend on recent WTI price history, projecting point forecasts to standard
+  horizons, and calibrating 80% prediction interval widths from residual
+  standard errors. Load references/projection-examples.md before writing any
+  trend-projection code.
+---
+
+# Trend projection skill
+
+Run the `statistical-analysis` skill first to determine the current vol
+regime and appropriate trend window before applying these patterns.
+
+Load `references/projection-examples.md` via
+`load_skill_resource("trend-projection", "references/projection-examples.md")`
+**before writing any trend-projection code**.
+
+The reference file contains:
+- A complete working code pattern using `sklearn.linear_model.LinearRegression`
+  to fit the most recent 30 trading days of WTI close prices.
+- The standard interval-width formula: `1.28 * residual_std * sqrt(h / 5)`,
+  which produces the 80% CI half-width at horizon `h` business days.
+- A guard for the edge case where the trend line overshoots the 52-week range.
+- Worked numeric examples showing expected output for typical WTI vol regimes.
+
+## Quick-reference steps
+
+1. Parse the CSV history from the task payload into a DataFrame.
+2. Select the most recent 30 rows (trading days).
+3. Fit `LinearRegression` on `[0..29]` (x) vs close price (y).
+4. Project to horizons 5, 10, 21 by evaluating the regression at `30 + h - 1`.
+5. Compute `residual_std = std of (y - y_hat)` on the 30-day window.
+6. Set 80% CI half-width = `1.28 * residual_std * sqrt(h / 5)`.
+7. Clip projected point forecast to `[0.5 * 52w_low, 1.5 * 52w_high]` as a
+   plausibility guard — extreme trend extrapolation is usually wrong.
+
+**No scripts in this skill. Do not call `run_skill_script`.**
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__references__projection-examples.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__references__projection-examples.md.md
new file mode 100644
index 0000000..819fca4
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__references__projection-examples.md.md
@@ -0,0 +1,119 @@
+# Source: implementations/energy_oil_forecasting/analyst_agent/skills/trend-projection/references/projection-examples.md
+
+kind: markdown
+
+# Trend Projection — Code Patterns
+
+These are working, copy-pasteable patterns for WTI price trend projection.
+Paste the relevant block into your code execution cell and adapt as needed.
+
+---
+
+## Pattern 1: Linear regression trend + residual-based 80% CI
+
+```python
+import io
+import numpy as np
+import pandas as pd
+from sklearn.linear_model import LinearRegression
+
+# ── 1. Parse the CSV payload ──────────────────────────────────────────────
+# Assume `history_csv` is the string value of task_payload["target_history_csv"]
+df = pd.read_csv(io.StringIO(history_csv), parse_dates=["date"])
+df = df.sort_values("date").reset_index(drop=True)
+
+# ── 2. Select the most recent 30 trading days ────────────────────────────
+window = df.tail(30).copy().reset_index(drop=True)
+x = window.index.values.reshape(-1, 1)           # shape (30, 1)
+y = window["close"].values                        # shape (30,)
+
+# ── 3. Fit linear regression ──────────────────────────────────────────────
+model = LinearRegression().fit(x, y)
+y_hat = model.predict(x)
+residual_std = float(np.std(y - y_hat, ddof=1))
+
+print(f"Trend slope: {model.coef_[0]:+.4f} USD/day")
+print(f"Residual std: {residual_std:.4f} USD")
+
+# ── 4. Project to horizons ────────────────────────────────────────────────
+horizons = [5, 10, 21]
+for h in horizons:
+    x_proj = np.array([[29 + h]])          # 0-indexed: last window point is 29
+    point = float(model.predict(x_proj)[0])
+
+    # ── 5. Calibrate 80% CI ───────────────────────────────────────────────
+    half_width = 1.28 * residual_std * np.sqrt(h / 5)
+    lower_80 = point - half_width
+    upper_80 = point + half_width
+
+    print(f"h={h:>2}d  point={point:.2f}  80%CI=[{lower_80:.2f}, {upper_80:.2f}]")
+```
+
+**Expected output (typical WTI stable regime, ~$72/bbl):**
+```
+Trend slope: -0.0420 USD/day
+Residual std: 1.3200 USD
+h= 5d  point=71.45  80%CI=[68.48, 74.42]
+h=10d  point=71.24  80%CI=[66.85, 75.63]
+h=21d  point=70.86  80%CI=[63.74, 77.98]
+```
+
+**Expected output (high-vol regime, ~$75/bbl, residual_std ~$3.50):**
+```
+h= 5d  point=76.20  80%CI=[68.21, 84.19]
+h=10d  point=77.40  80%CI=[65.49, 89.31]
+h=21d  point=79.20  80%CI=[61.04, 97.36]
+```
+
+---
+
+## Pattern 2: Plausibility guard for trend extrapolation
+
+If the trend line overshoots the 52-week range, clip the point forecast.
+Use `target_summary["52w_high"]` and `target_summary["52w_low"]` from the payload.
+
+```python
+low_52w  = payload["target_summary"]["52w_low"]
+high_52w = payload["target_summary"]["52w_high"]
+
+# Allow ±50% of 52-week range as plausible boundary
+lower_bound = 0.5 * low_52w
+upper_bound = 1.5 * high_52w
+
+point_clipped = float(np.clip(point, lower_bound, upper_bound))
+if point_clipped != point:
+    print(f"WARNING: trend projected to {point:.2f}, clipped to {point_clipped:.2f}")
+```
+
+---
+
+## Pattern 3: Standard quantile grid from point + CI
+
+The task requires all 11 standard quantiles: 0.05, 0.10, 0.20, 0.30, 0.40,
+0.50, 0.60, 0.70, 0.80, 0.90, 0.95.
+
+Approximate with a Gaussian parameterised by `(point, sigma)`:
+
+```python
+import scipy.stats
+
+# Derive sigma from the 80% CI: CI_half_width = 1.28 * sigma
+sigma = half_width / 1.28
+
+standard_quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95]
+quantile_values = {q: float(scipy.stats.norm.ppf(q, loc=point, scale=sigma))
+                   for q in standard_quantiles}
+
+# Verify median matches point_forecast
+assert abs(quantile_values[0.50] - point) < 1e-6, "median must equal point_forecast"
+```
+
+---
+
+## Notes on Gemini code execution limits
+
+- Session timeout: ~30 seconds of CPU time. Keep computations lightweight.
+- Available packages: pandas, numpy, scipy, scikit-learn, matplotlib, seaborn.
+- `import io` is available for parsing CSV strings.
+- Do not attempt to `pip install` additional packages — the environment is fixed.
+- Use `print()` to inspect intermediate results; the output is returned to you.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__data.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__data.py.md
new file mode 100644
index 0000000..0a0abfa
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__data.py.md
@@ -0,0 +1,94 @@
+# Source: implementations/energy_oil_forecasting/data.py
+
+kind: python
+
+```python
+"""Data-service setup for the WTI Crude Oil forecasting experiment.
+
+:func:`build_wti_service` registers the continuous front-month WTI futures
+close series (Yahoo Finance ticker ``CL=F``) under the canonical
+:data:`WTI_SERIES_ID`.  Both the reference YAML specs under
+``implementations/energy_oil_forecasting/specs/`` and the notebooks here
+reference the same ``series_id`` via this module.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from pathlib import Path
+
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters.yfinance import YFinanceDailyAdapter
+
+
+def naive_utc_now() -> datetime:
+    """Return current UTC time as a timezone-naive :class:`datetime`.
+
+    :class:`~aieng.forecasting.data.service.DataService` and
+    :class:`~aieng.forecasting.data.cutoff.CutoffEnforcer` require naive
+    ``as_of`` values — tz-aware timestamps raise on comparison with cached
+    series timestamps.
+    """
+    return datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+
+WTI_SERIES_ID = "wti_crude_oil_price"
+"""Canonical series ID for the WTI front-month futures close price."""
+
+DEFAULT_CACHE_DIR = Path("data/yfinance")
+"""Default yfinance CSV cache directory (resolved relative to CWD at call time)."""
+
+_WTI_HISTORY_START = "2004-01-01"
+"""Earliest date requested from yfinance.  Setting an explicit start ensures the
+adapter fetches the full available history rather than yfinance's default 30-day
+window when no cache exists."""
+
+
+def build_wti_service(cache_dir: Path | None = None) -> DataService:
+    """Return a :class:`DataService` with the WTI Crude Oil daily close series registered.
+
+    Parameters
+    ----------
+    cache_dir : Path or None
+        yfinance CSV cache directory.  Defaults to ``data/yfinance`` relative
+        to the current working directory.  Notebooks typically run from their
+        own directory so the adapter will transparently fetch from yfinance if
+        the cache is absent or stale, then persist the result for subsequent
+        runs.
+
+    Returns
+    -------
+    DataService
+        A data service with the WTI series registered, ready to be handed
+        to :func:`~aieng.forecasting.evaluation.backtest.backtest` /
+        :func:`~aieng.forecasting.evaluation.backtest.cached_multi_backtest` /
+        :func:`~aieng.forecasting.evaluation.eval.evaluate`.
+    """
+    resolved_cache_dir: Path = cache_dir if cache_dir is not None else DEFAULT_CACHE_DIR
+    svc = DataService()
+    svc.register(
+        WTI_SERIES_ID,
+        # field defaults to "Adj Close" — matches the cache key cl_f_adj_close_1d.parquet
+        # produced by scripts/fetch_wti.py. For futures contracts like CL=F, Adj Close
+        # equals Close (no dividend adjustments).
+        # start is set explicitly to ensure yfinance fetches full history on a cache miss
+        # rather than its default 30-day window.
+        YFinanceDailyAdapter(ticker="CL=F", start=_WTI_HISTORY_START, cache_dir=resolved_cache_dir),
+        SeriesMetadata(
+            series_id=WTI_SERIES_ID,
+            description="WTI Crude Oil continuous front-month futures adjusted close (Yahoo Finance CL=F)",
+            source="yfinance",
+            units="USD/bbl",
+            frequency="B",
+        ),
+    )
+    return svc
+
+
+__all__ = [
+    "DEFAULT_CACHE_DIR",
+    "WTI_SERIES_ID",
+    "build_wti_service",
+    "naive_utc_now",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__paths.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__paths.py.md
new file mode 100644
index 0000000..59c3ce0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__paths.py.md
@@ -0,0 +1,112 @@
+# Source: implementations/energy_oil_forecasting/paths.py
+
+kind: python
+
+```python
+"""Shared paths, simulation constants, and colour palette for the energy/oil experiment."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+
+
+def repo_data_dir() -> Path:
+    """Return ``data/`` at the repository root (walk up from CWD if needed)."""
+    cwd = Path.cwd().resolve()
+    root = cwd
+    while not (root / "pyproject.toml").exists():
+        if root.parent == root:
+            return cwd / "data"
+        root = root.parent
+    data_dir = root / "data"
+    data_dir.mkdir(exist_ok=True)
+    return data_dir
+
+
+DATA_DIR = repo_data_dir()
+
+# ── Case-study Prophet rolling backtest (NB1) ────────────────────────────────
+ROLLING_FORECAST_CACHE = DATA_DIR / "energy_case_study_forecasts_30d_daily_v3.parquet"
+SIMULATION_START = pd.Timestamp("2025-01-01")
+SIMULATION_END = pd.Timestamp("2030-12-31")
+ROLLING_HORIZON_DAYS = 30
+ROLLING_CI_WIDTH = 0.95
+
+# ── Prophet origin trajectories (NB3 baselines) ──────────────────────────────
+PROPHET_TRAJ_CACHE = DATA_DIR / "energy_prophet_trajectories.parquet"
+PROPHET_SHOCK_TRAJ_CACHE = DATA_DIR / "energy_shock_prophet_trajectories.parquet"
+
+# ── Agent JSON caches (NB3) ───────────────────────────────────────────────────
+TRAJ_AGENT_CACHE = DATA_DIR / "energy_agent_trajectory_forecasts.json"
+TRAJ_CONTEXT_CACHE = DATA_DIR / "energy_agent_trajectory_context.json"
+SHOCK_ANALYST_CACHE = DATA_DIR / "energy_upshock_analyst_forecasts.json"
+SHOCK_CONTEXT_CACHE = DATA_DIR / "energy_upshock_news_context.json"
+SCENARIO_CACHE = DATA_DIR / "energy_agent_scenario_forecasts.json"
+
+# ── Demo origins ──────────────────────────────────────────────────────────────
+TRAJECTORY_ORIGINS: list[pd.Timestamp] = [
+    pd.Timestamp("2026-02-02"),
+    pd.Timestamp("2026-02-23"),
+    pd.Timestamp("2026-03-02"),
+]
+SHOCK_ORIGINS: list[pd.Timestamp] = [
+    pd.Timestamp("2026-02-02"),
+    pd.Timestamp("2026-02-09"),
+    pd.Timestamp("2026-02-16"),
+    pd.Timestamp("2026-02-23"),
+    pd.Timestamp("2026-03-02"),
+    pd.Timestamp("2026-03-09"),
+]
+SCENARIO_ORIGIN = pd.Timestamp("2026-03-02")
+
+SHOCK_THRESHOLD = 5.0
+SHOCK_HORIZON = 5
+
+# ── Plotly colour palette (shared across viz modules) ─────────────────────────
+CLR_ACTUAL = "#2171b5"
+CLR_HISTORY = "#bdd7e7"
+CLR_PROPHET = "#636363"
+CLR_AGENT = "#2ca02c"
+CLR_CI_PAST_FILL = "rgba(253, 141, 60, 0.22)"
+CLR_CI_CURR_FILL = "rgba(200, 90, 10, 0.50)"
+CLR_DAY_LINE = "rgba(150, 150, 150, 0.50)"
+CLR_HIT = "#31a354"
+CLR_MISS = "#de2d26"
+IRAN_COLOR = "#d62728"
+WARN_COLOR = "#b45309"
+
+__all__ = [
+    "CLR_ACTUAL",
+    "CLR_AGENT",
+    "CLR_CI_CURR_FILL",
+    "CLR_CI_PAST_FILL",
+    "CLR_DAY_LINE",
+    "CLR_HISTORY",
+    "CLR_HIT",
+    "CLR_MISS",
+    "CLR_PROPHET",
+    "DATA_DIR",
+    "IRAN_COLOR",
+    "PROPHET_SHOCK_TRAJ_CACHE",
+    "PROPHET_TRAJ_CACHE",
+    "ROLLING_CI_WIDTH",
+    "ROLLING_FORECAST_CACHE",
+    "ROLLING_HORIZON_DAYS",
+    "SCENARIO_CACHE",
+    "SCENARIO_ORIGIN",
+    "SHOCK_ANALYST_CACHE",
+    "SHOCK_CONTEXT_CACHE",
+    "SHOCK_HORIZON",
+    "SHOCK_ORIGINS",
+    "SHOCK_THRESHOLD",
+    "SIMULATION_END",
+    "SIMULATION_START",
+    "TRAJ_AGENT_CACHE",
+    "TRAJ_CONTEXT_CACHE",
+    "TRAJECTORY_ORIGINS",
+    "WARN_COLOR",
+    "repo_data_dir",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__prophet_baseline.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__prophet_baseline.py.md
new file mode 100644
index 0000000..9873c88
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__prophet_baseline.py.md
@@ -0,0 +1,306 @@
+# Source: implementations/energy_oil_forecasting/prophet_baseline.py
+
+kind: python
+
+```python
+"""Prophet baseline helpers for the WTI crude oil experiment.
+
+Provides a :class:`Predictor`-compatible wrapper for systematic backtests and
+origin-based trajectory helpers used in the case-study narrative and one-agent-
+three-tasks demo.
+"""
+
+from __future__ import annotations
+
+import logging
+from datetime import datetime, timedelta
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import scipy.stats
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import (
+    STANDARD_QUANTILES,
+    ContinuousForecast,
+    Prediction,
+)
+from aieng.forecasting.evaluation.predictor import Predictor
+from aieng.forecasting.evaluation.task import ForecastingTask
+from prophet import Prophet
+
+
+def find_nearest_trading_day(target: pd.Timestamp, index: pd.DatetimeIndex) -> pd.Timestamp | None:
+    """Return the nearest trading day on or after ``target`` within the index."""
+    candidates = index[index >= target]
+    return candidates[0] if len(candidates) > 0 else None
+
+
+def price_series_to_prophet_df(price_df: pd.DataFrame) -> pd.DataFrame:
+    """Convert a ``date``-indexed price DataFrame to Prophet ``ds``/``y`` columns."""
+    out = price_df.loc[:, ["price"]].reset_index()
+    out.columns = pd.Index(["ds", "y"])
+    return out
+
+
+def compute_rolling_forecasts(
+    price_df: pd.DataFrame,
+    simulation_start: pd.Timestamp,
+    simulation_end: pd.Timestamp,
+    horizon_days: int,
+    ci_width: float,
+    cache_path: Path,
+) -> pd.DataFrame:
+    """Run daily-refit Prophet forecasts for every simulation trading day.
+
+    Each sim_day gets its own Prophet model trained on all available data
+    through that day. Returns a DataFrame with columns:
+    ``sim_day``, ``resolution_date``, ``yhat``, ``yhat_lower``, ``yhat_upper``,
+    ``actual_price``, ``inside_ci``.
+    """
+    if cache_path.exists():
+        df = pd.read_parquet(cache_path)
+        print(f"Loaded {len(df):,} pre-computed forecasts from cache.")
+        return df
+
+    last_resolvable = price_df.index.max() - timedelta(days=horizon_days)
+    effective_end = min(simulation_end, last_resolvable)
+    sim_days = price_df.loc[simulation_start:effective_end].index.tolist()
+    n = len(sim_days)
+    print(f"Computing daily-refit forecasts for {n} simulation days (~10–15 min)...")
+
+    logging.getLogger("prophet").setLevel(logging.ERROR)
+    results: list[dict[str, object]] = []
+
+    for i, sim_day in enumerate(sim_days):
+        if i % 25 == 0 or i == n - 1:
+            print(f"  [{i + 1:>3}/{n}] {sim_day.date()}", flush=True)
+
+        train_df = price_series_to_prophet_df(price_df.loc[:sim_day])
+        model = Prophet(
+            interval_width=ci_width,
+            daily_seasonality=False,
+            weekly_seasonality=False,
+            yearly_seasonality=True,
+            seasonality_mode="multiplicative",
+        )
+        model.fit(train_df)
+
+        future = model.make_future_dataframe(periods=horizon_days + 5, freq="D")
+        pred = model.predict(future).set_index("ds")
+
+        resolution_calendar = sim_day + timedelta(days=horizon_days)
+        resolution_day = find_nearest_trading_day(resolution_calendar, price_df.index)
+        if resolution_day is None:
+            continue
+
+        if resolution_calendar not in pred.index:
+            resolution_calendar = min(pred.index, key=lambda d: abs(d - resolution_calendar))
+
+        row = pred.loc[resolution_calendar]
+        actual_price = (
+            float(price_df.loc[resolution_day, "price"]) if resolution_day in price_df.index else float("nan")
+        )
+        inside_ci = (
+            bool(float(row["yhat_lower"]) <= actual_price <= float(row["yhat_upper"]))
+            if not pd.isna(actual_price)
+            else False
+        )
+
+        results.append(
+            {
+                "sim_day": sim_day,
+                "resolution_date": resolution_day,
+                "yhat": float(row["yhat"]),
+                "yhat_lower": float(row["yhat_lower"]),
+                "yhat_upper": float(row["yhat_upper"]),
+                "actual_price": actual_price,
+                "inside_ci": inside_ci,
+            }
+        )
+
+    forecasts_df = pd.DataFrame(results)
+    forecasts_df.to_parquet(cache_path, index=False)
+    print(f"\nSaved {len(forecasts_df):,} forecast records to {cache_path}")
+    return forecasts_df
+
+
+def _fit_prophet_at_origin(price_df: pd.DataFrame, origin: pd.Timestamp) -> pd.DataFrame:
+    """Fit Prophet on history up to origin; return 21-business-day trajectory."""
+    train_df = price_series_to_prophet_df(price_df.loc[:origin])
+    logging.getLogger("prophet").setLevel(logging.ERROR)
+
+    model = Prophet(
+        interval_width=0.95,
+        daily_seasonality=False,
+        weekly_seasonality=False,
+        yearly_seasonality=True,
+        seasonality_mode="multiplicative",
+    )
+    model.fit(train_df)
+
+    future = model.make_future_dataframe(periods=35, freq="D")
+    pred = model.predict(future).set_index("ds")
+
+    bday_dates = pd.bdate_range(start=origin + pd.offsets.BDay(1), periods=21)
+    rows: list[dict[str, object]] = []
+    for h, date in enumerate(bday_dates, start=1):
+        cal_date = date.normalize()
+        if cal_date in pred.index:
+            row = pred.loc[cal_date]
+        else:
+            nearest_idx = int((pred.index - cal_date).abs().argmin())
+            row = pred.iloc[nearest_idx]
+        rows.append(
+            {
+                "origin": origin,
+                "forecast_date": date,
+                "horizon": h,
+                "yhat": float(row["yhat"]),
+                "yhat_lower": float(row["yhat_lower"]),
+                "yhat_upper": float(row["yhat_upper"]),
+            }
+        )
+    return pd.DataFrame(rows)
+
+
+def load_prophet_trajectories(
+    price_df: pd.DataFrame,
+    origins: list[pd.Timestamp],
+    cache_path: Path,
+) -> pd.DataFrame:
+    """Load from cache or fit Prophet at each origin."""
+    if cache_path.exists():
+        df = pd.read_parquet(cache_path)
+        df["origin"] = pd.to_datetime(df["origin"])
+        df["forecast_date"] = pd.to_datetime(df["forecast_date"])
+        print(f"Loaded {len(df)} Prophet trajectory rows from {cache_path.name}")
+        return df
+
+    print(f"Fitting Prophet at {len(origins)} origins ...")
+    frames = [_fit_prophet_at_origin(price_df, origin) for origin in origins]
+    df = pd.concat(frames, ignore_index=True)
+    df.to_parquet(cache_path, index=False)
+    print(f"Saved {len(df)} rows to {cache_path.name}")
+    return df
+
+
+def check_shock_outcome(
+    price_df: pd.DataFrame,
+    origin: pd.Timestamp,
+    threshold: float,
+    horizon_bdays: int,
+) -> tuple[int, float]:
+    """Return ``(outcome, delta)`` where outcome=1 if day-H close > origin + threshold."""
+    origin_price = float(price_df[price_df.index >= origin].iloc[0]["price"])
+    future = price_df[price_df.index > origin].iloc[:horizon_bdays]
+    delta = float(future.iloc[-1]["price"]) - origin_price
+    return (1 if delta > threshold else 0), delta
+
+
+def prophet_prob_shock(
+    prophet_traj_sub: pd.DataFrame,
+    origin_price: float,
+    threshold: float,
+    horizon: int = 5,
+) -> float:
+    """P(price_h > origin + threshold) from Prophet 95% CI (Gaussian approximation)."""
+    row = prophet_traj_sub[prophet_traj_sub["horizon"] == horizon]
+    if row.empty:
+        return float("nan")
+    row = row.iloc[0]
+    sigma = (float(row["yhat_upper"]) - float(row["yhat_lower"])) / (2 * 1.96)
+    if sigma <= 0:
+        return 1.0 if float(row["yhat"]) > origin_price + threshold else 0.0
+    return float(
+        np.clip(
+            1.0 - scipy.stats.norm.cdf(origin_price + threshold, loc=float(row["yhat"]), scale=sigma),
+            0.0,
+            1.0,
+        )
+    )
+
+
+class ProphetPredictor(Predictor):
+    """Standard :class:`Predictor` wrapper for Prophet daily WTI forecasting."""
+
+    def __init__(
+        self,
+        predictor_id: str = "prophet_daily",
+        *,
+        interval_width: float = 0.80,
+        seasonality_mode: str = "multiplicative",
+    ) -> None:
+        self._predictor_id = predictor_id
+        self._interval_width = interval_width
+        self._seasonality_mode = seasonality_mode
+
+    @property
+    def predictor_id(self) -> str:
+        return self._predictor_id
+
+    def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]:
+        df = context.get_series(task.target_series_id)
+        if len(df) < 50:
+            return []
+
+        train_df = df.rename(columns={"timestamp": "ds", "value": "y"})
+        train_df["ds"] = pd.to_datetime(train_df["ds"])
+
+        logging.getLogger("prophet").setLevel(logging.ERROR)
+        model = Prophet(
+            interval_width=self._interval_width,
+            daily_seasonality=False,
+            weekly_seasonality=False,
+            yearly_seasonality=False,
+            seasonality_mode=self._seasonality_mode,
+        )
+        model.fit(train_df)
+
+        origin = pd.Timestamp(context.as_of)
+        future = model.make_future_dataframe(periods=max(task.horizons) + 15, freq="D")
+        forecast = model.predict(future).set_index("ds")
+
+        predictions: list[Prediction] = []
+        for h in task.horizons:
+            target_date = origin + pd.Timedelta(days=h)
+            snap = forecast.index[forecast.index >= target_date][0]
+            row = forecast.loc[snap]
+            yhat = float(row["yhat"])
+            sigma = (float(row["yhat_upper"]) - float(row["yhat_lower"])) / (2 * 1.96)
+            sigma = max(sigma, 1e-4)
+            quantiles = {q: float(scipy.stats.norm.ppf(q, loc=yhat, scale=sigma)) for q in STANDARD_QUANTILES}
+            predictions.append(
+                Prediction(
+                    predictor_id=self.predictor_id,
+                    task_id=task.task_id,
+                    issued_at=datetime.utcnow(),
+                    as_of=context.as_of,
+                    forecast_date=snap.to_pydatetime(),
+                    payload=ContinuousForecast(point_forecast=yhat, quantiles=quantiles),
+                )
+            )
+
+        return predictions
+
+
+def wti_series_to_price_df(data_service_series: pd.DataFrame) -> pd.DataFrame:
+    """Convert a DataService series (timestamp/value) to date-indexed price DataFrame."""
+    df = data_service_series.copy()
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    df = df.set_index("timestamp").rename(columns={"value": "price"})
+    df.index = pd.DatetimeIndex([pd.Timestamp(str(d)[:10]) for d in df.index])
+    df.index.name = "date"
+    return df.sort_index()
+
+
+__all__ = [
+    "ProphetPredictor",
+    "check_shock_outcome",
+    "compute_rolling_forecasts",
+    "find_nearest_trading_day",
+    "load_prophet_trajectories",
+    "prophet_prob_shock",
+    "wti_series_to_price_df",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_backtest.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_backtest.yaml.md
new file mode 100644
index 0000000..d6929fc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_backtest.yaml.md
@@ -0,0 +1,36 @@
+# Source: implementations/energy_oil_forecasting/specs/energy_oil_backtest.yaml
+
+kind: yaml
+
+```yaml
+# Energy Oil Backtest Spec — 2025 Weekly Rolling Backtest
+#
+# Runs weekly origins across 2025. Stride is 5 business days (weekly).
+# Target is WTI Crude Oil price (yfinance ticker: CL=F).
+# Horizons: 5, 10, 21 business days.
+#
+# Origin count : 51 (weekly in 2025)
+# Warmup       : 250 trading days (~1 year) of historical prices
+
+spec_id: energy_oil_backtest
+
+description: >-
+  Weekly rolling backtest in 2025 for daily WTI crude oil price forecasting.
+  Evaluates trajectory forecasts (5, 10, 21 business days) with CRPS/MAE and
+  binary up-shock forecasts (climb > $5 in 5 business days) with Brier Score.
+  Used to select the top contender models.
+
+tasks:
+  - task_id: wti_oil_price_forecast
+    target_series_id: wti_crude_oil_price
+    horizons: [5, 10, 21]
+    frequency: B
+    description: >-
+      WTI Crude Oil continuous front-month futures Close price (yfinance symbol: CL=F),
+      projected 5, 10, and 21 trading days ahead.
+
+start: "2025-01-06"
+end: "2025-12-22"
+stride: 5
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval.yaml.md
new file mode 100644
index 0000000..0c0a437
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval.yaml.md
@@ -0,0 +1,33 @@
+# Source: implementations/energy_oil_forecasting/specs/energy_oil_eval.yaml
+
+kind: yaml
+
+```yaml
+# Energy Oil Eval Spec — 2026 Prospective Competition
+#
+# Runs on 8 weekly origins from Feb 2, 2026 to Mar 23, 2026.
+# Covers the high-volatility Persian Gulf geopolitical price shock period.
+# Target is WTI Crude Oil price (yfinance ticker: CL=F).
+# Horizons: 5, 10, 21 business days.
+
+spec_id: energy_oil_eval
+
+description: >-
+  Prospective/out-of-sample evaluation period in 2026 for daily WTI crude oil.
+  Evaluates selected contender models on 8 weekly origins during the early 2026
+  geopolitical price shock to measure adaptive real-time forecasting performance.
+
+tasks:
+  - task_id: wti_oil_price_forecast
+    target_series_id: wti_crude_oil_price
+    horizons: [5, 10, 21]
+    frequency: B
+    description: >-
+      WTI Crude Oil continuous front-month futures Close price (yfinance symbol: CL=F),
+      projected 5, 10, and 21 trading days ahead.
+
+start: "2026-02-02"
+end: "2026-03-23"
+stride: 5
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval_smoke.yaml.md
new file mode 100644
index 0000000..8059722
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval_smoke.yaml.md
@@ -0,0 +1,35 @@
+# Source: implementations/energy_oil_forecasting/specs/energy_oil_eval_smoke.yaml
+
+kind: yaml
+
+```yaml
+# Energy Oil Eval Smoke Spec — Fast CI/Testing Evaluation
+#
+# Two-origin subset of energy_oil_eval.yaml for running the 2026 protected
+# arena cheaply during development and end-to-end testing.
+# Use by setting SMOKE_TEST = True in the notebook setup cell.
+#
+# Origin count : 2 (vs. 8 in the full eval)
+# Warmup       : 250 trading days (~1 year) of historical prices
+
+spec_id: energy_oil_eval_smoke
+
+description: >-
+  Two-origin smoke evaluation for local and CI testing of the NB04 pipeline.
+  Uses the same tasks, horizons, warmup, and geopolitical period as
+  energy_oil_eval but with only 2 origins to keep cost negligible.
+
+tasks:
+  - task_id: wti_oil_price_forecast
+    target_series_id: wti_crude_oil_price
+    horizons: [5, 10, 21]
+    frequency: B
+    description: >-
+      WTI Crude Oil continuous front-month futures Close price (yfinance symbol: CL=F),
+      projected 5, 10, and 21 trading days ahead.
+
+start: "2026-02-02"
+end: "2026-02-09"
+stride: 5
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_smoke.yaml.md
new file mode 100644
index 0000000..48e1f11
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__specs__energy_oil_smoke.yaml.md
@@ -0,0 +1,36 @@
+# Source: implementations/energy_oil_forecasting/specs/energy_oil_smoke.yaml
+
+kind: yaml
+
+```yaml
+# Energy Oil Smoke Spec — Fast CI/Testing Backtest
+#
+# Two-origin subset of energy_oil_backtest.yaml for running the full
+# NB04 pipeline cheaply during development and end-to-end testing.
+# Use by setting SMOKE_TEST = True in the notebook setup cell.
+#
+# Origin count : 2 (vs. 51 in the full backtest)
+# Warmup       : 250 trading days (~1 year) of historical prices
+
+spec_id: energy_oil_smoke
+
+description: >-
+  Two-origin smoke backtest for local and CI testing of the NB04 pipeline.
+  Uses the same tasks, horizons, and warmup as energy_oil_backtest but with
+  only 2 weekly origins so the full notebook can be exercised without
+  burning tokens on 51 × 5 predictor evaluations.
+
+tasks:
+  - task_id: wti_oil_price_forecast
+    target_series_id: wti_crude_oil_price
+    horizons: [5, 10, 21]
+    frequency: B
+    description: >-
+      WTI Crude Oil continuous front-month futures Close price (yfinance symbol: CL=F),
+      projected 5, 10, and 21 trading days ahead.
+
+start: "2025-06-02"
+end: "2025-06-09"
+stride: 5
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent____init__.py.md
new file mode 100644
index 0000000..2e5a2e8
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent____init__.py.md
@@ -0,0 +1,22 @@
+# Source: implementations/energy_oil_forecasting/starter_agent/__init__.py
+
+kind: python
+
+```python
+"""WTI starter agent — a fresh, hackable template for your own exploration.
+
+Exports the toggle-driven :class:`AgentConfig` factory and the predictor
+convenience factory. See ``99_starter_agent.ipynb`` and ``agent.py``.
+"""
+
+from energy_oil_forecasting.starter_agent.agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+__all__ = [
+    "build_starter_agent_config",
+    "build_starter_agent_predictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__agent.py.md
new file mode 100644
index 0000000..c707602
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__agent.py.md
@@ -0,0 +1,233 @@
+# Source: implementations/energy_oil_forecasting/starter_agent/agent.py
+
+kind: python
+
+```python
+"""WTI starter agent — a fresh, hackable template for your own exploration.
+
+This is **not** part of the notebook 01–06 curriculum. It is a clean starting
+point: the smallest agent that still has room to grow. It ships with our common
+building blocks wired behind simple toggles —
+
+- **optional news search** (``enable_search``, on by default) — bounded,
+  cutoff-aware Google Search through the Vector proxy;
+- **optional code execution** (``enable_code_exec``, off by default) — an E2B
+  Python sandbox;
+- **two lightweight skills** (:mod:`skills/`) that are *tool-usage playbooks*:
+  how to get good results out of search and code execution.
+
+Everything routes through the Vector proxy — no direct provider keys. See
+``planning-docs/vector-llm-proxy.md``.
+
+The prompt builder and output schema are reused from the
+:mod:`~energy_oil_forecasting.analyst_agent` module (they are just task
+serialisation — no need to duplicate them); the *agent identity* here is fresh
+and yours to edit. Pair this with ``99_starter_agent.ipynb``.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Callable
+
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+# Reuse the existing WTI prompt builder + history compression — these serialise
+# the task/context into the agent's JSON payload and are not worth duplicating.
+from energy_oil_forecasting.analyst_agent import WtiPriceForecastPromptBuilder
+
+
+# Skills live next to this module.
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_FORECASTING_SKILL = _SKILLS_ROOT / "forecasting"
+_RESEARCH_SKILL = _SKILLS_ROOT / "research-playbook"
+_CODE_ANALYSIS_SKILL = _SKILLS_ROOT / "code-analysis-playbook"
+
+
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+
+
+def _build_starter_instruction() -> str:
+    """Build the task-agnostic, skill-agnostic starter persona.
+
+    Just the analyst's identity and how to behave — no output schema, no payload
+    contract, no skill or tool mechanics. ADK injects the name + description of
+    every attached skill (and every tool) into the system prompt, so the agent
+    already knows what it can load and call; repeating that here would only
+    duplicate dynamically-injected information. The forecasting *contract* lives
+    in the loadable ``forecasting`` skill. Edit the persona freely.
+    """
+    return (
+        "## Role\n\n"
+        "You are a WTI crude oil market analyst — fluent in supply/demand "
+        "fundamentals, OPEC+ policy, geopolitical and shipping-lane risk, and "
+        "price dynamics. This is a starter agent: keep your reasoning "
+        "transparent and your claims honest.\n\n"
+        "## How to respond\n\n"
+        "- For open-ended questions, scenario analysis, or anything "
+        "conversational, answer directly and concisely — do NOT ask for a JSON "
+        "payload.\n"
+        "- When you are handed a task that asks for a structured probabilistic "
+        "forecast, produce a calibrated one."
+    )
+
+
+_STARTER_INSTRUCTION = _build_starter_instruction()
+
+
+_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are an oil-market intelligence specialist with web search.
+
+Return a concise structured markdown summary (3-5 paragraphs) covering, as the
+query warrants: WTI/Brent price level and trend; OPEC+ supply decisions;
+geopolitical risk in the Persian Gulf and key shipping lanes; US SPR / energy
+policy; notable supply-disruption signals; and published analyst price targets.
+
+Ground every claim in the search results you actually retrieve. When a cutoff
+date is specified, never report or speculate about events after it.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Config factory
+# ---------------------------------------------------------------------------
+
+
+def build_starter_agent_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    *,
+    enable_search: bool = True,
+    enable_code_exec: bool = False,
+) -> AgentConfig:
+    """Build the WTI starter :class:`AgentConfig`.
+
+    Parameters
+    ----------
+    model : str
+        Model for the analyst agent (default: lite). Pass the advanced model
+        (``"gemini-3.5-flash"``) for higher-quality runs.
+    search_model : str
+        Model for the bounded web-search sub-tool.
+    enable_search : bool, default=True
+        Wire a cutoff-aware ``search_web`` tool and load the
+        ``research-playbook`` skill. Proxy-only — no extra API key.
+    enable_code_exec : bool, default=False
+        Wire an E2B Python sandbox and load the ``code-analysis-playbook``
+        skill. Needs ``E2B_API_KEY`` and is slower, so it is off by default —
+        flip it on to let the agent compute its own diagnostics.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    # Every attached skill is loaded on demand: ADK injects each skill's name +
+    # description into the system prompt, and the agent reads the full SKILL.md
+    # only when relevant — so toggling a tool just adds its skill, no persona edits.
+    skills_dirs: list[Path] = [_FORECASTING_SKILL]
+    if enable_search:
+        skills_dirs.append(_RESEARCH_SKILL)
+    if enable_code_exec:
+        skills_dirs.append(_CODE_ANALYSIS_SKILL)
+
+    context_retrieval = (
+        ContextRetrievalConfig(
+            enabled=True,
+            instruction=_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        )
+        if enable_search
+        else ContextRetrievalConfig()
+    )
+
+    return AgentConfig(
+        name="wti_starter_agent",
+        model=model,
+        instruction=_STARTER_INSTRUCTION,
+        # 16k headroom: enough for a complete run_code script + structured output.
+        max_output_tokens=16_384 if enable_code_exec else None,
+        context_retrieval=context_retrieval,
+        code_execution=CodeExecutionConfig(enabled=enable_code_exec),
+        skills_dirs=skills_dirs,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+class _StarterForecastPromptBuilder:
+    """Add the output schema + a forecast directive to a base builder's payload.
+
+    The exact JSON schema is generated at call time from the output class
+    (drift-free) and injected into the user payload — never into the system
+    prompt — so the agent stays conversational until it is actually asked to
+    forecast. Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally.
+    """
+
+    def __init__(self, inner: Callable[..., str], output_schema_json: str) -> None:
+        self._inner = inner
+        self._schema_json = output_schema_json
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        payload = json.loads(self._inner(task=task, context=context))
+        payload["instructions"] = (
+            "Produce a calibrated probabilistic forecast for this task and return it by "
+            "calling `set_model_response` with a `json_response` string matching "
+            "`output_schema` exactly."
+        )
+        payload["output_schema"] = self._schema_json
+        return json.dumps(payload, indent=2)
+
+
+def build_starter_agent_predictor(config: AgentConfig) -> AgentPredictor:
+    """Wrap a starter :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Reuses :class:`~energy_oil_forecasting.analyst_agent.WtiPriceForecastPromptBuilder`
+    for data serialisation, wrapped so the (drift-free) continuous output schema
+    and a forecast directive ride in the payload — keeping the schema out of the
+    persona. ``predict(task, context)`` returns one
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` per horizon.
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=_StarterForecastPromptBuilder(
+            WtiPriceForecastPromptBuilder(),
+            ContinuousAgentForecastOutput.prompt_schema_json(),
+        ),
+        output_schema=ContinuousAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_starter_agent_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
new file mode 100644
index 0000000..c543a1e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
@@ -0,0 +1,56 @@
+# Source: implementations/energy_oil_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: code-analysis-playbook
+description: >-
+  How to use the code execution sandbox well — parse the JSON payload (not
+  disk files), compute a couple of useful diagnostics before forecasting, and
+  keep the session stateful within a turn. Load this before writing code. No
+  scripts.
+---
+
+# Code-analysis playbook
+
+A short guide to using the `run_code` sandbox productively. This is a starter
+skill — extend it with the diagnostics that matter for your problem.
+
+## Where your data lives
+
+All data comes from the **JSON payload in your context** — there are no disk
+files and no network. The history arrives as a CSV *string* (e.g.
+`target_history_csv`). Parse it with `io.StringIO`, never as a file path:
+
+```python
+import io, pandas as pd
+df = pd.read_csv(io.StringIO(payload["target_history_csv"]))
+```
+
+The sandbox is **stateful within a turn**: parse once in your first code block,
+then reuse the DataFrame in later blocks instead of re-parsing.
+
+## Compute before you forecast
+
+Run a couple of cheap diagnostics so your forecast is grounded in arithmetic,
+not vibes:
+
+1. **Recent trend** — slope/return over the last N observations.
+2. **Volatility** — recent standard deviation of changes; it sets how wide your
+   quantile bands should be.
+3. **Sanity check** — does your point forecast sit within a plausible multiple
+   of recent moves? If not, revisit it.
+
+Use the printed numbers to set the point forecast and to *calibrate the spread*
+between your low and high quantiles — wider when recent volatility is high.
+
+## Domain focus (edit this for your use case)
+
+For WTI crude, daily moves are usually within a few percent; multi-day moves
+fan out roughly with the square root of the horizon. Let recent realised
+volatility, not a fixed guess, set your interval widths.
+
+## Room to grow
+
+- Add your own diagnostic patterns (regime detection, seasonality, covariates).
+- Drop reusable reference values into a `references/` file and `load_skill_resource` them.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__forecasting__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__forecasting__SKILL.md.md
new file mode 100644
index 0000000..220d623
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__forecasting__SKILL.md.md
@@ -0,0 +1,57 @@
+# Source: implementations/energy_oil_forecasting/starter_agent/skills/forecasting/SKILL.md
+
+kind: markdown
+
+---
+name: forecasting
+description: >-
+  The output contract for producing a structured probabilistic forecast — the
+  JSON shape, the calibration and quantile rules, and how to submit it. Load
+  this ONLY when your task payload asks for a forecast; ignore it for
+  open-ended questions. No scripts.
+---
+
+# Forecasting skill
+
+Load this when your task payload asks for a structured forecast. For open-ended
+questions, ignore it and just answer.
+
+## What you'll receive
+
+A JSON payload describing the task: a `task` id, the `as_of` cutoff date,
+`horizons` (steps ahead), the `standard_quantiles` grid, a `target_summary`, the
+recent `target_history_csv`, and an `output_schema` showing the exact JSON to
+return.
+
+## The output contract
+
+1. Produce **one forecast per horizon** in `horizons`.
+2. Use **exactly** the levels in `standard_quantiles` — no additions or omissions.
+3. `point_forecast` must equal the **0.50 quantile** value.
+4. Quantile values must be **non-decreasing** as the quantile level rises.
+5. Use ONLY information available on or before `as_of`.
+6. Put your reasoning in the `rationale` fields.
+
+Submit by calling `set_model_response` with a `json_response` string that
+matches the payload's `output_schema` **exactly** — use `"horizon"` (an
+integer), and make `"quantiles"` a **list** of `{"quantile": <level>, "value":
+<number>}` objects. Omit any field not shown in the schema.
+
+## Calibration
+
+Report calibrated intervals, not false precision: across many forecasts where
+your 80% band is stated, the truth should land inside it about 80% of the time.
+Anchor the point on the recent level and trend; let recent **volatility** set
+how wide the bands are, and widen them as the horizon grows.
+
+## Domain focus (edit this for your use case)
+
+For WTI crude, anchor on the last close and recent daily moves (usually a few
+percent); multi-day uncertainty fans out roughly with the square root of the
+horizon. Adjust the point for OPEC+/​supply or geopolitical signals you have
+real evidence for — and say so in the rationale.
+
+## Room to grow
+
+- Tighten the calibration guidance with your own backtest findings.
+- Add worked examples of good vs. over-confident forecasts.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__research-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
new file mode 100644
index 0000000..a696993
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
@@ -0,0 +1,45 @@
+# Source: implementations/energy_oil_forecasting/starter_agent/skills/research-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: research-playbook
+description: >-
+  How to use the search_web tool well when grounding a forecast in recent
+  news — phrase cutoff-aware queries, decide what is worth searching for, and
+  weigh sources. Load this before your first search_web call. No scripts.
+---
+
+# Research playbook
+
+A short guide to getting real signal out of `search_web`. This is a starter
+skill — extend it with the queries and sources that work for your problem.
+
+## The one rule that matters
+
+Always pass `cutoff_date` equal to the `as_of` date in your payload. It is the
+temporal fence that keeps post-origin information out of a historical forecast.
+A forecast that "knew" what happened after `as_of` is not a forecast.
+
+## How to search
+
+- **Search before you forecast, not after.** Gather context first, then reason.
+- **One topic per query.** Several focused queries beat one broad one. Stop when
+  new queries stop returning new facts.
+- **Ask for the present state, not a prediction.** "current OPEC+ production
+  policy" returns facts; "will oil go up" returns noise.
+- **Weigh sources.** Prefer primary releases and major outlets; treat a single
+  blog or forum post as a lead to confirm, not a fact.
+
+## Domain focus (edit this for your use case)
+
+For WTI crude, the signals that move price: OPEC+ supply decisions, Persian
+Gulf / shipping-lane geopolitical risk, US SPR and inventory data, USD strength,
+and global demand outlook. Search for the *current state* of these, then let the
+price history tell you the baseline.
+
+## Room to grow
+
+- Add a curated list of go-to sources for your domain.
+- Track which queries paid off and prune the ones that didn't.
+- Add a `references/` file with example high-signal searches.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__tasks.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__tasks.py.md
new file mode 100644
index 0000000..f51bbc9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__tasks.py.md
@@ -0,0 +1,251 @@
+# Source: implementations/energy_oil_forecasting/tasks.py
+
+kind: python
+
+```python
+"""Task specifications and agent predictor wiring for the WTI experiment.
+
+Implements the "one agent, three tasks" pattern: a single :class:`AgentConfig`
+identity with task-specific prompt builders and output schemas supplied via
+:class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`.
+"""
+
+from __future__ import annotations
+
+import json
+from datetime import datetime
+from typing import Any, ClassVar, Literal
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import BinaryForecast, Prediction
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    DiscreteAgentForecastOutput,
+)
+from aieng.forecasting.methods.agentic.agent_factory import AgentConfig
+from aieng.forecasting.methods.agentic.outputs import AgentForecastOutput
+from aieng.forecasting.models import LITE_MODEL
+from energy_oil_forecasting.analyst_agent import (
+    WtiPriceForecastPromptBuilder,
+    build_wti_multitask_news_config,
+    build_wti_news_config,
+    compress_history,
+)
+from energy_oil_forecasting.paths import SHOCK_HORIZON, SHOCK_THRESHOLD
+from pydantic import BaseModel, Field
+
+
+# ── Task specification strings (embedded in user prompts for NB3) ───────────
+# Each spec uses the corresponding output class's prompt_schema_json() so the
+# required JSON format in the prompt is always in sync with the Pydantic schema.
+
+TASK_TRAJECTORY_SPEC = (
+    "Forecast the WTI crude oil price at the horizons listed in the payload.\n\n"
+    "If a `set_model_response` tool is available, call it with your complete "
+    "JSON as `json_response`. Otherwise return the JSON directly as plain text.\n\n"
+    "Required JSON format:\n" + ContinuousAgentForecastOutput.prompt_schema_json()
+)
+
+TaskKind = Literal["trajectory", "shock", "scenario"]
+
+
+class WtiMultitaskPromptBuilder(BaseModel):
+    """Prompt builder for task-spec-driven agent calls (NB3)."""
+
+    task_spec: str
+
+    model_config = {"extra": "forbid"}
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        df = context.get_series(task.target_series_id)
+        last_row = df.iloc[-1]
+        payload: dict[str, Any] = {
+            "task": task.task_id,
+            "task_spec": self.task_spec,
+            "as_of": str(context.as_of)[:10],
+            "origin_price_usd_bbl": float(last_row["value"]),
+            "target_history_csv": compress_history(df),
+        }
+        return json.dumps(payload, indent=2)
+
+
+class ScenarioCard(BaseModel):
+    """One scenario card from Task C agent output."""
+
+    model_config = {"extra": "ignore"}
+
+    name: str
+    description: str
+    probability: float = Field(ge=0.0, le=1.0)
+    wti_range_60d: list[float]
+    point_estimate_60d: float
+    key_drivers: list[str] = Field(default_factory=list)
+
+
+class ScenarioAgentForecastOutput(AgentForecastOutput):
+    """Track 2 scenario analysis output for the energy case study."""
+
+    modality: ClassVar[Literal["continuous", "discrete"]] = "discrete"
+
+    model_config = {"extra": "ignore"}
+
+    scenarios: list[ScenarioCard]
+    base_case: str
+    reasoning: str = ""
+
+    @classmethod
+    def prompt_schema_json(cls) -> str:
+        """Return a JSON template for use in agent instruction strings.
+
+        Returns
+        -------
+        str
+            Indented JSON string showing the exact structure the agent must
+            pass to ``set_model_response``.
+        """
+        template: dict[str, object] = {
+            "scenarios": [
+                {
+                    "name": "<string>",
+                    "description": "<string>",
+                    "probability": "<float in [0, 1]>",
+                    "wti_range_60d": ["<float_low>", "<float_high>"],
+                    "point_estimate_60d": "<float>",
+                    "key_drivers": ["<driver 1>", "<driver 2>"],
+                }
+            ],
+            "base_case": "<scenario name>",
+            "reasoning": "<paragraph>",
+        }
+        return json.dumps(template, indent=2)
+
+    def to_predictions(
+        self,
+        *,
+        task: ForecastingTask,
+        context: ForecastContext,
+        predictor_id: str,
+        metadata: dict[str, Any] | None = None,
+    ) -> list[Prediction]:
+        """Convert scenario output to a metadata-rich prediction (Track 2 display)."""
+        if len(task.horizons) != 1:
+            raise ValueError("Scenario agent output expects exactly one task horizon.")
+
+        horizon = task.horizons[0]
+        issued_at = datetime.utcnow()
+        offset = pd.tseries.frequencies.to_offset(task.frequency)
+        base_prob = float(sum(s.probability for s in self.scenarios))
+        prediction_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {}
+        prediction_metadata["scenarios"] = [s.model_dump() for s in self.scenarios]
+        prediction_metadata["base_case"] = self.base_case
+        if self.reasoning.strip():
+            prediction_metadata["rationale"] = self.reasoning
+
+        return [
+            Prediction(
+                predictor_id=predictor_id,
+                task_id=task.task_id,
+                issued_at=issued_at,
+                as_of=context.as_of,
+                forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(),
+                payload=BinaryForecast(probability=min(base_prob, 1.0)),
+                metadata=prediction_metadata,
+            )
+        ]
+
+
+# Task specification strings embedded in user prompts for NB3.
+# Defined after the output classes so each spec can reference the
+# corresponding prompt_schema_json() classmethod — single source of truth.
+
+TASK_SHOCK_SPEC = (
+    f"Estimate P(up) — the probability that WTI will close MORE THAN\n"
+    f"${int(SHOCK_THRESHOLD)}/bbl HIGHER than today's price at the end of\n"
+    f"{SHOCK_HORIZON} trading days.\n\n"
+    "If a `set_model_response` tool is available, call it with your complete "
+    "JSON as `json_response`. Otherwise return the JSON directly as plain text.\n\n"
+    "Required JSON format:\n" + DiscreteAgentForecastOutput.prompt_schema_json()
+)
+
+TASK_SCENARIOS_SPEC = (
+    "Identify the three scenarios oil market analysts are debating for WTI "
+    "over the next 60 days.\n\n"
+    "If a `set_model_response` tool is available, call it with your complete "
+    "JSON as `json_response`. Otherwise return the JSON directly as plain text.\n\n"
+    "Required JSON format:\n" + ScenarioAgentForecastOutput.prompt_schema_json()
+)
+
+TASK_SPECS: dict[TaskKind, str] = {
+    "trajectory": TASK_TRAJECTORY_SPEC,
+    "shock": TASK_SHOCK_SPEC,
+    "scenario": TASK_SCENARIOS_SPEC,
+}
+
+
+TASK_OUTPUT_SCHEMAS: dict[TaskKind, type[AgentForecastOutput]] = {
+    "trajectory": ContinuousAgentForecastOutput,
+    "shock": DiscreteAgentForecastOutput,
+    "scenario": ScenarioAgentForecastOutput,
+}
+
+
+def build_wti_news_predictor(
+    task: TaskKind,
+    model: str = LITE_MODEL,
+) -> AgentPredictor:
+    """Build a news-grounded agent predictor for the given task kind.
+
+    Parameters
+    ----------
+    task : TaskKind
+        One of ``"trajectory"``, ``"shock"``, or ``"scenario"``.
+    model : str
+        Model identifier passed through to the underlying
+        :class:`~aieng.forecasting.methods.agentic.agent_factory.AgentConfig`.
+        Defaults to the lite model (``"gemini-3.1-flash-lite-preview"``); pass the
+        advanced model (``"gemini-3.5-flash"``) when more capability is needed.
+    """
+    if task == "trajectory":
+        return AgentPredictor(
+            agent_config=build_wti_news_config(model=model),
+            prompt_builder=WtiPriceForecastPromptBuilder(),
+            output_schema=ContinuousAgentForecastOutput,
+        )
+    return AgentPredictor(
+        agent_config=build_wti_multitask_news_config(model=model),
+        prompt_builder=WtiMultitaskPromptBuilder(task_spec=TASK_SPECS[task]),
+        output_schema=TASK_OUTPUT_SCHEMAS[task],
+    )
+
+
+def build_wti_agent_predictor_for_task(config: AgentConfig, task: TaskKind) -> AgentPredictor:
+    """Wire any WTI agent config to a task-specific predictor."""
+    if task == "trajectory":
+        return AgentPredictor(
+            agent_config=config,
+            prompt_builder=WtiPriceForecastPromptBuilder(),
+            output_schema=ContinuousAgentForecastOutput,
+        )
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=WtiMultitaskPromptBuilder(task_spec=TASK_SPECS[task]),
+        output_schema=TASK_OUTPUT_SCHEMAS[task],
+    )
+
+
+__all__ = [
+    "TASK_SCENARIOS_SPEC",
+    "TASK_SHOCK_SPEC",
+    "TASK_SPECS",
+    "TASK_TRAJECTORY_SPEC",
+    "ScenarioAgentForecastOutput",
+    "ScenarioCard",
+    "TaskKind",
+    "WtiMultitaskPromptBuilder",
+    "build_wti_agent_predictor_for_task",
+    "build_wti_news_predictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__viz.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__viz.py.md
new file mode 100644
index 0000000..e869367
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__energy_oil_forecasting__viz.py.md
@@ -0,0 +1,1223 @@
+# Source: implementations/energy_oil_forecasting/viz.py
+
+kind: python
+
+```python
+"""Plotly visualisation helpers for the WTI crude oil experiment.
+
+Keeps notebooks narrative-focused by centralising Plotly chart builders and
+HTML display helpers from the original playground case study.
+"""
+
+from __future__ import annotations
+
+import datetime
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.subplots as psp
+import yfinance as yf
+from energy_oil_forecasting.paths import (
+    CLR_ACTUAL,
+    CLR_AGENT,
+    CLR_CI_CURR_FILL,
+    CLR_CI_PAST_FILL,
+    CLR_DAY_LINE,
+    CLR_HISTORY,
+    CLR_HIT,
+    CLR_MISS,
+    CLR_PROPHET,
+    IRAN_COLOR,
+    SIMULATION_START,
+    WARN_COLOR,
+)
+
+
+_DAY_MS = 24 * 3600 * 1000
+
+
+def build_forecast_animation(  # noqa: PLR0915
+    price_df: pd.DataFrame,
+    forecasts_df: pd.DataFrame,
+    *,
+    simulation_start: pd.Timestamp = SIMULATION_START,
+) -> go.Figure:
+    """Build a Plotly animation showing accumulated 95% CI bars at resolution dates."""
+    # Show from Jan 2024 as context; simulation begins Jan 2025
+    pre_sim = price_df.loc["2024-01-01":"2024-12-31"]
+    price_series = price_df["price"]
+    y_min = float(price_series.min()) * 0.92
+    y_max = float(price_series.max()) * 1.08
+
+    sim_days: list[pd.Timestamp] = sorted(forecasts_df["sim_day"].unique().tolist())
+
+    # Canonical bar half-width — fixed across all frames to prevent size jumps
+    all_res_dates = sorted(forecasts_df["resolution_date"].dropna().unique().tolist())
+    spacings = [all_res_dates[i + 1] - all_res_dates[i] for i in range(len(all_res_dates) - 1)]
+    canonical_half_w: pd.Timedelta = pd.Series(spacings).median() / 2
+
+    def _bars(rows: pd.DataFrame) -> tuple[list[object], list[object]]:
+        """Fixed-width CI bar polygons separated by None for a single scatter trace."""
+        all_x: list[object] = []
+        all_y: list[object] = []
+        for _, fc in rows.sort_values("resolution_date").iterrows():
+            d = fc["resolution_date"]
+            lo = float(fc["yhat_lower"])
+            hi = float(fc["yhat_upper"])
+            if all_x:
+                all_x.append(None)
+                all_y.append(None)
+            all_x.extend(
+                [
+                    d - canonical_half_w,
+                    d + canonical_half_w,
+                    d + canonical_half_w,
+                    d - canonical_half_w,
+                    d - canonical_half_w,
+                ]
+            )
+            all_y.extend([lo, lo, hi, hi, lo])
+        return all_x, all_y
+
+    # ── Base traces ────────────────────────────────────────────────────────────
+    # Trace 0: 2024 price history — unlabelled context, never animated
+    # Traces 1–6: updated each frame
+    base_data: list[go.BaseTraceType] = [
+        go.Scatter(  # 0 — 2024 history (context only, no legend entry)
+            x=pre_sim.index,
+            y=pre_sim["price"],
+            mode="lines",
+            line={"color": CLR_HISTORY, "width": 1.5},
+            showlegend=False,
+            hoverinfo="skip",
+        ),
+        go.Scatter(  # 1 — realized price
+            x=[],
+            y=[],
+            mode="lines",
+            line={"color": CLR_ACTUAL, "width": 2.5},
+            name="WTI Realized Price",
+            showlegend=True,
+        ),
+        go.Scatter(  # 2 — past resolved CI bars
+            x=[],
+            y=[],
+            mode="lines",
+            fill="toself",
+            fillcolor=CLR_CI_PAST_FILL,
+            line={"width": 0},
+            name="95% CI Forecast",
+            showlegend=True,
+            hoverinfo="skip",
+        ),
+        go.Scatter(  # 3 — current (leading) forecast bar, darker shade
+            x=[],
+            y=[],
+            mode="lines",
+            fill="toself",
+            fillcolor=CLR_CI_CURR_FILL,
+            line={"width": 0},
+            showlegend=False,  # same concept as trace 2, shade speaks for itself
+            hoverinfo="skip",
+        ),
+        go.Scatter(  # 4 — current day marker
+            x=[],
+            y=[],
+            mode="lines",
+            line={"color": CLR_DAY_LINE, "width": 1, "dash": "dot"},
+            showlegend=False,
+            hoverinfo="skip",
+        ),
+        go.Scatter(  # 5 — hits
+            x=[],
+            y=[],
+            mode="markers",
+            marker={"color": CLR_HIT, "size": 8, "symbol": "circle", "opacity": 0.9},
+            name="Inside CI",
+            showlegend=True,
+        ),
+        go.Scatter(  # 6 — misses
+            x=[],
+            y=[],
+            mode="markers",
+            marker={"color": CLR_MISS, "size": 9, "symbol": "x", "opacity": 0.9},
+            name="Outside CI",
+            showlegend=True,
+        ),
+    ]
+
+    fig = go.Figure(data=base_data)
+
+    # Iran war annotation — defined here so it can be embedded in every frame.
+    # Plotly replaces layout.annotations entirely when a frame is applied, so we
+    # must include this explicitly in each frame's annotations list.
+    iran_date = pd.Timestamp("2026-03-01")
+    iran_annotation: dict[str, object] = {
+        "font": {"color": "#d62728", "size": 14},
+        "showarrow": False,
+        "text": "US–Iran war begins →",
+        "x": iran_date.timestamp() * 1000,
+        "xanchor": "right",
+        "xref": "x",
+        "y": 0.62,
+        "yanchor": "middle",
+        "yref": "paper",
+    }
+
+    # ── Build frames ──────────────────────────────────────────────────────────
+    frames: list[go.Frame] = []
+
+    for sim_day in sim_days:
+        revealed = price_df.loc[simulation_start:sim_day]
+
+        fc_rows = forecasts_df[forecasts_df["sim_day"] == sim_day]
+        if fc_rows.empty:
+            continue
+
+        # Bars painted at prediction time: show all forecasts made up to today
+        past_fc = forecasts_df[forecasts_df["sim_day"] < sim_day]
+        past_xs, past_ys = _bars(past_fc)
+        # Today's forecast is the fresh leading bar (darker shade)
+        curr_xs, curr_ys = _bars(fc_rows)
+
+        # Markers appear only when the actual price is known (resolution date passed)
+        resolved = forecasts_df[
+            (forecasts_df["sim_day"] <= sim_day) & (forecasts_df["resolution_date"] <= sim_day)
+        ].dropna(subset=["actual_price"])
+        hits = resolved[resolved["inside_ci"]]
+        misses = resolved[~resolved["inside_ci"]]
+
+        n_total = len(resolved)
+        n_hits = len(hits)
+        pct = (n_hits / n_total * 100) if n_total > 0 else 0.0
+
+        scorecard = f"<b>{sim_day.strftime('%b %d, %Y')}</b><br>Resolved: {n_total}  |  In CI: {n_hits} ({pct:.0f}%)"
+
+        frame = go.Frame(
+            data=[
+                go.Scatter(x=revealed.index.tolist(), y=revealed["price"].tolist()),
+                go.Scatter(x=past_xs, y=past_ys),
+                go.Scatter(x=curr_xs, y=curr_ys),
+                go.Scatter(x=[sim_day, sim_day], y=[y_min, y_max]),
+                go.Scatter(
+                    x=hits["resolution_date"].tolist(),
+                    y=hits["actual_price"].tolist(),
+                ),
+                go.Scatter(
+                    x=misses["resolution_date"].tolist(),
+                    y=misses["actual_price"].tolist(),
+                ),
+            ],
+            layout=go.Layout(
+                annotations=[
+                    iran_annotation,
+                    {
+                        "x": 0.99,
+                        "y": 0.97,
+                        "xref": "paper",
+                        "yref": "paper",
+                        "xanchor": "right",
+                        "yanchor": "top",
+                        "text": scorecard,
+                        "showarrow": False,
+                        "font": {"size": 16, "family": "monospace"},
+                        "bgcolor": "rgba(255,255,255,0.88)",
+                        "bordercolor": "#cccccc",
+                        "borderwidth": 1,
+                        "borderpad": 8,
+                        "align": "right",
+                    },
+                ]
+            ),
+            traces=[1, 2, 3, 4, 5, 6],
+            name=str(sim_day.date()),
+        )
+        frames.append(frame)
+
+    # ── Tail frames: resolutions roll in after the last sim_day ─────────────
+    # After the last prediction is made, older forecasts keep resolving through
+    # price_df.index.max(). Add one frame per trailing trading day so the
+    # animation shows all April resolutions without stopping at the last sim_day.
+    last_sim_day = sim_days[-1]
+    all_past_xs, all_past_ys = _bars(forecasts_df)  # all 314 bars, static from here on
+    tail_days = price_df.loc[last_sim_day + pd.Timedelta(days=1) :].index.tolist()
+
+    for tail_day in tail_days:
+        revealed = price_df.loc[simulation_start:tail_day]
+
+        resolved = forecasts_df[forecasts_df["resolution_date"] <= tail_day].dropna(subset=["actual_price"])
+        hits = resolved[resolved["inside_ci"]]
+        misses = resolved[~resolved["inside_ci"]]
+
+        n_total = len(resolved)
+        n_hits = len(hits)
+        pct = (n_hits / n_total * 100) if n_total > 0 else 0.0
+
+        scorecard = f"<b>{tail_day.strftime('%b %d, %Y')}</b><br>Resolved: {n_total}  |  In CI: {n_hits} ({pct:.0f}%)"
+
+        frame = go.Frame(
+            data=[
+                go.Scatter(x=revealed.index.tolist(), y=revealed["price"].tolist()),
+                go.Scatter(x=all_past_xs, y=all_past_ys),
+                go.Scatter(x=[], y=[]),  # no new leading bar
+                go.Scatter(x=[tail_day, tail_day], y=[y_min, y_max]),
+                go.Scatter(
+                    x=hits["resolution_date"].tolist(),
+                    y=hits["actual_price"].tolist(),
+                ),
+                go.Scatter(
+                    x=misses["resolution_date"].tolist(),
+                    y=misses["actual_price"].tolist(),
+                ),
+            ],
+            layout=go.Layout(
+                annotations=[
+                    iran_annotation,
+                    {
+                        "x": 0.99,
+                        "y": 0.97,
+                        "xref": "paper",
+                        "yref": "paper",
+                        "xanchor": "right",
+                        "yanchor": "top",
+                        "text": scorecard,
+                        "showarrow": False,
+                        "font": {"size": 16, "family": "monospace"},
+                        "bgcolor": "rgba(255,255,255,0.88)",
+                        "bordercolor": "#cccccc",
+                        "borderwidth": 1,
+                        "borderpad": 8,
+                        "align": "right",
+                    },
+                ]
+            ),
+            traces=[1, 2, 3, 4, 5, 6],
+            name=str(tail_day.date()),
+        )
+        frames.append(frame)
+
+    fig.frames = frames
+
+    # ── Slider steps ──────────────────────────────────────────────────────────
+    slider_steps = [
+        {
+            "args": [[f.name], {"frame": {"duration": 80, "redraw": True}, "mode": "immediate"}],
+            "label": f.name if i % 20 == 0 else "",
+            "method": "animate",
+        }
+        for i, f in enumerate(frames)
+    ]
+
+    # ── Play / Pause ──────────────────────────────────────────────────────────
+    updatemenus = [
+        {
+            "type": "buttons",
+            "showactive": False,
+            "direction": "right",
+            "x": 0.0,
+            "xanchor": "left",
+            "y": -0.06,
+            "yanchor": "top",
+            "pad": {"r": 4, "t": 0},
+            "buttons": [
+                {
+                    "label": "▶  Play",
+                    "method": "animate",
+                    "args": [
+                        None,
+                        {
+                            "frame": {"duration": 80, "redraw": True},
+                            "fromcurrent": True,
+                            "transition": {"duration": 0},
+                        },
+                    ],
+                },
+                {
+                    "label": "⏸  Pause",
+                    "method": "animate",
+                    "args": [
+                        [None],
+                        {
+                            "frame": {"duration": 0, "redraw": False},
+                            "mode": "immediate",
+                            "transition": {"duration": 0},
+                        },
+                    ],
+                },
+            ],
+        }
+    ]
+
+    x_end = (forecasts_df["resolution_date"].max() + pd.Timedelta(days=10)).strftime("%Y-%m-%d")
+
+    # US–Iran war start — vertical line (annotation is embedded in each frame above)
+    fig.add_vline(
+        x=iran_date.timestamp() * 1000,
+        line={"color": "#d62728", "width": 2.5, "dash": "dash"},
+        annotation_text="US–Iran war begins →",
+        annotation_position="top left",
+        annotation_font={"size": 14, "color": "#d62728"},
+    )
+
+    fig.update_layout(
+        title={
+            "text": "WTI Crude Oil — 30-Day Forecast w/ Prophet",
+            "font": {"size": 24, "color": "#1a1a1a"},
+            "x": 0.0,
+            "xanchor": "left",
+        },
+        xaxis={
+            "range": ["2024-01-01", x_end],
+            "showgrid": True,
+            "gridcolor": "#f0f0f0",
+            "tickfont": {"size": 13},
+            "title": {"text": "Date", "font": {"size": 15, "color": "#333333"}},
+        },
+        yaxis={
+            "title": {"text": "USD / bbl", "font": {"size": 15, "color": "#333333"}},
+            "range": [y_min, y_max],
+            "showgrid": True,
+            "gridcolor": "#f0f0f0",
+            "tickfont": {"size": 13},
+        },
+        legend={
+            "x": 0.01,
+            "y": 0.99,
+            "xanchor": "left",
+            "yanchor": "top",
+            "orientation": "v",
+            "bgcolor": "rgba(255,255,255,0.85)",
+            "bordercolor": "#dddddd",
+            "borderwidth": 1,
+            "font": {"size": 13},
+            "itemsizing": "constant",
+            "tracegroupgap": 2,
+        },
+        template="plotly_white",
+        width=900,
+        height=600,
+        margin={"t": 70, "b": 90, "l": 70, "r": 30},
+        updatemenus=updatemenus,
+        sliders=[
+            {
+                "active": 0,
+                "steps": slider_steps,
+                "x": 0.18,
+                "len": 0.82,
+                "y": -0.06,
+                "currentvalue": {
+                    "prefix": "",
+                    "visible": True,
+                    "xanchor": "center",
+                    "font": {"size": 13, "color": "#555555"},
+                },
+                "transition": {"duration": 0},
+                "pad": {"t": 24, "b": 6},
+            }
+        ],
+    )
+
+    return fig
+
+
+def make_context_chart(price_df: pd.DataFrame) -> go.Figure:
+    """Annotated WTI price chart: what a well-informed agent could have seen."""
+    context = price_df.loc["2024-09-01":"2024-12-31"]
+    sim_era = price_df.loc["2025-01-01":]
+
+    iran_color = IRAN_COLOR
+    warn_color = WARN_COLOR  # amber
+    title_font = {"size": 22, "color": "#1a1a1a"}
+    axis_font = {"size": 14, "color": "#333333"}
+    tick_font = {"size": 13}
+
+    # Numbered warning signals — all publicly available before the war
+    warn_events: list[tuple[str, str]] = [
+        ("2024-12-06", "①"),
+        ("2025-03-19", "②"),
+        ("2025-06-21", "③"),
+        ("2025-12-28", "④"),
+        ("2026-01-26", "⑤"),
+        ("2026-02-21", "⑥"),
+    ]
+
+    # Event key rendered below the chart
+    event_key: list[tuple[str, str, str]] = [
+        ("①", "Dec 6, 2024", "IAEA: Iran begins producing 60% HEU at 7× previous rate at Fordow"),
+        ("②", "Mar 19, 2025", "Trump sends nuclear ultimatum to Iran — 2-month deadline for deal"),
+        ("③", "Jun 21, 2025", "Operation Midnight Hammer — US B-2 bombers strike Fordow, Natanz, Isfahan"),
+        ("④", "Dec 28, 2025", "Iran mass protests erupt across all 31 provinces; rial hits record low"),
+        ("⑤", "Jan 26, 2026", "USS Abraham Lincoln carrier strike group enters CENTCOM area of responsibility"),
+        ("⑥", "Feb 21, 2026", "Oil traders rush to hedge risk; Brent up 18% since year-end 2025 (Bloomberg)"),
+    ]
+
+    fig = go.Figure()
+
+    # Muted Q4-2024 context trace
+    fig.add_trace(
+        go.Scatter(
+            x=context.index,
+            y=context["price"],
+            mode="lines",
+            line={"color": CLR_HISTORY, "width": 2},
+            name="WTI Price (Q4 2024 context)",
+            opacity=0.6,
+        )
+    )
+
+    # Realized price 2025–present
+    fig.add_trace(
+        go.Scatter(
+            x=sim_era.index,
+            y=sim_era["price"],
+            mode="lines",
+            line={"color": CLR_ACTUAL, "width": 2.5},
+            name="WTI Realized Price (2025–present)",
+        )
+    )
+
+    # Light simulation-window shading
+    fig.add_vrect(
+        x0="2025-01-01",
+        x1=price_df.index.max().strftime("%Y-%m-%d"),
+        fillcolor="rgba(33,113,181,0.04)",
+        layer="below",
+        line_width=0,
+    )
+
+    # Simulation-start divider
+    fig.add_vline(
+        x=pd.Timestamp("2025-01-01").timestamp() * 1000,
+        line={"color": "#999999", "width": 1.2, "dash": "dot"},
+        annotation_text="Simulation begins →",
+        annotation_position="top left",
+        annotation_font={"size": 11, "color": "#666666"},
+    )
+
+    # US–Iran war line
+    fig.add_vline(
+        x=pd.Timestamp("2026-03-01").timestamp() * 1000,
+        line={"color": iran_color, "width": 2.5, "dash": "dash"},
+        annotation_text="US–Iran war begins →",
+        annotation_position="top left",
+        annotation_font={"size": 12, "color": iran_color},
+    )
+
+    # Numbered callout badges at each warning event
+    for date_str, badge in warn_events:
+        ts = pd.Timestamp(date_str)
+        nearby = price_df.index[price_df.index >= ts]
+        if len(nearby) == 0:
+            continue
+        ts = nearby[0]
+        price_val = float(price_df.loc[ts, "price"])
+        fig.add_annotation(
+            x=ts,
+            y=price_val,
+            text=f"<b>{badge}</b>",
+            showarrow=True,
+            arrowhead=2,
+            arrowsize=0.9,
+            arrowwidth=1.5,
+            arrowcolor=warn_color,
+            ax=0,
+            ay=-30,
+            font={"size": 12, "color": "white"},
+            bgcolor=warn_color,
+            bordercolor=warn_color,
+            borderwidth=1,
+            borderpad=5,
+            xanchor="center",
+        )
+
+    # Event key — clean table-style annotation below the chart
+    key_lines = ["<b>Event key</b>"]
+    for badge, date, desc in event_key:
+        key_lines.append(f"<span style='color:{warn_color}'><b>{badge}</b></span>  <b>{date}</b> — {desc}")
+    # Fallback: Plotly HTML doesn't support <span style=...> reliably;
+    # use plain text colour
+    key_plain = ["<b>Event key — publicly available signals leading up to the war</b>"]
+    for badge, date, desc in event_key:
+        key_plain.append(f"<b>{badge}  {date}</b>  {desc}")
+    key_text = "<br>".join(key_plain)
+
+    fig.add_annotation(
+        x=0.0,
+        y=-0.19,
+        xref="paper",
+        yref="paper",
+        text=key_text,
+        showarrow=False,
+        xanchor="left",
+        yanchor="top",
+        font={"size": 11, "color": "#444444"},
+        align="left",
+        bgcolor="rgba(250,250,250,0.97)",
+        bordercolor="#dddddd",
+        borderwidth=1,
+        borderpad=10,
+    )
+
+    x_min = "2024-09-01"
+    x_max = (price_df.index.max() + pd.Timedelta(days=20)).strftime("%Y-%m-%d")
+    y_min = float(price_df.loc["2024-09-01":, "price"].min()) * 0.90
+    y_max = float(price_df.loc["2024-09-01":, "price"].max()) * 1.08
+
+    fig.update_layout(
+        title={
+            "text": "WTI Crude Oil 2025–Present: What a Well-Informed Agent Could Have Seen",
+            "font": title_font,
+            "x": 0.0,
+            "xanchor": "left",
+        },
+        xaxis={
+            "range": [x_min, x_max],
+            "title": {"text": "Date", "font": axis_font},
+            "showgrid": True,
+            "gridcolor": "#f0f0f0",
+            "tickfont": tick_font,
+        },
+        yaxis={
+            "range": [y_min, y_max],
+            "title": {"text": "Price (USD / bbl)", "font": axis_font},
+            "showgrid": True,
+            "gridcolor": "#f0f0f0",
+            "tickfont": tick_font,
+        },
+        template="plotly_white",
+        legend={
+            "orientation": "h",
+            "y": -0.10,
+            "x": 0.0,
+            "xanchor": "left",
+            "font": {"size": 12},
+        },
+        width=900,
+        height=660,
+        margin={"t": 80, "b": 230, "l": 70, "r": 40},
+    )
+    return fig
+
+
+def make_error_timeline(forecasts_df: pd.DataFrame) -> go.Figure:
+    """Signed forecast error by resolution date, coloured by period."""
+    resolved = forecasts_df.dropna(subset=["actual_price"]).copy()
+    resolved = resolved.sort_values("resolution_date")
+    resolved["error"] = resolved["actual_price"] - resolved["yhat"]
+
+    y2025 = resolved[resolved["resolution_date"].dt.year == 2025]
+    y2026 = resolved[resolved["resolution_date"].dt.year >= 2026]
+
+    x_max = (resolved["resolution_date"].max() + pd.Timedelta(days=14)).strftime("%Y-%m-%d")
+
+    fig = go.Figure()
+    fig.add_vrect(
+        x0="2026-01-01",
+        x1=x_max,
+        fillcolor="rgba(222, 45, 38, 0.06)",
+        line_width=0,
+        annotation_text="2026 Reality",
+        annotation_position="top right",
+        annotation_font={"size": 10, "color": CLR_MISS},
+    )
+
+    for df_sub, label, color in [
+        (y2025, "2025 Backtest", CLR_ACTUAL),
+        (y2026, "2026 Reality (Jan–Apr)", CLR_MISS),
+    ]:
+        fig.add_trace(
+            go.Bar(
+                x=df_sub["resolution_date"],
+                y=df_sub["error"],
+                name=label,
+                marker_color=color,
+                opacity=0.75,
+                width=1.6 * _DAY_MS,
+            )
+        )
+
+    fig.add_hline(y=0, line={"color": "#252525", "width": 1.5, "dash": "dot"})
+    fig.add_vline(
+        x=pd.Timestamp("2026-01-01").timestamp() * 1000,
+        line={"color": "#636363", "dash": "dash", "width": 1.5},
+        annotation_text=" Jan 2026",
+        annotation_position="top left",
+        annotation_font={"size": 11, "color": "#636363"},
+    )
+
+    fig.update_layout(
+        title={"text": "Forecast Error by Resolution Date — Actual minus Forecast (USD/bbl)", "font": {"size": 16}},
+        xaxis={"title": "Resolution Date", "showgrid": True, "gridcolor": "#f0f0f0"},
+        yaxis={"title": "Error (USD/bbl)", "showgrid": True, "gridcolor": "#f0f0f0", "zeroline": False},
+        template="plotly_white",
+        width=900,
+        height=420,
+        margin={"t": 60, "b": 40, "l": 60, "r": 40},
+        barmode="overlay",
+        legend={
+            "x": 0.01,
+            "y": 0.99,
+            "xanchor": "left",
+            "yanchor": "top",
+            "bgcolor": "rgba(255,255,255,0.85)",
+            "bordercolor": "#dddddd",
+            "borderwidth": 1,
+        },
+    )
+    return fig
+
+
+def coverage_summary_table(forecasts_df: pd.DataFrame) -> pd.DataFrame:
+    """Return period-level coverage and error summary."""
+    resolved = forecasts_df.dropna(subset=["actual_price"]).copy()
+    resolved["period"] = resolved["sim_day"].apply(
+        lambda d: "2025 Backtest" if d.year == 2025 else "2026 Reality (Jan–Apr)"
+    )
+    resolved["error"] = resolved["actual_price"] - resolved["yhat"]
+    resolved["abs_error"] = resolved["error"].abs()
+    period_order = ["2025 Backtest", "2026 Reality (Jan–Apr)"]
+    return (
+        resolved.groupby("period")
+        .agg(
+            n_forecasts=("sim_day", "count"),
+            coverage_pct=("inside_ci", lambda x: f"{x.mean() * 100:.1f}%"),
+            mae=("abs_error", lambda x: f"${x.mean():.2f}"),
+            median_abs_error=("abs_error", lambda x: f"${x.median():.2f}"),
+            max_abs_error=("abs_error", lambda x: f"${x.max():.2f}"),
+        )
+        .loc[period_order]
+    )
+
+
+def make_coverage_chart(forecasts_df: pd.DataFrame) -> go.Figure:
+    """CI coverage bar chart by period."""
+    resolved = forecasts_df.dropna(subset=["actual_price"]).copy()
+    resolved["period"] = resolved["sim_day"].apply(
+        lambda d: "2025 Backtest" if d.year == 2025 else "2026 Reality (Jan–Apr)"
+    )
+    period_order = ["2025 Backtest", "2026 Reality (Jan–Apr)"]
+    period_colors = {"2025 Backtest": CLR_ACTUAL, "2026 Reality (Jan–Apr)": CLR_MISS}
+    coverage = (
+        resolved.groupby("period", sort=False)
+        .agg(total=("inside_ci", "count"), inside=("inside_ci", "sum"))
+        .assign(coverage_pct=lambda d: d["inside"] / d["total"] * 100)
+        .reset_index()
+    )
+    coverage["order"] = coverage["period"].map({"2025 Backtest": 0, "2026 Reality (Jan–Apr)": 1})
+    coverage = coverage.sort_values("order")
+    bar_colors = [period_colors[p] for p in period_order]
+
+    fig_cov = go.Figure()
+    fig_cov.add_trace(
+        go.Bar(
+            x=coverage["period"],
+            y=coverage["coverage_pct"],
+            marker_color=bar_colors,
+            text=[f"{v:.1f}%" for v in coverage["coverage_pct"]],
+            textposition="outside",
+            textfont={"size": 15},
+            width=0.4,
+        )
+    )
+    fig_cov.add_hline(
+        y=95,
+        line={"color": "#636363", "dash": "dash", "width": 1.5},
+        annotation_text=" Expected 95%",
+        annotation_position="right",
+        annotation_font={"size": 11, "color": "#636363"},
+    )
+    fig_cov.update_layout(
+        title={"text": "Forecast Coverage: % of resolutions inside the 95% CI", "font": {"size": 16}},
+        yaxis={"title": "Coverage (%)", "range": [0, 108], "showgrid": True, "gridcolor": "#f0f0f0"},
+        xaxis={"title": ""},
+        template="plotly_white",
+        width=900,
+        height=380,
+        margin={"t": 60, "b": 40, "l": 60, "r": 40},
+        showlegend=False,
+    )
+    return fig_cov
+
+
+def make_punchline_charts(forecasts_df: pd.DataFrame) -> tuple[go.Figure, go.Figure, pd.DataFrame]:
+    """Return error timeline, coverage chart, and summary table."""
+    return make_error_timeline(forecasts_df), make_coverage_chart(forecasts_df), coverage_summary_table(forecasts_df)
+
+
+def make_futures_curve_chart(price_df: pd.DataFrame) -> go.Figure | None:
+    """Snapshot of the WTI futures term structure from nearby NYMEX contracts."""
+    month_codes = ["F", "G", "H", "J", "K", "M", "N", "Q", "U", "V", "X", "Z"]
+    month_names = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
+    month_map = dict(zip(month_codes, month_names, strict=True))
+
+    today = datetime.date.today()
+    tickers: list[str] = []
+    labels: list[str] = []
+
+    for delta_months in range(1, 10):
+        m = today.month - 1 + delta_months
+        year = today.year + m // 12
+        month = m % 12 + 1
+        code = month_codes[month - 1]
+        yr2 = str(year)[-2:]
+        tickers.append(f"CL{code}{yr2}.NYM")
+        labels.append(f"{month_map[code]} '{yr2}")
+
+    prices: list[float] = []
+    valid_labels: list[str] = []
+
+    for ticker, label in zip(tickers, labels, strict=True):
+        try:
+            data = yf.download(ticker, period="5d", progress=False, auto_adjust=True)
+            if isinstance(data.columns, pd.MultiIndex):
+                data.columns = data.columns.get_level_values(0)
+            if not data.empty and "Close" in data.columns:
+                val = float(data["Close"].dropna().iloc[-1])
+                if val > 1.0:
+                    prices.append(val)
+                    valid_labels.append(label)
+        except Exception:
+            pass
+
+    if not prices:
+        return None
+
+    spot = float(price_df["price"].iloc[-1])
+    all_labels = ["Spot (now)"] + valid_labels
+    all_prices = [spot] + prices
+
+    contango = prices[-1] > prices[0] if len(prices) > 1 else False
+    structure = (
+        "Contango — market prices in higher costs ahead"
+        if contango
+        else "Backwardation — market prices in near-term premium"
+    )
+    curve_color = "#e6550d" if contango else CLR_ACTUAL
+
+    fig = go.Figure()
+    fig.add_trace(
+        go.Scatter(
+            x=all_labels,
+            y=all_prices,
+            mode="lines+markers",
+            line={"color": curve_color, "width": 2.5},
+            marker={"size": 9},
+            name="WTI Futures Curve",
+        )
+    )
+    fig.update_layout(
+        title={
+            "text": f"WTI Futures Term Structure — Current Snapshot<br><sup>{structure}</sup>",
+            "font": {"size": 16},
+        },
+        xaxis={"title": "Contract Month"},
+        yaxis={"title": "Price (USD/bbl)", "showgrid": True, "gridcolor": "#f0f0f0"},
+        template="plotly_white",
+        width=900,
+        height=380,
+        margin={"t": 80, "b": 50, "l": 60, "r": 40},
+    )
+    return fig
+
+
+def export_animation_html(fig: go.Figure, output_path: Path) -> None:
+    """Write a standalone Plotly animation HTML file."""
+    fig.write_html(str(output_path), include_plotlyjs="cdn", auto_play=False)
+
+
+# ── NB3 chart builders ────────────────────────────────────────────────────────
+
+
+def _qval(quantiles: dict[str | float, float], q: float) -> float:
+    """Extract a quantile value tolerating both string and float dict keys."""
+    for key in (q, str(q)):
+        if key in quantiles:
+            return float(quantiles[key])
+    return float("nan")
+
+
+def make_trajectory_fan_chart(
+    traj_agent_results: list[dict[str, Any]],
+    prophet_traj_df: pd.DataFrame,
+    price_df: pd.DataFrame,
+    trajectory_origins: list[pd.Timestamp],
+    *,
+    history_window: int = 40,
+) -> go.Figure:
+    """3-panel Plotly fan chart comparing Prophet CI fan to agent error bars.
+
+    One column per origin. Each panel shows:
+    - Pre-origin price history (grey line)
+    - Realised prices over the 21-day forecast window (blue thick line)
+    - Prophet 95% CI fan + median (grey shaded + dotted line)
+    - Agent point forecasts at h=5, 10, 21 with 80% CI error bars (green diamonds)
+    - Vertical dashed line at the forecast origin
+
+    Parameters
+    ----------
+    traj_agent_results : list[dict]
+        Reference-format agent results: ``{"origin": "YYYY-MM-DD", "predictions": [...]}``.
+    prophet_traj_df : pd.DataFrame
+        Prophet trajectory DataFrame with columns ``origin``, ``horizon``,
+        ``forecast_date``, ``yhat``, ``yhat_lower``, ``yhat_upper``.
+    price_df : pd.DataFrame
+        Price DataFrame with DatetimeIndex and column ``price``.
+    trajectory_origins : list[pd.Timestamp]
+        The three (or more) forecast origins to display as columns.
+    history_window : int
+        Number of business days of history to show before each origin.
+    """
+    agent_by_origin = {r["origin"]: r for r in traj_agent_results}
+
+    subplot_titles = []
+    for o in trajectory_origins:
+        rows = price_df[price_df.index >= o]
+        price_label = f"${float(rows.iloc[0]['price']):.0f}" if not rows.empty else ""
+        subplot_titles.append(f"{o.strftime('%b %d, %Y')}  WTI {price_label}")
+
+    fig = psp.make_subplots(
+        rows=1,
+        cols=len(trajectory_origins),
+        subplot_titles=subplot_titles,
+        shared_yaxes=True,
+        horizontal_spacing=0.04,
+    )
+
+    for col_idx, origin in enumerate(trajectory_origins, start=1):
+        key = str(origin.date())
+        show_legend = col_idx == 1
+
+        bday_dates = pd.bdate_range(start=origin + pd.offsets.BDay(1), periods=21)
+
+        # Pre-origin history
+        hist_window = price_df[price_df.index <= origin].iloc[-history_window:]
+        fig.add_trace(
+            go.Scatter(
+                x=hist_window.index.tolist(),
+                y=hist_window["price"].tolist(),
+                mode="lines",
+                line={"color": CLR_HISTORY, "width": 1.5},
+                name="WTI Price",
+                showlegend=show_legend,
+                legendgroup="actual",
+            ),
+            row=1,
+            col=col_idx,
+        )
+
+        # Realised post-origin prices
+        actual_future = price_df[(price_df.index > origin) & (price_df.index <= bday_dates[-1])]
+        fig.add_trace(
+            go.Scatter(
+                x=actual_future.index.tolist(),
+                y=actual_future["price"].tolist(),
+                mode="lines",
+                line={"color": CLR_ACTUAL, "width": 2.5},
+                name="Actual outcome",
+                showlegend=show_legend,
+                legendgroup="actual_outcome",
+            ),
+            row=1,
+            col=col_idx,
+        )
+
+        # Prophet fan
+        p_sub = prophet_traj_df[prophet_traj_df["origin"] == origin].sort_values("horizon")
+        if not p_sub.empty:
+            x_fill = pd.concat([p_sub["forecast_date"], p_sub["forecast_date"].iloc[::-1]]).tolist()
+            y_fill = pd.concat([p_sub["yhat_lower"], p_sub["yhat_upper"].iloc[::-1]]).tolist()
+            fig.add_trace(
+                go.Scatter(
+                    x=x_fill,
+                    y=y_fill,
+                    fill="toself",
+                    fillcolor="rgba(99,99,99,0.12)",
+                    line={"width": 0},
+                    showlegend=False,
+                    hoverinfo="skip",
+                ),
+                row=1,
+                col=col_idx,
+            )
+            fig.add_trace(
+                go.Scatter(
+                    x=p_sub["forecast_date"].tolist(),
+                    y=p_sub["yhat"].tolist(),
+                    mode="lines",
+                    line={"color": CLR_PROPHET, "width": 1.8, "dash": "dot"},
+                    name="Prophet (95% CI)",
+                    showlegend=show_legend,
+                    legendgroup="prophet",
+                ),
+                row=1,
+                col=col_idx,
+            )
+
+        # Agent error bars at h=5, 10, 21
+        result = agent_by_origin.get(key)
+        if result and result.get("predictions"):
+            preds = result["predictions"]
+            agent_horizons = [5, 10, 21]
+            agent_dates = [bday_dates[h - 1] for h in agent_horizons]
+            agent_pts = [preds[i]["payload"]["point_forecast"] for i in range(len(preds))]
+            agent_lo = [_qval(preds[i]["payload"]["quantiles"], 0.1) for i in range(len(preds))]
+            agent_hi = [_qval(preds[i]["payload"]["quantiles"], 0.9) for i in range(len(preds))]
+            err_hi = [hi - pt if not (np.isnan(hi) or np.isnan(pt)) else 0.0 for hi, pt in zip(agent_hi, agent_pts)]
+            err_lo = [pt - lo if not (np.isnan(lo) or np.isnan(pt)) else 0.0 for pt, lo in zip(agent_pts, agent_lo)]
+            fig.add_trace(
+                go.Scatter(
+                    x=[d.to_pydatetime() for d in agent_dates],
+                    y=agent_pts,
+                    mode="markers",
+                    marker={"color": CLR_AGENT, "size": 11, "symbol": "diamond"},
+                    error_y={
+                        "type": "data",
+                        "symmetric": False,
+                        "array": err_hi,
+                        "arrayminus": err_lo,
+                        "color": CLR_AGENT,
+                        "thickness": 2,
+                        "width": 6,
+                    },
+                    name="Agent (80% CI)",
+                    showlegend=show_legend,
+                    legendgroup="agent",
+                ),
+                row=1,
+                col=col_idx,
+            )
+
+        # Origin marker
+        fig.add_vline(
+            x=origin.timestamp() * 1000,
+            line={"color": "#aaaaaa", "dash": "dash", "width": 1.2},
+            row=1,
+            col=col_idx,
+        )
+
+    fig.update_layout(
+        title={
+            "text": "WTI Trajectory Forecast — Prophet Fan vs Agent Estimates",
+            "font": {"size": 16},
+            "x": 0.0,
+            "xanchor": "left",
+        },
+        template="plotly_white",
+        width=1000,
+        height=420,
+        margin={"t": 80, "b": 50, "l": 60, "r": 20},
+        legend={
+            "orientation": "h",
+            "y": -0.12,
+            "x": 0.0,
+            "xanchor": "left",
+            "font": {"size": 12},
+        },
+    )
+    fig.update_xaxes(showgrid=True, gridcolor="#f0f0f0", tickfont={"size": 11})
+    fig.update_yaxes(showgrid=True, gridcolor="#f0f0f0", tickfont={"size": 11})
+    return fig
+
+
+def make_shock_comparison_chart(
+    shock_results: list[dict[str, Any]],
+    prophet_probs: list[float],
+    *,
+    shock_threshold: float = 5.0,
+) -> go.Figure:
+    """2-panel chart: P(shock) over time + cumulative Brier score.
+
+    Row 1 — Predicted probability for each origin, both Prophet and Agent.
+    Shock origins are highlighted with a red background band.
+    Row 2 — Cumulative mean Brier score (lower is better; 0.25 = random).
+
+    Parameters
+    ----------
+    shock_results : list[dict]
+        Reference-format shock results: ``{"origin", "probability", "outcome", "delta"}``.
+    prophet_probs : list[float]
+        Pre-computed Prophet P(shock) values, parallel to ``shock_results``.
+    shock_threshold : float
+        The dollar threshold used to define a shock (for axis labelling).
+    """
+    origins = [r["origin"] for r in shock_results]
+    agent_probs = [float(r["probability"]) for r in shock_results]
+    outcomes = [int(r["outcome"]) for r in shock_results]
+
+    agent_briers = [(p - y) ** 2 for p, y in zip(agent_probs, outcomes)]
+    prophet_briers = [(p - y) ** 2 if not np.isnan(p) else float("nan") for p, y in zip(prophet_probs, outcomes)]
+
+    # Cumulative mean Brier
+    def _cum_mean(vals: list[float]) -> list[float]:
+        result = []
+        total = 0.0
+        count = 0
+        for v in vals:
+            if not np.isnan(v):
+                total += v
+                count += 1
+            result.append(total / count if count else float("nan"))
+        return result
+
+    agent_cum = _cum_mean(agent_briers)
+    prophet_cum = _cum_mean(prophet_briers)
+
+    shock_indices = [i for i, r in enumerate(shock_results) if r["outcome"] == 1]
+
+    fig = psp.make_subplots(
+        rows=2,
+        cols=1,
+        row_heights=[0.58, 0.42],
+        vertical_spacing=0.22,
+        subplot_titles=[
+            f"P(WTI up > +${shock_threshold:.0f}/bbl in 5 trading days)",
+            "Cumulative mean Brier score (lower = better)",
+        ],
+    )
+
+    # Red shock bands
+    for i in shock_indices:
+        for row_n, y0, y1 in [(1, -0.06, 1.06), (2, 0.0, 0.30)]:
+            fig.add_shape(
+                type="rect",
+                layer="below",
+                xref=f"x{'' if row_n == 1 else str(row_n)}",
+                yref=f"y{'' if row_n == 1 else str(row_n)}",
+                x0=i - 0.48,
+                x1=i + 0.48,
+                y0=y0,
+                y1=y1,
+                fillcolor="rgba(214,39,40,0.12)",
+                line_width=0,
+            )
+        fig.add_annotation(
+            x=origins[i],
+            y=1.04,
+            text="<b>SHOCK</b>",
+            showarrow=False,
+            font={"size": 9, "color": "#d62728"},
+            xref="x",
+            yref="y",
+        )
+
+    # Probability traces
+    for method, probs, color, dash, symbol in [
+        ("Analyst Agent", agent_probs, CLR_AGENT, "solid", "circle"),
+        ("Prophet", prophet_probs, CLR_PROPHET, "dot", "square"),
+    ]:
+        fig.add_trace(
+            go.Scatter(
+                x=origins,
+                y=probs,
+                name=method,
+                mode="lines+markers",
+                line={"color": color, "width": 2.5, "dash": dash},
+                marker={"size": 10, "symbol": symbol},
+                legendgroup=method,
+                showlegend=True,
+                hovertemplate="%{x}<br>P(shock)=%{y:.0%}<extra>" + method + "</extra>",
+            ),
+            row=1,
+            col=1,
+        )
+        yshift = 12 if method == "Analyst Agent" else -14
+        for x_val, y_val in zip(origins, probs):
+            if not np.isnan(y_val):
+                fig.add_annotation(
+                    x=x_val,
+                    y=y_val,
+                    text=f"{y_val:.0%}",
+                    showarrow=False,
+                    font={"size": 8, "color": color},
+                    yshift=yshift,
+                    row=1,
+                    col=1,
+                )
+
+    fig.add_hline(
+        y=0.5,
+        line={"color": "#d0d0d0", "dash": "dot", "width": 1.2},
+        row=1,
+        col=1,
+    )
+
+    # Brier traces
+    for method, cum, color, dash, symbol in [
+        ("Analyst Agent", agent_cum, CLR_AGENT, "solid", "circle"),
+        ("Prophet", prophet_cum, CLR_PROPHET, "dot", "square"),
+    ]:
+        fig.add_trace(
+            go.Scatter(
+                x=origins,
+                y=cum,
+                name=method,
+                mode="lines+markers",
+                line={"color": color, "width": 2.5, "dash": dash},
+                marker={"size": 8, "symbol": symbol},
+                legendgroup=method,
+                showlegend=False,
+                hovertemplate="%{x}<br>Cumul. Brier: %{y:.3f}<extra>" + method + "</extra>",
+            ),
+            row=2,
+            col=1,
+        )
+
+    fig.add_hline(
+        y=0.25,
+        line={"color": "#aaaaaa", "dash": "dot", "width": 1.5},
+        annotation_text="0.25 random ceiling",
+        annotation_position="top right",
+        annotation_font={"size": 9, "color": "#888888"},
+        row=2,
+        col=1,
+    )
+
+    fig.update_layout(
+        title={
+            "text": f"Analyst Agent vs Prophet — WTI Upward Shock (>${shock_threshold:.0f}/bbl in 5 days)",
+            "x": 0.5,
+            "font": {"size": 13},
+        },
+        height=520,
+        width=700,
+        template="plotly_white",
+        xaxis={"type": "category", "tickangle": -35, "showgrid": False},
+        xaxis2={"type": "category", "tickangle": -35, "showgrid": False},
+        yaxis={"range": [-0.06, 1.12], "tickformat": ".0%", "showgrid": True, "gridcolor": "#f0f0f0"},
+        yaxis2={"range": [0.0, 0.32], "showgrid": True, "gridcolor": "#f0f0f0"},
+        legend={
+            "orientation": "h",
+            "yanchor": "bottom",
+            "y": 1.04,
+            "xanchor": "right",
+            "x": 1,
+            "font": {"size": 11},
+        },
+        margin={"t": 80, "b": 70, "l": 60, "r": 35},
+    )
+    return fig
+
+
+# ── HTML display helpers (NB3 forecast cards) ────────────────────────────────
+
+
+def verdict_label(a_prob: float, outcome: int, delta: float, threshold: float) -> str:
+    """Human-readable verdict for binary shock forecast cards."""
+    if outcome == 1:
+        return f"Actual: +${delta:.2f}/bbl (>{threshold:.0f}) — shock materialised"
+    return f"Actual: +${delta:.2f}/bbl — no shock"
+
+
+def prob_bar(val: float, width: int = 10) -> str:
+    """ASCII probability bar for notebook Markdown display."""
+    filled = int(round(val * width))
+    return "█" * filled + "░" * (width - filled) + f"  {val:.0%}"
+
+
+def conf_bar(conf: str) -> str:
+    """Map confidence label to emoji indicator."""
+    return {"high": "🟢", "medium": "🟡", "low": "🔴"}.get(conf.lower(), "⚪")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__01_food_data_exploration.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__01_food_data_exploration.ipynb.md
new file mode 100644
index 0000000..de6ba4b
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__01_food_data_exploration.ipynb.md
@@ -0,0 +1,127 @@
+# Source: implementations/food_price_forecasting/01_food_data_exploration.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Food Price CPI — Data Exploration
+
+A quick tour of the Canadian food CPI data used by the CFPR replica
+experiment.  The target variable is the set of nine StatCan food sub-indices
+from table 18-10-0004-11, which we register via the helper in
+``food_price_forecasting.data``.
+
+This notebook is intentionally short: the experiment notebook
+(`food_cpi_experiment.ipynb`) is the canonical analysis.  Use this one as a
+warm-up when you're onboarding, or when you want a look at the raw series
+without touching the backtest.
+
+If you need the macro covariates from FRED, register them yourself — see
+``planning-docs/bootcamp-workplan.md`` for the deferred covariate-framing design work.
+
+## Cell 2 (markdown)
+
+---
+## 1. Load the 9 food CPI series
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import pandas as pd
+
+
+ROOT = Path.cwd().resolve().parents[1]
+STATCAN_CACHE = ROOT / "data" / "statcan"
+
+from food_price_forecasting.data import CATEGORY_LABELS, FOOD_CPI_SERIES, build_food_cpi_service
+from food_price_forecasting.plots import plot_food_cpi_small_multiples
+
+
+svc = build_food_cpi_service(cache_dir=STATCAN_CACHE)
+print(f"Registered {len(FOOD_CPI_SERIES)} food CPI series.")
+
+_as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+overall = svc.get_series("cpi_food_canada", as_of=_as_of)
+print(f"cpi_food_canada: {overall['timestamp'].min()} → {overall['timestamp'].max()} ({len(overall)} months)")
+```
+
+## Cell 4 (markdown)
+
+---
+## 2. Visualise all nine sub-indices
+
+Index levels (base 2002 = 100) give a single-glance view of the acceleration
+in Canadian food prices since roughly 2020.
+
+## Cell 5 (code)
+
+```python
+fig, _ = plot_food_cpi_small_multiples(svc)
+plt.show()
+```
+
+## Cell 6 (markdown)
+
+---
+## 3. Year-over-year change
+
+Monthly YoY change surfaces the dynamics a predictor actually has to learn.
+Most categories oscillate around ±5%; fish, vegetables, and fruit stand out for
+higher volatility.
+
+## Cell 7 (code)
+
+```python
+fig, axes = plt.subplots(3, 3, figsize=(15, 9), sharex=True)
+axes_flat = axes.flatten()
+
+for ax, (series_id, _, _desc, _units) in zip(axes_flat, FOOD_CPI_SERIES):
+    df = svc.get_series(series_id, as_of=_as_of)
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    df = df.set_index("timestamp").sort_index()
+    df["yoy_pct"] = df["value"].pct_change(12) * 100
+
+    ax.axhline(0, color="#888", linewidth=0.6, linestyle="--")
+    ax.plot(df.index, df["yoy_pct"], color="tomato", linewidth=1.0)
+    ax.set_title(CATEGORY_LABELS.get(series_id, series_id), fontsize=10)
+    ax.set_ylabel("YoY %", fontsize=8)
+    ax.tick_params(labelsize=8)
+    ax.grid(axis="y", alpha=0.3)
+
+fig.suptitle("Canada food CPI — YoY % change per category", fontsize=12)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 8 (markdown)
+
+---
+## 4. Coverage summary
+
+## Cell 9 (code)
+
+```python
+rows: list[dict[str, object]] = []
+for series_id, _, desc, units in FOOD_CPI_SERIES:
+    df = svc.get_series(series_id, as_of=_as_of)
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    rows.append(
+        {
+            "series_id": series_id,
+            "description": desc,
+            "units": units,
+            "start": df["timestamp"].min().date(),
+            "end": df["timestamp"].max().date(),
+            "n_months": len(df),
+        }
+    )
+
+coverage = pd.DataFrame(rows).set_index("series_id")
+print(coverage.to_string())
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__02_food_cpi_experiment.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__02_food_cpi_experiment.ipynb.md
new file mode 100644
index 0000000..29f36b3
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__02_food_cpi_experiment.ipynb.md
@@ -0,0 +1,650 @@
+# Source: implementations/food_price_forecasting/02_food_cpi_experiment.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Canada Food CPI — CFPR Replica Experiment
+
+This notebook is a **LLMP methodology demonstration**: it compares direct LLM
+process predictors to conventional baselines on historical Canadian food CPI
+data and, crucially, explains what those comparisons can and cannot tell us.
+
+**The leakage problem.** Frontier LLMs like Gemini were trained on data through
+late 2024.  For most backtest origins here (2009–2024), the model has likely
+been exposed to the resolved outcomes.  A strong CRPS score therefore tells us
+the pipeline is working and the model can produce calibrated structured output
+— it does not tell us the model can genuinely forecast the future.  The right
+interpretation is an **upper bound**: if a model with perfect recall of
+historical outcomes scores X, a live deployment should aspire to approach X on
+unresolved tasks.
+
+Set `EXPERIMENT_CONFIG` in the config cell to choose the scope:
+`mini_single` (1 target, 6 origins), `mini_recent` (9 targets, 6 origins — default),
+or `full` (9 targets, 16 origins — canonical CFPR backtest).
+
+**What's here:**
+
+1. Data overview — nine food CPI sub-indices at a glance.
+2. Spec + LLMP predictors — loaded from YAML; backtest results cached on disk.
+3. Qualitative check — trajectory fans and avg/avg YoY grid.
+4. Model selection — CRPS and MAPE per category.
+5. What next — why live forecasting is the only clean evaluation.
+
+## Cell 2 (markdown)
+
+---
+## 1. Setup
+
+The heavy lifting lives in helper modules alongside this notebook:
+
+- `data.py`      registers the 9 StatCan series on a `DataService`.
+- `analysis.py`  flattens results to DataFrames and computes avg/avg YoY.
+- `plots.py`     renders the figures the CFPR audience expects.
+
+Backtest specs (all under `implementations/food_price_forecasting/specs/`):
+
+| Spec file | Tasks | Origins | Notes |
+|---|---|---|---|
+| `food_cpi_single_mini_backtest.yaml` | 1 (food overall) | 6 (2019–2024) | Fast dev/smoke-test |
+| `food_cpi_recent_backtest.yaml` | 9 | 6 (2019–2024) | Recent regimes only |
+| `food_cpi_cfpr_backtest.yaml` | 9 | 16 (2009–2024) | Canonical CFPR backtest |
+
+There is no protected eval spec for this experiment — historical LLM scores are upper bounds on live performance (see the intro).
+
+Run the data fetch once if you haven't:
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+import warnings
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import yaml
+from dotenv import load_dotenv
+
+
+warnings.filterwarnings("ignore")
+
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+import importlib
+
+# Helper to live-reload the plots module during development
+import food_price_forecasting
+import food_price_forecasting.plots
+from aieng.forecasting.evaluation import (
+    MultiTargetBacktestSpec,
+    cached_multi_backtest,
+    describe_spec,
+)
+from aieng.forecasting.methods import DartsAutoARIMAPredictor, LastValuePredictor
+from food_price_forecasting.analysis import compute_ape_long, compute_avgyoy, compute_mape, summarize_crps
+from food_price_forecasting.data import CATEGORY_LABELS, build_food_cpi_service
+
+
+importlib.reload(food_price_forecasting.plots)
+from food_price_forecasting.plots import (
+    plot_avgyoy_grid,
+    plot_food_cpi_small_multiples,
+    plot_mape_by_category,
+    plot_trajectory_fan,
+)
+
+
+STATCAN_CACHE = ROOT / "data" / "statcan"
+PREDICTIONS_DIR = ROOT / "data" / "predictions"
+SPECS_DIR = ROOT / "implementations" / "food_price_forecasting" / "specs"
+REPORTS_DIR = ROOT / "data" / "reports" / "cfpr"
+
+# `reports_dir` attaches a cutoff-aware DocumentStore of CFPR editions, so the
+# report-grounded quantile-grid recipe below can prepend each edition published
+# on or before the forecast origin. Populate it once with
+# `scripts/fetch_cfpr.py` then `scripts/extract_reports.py`; if the directory is
+# absent the store simply loads empty and the numeric-only predictors are
+# unaffected.
+svc = build_food_cpi_service(cache_dir=STATCAN_CACHE, reports_dir=REPORTS_DIR)
+print(f"Registered {len(CATEGORY_LABELS)} food CPI series.")
+```
+
+## Cell 4 (code)
+
+```python
+# ── Experiment configuration ──────────────────────────────────────────────────
+# Set EXPERIMENT_CONFIG to control which CFPR backtest runs throughout this
+# notebook. All downstream cells adapt automatically.
+#
+#   "mini_single"  1 target (food overall) × 6 recent origins (Jul 2019–2024)
+#                  fast smoke test for the active LLMP set
+#   "mini_recent"  9 targets × 6 recent origins (Jul 2019–2024)
+#                  default experiment; covers COVID/inflation regimes
+#   "full"         9 targets × 16 origins (Jul 2009–2024)
+#                  canonical CFPR backtest
+
+EXPERIMENT_CONFIG = "mini_recent"
+
+_BACKTEST_SPEC_FILES = {
+    "mini_single": "food_cpi_single_mini_backtest.yaml",
+    "mini_recent": "food_cpi_recent_backtest.yaml",
+    "full": "food_cpi_cfpr_backtest.yaml",
+}
+_BACKTEST_SPEC_FILE = _BACKTEST_SPEC_FILES[EXPERIMENT_CONFIG]
+
+print(f"Config: {EXPERIMENT_CONFIG!r}  →  {_BACKTEST_SPEC_FILE}")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Data exploration
+
+A single figure is plenty — the nine sub-indices track each other closely with
+a clear post-2020 acceleration.
+
+## Cell 6 (code)
+
+```python
+fig, _ = plot_food_cpi_small_multiples(svc)
+plt.show()
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. The backtest spec
+
+The backtest spec is loaded from YAML so the spec (not the notebook) is the
+source of truth.  `describe_spec()` renders a plain-text summary suitable for
+print, prompts, or documentation.
+
+> **Training window note.** The `full` spec covers origins from July 2009 to
+> July 2024.  Gemini's training data extends through late 2024, so virtually
+> every origin in this spec — including the most recent ones — falls within the
+> model's training window.  The model has plausibly been exposed to resolved
+> inflation outcomes for all of these periods.  Keep this in mind when
+> interpreting any LLM or agent scores below.
+
+## Cell 8 (code)
+
+```python
+with (SPECS_DIR / _BACKTEST_SPEC_FILE).open() as f:
+    backtest_spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(f))
+
+print(describe_spec(backtest_spec, data_service=svc))
+print(
+    f"\nTasks: {len(backtest_spec.tasks)}  Window: {backtest_spec.start.date()} → {backtest_spec.end.date()}  Stride: {backtest_spec.stride}"
+)
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Predictors
+
+Six active predictors, each implementing the same `Predictor` API against the
+configured backtest spec.  The default run stays LLMP-focused: baselines, two
+sampled-trajectory LLMP variants, and two quantile-grid runs — one numeric-only
+and one **grounded on the CFPR reports themselves**.  Models are the two
+bootcamp-pegged Gemini tiers (`gemini-3.1-flash-lite-preview` for the
+token-heavy sampled trajectories, `gemini-3.5-flash` for the quantile grids);
+pro-tier models are deliberately avoided as too heavy for a notebook run.  The
+Food CPI LLMP recipe builders are model-agnostic; model IDs are passed
+explicitly and already appear in each `predictor_id`.  Agentic predictors remain
+below as commented optional blocks.
+
+| Group | Predictor | Notes |
+|---|---|---|
+| Baselines | `LastValuePredictor` | Repeats last observed value; hard to beat at short horizons. |
+| Baselines | `DartsAutoARIMAPredictor` | Auto-ARIMA via Darts; fits at each origin. |
+| LLMP | `SampledTrajectoryLLMPredictor` / Gemini 3.1 Flash-Lite | Samples structured trajectories, then computes empirical quantiles. Token-heavy → lite model. |
+| LLMP | `SampledTrajectoryLLMPredictor` / Gemini 3.5 Flash | Same sampled-trajectory contract with the advanced Flash model. |
+| LLMP | `QuantileGridLLMPredictor` / Gemini 3.5 Flash | One-shot elicitation of the standard quantile grid. |
+| LLMP (reports) | `QuantileGridLLMPredictor` / Gemini 3.5 Flash + CFPR | Same grid, but every CFPR edition published on/before the origin is prepended as text context. One call per origin keeps the large report context affordable. |
+| Optional agentic | `AgentPredictor` | Preserved as commented examples; agentic integration lives in the package and can be re-enabled later. |
+
+### Why the quantile grid carries the reports (and not the sampled trajectory)
+
+Report grounding means tens of thousands of input tokens of context per call.
+The quantile grid elicits the full distribution in a **single** call per origin,
+so that context is paid for once.  A sampled-trajectory variant would re-send the
+same reports on every one of its `n_samples` draws — multiplying input-token cost
+with no methodological benefit.  Reports therefore ride on the direct-prediction
+(quantile-grid) modality by design.
+
+### Leakage — model weights *and* the report itself
+
+Even with tools disabled, a frontier LLM is not a blank slate.  The model
+weights encode a compressed representation of its training corpus — which, for
+Gemini, includes news, statistics, and economic commentary up to late 2024.  For
+a July 2021 origin, the model implicitly "knows" that food inflation spiked in
+2022.  This is not a bug that can be filtered away; it is structural.
+
+The report-grounded row compounds this: **the CFPR report for forecast-year Y is
+itself an expert narrative forecast of the outcome being scored.**  Feeding it in
+is closer to handing the model the answer key than to supplying neutral context.
+Treat the `+ CFPR` row strictly as a wiring demonstration and an upper bound — it
+shows the document pipeline runs end-to-end and is cutoff-disciplined
+(`publication_date <= as_of`), not that report grounding yields live skill.
+
+## Cell 10 (code)
+
+```python
+import pandas as pd
+from food_price_forecasting.predictors.llmp_quantile_grid import build_llmp_quantile_grid
+from food_price_forecasting.predictors.llmp_sampled_trajectory import build_llmp_sampled_trajectory
+
+
+# from food_price_forecasting.analyst_agent import build_food_price_agent_predictor
+
+
+# Bootcamp-pegged Gemini models (see aieng.forecasting.models):
+#   gemini-3.1-flash-lite-preview  (LITE_MODEL) — token-heavy workflows, e.g. sampled trajectories
+#   gemini-3.5-flash               (ADVANCED_MODEL) — the standard run, e.g. the quantile grids
+# Pro-tier models are deliberately avoided here — too heavy for a notebook run.
+
+# ── Baselines ─────────────────────────────────────────────────────────────────
+lv = LastValuePredictor()
+arima = DartsAutoARIMAPredictor()
+
+# ── Active LLMP predictors (no tools) ─────────────────────────────────────────
+# Sampled trajectories are token-heavy (n_samples calls per origin), so they run
+# on the lite model; the 3.5-flash variant is a deliberate lite-vs-advanced point.
+llmp_sampled_flash_lite = build_llmp_sampled_trajectory(
+    model="gemini-3.1-flash-lite-preview",
+    n_samples=3,
+    history_window=30,
+)
+llmp_sampled_flash = build_llmp_sampled_trajectory(
+    model="gemini-3.5-flash",
+    n_samples=3,
+    history_window=30,
+)
+llmp_quantile_grid = build_llmp_quantile_grid(
+    model="gemini-3.5-flash",
+    history_window=30,
+    reasoning_effort=None,
+)
+
+# ── Report-grounded LLMP (CFPR editions in context) ───────────────────────────
+# The quantile grid is the modality of choice for report grounding: it makes ONE
+# elicitation call per origin, so the (large) report context is paid for exactly
+# once. Sampled trajectories would re-send the reports on every sample draw, an
+# order of magnitude more input tokens for no methodological gain.
+#
+# `report_sources=["cfpr"]` pulls every CFPR edition the DocumentStore holds with
+# publication_date <= the forecast origin and prepends it as a text preamble
+# (report_ingestion defaults to "text", which works for Gemini through the proxy;
+# native PDF ingestion is Claude/GPT-only today — see the proxy-limitations note).
+#
+# LEAKAGE, SHARPENED: the CFPR report for forecast-year Y *is the narrative
+# forecast of the very outcome being scored*. On historical origins this stacks
+# on top of model-weight memorization. Read the report-grounded row as a wiring
+# demonstration and an upper bound, never as evidence of live skill.
+llmp_quantile_grid_reports = build_llmp_quantile_grid(
+    model="gemini-3.5-flash",
+    history_window=30,
+    reasoning_effort=None,
+    report_sources=["cfpr"],
+)
+
+# ── Optional agent predictors — news search OFF (historical-backtest safer) ────
+# agent_lite = build_food_price_agent_predictor(model="gemini-3.1-flash-lite-preview", enable_news_search=False)
+# agent_advanced = build_food_price_agent_predictor(model="gemini-3.5-flash", enable_news_search=False)
+
+# ── Optional agent predictors — news search ON (leakage risk on historical dates) ──
+# agent_lite_search = build_food_price_agent_predictor(model="gemini-3.1-flash-lite-preview", enable_news_search=True)
+# agent_advanced_search = build_food_price_agent_predictor(model="gemini-3.5-flash", enable_news_search=True)
+
+all_predictors = [
+    lv,
+    arima,
+    llmp_sampled_flash_lite,
+    llmp_sampled_flash,
+    llmp_quantile_grid,
+    llmp_quantile_grid_reports,
+    # agent_lite,
+    # agent_advanced,
+    # agent_lite_search,
+    # agent_advanced_search,
+]
+
+# Colors: gray/blue = baselines, red = sampled LLMP, purple = quantile grid, orange/green = optional agents
+PREDICTOR_COLORS: dict[str, str] = {
+    lv.predictor_id: "#7f7f7f",
+    arima.predictor_id: "#1f77b4",
+    llmp_sampled_flash_lite.predictor_id: "#d62728",
+    llmp_sampled_flash.predictor_id: "#e87070",
+    llmp_quantile_grid.predictor_id: "#9467bd",
+    llmp_quantile_grid_reports.predictor_id: "#c5b0d5",
+    # agent_lite.predictor_id: "#ff7f0e",
+    # agent_advanced.predictor_id: "#ffb347",
+    # agent_lite_search.predictor_id: "#2ca02c",
+    # agent_advanced_search.predictor_id: "#72c472",
+}
+
+PREDICTOR_LABELS: dict[str, str] = {
+    lv.predictor_id: "Naive",
+    arima.predictor_id: "AutoARIMA",
+    llmp_sampled_flash_lite.predictor_id: "LLMP sampled 3.1 flash-lite",
+    llmp_sampled_flash.predictor_id: "LLMP sampled 3.5 flash",
+    llmp_quantile_grid.predictor_id: "LLMP quantile-grid 3.5 flash",
+    llmp_quantile_grid_reports.predictor_id: "LLMP quantile-grid 3.5 flash + CFPR",
+    # agent_lite.predictor_id: "Agent (lite)",
+    # agent_advanced.predictor_id: "Agent (advanced)",
+    # agent_lite_search.predictor_id: "Agent (lite) + news",
+    # agent_advanced_search.predictor_id: "Agent (advanced) + news",
+}
+
+for p in all_predictors:
+    print(f"  {p.predictor_id}")
+```
+
+## Cell 11 (code)
+
+```python
+# Optional agent smoke test (disabled for the LLMP-focused notebook run).
+# Uncomment this cell only when you want to exercise the Food CPI agent path.
+#
+# import logging
+# import warnings
+# from datetime import datetime
+#
+# from aieng.forecasting.evaluation.task import ForecastingTask
+# from aieng.forecasting.langfuse_tracing import init_langfuse_tracing, print_langfuse_trace_url
+# from food_price_forecasting.analyst_agent import FoodPriceForecastPromptBuilder, build_food_price_agent_predictor
+# from food_price_forecasting.data import build_food_cpi_service
+# from food_price_forecasting.smoke_report import CFPR_HORIZONS, summarize_agent_predictions
+# from pydantic import ValidationError
+#
+# warnings.filterwarnings("ignore")
+# for _noisy_log in ["opentelemetry", "openinference", "asyncio", "langfuse", "google.adk"]:
+#     logging.getLogger(_noisy_log).setLevel(logging.ERROR)
+# logging.basicConfig(level=logging.WARNING, format="%(message)s")
+# init_langfuse_tracing()
+#
+# _AGENT_SMOKE_ORIGIN = datetime(2023, 7, 1)
+# _AGENT_SMOKE_TASK = ForecastingTask(
+#     task_id="meat_cfpr",
+#     target_series_id="cpi_meat_canada",
+#     horizons=CFPR_HORIZONS,
+#     frequency="MS",
+#     description="Meat CFPR; Jan-Dec trajectory from July origin.",
+# )
+# ENABLE_NEWS_SEARCH_SMOKE = True
+# _smoke_svc = build_food_cpi_service(cache_dir=STATCAN_CACHE)
+# _smoke_predictor = build_food_price_agent_predictor(
+#     model="gemini-3.5-flash",
+#     enable_news_search=ENABLE_NEWS_SEARCH_SMOKE,
+#     prompt_builder=FoodPriceForecastPromptBuilder(max_history_rows=60),
+# )
+# _ctx = _smoke_svc.context(as_of=_AGENT_SMOKE_ORIGIN)
+# _user_msg = _smoke_predictor.prompt_builder(task=_AGENT_SMOKE_TASK, context=_ctx)
+# print(_smoke_predictor.agent_config.instruction)
+# print(_user_msg)
+# try:
+#     _smoke_preds = _smoke_predictor.predict(_AGENT_SMOKE_TASK, _ctx)
+# except ValidationError:
+#     print("SCHEMA VALIDATION FAILED - model response did not match structured output schema.")
+#     _smoke_preds = []
+# summarize_agent_predictions(_smoke_preds, expected_horizons=CFPR_HORIZONS)
+# print_langfuse_trace_url()
+```
+
+## Cell 12 (markdown)
+
+---
+## 5. Backtest (cached on disk)
+
+`cached_multi_backtest` writes each `BacktestResult` to
+`data/predictions/<spec_id>/<predictor_id>/<task_id>.yaml` and reuses it on
+subsequent runs.  Pass `force_refresh=True` to re-run a predictor from scratch.
+
+All active predictors run against the configured backtest spec (set in the config cell above).
+The two active `SampledTrajectoryLLMPredictor` variants make sampled LLM calls on a first run;
+for `mini_recent`, that is 9 targets × 6 origins × 2 LLMPs × 3 samples before caching.
+Subsequent runs are free from cache.
+
+**What this backtest is measuring:**
+
+| What we learn | What we don't learn |
+|---|---|
+| The pipeline runs end-to-end without errors | Whether an LLM can forecast genuinely unseen data |
+| The model produces valid calibrated structured output | Whether strong scores reflect skill or memorization |
+| An approximate upper bound on live performance | How the method performs when the future is actually unknown |
+
+Think of it this way: a human expert who had already read the outcome would
+also beat ARIMA.  The interesting question — whether LLM-process predictors
+provide useful calibrated uncertainty when the future is genuinely open — can
+only be answered by forecasting tasks that haven't resolved yet.
+
+## Cell 13 (code)
+
+```python
+results_by_predictor: dict[str, dict[str, object]] = {}
+
+for predictor in all_predictors:
+    print(f"Running {predictor.predictor_id} ...", flush=True)
+    results_by_predictor[predictor.predictor_id] = cached_multi_backtest(
+        predictor=predictor,
+        spec=backtest_spec,
+        data_service=svc,
+        store_dir=PREDICTIONS_DIR,
+    )
+    for task_id, result in results_by_predictor[predictor.predictor_id].items():
+        print(f"  {task_id:42s}  mean CRPS = {result.mean_score:.4f}  ({len(result.predictions)} preds)")
+```
+
+## Cell 14 (markdown)
+
+---
+## 6. Trajectories — focal series
+
+Show the three most recent origins for the focal series (`FOCAL_TASK`, derived
+from the first task in the backtest spec) with each predictor's 12-step
+trajectory fan.  Solid black is observed history, dashed black is Y+1 actuals
+where available, fans are the predictor's 90%/50% intervals with the median in
+colour.
+
+## Cell 15 (code)
+
+```python
+# Derive from the first task in the loaded spec so this adapts to any config.
+FOCAL_TASK = backtest_spec.tasks[0].task_id
+FOCAL_SERIES = backtest_spec.tasks[0].target_series_id
+
+fig, _ = plot_trajectory_fan(
+    results_by_predictor=results_by_predictor,
+    task_id=FOCAL_TASK,
+    category_id=FOCAL_SERIES,
+    data_service=svc,
+    n_recent=3,
+    colors=PREDICTOR_COLORS,
+    labels=PREDICTOR_LABELS,
+)
+plt.show()
+```
+
+## Cell 16 (markdown)
+
+---
+## 7. Avg/avg YoY — food CPI categories
+
+The headline CFPR metric: for each July origin, mean predicted CPI for year Y+1
+divided by mean observed CPI for year Y, minus 1.  Actual realised YoY (solid
+black) is the out-of-sample truth for every completed year.
+
+**Note on the 2022 spike:** Sampled-trajectory LLMPs (no chain-of-thought) anchor extrapolation on the *current level* rather
+than the *rate of change*, so they can underestimate carry-through during a
+mid-surge origin.  This is a known limitation of level-domain direct prompting
+(Gruver / CiK), not a data-feeding issue.
+
+## Cell 17 (code)
+
+```python
+from datetime import datetime, timezone
+
+
+yoy_by_predictor_by_task: dict[str, dict[str, object]] = {}
+task_to_category: dict[str, str] = {task.task_id: task.target_series_id for task in backtest_spec.tasks}
+
+_as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+
+for pid, task_results in results_by_predictor.items():
+    yoy_by_predictor_by_task[pid] = {}
+    for task_id, result in task_results.items():
+        actual_df = svc.get_series(result.spec.task.target_series_id, as_of=_as_of)
+        yoy_by_predictor_by_task[pid][task_id] = compute_avgyoy(result, actual_df)
+
+fig, _ = plot_avgyoy_grid(
+    yoy_by_predictor_by_task=yoy_by_predictor_by_task,
+    task_to_category=task_to_category,
+    colors=PREDICTOR_COLORS,
+    labels=PREDICTOR_LABELS,
+)
+plt.show()
+```
+
+## Cell 18 (markdown)
+
+---
+## 8. Model selection
+
+A note on interpreting scores for LLM-based predictors: because these backtest
+origins are within the training window of frontier models, low CRPS values
+reflect a combination of (a) genuine calibration quality and (b) the model
+"recalling" outcomes it has implicitly seen.  These two contributions cannot be
+separated from the backtest alone.  Use the scores to confirm the pipeline is
+working and to establish a performance ceiling, not to draw conclusions about
+live predictive skill.
+
+### 8.1 CRPS per category
+
+Lower is better.  The `MEAN` row is the across-category average — useful
+context, but the per-category rows are the basis for model selection.
+
+> ⚠️ **Leakage caveat — read before comparing.** Gemini's training cutoff is ~January 2025, so on this **pre-2025 backtest** the LLM-Process rows may be *reciting memorised CPI outcomes* rather than forecasting. Treat their scores as an **upper bound**, not live skill — the cutoff-safe numerical baselines (last-value, AutoARIMA) are the honest comparison here. A fair LLM evaluation needs **post-cutoff / prospective** origins (see the energy reference's protected 2026 eval, and §10 below).
+
+## Cell 19 (code)
+
+```python
+crps_board = summarize_crps(results_by_predictor)
+print(crps_board.to_string())
+```
+
+## Cell 20 (markdown)
+
+### 8.2 MAPE per category
+
+Median-accuracy sanity check (CRPS is the primary selection metric).  One panel
+per sub-index; each box spans the distribution of per-prediction absolute
+percentage errors across all backtest origins and horizons.
+
+## Cell 21 (code)
+
+```python
+mape_df = compute_mape(results_by_predictor, data_service=svc)
+print(mape_df.to_string())
+
+ape_long = compute_ape_long(results_by_predictor, data_service=svc)
+fig, _ = plot_mape_by_category(
+    ape_long,
+    task_to_category=task_to_category,
+    colors=PREDICTOR_COLORS,
+    labels=PREDICTOR_LABELS,
+)
+plt.show()
+```
+
+## Cell 22 (markdown)
+
+---
+## 9. Backtest-average avg/avg YoY — headline table
+
+Model selection is done **per category**: for each food CPI sub-index the
+predictor with the lowest mean CRPS over all backtest origins and horizons for
+that category is selected independently.  The table below shows each
+category's best predictor and its avg/avg YoY central estimate and uncertainty
+band averaged across the full backtest window.
+
+## Cell 23 (code)
+
+```python
+# Best predictor per category, selected by that category's own mean CRPS.
+best_pid_by_task: dict[str, str] = crps_board.drop(index="MEAN").idxmin(axis=1).to_dict()
+print("Best predictor by category (mean CRPS over full backtest window):")
+for task_id, pid in best_pid_by_task.items():
+    category = CATEGORY_LABELS.get(task_to_category[task_id], task_id)
+    print(f"  {category:<40s} {pid}")
+
+rows: list[dict[str, object]] = []
+for task_id, pid in best_pid_by_task.items():
+    yoy_df = yoy_by_predictor_by_task[pid].get(task_id)
+    if yoy_df is None or yoy_df.empty:
+        continue
+    avg = yoy_df[["yoy_median", "yoy_q05", "yoy_q25", "yoy_q75", "yoy_q95", "actual_yoy"]].mean()
+    rows.append(
+        {
+            "category": CATEGORY_LABELS.get(task_to_category[task_id], task_id),
+            "best_predictor": pid,
+            "median_yoy_%": round(avg["yoy_median"] * 100, 2),
+            "q05_%": round(avg["yoy_q05"] * 100, 2),
+            "q25_%": round(avg["yoy_q25"] * 100, 2),
+            "q75_%": round(avg["yoy_q75"] * 100, 2),
+            "q95_%": round(avg["yoy_q95"] * 100, 2),
+            "actual_yoy_%": round(avg["actual_yoy"] * 100, 2),
+        }
+    )
+
+headline = pd.DataFrame(rows).set_index("category")
+print()
+print(headline.to_string())
+```
+
+## Cell 24 (markdown)
+
+---
+## 10. What next — live forecasting as the only clean evaluation
+
+This notebook demonstrates that the LLMP pipeline works: predictors run,
+structured output is validated, results are cached, evaluation metrics are
+computed.  That is genuinely valuable.  But if the question is "can an LLM
+produce useful forecasts of future food prices?", the backtest scores above
+cannot answer it.
+
+**The fundamental asymmetry.**  Conventional methods (Last Value, ARIMA) are
+blind to the future by construction.  Frontier LLMs are not.  Comparing them on
+historical data is a category error — roughly equivalent to asking a human expert
+who has already read the economic history to compete against an algorithm that
+has not.  The expert will win.  That tells you very little about what either
+would do facing a genuinely open question.
+
+**What a clean evaluation looks like.**  The `Predictor` API, `ForecastingTask`,
+`BacktestSpec`, and `EvalTracker` (budget-gated evaluation runs) are all
+designed to transfer directly to live evaluation:
+
+1. Define tasks whose targets have not yet resolved (e.g., energy price
+   forecasts issued today with a 30-day horizon).
+2. Lock the predictor configuration and issue forecasts now.
+3. Evaluate against actuals when they are published.
+4. Compare CRPS against the baselines under identical conditions.
+
+This pattern is demonstrated properly in the **energy prices case study**, where
+daily resolution makes genuinely prospective evaluation practical.  Food CPI's
+annual resolution cycle makes it unsuitable — you'd wait 18 months for a single
+data point.
+
+**The upside case.**  Even acknowledging leakage, the pipeline has demonstrated
+something real: LLMs can produce coherent probabilistic forecasts in the required
+structured format.  If that capability translates to live tasks — even partially
+— it represents a qualitative shift in what is possible with automated
+forecasting.  The only way to find out is to forecast forward and measure.
+
+*The next step is to do exactly that.*
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__99_starter_agent.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__99_starter_agent.ipynb.md
new file mode 100644
index 0000000..1ed8fe5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__99_starter_agent.ipynb.md
@@ -0,0 +1,179 @@
+# Source: implementations/food_price_forecasting/99_starter_agent.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Food Price (CPI) — Your Starter Agent
+
+**If you're not sure what to do next, continue from here.**
+
+This notebook is a fresh, hackable agent for the Canadian food-CPI use case — deliberately *not* wired into the numbered curriculum. It gives you our common building blocks behind simple toggles, so you can start building something of your own:
+
+- **optional news search** — bounded, cutoff-aware Google Search (proxy-only)
+- **optional code execution** — an E2B Python sandbox
+- **two lightweight skills** — *tool-usage playbooks* in `starter_agent/skills/`
+
+It does two things: lets you **talk to the agent** (open-ended, Track 2) and **score one real forecast** (Track 1). The live cells are gated by `RUN_AGENT` so a fresh `Run All` is safe and free; flip it to `True` to actually call the model.
+
+## Cell 2 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from dotenv import load_dotenv
+
+
+# Repo root holds the .env with PROXY_* creds the agent needs.
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+# ── Model selection ───────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Lite is the default.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+# AGENT_MODEL = "gemini-3.5-flash"  # advanced (higher cost/latency)
+
+# ── Run guard ──────────────────────────────────────
+# Live agent calls cost tokens and need PROXY_* in the repo-root .env, plus warm
+# data caches. Default False so `Run All` is safe; set True to call the model.
+RUN_AGENT = False
+
+from food_price_forecasting.starter_agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+print("RUN_AGENT =", RUN_AGENT, "| model =", AGENT_MODEL)
+```
+
+## Cell 3 (markdown)
+
+---
+## 1. Meet your agent
+
+`build_starter_agent_config` returns an `AgentConfig` with two toggles. The default turns **news search on** (proxy-only, no extra key) and **code execution off** (it needs `E2B_API_KEY` and is slower). Flip them and re-run — the loaded skills follow the enabled tools.
+
+## Cell 4 (code)
+
+```python
+config = build_starter_agent_config(
+    model=AGENT_MODEL,
+    enable_search=True,  # ← cutoff-aware Google Search (proxy-only)
+    enable_code_exec=False,  # ← E2B Python sandbox (needs E2B_API_KEY); try True!
+)
+
+print("Agent:", config.name)
+print("Search enabled:    ", config.context_retrieval.enabled)
+print("Code-exec enabled: ", config.code_execution.enabled)
+print("Skills loaded:     ", [p.name for p in config.skills_dirs])
+print("\n── System instruction (edit this in starter_agent/agent.py) ──\n")
+print(config.instruction[:1200], "...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Talk to it  *(Track 2 — open-ended analysis)*
+
+Ask the agent anything. This is the interactive mode: no scoring, no schema — just reasoning (and a web search, since search is on). Edit the question and explore.
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+
+
+QUESTION = (
+    "What is driving Canadian food inflation right now, and where is the food "
+    "CPI headed over the next year? Be concise."
+)
+
+if RUN_AGENT:
+    chat_agent = build_adk_agent(config)  # schema-free: plain text in, text out
+    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name="food_starter_chat"))
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True in the setup cell to talk to the agent.")
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. Score one prediction against known outcomes  *(Track 1)*
+
+Now run the agent as a `Predictor`. We pick the **most recent origin whose horizons have already resolved**, forecast the food-overall CPI index, and check whether each actual index landed inside the agent's 80% band. (One origin can't tell you if the agent is *calibrated*; that's what the backtest in `02_food_cpi_experiment.ipynb` is for.) Live, so gated by `RUN_AGENT`.
+
+## Cell 8 (code)
+
+```python
+from datetime import datetime, timezone
+
+
+if RUN_AGENT:
+    from aieng.forecasting.evaluation.task import ForecastingTask
+    from food_price_forecasting.data import FOOD_CPI_SERIES, build_food_cpi_service
+
+    FOOD_SERIES_ID = FOOD_CPI_SERIES[0][0]  # cpi_food_canada (food overall)
+    svc = build_food_cpi_service()
+    now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    full = svc.get_series(FOOD_SERIES_ID, as_of=now)
+    full["timestamp"] = pd.to_datetime(full["timestamp"])
+    last_date = full["timestamp"].iloc[-1]
+
+    HORIZONS = [1, 3, 6, 12]
+    # Most recent month-start origin whose longest horizon has already resolved.
+    AS_OF = (last_date - pd.DateOffset(months=max(HORIZONS))).replace(day=1)
+
+    task = ForecastingTask(
+        task_id="food_cpi_starter_forecast",
+        target_series_id=FOOD_SERIES_ID,
+        horizons=HORIZONS,
+        frequency="MS",
+        description="Canadian food CPI index — 1/3/6/12 months ahead (starter).",
+    )
+    ctx = svc.context(as_of=AS_OF)
+    preds = build_starter_agent_predictor(config).predict(task, ctx)
+
+    def realized_at(h):
+        rows = full[full["timestamp"] >= AS_OF + pd.DateOffset(months=h)]
+        return float(rows["value"].iloc[0]) if not rows.empty else None
+
+    print(f"Origin as_of={AS_OF.date()}  series={FOOD_SERIES_ID}  (latest data {last_date.date()})\n")
+    print("  h(mo)  agent index   agent 80% CI            actual   in band?")
+    for i, h in enumerate(HORIZONS):
+        fc = preds[i].payload
+        lo, hi = fc.quantiles[0.10], fc.quantiles[0.90]
+        act = realized_at(h)
+        inb = "—" if act is None else ("yes ✓" if lo <= act <= hi else "no ✗")
+        acts = "   N/A" if act is None else f"{act:8.2f}"
+        print(f"  {h:>3}    {fc.point_forecast:8.2f}   [{lo:7.2f}, {hi:7.2f}]   {acts}   {inb}")
+    if preds[0].metadata.get("rationale"):
+        print("\nRationale:", preds[0].metadata["rationale"][:300])
+else:
+    print("RUN_AGENT is False — set it to True to score a live forecast against known outcomes.")
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Make it yours
+
+This agent is a starting point. Here are concrete next steps, easiest first — each is a small edit, then re-run the cells above.
+
+1. **Flip code execution on.** Set `enable_code_exec=True` in §1 (needs `E2B_API_KEY`). The agent loads the `code-analysis-playbook` skill and can compute its own diagnostics before forecasting. Compare the rationale.
+2. **Edit the agent's personality.** Open `starter_agent/agent.py` and change `_build_starter_instruction()` — make it more cautious, more contrarian, focused on one driver. Re-run §1 to see the new instruction.
+3. **Sharpen the skills.** The two files in `starter_agent/skills/` are short on purpose. Add your best queries to `research-playbook`, or a new diagnostic to `code-analysis-playbook`. The agent picks them up automatically.
+4. **Change the question and the origin.** Try a different `QUESTION` in §2 and a different origin in §3.
+5. **Forecast all nine series.** This starter does food-overall; loop over `FOOD_CPI_SERIES` for the full CFPR basket (see `02_food_cpi_experiment.ipynb`).
+6. **Add report context.** Build the service with `build_food_cpi_service(reports_dir=...)` and feed cutoff-filtered CFPR report text into the prompt builder.
+
+Bigger ideas — report-grounded forecasting, the avg/avg YoY metric, the full multi-target leaderboard — are in the use-case `README.md` and `planning-docs/roadmap.md`.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__README.md.md
new file mode 100644
index 0000000..fdff802
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__README.md.md
@@ -0,0 +1,227 @@
+# Source: implementations/food_price_forecasting/README.md
+
+kind: markdown
+
+# Food Price CPI Forecasting
+
+> **Reference implementation 2 of 4.** Recommended order: [getting_started](../getting_started/) → [S&P 500](../sp500_forecasting/) → **food CPI** → [energy / WTI](../energy_oil_forecasting/) → [BoC rate decisions](../boc_rate_decisions/). Each stands on its own.
+
+Replicates the **Canada's Food Price Report (CFPR)** forecasting methodology —
+an annual estimate of the year-over-year percentage change in Canadian food
+prices across nine CPI sub-categories.
+
+This is the **no-futures multivariate** reference implementation — the case
+where context genuinely matters because no market aggregator summarises the
+answer. It is a fully working, literature-aligned forecasting task that runs
+in minutes on a laptop and provides a launching pad for LLM and agent-based
+predictors. It extends the single-series evaluation loop from
+[`getting_started/`](../getting_started/) to multiple correlated targets and a
+multi-step trajectory, but stands on its own — you don't need to work through
+that one first.
+
+---
+
+## Forecasting task
+
+**Target variable:** Consumer Price Index (CPI) for food products and
+sub-categories in Canada (index, 2002 = 100).  The headline CFPR statement
+("food prices are expected to rise by X% in year Y+1") is an
+**average-over-average YoY change**:
+
+$$\text{YoY}_{\text{avg/avg}} = \frac{\overline{\text{CPI}}_{Y+1}}{\overline{\text{CPI}}_Y} - 1$$
+
+where each $\overline{\text{CPI}}_Y$ is the mean of the twelve monthly index
+values for year $Y$.
+
+**Target categories (9):**
+
+| Series ID | Description |
+|-----------|-------------|
+| `cpi_food_canada` | Overall Food (headline) |
+| `cpi_bakery_cereal_canada` | Bakery and cereal products (excl. baby food) |
+| `cpi_dairy_eggs_canada` | Dairy products and eggs |
+| `cpi_fish_seafood_canada` | Fish, seafood and other marine products |
+| `cpi_restaurants_canada` | Food purchased from restaurants |
+| `cpi_fruit_preparations_nuts_canada` | Fruit, fruit preparations and nuts |
+| `cpi_meat_canada` | Meat |
+| `cpi_other_food_nonalcoholic_canada` | Other food products and non-alcoholic beverages |
+| `cpi_vegetables_preparations_canada` | Vegetables and vegetable preparations |
+
+**Data source:** Statistics Canada table 18-10-0004-11.  Populated via
+`scripts/fetch_cpi.py`.
+
+The 9 canonical series are defined once in `data.py` (`FOOD_CPI_SERIES`) and
+referenced everywhere else (YAML specs, notebook, helpers).
+
+---
+
+## CFPR methodology
+
+The CFPR is published each November/December. By that point, the July CPI
+release is typically the most recent data available.  We model the report's
+preparation discipline at every origin:
+
+- **Origins:** July 1 of each year (annual stride).
+- **Trajectory:** horizons 6-17 from a July origin, i.e. January-December of
+  the following calendar year.  Summing the twelve monthly forecasts and
+  dividing by the prior year's mean gives the avg/avg YoY headline.
+- **Backtest window:** July 2009 → July 2024 (16 annual origins).  Covers
+  three distinct macro regimes: low-inflation (2010-19), COVID shock (2020-21),
+  and the food-price surge and retreat (2021-24).
+- **Information cutoff:** at each origin, predictors only see data with
+  `timestamp ≤ origin`, enforced by `ForecastContext.as_of`.
+
+> **Note on leakage:** LLM-based predictors trained on data through 2024 have
+> likely seen the resolutions of historical backtesting origins.  Historical
+> CRPS scores for LLMP and agentic predictors represent an **upper bound** on
+> real-world performance, not a clean benchmark.  Proper evaluation requires
+> live / prospective testing on unresolved origins.
+
+---
+
+## Reference specs
+
+```
+specs/
+├── food_cpi_cfpr_backtest.yaml      # MultiTargetBacktestSpec — 9 tasks × 16 origins (full)
+├── food_cpi_recent_backtest.yaml    # MultiTargetBacktestSpec — 9 tasks × 6 recent origins
+└── food_cpi_single_mini_backtest.yaml  # MultiTargetBacktestSpec — 1 task × 6 origins (dev/smoke)
+```
+
+The notebook selects a spec via the `EXPERIMENT_CONFIG` variable at the top
+(`"full"`, `"mini_recent"`, or `"mini_single"`).  The full spec is the
+source of truth for the CFPR task; the mini specs are for fast iteration and
+smoke-testing during development.
+
+---
+
+## Module layout
+
+```
+implementations/food_price_forecasting/
+├── specs/         # backtest YAML (full, mini_recent, mini_single)
+├── reports_manifest.yaml  # committed CFPR PDF URLs + publication dates (2021-2026)
+├── reports.py     # load_manifest(); CFPRReportEntry (manifest URLs + cutoff dates)
+├── data.py        # build_food_cpi_service(); FOOD_CPI_SERIES; CATEGORY_LABELS
+├── analysis.py    # predictions_to_dataframe, compute_avgyoy, summarize_crps,
+│                  # compute_mape, rationales_table
+├── plots.py       # plot_trajectory_fan, plot_avgyoy_grid,
+│                  # plot_crps_disaggregated, plot_mape_distribution,
+│                  # plot_food_cpi_small_multiples
+├── predictors/    # QuantileGridLLMPredictor, SampledTrajectoryLLMPredictor (report-grounded LLMP)
+├── smoke_report.py # summarize_agent_predictions() — plain-text agent smoke-test summary
+├── starter_agent/  # fresh, hackable agent template (toggleable search/code-exec + skills)
+├── 01_food_data_exploration.ipynb # warm-up tour of the 9 series
+├── 02_food_cpi_experiment.ipynb   # narrative over the helpers above
+└── 99_starter_agent.ipynb         # ← start here to build your own agent
+```
+
+Unit tests for the analysis helpers live under
+`implementations/tests/food_price_forecasting/test_analysis.py`.
+
+---
+
+## Covariates
+
+FRED macro covariates are **not** used in the canonical experiment. Framing
+multivariate exogenous inputs for agentic and LLM-based predictors is a natural
+extension. Experiments that need FRED covariates should register their own via
+`FREDAdapter`.
+
+---
+
+## Artifact storage
+
+`cached_multi_backtest()` saves each `BacktestResult` to
+`data/predictions/<spec_id>/<predictor_id>__<task_id>.yaml` immediately after
+each task completes.  If a run crashes mid-experiment, all completed tasks are
+preserved and only the failed task is retried.  Use `force_refresh=True` to
+re-run a predictor from scratch.
+
+Per-origin retry logic (`max_retries=2` by default) handles transient model
+errors such as malformed structured output — a common occurrence with LLM-based
+predictors — without aborting the whole backtest.
+
+---
+
+## Prerequisites
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+No FRED API key is required for the canonical experiment.
+
+---
+
+## Report context (CFPR PDFs)
+
+The Canada's Food Price Report is published each December as a PDF. We extract
+the full text of each report so it can later be co-located with the numeric CPI
+history in LLM-P prompts.
+
+```bash
+# 1. download the report PDFs into data/reports/cfpr/ (gitignored)
+uv run python scripts/fetch_cfpr.py
+# 2. extract each PDF -> <year>_en.md (full text) + <year>_en.json (metadata)
+uv run python scripts/extract_reports.py
+```
+
+- **Manifest:** `reports_manifest.yaml` pins the Dalhousie CDN URLs, editions,
+  and `publication_date` for 2021-2026 (English). It is the committed source of
+  truth; the PDFs and extracted text are cached under `data/` and never
+  committed. `fetch_cfpr.py` fails loudly if a URL has moved (non-PDF response).
+- **Extraction:** a single source-agnostic
+  [`extract_document`](../../aieng-forecasting/aieng/forecasting/documents/extract.py)
+  function (lightweight, deterministic, CPU-only `pymupdf4llm`; the `documents`
+  optional dependency, installed by `uv sync`). It returns an
+  `ExtractedDocument` = full `text` + `publication_date` + `page_count` +
+  `n_chars` + `est_tokens`. No section/segment structure is reconstructed — the
+  planned LLM-P formats consume the whole document, and report families share no
+  common structure, so per-source heading heuristics would be brittle.
+- **`publication_date` is the cutoff key.** A cutoff-aware document store
+  filters reports with `publication_date <= as_of`, so a report is never
+  visible at a forecast origin before its real release. The BoC use case ships
+  a worked example of exactly this pattern —
+  [`PressReleaseStore`](../boc_rate_decisions/press_releases.py) — if you want
+  a reference before building the food-CPI equivalent. For the canonical
+  July-origin CFPR backtest only the month/year matters.
+- **Context-cost estimate:** `extract_reports.py` prints per-document and total
+  char/token counts (token estimate ≈ chars/4, model-agnostic) so you can gauge
+  the cost of putting one — or several — reports into a prompt.
+- **Report-grounded LLMP (now wired):** `build_food_cpi_service(reports_dir=...)`
+  builds a cutoff-aware `DocumentStore`, and `02_food_cpi_experiment.ipynb`
+  passes `report_sources=["cfpr"]` to the LLM-P predictors so the extracted
+  reports enter the prompt filtered by `publication_date <= as_of`. Measure the
+  lift over the quantitative-only baseline. **Still a good extension:**
+  generalize the same `--source`-keyed fetcher + `extract_document` to Bank of
+  Canada Monetary Policy Reports, mirroring BoC's `PressReleaseStore` pattern
+  for additional document families.
+
+---
+
+## Notebooks
+
+| Notebook | Purpose |
+|----------|---------|
+| `01_food_data_exploration.ipynb` | Short warm-up tour: register the 9 series, small-multiples history, YoY overlay, coverage table. |
+| `02_food_cpi_experiment.ipynb`   | **Main experiment.** Selectable via `EXPERIMENT_CONFIG` (`"full"` / `"mini_recent"` / `"mini_single"`). Runs cached backtests for two baselines (`LastValuePredictor`, `DartsAutoARIMAPredictor`) and two LLMPs. Plots trajectory fans, avg/avg YoY grid, and CRPS/MAPE leaderboards. |
+| `99_starter_agent.ipynb`         | **Your starter agent.** This use case's first agent — a fresh, hackable food-CPI forecaster (toggleable news search + code execution, two lightweight tool-usage skills). Interactive (Track 2) cell, one scored trajectory (Track 1), and a "make it yours" guide pointing at the bigger projects (all 9 series, report-grounded context). |
+
+---
+
+## Key design decisions
+
+- **All 9 categories at once.** The backtest targets the full CFPR task, not a
+  single category, so the notebook produces the exact headline table the CFPR
+  publishes.  Caching keeps re-runs cheap during development.
+- **Trajectory horizons (6-17) replace a single outermost horizon.** Required
+  for the avg/avg YoY metric, and a natural fit for any predictor — ARIMA
+  returns all twelve steps in one call, a naive baseline repeats its last
+  value, and an LLM can emit a full trajectory in a single structured output.
+- **YAML specs are the source of truth.** Notebook code never hard-codes task
+  definitions; everything comes from `specs/*.yaml`.
+- **CRPS is the primary metric.**  MAPE on the median is a secondary,
+  point-estimate sanity check.
+- **No ensemble model selection.** The leaderboard compares individual
+  predictors; assembling them into a committee is left as an exercise.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting____init__.py.md
new file mode 100644
index 0000000..a15d0e0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting____init__.py.md
@@ -0,0 +1,20 @@
+# Source: implementations/food_price_forecasting/__init__.py
+
+kind: python
+
+```python
+"""Canada Food CPI experiment — helper modules and reference implementations.
+
+The experiment notebook (``02_food_cpi_experiment.ipynb``) is deliberately kept
+thin; most of the analytical and plotting code lives in the modules in this
+package:
+
+- :mod:`data` — data service setup; registers the 9 canonical food CPI series.
+- :mod:`analysis` — result-to-DataFrame flatteners, average-over-average YoY
+  computation, CRPS/MAPE leaderboards, rationale extraction.
+- :mod:`plots` — matplotlib figures (trajectory fans, avg/avg YoY grid,
+  CRPS/MAPE breakdowns).
+
+See ``README.md`` in this directory for the full experiment description.
+"""
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__analysis.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__analysis.py.md
new file mode 100644
index 0000000..da24a2e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__analysis.py.md
@@ -0,0 +1,339 @@
+# Source: implementations/food_price_forecasting/analysis.py
+
+kind: python
+
+```python
+"""Analysis helpers for the Canada Food CPI experiment.
+
+These functions turn :class:`BacktestResult` / :class:`EvalResult` objects
+into tidy DataFrames and the CFPR-specific average-over-average YoY metric.
+They are kept separate from the notebook itself so they can be unit-tested
+and re-used across notebooks or agentic workflows.
+
+All functions are pure: they take results and a data service and return
+DataFrames.  They never fetch data from the network or mutate global state.
+"""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from aieng.forecasting.evaluation.prediction import Prediction
+
+
+def predictions_to_dataframe(
+    results: dict[str, BacktestResult] | BacktestResult, predictor_id: str | None = None, task_id: str | None = None
+) -> pd.DataFrame:
+    """Flatten predictions + CRPS scores into a tidy DataFrame.
+
+    Parameters
+    ----------
+    results : dict[str, BacktestResult] | BacktestResult
+        Either a dict keyed by ``task_id`` (from :func:`multi_backtest`) or a
+        single :class:`BacktestResult` (from :func:`backtest`).
+    predictor_id : str or None
+        Override the ``predictor_id`` column.  Defaults to the id embedded in
+        each result.  Useful when plotting multiple predictors on one axis.
+    task_id : str or None
+        Override the ``task_id`` column when passing a single
+        :class:`BacktestResult` (which does not itself carry a task_id).
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``predictor_id``, ``task_id``, ``origin``, ``origin_year``,
+        ``horizon``, ``forecast_date``, ``median``, ``crps``.
+    """
+    rows: list[dict[str, object]] = []
+
+    if isinstance(results, BacktestResult):
+        _iter: list[tuple[str, BacktestResult]] = [(task_id or results.spec.task.task_id, results)]
+    else:
+        _iter = list(results.items())
+
+    for tid, result in _iter:
+        pid = predictor_id or result.predictor_id
+        for pred, score in zip(result.predictions, result.scores):
+            rows.append(_prediction_row(pred, score, pid, tid))
+
+    return pd.DataFrame(rows)
+
+
+def _prediction_row(pred: Prediction, score: float, pid: str, tid: str) -> dict[str, object]:
+    fd = pd.Timestamp(pred.forecast_date)
+    aof = pd.Timestamp(pred.as_of)
+    horizon_months = (fd.year - aof.year) * 12 + (fd.month - aof.month)
+    return {
+        "predictor_id": pid,
+        "task_id": tid,
+        "origin": aof,
+        "origin_year": aof.year,
+        "horizon": horizon_months,
+        "forecast_date": fd,
+        "median": pred.payload.point_forecast,
+        "crps": score,
+    }
+
+
+def compute_avgyoy(result: BacktestResult, actual_df: pd.DataFrame) -> pd.DataFrame:
+    """Compute per-origin average-over-average YoY CPI change.
+
+    For each forecast origin in ``result``:
+
+    * let ``Y = origin.year`` and ``Y1 = Y + 1``;
+    * let ``actual_avg_Y`` be the mean observed value for year Y (requires a
+      complete Jan-Dec of Y in ``actual_df``);
+    * let ``predicted_avg_Y1`` be the mean of the predictions for Jan-Dec Y1
+      (at each quantile);
+    * return ``predicted_avg_Y1 / actual_avg_Y - 1`` at the quantiles
+      {0.05, 0.20, 0.50, 0.80, 0.95}, plus the realised ``actual_yoy`` where
+      year Y1 is also complete (NaN otherwise).
+
+    Parameters
+    ----------
+    result : BacktestResult
+        Must contain predictions covering the Jan-Dec window of the year
+        following each origin.  Typical shape: trajectory horizons
+        ``range(6, 18)`` from July origins.
+    actual_df : pd.DataFrame
+        Full observed series with ``timestamp`` and ``value`` columns (the
+        form returned by :meth:`DataService.get_series`).
+
+    Returns
+    -------
+    pd.DataFrame
+        One row per origin with columns: ``origin_year``, ``actual_avg_y0``,
+        ``predicted_avg_y1``, ``yoy_median``, ``yoy_q05``, ``yoy_q25``,
+        ``yoy_q75``, ``yoy_q95``, ``actual_yoy``.
+    """
+    actual_df = actual_df.copy()
+    actual_df["timestamp"] = pd.to_datetime(actual_df["timestamp"])
+    actual_df["year"] = actual_df["timestamp"].dt.year
+
+    origins = sorted({p.as_of for p in result.predictions})
+    rows: list[dict[str, float]] = []
+
+    for origin in origins:
+        origin_ts = pd.Timestamp(origin)
+        origin_year = origin_ts.year
+        next_year = origin_year + 1
+
+        y0_vals = actual_df[actual_df["year"] == origin_year]["value"]
+        if len(y0_vals) < 12:
+            continue
+        actual_avg_y0 = float(y0_vals.mean())
+
+        traj_preds = [
+            p for p in result.predictions if p.as_of == origin and pd.Timestamp(p.forecast_date).year == next_year
+        ]
+        if not traj_preds:
+            continue
+
+        medians = np.array([p.payload.point_forecast for p in traj_preds], dtype=float)
+        predicted_avg_y1_median = float(medians.mean())
+
+        def _avg_yoy_at_q(q: float, preds: list[Prediction] = traj_preds, avg_y0: float = actual_avg_y0) -> float:
+            qs = np.array(
+                [p.payload.quantiles.get(q, p.payload.point_forecast) for p in preds],
+                dtype=float,
+            )
+            return float(qs.mean() / avg_y0 - 1)
+
+        y1_vals = actual_df[actual_df["year"] == next_year]["value"]
+        actual_yoy = float(y1_vals.mean() / actual_avg_y0 - 1) if len(y1_vals) == 12 else float("nan")
+
+        rows.append(
+            {
+                "origin_year": origin_year,
+                "actual_avg_y0": actual_avg_y0,
+                "predicted_avg_y1": predicted_avg_y1_median,
+                "yoy_median": predicted_avg_y1_median / actual_avg_y0 - 1,
+                "yoy_q05": _avg_yoy_at_q(0.05),
+                "yoy_q25": _avg_yoy_at_q(0.20),
+                "yoy_q75": _avg_yoy_at_q(0.80),
+                "yoy_q95": _avg_yoy_at_q(0.95),
+                "actual_yoy": actual_yoy,
+            }
+        )
+
+    return pd.DataFrame(rows)
+
+
+def summarize_crps(results_by_predictor: dict[str, dict[str, BacktestResult]]) -> pd.DataFrame:
+    """Return a leaderboard of mean CRPS per (predictor, task) pair.
+
+    Parameters
+    ----------
+    results_by_predictor : dict[str, dict[str, BacktestResult]]
+        Nested mapping: ``predictor_id -> {task_id -> BacktestResult}``.  This
+        is the natural output of running :func:`multi_backtest` or
+        :func:`cached_multi_backtest` once per predictor.
+
+    Returns
+    -------
+    pd.DataFrame
+        Pivoted table indexed by ``task_id`` with one column per predictor,
+        plus a ``MEAN`` row of the column means.
+    """
+    rows: list[dict[str, object]] = []
+    for pid, task_results in results_by_predictor.items():
+        for tid, result in task_results.items():
+            rows.append({"predictor_id": pid, "task_id": tid, "mean_crps": result.mean_score})
+    if not rows:
+        return pd.DataFrame()
+    df = pd.DataFrame(rows).pivot(index="task_id", columns="predictor_id", values="mean_crps").round(4)
+    df.loc["MEAN"] = df.mean()
+    return df
+
+
+def compute_ape_long(
+    results_by_predictor: dict[str, dict[str, BacktestResult]], data_service: DataService
+) -> pd.DataFrame:
+    """Return per-prediction absolute percentage error in long format.
+
+    Unlike :func:`compute_mape`, which collapses to a per-(predictor, task)
+    mean, this function returns one row per (predictor, task, origin, horizon)
+    so that callers can draw box plots or other distributional summaries.
+    Predictions whose ``forecast_date`` has no observed value are silently
+    dropped.
+
+    Parameters
+    ----------
+    results_by_predictor : dict[str, dict[str, BacktestResult]]
+        Nested mapping: ``predictor_id -> {task_id -> BacktestResult}``.
+    data_service : DataService
+        Data service used to fetch observed values.  Uses ``as_of=utcnow()``
+        so all available data is visible.
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``predictor_id``, ``task_id``, ``origin``, ``origin_year``,
+        ``horizon``, ``forecast_date``, ``ape``.
+    """
+    as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    rows: list[dict[str, object]] = []
+
+    for pid, task_results in results_by_predictor.items():
+        for tid, result in task_results.items():
+            actual_df = data_service.get_series(result.spec.task.target_series_id, as_of=as_of)
+            actual_long = actual_df.assign(forecast_date=pd.to_datetime(actual_df["timestamp"])).rename(
+                columns={"value": "actual"}
+            )[["forecast_date", "actual"]]
+            preds_df = predictions_to_dataframe(result, predictor_id=pid, task_id=tid)
+            merged = preds_df.merge(actual_long, on="forecast_date", how="inner")
+            for _, row in merged.iterrows():
+                rows.append(
+                    {
+                        "predictor_id": pid,
+                        "task_id": tid,
+                        "origin": row["origin"],
+                        "origin_year": row["origin_year"],
+                        "horizon": row["horizon"],
+                        "forecast_date": row["forecast_date"],
+                        "ape": abs(row["median"] - row["actual"]) / abs(row["actual"]) * 100,
+                    }
+                )
+
+    return pd.DataFrame(rows)
+
+
+def compute_mape(results_by_predictor: dict[str, dict[str, BacktestResult]], data_service: DataService) -> pd.DataFrame:
+    """Return mean absolute percentage error per (predictor, task).
+
+    MAPE is computed against the observed value at each prediction's
+    ``forecast_date``; predictions that do not yet have an observed value
+    (e.g. future horizons from the most recent origin) are silently dropped.
+
+    Parameters
+    ----------
+    results_by_predictor : dict[str, dict[str, BacktestResult]]
+        Nested mapping: ``predictor_id -> {task_id -> BacktestResult}``.
+    data_service : DataService
+        Data service used to fetch observed values.  Uses ``as_of=utcnow()``
+        so all available data is visible.
+
+    Returns
+    -------
+    pd.DataFrame
+        Indexed by ``task_id`` with one column per predictor (mean APE in %).
+    """
+    as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    rows: list[dict[str, object]] = []
+
+    for pid, task_results in results_by_predictor.items():
+        for tid, result in task_results.items():
+            actual_df = data_service.get_series(result.spec.task.target_series_id, as_of=as_of)
+            actual_long = actual_df.assign(forecast_date=pd.to_datetime(actual_df["timestamp"])).rename(
+                columns={"value": "actual"}
+            )[["forecast_date", "actual"]]
+            preds_df = predictions_to_dataframe(result, predictor_id=pid, task_id=tid)
+            merged = preds_df.merge(actual_long, on="forecast_date", how="inner")
+            if merged.empty:
+                continue
+            merged["ape"] = (merged["median"] - merged["actual"]).abs() / merged["actual"].abs() * 100
+            rows.append({"predictor_id": pid, "task_id": tid, "mape": float(merged["ape"].mean())})
+
+    if not rows:
+        return pd.DataFrame()
+    return pd.DataFrame(rows).pivot(index="task_id", columns="predictor_id", values="mape").round(3)
+
+
+def rationales_table(result: BacktestResult) -> pd.DataFrame:
+    """Extract per-prediction metadata into a DataFrame.
+
+    For classical statistical predictors the ``metadata`` dict is typically
+    empty; for LLM and agentic predictors it is the natural place to surface
+    a reasoning trace or any side-channel data.  This helper gives a uniform
+    way to inspect whatever is there.
+
+    Parameters
+    ----------
+    result : BacktestResult
+        Result to introspect.
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns: ``predictor_id``, ``task_id``, ``origin``, ``horizon``,
+        ``forecast_date``, plus one column per distinct metadata key seen
+        across all predictions (missing values filled with ``None``).
+    """
+    base_rows: list[dict[str, object]] = []
+    all_keys: set[str] = set()
+    for pred in result.predictions:
+        fd = pd.Timestamp(pred.forecast_date)
+        aof = pd.Timestamp(pred.as_of)
+        row: dict[str, object] = {
+            "predictor_id": pred.predictor_id,
+            "task_id": pred.task_id,
+            "origin": aof,
+            "horizon": (fd.year - aof.year) * 12 + (fd.month - aof.month),
+            "forecast_date": fd,
+        }
+        for k, v in pred.metadata.items():
+            row[f"meta_{k}"] = v
+            all_keys.add(f"meta_{k}")
+        base_rows.append(row)
+
+    # Fill missing keys with None for consistent columns.
+    for row in base_rows:
+        for k in all_keys:
+            row.setdefault(k, None)
+
+    return pd.DataFrame(base_rows)
+
+
+__all__ = [
+    "compute_ape_long",
+    "compute_avgyoy",
+    "compute_mape",
+    "predictions_to_dataframe",
+    "rationales_table",
+    "summarize_crps",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__data.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__data.py.md
new file mode 100644
index 0000000..71e9998
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__data.py.md
@@ -0,0 +1,184 @@
+# Source: implementations/food_price_forecasting/data.py
+
+kind: python
+
+```python
+"""Data-service setup for the Canada Food CPI experiment.
+
+The CFPR canonical experiment uses a fixed set of 9 Canadian food CPI
+sub-indices from StatCan table 18-10-0004-11.  :data:`FOOD_CPI_SERIES` is the
+single source of truth for this list; both the reference YAML specs under
+``implementations/food_price_forecasting/specs/`` and the notebook/helpers here reference the
+same nine ``series_id`` values via this module.
+
+FRED macro covariates are *not* part of the canonical experiment — see
+``planning-docs/bootcamp-workplan.md`` for the deferred covariate-framing design work.
+Other experiments that want FRED covariates should register them via their
+own helpers.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters.statcan import StatCanAdapter
+from aieng.forecasting.documents.store import DocumentStore
+
+
+STATCAN_TABLE = "18-10-0004-11"
+"""StatCan table 18-10-0004-11 — Consumer Price Index, monthly, not seasonally adjusted."""
+
+CFPR_REPORTS_SOURCE = "cfpr"
+"""Document-source key for Canada's Food Price Report editions."""
+
+DEFAULT_REPORTS_DIR = Path("data/reports/cfpr")
+"""Default cache directory for extracted CFPR report artifacts (.json/.md/.pdf)."""
+
+
+# (series_id, product_group_label, description, units)
+# The product_group_label MUST match StatCan's "Products and product groups"
+# column exactly; any mismatch will produce an empty DataFrame at fetch time.
+FOOD_CPI_SERIES: list[tuple[str, str, str, str]] = [
+    (
+        "cpi_food_canada",
+        "Food",
+        "CPI Food (overall), Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_bakery_cereal_canada",
+        "Bakery and cereal products (excluding baby food)",
+        "CPI Bakery and cereal products (excl. baby food), Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_dairy_eggs_canada",
+        "Dairy products and eggs",
+        "CPI Dairy products and eggs, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fish_seafood_canada",
+        "Fish, seafood and other marine products",
+        "CPI Fish, seafood and other marine products, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_restaurants_canada",
+        "Food purchased from restaurants",
+        "CPI Food purchased from restaurants, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fruit_preparations_nuts_canada",
+        "Fruit, fruit preparations and nuts",
+        "CPI Fruit, fruit preparations and nuts, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_meat_canada",
+        "Meat",
+        "CPI Meat, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_other_food_nonalcoholic_canada",
+        "Other food products and non-alcoholic beverages",
+        "CPI Other food and non-alcoholic beverages, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_vegetables_preparations_canada",
+        "Vegetables and vegetable preparations",
+        "CPI Vegetables and vegetable preparations, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+]
+"""The 9 canonical Canadian food CPI series that the CFPR experiment evaluates."""
+
+
+CATEGORY_LABELS: dict[str, str] = {
+    "cpi_food_canada": "Food (overall)",
+    "cpi_bakery_cereal_canada": "Bakery & cereal",
+    "cpi_dairy_eggs_canada": "Dairy & eggs",
+    "cpi_fish_seafood_canada": "Fish & seafood",
+    "cpi_restaurants_canada": "Restaurants",
+    "cpi_fruit_preparations_nuts_canada": "Fruit & nuts",
+    "cpi_meat_canada": "Meat",
+    "cpi_other_food_nonalcoholic_canada": "Other food",
+    "cpi_vegetables_preparations_canada": "Vegetables",
+}
+"""Short display labels for plots and leaderboard tables."""
+
+
+DEFAULT_CACHE_DIR = Path("data/statcan")
+"""Default StatCan CSV cache directory (same default as ``StatCanAdapter``)."""
+
+
+def build_food_cpi_service(
+    cache_dir: Path | None = None,
+    reports_dir: Path | None = None,
+) -> DataService:
+    """Return a :class:`DataService` with all 9 food CPI series registered.
+
+    Each series gets its own :class:`StatCanAdapter` (StatCan's adapter is
+    single-series by design — it filters the shared table by GEO + product
+    group label).  Registration fetches the data, which on a warm cache is
+    effectively instant.
+
+    Parameters
+    ----------
+    cache_dir : Path or None
+        StatCan CSV cache directory.  Defaults to ``data/statcan`` at the
+        repo root, which is what ``scripts/fetch_cpi.py`` populates.
+    reports_dir : Path or None
+        When set, attach a :class:`DocumentStore` loading CFPR report artifacts
+        from this directory (the ``.json``/``.md`` files written by
+        ``scripts/extract_reports.py``), registered under the ``"cfpr"`` source.
+        This is what enables report-grounded LLMP recipes (``report_sources=
+        ["cfpr"]``) — predictors then see only editions published on or before
+        the forecast ``as_of`` date.  ``None`` leaves the service report-free,
+        unchanged from the canonical numeric-only experiment.
+
+    Returns
+    -------
+    DataService
+        A data service with 9 Canadian food CPI series registered, ready to
+        be handed to :func:`backtest` / :func:`multi_backtest` /
+        :func:`evaluate` / :func:`multi_evaluate`.
+    """
+    resolved_cache_dir: Path = cache_dir if cache_dir is not None else DEFAULT_CACHE_DIR
+    doc_store = DocumentStore({CFPR_REPORTS_SOURCE: reports_dir}) if reports_dir is not None else None
+    svc = DataService(doc_store=doc_store)
+    for series_id, product_group, description, units in FOOD_CPI_SERIES:
+        adapter = StatCanAdapter(
+            table_id=STATCAN_TABLE,
+            member_filter={"GEO": "Canada", "Products and product groups": product_group},
+            cache_dir=resolved_cache_dir,
+        )
+        svc.register(
+            series_id,
+            adapter,
+            SeriesMetadata(
+                series_id=series_id,
+                description=description,
+                source="Statistics Canada",
+                units=units,
+                frequency="MS",
+                table_id=STATCAN_TABLE,
+            ),
+        )
+    return svc
+
+
+__all__ = [
+    "CATEGORY_LABELS",
+    "CFPR_REPORTS_SOURCE",
+    "DEFAULT_CACHE_DIR",
+    "DEFAULT_REPORTS_DIR",
+    "FOOD_CPI_SERIES",
+    "STATCAN_TABLE",
+    "build_food_cpi_service",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__plots.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__plots.py.md
new file mode 100644
index 0000000..65da1da
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__plots.py.md
@@ -0,0 +1,514 @@
+# Source: implementations/food_price_forecasting/plots.py
+
+kind: python
+
+```python
+"""Plotting helpers for the Canada Food CPI experiment.
+
+These helpers keep the notebook focused on narrative by centralising the
+matplotlib boilerplate for the CFPR-style figures.  All plots use matplotlib
+directly (no seaborn / plotly) to minimise dependencies.
+
+Return convention: each helper returns the ``(fig, axes)`` pair it created
+so the caller can further customise or save the figure.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation.backtest import BacktestResult
+from matplotlib.axes import Axes
+from matplotlib.figure import Figure
+from matplotlib.patches import Patch
+
+from .data import CATEGORY_LABELS, FOOD_CPI_SERIES
+
+
+DEFAULT_PREDICTOR_PALETTE: list[str] = ["#7f7f7f", "#1f77b4", "#2ca02c", "#d62728", "#9467bd"]
+"""Default colour palette for up to five predictors (grey, blue, green, red, purple)."""
+
+
+def _resolve_colors(predictors: list[str], colors: dict[str, str] | None) -> dict[str, str]:
+    """Return a ``predictor_id -> colour`` map that covers every predictor.
+
+    Any explicit entries in ``colors`` are preserved; missing predictors get
+    filled in from :data:`DEFAULT_PREDICTOR_PALETTE` so callers don't have to
+    line up their keys with the exact predictor_id strings.
+    """
+    resolved: dict[str, str] = dict(colors or {})
+    next_idx = 0
+    for pid in predictors:
+        if pid in resolved:
+            continue
+        resolved[pid] = DEFAULT_PREDICTOR_PALETTE[next_idx % len(DEFAULT_PREDICTOR_PALETTE)]
+        next_idx += 1
+    return resolved
+
+
+def _resolve_labels(predictors: list[str], labels: dict[str, str] | None) -> dict[str, str]:
+    """Return a ``predictor_id -> display label`` map for plot legends and axes."""
+    return {pid: (labels or {}).get(pid, pid) for pid in predictors}
+
+
+# ---------------------------------------------------------------------------
+# Trajectory fan chart (median + 50% + 90% CI) for recent origins
+# ---------------------------------------------------------------------------
+
+
+def plot_trajectory_fan(
+    results_by_predictor: dict[str, dict[str, BacktestResult]],
+    task_id: str,
+    category_id: str,
+    data_service: DataService,
+    n_recent: int = 3,
+    colors: dict[str, str] | None = None,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, list[Axes]]:
+    """Draw median + 50%/90% CI trajectories for the ``n_recent`` most recent origins.
+
+    For each origin one subplot shows:
+    * the last 24 months of observed history up to the origin (solid black);
+    * the observed Y+1 values (dashed black) where available;
+    * one fan per predictor (median + 50% CI + 90% CI) for the Y+1 trajectory.
+
+    Parameters
+    ----------
+    results_by_predictor : dict[str, dict[str, BacktestResult]]
+        ``predictor_id -> {task_id -> BacktestResult}`` mapping.
+    task_id : str
+        Task identifier whose predictions to plot.
+    category_id : str
+        Underlying series id, used to fetch the observed series for context.
+    data_service : DataService
+        Data service to query for the observed series.
+    n_recent : int
+        How many most-recent origins to plot (one subplot each).
+    colors : dict[str, str] or None
+        Optional predictor_id -> matplotlib colour mapping.
+    labels : dict[str, str] or None
+        Optional predictor_id -> short display label for the legend.
+
+    Returns
+    -------
+    (Figure, list[Axes])
+        The created figure and its axes list.
+    """
+    predictor_ids = list(results_by_predictor.keys())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+
+    sample_result = next(iter(results_by_predictor.values()))[task_id]
+    origins = sorted({p.as_of for p in sample_result.predictions})
+    recent_origins = origins[-n_recent:]
+
+    fig, axes_obj = plt.subplots(len(recent_origins), 1, figsize=(13, 3.8 * len(recent_origins)), sharex=False)
+    axes: list[Axes] = [axes_obj] if len(recent_origins) == 1 else list(axes_obj)
+
+    as_of = pd.Timestamp.utcnow().tz_localize(None).to_pydatetime()
+    actual_df = data_service.get_series(category_id, as_of=as_of)
+    actual_df["timestamp"] = pd.to_datetime(actual_df["timestamp"])
+
+    for ax, origin in zip(axes, recent_origins):
+        origin_ts = pd.Timestamp(origin)
+        hist_start = origin_ts - pd.DateOffset(months=24)
+        hist = actual_df[(actual_df["timestamp"] >= hist_start) & (actual_df["timestamp"] <= origin_ts)]
+        ax.plot(hist["timestamp"], hist["value"], color="k", linewidth=1.8, label="Observed", zorder=5)
+
+        max_horizon = 0
+        for result in (r[task_id] for r in results_by_predictor.values()):
+            for p in result.predictions:
+                if p.as_of == origin:
+                    fd = pd.Timestamp(p.forecast_date)
+                    max_horizon = max(max_horizon, (fd.year - origin_ts.year) * 12 + (fd.month - origin_ts.month))
+
+        traj_end = origin_ts + pd.DateOffset(months=max_horizon + 1)
+        fut_actual = actual_df[(actual_df["timestamp"] > origin_ts) & (actual_df["timestamp"] <= traj_end)]
+        ax.plot(
+            fut_actual["timestamp"],
+            fut_actual["value"],
+            color="k",
+            linewidth=1.8,
+            linestyle="--",
+            alpha=0.6,
+            zorder=4,
+        )
+
+        for pid, task_results in results_by_predictor.items():
+            result = task_results[task_id]
+            color = color_map[pid]
+            preds = sorted(
+                (p for p in result.predictions if p.as_of == origin),
+                key=lambda p: p.forecast_date,
+            )
+            if not preds:
+                continue
+            dates = np.array([pd.Timestamp(p.forecast_date) for p in preds])
+            medians = np.array([p.payload.point_forecast for p in preds], dtype=float)
+            q05 = np.array([p.payload.quantiles[0.05] for p in preds], dtype=float)
+            q25 = np.array([p.payload.quantiles[0.20] for p in preds], dtype=float)
+            q75 = np.array([p.payload.quantiles[0.80] for p in preds], dtype=float)
+            q95 = np.array([p.payload.quantiles[0.95] for p in preds], dtype=float)
+            ax.fill_between(dates, q05, q95, alpha=0.12, color=color)
+            ax.fill_between(dates, q25, q75, alpha=0.22, color=color)
+            ax.plot(dates, medians, color=color, linewidth=1.6, label=label_map[pid])
+
+        ax.axvline(origin_ts, color="navy", linewidth=1.2, linestyle=":", alpha=0.7)
+        ax.set_title(f"Origin: {origin_ts.date()}  (-> forecast Y+1 = {origin_ts.year + 1})", fontsize=10)
+        ax.set_ylabel("CPI (2002=100)", fontsize=9)
+        ax.grid(axis="y", alpha=0.3)
+        ax.legend(fontsize=8, loc="upper left")
+
+    label = CATEGORY_LABELS.get(category_id, category_id)
+    fig.suptitle(
+        f"{label} ({category_id}) — forecast trajectories, {len(recent_origins)} most recent origins",
+        fontsize=11,
+        y=1.01,
+    )
+    fig.tight_layout()
+    return fig, axes
+
+
+# ---------------------------------------------------------------------------
+# Avg/avg YoY grid across categories
+# ---------------------------------------------------------------------------
+
+
+def plot_avgyoy_grid(
+    yoy_by_predictor_by_task: dict[str, dict[str, pd.DataFrame]],
+    task_to_category: dict[str, str],
+    colors: dict[str, str] | None = None,
+    ncols: int = 3,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, np.ndarray]:
+    """Plot a grid of avg/avg YoY fan charts, one panel per category.
+
+    Parameters
+    ----------
+    yoy_by_predictor_by_task : dict[str, dict[str, pd.DataFrame]]
+        Nested mapping ``predictor_id -> {task_id -> avg-yoy DataFrame}``
+        where each DataFrame comes from :func:`compute_avgyoy`.
+    task_to_category : dict[str, str]
+        Mapping from ``task_id`` (as used in the results) to the underlying
+        ``series_id``.  The series_id is used to look up a display label.
+    colors : dict[str, str] or None
+        Optional predictor_id -> matplotlib colour mapping.
+    ncols : int
+        Number of columns in the subplot grid (default 3).
+    labels : dict[str, str] or None
+        Optional predictor_id -> short display label for the legend.
+
+    Returns
+    -------
+    (Figure, np.ndarray)
+        Figure and a flat array of axes.
+    """
+    predictor_ids = list(yoy_by_predictor_by_task.keys())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+
+    task_ids = list(task_to_category.keys())
+    n = len(task_ids)
+    nrows = (n + ncols - 1) // ncols
+
+    fig, axes = plt.subplots(nrows, ncols, figsize=(16 * ncols // 3, 10 * nrows // 3), sharey=False, squeeze=False)
+    axes_flat = axes.flatten()
+
+    for ax, task_id in zip(axes_flat, task_ids):
+        series_id = task_to_category[task_id]
+        label = CATEGORY_LABELS.get(series_id, series_id)
+
+        df_any = next(
+            (
+                df
+                for pid_dict in yoy_by_predictor_by_task.values()
+                for (tid, df) in pid_dict.items()
+                if tid == task_id and not df.empty
+            ),
+            None,
+        )
+        if df_any is None:
+            ax.set_title(f"{label} (no data)", fontsize=10)
+            ax.axis("off")
+            continue
+
+        years = df_any["origin_year"] + 1
+        ax.plot(
+            years,
+            df_any["actual_yoy"] * 100,
+            color="k",
+            linewidth=1.8,
+            marker="o",
+            markersize=4,
+            label="Actual",
+            zorder=5,
+        )
+
+        for pid in predictor_ids:
+            df = yoy_by_predictor_by_task[pid].get(task_id)
+            if df is None or df.empty:
+                continue
+            color = color_map[pid]
+            yrs = df["origin_year"] + 1
+            ax.fill_between(yrs, df["yoy_q05"] * 100, df["yoy_q95"] * 100, alpha=0.10, color=color)
+            ax.fill_between(yrs, df["yoy_q25"] * 100, df["yoy_q75"] * 100, alpha=0.20, color=color)
+            ax.plot(
+                yrs, df["yoy_median"] * 100, color=color, linewidth=1.3, marker="^", markersize=4, label=label_map[pid]
+            )
+
+        ax.axhline(0, color="#aaa", linewidth=0.8, linestyle="--")
+        ax.set_title(label, fontsize=10)
+        ax.set_ylabel("avg/avg YoY (%)", fontsize=8)
+        ax.grid(axis="y", alpha=0.3)
+        ax.tick_params(labelsize=8)
+
+    # Legend on the first axis only (identical across panels).
+    if task_ids:
+        axes_flat[0].legend(fontsize=7, loc="best")
+
+    # Hide any unused panels.
+    for ax in axes_flat[len(task_ids) :]:
+        ax.axis("off")
+
+    fig.suptitle(f"Avg/avg YoY predictions vs actuals — {n} categor{'y' if n == 1 else 'ies'}", fontsize=12)
+    fig.tight_layout()
+    return fig, axes
+
+
+# ---------------------------------------------------------------------------
+# CRPS disaggregation
+# ---------------------------------------------------------------------------
+
+
+def plot_crps_disaggregated(
+    predictions_df: pd.DataFrame,
+    by: str = "origin_year",
+    colors: dict[str, str] | None = None,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, Axes]:
+    """Plot mean CRPS per predictor disaggregated by origin-year or horizon.
+
+    Parameters
+    ----------
+    predictions_df : pd.DataFrame
+        Tidy predictions DataFrame of the shape returned by
+        :func:`predictions_to_dataframe`.  Must have ``predictor_id``,
+        ``crps``, and either ``origin_year`` or ``horizon`` columns.
+    by : str
+        Grouping column.  Must be ``"origin_year"`` or ``"horizon"``.
+    colors : dict[str, str] or None
+        Optional predictor_id -> matplotlib colour mapping.
+
+    Returns
+    -------
+    (Figure, Axes)
+    """
+    if by not in {"origin_year", "horizon"}:
+        raise ValueError(f"by must be 'origin_year' or 'horizon', got {by!r}")
+
+    predictor_ids = sorted(predictions_df["predictor_id"].unique())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+
+    pivot = predictions_df.groupby(["predictor_id", by])["crps"].mean().unstack(0)
+
+    fig, ax = plt.subplots(figsize=(9, 4.5))
+    for pid in predictor_ids:
+        if pid in pivot.columns:
+            ax.plot(
+                pivot.index,
+                pivot[pid],
+                color=color_map[pid],
+                linewidth=1.5,
+                marker="o",
+                markersize=5,
+                label=label_map[pid],
+            )
+    ax.set_xlabel(by.replace("_", " ").title(), fontsize=10)
+    ax.set_ylabel("Mean CRPS (lower is better)", fontsize=10)
+    ax.set_title(f"CRPS disaggregated by {by.replace('_', ' ')}", fontsize=10)
+    ax.legend(fontsize=9)
+    ax.grid(axis="y", alpha=0.3)
+    fig.tight_layout()
+    return fig, ax
+
+
+# ---------------------------------------------------------------------------
+# MAPE distribution box plot
+# ---------------------------------------------------------------------------
+
+
+def plot_mape_distribution(
+    mape_df: pd.DataFrame, colors: dict[str, str] | None = None, labels: dict[str, str] | None = None
+) -> tuple[Figure, Axes]:
+    """Box plot of per-task mean-APE distribution, one box per predictor.
+
+    Parameters
+    ----------
+    mape_df : pd.DataFrame
+        Wide-format DataFrame indexed by ``task_id`` with one column per
+        predictor (as returned by :func:`compute_mape`).
+    colors : dict[str, str] or None
+        Optional predictor_id -> colour mapping.
+    labels : dict[str, str] or None
+        Optional predictor_id -> short display label for the x-axis.
+
+    Returns
+    -------
+    (Figure, Axes)
+    """
+    predictor_ids = list(mape_df.columns)
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+    tick_labels = [label_map[pid] for pid in predictor_ids]
+
+    fig, ax = plt.subplots(figsize=(9, 4.5))
+    data: list[Any] = [mape_df[pid].dropna().values for pid in predictor_ids]
+    bp = ax.boxplot(data, patch_artist=True, tick_labels=tick_labels)
+    for patch, pid in zip(bp["boxes"], predictor_ids):
+        patch.set_facecolor(color_map[pid])
+        patch.set_alpha(0.6)
+    ax.set_ylabel("MAPE per task (%)", fontsize=10)
+    ax.set_title("Distribution of per-task MAPE across predictors", fontsize=10)
+    ax.tick_params(axis="x", labelrotation=15, labelsize=8)
+    ax.grid(axis="y", alpha=0.3)
+    fig.tight_layout()
+    return fig, ax
+
+
+# ---------------------------------------------------------------------------
+# MAPE per-category small-multiples box plot
+# ---------------------------------------------------------------------------
+
+
+def plot_mape_by_category(
+    ape_long_df: pd.DataFrame,
+    task_to_category: dict[str, str],
+    colors: dict[str, str] | None = None,
+    labels: dict[str, str] | None = None,
+) -> tuple[Figure, np.ndarray]:
+    """Small-multiples box plot of raw per-prediction APE, one panel per category.
+
+    Each panel shows the distribution of absolute percentage error across all
+    (origin, horizon) prediction pairs for that category, with one box per
+    predictor.  This gives a richer picture than the single-number MAPE table
+    and makes it easy to see which predictor is consistently tighter within
+    each sub-index.
+
+    Parameters
+    ----------
+    ape_long_df : pd.DataFrame
+        Long-format APE DataFrame returned by :func:`compute_ape_long`.
+        Must have ``predictor_id``, ``task_id``, and ``ape`` columns.
+    task_to_category : dict[str, str]
+        Mapping from ``task_id`` to the underlying ``series_id``, used to
+        look up display labels.
+    colors : dict[str, str] or None
+        Optional predictor_id -> matplotlib colour mapping.
+
+    Returns
+    -------
+    (Figure, np.ndarray)
+        Figure and a flat array of axes.
+    """
+    task_ids = list(task_to_category.keys())
+    predictor_ids = sorted(ape_long_df["predictor_id"].unique())
+    color_map = _resolve_colors(predictor_ids, colors)
+    label_map = _resolve_labels(predictor_ids, labels)
+    use_shared_legend = labels is not None
+
+    n = len(task_ids)
+    ncols = 3
+    nrows = (n + ncols - 1) // ncols
+    fig, axes = plt.subplots(nrows, ncols, figsize=(5 * ncols, 4 * nrows), sharey=False, squeeze=False)
+    axes_flat: list[Axes] = list(axes.flatten())
+
+    for ax, task_id in zip(axes_flat, task_ids):
+        series_id = task_to_category[task_id]
+        label = CATEGORY_LABELS.get(series_id, series_id)
+
+        task_df = ape_long_df[ape_long_df["task_id"] == task_id]
+        data: list[Any] = [task_df[task_df["predictor_id"] == pid]["ape"].dropna().values for pid in predictor_ids]
+
+        tick_labels = [""] * len(predictor_ids) if use_shared_legend else [label_map[pid] for pid in predictor_ids]
+        bp = ax.boxplot(data, patch_artist=True, tick_labels=tick_labels)
+        for patch, pid in zip(bp["boxes"], predictor_ids):
+            patch.set_facecolor(color_map[pid])
+            patch.set_alpha(0.6)
+
+        ax.set_title(label, fontsize=10)
+        ax.set_ylabel("APE (%)", fontsize=8)
+        if not use_shared_legend:
+            ax.tick_params(axis="x", labelrotation=20, labelsize=7)
+        else:
+            ax.tick_params(axis="x", labelbottom=False)
+        ax.grid(axis="y", alpha=0.3)
+
+    for ax in axes_flat[n:]:
+        ax.axis("off")
+
+    fig.suptitle("Per-prediction APE distribution by category", fontsize=12)
+    if use_shared_legend:
+        legend_handles = [Patch(facecolor=color_map[pid], alpha=0.6, label=label_map[pid]) for pid in predictor_ids]
+        fig.legend(
+            handles=legend_handles,
+            loc="lower center",
+            ncol=min(len(predictor_ids), 4),
+            fontsize=9,
+            frameon=False,
+        )
+        fig.tight_layout(rect=[0, 0.06, 1, 0.96])
+    else:
+        fig.tight_layout()
+    return fig, axes
+
+
+# ---------------------------------------------------------------------------
+# Exploration plot — overall food CPI small multiples
+# ---------------------------------------------------------------------------
+
+
+def plot_food_cpi_small_multiples(data_service: DataService, ncols: int = 3) -> tuple[Figure, np.ndarray]:
+    """Small-multiples overview of all food CPI categories defined in :data:`FOOD_CPI_SERIES`.
+
+    Each subplot shows the full history of one category, with the y-axis
+    free-scaled.  Useful as the notebook's single exploration figure.
+    """
+    as_of = pd.Timestamp.utcnow().tz_localize(None).to_pydatetime()
+
+    n = len(FOOD_CPI_SERIES)
+    nrows = (n + ncols - 1) // ncols
+
+    fig, axes = plt.subplots(nrows, ncols, figsize=(5 * ncols, 3 * nrows), sharex=True, squeeze=False)
+    axes_flat = axes.flatten()
+
+    for ax, (series_id, _, _desc, _units) in zip(axes_flat, FOOD_CPI_SERIES):
+        df = data_service.get_series(series_id, as_of=as_of)
+        df["timestamp"] = pd.to_datetime(df["timestamp"])
+        ax.plot(df["timestamp"], df["value"], color="steelblue", linewidth=1.2)
+        ax.set_title(CATEGORY_LABELS.get(series_id, series_id), fontsize=10)
+        ax.grid(axis="y", alpha=0.3)
+        ax.tick_params(labelsize=8)
+
+    for ax in axes_flat[n:]:
+        ax.axis("off")
+
+    fig.suptitle(f"Canada food CPI — {n} category sub-indices (index, 2002=100)", fontsize=12)
+    fig.tight_layout()
+    return fig, axes
+
+
+__all__ = [
+    "DEFAULT_PREDICTOR_PALETTE",
+    "plot_avgyoy_grid",
+    "plot_crps_disaggregated",
+    "plot_food_cpi_small_multiples",
+    "plot_mape_by_category",
+    "plot_mape_distribution",
+    "plot_trajectory_fan",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors____init__.py.md
new file mode 100644
index 0000000..587a863
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors____init__.py.md
@@ -0,0 +1,25 @@
+# Source: implementations/food_price_forecasting/predictors/__init__.py
+
+kind: python
+
+```python
+"""Tuned predictor recipes for the Canada Food CPI experiment.
+
+Each module here builds a fully-configured predictor instance for the food
+CPI use case. Recipes pair a task-agnostic predictor from
+:mod:`aieng.forecasting.methods` with use-case-specific configuration:
+prompt overrides, history windows, sampling budgets, and a
+:attr:`~aieng.forecasting.methods.llm_processes.base.LLMPredictorConfig.variant_tag`
+that keeps cached artifacts distinct from ad-hoc bare-config runs.
+
+See the package README and ``planning-docs/bootcamp-workplan.md`` for the
+separation between task-agnostic methods (in ``aieng-forecasting``) and
+use-case recipes (here).
+"""
+
+from .llmp_quantile_grid import build_llmp_quantile_grid
+from .llmp_sampled_trajectory import build_llmp_sampled_trajectory
+
+
+__all__ = ["build_llmp_quantile_grid", "build_llmp_sampled_trajectory"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_quantile_grid.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_quantile_grid.py.md
new file mode 100644
index 0000000..3dbcd64
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_quantile_grid.py.md
@@ -0,0 +1,123 @@
+# Source: implementations/food_price_forecasting/predictors/llmp_quantile_grid.py
+
+kind: python
+
+```python
+"""Food CPI recipe: quantile-grid LLMP.
+
+This file is intentionally small and explicit so notebook readers can open it as
+a reference recipe. The reusable method lives in ``aieng.forecasting``; this
+module shows the Food CPI prompt framing, default history window, reasoning
+setting, and cache tag used by the experiment.
+"""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from aieng.forecasting.methods.llm_processes import (
+    QuantileGridLLMPredictor,
+    QuantileGridLLMPredictorConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+
+_ReasoningEffort = Literal["disable", "low", "medium", "high"]
+
+_DEFAULT_MODEL = LITE_MODEL
+_DEFAULT_HISTORY_WINDOW = 120
+# ``None`` = provider default. The Vector proxy now rejects 'disable'/'low' for
+# Gemini models (valid: minimal/medium/high); None sends no reasoning_effort.
+_DEFAULT_REASONING_EFFORT: _ReasoningEffort | None = None
+_RECIPE_FAMILY = "food_cpi_v1"
+
+_SERIES_DESCRIPTION = (
+    "Series: Canadian food Consumer Price Index sub-component (Statistics Canada "
+    "table 18-10-0004, 2002 = 100).\n"
+    "Units: index level (unitless, base 2002 = 100).\n"
+    "Frequency: monthly (period-start)."
+)
+
+_USER_PROMPT_SUFFIX = (
+    "Notes for this series:\n"
+    "- Values are strictly positive and almost always above 100 in the modern era.\n"
+    "- Month-over-month changes are typically within +/- 1.5 index points; large "
+    "  jumps are rare and usually tied to known commodity or policy shocks.\n"
+    "- Quantile spreads should widen with forecast horizon unless recent volatility "
+    "  clearly supports a tighter distribution."
+)
+
+
+def build_llmp_quantile_grid(
+    *,
+    model: str = _DEFAULT_MODEL,
+    history_window: int | None = _DEFAULT_HISTORY_WINDOW,
+    reasoning_effort: _ReasoningEffort | None = _DEFAULT_REASONING_EFFORT,
+    max_tokens: int = 16384,
+    report_sources: list[str] | None = None,
+    report_max_chars: int | None = None,
+    variant_tag: str | None = None,
+) -> QuantileGridLLMPredictor:
+    """Return the Food CPI quantile-grid LLMP recipe.
+
+    The model is a normal parameter because the base LLMP ``predictor_id``
+    already includes it. The recipe tag records the Food CPI prompt/config family
+    and the cache-relevant knobs that are not otherwise visible in the ID.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier. Defaults to the lite model (``gemini-3.1-flash-lite-preview``).
+    history_window : int or None
+        Number of most-recent periods to include in context.
+    reasoning_effort : str or None
+        Reasoning budget. Thinking models (e.g. ``gemini-3.1-pro-preview``)
+        draw thinking tokens from the same ``max_tokens`` budget via the
+        OpenAI-compatible proxy — increase ``max_tokens`` if responses are
+        truncated.
+    max_tokens : int, default=16384
+        Per-call output token budget. The generous default prevents truncation
+        on thinking models (e.g. ``gemini-3.1-pro-preview``) where thinking
+        tokens consume the same budget via the OpenAI-compatible proxy. The
+        model only generates tokens it needs, so non-thinking models are
+        unaffected in cost.
+    report_sources : list[str] or None
+        Document-source keys (e.g. ``["cfpr"]``) whose cutoff-filtered editions
+        are prepended to the prompt as a text preamble. Requires the
+        ``DataService`` to carry a ``DocumentStore`` (pass ``reports_dir=`` to
+        ``build_food_cpi_service``). The quantile grid is the report-grounded
+        modality of choice here: it issues **one** elicitation call per origin,
+        so the (large) report context is paid for once — unlike sampled
+        trajectories, which would re-send the reports on every sample draw.
+        ``None`` leaves the recipe report-free.
+    report_max_chars : int or None
+        Per-report character truncation applied before concatenation. CFPR
+        editions run ~30-80k chars each; cap this if context windows get tight.
+        ``None`` means no truncation.
+    variant_tag : str or None
+        Override the cache tag suffix. Defaults to a tag encoding the
+        recipe family, history window, reasoning effort, and (when reports are
+        configured) a ``rep-<sources>`` marker so report-grounded results stay
+        distinct from the numeric-only baseline in caches and leaderboards.
+    """
+    history_tag = "hfull" if history_window is None else f"h{history_window}"
+    reasoning_tag = "rprovider" if reasoning_effort is None else f"r{reasoning_effort}"
+    report_tag = f"_rep-{'-'.join(report_sources)}" if report_sources else ""
+    resolved_variant_tag = variant_tag or f"{_RECIPE_FAMILY}_{history_tag}_{reasoning_tag}{report_tag}"
+
+    config = QuantileGridLLMPredictorConfig(
+        model=model,
+        history_window=history_window,
+        reasoning_effort=reasoning_effort,
+        max_tokens=max_tokens,
+        report_sources=report_sources,
+        report_max_chars=report_max_chars,
+        series_description=_SERIES_DESCRIPTION,
+        user_prompt_suffix=_USER_PROMPT_SUFFIX,
+        variant_tag=resolved_variant_tag,
+    )
+    return QuantileGridLLMPredictor(config)
+
+
+__all__ = ["build_llmp_quantile_grid"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_sampled_trajectory.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_sampled_trajectory.py.md
new file mode 100644
index 0000000..1417e3c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__predictors__llmp_sampled_trajectory.py.md
@@ -0,0 +1,94 @@
+# Source: implementations/food_price_forecasting/predictors/llmp_sampled_trajectory.py
+
+kind: python
+
+```python
+"""Food CPI recipe: sampled-trajectory LLMP.
+
+This file is intentionally small and explicit so notebook readers can open it as
+a reference recipe. The reusable method lives in ``aieng.forecasting``; this
+module shows the Food CPI prompt framing, default sampling budget, history
+window, and cache tag used by the experiment.
+"""
+
+from __future__ import annotations
+
+from aieng.forecasting.methods.llm_processes import (
+    SampledTrajectoryLLMPredictor,
+    SampledTrajectoryLLMPredictorConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+
+_DEFAULT_MODEL = LITE_MODEL
+_DEFAULT_N_SAMPLES = 20
+_DEFAULT_HISTORY_WINDOW = 120
+_RECIPE_FAMILY = "food_cpi_v1"
+
+_SERIES_DESCRIPTION = (
+    "Series: Canadian food Consumer Price Index sub-component (Statistics Canada "
+    "table 18-10-0004, 2002 = 100).\n"
+    "Units: index level (unitless, base 2002 = 100).\n"
+    "Frequency: monthly (period-start)."
+)
+
+_USER_PROMPT_SUFFIX = (
+    "Notes for this series:\n"
+    "- Values are strictly positive and almost always above 100 in the modern era.\n"
+    "- Month-over-month changes are typically within +/- 1.5 index points; large "
+    "  jumps are rare and usually tied to known commodity or policy shocks.\n"
+    "- Year-over-year growth in the 2020-2024 window has ranged roughly 0-12 percent "
+    "  depending on the sub-component; revert toward the recent trend rather than "
+    "  extrapolating short-term spikes indefinitely."
+)
+
+
+def build_llmp_sampled_trajectory(
+    *,
+    model: str = _DEFAULT_MODEL,
+    n_samples: int = _DEFAULT_N_SAMPLES,
+    history_window: int | None = _DEFAULT_HISTORY_WINDOW,
+    max_tokens: int = 16384,
+    variant_tag: str | None = None,
+) -> SampledTrajectoryLLMPredictor:
+    """Return the Food CPI sampled-trajectory LLMP recipe.
+
+    The model is a normal parameter because the base LLMP ``predictor_id``
+    already includes it. The recipe tag records the Food CPI prompt/config family
+    and the cache-relevant knobs that are not otherwise visible in the ID.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier. Defaults to the lite model (``gemini-3.1-flash-lite-preview``).
+    n_samples : int
+        Number of trajectory samples to draw per prediction call.
+    history_window : int or None
+        Number of most-recent periods to include in context.
+    max_tokens : int, default=16384
+        Per-call output token budget. The generous default prevents truncation
+        on thinking models (e.g. ``gemini-3.1-pro-preview``) where thinking
+        tokens consume the same budget via the OpenAI-compatible proxy. The
+        model only generates tokens it needs, so non-thinking models are
+        unaffected in cost.
+    variant_tag : str or None
+        Override the cache tag suffix.
+    """
+    history_tag = "hfull" if history_window is None else f"h{history_window}"
+    sample_count_tag = f"n{n_samples}"
+    resolved_variant_tag = variant_tag or f"{_RECIPE_FAMILY}_{history_tag}_{sample_count_tag}"
+
+    config = SampledTrajectoryLLMPredictorConfig(
+        model=model,
+        n_samples=n_samples,
+        history_window=history_window,
+        max_tokens=max_tokens,
+        series_description=_SERIES_DESCRIPTION,
+        user_prompt_suffix=_USER_PROMPT_SUFFIX,
+        variant_tag=resolved_variant_tag,
+    )
+    return SampledTrajectoryLLMPredictor(config)
+
+
+__all__ = ["build_llmp_sampled_trajectory"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__reports.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__reports.py.md
new file mode 100644
index 0000000..9c9ccd6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__reports.py.md
@@ -0,0 +1,105 @@
+# Source: implementations/food_price_forecasting/reports.py
+
+kind: python
+
+```python
+"""CFPR report acquisition manifest.
+
+This module is the use-case-specific glue for Canada's Food Price Report: it
+parses the committed ``reports_manifest.yaml`` into :class:`CFPRReportEntry`
+records (a core :class:`DocumentMeta` plus the download URL).
+
+Text extraction itself is source-agnostic and lives in
+``aieng.forecasting.documents``.  We deliberately keep no CFPR-specific parsing
+or section/segment heuristics here: report families share no common structure,
+and the planned LLM-P formats consume whole documents, so brittle per-source
+heading rules would add complexity without earning it.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import yaml
+from aieng.forecasting.documents.models import DocumentMeta
+from pydantic import BaseModel
+
+
+REPORTS_MANIFEST_PATH = Path(__file__).resolve().parent / "reports_manifest.yaml"
+"""Committed manifest of CFPR report editions (URLs, editions, publication dates)."""
+
+DEFAULT_REPORTS_CACHE_DIR = Path("data/reports/cfpr")
+"""Default (gitignored) cache directory for downloaded PDFs and extracted text."""
+
+
+class CFPRReportEntry(BaseModel):
+    """One CFPR edition: cutoff metadata plus where to fetch it.
+
+    Parameters
+    ----------
+    meta : DocumentMeta
+        Source-agnostic provenance/cutoff metadata.
+    url : str
+        Direct PDF URL on the Dalhousie CDN.
+    sha256 : str or None
+        Expected SHA-256 of the PDF bytes, if pinned. ``fetch_cfpr.py`` verifies
+        downloads against it so a re-uploaded CDN file fails loudly.
+    """
+
+    meta: DocumentMeta
+    url: str
+    sha256: str | None = None
+
+    @property
+    def key(self) -> str:
+        """Stable per-edition key, e.g. ``"2026_en"`` (mirrors ``meta.doc_id``)."""
+        return self.meta.doc_id
+
+    def pdf_path(self, cache_dir: Path = DEFAULT_REPORTS_CACHE_DIR) -> Path:
+        """Canonical cached PDF path for this edition."""
+        return cache_dir / f"{self.key}.pdf"
+
+
+def load_manifest(manifest_path: Path = REPORTS_MANIFEST_PATH) -> list[CFPRReportEntry]:
+    """Load and validate the committed CFPR report manifest.
+
+    Parameters
+    ----------
+    manifest_path : Path
+        Path to ``reports_manifest.yaml``.
+
+    Returns
+    -------
+    list[CFPRReportEntry]
+        One entry per report edition, in manifest order.
+
+    Raises
+    ------
+    FileNotFoundError
+        If the manifest file does not exist.
+    """
+    if not manifest_path.exists():
+        raise FileNotFoundError(f"CFPR manifest not found: {manifest_path}")
+    raw = yaml.safe_load(manifest_path.read_text(encoding="utf-8"))
+    source = raw["source"]
+    entries: list[CFPRReportEntry] = []
+    for item in raw["reports"]:
+        lang = item.get("lang", "en")
+        meta = DocumentMeta(
+            source=source,
+            doc_id=f"{item['year']}_{lang}",
+            publication_date=item["publication_date"],
+            title=item.get("title") or f"Canada's Food Price Report {item['year']} ({item.get('edition', '')})".strip(),
+            lang=lang,
+        )
+        entries.append(CFPRReportEntry(meta=meta, url=item["url"], sha256=item.get("sha256")))
+    return entries
+
+
+__all__ = [
+    "DEFAULT_REPORTS_CACHE_DIR",
+    "REPORTS_MANIFEST_PATH",
+    "CFPRReportEntry",
+    "load_manifest",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__smoke_report.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__smoke_report.py.md
new file mode 100644
index 0000000..f2efeba
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__smoke_report.py.md
@@ -0,0 +1,83 @@
+# Source: implementations/food_price_forecasting/smoke_report.py
+
+kind: python
+
+```python
+"""Plain-text reporting for food CPI agent smoke tests."""
+
+from __future__ import annotations
+
+import textwrap
+from collections.abc import Sequence
+
+from aieng.forecasting.evaluation.prediction import Prediction
+
+
+CFPR_HORIZONS = list(range(6, 18))
+
+
+def summarize_agent_predictions(
+    predictions: Sequence[Prediction],
+    *,
+    expected_horizons: Sequence[int] | None = None,
+) -> bool:
+    """Print a smoke-test summary and return True only for a complete, valid fan.
+
+    Handles the common failure modes explicitly:
+
+    - **Empty list** — ``predict()`` finished but nothing was converted (check logs).
+    - **Partial fan** — fewer horizons than the task expects.
+    - **Per-row issues** — non-monotone quantiles or ``point_forecast != q50``.
+    """
+    expected = list(expected_horizons if expected_horizons is not None else CFPR_HORIZONS)
+    n_expected = len(expected)
+
+    if not predictions:
+        print("\n✗ NO FORECAST OUTPUT")
+        print("  predict() returned an empty list — no structured trajectory was produced.")
+        print("  Check ERROR/WARNING logs above (conversion failure or missing horizons in agent JSON).")
+        print("  If a ValidationError was raised instead, the model response failed schema validation.")
+        return False
+
+    n_got = len(predictions)
+    if n_got != n_expected:
+        print(f"\n✗ INCOMPLETE FORECAST ({n_got}/{n_expected} horizons)")
+        print(f"  Expected CFPR horizons {expected[0]}–{expected[-1]}; got {n_got} prediction(s).")
+        print("  The agent did not return a full Jan–Dec trajectory in structured output.")
+        print("  Inspect Langfuse for set_model_response / whether all horizons are present.")
+    else:
+        print(f"\n✓ Well-formed fan: {n_got} predictions (horizons {expected[0]}–{expected[-1]})")
+
+    print(f"\n  {'Month':>8}  {'Point':>8}  {'q05':>8}  {'q50':>8}  {'q95':>8}  OK")
+    print(f"  {'──────':>8}  {'─────':>8}  {'───':>8}  {'───':>8}  {'───':>8}  ──")
+
+    issues: list[str] = []
+    for pred in sorted(predictions, key=lambda p: p.forecast_date):
+        month = pred.forecast_date.strftime("%Y-%m")
+        q05 = pred.payload.quantiles[0.05]
+        q50 = pred.payload.quantiles[0.50]
+        q95 = pred.payload.quantiles[0.95]
+        point = pred.payload.point_forecast
+        qs = [pred.payload.quantiles[q] for q in sorted(pred.payload.quantiles)]
+        row_ok = all(left <= right for left, right in zip(qs, qs[1:])) and abs(point - q50) < 0.01
+        flag = "ok" if row_ok else "!"
+        print(f"  {month:>8}  {point:>8.2f}  {q05:>8.2f}  {q50:>8.2f}  {q95:>8.2f}  {flag}")
+        if not all(left <= right for left, right in zip(qs, qs[1:])):
+            issues.append(f"non-monotone quantiles at {month}")
+        if abs(point - q50) >= 0.01:
+            issues.append(f"point_forecast != q50 at {month}")
+
+    rationale = predictions[0].metadata.get("rationale") if predictions else None
+    if rationale:
+        print("\n  Rationale:")
+        for line in textwrap.wrap(str(rationale), width=72):
+            print(f"    {line}")
+
+    if issues:
+        print("\n✗ Output format issues:")
+        for issue in issues:
+            print(f"    - {issue}")
+        return False
+
+    return n_got == n_expected
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_cfpr_backtest.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_cfpr_backtest.yaml.md
new file mode 100644
index 0000000..e4ef431
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_cfpr_backtest.yaml.md
@@ -0,0 +1,154 @@
+# Source: implementations/food_price_forecasting/specs/food_cpi_cfpr_backtest.yaml
+
+kind: yaml
+
+```yaml
+# Reference MultiTargetBacktestSpec: Canada Food CPI, CFPR trajectory replica
+#
+# This spec is the canonical backtest for the Canada's Food Price Report
+# (CFPR) forecasting exercise.  The CFPR produces, once a year, an
+# *average-over-average* YoY CPI change projection for the upcoming calendar
+# year: "by how much will the average food CPI index in year Y+1 differ
+# from the average in year Y?"
+#
+# To compute that quantity from a monthly CPI forecast we need the full
+# 12-month trajectory for calendar year Y+1, issued from a July-of-year-Y
+# origin.  Concretely:
+#
+#   * Origin:   July 1, year Y  (the most recent CPI data available in hand
+#               is typically May or June Y, so July is the first origin at
+#               which a CFPR-style prediction can be made).
+#   * Horizons: 6, 7, 8, ..., 17  (January .. December of year Y+1).
+#   * Stride:   12 months (one origin per July, annual cadence).
+#
+# From these 12 forecasts one computes:
+#
+#     avg/avg YoY = mean(predictions for Jan..Dec of Y+1)
+#                   / mean(actuals for Jan..Dec of Y)
+#                 - 1
+#
+# Window
+# ------
+# Jul 2009 → Jul 2024 = 16 annual origins.  This spans three distinct
+# macro regimes (low-inflation 2010-19, COVID shock 2020-21, food-price
+# surge/retreat 2021-24).  The last origin (Jul 2024) forecasts out to
+# Dec 2025, which resolves in the StatCan release of Jan 2026.
+#
+# Predictors
+# ----------
+# This is the open backtesting resource — run it as many times as you
+# like to tune predictors and build intuition.  There is no protected eval
+# spec for this experiment: historical LLM/agent scores are upper bounds
+# on live performance (see the food CPI experiment notebook).
+#
+# Prerequisites
+# -------------
+#   uv run python scripts/fetch_cpi.py    # StatCan food CPI cache
+#
+# (FRED covariates are deliberately NOT part of this canonical spec — see
+# `planning-docs/bootcamp-workplan.md` for the deferred "covariate framing" design
+# discussion.  Predictors that want exogenous inputs should be evaluated
+# on a separate experiment.)
+#
+# Loading
+# -------
+#
+#   import yaml
+#   from aieng.forecasting.evaluation import MultiTargetBacktestSpec, multi_backtest
+#
+#   with open("implementations/food_price_forecasting/specs/food_cpi_cfpr_backtest.yaml") as f:
+#       spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(f))
+#   results = multi_backtest(predictor=my_predictor, spec=spec, data_service=svc)
+
+spec_id: food_cpi_cfpr_backtest
+
+description: >-
+  CFPR replica backtest: 9 Canadian food CPI sub-indices, annual July
+  origins 2009-2024, 12-step trajectory (horizons 6-17) spanning
+  Jan-Dec of the following calendar year.  Supports computation of the
+  canonical average-over-average YoY CPI change that the Canada's Food
+  Price Report publishes each year.  Open resource — run freely.
+
+tasks:
+  - task_id: food_cpi_overall_cfpr
+    target_series_id: cpi_food_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Food (overall, 2002=100), trajectory across Jan-Dec of
+      the year following a July origin.
+
+  - task_id: food_cpi_bakery_cereal_cfpr
+    target_series_id: cpi_bakery_cereal_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Bakery and cereal products (excl. baby food), trajectory
+      across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_dairy_eggs_cfpr
+    target_series_id: cpi_dairy_eggs_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Dairy products and eggs, trajectory across Jan-Dec of
+      the year following a July origin.
+
+  - task_id: food_cpi_fish_seafood_cfpr
+    target_series_id: cpi_fish_seafood_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Fish, seafood and other marine products, trajectory
+      across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_restaurants_cfpr
+    target_series_id: cpi_restaurants_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Food purchased from restaurants, trajectory across
+      Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_fruit_preparations_nuts_cfpr
+    target_series_id: cpi_fruit_preparations_nuts_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Fruit, fruit preparations and nuts, trajectory across
+      Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_meat_cfpr
+    target_series_id: cpi_meat_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Meat, trajectory across Jan-Dec of the year following a
+      July origin.
+
+  - task_id: food_cpi_other_food_cfpr
+    target_series_id: cpi_other_food_nonalcoholic_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Other food products and non-alcoholic beverages,
+      trajectory across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_vegetables_cfpr
+    target_series_id: cpi_vegetables_preparations_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Vegetables and vegetable preparations, trajectory across
+      Jan-Dec of the year following a July origin.
+
+# July 2009 -> July 2024 = 16 annual origins.
+start: "2009-07-01"
+end: "2024-07-01"
+
+# stride=12 on monthly (MS) frequency -> one origin per July.
+stride: 12
+
+# Require 24 months of history before the first forecast origin.
+warmup: 24
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_recent_backtest.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_recent_backtest.yaml.md
new file mode 100644
index 0000000..c4298ad
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_recent_backtest.yaml.md
@@ -0,0 +1,108 @@
+# Source: implementations/food_price_forecasting/specs/food_cpi_recent_backtest.yaml
+
+kind: yaml
+
+```yaml
+# Mini backtest spec: all 9 targets, recent 6 origins.
+#
+# All nine food CPI categories, 6 annual July origins (2019-2024).
+# Covers two meaningful macro regimes — the COVID shock (2020-21) and the
+# food-price surge and retreat (2021-24) — while keeping total agent calls
+# manageable.  Not budget-gated; run freely.
+#
+# Origin count : 6    (Jul 2019, 2020, 2021, 2022, 2023, 2024)
+# Agent calls  : ~54  (9 tasks × 6 origins, all 12 trajectory horizons per call)
+#
+# Use `food_cpi_cfpr_backtest.yaml` for the full 16-origin canonical backtest.
+#
+# Prerequisites
+# -------------
+#   uv run python scripts/fetch_cpi.py
+
+spec_id: food_cpi_recent_backtest
+
+description: >-
+  All-targets recent backtest: 9 Canadian food CPI sub-indices, 6 recent
+  annual July origins (2019-2024), 12-step trajectory (horizons 6-17).
+  Covers COVID shock and food-price surge/retreat regimes. Not
+  budget-gated — run freely.
+
+tasks:
+  - task_id: food_cpi_overall_cfpr
+    target_series_id: cpi_food_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Food (overall, 2002=100), trajectory across Jan-Dec of
+      the year following a July origin.
+
+  - task_id: food_cpi_bakery_cereal_cfpr
+    target_series_id: cpi_bakery_cereal_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Bakery and cereal products (excl. baby food), trajectory
+      across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_dairy_eggs_cfpr
+    target_series_id: cpi_dairy_eggs_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Dairy products and eggs, trajectory across Jan-Dec of
+      the year following a July origin.
+
+  - task_id: food_cpi_fish_seafood_cfpr
+    target_series_id: cpi_fish_seafood_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Fish, seafood and other marine products, trajectory
+      across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_restaurants_cfpr
+    target_series_id: cpi_restaurants_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Food purchased from restaurants, trajectory across
+      Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_fruit_preparations_nuts_cfpr
+    target_series_id: cpi_fruit_preparations_nuts_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Fruit, fruit preparations and nuts, trajectory across
+      Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_meat_cfpr
+    target_series_id: cpi_meat_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Meat, trajectory across Jan-Dec of the year following a
+      July origin.
+
+  - task_id: food_cpi_other_food_cfpr
+    target_series_id: cpi_other_food_nonalcoholic_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Other food products and non-alcoholic beverages,
+      trajectory across Jan-Dec of the year following a July origin.
+
+  - task_id: food_cpi_vegetables_cfpr
+    target_series_id: cpi_vegetables_preparations_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Vegetables and vegetable preparations, trajectory across
+      Jan-Dec of the year following a July origin.
+
+# Jul 2019 -> Jul 2024 = 6 annual origins.
+start: "2019-07-01"
+end: "2024-07-01"
+stride: 12
+warmup: 24
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_single_mini_backtest.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_single_mini_backtest.yaml.md
new file mode 100644
index 0000000..e17b6b6
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__specs__food_cpi_single_mini_backtest.yaml.md
@@ -0,0 +1,40 @@
+# Source: implementations/food_price_forecasting/specs/food_cpi_single_mini_backtest.yaml
+
+kind: yaml
+
+```yaml
+# Mini backtest spec: single target, recent 6 origins.
+#
+# One task (bakery & cereal), 6 annual July origins (2019-2024).
+# Use this for fast agent development and pipeline smoke-testing.
+# Not budget-gated; run freely.
+#
+# Origin count : 6   (Jul 2019, 2020, 2021, 2022, 2023, 2024)
+# Agent calls  : ~6  (one per origin, all 12 trajectory horizons in one call)
+#
+# Prerequisites
+# -------------
+#   uv run python scripts/fetch_cpi.py
+
+spec_id: food_cpi_single_mini_backtest
+
+description: >-
+  Single-target mini backtest: bakery & cereal CPI, 6 recent annual July
+  origins (2019-2024), 12-step trajectory (horizons 6-17). Use for fast
+  agent development and pipeline smoke-testing. Not budget-gated — run freely.
+
+tasks:
+  - task_id: food_cpi_bakery_cereal_cfpr
+    target_series_id: cpi_bakery_cereal_canada
+    horizons: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
+    frequency: MS
+    description: >-
+      Canada CPI Bakery and cereal products (excl. baby food), trajectory
+      across Jan-Dec of the year following a July origin.
+
+# Jul 2019 -> Jul 2024 = 6 annual origins.
+start: "2019-07-01"
+end: "2024-07-01"
+stride: 12
+warmup: 24
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent____init__.py.md
new file mode 100644
index 0000000..4d113c9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent____init__.py.md
@@ -0,0 +1,25 @@
+# Source: implementations/food_price_forecasting/starter_agent/__init__.py
+
+kind: python
+
+```python
+"""Food CPI starter agent — a fresh, hackable template for your own exploration.
+
+Exports the toggle-driven :class:`AgentConfig` factory, the predictor
+convenience factory, and the self-contained prompt builder. See
+``99_starter_agent.ipynb`` and ``agent.py``.
+"""
+
+from food_price_forecasting.starter_agent.agent import (
+    FoodCpiStarterPromptBuilder,
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+__all__ = [
+    "FoodCpiStarterPromptBuilder",
+    "build_starter_agent_config",
+    "build_starter_agent_predictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__agent.py.md
new file mode 100644
index 0000000..139e8d5
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__agent.py.md
@@ -0,0 +1,270 @@
+# Source: implementations/food_price_forecasting/starter_agent/agent.py
+
+kind: python
+
+```python
+"""Food CPI starter agent — a fresh, hackable template for your own exploration.
+
+This is the food use case's **first** agent, and it is deliberately minimal: a
+clean starting point with our common building blocks behind simple toggles —
+
+- **optional news search** (``enable_search``, on by default) — bounded,
+  cutoff-aware Google Search through the Vector proxy;
+- **optional code execution** (``enable_code_exec``, off by default) — an E2B
+  Python sandbox;
+- **two lightweight skills** (:mod:`skills/`) that are *tool-usage playbooks*:
+  how to get good results out of search and code execution.
+
+Everything routes through the Vector proxy — no direct provider keys. See
+``planning-docs/vector-llm-proxy.md``.
+
+Unlike energy/boc, this use case had no agent to borrow a prompt builder from,
+so :class:`FoodCpiStarterPromptBuilder` below is a small, self-contained
+serialiser — read it, then extend it (more history, covariates, report context).
+The output is a probabilistic CPI-index trajectory. Pair this with
+``99_starter_agent.ipynb``.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Callable
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+from pydantic import BaseModel
+
+
+# Skills live next to this module.
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_FORECASTING_SKILL = _SKILLS_ROOT / "forecasting"
+_RESEARCH_SKILL = _SKILLS_ROOT / "research-playbook"
+_CODE_ANALYSIS_SKILL = _SKILLS_ROOT / "code-analysis-playbook"
+
+
+# ---------------------------------------------------------------------------
+# Prompt builder (self-contained — this use case has no analyst_agent to reuse)
+# ---------------------------------------------------------------------------
+
+
+class FoodCpiStarterPromptBuilder(BaseModel):
+    """Serialise a single food-CPI series + task into the agent's JSON payload.
+
+    Minimal on purpose: the most recent ``history_months`` of the target index,
+    plus the task spec and the exact quantile grid the agent must produce.
+    Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally — extend it with covariates or report context.
+    """
+
+    model_config = {"extra": "forbid"}
+
+    history_months: int = 120
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        df = context.get_series(task.target_series_id).tail(self.history_months)
+        rows = ["date,index"] + [
+            f"{pd.Timestamp(ts).date()},{float(v):.2f}" for ts, v in zip(df["timestamp"], df["value"])
+        ]
+        last_row = df.iloc[-1]
+        payload: dict[str, Any] = {
+            "task": task.task_id,
+            "as_of": str(context.as_of)[:10],
+            "horizons": list(task.horizons),
+            "standard_quantiles": list(STANDARD_QUANTILES),
+            "target_summary": {
+                "last_index": float(last_row["value"]),
+                "last_date": str(pd.Timestamp(last_row["timestamp"]).date()),
+                "n_months": int(len(df)),
+            },
+            "target_history_csv": "\n".join(rows),
+        }
+        return json.dumps(payload, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+
+
+def _build_starter_instruction() -> str:
+    """Build the task-agnostic, skill-agnostic starter persona.
+
+    Just the analyst's identity and how to behave — no output schema, no payload
+    contract, no skill or tool mechanics. ADK injects the name + description of
+    every attached skill (and every tool) into the system prompt, so the agent
+    already knows what it can load and call; repeating that here would only
+    duplicate dynamically-injected information. The forecasting *contract* lives
+    in the loadable ``forecasting`` skill. Edit the persona freely.
+    """
+    return (
+        "## Role\n\n"
+        "You are a Canadian food-price analyst — fluent in food CPI dynamics, "
+        "seasonality, commodity and input costs, and the headline inflation "
+        "backdrop. This is a starter agent: keep your reasoning transparent and "
+        "your claims honest.\n\n"
+        "## How to respond\n\n"
+        "- For open-ended questions, scenario analysis, or anything "
+        "conversational, answer directly and concisely — do NOT ask for a JSON "
+        "payload.\n"
+        "- When you are handed a task that asks for a structured probabilistic "
+        "forecast, produce a calibrated one."
+    )
+
+
+_STARTER_INSTRUCTION = _build_starter_instruction()
+
+
+_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are a Canadian food-inflation intelligence specialist with web search.
+
+Return a concise structured markdown summary (3-5 paragraphs) covering, as the
+query warrants: recent food-CPI and headline-CPI prints vs the 2% target;
+commodity and input costs (grains, energy, fertiliser); supply-chain and weather
+disruptions; and the CAD exchange rate.
+
+Ground every claim in the search results you actually retrieve. When a cutoff
+date is specified, never report or speculate about events after it.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Config factory
+# ---------------------------------------------------------------------------
+
+
+def build_starter_agent_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    *,
+    enable_search: bool = True,
+    enable_code_exec: bool = False,
+) -> AgentConfig:
+    """Build the food-CPI starter :class:`AgentConfig`.
+
+    Parameters
+    ----------
+    model : str
+        Model for the analyst agent (default: lite). Pass the advanced model
+        (``"gemini-3.5-flash"``) for higher-quality runs.
+    search_model : str
+        Model for the bounded web-search sub-tool.
+    enable_search : bool, default=True
+        Wire a cutoff-aware ``search_web`` tool and load the
+        ``research-playbook`` skill. Proxy-only — no extra API key.
+    enable_code_exec : bool, default=False
+        Wire an E2B Python sandbox and load the ``code-analysis-playbook``
+        skill. Needs ``E2B_API_KEY`` and is slower, so it is off by default.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    # Every attached skill is loaded on demand: ADK injects each skill's name +
+    # description into the system prompt, and the agent reads the full SKILL.md
+    # only when relevant — so toggling a tool just adds its skill, no persona edits.
+    skills_dirs: list[Path] = [_FORECASTING_SKILL]
+    if enable_search:
+        skills_dirs.append(_RESEARCH_SKILL)
+    if enable_code_exec:
+        skills_dirs.append(_CODE_ANALYSIS_SKILL)
+
+    context_retrieval = (
+        ContextRetrievalConfig(
+            enabled=True,
+            instruction=_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        )
+        if enable_search
+        else ContextRetrievalConfig()
+    )
+
+    return AgentConfig(
+        name="food_cpi_starter_agent",
+        model=model,
+        instruction=_STARTER_INSTRUCTION,
+        # 16k headroom: enough for a complete run_code script + structured output.
+        max_output_tokens=16_384 if enable_code_exec else None,
+        context_retrieval=context_retrieval,
+        code_execution=CodeExecutionConfig(enabled=enable_code_exec),
+        skills_dirs=skills_dirs,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+class _StarterForecastPromptBuilder:
+    """Add the output schema + a forecast directive to a base builder's payload.
+
+    The exact JSON schema is generated at call time from the output class
+    (drift-free) and injected into the user payload — never into the system
+    prompt — so the agent stays conversational until it is actually asked to
+    forecast. Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally.
+    """
+
+    def __init__(self, inner: Callable[..., str], output_schema_json: str) -> None:
+        self._inner = inner
+        self._schema_json = output_schema_json
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        payload = json.loads(self._inner(task=task, context=context))
+        payload["instructions"] = (
+            "Produce a calibrated probabilistic forecast for this task and return it by "
+            "calling `set_model_response` with a `json_response` string matching "
+            "`output_schema` exactly."
+        )
+        payload["output_schema"] = self._schema_json
+        return json.dumps(payload, indent=2)
+
+
+def build_starter_agent_predictor(config: AgentConfig) -> AgentPredictor:
+    """Wrap a starter :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Wraps :class:`FoodCpiStarterPromptBuilder` so the (drift-free) continuous
+    output schema and a forecast directive ride in the payload — keeping the
+    schema out of the persona. ``predict(task, context)`` returns one
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` per horizon.
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=_StarterForecastPromptBuilder(
+            FoodCpiStarterPromptBuilder(),
+            ContinuousAgentForecastOutput.prompt_schema_json(),
+        ),
+        output_schema=ContinuousAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_starter_agent_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
new file mode 100644
index 0000000..60b6903
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
@@ -0,0 +1,57 @@
+# Source: implementations/food_price_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: code-analysis-playbook
+description: >-
+  How to use the code execution sandbox well — parse the JSON payload (not
+  disk files), compute a couple of useful diagnostics before forecasting, and
+  keep the session stateful within a turn. Load this before writing code. No
+  scripts.
+---
+
+# Code-analysis playbook
+
+A short guide to using the `run_code` sandbox productively. This is a starter
+skill — extend it with the diagnostics that matter for your problem.
+
+## Where your data lives
+
+All data comes from the **JSON payload in your context** — there are no disk
+files and no network. The history arrives as a CSV *string* (e.g.
+`target_history_csv`). Parse it with `io.StringIO`, never as a file path:
+
+```python
+import io, pandas as pd
+df = pd.read_csv(io.StringIO(payload["target_history_csv"]))
+```
+
+The sandbox is **stateful within a turn**: parse once in your first code block,
+then reuse the DataFrame in later blocks instead of re-parsing.
+
+## Compute before you forecast
+
+Run a couple of cheap diagnostics so your forecast is grounded in arithmetic,
+not vibes:
+
+1. **Recent trend** — slope/return over the last N observations.
+2. **Volatility** — recent standard deviation of changes; it sets how wide your
+   quantile bands should be.
+3. **Sanity check** — does your point forecast sit within a plausible multiple
+   of recent moves? If not, revisit it.
+
+Use the printed numbers to set the point forecast and to *calibrate the spread*
+between your low and high quantiles — wider when recent volatility is high.
+
+## Domain focus (edit this for your use case)
+
+For monthly food CPI, the series is an index (base 2002=100) that moves slowly
+and seasonally; month-over-month changes are typically small. Use recent
+year-over-year change and seasonality, not a fixed guess, to set your trajectory
+and the width of your interval bands.
+
+## Room to grow
+
+- Add your own diagnostic patterns (regime detection, seasonality, covariates).
+- Drop reusable reference values into a `references/` file and `load_skill_resource` them.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__forecasting__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__forecasting__SKILL.md.md
new file mode 100644
index 0000000..ecf2f3a
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__forecasting__SKILL.md.md
@@ -0,0 +1,57 @@
+# Source: implementations/food_price_forecasting/starter_agent/skills/forecasting/SKILL.md
+
+kind: markdown
+
+---
+name: forecasting
+description: >-
+  The output contract for producing a structured probabilistic forecast — the
+  JSON shape, the calibration and quantile rules, and how to submit it. Load
+  this ONLY when your task payload asks for a forecast; ignore it for
+  open-ended questions. No scripts.
+---
+
+# Forecasting skill
+
+Load this when your task payload asks for a structured forecast. For open-ended
+questions, ignore it and just answer.
+
+## What you'll receive
+
+A JSON payload describing the task: a `task` id, the `as_of` cutoff date,
+`horizons` (steps ahead), the `standard_quantiles` grid, a `target_summary`, the
+recent `target_history_csv`, and an `output_schema` showing the exact JSON to
+return.
+
+## The output contract
+
+1. Produce **one forecast per horizon** in `horizons`.
+2. Use **exactly** the levels in `standard_quantiles` — no additions or omissions.
+3. `point_forecast` must equal the **0.50 quantile** value.
+4. Quantile values must be **non-decreasing** as the quantile level rises.
+5. Use ONLY information available on or before `as_of`.
+6. Put your reasoning in the `rationale` fields.
+
+Submit by calling `set_model_response` with a `json_response` string that
+matches the payload's `output_schema` **exactly** — use `"horizon"` (an
+integer), and make `"quantiles"` a **list** of `{"quantile": <level>, "value":
+<number>}` objects. Omit any field not shown in the schema.
+
+## Calibration
+
+Report calibrated intervals, not false precision: across many forecasts where
+your 80% band is stated, the truth should land inside it about 80% of the time.
+Anchor the point on the recent level and trend; let recent **volatility** set
+how wide the bands are, and widen them as the horizon grows.
+
+## Domain focus (edit this for your use case)
+
+For monthly food CPI, anchor on the last index value and its recent
+year-over-year change; the index moves slowly and seasonally, so month-to-month
+jumps are small. Widen the bands with the horizon, and adjust for food-inflation
+or input-cost signals you have real evidence for — and say so in the rationale.
+
+## Room to grow
+
+- Tighten the calibration guidance with your own backtest findings.
+- Add worked examples of good vs. over-confident forecasts.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__research-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
new file mode 100644
index 0000000..467eb3c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__food_price_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
@@ -0,0 +1,45 @@
+# Source: implementations/food_price_forecasting/starter_agent/skills/research-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: research-playbook
+description: >-
+  How to use the search_web tool well when grounding a forecast in recent
+  news — phrase cutoff-aware queries, decide what is worth searching for, and
+  weigh sources. Load this before your first search_web call. No scripts.
+---
+
+# Research playbook
+
+A short guide to getting real signal out of `search_web`. This is a starter
+skill — extend it with the queries and sources that work for your problem.
+
+## The one rule that matters
+
+Always pass `cutoff_date` equal to the `as_of` date in your payload. It is the
+temporal fence that keeps post-origin information out of a historical forecast.
+A forecast that "knew" what happened after `as_of` is not a forecast.
+
+## How to search
+
+- **Search before you forecast, not after.** Gather context first, then reason.
+- **One topic per query.** Several focused queries beat one broad one. Stop when
+  new queries stop returning new facts.
+- **Ask for the present state, not a prediction.** "current OPEC+ production
+  policy" returns facts; "will oil go up" returns noise.
+- **Weigh sources.** Prefer primary releases and major outlets; treat a single
+  blog or forum post as a lead to confirm, not a fact.
+
+## Domain focus (edit this for your use case)
+
+For Canadian food CPI, the signals: recent food-inflation prints and the
+headline CPI trend vs the 2% target, commodity and input costs (grains, energy,
+fertiliser), supply-chain and weather disruptions, and the CAD exchange rate.
+Search for the *current state* of these, then let the series trend set the baseline.
+
+## Room to grow
+
+- Add a curated list of go-to sources for your domain.
+- Track which queries paid off and prune the ones that didn't.
+- Add a `references/` file with example high-signal searches.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__00_environment_check.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__00_environment_check.ipynb.md
new file mode 100644
index 0000000..6d3a1de
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__00_environment_check.ipynb.md
@@ -0,0 +1,618 @@
+# Source: implementations/getting_started/00_environment_check.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# 00 · Environment Check — start here
+
+Welcome! This notebook is a **self-guided preflight** for the agentic-forecasting
+project. It checks every major capability you'll need — one cell at a time — and
+tells you in plain language what (if anything) is wrong and how to fix it.
+
+**How to use it**
+
+1. Run the cells top to bottom (`Run All` is safe — nothing here changes your data).
+2. Read each result:
+   - ✅ **PASS** — that capability works.
+   - ⚠️ **WARN** — optional or degraded; you can usually proceed, but read the note.
+   - ❌ **FAIL** — something needs fixing before the forecasting notebooks will work.
+3. The final cell gives you a single verdict and a prioritized to-do list.
+
+**The most common cause of a ❌ is a missing or placeholder API key.**
+
+On **Coder workspaces**, bootcamp keys (`OPENAI_*`, `E2B_*`, `LANGFUSE_*`) are
+injected into your shell at startup. You do not need those in a repo `.env`.
+Optional personal keys (e.g. `FRED_API_KEY`) go in `.env` only. The inventory
+below reads the live environment, so bootcamp keys may show as ✅ even when
+`.env` is absent. If a key wasn't filled in correctly during setup, the relevant
+check below will tell you exactly which variable to fix.
+
+When everything is green, continue to
+[`01_cpi_data_exploration.ipynb`](01_cpi_data_exploration.ipynb) and
+[`02_cpi_backtest_demo.ipynb`](02_cpi_backtest_demo.ipynb).
+
+## Cell 2 (markdown)
+
+## Setup
+
+This cell optionally loads a repo `.env` (for personal keys like FRED_API_KEY), locates
+the repository root, and defines the small helpers used by every check below.
+Bootcamp keys come from your shell environment and are never overwritten by
+`.env`. It imports nothing from the project yet, so it should always succeed.
+
+## Cell 3 (code)
+
+```python
+from __future__ import annotations
+
+import asyncio
+import contextvars
+import os
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+
+from dotenv import load_dotenv
+
+
+# --- Locate the repo root robustly (works regardless of the kernel's cwd) ----
+def find_repo_root(start: Path | None = None) -> Path:
+    """Walk upward until we find the workspace root (has pyproject + aieng-forecasting)."""
+    here = (start or Path.cwd()).resolve()
+    for cand in (here, *here.parents):
+        if (cand / "pyproject.toml").exists() and (cand / "aieng-forecasting").is_dir():
+            return cand
+    # Fallback: this notebook lives two levels under the root.
+    return Path.cwd().resolve().parents[1]
+
+
+ROOT = find_repo_root()
+load_dotenv(ROOT / ".env", override=False)  # optional FRED etc.; shell env wins
+print(f"Repository root: {ROOT}")
+print(f".env present:    {(ROOT / '.env').exists()}")
+
+# --- Result tracking + uniform reporting ------------------------------------
+RESULTS: list[dict[str, str]] = []
+_ICONS = {"PASS": "✅", "WARN": "⚠️", "FAIL": "❌"}
+
+
+def report(name: str, status: str, detail: str = "", fix: str = "") -> str:
+    """Print a uniform check result and record it for the final summary."""
+    RESULTS.append({"name": name, "status": status, "detail": detail})
+    print(f"{_ICONS[status]}  {status} — {name}")
+    for line in str(detail).splitlines():
+        print(f"      {line}")
+    if fix:
+        print("      ── How to fix ─────────────────────────────")
+        for line in fix.strip("\n").splitlines():
+            print(f"      {line}")
+    return status
+
+
+def ok(name: str, detail: str = "") -> str:
+    return report(name, "PASS", detail)
+
+
+def warn(name: str, detail: str = "", fix: str = "") -> str:
+    return report(name, "WARN", detail, fix)
+
+
+def fail(name: str, detail: str = "", fix: str = "") -> str:
+    return report(name, "FAIL", detail, fix)
+
+
+# --- Environment-variable helpers -------------------------------------------
+def _is_placeholder(value: str) -> bool:
+    s = value.strip()
+    return (not s) or s.startswith("your_") or s.endswith("...")
+
+
+def env(key: str) -> str:
+    """Return a stripped env value, or '' if missing/placeholder."""
+    raw = os.environ.get(key, "").strip()
+    return "" if _is_placeholder(raw) else raw
+
+
+def env_ok(key: str) -> bool:
+    return bool(env(key))
+
+
+def mask(value: str) -> str:
+    """Show only the last 4 characters of a secret (never echo it in full)."""
+    v = (value or "").strip()
+    if not v:
+        return "(not set)"
+    return v if len(v) <= 4 else "…" + v[-4:]
+
+
+# --- Run an async coroutine from a notebook cell ----------------------------
+def run_async(coro):
+    """Run a coroutine whether or not an event loop is already running (Jupyter)."""
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        return asyncio.run(coro)
+    ctx = contextvars.copy_context()
+    with ThreadPoolExecutor(max_workers=1) as pool:
+        return pool.submit(ctx.run, asyncio.run, coro).result()
+
+
+print("Helpers ready.")
+```
+
+## Cell 4 (markdown)
+
+## 1 · API key inventory
+
+A quick look at which environment variables are present in your **shell
+environment** (and optional `.env`), missing, or still hold a placeholder value.
+On Coder, bootcamp keys mostly come from onboarding — not from `.env`. This is
+**informational** — it doesn't pass or fail on its own, but it explains most of
+the results further down.
+
+| Tier | Variable | Used for |
+|---|---|---|
+| Required | `OPENAI_BASE_URL`, `OPENAI_API_KEY` | LLM inference via the Vector proxy |
+| Required | `E2B_API_KEY` | Sandboxed code execution for agents |
+| Recommended | `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` | Trace logging |
+| Optional | `FRED_API_KEY` | FRED data (apply for a free key if you want it) |
+
+## Cell 5 (code)
+
+```python
+_INVENTORY = [
+    ("Required", "OPENAI_BASE_URL"),
+    ("Required", "OPENAI_API_KEY"),
+    ("Required", "E2B_API_KEY"),
+    ("Recommended", "LANGFUSE_PUBLIC_KEY"),
+    ("Recommended", "LANGFUSE_SECRET_KEY"),
+    ("Recommended", "LANGFUSE_HOST"),
+    ("Optional", "FRED_API_KEY"),
+]
+
+
+def _status_symbol(key: str) -> str:
+    raw = os.environ.get(key, "").strip()
+    if not raw:
+        return "❌ missing"
+    if _is_placeholder(raw):
+        return "⚠️ placeholder"
+    return "✅ set"
+
+
+print(f"{'Tier':<12} {'Variable':<22} {'Status':<16} Value")
+print("-" * 70)
+for tier, key in _INVENTORY:
+    # OPENAI_BASE_URL / LANGFUSE_HOST are URLs — fine to show; secrets are masked.
+    show = os.environ.get(key, "") if key.endswith(("_URL", "_HOST")) else mask(os.environ.get(key, ""))
+    print(f"{tier:<12} {key:<22} {_status_symbol(key):<16} {show or '(not set)'}")
+
+print()
+print("Legend: ✅ set   ⚠️ still a placeholder from .env.example   ❌ not set")
+```
+
+## Cell 6 (markdown)
+
+## 2 · Package imports & native libraries
+
+Confirms the project packages import cleanly and that LightGBM's native
+dependency (OpenMP) loads. The most common snag here is on macOS, where the
+LightGBM wheel needs Homebrew's `libomp`.
+
+## Cell 7 (code)
+
+```python
+try:
+    # Import the project and LightGBM; LightGBM's import triggers the native
+    # OpenMP (libomp) load that is the usual macOS setup snag.
+    import aieng.forecasting  # noqa: F401
+    import lightgbm  # noqa: F401
+    from aieng.forecasting.data import DataService, SeriesMetadata  # noqa: F401
+    from aieng.forecasting.evaluation import BacktestSpec, backtest  # noqa: F401
+    from aieng.forecasting.methods import LastValuePredictor  # noqa: F401
+    from aieng.forecasting.models import LITE_MODEL
+
+    ok(
+        "Package imports & LightGBM/OpenMP",
+        f"aieng.forecasting, LightGBM {lightgbm.__version__}, default model {LITE_MODEL!r}.",
+    )
+except Exception as exc:  # noqa: BLE001
+    msg = str(exc)
+    if "libomp" in msg or "Library not loaded" in msg:
+        fail(
+            "Package imports & LightGBM/OpenMP",
+            f"LightGBM could not load OpenMP: {msg}",
+            fix=(
+                "macOS only — install Homebrew's OpenMP, then restart the Jupyter kernel:\n"
+                "    brew install libomp\n"
+                "On Apple Silicon the dylib lives under /opt/homebrew/opt/libomp/lib/."
+            ),
+        )
+    else:
+        fail(
+            "Package imports & LightGBM/OpenMP",
+            f"Import failed: {type(exc).__name__}: {msg}",
+            fix=(
+                "Reinstall the workspace from the repo root:\n"
+                "    uv sync\n"
+                "Then restart the Jupyter kernel and re-run this cell."
+            ),
+        )
+```
+
+## Cell 8 (markdown)
+
+## 3 · LLM inference via the Vector proxy
+
+Sends one tiny completion to the **default model** through the proxy. This is the
+single most important check — almost every notebook depends on it. It routes
+exactly the way the library does (`openai/<model>` + `api_base`).
+
+## Cell 9 (code)
+
+```python
+_NAME = "LLM inference via proxy"
+
+if not env_ok("OPENAI_BASE_URL") or not env_ok("OPENAI_API_KEY"):
+    missing = [k for k in ("OPENAI_BASE_URL", "OPENAI_API_KEY") if not env_ok(k)]
+    fail(
+        _NAME,
+        f"Required proxy setting(s) not configured: {', '.join(missing)}.",
+        fix=(
+            "Set these in your .env at the repository root (see .env.example):\n"
+            "    OPENAI_BASE_URL=https://proxy.vectorinstitute.ai/v1\n"
+            "    OPENAI_API_KEY=<your key>\n"
+            "If they look set but this still fails, check for a leftover placeholder value."
+        ),
+    )
+else:
+    try:
+        import litellm
+        from aieng.forecasting.models import LITE_MODEL
+
+        resp = litellm.completion(
+            model=f"openai/{LITE_MODEL}",
+            api_base=env("OPENAI_BASE_URL"),
+            api_key=env("OPENAI_API_KEY"),
+            messages=[{"role": "user", "content": "Reply with exactly: OK"}],
+            max_tokens=16,
+            temperature=0,
+        )
+        text = (resp.choices[0].message.content or "").strip()
+        ok(_NAME, f"Model {LITE_MODEL!r} responded: {text!r}")
+    except Exception as exc:  # noqa: BLE001
+        msg = str(exc)
+        low = msg.lower()
+        if any(t in low for t in ("auth", "401", "403", "api key", "unauthorized", "forbidden")):
+            fix = (
+                "Your OPENAI_API_KEY was rejected. Re-check it in .env — copy it again from "
+                "your setup credentials, with no surrounding quotes or whitespace."
+            )
+        elif any(t in low for t in ("connect", "timeout", "resolve", "connection", "getaddrinfo", "name or service")):
+            fix = (
+                f"Could not reach the proxy at {env('OPENAI_BASE_URL')!r}. Check your network/VPN "
+                "and that OPENAI_BASE_URL is correct (it should end in /v1)."
+            )
+        else:
+            fix = (
+                "Verify OPENAI_BASE_URL and OPENAI_API_KEY in .env, then restart the kernel. "
+                "The full error above usually names the cause."
+            )
+        fail(_NAME, f"{type(exc).__name__}: {msg}", fix=fix)
+```
+
+## Cell 10 (markdown)
+
+## 4 · Langfuse tracing connection
+
+Langfuse records traces of LLM and agent runs so you can inspect them in the UI.
+It's **recommended but optional** — the forecasting notebooks run without it, you
+just won't get trace links.
+
+## Cell 11 (code)
+
+```python
+_NAME = "Langfuse tracing"
+
+if not (env_ok("LANGFUSE_PUBLIC_KEY") and env_ok("LANGFUSE_SECRET_KEY")):
+    warn(
+        _NAME,
+        "Langfuse credentials are not set — tracing will be skipped (this is OK to proceed).",
+        fix=(
+            "To enable trace logging, set these in .env (from your Langfuse project settings):\n"
+            "    LANGFUSE_PUBLIC_KEY=pk-lf-...\n"
+            "    LANGFUSE_SECRET_KEY=sk-lf-...\n"
+            "    LANGFUSE_HOST=https://us.cloud.langfuse.com"
+        ),
+    )
+else:
+    try:
+        from aieng.forecasting.langfuse_tracing import init_langfuse_tracing
+        from langfuse import get_client
+
+        init_langfuse_tracing()
+        client = get_client()
+        if client.auth_check():
+            host = env("LANGFUSE_HOST") or "https://cloud.langfuse.com"
+            ok(_NAME, f"Authenticated to {host} (public key {mask(env('LANGFUSE_PUBLIC_KEY'))}).")
+        else:
+            fail(
+                _NAME,
+                "Credentials are set but Langfuse auth_check() returned False.",
+                fix=(
+                    "Re-check LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY and that LANGFUSE_HOST "
+                    "matches your project's region (e.g. https://us.cloud.langfuse.com)."
+                ),
+            )
+    except Exception as exc:  # noqa: BLE001
+        fail(
+            _NAME,
+            f"{type(exc).__name__}: {exc}",
+            fix=(
+                "Confirm the three LANGFUSE_* variables in .env and that LANGFUSE_HOST is reachable, "
+                "then restart the kernel."
+            ),
+        )
+```
+
+## Cell 12 (markdown)
+
+## 5 · E2B code execution sandbox
+
+Agentic forecasters run code in an E2B cloud sandbox. This runs a trivial snippet
+(`print(1 + 1)`) end-to-end. A failure here is usually either a missing
+`E2B_API_KEY` or a sandbox **template that hasn't been built yet** — the messages
+below distinguish the two.
+
+## Cell 13 (code)
+
+```python
+_NAME = "E2B code execution"
+
+if not env_ok("E2B_API_KEY"):
+    fail(
+        _NAME,
+        "E2B_API_KEY is not set — code execution cannot run.",
+        fix=(
+            "1. Create a free account at https://e2b.dev and copy your API key.\n"
+            "2. Add it to .env at the repository root:\n"
+            "       E2B_API_KEY=<your key>\n"
+            "3. Restart the kernel and re-run this cell."
+        ),
+    )
+else:
+    # Mirror the project default; fall back to the literal if the agentic extra
+    # (which pulls in google-adk) is not importable in this kernel.
+    try:
+        from aieng.forecasting.methods.agentic.agent_factory import CodeExecutionConfig
+
+        template_name = CodeExecutionConfig().template_name
+    except Exception:  # noqa: BLE001
+        template_name = "agentic-forecasting-bootcamp"
+
+    try:
+        import json
+
+        from aieng.agents.tools.code_interpreter import CodeInterpreter
+
+        ci = CodeInterpreter(template_name=template_name)
+        raw = run_async(ci.run_code("print(1 + 1)"))
+        out = json.loads(raw)
+        stdout = "".join(out.get("stdout", []))
+        if out.get("error"):
+            err = out["error"]
+            fail(_NAME, f"Sandbox ran but raised: {err.get('name')}: {err.get('value')}")
+        elif "2" in stdout:
+            ok(_NAME, f"Sandbox (template {template_name!r}) executed code and returned: {stdout.strip()!r}")
+        else:
+            warn(_NAME, f"Sandbox ran but produced unexpected output: {stdout!r}")
+    except Exception as exc:  # noqa: BLE001
+        msg = str(exc)
+        low = msg.lower()
+        if "template" in low and ("not found" in low or "does not exist" in low or "notfound" in low):
+            fix = (
+                f"The sandbox template {template_name!r} hasn't been built yet. Build it once "
+                "(takes a few minutes):\n"
+                "    uv run --env-file .env scripts/build_e2b_template.py"
+            )
+        elif any(t in low for t in ("auth", "401", "403", "api key", "unauthorized", "invalid")):
+            fix = (
+                "Your E2B_API_KEY was rejected. Re-copy it from https://e2b.dev into .env, "
+                "with no surrounding quotes or whitespace."
+            )
+        else:
+            fix = (
+                "Check that E2B_API_KEY is valid and the template has been built "
+                "(uv run --env-file .env scripts/build_e2b_template.py). The error above names the cause."
+            )
+        fail(_NAME, f"{type(exc).__name__}: {msg}", fix=fix)
+```
+
+## Cell 14 (markdown)
+
+## 6 · StatCan data access
+
+Pulls one real CPI series (Canada gasoline) from Statistics Canada. The first run
+downloads and caches the table under `data/statcan/`; later runs read the cache.
+If you're offline but the cache already exists, this degrades to a ⚠️.
+
+## Cell 15 (code)
+
+```python
+_NAME = "StatCan data pull"
+
+try:
+    from aieng.forecasting.data.adapters import StatCanAdapter
+
+    adapter = StatCanAdapter(
+        table_id="18-10-0004-11",
+        member_filter={"GEO": "Canada", "Products and product groups": "Gasoline"},
+        cache_dir=ROOT / "data" / "statcan",
+    )
+    df = adapter.fetch()
+    start = df["timestamp"].min().strftime("%Y-%m")
+    end = df["timestamp"].max().strftime("%Y-%m")
+    ok(_NAME, f"Fetched cpi_gasoline_canada: {len(df)} rows, {start} → {end}.")
+except Exception as exc:  # noqa: BLE001
+    cache_file_exists = (ROOT / "data" / "statcan").exists() and any((ROOT / "data" / "statcan").glob("*.zip"))
+    if cache_file_exists:
+        warn(
+            _NAME,
+            f"Live fetch failed ({type(exc).__name__}: {exc}) but a local StatCan cache exists.",
+            fix="Likely a transient network issue. The cached data is usable; re-run later to refresh.",
+        )
+    else:
+        fail(
+            _NAME,
+            f"{type(exc).__name__}: {exc}",
+            fix=(
+                "Populate the local data cache once from the repo root:\n"
+                "    uv run python scripts/fetch_cpi.py\n"
+                "This needs network access to Statistics Canada the first time."
+            ),
+        )
+```
+
+## Cell 16 (markdown)
+
+## 7 · FRED data access (optional)
+
+FRED (US Federal Reserve Economic Data) needs a **free API key**. It's optional —
+only some implementations use it. If you don't have a key, this is a ⚠️ with
+instructions, not a failure. If you do, we validate it with a live fetch.
+
+## Cell 17 (code)
+
+```python
+_NAME = "FRED data pull"
+
+if not env_ok("FRED_API_KEY"):
+    warn(
+        _NAME,
+        "FRED_API_KEY is not set. This is optional — skip it unless you need FRED series.",
+        fix=(
+            "FRED requires a free API key. To get one:\n"
+            "  1. Request it at https://fred.stlouisfed.org/docs/api/api_key.html\n"
+            "  2. Add it to .env at the repository root:\n"
+            "         FRED_API_KEY=<your key>\n"
+            "  3. Restart the kernel and re-run this cell."
+        ),
+    )
+else:
+    try:
+        from aieng.forecasting.data.adapters import FREDAdapter
+
+        # refresh=True forces a live API call so we actually validate the key.
+        adapter = FREDAdapter("EXCAUS", cache_dir=ROOT / "data" / "fred", refresh=True)
+        df = adapter.fetch()
+        latest = df.iloc[-1]
+        ok(
+            _NAME,
+            f"Validated FRED key — fetched EXCAUS (CAD/USD): {len(df)} rows, "
+            f"latest {latest['timestamp'].strftime('%Y-%m')} = {latest['value']:.4f}.",
+        )
+    except Exception as exc:  # noqa: BLE001
+        fail(
+            _NAME,
+            f"{type(exc).__name__}: {exc}",
+            fix=(
+                "Your FRED_API_KEY may be invalid. Re-copy it from "
+                "https://fred.stlouisfed.org/docs/api/api_key.html into .env, then restart the kernel."
+            ),
+        )
+```
+
+## Cell 18 (markdown)
+
+## 8 · End-to-end mini forecast
+
+The real thing in miniature: load the getting-started backtest spec, register the
+gasoline series, run a `LastValuePredictor` backtest, and score it (CRPS). This
+proves the whole **data → predictor → backtest → score** loop works — not just the
+individual services. It uses only the StatCan cache (no LLM/network).
+
+## Cell 19 (code)
+
+```python
+_NAME = "End-to-end mini forecast"
+
+try:
+    import yaml
+    from aieng.forecasting.data import DataService, SeriesMetadata
+    from aieng.forecasting.data.adapters import StatCanAdapter
+    from aieng.forecasting.evaluation import BacktestSpec, backtest
+    from aieng.forecasting.methods import LastValuePredictor
+
+    spec_path = ROOT / "implementations" / "getting_started" / "specs" / "cpi_gasoline_1m.yaml"
+    spec = BacktestSpec.model_validate(yaml.safe_load(spec_path.read_text()))
+
+    svc = DataService()
+    svc.register(
+        "cpi_gasoline_canada",
+        StatCanAdapter(
+            table_id="18-10-0004-11",
+            member_filter={"GEO": "Canada", "Products and product groups": "Gasoline"},
+            cache_dir=ROOT / "data" / "statcan",
+        ),
+        SeriesMetadata(
+            series_id="cpi_gasoline_canada",
+            description="CPI Gasoline, Canada (2002=100)",
+            source="StatCan",
+            units="Index 2002=100",
+            frequency="MS",
+            table_id="18-10-0004-11",
+        ),
+    )
+
+    result = backtest(LastValuePredictor(), spec, svc)
+    ok(
+        _NAME,
+        f"Ran {result.predictor_id} over the gasoline backtest — "
+        f"mean {result.metric.upper()} = {result.mean_score:.4f}.",
+    )
+except Exception as exc:  # noqa: BLE001
+    fail(
+        _NAME,
+        f"{type(exc).__name__}: {exc}",
+        fix=(
+            "This depends on the StatCan check above. If that failed, fix it first "
+            "(uv run python scripts/fetch_cpi.py), then restart the kernel and re-run."
+        ),
+    )
+```
+
+## Cell 20 (markdown)
+
+## Summary
+
+A single verdict and, if needed, a prioritized list of what to fix.
+
+## Cell 21 (code)
+
+```python
+_passed = [r for r in RESULTS if r["status"] == "PASS"]
+_warned = [r for r in RESULTS if r["status"] == "WARN"]
+_failed = [r for r in RESULTS if r["status"] == "FAIL"]
+
+print("=" * 64)
+print(f"  Checks run: {len(RESULTS)}    ✅ {len(_passed)}    ⚠️ {len(_warned)}    ❌ {len(_failed)}")
+print("=" * 64)
+
+if _failed:
+    print("\n❌ Fix these before continuing (most are missing/placeholder keys in .env):")
+    for r in _failed:
+        print(f"   • {r['name']}")
+if _warned:
+    print("\n⚠️ Optional / heads-up (you can usually proceed):")
+    for r in _warned:
+        print(f"   • {r['name']}")
+
+print()
+if not _failed:
+    print("🎉 You're ready! Open 01_cpi_data_exploration.ipynb to begin.")
+    if _warned:
+        print("   (The ⚠️ items above are optional — enable them later if you need them.)")
+else:
+    print("Re-run this notebook after editing .env and restarting the kernel.")
+    print("Most ❌ items are a key that wasn't filled in during setup — scroll up for the exact fix.")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__01_cpi_data_exploration.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__01_cpi_data_exploration.ipynb.md
new file mode 100644
index 0000000..e482838
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__01_cpi_data_exploration.ipynb.md
@@ -0,0 +1,232 @@
+# Source: implementations/getting_started/01_cpi_data_exploration.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Getting Started — CPI Data Exploration
+
+A short warm-up that demonstrates the data service layer using three
+Canadian CPI series with deliberately different dynamics:
+
+| Series | Why it's here |
+|---|---|
+| `cpi_all_items_canada` | The headline. Smooth trend, a useful baseline for comparison. |
+| `cpi_gasoline_canada` | **The getting-started target.** Violently volatile (2008 collapse, 2014–16 OPEC decline, 2020 COVID crash, 2022 surge). |
+| `cpi_shelter_canada` | Sticky, persistent trend with a dramatic 2021–24 regime shift — a good "try this next" alternate. |
+
+`scripts/fetch_cpi.py` registers all 47 product-group series available in
+StatCan table 18-10-0004-11.  This notebook selects three of them to keep
+the first look uncluttered.  Swap in any other series from the script to
+explore further.
+
+**Before running this notebook**, populate the local data cache:
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+After that, no network calls are made — the notebook reads entirely from
+the local cache, which is what the `CutoffEnforcer` discipline requires.
+
+## Cell 2 (code)
+
+```python
+from datetime import datetime
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters import StatCanAdapter
+from aieng.forecasting.evaluation import ForecastingTask
+```
+
+## Cell 3 (markdown)
+
+## 1. Build the DataService
+
+Register the three focus series.  This reads from the local stats-can
+cache — run `scripts/fetch_cpi.py` first if the cache is empty.
+
+## Cell 4 (code)
+
+```python
+CPI_TABLE_ID = "18-10-0004-11"
+# Resolve the repo-root data cache the same way 00/02 do, so this notebook
+# reuses the cache populated by `scripts/fetch_cpi.py` regardless of CWD.
+ROOT = Path.cwd().resolve().parents[1]
+CACHE_DIR = ROOT / "data" / "statcan"
+
+# (series_id, StatCan product-group label, short label)
+CPI_SERIES = [
+    ("cpi_all_items_canada", "All-items", "All-items"),
+    ("cpi_gasoline_canada", "Gasoline", "Gasoline"),
+    ("cpi_shelter_canada", "Shelter", "Shelter"),
+]
+
+svc = DataService()
+
+for series_id, product_group, short_label in CPI_SERIES:
+    adapter = StatCanAdapter(
+        table_id=CPI_TABLE_ID,
+        member_filter={"GEO": "Canada", "Products and product groups": product_group},
+        cache_dir=CACHE_DIR,
+    )
+    metadata = SeriesMetadata(
+        series_id=series_id,
+        description=f"CPI {short_label}, Canada (2002=100)",
+        source="StatCan",
+        units="Index 2002=100",
+        frequency="MS",
+        table_id=CPI_TABLE_ID,
+    )
+    svc.register(series_id, adapter, metadata)
+    print(f"  Registered: {series_id}")
+
+print("\nDone.")
+```
+
+## Cell 5 (markdown)
+
+## 2. Inspect the registered series
+
+## Cell 6 (code)
+
+```python
+summary = svc.summary()
+summary["start"] = summary["start"].dt.strftime("%Y-%m")
+summary["end"] = summary["end"].dt.strftime("%Y-%m")
+summary
+```
+
+## Cell 7 (markdown)
+
+## 3. Cutoff filtering — the core discipline
+
+Calling `get_series(series_id, as_of=...)` returns only observations that
+were available on or before `as_of`.  This is how backtests and live
+forecasts share the same code path — the `as_of` date is the only
+difference.  Predictors never deal with this directly; they receive a
+`ForecastContext` in which the cutoff is already applied.
+
+## Cell 8 (code)
+
+```python
+cutoff = datetime(2023, 1, 1)
+df_cutoff = svc.get_series("cpi_gasoline_canada", as_of=cutoff)
+
+print(f"as_of: {cutoff.date()}")
+print(f"Rows returned: {len(df_cutoff)}")
+print(f"Latest observation: {df_cutoff['timestamp'].max().date()}")
+df_cutoff.tail()
+```
+
+## Cell 9 (markdown)
+
+## 4. Plot: the three series side by side
+
+The levels plot highlights the most striking thing about gasoline: its
+scale dominates everything else.  The small-multiples view lets each
+series speak on its own.
+
+## Cell 10 (code)
+
+```python
+AS_OF = datetime(2026, 1, 1)
+PLOT_START = "2000-01"
+
+series_data = {
+    series_id: (svc.get_series(series_id, as_of=AS_OF).set_index("timestamp").sort_index().loc[PLOT_START:])
+    for series_id, _, __ in CPI_SERIES
+}
+
+fig, ax = plt.subplots(figsize=(12, 4.5))
+for series_id, _, short_label in CPI_SERIES:
+    df = series_data[series_id]
+    ax.plot(df.index, df["value"], label=short_label, linewidth=1.6)
+ax.set_title("Canada CPI — Three focus series (2002=100)", fontsize=13)
+ax.set_xlabel("Date")
+ax.set_ylabel("Index (2002=100)")
+ax.legend()
+ax.grid(True, alpha=0.3)
+plt.tight_layout()
+plt.show()
+
+fig, axes = plt.subplots(1, 3, figsize=(13, 3.8), sharex=True)
+for ax, (series_id, _, short_label) in zip(axes, CPI_SERIES):
+    df = series_data[series_id]
+    ax.plot(df.index, df["value"], linewidth=1.3, color="steelblue")
+    ax.set_title(short_label, fontsize=11)
+    ax.grid(True, alpha=0.3)
+    ax.tick_params(axis="x", labelrotation=30, labelsize=8)
+fig.suptitle("Canada CPI — Small multiples", fontsize=12, y=1.02)
+fig.supylabel("Index (2002=100)", x=0.02)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 11 (markdown)
+
+## 5. Year-over-year change
+
+Derived from the index levels — storing levels and computing changes
+on-demand is the right pattern.  Note how gasoline's YoY amplitude dwarfs
+shelter and all-items; this is the "why forecasting gasoline is hard"
+story in a single chart, and the motivation for picking gasoline as the
+getting-started target.
+
+## Cell 12 (code)
+
+```python
+YOY_START = "2000-01"
+
+fig, ax = plt.subplots(figsize=(12, 4.5))
+for series_id, _, short_label in CPI_SERIES:
+    df = series_data[series_id].loc[YOY_START:].copy()
+    df["yoy_pct"] = df["value"].pct_change(12) * 100
+    ax.plot(df.index, df["yoy_pct"], label=short_label, linewidth=1.3)
+ax.axhline(0, color="black", linewidth=0.8, linestyle="--")
+ax.set_title("Canada CPI — Year-over-year change (%)", fontsize=13)
+ax.set_xlabel("Date")
+ax.set_ylabel("YoY change (%)")
+ax.legend()
+ax.grid(True, alpha=0.3)
+plt.tight_layout()
+plt.show()
+```
+
+## Cell 13 (markdown)
+
+## 6. Define a `ForecastingTask`
+
+A `ForecastingTask` specifies the prediction problem without prescribing
+how a predictor should solve it.  For completeness, here's what the
+getting-started backtest task looks like as a plain Python construction —
+the YAML spec in `specs/cpi_gasoline_1m.yaml` is exactly this
+plus a window (`start`, `end`, `stride`, `warmup`).
+
+## Cell 14 (code)
+
+```python
+task = ForecastingTask(
+    task_id="cpi_gasoline_canada_1m",
+    target_series_id="cpi_gasoline_canada",
+    horizons=[1],
+    frequency="MS",
+    description=(
+        "Forecast Canada-wide CPI Gasoline (2002=100) 1 month ahead. "
+        "Resolution: observed CPI value at the target month-start timestamp."
+    ),
+)
+
+print(task.model_dump_json(indent=2))
+```
+
+## Cell 15 (markdown)
+
+## Next: `02_cpi_backtest_demo.ipynb`
+
+That notebook runs the full end-to-end backtest against the reference
+spec, compares a naive baseline to AutoARIMA, plots the predictions
+against the observed series with 80% CI bands, and walks through where
+each predictor fails and why.  Start there after this warm-up.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__02_cpi_backtest_demo.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__02_cpi_backtest_demo.ipynb.md
new file mode 100644
index 0000000..3d69e78
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__02_cpi_backtest_demo.ipynb.md
@@ -0,0 +1,469 @@
+# Source: implementations/getting_started/02_cpi_backtest_demo.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Getting Started — CPI Gasoline Backtest
+
+The bootcamp's **hello-world** forecasting experiment.  Run this notebook
+top to bottom for an end-to-end tour of the evaluation framework on a
+single, deliberately-hard target.
+
+**Task.** Forecast Canada CPI Gasoline (index, 2002=100) exactly 1 month
+ahead.  Evaluated at every monthly origin from 2000 to 2025, with a
+held-out eval set covering Jan 2025 – Mar 2026.
+
+**Why gasoline.** Because it *breaks* our models.  The series has four
+textbook regime shifts in the evaluation window — the 2008 crude-oil
+collapse, the 2014–16 OPEC-led decline, the 2020 COVID demand shock,
+and the 2021–22 Russia/Ukraine surge.  That makes gasoline a good
+motivator for everything downstream: exogenous covariates, LLM context,
+agentic news retrieval.
+
+**Why 1-month ahead.** StatCan publishes CPI ~3 weeks after the
+reference month.  A forecast made at origin T resolves when the next
+CPI print lands — short enough to run genuine **live / prospective**
+tests as new data comes in month by month.
+
+**Predictors.**
+
+- `LastValuePredictor` — naive last-value baseline from `aieng.forecasting.methods.baselines.naive`.
+  The performance floor every predictor must beat.
+- `DartsAutoARIMAPredictor` — Darts `AutoARIMA` wrapped to emit
+  probabilistic forecasts via Monte Carlo sampling.  Imported from
+  `aieng.forecasting.methods.numerical.darts_arima`.
+
+**Score.** Continuous Ranked Probability Score (CRPS, lower is better) — rewards both
+calibration and sharpness.
+
+**Prerequisites.**
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+## Cell 2 (code)
+
+```python
+from datetime import datetime, timezone
+from pathlib import Path
+
+import matplotlib.dates as mdates
+import matplotlib.pyplot as plt
+import pandas as pd
+import yaml
+
+
+ROOT = Path.cwd().resolve().parents[1]
+```
+
+## Cell 3 (markdown)
+
+## 1. Register CPI Gasoline
+
+## Cell 4 (code)
+
+```python
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters import StatCanAdapter
+
+
+CPI_TABLE_ID = "18-10-0004-11"
+CACHE_DIR = ROOT / "data" / "statcan"
+
+svc = DataService()
+
+gasoline_adapter = StatCanAdapter(
+    table_id=CPI_TABLE_ID,
+    member_filter={"GEO": "Canada", "Products and product groups": "Gasoline"},
+    cache_dir=CACHE_DIR,
+)
+svc.register(
+    "cpi_gasoline_canada",
+    gasoline_adapter,
+    SeriesMetadata(
+        series_id="cpi_gasoline_canada",
+        description="CPI Gasoline, Canada (2002=100)",
+        source="StatCan table 18-10-0004-11",
+        units="Index 2002=100",
+        frequency="MS",
+        table_id=CPI_TABLE_ID,
+    ),
+)
+
+# Shelter is pre-registered too so the "try this next" cell at the end
+# can re-run the backtest against a totally different inflation dynamic
+# without any setup rerun.
+shelter_adapter = StatCanAdapter(
+    table_id=CPI_TABLE_ID,
+    member_filter={"GEO": "Canada", "Products and product groups": "Shelter"},
+    cache_dir=CACHE_DIR,
+)
+svc.register(
+    "cpi_shelter_canada",
+    shelter_adapter,
+    SeriesMetadata(
+        series_id="cpi_shelter_canada",
+        description="CPI Shelter, Canada (2002=100)",
+        source="StatCan table 18-10-0004-11",
+        units="Index 2002=100",
+        frequency="MS",
+        table_id=CPI_TABLE_ID,
+    ),
+)
+
+svc.summary()
+```
+
+## Cell 5 (markdown)
+
+## 2. Load the reference spec
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.evaluation import BacktestSpec
+
+
+spec_path = ROOT / "implementations" / "getting_started" / "specs" / "cpi_gasoline_1m.yaml"
+with spec_path.open() as f:
+    spec = BacktestSpec.model_validate(yaml.safe_load(f))
+
+origins = spec.origins()
+print(f"Task:    {spec.task.task_id}")
+print(f"Target:  {spec.task.target_series_id}")
+print(f"Horizon: {spec.task.horizon} months")
+print(f"Origins: {len(origins)} ({origins[0].date()} → {origins[-1].date()})")
+print(f"Warmup:  {spec.warmup} observations")
+```
+
+## Cell 7 (markdown)
+
+## 3. Define the predictors
+
+- **`LastValuePredictor`** — `aieng.forecasting.methods.naive`.  No fitting required;
+  predicts the last observed value at every quantile.  The floor.
+- **`DartsAutoARIMAPredictor`** — `aieng.forecasting.methods.darts_arima`.  Wraps
+  Darts `AutoARIMA` with Monte Carlo sampling for quantile
+  estimates.  Univariate only.
+
+Read `aieng-forecasting/aieng/forecasting/methods/naive.py` for an annotated reference on
+how to satisfy the `Predictor` abstract base class when you write your own.
+
+## Cell 8 (code)
+
+```python
+from aieng.forecasting.methods.baselines.naive import LastValuePredictor
+from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor
+
+
+naive_predictor = LastValuePredictor()
+arima_predictor = DartsAutoARIMAPredictor(num_samples=500)
+print(f"Predictors: {naive_predictor.predictor_id}, {arima_predictor.predictor_id}")
+```
+
+## Cell 9 (markdown)
+
+## 4. Run both backtests
+
+The naive backtest is instant.  AutoARIMA fits once per origin — expect
+a couple of minutes on a laptop.
+
+## Cell 10 (code)
+
+```python
+from aieng.forecasting.evaluation import backtest
+
+
+naive_results = backtest(predictor=naive_predictor, spec=spec, data_service=svc)
+arima_results = backtest(predictor=arima_predictor, spec=spec, data_service=svc)
+
+print(f"{'Predictor':<30} {'Origins':>8} {'Skipped':>8} {'Mean CRPS':>10}")
+print("-" * 60)
+for r in [naive_results, arima_results]:
+    print(f"{r.predictor_id:<30} {len(r.predictions):>8} {r.skipped_origins:>8} {r.mean_score:>10.4f}")
+```
+
+## Cell 11 (markdown)
+
+## 5. Per-origin CRPS comparison
+
+Mean CRPS is a single number; the per-origin breakdown tells the
+actual story.
+
+## Cell 12 (code)
+
+```python
+def result_to_df(result, label):
+    return pd.DataFrame(
+        {
+            "origin": [p.as_of.date() for p in result.predictions],
+            "forecast_date": [p.forecast_date.date() for p in result.predictions],
+            f"point_{label}": [p.payload.point_forecast for p in result.predictions],
+            f"crps_{label}": result.scores,
+        }
+    ).set_index("forecast_date")
+
+
+naive_df = result_to_df(naive_results, "naive")
+arima_df = result_to_df(arima_results, "arima")
+
+comparison = naive_df.join(arima_df[["point_arima", "crps_arima"]], how="inner")
+comparison["crps_reduction"] = comparison["crps_naive"] - comparison["crps_arima"]
+
+print(f"Mean CRPS  naive:  {comparison['crps_naive'].mean():.4f}")
+print(f"Mean CRPS  arima:  {comparison['crps_arima'].mean():.4f}")
+print(
+    f"Mean reduction:    {comparison['crps_reduction'].mean():.4f}"
+    f"  ({comparison['crps_reduction'].mean() / comparison['crps_naive'].mean() * 100:.1f}%)"
+)
+comparison.reset_index().head(10)
+```
+
+## Cell 13 (markdown)
+
+## 6. Predictions vs. actuals — the visual story
+
+One chart, last ~16 years of data, with AutoARIMA's 80% CI shaded so the
+model's uncertainty (and its collapses) are visible next to the observed
+series.  A companion panel underneath shows per-origin CRPS over time —
+the spikes line up with the regime shifts.
+
+## Cell 14 (code)
+
+```python
+COLOR_OBS = "#1a1a2e"
+COLOR_ARIMA = "#e94560"
+COLOR_NAIVE = "#f5a623"
+
+full_series = svc.get_series(
+    "cpi_gasoline_canada",
+    as_of=datetime.now(tz=timezone.utc).replace(tzinfo=None),
+)
+PLOT_START = pd.Timestamp("2008-01-01")
+plot_series = full_series[full_series["timestamp"] >= PLOT_START]
+
+arima_dates = [p.forecast_date for p in arima_results.predictions]
+arima_points = [p.payload.point_forecast for p in arima_results.predictions]
+arima_q10 = [p.payload.quantiles.get(0.10, p.payload.point_forecast) for p in arima_results.predictions]
+arima_q90 = [p.payload.quantiles.get(0.90, p.payload.point_forecast) for p in arima_results.predictions]
+
+naive_dates = [p.forecast_date for p in naive_results.predictions]
+naive_points = [p.payload.point_forecast for p in naive_results.predictions]
+
+# Trim plotted predictions to the visible window.
+arima_mask = [d >= PLOT_START.to_pydatetime() for d in arima_dates]
+arima_dates_p = [d for d, keep in zip(arima_dates, arima_mask) if keep]
+arima_points_p = [v for v, keep in zip(arima_points, arima_mask) if keep]
+arima_q10_p = [v for v, keep in zip(arima_q10, arima_mask) if keep]
+arima_q90_p = [v for v, keep in zip(arima_q90, arima_mask) if keep]
+arima_scores_p = [s for s, keep in zip(arima_results.scores, arima_mask) if keep]
+
+naive_mask = [d >= PLOT_START.to_pydatetime() for d in naive_dates]
+naive_dates_p = [d for d, keep in zip(naive_dates, naive_mask) if keep]
+naive_points_p = [v for v, keep in zip(naive_points, naive_mask) if keep]
+naive_scores_p = [s for s, keep in zip(naive_results.scores, naive_mask) if keep]
+
+fig, axes = plt.subplots(2, 1, figsize=(14, 9), gridspec_kw={"height_ratios": [3, 1]})
+
+ax = axes[0]
+ax.plot(plot_series["timestamp"], plot_series["value"], color=COLOR_OBS, linewidth=1.6, label="Observed CPI")
+ax.fill_between(arima_dates_p, arima_q10_p, arima_q90_p, alpha=0.20, color=COLOR_ARIMA, label="AutoARIMA 80% CI")
+ax.scatter(
+    arima_dates_p,
+    arima_points_p,
+    color=COLOR_ARIMA,
+    s=28,
+    zorder=5,
+    label=f"AutoARIMA median (mean CRPS {arima_results.mean_score:.3f})",
+)
+ax.scatter(
+    naive_dates_p,
+    naive_points_p,
+    color=COLOR_NAIVE,
+    s=22,
+    marker="^",
+    zorder=4,
+    label=f"Naive last-value    (mean CRPS {naive_results.mean_score:.3f})",
+)
+ax.set_title(
+    "CPI Gasoline Canada — 1-month ahead backtest\nObserved series vs. AutoARIMA (80% CI) and naive baseline",
+    fontsize=13,
+    fontweight="bold",
+)
+ax.set_ylabel("CPI Gasoline (Index 2002=100)")
+ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
+ax.xaxis.set_major_locator(mdates.YearLocator(2))
+ax.legend(loc="upper left")
+ax.grid(axis="y", linestyle="--", alpha=0.4)
+
+ax2 = axes[1]
+ax2.plot(naive_dates_p, naive_scores_p, color=COLOR_NAIVE, linewidth=1.1, marker="^", markersize=4, label="Naive CRPS")
+ax2.plot(
+    arima_dates_p, arima_scores_p, color=COLOR_ARIMA, linewidth=1.1, marker="o", markersize=4, label="AutoARIMA CRPS"
+)
+ax2.axhline(naive_results.mean_score, color=COLOR_NAIVE, linestyle="--", linewidth=0.8, alpha=0.6)
+ax2.axhline(arima_results.mean_score, color=COLOR_ARIMA, linestyle="--", linewidth=0.8, alpha=0.6)
+ax2.set_ylabel("CRPS")
+ax2.set_xlabel("Forecast date (resolution)")
+ax2.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
+ax2.xaxis.set_major_locator(mdates.YearLocator(2))
+ax2.legend(loc="upper left")
+ax2.grid(axis="y", linestyle="--", alpha=0.4)
+
+fig.tight_layout()
+plt.show()
+```
+
+## Cell 15 (markdown)
+
+## 7. Where does it fail?
+
+Sort the per-origin table by AutoARIMA CRPS to see the worst cases.
+These are not random outliers — they cluster around four macro regime
+shifts where gasoline prices moved so sharply month-over-month that
+even knowing last month's value tells you almost nothing about the next.
+
+At h=1 the failure mode is different from a long-horizon miss: the
+model *sees* the current level but can't anticipate the magnitude or
+direction of the next month's jump when a regime is in motion.
+
+**Look for these events in your results:**
+
+| Period | What happened |
+|---|---|
+| Late 2008 – early 2009 | Crude oil collapsed ~70% from its July 2008 peak in a matter of months. Successive M-o-M drops were large and hard to predict even one step ahead. |
+| 2014–2016 | OPEC-led oil price decline. A sustained series of down-moves punishes any model that extrapolates last month's value forward. |
+| Mar–Apr 2020 | COVID-19 demand shock. Gasoline demand collapsed almost overnight — the sharpest single-month drop in the series. |
+| 2022 | Russian invasion of Ukraine drove a global oil shock and then a sharp reversal later in the year. Models struggle at both the surge and the unwind. |
+
+The exact origins and scores will vary slightly across runs (AutoARIMA
+refits on each origin's history), but the same four clusters of failures
+will appear near the top of the sorted table.
+
+This is the core motivation for richer methods: exogenous covariates
+(FRED crude oil, CAD/USD FX), LLM context (news, policy statements),
+or agents that can retrieve and reason about that context themselves.
+
+## Cell 16 (code)
+
+```python
+worst = comparison.sort_values("crps_arima", ascending=False).head(8)
+worst_display = worst.reset_index().assign(
+    origin=lambda df: df["origin"].astype(str),
+    forecast_date=lambda df: df["forecast_date"].astype(str),
+)[["origin", "forecast_date", "point_naive", "point_arima", "crps_naive", "crps_arima"]]
+worst_display
+```
+
+## Cell 17 (markdown)
+
+## 8. Spend an eval run (optional)
+
+`evaluate()` is the counterpart to `backtest()`.  It runs against the
+held-out eval window (`specs/cpi_gasoline_eval_2025.yaml`,
+covering Jan 2025 – Mar 2026) and decrements a per-participant budget
+tracked in `data/eval_runs.yaml` (gitignored).  `max_runs: 5` means
+each participant may spend at most five runs against this spec — that
+deliberate scarcity is what makes the eval number a real generalisation
+estimate rather than another metric to over-fit.
+
+Because h=1 and StatCan releases CPI monthly, all origins in this eval
+window are already resolved and the results are fully interpretable.
+This is also the window that will grow over time as new monthly CPI
+prints land — it is the natural bridge to live forecasting.
+
+The cell below is **commented out** to protect the budget.  Uncomment
+and run it once you have a predictor you're confident about.
+
+```python
+from aieng.forecasting.evaluation import EvalSpec, EvalTracker, evaluate
+
+eval_spec_path = ROOT / "implementations" / "getting_started" / "specs" / "cpi_gasoline_eval_2025.yaml"
+with eval_spec_path.open() as f:
+    eval_spec = EvalSpec.model_validate(yaml.safe_load(f))
+
+tracker = EvalTracker(ROOT / "data" / "eval_runs.yaml")
+eval_result = evaluate(
+    predictor=arima_predictor,
+    spec=eval_spec,
+    data_service=svc,
+    tracker=tracker,
+)
+print(
+    f"Eval mean CRPS: {eval_result.mean_score:.4f}  "
+    f"(run {eval_result.run_number}/{eval_spec.max_runs})"
+)
+```
+
+## Cell 18 (markdown)
+
+## 9. Try this next: re-run against Shelter
+
+Shelter tells the opposite story to gasoline: sticky trend, low
+month-to-month volatility, one large regime shift (2021–24) driven by
+housing and monetary policy.  AutoARIMA should do relatively well on
+the stationary portion and miss the shift — a different failure mode
+from gasoline's.
+
+The cell below constructs a spec variant in Python against the
+already-registered `cpi_shelter_canada` series.  For a more persistent
+configuration, copy `specs/cpi_gasoline_1m.yaml` to
+`specs/cpi_shelter_1m.yaml` and swap the series ID.
+
+After that, the natural graduation step is
+`implementations/food_price_forecasting/` — nine correlated
+series, a 12-step trajectory, and the avg/avg YoY metric that
+Canada's Food Price Report actually publishes.
+
+## Cell 19 (code)
+
+```python
+from aieng.forecasting.evaluation import ForecastingTask
+
+
+shelter_spec = BacktestSpec(
+    task=ForecastingTask(
+        task_id="cpi_shelter_canada_1m",
+        target_series_id="cpi_shelter_canada",
+        horizons=[1],
+        frequency="MS",
+        description="CPI Shelter Canada, 1-month ahead forecast.",
+    ),
+    start=spec.start,
+    end=spec.end,
+    stride=spec.stride,
+    warmup=spec.warmup,
+    description="Shelter variant of the getting-started spec — the gasoline foil.",
+)
+
+naive_shelter = backtest(predictor=naive_predictor, spec=shelter_spec, data_service=svc)
+arima_shelter = backtest(predictor=arima_predictor, spec=shelter_spec, data_service=svc)
+
+print(f"{'Target':<20} {'Predictor':<20} {'Mean CRPS':>10}")
+print("-" * 55)
+print(f"{'Gasoline':<20} {'Naive':<20} {naive_results.mean_score:>10.3f}")
+print(f"{'Gasoline':<20} {'AutoARIMA':<20} {arima_results.mean_score:>10.3f}")
+print(f"{'Shelter':<20} {'Naive':<20} {naive_shelter.mean_score:>10.3f}")
+print(f"{'Shelter':<20} {'AutoARIMA':<20} {arima_shelter.mean_score:>10.3f}")
+```
+
+## Cell 20 (markdown)
+
+## 10. Serialize the result to YAML
+
+`BacktestResult` is a Pydantic model — serialisable to YAML alongside
+the predictor implementation, passable to a downstream agent as
+structured context, or usable as a submission artefact in a future
+leaderboard mechanism.
+
+## Cell 21 (code)
+
+```python
+result_dict = arima_results.model_dump(mode="json")
+result_yaml = yaml.dump(result_dict, default_flow_style=False, allow_unicode=True)
+
+print("\n".join(result_yaml.splitlines()[:35]))
+print("...")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__99_repo_concierge.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__99_repo_concierge.ipynb.md
new file mode 100644
index 0000000..1ad6ff3
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__99_repo_concierge.ipynb.md
@@ -0,0 +1,180 @@
+# Source: implementations/getting_started/99_repo_concierge.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# Repo Concierge — ask questions about this codebase
+
+> **Note:** This agent uses a snapshot of the public `main` branch (not your local
+> uncommitted changes or `data/` cache). Like any LLM, it can be wrong — verify
+> important details against the repo or ask a facilitator.
+
+**Not sure how something works? Start here.**
+
+The repo concierge helps you **find your way** — it answers questions, points you
+to the right notebooks and modules, and can quote short snippets so you know
+where to dig deeper. Example questions:
+
+- *How do I create a new data service?*
+- *How do I customize the way context is presented to an LLMP?*
+- *What's the difference between `backtest()` and `evaluate()`?*
+
+It searches a committed **catalog** of the codebase (`search_repo_catalog` →
+`fetch_repo_artifact`): full `aieng/forecasting`, reference implementations, and
+notebooks (markdown + code cells). Domain `99_starter_agent.ipynb` notebooks are
+for building forecasters; this one is your map of the repo.
+
+Live cells are gated by `RUN_AGENT` so `Run All` is safe and free; set it to `True`
+to call the model.
+
+## Cell 2 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+
+warnings.filterwarnings("ignore")
+
+from dotenv import load_dotenv
+
+
+def find_repo_root(start: Path | None = None) -> Path:
+    """Walk upward until we find the workspace root."""
+    here = (start or Path.cwd()).resolve()
+    for cand in (here, *here.parents):
+        if (cand / "pyproject.toml").exists() and (cand / "aieng-forecasting").is_dir():
+            return cand
+    return Path.cwd().resolve().parents[1]
+
+
+ROOT = find_repo_root()
+load_dotenv(ROOT / ".env", override=False)
+
+# ── Model selection ───────────────────────────────────
+# Concierge uses the lite/default model only.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+
+# ── Run guard ──────────────────────────────────────
+RUN_AGENT = False
+
+from getting_started.concierge_agent import build_concierge_config
+
+
+print("RUN_AGENT =", RUN_AGENT, "| model =", AGENT_MODEL)
+```
+
+## Cell 3 (markdown)
+
+---
+## 1. Meet the concierge
+
+The agent uses a **catalog + artifacts** knowledge pack under `concierge_agent/context/`:
+
+1. **`search_repo_catalog`** — search metadata (paths, summaries, domains); cheap, run first.
+2. **`fetch_repo_artifact`** — fetch full content for a catalog path (Python modules, READMEs, notebooks with **markdown + code cells**).
+
+The pack is built from public `main` via `scripts/build_concierge_context.py` and indexes the full `aieng/forecasting` tree plus reference implementations. The `repo-navigation` skill has reference guides (no scripts).
+
+## Cell 4 (code)
+
+```python
+from getting_started.concierge_agent import build_concierge_config
+
+
+config = build_concierge_config(model=AGENT_MODEL)
+
+print("Agent:", config.name)
+print("Search enabled:    ", config.context_retrieval.enabled)
+print("Code-exec enabled: ", config.code_execution.enabled)
+print("Skills loaded:     ", [p.name for p in config.skills_dirs])
+print("Extra tools:       ", [getattr(t, "__name__", repr(t)) for t in config.extra_tools])
+print("\n── System instruction (edit in concierge_agent/agent.py) ──\n")
+print(config.instruction[:900], "...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Try a seed question
+
+Edit `QUESTION` below, or jump to the next section for a multi-turn conversation.
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+
+
+QUESTION = "How do I create a new data service?"
+
+if RUN_AGENT:
+    chat_agent = build_adk_agent(config)
+    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name="repo_concierge_chat"))
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True in the setup cell to ask the concierge.")
+```
+
+## Cell 7 (code)
+
+```python
+QUESTION = "How do I customize the way context is presented to an LLMP?"
+
+if RUN_AGENT:
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True to run this cell.")
+```
+
+## Cell 8 (code)
+
+```python
+QUESTION = "What's the difference between backtest() and evaluate()?"
+
+if RUN_AGENT:
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True to run this cell.")
+```
+
+## Cell 9 (code)
+
+```python
+QUESTION = "Where should I go after getting_started if I want to build agents?"
+
+if RUN_AGENT:
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, F821, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True to run this cell.")
+```
+
+## Cell 10 (markdown)
+
+---
+## 3. Terminal mode — multi-turn conversations
+
+For extended back-and-forth, use the ADK CLI in the integrated terminal. From this
+directory (`implementations/getting_started/`):
+
+```bash
+cd implementations/getting_started
+uv run adk run concierge_agent
+```
+
+That loads the same `repo_concierge` agent (`gemini-3.1-flash-lite-preview`) with
+`search_repo_knowledge` and the repo-navigation skill.
+
+**Alternative:** `uv run adk web concierge_agent` opens a browser UI (same agent).
+
+---
+
+**Where next?** Forecasting starter agents live in each domain implementation's
+`99_starter_agent.ipynb` (food, energy, BoC, S&P 500). This concierge only explains
+the repo — open one of those when you're ready to build and score a forecaster.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__README.md.md
new file mode 100644
index 0000000..205b88e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__README.md.md
@@ -0,0 +1,235 @@
+# Source: implementations/getting_started/README.md
+
+kind: markdown
+
+# Getting Started
+
+The **"hello-world"** forecasting example — the smallest end-to-end use of
+the evaluation framework, and a good first stop if the `Predictor` /
+`backtest` / `evaluate` loop is new to you.
+
+The task deliberately keeps the framework surface minimal - a single
+series, a single 1-month horizon, one `BacktestSpec`, the `backtest()`
+and `evaluate()` entry points - so the evaluation loop itself is clear
+before you meet the richer patterns in
+[`implementations/food_price_forecasting/`](../food_price_forecasting/) (multi-target,
+multi-horizon trajectories, avg/avg YoY, cached artefacts).
+
+---
+
+## The task
+
+**Forecast Canada CPI Gasoline (index, 2002=100) exactly 1 month
+ahead.**  Evaluated at every monthly origin from 2000 to 2025, with a
+held-out eval set covering Jan 2025 – Mar 2026.
+
+**Why gasoline?**  Because it *breaks* our models, visibly.  The
+backtest window covers four textbook regime shifts — the 2008
+crude-oil collapse, the 2014–16 OPEC-led decline, the 2020 COVID
+demand shock, and the 2021–22 Russia/Ukraine surge.  Even at h=1
+the series makes large enough month-over-month jumps during these
+events that last-value and ARIMA both struggle.  The CRPS spikes are
+exactly the motivation for the richer techniques the other
+implementations explore: exogenous covariates, LLM context, and agents
+that can retrieve that context.
+
+**Why 1-month ahead?**  StatCan publishes CPI ~3 weeks after the
+reference month, so a forecast made today resolves at the next print.
+This is short enough to run genuine **live / prospective tests**: make
+a prediction now, validate it next month.
+
+Headline `cpi_all_items_canada` was the original target here and is a
+fine series - just too smooth to teach anything interesting.
+
+**Score:** Continuous Ranked Probability Score or CRPS for short (lower is better).
+CRPS rewards both calibration (is the probability band the right width?) and sharpness
+(is it as narrow as it can be?).
+
+---
+
+## Before you start
+
+### 0. Check your environment - `00_environment_check.ipynb`
+
+**New to the project? Start here.** This self-guided preflight notebook checks
+every major capability you'll need — LLM inference through the Vector proxy,
+Langfuse tracing, E2B code execution, StatCan and (optional) FRED data access,
+and a full end-to-end mini backtest — one cell at a time. Run it top to bottom
+(`Run All` is safe); each check reports ✅ / ⚠️ / ❌ and, on failure, tells you
+exactly what to fix. Most ❌ results are a missing or placeholder API key in the
+repo-root `.env`, so it's the fastest way to confirm your setup is complete
+before opening the notebooks below.
+
+The FRED check is optional for `getting_started` itself, but required by the
+S&P 500 reference implementation and useful for the BoC rate decisions one. FRED
+API keys are free but must be requested individually — **we cannot provide one
+for you**. Request yours early at <https://fred.stlouisfed.org/docs/api/api_key.html>
+(approval is usually quick but can take some time). A description like "Requesting
+an API key to explore the effectiveness of various forecasting techniques on
+economic data." works well. Once approved, add `FRED_API_KEY=your_key` to your
+`.env`.
+
+### Populate the local data cache
+
+Populate the local data cache (the stats-can download is gitignored):
+
+```bash
+uv run python scripts/fetch_cpi.py
+```
+
+This registers all 47 Canada-wide CPI series from StatCan table
+18-10-0004-11 into `data/statcan/`.  Re-running is idempotent.
+
+---
+
+## Walkthrough
+
+### 1. Warm up - `01_cpi_data_exploration.ipynb`
+
+Registers three focus series (all-items, gasoline,
+shelter), shows the cutoff-enforcement pattern, plots levels and
+year-over-year change, and constructs a `ForecastingTask` by hand so
+you can see what the YAML spec turns into.
+
+### 2. Run the backtest - `02_cpi_backtest_demo.ipynb`
+
+Walks through the full cycle:
+
+1. Load `specs/cpi_gasoline_1m.yaml` into a `BacktestSpec`.
+2. Construct a `LastValuePredictor` (the floor) and a
+   `DartsAutoARIMAPredictor` (a real baseline).
+3. Run `backtest()` for both, print a CRPS comparison table.
+4. Plot observed gasoline vs. AutoARIMA forecasts with shaded 80% CI.
+5. Inspect the worst-performing origins and match them to real-world
+   events.
+6. Show how `evaluate()` + `EvalTracker` would spend a run from the
+   held-out 2025 eval window.
+7. Re-run the same predictors against shelter for a side-by-side
+   regime-contrast.
+8. Serialise the `BacktestResult` to YAML.
+
+### 3. Write your own predictor
+
+Read [`aieng-forecasting/aieng/forecasting/methods/baselines/naive.py`](../../aieng-forecasting/aieng/forecasting/methods/baselines/naive.py) for a
+step-by-step annotated reference.  Subclass `Predictor`:
+
+```python
+from aieng.forecasting.evaluation import Predictor
+
+class MyPredictor(Predictor):
+    @property
+    def predictor_id(self) -> str:
+        return "my_predictor"
+
+    def predict(self, task, context):
+        series = context.get_series(task.target_series_id)
+        ...
+```
+
+Then point `backtest(predictor=MyPredictor(), spec=spec, data_service=svc)`
+at `cpi_gasoline_1m.yaml` and see whether you beat AutoARIMA.
+
+### 4. Compare predictors
+
+Re-run `backtest()` with two or more predictors against the same spec;
+the `BacktestResult.mean_score` values are directly comparable.
+(For continuous tasks like this one the metric is CRPS; binary event
+tasks — see the BoC rate-decision reference — use the Brier score.)
+
+### 5. Spend an eval run
+
+Once you have a predictor you're confident about, run `evaluate()`
+against [`cpi_gasoline_eval_2025.yaml`](specs/cpi_gasoline_eval_2025.yaml)
+— monthly origins from Jan 2025 through Mar 2026, all currently resolved.
+`max_runs: 5` — spend deliberately.
+
+### 6. Ask the repo concierge — `99_repo_concierge.ipynb`
+
+**Questions about how the repository works?** Open
+[`99_repo_concierge.ipynb`](99_repo_concierge.ipynb) — a lite-model **repo
+concierge** that answers onboarding questions, points you to the right notebooks
+and modules, and can quote snippets from the committed public-`main` catalog.
+
+- Notebook cells are gated by `RUN_AGENT` (safe `Run All`).
+- For longer conversations, use the terminal from this directory:
+
+  ```bash
+  cd implementations/getting_started
+  uv run adk run concierge_agent
+  ```
+
+  (`uv run adk web concierge_agent` opens the same agent in a browser.)
+
+This is different from each domain's `99_starter_agent.ipynb` — those are
+hackable **forecasting** agents; the concierge only explains the repo.
+
+Maintainers regenerate the catalog with
+`uv run python scripts/build_concierge_context.py` when library code,
+implementations, or notebooks change.
+
+---
+
+## Where to go next
+
+This implementation is the minimal subset of the evaluation framework. The
+other reference implementations are independent — pick whichever problem fits
+what you're building:
+
+- [`food_price_forecasting/`](../food_price_forecasting/) — the same evaluation
+  story scaled up: nine correlated CPI sub-indices, a 12-step trajectory per
+  origin, `MultiTargetBacktestSpec`, `cached_multi_backtest()`, helper modules
+  (`data.py`, `analysis.py`, `plots.py`), and the avg/avg YoY metric that
+  Canada's Food Price Report actually publishes.
+- [`boc_rate_decisions/`](../boc_rate_decisions/) — the same harness applied to
+  a discrete cut/hold/hike event instead of a continuous series (Brier / RPS).
+- [`energy_oil_forecasting/`](../energy_oil_forecasting/) — daily prices,
+  news-grounded and code-executing agents, and an agent that learns a strategy
+  from data.
+
+---
+
+## Directory layout
+
+```text
+getting_started/                 # this directory
+├── README.md
+├── specs/                       # backtest and eval YAML
+├── concierge_agent/             # repo concierge ADK agent + catalog + artifacts
+├── 00_environment_check.ipynb   # self-guided setup preflight — run this first
+├── 01_cpi_data_exploration.ipynb
+├── 02_cpi_backtest_demo.ipynb
+└── 99_repo_concierge.ipynb      # ask questions about the repo (onboarding helper)
+```
+
+Reference predictors live in the `aieng-forecasting` package under
+`aieng/forecasting/methods/`:
+
+- `baselines/` for floor baselines such as `LastValuePredictor`
+- `numerical/` for Darts-based numerical predictors
+
+Reference specs (co-located with this use case):
+
+```text
+getting_started/specs/
+├── cpi_gasoline_1m.yaml             # backtest spec (2000–2025) - use freely
+└── cpi_gasoline_eval_2025.yaml      # eval spec (Jan 2025–Mar 2026) - 5 runs max
+```
+
+---
+
+## Key interfaces (from `aieng-forecasting`)
+
+```python
+from aieng.forecasting.evaluation import (
+    Predictor,          # ABC - implement this
+    backtest,           # run a backtest, returns BacktestResult
+    evaluate,           # run against the held-out eval window
+    BacktestSpec,       # loaded from specs/ YAML
+    EvalSpec,           # loaded from specs/ YAML
+    EvalTracker,        # file-backed run counter
+    ContinuousForecast, # forecast payload (point + quantiles)
+    Prediction,         # full prediction record (payload + metadata)
+    STANDARD_QUANTILES, # [0.05, 0.10, ..., 0.90, 0.95]
+)
+from aieng.forecasting.data import DataService  # register series, create contexts
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started____init__.py.md
new file mode 100644
index 0000000..8663edc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started____init__.py.md
@@ -0,0 +1,7 @@
+# Source: implementations/getting_started/__init__.py
+
+kind: python
+
+```python
+"""Getting started reference implementation (notebooks + repo concierge agent)."""
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent____init__.py.md
new file mode 100644
index 0000000..f8f6300
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent____init__.py.md
@@ -0,0 +1,26 @@
+# Source: implementations/getting_started/concierge_agent/__init__.py
+
+kind: python
+
+```python
+"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase.
+
+Exports the :class:`AgentConfig` factory and the knowledge-search tool. Pair
+with ``99_repo_concierge.ipynb`` or ``adk run concierge_agent``.
+"""
+
+from getting_started.concierge_agent.agent import build_concierge_config
+from getting_started.concierge_agent.catalog import (
+    fetch_repo_artifact,
+    search_repo_catalog,
+)
+from getting_started.concierge_agent.knowledge import search_repo_knowledge
+
+
+__all__ = [
+    "build_concierge_config",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+    "search_repo_knowledge",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__agent.py.md
new file mode 100644
index 0000000..311ecf9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__agent.py.md
@@ -0,0 +1,113 @@
+# Source: implementations/getting_started/concierge_agent/agent.py
+
+kind: python
+
+```python
+"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase.
+
+A lightweight ADK agent powered by ``LITE_MODEL`` (``gemini-3.1-flash-lite-preview``).
+It answers questions about the repository using a committed **catalog + artifacts**
+snapshot of public ``main`` — not the participant's local workspace.
+
+Pair with ``99_repo_concierge.ipynb`` or ``adk run concierge_agent``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+from getting_started.concierge_agent.catalog import fetch_repo_artifact, search_repo_catalog
+
+
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_REPO_NAV_SKILL = _SKILLS_ROOT / "repo-navigation"
+
+
+def _build_concierge_instruction() -> str:
+    return (
+        "## Role\n\n"
+        "You are the **repo concierge** for the agentic-forecasting bootcamp — a "
+        "friendly guide who helps participants understand the repository and find "
+        "their way to the right notebooks, modules, and patterns.\n\n"
+        "Answer questions clearly. Point people to **concrete paths** in the "
+        "codebase (READMEs, notebooks, specs, library modules) where they can "
+        "read more or try things themselves. When it helps, quote short snippets "
+        "from fetched artifacts — especially from notebooks and reference "
+        "implementations.\n\n"
+        "## How you work\n\n"
+        "- Ground answers in the committed catalog: call "
+        "``search_repo_catalog`` first, then ``fetch_repo_artifact`` for the "
+        "paths you need (usually one to three per question).\n"
+        "- Prefer showing *where* something lives and *how it fits together* "
+        "over long generic explanations.\n"
+        "- If someone is debugging or extending code, walk them through the "
+        "relevant files and patterns you find in the catalog; suggest what to "
+        "open next in their editor.\n"
+        "- Your knowledge reflects the committed public ``main`` snapshot — not "
+        "the participant's local ``.env``, ``data/`` cache, or uncommitted "
+        "changes. If the catalog does not cover something, say so and name the "
+        "best file to open or a facilitator to ask.\n\n"
+        "## Tone\n\n"
+        "- Concise, welcoming, and practical — short paragraphs and bullet lists.\n"
+        "- Always cite paths returned by the catalog.\n"
+    )
+
+
+_CONCIERGE_INSTRUCTION = _build_concierge_instruction()
+
+_SKILLS_SUPPLEMENT = """
+
+## Skills
+
+You have one read-only skill: `repo-navigation` with reference files (catalog guide,
+domain map). Load them via `load_skill_resource` when you need routing hints.
+
+**To use a skill:**
+1. Call `list_skills` → `load_skill` → `load_skill_resource` as needed.
+
+These skills have NO scripts. Do not call `run_skill_script`.
+
+## Repo catalog tools (required workflow)
+
+1. **`search_repo_catalog(query, domain=None, kind=None)`** — search metadata only
+   (paths, summaries, section titles). Use `domain` filters like `core.data`,
+   `core.methods`, `impl.energy_oil_forecasting`, `scripts`, `docs`.
+   Use `kind` filters: `python`, `notebook`, `markdown`, `yaml`.
+2. **`fetch_repo_artifact(path, section=None)`** — fetch full content for one catalog
+   path (optionally one heading/section). Fetch 1–3 artifacts per question.
+
+Do not answer implementation or API questions without fetching the relevant paths.\
+"""
+
+
+def _full_instruction() -> str:
+    return _CONCIERGE_INSTRUCTION + _SKILLS_SUPPLEMENT
+
+
+def build_concierge_config(*, model: str = LITE_MODEL) -> AgentConfig:
+    """Build the repo-concierge :class:`AgentConfig`."""
+    return AgentConfig(
+        name="repo_concierge",
+        model=model,
+        instruction=_full_instruction(),
+        context_retrieval=ContextRetrievalConfig(),
+        code_execution=CodeExecutionConfig(),
+        skills_dirs=[_REPO_NAV_SKILL],
+        extra_tools=[search_repo_catalog, fetch_repo_artifact],
+    )
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ADK CLI."""
+    if name == "root_agent":
+        return build_adk_agent(build_concierge_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog.py.md
new file mode 100644
index 0000000..9584e51
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog.py.md
@@ -0,0 +1,249 @@
+# Source: implementations/getting_started/concierge_agent/catalog.py
+
+kind: python
+
+```python
+"""Runtime catalog search and artifact fetch for the repo concierge agent."""
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from functools import lru_cache
+from pathlib import Path
+from typing import Any
+
+import yaml
+from getting_started.concierge_agent.catalog_build import CatalogEntry
+
+
+_CONTEXT_DIR = Path(__file__).parent / "context"
+_MAX_CATALOG_HITS = 8
+_DEFAULT_FETCH_MAX_CHARS = 6000
+_MIN_SCORE = 1
+_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE)
+
+
+@dataclass(frozen=True)
+class CatalogHit:
+    """A ranked catalog match (metadata only)."""
+
+    path: str
+    kind: str
+    domain: str
+    summary: str
+    score: int
+    artifact: str
+    sections: list[str]
+
+
+def _entry_from_dict(data: dict[str, Any]) -> CatalogEntry:
+    return CatalogEntry(
+        path=str(data["path"]),
+        kind=str(data.get("kind", "other")),
+        domain=str(data.get("domain", "other")),
+        summary=str(data.get("summary", "")),
+        symbols=[str(s) for s in data.get("symbols", [])],
+        sections=[str(s) for s in data.get("sections", [])],
+        chars=int(data.get("chars", 0)),
+        artifact=str(data["artifact"]),
+    )
+
+
+@lru_cache(maxsize=1)
+def _load_catalog() -> dict[str, Any]:
+    catalog_path = _CONTEXT_DIR / "catalog.yaml"
+    if not catalog_path.is_file():
+        msg = f"Concierge catalog not found: {catalog_path}. Run scripts/build_concierge_context.py"
+        raise FileNotFoundError(msg)
+    with catalog_path.open(encoding="utf-8") as fh:
+        data = yaml.safe_load(fh)
+    if not isinstance(data, dict):
+        msg = f"Invalid catalog format in {catalog_path}"
+        raise ValueError(msg)
+    return data
+
+
+@lru_cache(maxsize=1)
+def _load_entries() -> tuple[CatalogEntry, ...]:
+    catalog = _load_catalog()
+    raw_entries = catalog.get("entries", [])
+    if not isinstance(raw_entries, list):
+        return ()
+    return tuple(_entry_from_dict(item) for item in raw_entries if isinstance(item, dict))
+
+
+def _tokenize(query: str) -> list[str]:
+    return [t for t in re.findall(r"[a-zA-Z0-9_./-]+", query.lower()) if len(t) > 2]
+
+
+def _score_entry(entry: CatalogEntry, terms: list[str], domain: str | None, kind: str | None) -> int:
+    haystack = " ".join(
+        [
+            entry.path,
+            entry.summary,
+            entry.domain,
+            entry.kind,
+            " ".join(entry.symbols),
+            " ".join(entry.sections),
+        ]
+    ).lower()
+    score = sum(haystack.count(term) for term in terms)
+    if domain and entry.domain == domain:
+        score += 5
+    if kind and entry.kind == kind:
+        score += 3
+    return score
+
+
+def _normalize_domain(domain: str | None) -> str | None:
+    if domain is None:
+        return None
+    return domain.strip().lower()
+
+
+def _normalize_kind(kind: str | None) -> str | None:
+    if kind is None:
+        return None
+    return kind.strip().lower()
+
+
+def search_repo_catalog(
+    query: str,
+    domain: str | None = None,
+    kind: str | None = None,
+) -> str:
+    """Search the committed repo catalog (metadata only).
+
+    Returns matching paths, summaries, and section titles — not file bodies.
+    Follow up with :func:`fetch_repo_artifact` for content.
+    """
+    terms = _tokenize(query)
+    if not terms:
+        return (
+            "No search terms found. Try e.g. "
+            "'DataService register' or 'energy notebook 02 agentic'."
+        )
+
+    domain_filter = _normalize_domain(domain)
+    kind_filter = _normalize_kind(kind)
+    ranked: list[CatalogHit] = []
+    for entry in _load_entries():
+        if domain_filter and entry.domain != domain_filter:
+            continue
+        if kind_filter and entry.kind != kind_filter:
+            continue
+        score = _score_entry(entry, terms, domain_filter, kind_filter)
+        if score >= _MIN_SCORE:
+            ranked.append(
+                CatalogHit(
+                    path=entry.path,
+                    kind=entry.kind,
+                    domain=entry.domain,
+                    summary=entry.summary,
+                    score=score,
+                    artifact=entry.artifact,
+                    sections=entry.sections[:5],
+                )
+            )
+
+    if not ranked:
+        domains = sorted({e.domain for e in _load_entries()})
+        return (
+            f"No catalog matches for query={query!r}"
+            + (f", domain={domain!r}" if domain else "")
+            + (f", kind={kind!r}" if kind else "")
+            + f". Available domains: {', '.join(domains)}."
+        )
+
+    ranked.sort(key=lambda hit: hit.score, reverse=True)
+    top = ranked[:_MAX_CATALOG_HITS]
+
+    lines = [
+        f"# Catalog search: {query}",
+        "",
+        "Metadata only — call `fetch_repo_artifact(path)` for full content.",
+        "",
+    ]
+    for i, hit in enumerate(top, start=1):
+        lines.append(f"## Match {i} (score={hit.score})")
+        lines.append(f"- **path:** `{hit.path}`")
+        lines.append(f"- **kind:** `{hit.kind}` | **domain:** `{hit.domain}`")
+        lines.append(f"- **summary:** {hit.summary}")
+        if hit.sections:
+            lines.append(f"- **sections:** {'; '.join(hit.sections[:3])}")
+        lines.append("")
+    return "\n".join(lines)
+
+
+def _find_entry_by_path(path: str) -> CatalogEntry | None:
+    normalized = path.strip().replace("\\", "/")
+    for entry in _load_entries():
+        if entry.path == normalized:
+            return entry
+    return None
+
+
+def _extract_section(body: str, section: str) -> str | None:
+    needle = section.strip().lower()
+    if not needle:
+        return None
+    parts = re.split(r"\n(?=#{1,4} )", body)
+    for part in parts:
+        heading_match = _HEADING_RE.match(part.strip())
+        if heading_match and needle in heading_match.group(1).lower():
+            return part.strip()
+        if needle in part[:120].lower():
+            return part.strip()
+    return None
+
+
+def fetch_repo_artifact(
+    path: str,
+    section: str | None = None,
+    max_chars: int = _DEFAULT_FETCH_MAX_CHARS,
+) -> str:
+    """Fetch one pre-built artifact by repo-relative ``path``.
+
+    Parameters
+    ----------
+    path : str
+        Repo-relative path as listed in the catalog (e.g.
+        ``aieng-forecasting/aieng/forecasting/data/service.py``).
+    section : str or None
+        Optional heading substring to return one section only.
+    max_chars : int
+        Hard cap on returned characters.
+    """
+    entry = _find_entry_by_path(path)
+    if entry is None:
+        return f"No catalog entry for path={path!r}. Call `search_repo_catalog` first."
+
+    artifact_path = _CONTEXT_DIR / entry.artifact
+    if not artifact_path.is_file():
+        return f"Artifact missing for {path!r}: {entry.artifact}"
+
+    body = artifact_path.read_text(encoding="utf-8")
+    if section:
+        extracted = _extract_section(body, section)
+        body = extracted or (
+            f"(Section {section!r} not found in artifact; showing beginning.)\n\n" + body[:max_chars]
+        )
+
+    if len(body) > max_chars:
+        body = body[:max_chars] + "\n…\n"
+    return body
+
+
+def clear_catalog_cache() -> None:
+    """Clear cached catalog reads (for tests)."""
+    _load_catalog.cache_clear()
+    _load_entries.cache_clear()
+
+
+__all__ = [
+    "clear_catalog_cache",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog_build.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog_build.py.md
new file mode 100644
index 0000000..2bd5807
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__catalog_build.py.md
@@ -0,0 +1,347 @@
+# Source: implementations/getting_started/concierge_agent/catalog_build.py
+
+kind: python
+
+```python
+"""Build the repo concierge catalog and per-source artifacts (maintainer-only)."""
+
+from __future__ import annotations
+
+import ast
+import json
+import re
+import shutil
+import subprocess
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any, Literal
+
+import yaml
+
+
+REPO_URL = "https://github.com/VectorInstitute/agentic-forecasting"
+DEFAULT_BRANCH = "main"
+CORE_PREFIX = "aieng-forecasting/aieng/forecasting"
+
+Kind = Literal["python", "markdown", "notebook", "yaml", "shell"]
+
+_SKIP_IMPL_PARTS = frozenset({"tests", "context", "__pycache__"})
+_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE)
+
+
+@dataclass(frozen=True)
+class CatalogEntry:
+    """One indexed source file in the public repo snapshot."""
+
+    path: str
+    kind: str
+    domain: str
+    summary: str
+    symbols: list[str]
+    sections: list[str]
+    chars: int
+    artifact: str
+
+
+def repo_root_from_here() -> Path:
+    """Return repository root (parent of ``implementations/``)."""
+    return Path(__file__).resolve().parents[3]
+
+
+def context_dir(repo_root: Path | None = None) -> Path:
+    root = repo_root or repo_root_from_here()
+    return root / "implementations/getting_started/concierge_agent/context"
+
+
+def path_to_artifact_slug(rel_path: str) -> str:
+    return rel_path.replace("/", "__")
+
+
+_DOMAIN_RULES: tuple[tuple[str, str], ...] = (
+    (f"{CORE_PREFIX}/data", "core.data"),
+    (f"{CORE_PREFIX}/evaluation", "core.evaluation"),
+    (f"{CORE_PREFIX}/methods", "core.methods"),
+    (f"{CORE_PREFIX}/documents", "core.documents"),
+    (f"{CORE_PREFIX}/", "core.root"),
+)
+
+
+def infer_domain(rel_path: str) -> str:
+    """Map a repo-relative path to a catalog domain tag."""
+    for prefix, domain in _DOMAIN_RULES:
+        if rel_path.startswith(prefix):
+            return domain
+    if rel_path.startswith("implementations/"):
+        parts = rel_path.split("/")
+        if len(parts) >= 2:
+            return f"impl.{parts[1]}"
+    if rel_path.startswith("scripts/"):
+        return "scripts"
+    if rel_path.startswith(("docs/", "planning-docs/")) or rel_path in {"README.md", "AGENTS.md"}:
+        return "docs"
+    return "other"
+
+
+def infer_kind(rel_path: str) -> Kind:
+    suffix = Path(rel_path).suffix.lower()
+    if suffix == ".py":
+        return "python"
+    if suffix == ".ipynb":
+        return "notebook"
+    if suffix in {".yaml", ".yml"}:
+        return "yaml"
+    if suffix == ".md":
+        return "markdown"
+    return "shell"
+
+
+def _first_paragraph(text: str) -> str:
+    stripped = text.strip()
+    if not stripped:
+        return ""
+    return stripped.split("\n\n")[0].replace("\n", " ").strip()[:240]
+
+
+def _extract_headings(text: str) -> list[str]:
+    return [m.group(1).strip() for m in _HEADING_RE.finditer(text)][:40]
+
+
+def _analyze_python(source: str) -> tuple[str, list[str]]:
+    try:
+        tree = ast.parse(source)
+    except SyntaxError:
+        return "", []
+    summary = _first_paragraph(ast.get_docstring(tree) or "")
+    symbols: list[str] = []
+    for node in tree.body:
+        if isinstance(node, ast.ClassDef):
+            symbols.append(node.name)
+        elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+            if not node.name.startswith("_"):
+                symbols.append(node.name)
+        elif isinstance(node, ast.Assign):
+            for target in node.targets:
+                if (
+                    isinstance(target, ast.Name)
+                    and target.id == "__all__"
+                    and isinstance(node.value, (ast.List, ast.Tuple))
+                ):
+                    for elt in node.value.elts:
+                        if isinstance(elt, ast.Constant) and isinstance(elt.value, str):
+                            symbols.append(elt.value)
+    return summary, symbols[:30]
+
+
+def _notebook_to_markdown(rel_path: str, raw: str) -> tuple[str, str, list[str]]:
+    nb = json.loads(raw)
+    lines = [f"# Source: {rel_path}", "", "kind: notebook", ""]
+    sections: list[str] = []
+    for idx, cell in enumerate(nb.get("cells", []), start=1):
+        cell_type = cell.get("cell_type", "")
+        source = "".join(cell.get("source", []))
+        if not source.strip():
+            continue
+        if cell_type == "markdown":
+            lines.extend([f"## Cell {idx} (markdown)", "", source.rstrip(), ""])
+            first = source.strip().splitlines()[0] if source.strip() else ""
+            if first.startswith("#"):
+                sections.append(first.lstrip("#").strip())
+        elif cell_type == "code":
+            lines.extend([f"## Cell {idx} (code)", "", "```python", source.rstrip(), "```", ""])
+    body = "\n".join(lines)
+    title = sections[0] if sections else Path(rel_path).stem.replace("_", " ")
+    summary = title[:240]
+    return body, summary, sections[:40]
+
+
+def _markdown_summary_and_sections(body: str, *, fallback: str) -> tuple[str, list[str]]:
+    sections = _extract_headings(body)
+    summary = sections[0] if sections else _first_paragraph(body) or fallback
+    return summary[:240], sections
+
+
+def _collect_core_paths(repo_root: Path) -> set[Path]:
+    paths: set[Path] = set()
+    core = repo_root / CORE_PREFIX
+    if core.is_dir():
+        for path in core.rglob("*"):
+            if path.is_file() and path.suffix in {".py", ".md"} and "__pycache__" not in path.parts:
+                paths.add(path)
+    return paths
+
+
+def _collect_impl_paths(repo_root: Path) -> set[Path]:
+    paths: set[Path] = set()
+    impl_root = repo_root / "implementations"
+    if not impl_root.is_dir():
+        return paths
+    for path in impl_root.rglob("*"):
+        if not path.is_file():
+            continue
+        if _SKIP_IMPL_PARTS.intersection(path.parts):
+            continue
+        if "curriculum" in path.parts and "context" in path.parts:
+            continue
+        if path.suffix in {".py", ".md", ".ipynb"} or (
+            path.parent.name == "specs" and path.suffix in {".yaml", ".yml"}
+        ):
+            paths.add(path)
+    return paths
+
+
+def collect_source_paths(repo_root: Path) -> list[Path]:
+    """Collect all concierge-indexed paths under the repo snapshot."""
+    paths = _collect_core_paths(repo_root) | _collect_impl_paths(repo_root)
+
+    for rel in (
+        "README.md",
+        "AGENTS.md",
+        "implementations/README.md",
+        "planning-docs/roadmap.md",
+        "docs/adk-skills-guide.md",
+    ):
+        candidate = repo_root / rel
+        if candidate.is_file():
+            paths.add(candidate)
+
+    scripts = repo_root / "scripts"
+    if scripts.is_dir():
+        for path in scripts.glob("fetch_*.py"):
+            paths.add(path)
+
+    return sorted(paths, key=lambda p: p.relative_to(repo_root).as_posix())
+
+
+def build_entry(repo_root: Path, path: Path) -> tuple[CatalogEntry, str]:
+    rel = path.relative_to(repo_root).as_posix()
+    kind = infer_kind(rel)
+    domain = infer_domain(rel)
+    raw = path.read_text(encoding="utf-8", errors="replace")
+
+    symbols: list[str] = []
+    sections: list[str] = []
+    if kind == "python":
+        summary, symbols = _analyze_python(raw)
+        if not summary:
+            summary = Path(rel).name
+        body = f"# Source: {rel}\n\nkind: python\n\n```python\n{raw.rstrip()}\n```\n"
+    elif kind == "notebook":
+        body, summary, sections = _notebook_to_markdown(rel, raw)
+    elif kind == "markdown":
+        summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name)
+        body = f"# Source: {rel}\n\nkind: markdown\n\n{raw.rstrip()}\n"
+    elif kind == "yaml":
+        summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name)
+        body = f"# Source: {rel}\n\nkind: yaml\n\n```yaml\n{raw.rstrip()}\n```\n"
+    else:
+        summary = Path(rel).name
+        body = f"# Source: {rel}\n\nkind: shell\n\n```\n{raw.rstrip()}\n```\n"
+
+    artifact_rel = f"artifacts/{path_to_artifact_slug(rel)}.md"
+    entry = CatalogEntry(
+        path=rel,
+        kind=kind,
+        domain=domain,
+        summary=summary,
+        symbols=symbols,
+        sections=sections,
+        chars=len(body),
+        artifact=artifact_rel,
+    )
+    return entry, body
+
+
+def git_ref(repo_root: Path) -> str:
+    try:
+        return subprocess.check_output(
+            ["git", "rev-parse", "HEAD"],
+            cwd=repo_root,
+            text=True,
+        ).strip()
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        return "unknown"
+
+
+def build_catalog(repo_root: Path | None = None) -> Path:
+    """Walk the repo, write ``catalog.yaml`` and per-source artifacts."""
+    root = repo_root or repo_root_from_here()
+    out_dir = context_dir(root)
+    artifacts_dir = out_dir / "artifacts"
+    if artifacts_dir.exists():
+        shutil.rmtree(artifacts_dir)
+    artifacts_dir.mkdir(parents=True, exist_ok=True)
+
+    entries: list[CatalogEntry] = []
+    for path in collect_source_paths(root):
+        entry, body = build_entry(root, path)
+        entries.append(entry)
+        artifact_path = out_dir / entry.artifact
+        artifact_path.parent.mkdir(parents=True, exist_ok=True)
+        artifact_path.write_text(body, encoding="utf-8")
+
+    built_at = datetime.now(tz=UTC).replace(microsecond=0).isoformat()
+    catalog = {
+        "source_url": REPO_URL,
+        "git_ref": git_ref(root),
+        "branch": DEFAULT_BRANCH,
+        "built_at": built_at,
+        "ingest_source": str(root),
+        "entry_count": len(entries),
+        "entries": [
+            {
+                "path": e.path,
+                "kind": e.kind,
+                "domain": e.domain,
+                "summary": e.summary,
+                "symbols": e.symbols,
+                "sections": e.sections,
+                "chars": e.chars,
+                "artifact": e.artifact,
+            }
+            for e in entries
+        ],
+    }
+    catalog_path = out_dir / "catalog.yaml"
+    catalog_path.write_text(yaml.safe_dump(catalog, sort_keys=False), encoding="utf-8")
+
+    # Remove legacy topic-blob digests if present.
+    for legacy in (
+        "overview.md",
+        "core_library.md",
+        "methods.md",
+        "implementations.md",
+        "extension_guides.md",
+        "manifest.yaml",
+    ):
+        legacy_path = out_dir / legacy
+        if legacy_path.is_file():
+            legacy_path.unlink()
+
+    _sync_skill_catalog_summary(catalog, root)
+    return out_dir
+
+
+def _sync_skill_catalog_summary(catalog: dict[str, Any], repo_root: Path) -> None:
+    """Write a compact domain summary for the repo-navigation skill."""
+    entries = catalog.get("entries", [])
+    domains: dict[str, int] = {}
+    for entry in entries:
+        if isinstance(entry, dict):
+            domain = str(entry.get("domain", "other"))
+            domains[domain] = domains.get(domain, 0) + 1
+    summary = {
+        "source_url": catalog.get("source_url"),
+        "branch": catalog.get("branch"),
+        "built_at": catalog.get("built_at"),
+        "git_ref": catalog.get("git_ref"),
+        "entry_count": catalog.get("entry_count"),
+        "domains": domains,
+    }
+    out = (
+        repo_root
+        / "implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml"
+    )
+    header = "# Concierge catalog summary (regenerated by scripts/build_concierge_context.py)\n"
+    out.write_text(header + yaml.safe_dump(summary, sort_keys=False), encoding="utf-8")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__knowledge.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__knowledge.py.md
new file mode 100644
index 0000000..8da484d
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__knowledge.py.md
@@ -0,0 +1,60 @@
+# Source: implementations/getting_started/concierge_agent/knowledge.py
+
+kind: python
+
+```python
+"""Repo knowledge tools — catalog search and artifact fetch.
+
+Legacy :func:`search_repo_knowledge` delegates to the catalog tools for
+backward compatibility.
+"""
+
+from __future__ import annotations
+
+from getting_started.concierge_agent.catalog import (
+    clear_catalog_cache,
+    fetch_repo_artifact,
+    search_repo_catalog,
+)
+
+
+def clear_knowledge_cache() -> None:
+    """Clear cached catalog reads (for tests)."""
+    clear_catalog_cache()
+
+
+def search_repo_knowledge(query: str, topic: str | None = None) -> str:
+    """Backward-compatible wrapper: catalog search + fetch of the top hit."""
+    catalog_result = search_repo_catalog(query, domain=_topic_to_domain(topic))
+    if catalog_result.startswith("No catalog matches"):
+        return catalog_result
+    # Pull first path from catalog output for a combined excerpt.
+    for line in catalog_result.splitlines():
+        if line.startswith("- **path:** "):
+            path = line.removeprefix("- **path:** ").strip().strip("`")
+            body = fetch_repo_artifact(path, max_chars=2400)
+            return f"{catalog_result}\n\n---\n\n# Fetched: `{path}`\n\n{body}"
+    return catalog_result
+
+
+def _topic_to_domain(topic: str | None) -> str | None:
+    if topic is None:
+        return None
+    key = topic.lower().removesuffix(".md")
+    mapping = {
+        "overview": "docs",
+        "core_library": "core.evaluation",
+        "methods": "core.methods",
+        "implementations": "impl.getting_started",
+        "extension_guides": "scripts",
+    }
+    return mapping.get(key, key if key.startswith(("core.", "impl.", "docs", "scripts")) else None)
+
+
+__all__ = [
+    "clear_knowledge_cache",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+    "search_repo_knowledge",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__SKILL.md.md
new file mode 100644
index 0000000..87e66ba
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__SKILL.md.md
@@ -0,0 +1,21 @@
+# Source: implementations/getting_started/concierge_agent/skills/repo-navigation/SKILL.md
+
+kind: markdown
+
+---
+name: repo-navigation
+description: >-
+  Reference guide for the repo concierge catalog — domain filters, the
+  search-then-fetch workflow, and bootcamp routing. Load references/catalog-guide.md
+  before your first answer. No scripts.
+---
+
+# Repo navigation skill
+
+## Workflow
+
+1. Optional: `load_skill_resource("repo-navigation", "references/catalog-guide.md")`
+2. `search_repo_catalog(query, domain=..., kind=...)` — metadata only
+3. `fetch_repo_artifact(path)` for each path you need (1–3 per question)
+
+**No scripts. Do not call `run_skill_script`.**
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__catalog-guide.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__catalog-guide.md.md
new file mode 100644
index 0000000..e972c8e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__catalog-guide.md.md
@@ -0,0 +1,51 @@
+# Source: implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-guide.md
+
+kind: markdown
+
+# Catalog guide
+
+## Two-step retrieval
+
+| Step | Tool | Returns |
+|------|------|---------|
+| 1 | `search_repo_catalog(query, domain?, kind?)` | Paths, summaries, section titles |
+| 2 | `fetch_repo_artifact(path, section?)` | Full file/notebook content |
+
+Never skip step 1 — it keeps responses grounded and token-efficient.
+
+## Domain filters (`domain=`)
+
+| Domain | Contents |
+|--------|----------|
+| `core.data` | `aieng/forecasting/data/` |
+| `core.evaluation` | `aieng/forecasting/evaluation/` |
+| `core.methods` | `aieng/forecasting/methods/` |
+| `core.documents` | `aieng/forecasting/documents/` |
+| `core.root` | top-level `aieng/forecasting/*.py` |
+| `impl.<use_case>` | e.g. `impl.energy_oil_forecasting` |
+| `scripts` | `scripts/fetch_*.py` |
+| `docs` | README, AGENTS, roadmap, adk-skills-guide |
+
+## Kind filters (`kind=`)
+
+- `python` — library and implementation modules
+- `notebook` — markdown **and code cells** (outputs stripped)
+- `markdown` — READMEs
+- `yaml` — `specs/*.yaml`
+
+## Example sequences
+
+**DataService:**
+1. `search_repo_catalog("DataService register", domain="core.data")`
+2. `fetch_repo_artifact("aieng-forecasting/aieng/forecasting/data/service.py")`
+3. Optionally `fetch_repo_artifact("scripts/fetch_cpi.py")`
+
+**LLMP context:**
+1. `search_repo_catalog("LLMP user_prompt_suffix", domain="core.methods")`
+2. `fetch_repo_artifact("aieng-forecasting/aieng/forecasting/methods/llm_processes/base.py")`
+
+**Energy notebook 02:**
+1. `search_repo_catalog("intro agentic predictor", domain="impl.energy_oil_forecasting", kind="notebook")`
+2. `fetch_repo_artifact("implementations/energy_oil_forecasting/02_intro_agentic_predictor.ipynb")`
+
+Load `references/navigation-map.md` for bootcamp entry points.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__navigation-map.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__navigation-map.md.md
new file mode 100644
index 0000000..36c7b6b
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__navigation-map.md.md
@@ -0,0 +1,39 @@
+# Source: implementations/getting_started/concierge_agent/skills/repo-navigation/references/navigation-map.md
+
+kind: markdown
+
+# Bootcamp navigation map
+
+Quick pointers for common participant questions. Confirm details with
+`search_repo_catalog` and `fetch_repo_artifact` — this file is a map, not the full docs.
+
+## First steps
+
+1. `implementations/getting_started/00_environment_check.ipynb` — preflight (run first).
+2. `implementations/getting_started/01_cpi_data_exploration.ipynb` — data + ForecastingTask.
+3. `implementations/getting_started/02_cpi_backtest_demo.ipynb` — backtest loop.
+4. `implementations/getting_started/99_repo_concierge.ipynb` — this concierge (repo Q&A).
+
+## Reference implementations (pick by problem)
+
+| Order | Directory | Good for |
+|-------|-----------|----------|
+| 0 | `getting_started/` | Smallest eval loop (CPI gasoline, h=1) |
+| 1 | `sp500_forecasting/` | Numerical methods + covariate-aware LLMP |
+| 2 | `food_price_forecasting/` | Multi-target CPI trajectories, CFPR metric |
+| 3 | `energy_oil_forecasting/` | Daily prices, news/code agents, adaptive agent |
+| 4 | `boc_rate_decisions/` | Discrete cut/hold/hike events, RPS/Brier |
+
+## Related agents
+
+Each domain ships `99_starter_agent.ipynb` + `starter_agent/` for hands-on
+forecasting (news search, code execution). Energy also has `analyst_agent/` and
+`adaptive_agent/`. This **repo concierge** helps you navigate and understand the
+codebase; domain starter agents are where you build and score forecasts.
+
+## Key library entry points
+
+- `aieng.forecasting.data.DataService` — register series, build contexts.
+- `aieng.forecasting.evaluation` — `Predictor`, `backtest()`, `evaluate()`.
+- `aieng.forecasting.methods` — baselines, numerical, LLM Processes, agentic ADK.
+- `AGENTS.md` — contributor conventions (models, data cache, docs).
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_1m.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_1m.yaml.md
new file mode 100644
index 0000000..e1dc137
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_1m.yaml.md
@@ -0,0 +1,69 @@
+# Source: implementations/getting_started/specs/cpi_gasoline_1m.yaml
+
+kind: yaml
+
+```yaml
+# Reference BacktestSpec: CPI Gasoline Canada, 1-month ahead forecast
+#
+# Gasoline is the "hello-world" target for the getting-started experiment.
+# It is deliberately chosen over headline All-items because it is visibly
+# volatile (2008 crude collapse, 2014-16 OPEC-led decline, 2020 COVID
+# demand shock, 2022 Russia/Ukraine surge) — a single-series backtest
+# makes the "why is this hard?" teaching point land without needing the
+# full CFPR trajectory machinery.
+#
+# Horizon is 1 month ahead (h=1).  Rationale:
+#   - One-step-ahead is the most natural short-term question on a monthly
+#     series and is immediately interpretable.
+#   - StatCan releases CPI ~3 weeks after the reference month, so a
+#     forecast made at origin T resolves at T+1 within the same month the
+#     next CPI print is published.  This is short enough to enable
+#     genuine live / prospective evaluation (make a prediction today,
+#     validate it next month).
+#   - The backtest (2000–2025) provides ~276 training-window origins;
+#     the eval set (2025–present) covers ~15+ already-resolved origins,
+#     enough for a stable CRPS estimate.
+#   - Naive (last-value) is meaningfully bad at turning points, so the
+#     naive-vs-AutoARIMA comparison still lands even at h=1.
+#
+# To load this spec in Python:
+#
+#   import yaml
+#   from aieng.forecasting.evaluation import BacktestSpec
+#
+#   with open("implementations/getting_started/specs/cpi_gasoline_1m.yaml") as f:
+#       spec = BacktestSpec.model_validate(yaml.safe_load(f))
+#
+# The target series must be registered in the DataService before running
+# a backtest.  Use scripts/fetch_cpi.py to populate the local data cache,
+# then register "cpi_gasoline_canada" from StatCan table 18-10-0004-11.
+
+task:
+  task_id: cpi_gasoline_canada_1m
+  target_series_id: cpi_gasoline_canada
+  horizons: [1]
+  frequency: MS
+  description: >-
+    Forecast Canada-wide CPI Gasoline (index, 2002=100) exactly 1 month
+    ahead of the forecast origin. Evaluated at every monthly origin.
+    Ground truth is the observed StatCan CPI value at the resolution date.
+
+# Backtest window: January 2000 through January 2025.
+# Origins from January 2025 onward are reserved for the eval set.
+start: "2000-01-01"
+end: "2025-01-01"
+
+# stride=1 on monthly (MS) frequency gives one origin per month.
+stride: 1
+
+# Require 24 months of history before the first forecast.
+warmup: 24
+
+description: >-
+  Getting-started reference backtest: forecast Canada-wide CPI Gasoline
+  (index, 2002=100) 1 month ahead at every monthly origin from 2000
+  to 2025.  The volatile target makes the naive-vs-AutoARIMA comparison
+  visually compelling and motivates richer methods (exogenous covariates,
+  LLM context, agentic news retrieval).  Origins from Jan 2025 onward
+  are held out in the companion eval spec for live / prospective testing.
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_eval_2025.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_eval_2025.yaml.md
new file mode 100644
index 0000000..9f46080
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__getting_started__specs__cpi_gasoline_eval_2025.yaml.md
@@ -0,0 +1,73 @@
+# Source: implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml
+
+kind: yaml
+
+```yaml
+# Reference EvalSpec: CPI Gasoline Canada, 1-month ahead — 2025 eval window
+#
+# Companion to cpi_gasoline_1m.yaml.  Covers origins from January 2025
+# through March 2026 — the most recent window where every forecast (h=1)
+# has a published resolution as of the time of writing (May 2026; StatCan
+# releases CPI ~3 weeks after the reference month, so April 2026 data was
+# published ~May 19 2026 and all h=1 forecasts up to March 2026 are
+# resolved).
+#
+# This spec is intentionally "live-adjacent": origins from late 2025 and
+# early 2026 are as recent as possible, making it possible to extend the
+# eval window month by month as new CPI prints are published.
+#
+# Use this spec sparingly.  max_runs: 5 limits how many times a
+# participant may run evaluate() against it, reducing the risk of
+# inadvertently over-fitting to the held-out window.
+#
+# To load this spec in Python:
+#
+#   import yaml
+#   from pathlib import Path
+#   from aieng.forecasting.evaluation import EvalSpec, EvalTracker, evaluate
+#
+#   with open("implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml") as f:
+#       spec = EvalSpec.model_validate(yaml.safe_load(f))
+#
+#   tracker = EvalTracker(Path("data/eval_runs.yaml"))
+#   result = evaluate(predictor=my_predictor, spec=spec, data_service=svc, tracker=tracker)
+#   print(f"Eval mean CRPS: {result.mean_score:.4f}  (run {result.run_number}/{spec.max_runs})")
+#
+# The target series must be registered before running.  Use
+# scripts/fetch_cpi.py to populate the local data cache, then register
+# "cpi_gasoline_canada".
+
+spec_id: cpi_gasoline_eval_2025
+
+task:
+  task_id: cpi_gasoline_canada_1m
+  target_series_id: cpi_gasoline_canada
+  horizons: [1]
+  frequency: MS
+  description: >-
+    Forecast Canada-wide CPI Gasoline (index, 2002=100) exactly 1 month
+    ahead of the forecast origin. Evaluated at every monthly origin.
+    Ground truth is the observed StatCan CPI value at the resolution date.
+
+# Eval window: January 2025 through March 2026.
+# Origins Jan 2025 – Mar 2026 with h=1 produce forecast dates Feb 2025 – Apr 2026,
+# all of which are published as of May 2026.
+start: "2025-01-01"
+end: "2026-04-01"
+
+# stride=1 gives one origin per month (~15 origins).
+stride: 1
+
+# Require 24 months of history before the first forecast.
+warmup: 24
+
+# Budget cap: each participant may run evaluate() against this spec at
+# most 5 times.  Enforced by EvalTracker when passed to evaluate().
+max_runs: 5
+
+description: >-
+  Getting-started protected eval: monthly origins from Jan 2025 through
+  Mar 2026 for CPI Gasoline Canada 1-month ahead forecasts.  All
+  resolutions are published as of May 2026.  Budget-limited to 5 runs
+  per participant tracker.
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__01_sp500_multivariate_backtest.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__01_sp500_multivariate_backtest.ipynb.md
new file mode 100644
index 0000000..adb4014
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__01_sp500_multivariate_backtest.ipynb.md
@@ -0,0 +1,509 @@
+# Source: implementations/sp500_forecasting/01_sp500_multivariate_backtest.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# S&P 500 — multivariate conventional-methods comparison
+
+The repo's **financial-markets** reference: a head-to-head of conventional
+time-series methods on a daily equity index, all reading the **same leak-safe
+covariate panel** (VIX, Treasury yields, the 2s10s spread, fed funds, CPI,
+unemployment, oil, gold, the dollar index, NASDAQ), plus an LLM-Process
+forecaster that can read those covariates in its prompt.
+
+> Given the same macro/market observations, **which method forecasts the index
+> best — and can an LLM-Process, handed those covariates, keep up with gradient
+> boosting?**
+
+Unlike the energy/oil reference (univariate price trajectory forecast by
+news-grounded / code-executing / adaptive **agents**), this is a clean,
+reproducible **numerical-methods bake-off across a covariate panel** — no agents,
+no news — scored with CRPS and direction metrics.
+
+**Methods**
+
+| Family | Predictors | Covariates? |
+|---|---|---|
+| Naive floor | `LastValuePredictor` | — |
+| Classical | ETS, Kalman, AutoARIMA | — (univariate) |
+| ML regression | Darts LinearRegression, LightGBM | ✅ optional past covariates |
+| LLM-Process | `SampledTrajectoryLLMPredictor` | ✅ optional covariate prompt blocks |
+
+**How it's organised.** Specs (`specs/*.yaml`) carry only the **experiment
+design** — the window and one single-horizon task per `sp500_logret_{N}b` target.
+The **predictors and their hyperparameters live in code** (the predictors cell in
+Section 4), so you tune models by editing Python, not YAML.
+
+**Workflow:** iterate on the **2025 backtest** (Section 5) by setting
+`EXPERIMENT_CONFIG` in the config cell, then spend the **protected 2026 eval**
+(Section 7) on your finalists. `"smoke"` is the fast default.
+
+## Cell 2 (markdown)
+
+## What's actually forecastable at daily resolution?
+
+- **Target = returns, not the index level.** We forecast **close-to-close
+  cumulative log returns** over a few horizons, one series per window:
+  `sp500_logret_1b` (next session), `sp500_logret_5b` (forward 1 week),
+  `sp500_logret_21b` (forward 1 month). Forecasting `sp500_logret_Nb` exactly `N`
+  business days ahead resolves to the **forward** cumulative return over the next
+  `N` sessions.
+- **Per horizon:** `h=1` → direction / next-day **risk management**; `h=5`/`h=21`
+  → tactical rebalancing and option tenors as cumulative returns.
+- **The caveat:** for an index the return *level* is close to a martingale, so
+  far-ahead point forecasts trend toward ~0; the forecastable, actionable objects
+  are **volatility, tail risk, and direction**. That's why a VIX-led panel helps
+  most at short horizons — watch the edge shrink as the horizon grows.
+
+## Cell 3 (markdown)
+
+## ⚠️ Cutoff-aware evaluation — why the windows are what they are
+
+This is the methodological heart of the comparison, and it's easy to get wrong.
+
+- **Numerical methods are cutoff-safe by construction.** Naive, ETS, Kalman,
+  AutoARIMA, LinReg and LightGBM only ever see the series up to the forecast
+  origin (`ForecastContext` enforces it). They can be backtested on *any*
+  historical window — including the 2020 COVID crash.
+- **An LLM is not.** Gemini's training cutoff is ~**January 2025**, so it has
+  effectively **memorised** outcomes before then. Scoring an LLM-Process on a
+  pre-2025 origin measures *recall*, not forecasting — and silently flatters it
+  in the head-to-head, which is exactly the comparison this notebook is about.
+
+So the LLM-inclusive comparison lives **after the cutoff**:
+
+| Window | Spec | Role | LLMP? |
+|---|---|---|---|
+| **2025** | `sp500_smoke.yaml` / `sp500_backtest_2025.yaml` | open iteration & comparison | ✅ post-cutoff |
+| **2026** | `sp500_eval_2026.yaml` | **protected** held-out test (budgeted) | ✅ post-cutoff |
+| **2020 COVID** | `sp500_stress_2020.yaml` | volatile-regime stress, **numerical only** | ❌ leaked for LLMs |
+
+This discipline is enforced **in code**: the predictors cell (Section 4) gates the
+LLM-Process rows on a `POST_CUTOFF` flag, so selecting `EXPERIMENT_CONFIG =
+"stress_2020"` drops them automatically — the numerical methods remain perfectly
+valid on that volatile window.
+
+## Cell 4 (markdown)
+
+---
+## 1. Setup
+
+The heavy lifting lives in helper modules alongside this notebook:
+
+- `data.py`        builds the `DataService` (return targets + the covariate panel).
+- `predictors/`    the S&P 500 LLM-Process recipe (prompt framing + sampling budget).
+- `leaderboard.py` turns cached results into the `RESULTS_DF` leaderboard frame.
+- `analysis.py` / `plots.py`  direction metrics, styled tables, and figures.
+
+We build **one** data service that registers the three return targets plus the
+full covariate panel. Target-only predictors simply ignore the registered
+covariates; the covariate variants read them.
+
+**First run on the 2025/2026 windows?** Warm the caches to the present first:
+`uv run python scripts/fetch_sp500_market.py --refresh` (Yahoo: ^GSPC/^VIX/^IXIC)
+and `uv run python scripts/fetch_fred.py` (macro covariates).
+
+## Cell 5 (code)
+
+```python
+from __future__ import annotations
+
+import warnings
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import pandas as pd
+import yaml
+from dotenv import load_dotenv
+
+
+warnings.filterwarnings("ignore")
+
+
+# Resolve the repo root robustly (walk up for the workspace markers) so the
+# proxy creds load regardless of the kernel's working directory.
+def _repo_root() -> Path:
+    here = Path.cwd().resolve()
+    for cand in (here, *here.parents):
+        if (cand / "pyproject.toml").exists() and (cand / "aieng-forecasting").is_dir():
+            return cand
+    return here
+
+
+ROOT = _repo_root()
+load_dotenv(ROOT / ".env")  # LLMP rows call the Vector proxy — need PROXY_* set
+
+from aieng.forecasting.evaluation import (
+    MultiTargetBacktestSpec,
+    MultiTargetEvalSpec,
+    cached_multi_backtest,
+    describe_spec,
+    multi_evaluate,
+)
+from aieng.forecasting.methods import (
+    DartsAutoARIMAPredictor,
+    DartsExponentialSmoothingPredictor,
+    DartsKalmanForecasterPredictor,
+    DartsLightGBMPredictor,
+    DartsLinearRegressionPredictor,
+    LastValuePredictor,
+)
+from sp500_forecasting import DEFAULT_COVARIATE_SERIES_IDS, build_sp500_multivariate_service
+from sp500_forecasting.leaderboard import build_leaderboard, build_return_compare_frame
+from sp500_forecasting.plots import (
+    display_multivariate_backtest_leaderboard,
+    plot_return_forecast_vs_actual_multi,
+    plot_sp500_log_return_recent,
+)
+from sp500_forecasting.predictors import build_sp500_llmp_sampled_trajectory
+
+
+SPECS_DIR = ROOT / "implementations" / "sp500_forecasting" / "specs"
+PREDICTIONS_DIR = ROOT / "data" / "predictions"
+
+# Data environment (not part of any single experiment): how far back to load
+# price/covariate history, and whether to re-fetch caches from source.
+DATA_HISTORY_START = "2016-01-01"
+REFRESH_CACHE = False
+
+svc = build_sp500_multivariate_service(
+    windows=(1, 5, 21),
+    include_covariates=True,
+    covariate_series_ids=DEFAULT_COVARIATE_SERIES_IDS,
+    start=DATA_HISTORY_START,
+    refresh=REFRESH_CACHE,
+)
+
+# The covariate panel available to the with-covariates predictors below (an
+# optional FRED feed may be skipped, so filter to what actually registered).
+registered = set(svc.series_ids)
+COVARIATES = [c for c in DEFAULT_COVARIATE_SERIES_IDS if c in registered]
+print("targets:", sorted(s for s in registered if s.startswith("sp500_logret_")))
+print(f"covariates registered: {len(COVARIATES)} / {len(DEFAULT_COVARIATE_SERIES_IDS)} requested")
+```
+
+## Cell 6 (markdown)
+
+---
+## 2. Configuration
+
+One selector picks the backtest window; everything downstream adapts. Specs are
+the source of truth for the **window and tasks** — not the predictors.
+
+## Cell 7 (code)
+
+```python
+# ── Experiment configuration ──────────────────────────────────────────────────
+# EXPERIMENT_CONFIG chooses which window drives the backtest (Section 5); all
+# downstream cells adapt. The protected 2026 eval (Section 7) is fixed.
+#
+#   "smoke"          short late-2025 window (~6 weekly origins), post-cutoff.
+#                    Fast default; LLM-Process predictors ON.
+#   "backtest_2025"  weekly origins across all of 2025 (~50), post-cutoff.
+#                    The full comparison; LLMP is slow here — trim the roster.
+#   "stress_2020"    the COVID crash (daily origins Feb–Apr 2020). NUMERICAL
+#                    ONLY — pre-cutoff is leaked for LLMs, so the predictors cell
+#                    drops the LLMP rows for this config.
+
+EXPERIMENT_CONFIG = "smoke"
+
+_BACKTEST_SPEC_FILES = {
+    "smoke": "sp500_smoke.yaml",
+    "backtest_2025": "sp500_backtest_2025.yaml",
+    "stress_2020": "sp500_stress_2020.yaml",
+}
+_EVAL_SPEC_FILE = "sp500_eval_2026.yaml"
+
+# 2020 is pre-cutoff → keep the LLM-Process out of the comparison (see Section 0).
+POST_CUTOFF = EXPERIMENT_CONFIG in {"smoke", "backtest_2025"}
+
+with (SPECS_DIR / _BACKTEST_SPEC_FILES[EXPERIMENT_CONFIG]).open() as f:
+    backtest_spec = MultiTargetBacktestSpec.model_validate(yaml.safe_load(f))
+
+print(
+    f"Config: {EXPERIMENT_CONFIG!r}  →  {_BACKTEST_SPEC_FILES[EXPERIMENT_CONFIG]}  "
+    f"(LLMP {'on' if POST_CUTOFF else 'OFF — pre-cutoff'})"
+)
+print(describe_spec(backtest_spec, data_service=svc))
+```
+
+## Cell 8 (markdown)
+
+---
+## 3. Target context — observed returns
+
+The most recent realised next-session returns (the `h=1` target). Blue = index up
+over the window, red = down.
+
+## Cell 9 (code)
+
+```python
+fig, _ = plot_sp500_log_return_recent(svc, n_trading_days=504)
+plt.show()
+```
+
+## Cell 10 (markdown)
+
+---
+## 4. Predictors — configured in code
+
+This is where you choose the roster. Each predictor implements the same
+`Predictor` API against the loaded spec; the conventional methods come straight
+from `aieng.forecasting.methods`, and the LLM-Process variants come from the
+S&P 500 recipe in `predictors/`. The covariate variants read the registered
+`COVARIATES` panel; the target-only variants pass `covariate_series_ids=None`.
+
+| Group | Predictor | Covariates? |
+|---|---|---|
+| Naive floor | `LastValuePredictor` | — |
+| Classical | `DartsExponentialSmoothingPredictor` (ETS), `DartsKalmanForecasterPredictor` | — |
+| ML regression | `DartsLinearRegressionPredictor`, `DartsLightGBMPredictor` | target-only **and** + panel |
+| LLM-Process | `build_sp500_llmp_sampled_trajectory` | target-only **and** + panel (post-cutoff only) |
+
+`AutoARIMA` is left commented (accurate but slow); add it back for a classical
+sweep. The `LLMP (target)` vs `LLMP + cov` pair is the centerpiece — its CRPS gap
+answers whether an LLM can use the same exogenous panel the ML methods do.
+
+## Cell 11 (code)
+
+```python
+# Shared hyperparameters for the Darts regression models.
+LAGS = 5  # autoregressive lags (and past-covariate lags) for the regression models
+NUM_SAMPLES = 100  # empirical-quantile sample count for the probabilistic Darts models
+LGBM_KWARGS = {"num_threads": 1, "n_jobs": 1, "verbosity": -1}
+
+# ── Naive floor + classical (univariate) ──────────────────────────────────────
+naive = LastValuePredictor()
+ets = DartsExponentialSmoothingPredictor(num_samples=NUM_SAMPLES)
+kalman = DartsKalmanForecasterPredictor(num_samples=NUM_SAMPLES)
+autoarima = DartsAutoARIMAPredictor(num_samples=NUM_SAMPLES)  # accurate but the slowest classical method
+
+# ── ML regression — target-only vs the covariate panel ────────────────────────
+linreg = DartsLinearRegressionPredictor(lags=LAGS, covariate_series_ids=None, num_samples=NUM_SAMPLES)
+linreg_cov = DartsLinearRegressionPredictor(
+    lags=LAGS, lags_past_covariates=LAGS, covariate_series_ids=COVARIATES, num_samples=NUM_SAMPLES
+)
+lightgbm = DartsLightGBMPredictor(
+    lags=LAGS, covariate_series_ids=None, num_samples=NUM_SAMPLES, lgbm_kwargs=LGBM_KWARGS
+)
+lightgbm_cov = DartsLightGBMPredictor(
+    lags=LAGS,
+    lags_past_covariates=LAGS,
+    covariate_series_ids=COVARIATES,
+    num_samples=NUM_SAMPLES,
+    lgbm_kwargs=LGBM_KWARGS,
+)
+
+all_predictors = [naive, ets, kalman, autoarima, linreg, linreg_cov, lightgbm, lightgbm_cov]
+
+# Which covariate panel each predictor consumes — drives the leaderboard's
+# covariate columns. Predictors absent from this map are treated as target-only.
+PREDICTOR_COVARIATES = {
+    linreg_cov.predictor_id: COVARIATES,
+    lightgbm_cov.predictor_id: COVARIATES,
+}
+# Short labels for the leaderboard "model" column and the CRPS charts.
+PREDICTOR_LABELS = {
+    naive.predictor_id: "Naive",
+    ets.predictor_id: "ETS",
+    kalman.predictor_id: "Kalman",
+    autoarima.predictor_id: "AutoARIMA",
+    linreg.predictor_id: "LinReg",
+    linreg_cov.predictor_id: "LinReg + cov",
+    lightgbm.predictor_id: "LightGBM",
+    lightgbm_cov.predictor_id: "LightGBM + cov",
+}
+
+# ── LLM-Process (sampled trajectories) — target-only vs with-covariates ────────
+# Gated on POST_CUTOFF: the 2020 stress window is pre-cutoff and leaked for LLMs,
+# so we keep the LLMP rows out of that comparison entirely. The recipe (prompt
+# framing, defaults) lives in predictors/llmp_sampled_trajectory.py; tune it
+# there, or override model= / n_samples= / history_window= per call here.
+if POST_CUTOFF:
+    llmp = build_sp500_llmp_sampled_trajectory(n_samples=8, history_window=48)
+    llmp_cov = build_sp500_llmp_sampled_trajectory(n_samples=8, history_window=48, covariate_series_ids=COVARIATES)
+    all_predictors += [llmp, llmp_cov]
+    PREDICTOR_COVARIATES[llmp_cov.predictor_id] = COVARIATES
+    PREDICTOR_LABELS[llmp.predictor_id] = "LLMP (target)"
+    PREDICTOR_LABELS[llmp_cov.predictor_id] = "LLMP + cov"
+
+print(f"{len(all_predictors)} predictors configured (LLMP {'on' if POST_CUTOFF else 'off'}):")
+for p in all_predictors:
+    print(f"  {p.predictor_id}")
+```
+
+## Cell 12 (markdown)
+
+---
+## 5. 2025 backtest — the comparison (post-cutoff)
+
+`cached_multi_backtest` runs each predictor across the three single-horizon tasks
+and writes every `BacktestResult` to `data/predictions/<spec_id>/` — so a re-run
+is free from cache (pass `force_refresh=True` to recompute). The loop prints each
+task's mean CRPS as it lands.
+
+## Cell 13 (code)
+
+```python
+results_by_predictor: dict[str, dict[str, object]] = {}
+
+for predictor in all_predictors:
+    print(f"Running {predictor.predictor_id} ...", flush=True)
+    results_by_predictor[predictor.predictor_id] = cached_multi_backtest(
+        predictor=predictor,
+        spec=backtest_spec,
+        data_service=svc,
+        store_dir=PREDICTIONS_DIR,
+    )
+    for task_id, result in results_by_predictor[predictor.predictor_id].items():
+        print(f"  {task_id:18s}  mean CRPS = {result.mean_score:.5f}  ({len(result.predictions)} preds)")
+
+RESULTS_DF = build_leaderboard(
+    results_by_predictor,
+    svc,
+    covariates_by_predictor=PREDICTOR_COVARIATES,
+    labels_by_predictor=PREDICTOR_LABELS,
+)
+```
+
+## Cell 14 (markdown)
+
+### Leaderboard — mean CRPS by method and horizon
+
+Read it as a story: the spread between methods (and the naive floor's
+disadvantage) is widest at `h=1` and compresses by `h=21`. The `dir_*` columns
+report next-direction skill (most meaningful at `h=1`).
+
+## Cell 15 (code)
+
+```python
+display_multivariate_backtest_leaderboard(RESULTS_DF)
+```
+
+## Cell 16 (markdown)
+
+### Reading the LLMP ± covariates rows
+
+When the LLM-Process rows are present (post-cutoff windows), compare them directly:
+- **LLMP (target)** — the LLM sees **only** the return history.
+- **LLMP + cov** — it additionally sees **labeled covariate-history blocks**
+  (VIX, yields, …) in its prompt.
+
+Their CRPS gap answers the headline question: does the same exogenous panel the
+ML methods use help an LLM — and does either LLMP variant reach the
+gradient-boosting rows? Because these windows are **post-cutoff**, the LLM can't
+be reciting memorised outcomes.
+
+## Cell 17 (markdown)
+
+---
+## 6. Forecast vs realised (next session, h=1)
+
+A few models' median next-session forecast against the realised return (percent)
+— the near-zero, low-amplitude forecasts vs the noisy realised series are the
+daily-efficiency story made visual. These reuse the cached `h=1` predictions from
+Section 5 (no re-running).
+
+## Cell 18 (code)
+
+```python
+# A few representative models, pulled straight from the cached backtest results.
+H1_TASK = "sp500_logret_1b"
+demo_ids = [naive.predictor_id, ets.predictor_id, lightgbm_cov.predictor_id]
+
+compare_by_run: dict[str, pd.DataFrame] = {}
+for pid in demo_ids:
+    result = results_by_predictor.get(pid, {}).get(H1_TASK)
+    if result is None:
+        continue
+    label = PREDICTOR_LABELS.get(pid, pid)
+    compare_by_run[label] = build_return_compare_frame(result.predictions, svc, H1_TASK)
+
+fig, _ = plot_return_forecast_vs_actual_multi(compare_by_run, title="Next-session return: forecast vs realised (h=1)")
+plt.show()
+```
+
+## Cell 19 (markdown)
+
+---
+## 7. Protected 2026 eval — the honest scoreboard
+
+`multi_evaluate()` against the held-out 2026 window. This is **scarce**:
+`sp500_eval_2026.yaml` carries `max_runs`, and one `multi_evaluate` call across
+all three horizons counts as **one** run. Pass an `EvalTracker` to enforce that
+budget across sessions (commented below — it would block re-runs once exhausted,
+so we leave it off here and just show `run_number`).
+
+Spend it on the models you committed to from the 2025 backtest — pick a small
+**finalists** list in the cell below, not the whole roster. Eval results are
+never cached (caching would obscure the budget), so each run recomputes.
+
+> **Runtime:** the finalists default to the cutoff-safe baselines plus
+> `LightGBM + cov`. Add the LLMP variants when you're ready to spend the proxy
+> tokens on the protected scoreboard (each LLMP eval is ~8 origins × 3 horizons ×
+> `n_samples` calls).
+
+## Cell 20 (code)
+
+```python
+# from aieng.forecasting.evaluation import EvalTracker
+# tracker = EvalTracker(ROOT / "data" / "sp500_eval_runs.yaml")  # enforces max_runs across sessions
+
+with (SPECS_DIR / _EVAL_SPEC_FILE).open() as f:
+    eval_spec = MultiTargetEvalSpec.model_validate(yaml.safe_load(f))
+
+# Finalists only — the held-out window is scarce. Pick the handful you committed
+# to from the 2025 backtest (e.g. add `llmp`, `llmp_cov` here when ready to spend
+# the proxy tokens — they exist only when POST_CUTOFF).
+eval_finalists = [naive, lightgbm_cov]
+
+eval_results: dict[str, dict[str, object]] = {}
+for predictor in eval_finalists:
+    print(f"Evaluating {predictor.predictor_id} ...", flush=True)
+    eval_results[predictor.predictor_id] = multi_evaluate(
+        predictor=predictor,
+        spec=eval_spec,
+        data_service=svc,
+        tracker=None,  # set to `tracker` above to enforce the max_runs budget
+    )
+    for task_id, result in eval_results[predictor.predictor_id].items():
+        print(f"  {task_id:18s}  mean CRPS = {result.mean_score:.5f}  (run #{result.run_number})")
+
+EVAL_DF = build_leaderboard(
+    eval_results,
+    svc,
+    covariates_by_predictor=PREDICTOR_COVARIATES,
+    labels_by_predictor=PREDICTOR_LABELS,
+)
+```
+
+## Cell 21 (code)
+
+```python
+display_multivariate_backtest_leaderboard(EVAL_DF)
+```
+
+## Cell 22 (markdown)
+
+---
+## Where to go next
+
+- **Scale up the backtest.** Set `EXPERIMENT_CONFIG = "backtest_2025"` (weekly
+  origins across all of 2025, full panel) and re-run from Section 2. The LLMP rows
+  are slow over ~50 origins — trim `all_predictors` or widen the spec's `stride`.
+- **Study the COVID regime (numerical only).** Set
+  `EXPERIMENT_CONFIG = "stress_2020"`: a volatile window where a covariate edge is
+  most visible. The predictors cell drops the LLMP rows automatically (2020 is
+  pre-cutoff, so an LLM would be reciting, not forecasting).
+- **Add a method.** Instantiate any `Predictor` (e.g. another Darts model — mirror
+  `aieng/forecasting/methods/numerical/darts_classical.py`) in the predictors
+  cell and append it to `all_predictors` with a `PREDICTOR_LABELS` entry. No
+  registry or dispatch to edit.
+- **Tune the LLM-Process.** Edit `predictors/llmp_sampled_trajectory.py` (prompt
+  framing, history window) or pass a different `model=` / `n_samples=` in the
+  predictors cell. How do the results move?
+- **Try an agentic forecaster** with the same cutoff caveats as the LLMP. Do
+  agents add lift over LLMPs or the conventional methods?
+- **Spend the eval honestly.** Wire the `EvalTracker` in Section 7 so `max_runs`
+  is enforced, and only evaluate models you've committed to.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__99_starter_agent.ipynb.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__99_starter_agent.ipynb.md
new file mode 100644
index 0000000..246018c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__99_starter_agent.ipynb.md
@@ -0,0 +1,181 @@
+# Source: implementations/sp500_forecasting/99_starter_agent.ipynb
+
+kind: notebook
+
+## Cell 1 (markdown)
+
+# S&P 500 — Your Starter Agent
+
+**If you're not sure what to do next, continue from here.**
+
+This notebook is a fresh, hackable agent for the S&P 500 use case — deliberately *not* wired into the numbered curriculum. It gives you our common building blocks behind simple toggles, so you can start building something of your own:
+
+- **optional news search** — bounded, cutoff-aware Google Search (proxy-only)
+- **optional code execution** — an E2B Python sandbox
+- **two lightweight skills** — *tool-usage playbooks* in `starter_agent/skills/`
+
+It does two things: lets you **talk to the agent** (open-ended, Track 2) and **score one real forecast** (Track 1). The live cells are gated by `RUN_AGENT` so a fresh `Run All` is safe and free; flip it to `True` to actually call the model.
+
+## Cell 2 (code)
+
+```python
+import warnings
+from pathlib import Path
+
+
+warnings.filterwarnings("ignore")
+
+import pandas as pd
+from dotenv import load_dotenv
+
+
+# Repo root holds the .env with PROXY_* creds the agent needs.
+ROOT = Path.cwd().resolve().parents[1]
+load_dotenv(ROOT / ".env")
+
+# ── Model selection ───────────────────────────────────
+# Two project models: "gemini-3.1-flash-lite-preview" (lite/default) and
+# "gemini-3.5-flash" (advanced). Lite is the default.
+AGENT_MODEL = "gemini-3.1-flash-lite-preview"
+# AGENT_MODEL = "gemini-3.5-flash"  # advanced (higher cost/latency)
+
+# ── Run guard ──────────────────────────────────────
+# Live agent calls cost tokens and need PROXY_* in the repo-root .env, plus warm
+# data caches. Default False so `Run All` is safe; set True to call the model.
+RUN_AGENT = False
+
+from sp500_forecasting.starter_agent import (
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+print("RUN_AGENT =", RUN_AGENT, "| model =", AGENT_MODEL)
+```
+
+## Cell 3 (markdown)
+
+---
+## 1. Meet your agent
+
+`build_starter_agent_config` returns an `AgentConfig` with two toggles. The default turns **news search on** (proxy-only, no extra key) and **code execution off** (it needs `E2B_API_KEY` and is slower). Flip them and re-run — the loaded skills follow the enabled tools.
+
+## Cell 4 (code)
+
+```python
+config = build_starter_agent_config(
+    model=AGENT_MODEL,
+    enable_search=True,  # ← cutoff-aware Google Search (proxy-only)
+    enable_code_exec=False,  # ← E2B Python sandbox (needs E2B_API_KEY); try True!
+)
+
+print("Agent:", config.name)
+print("Search enabled:    ", config.context_retrieval.enabled)
+print("Code-exec enabled: ", config.code_execution.enabled)
+print("Skills loaded:     ", [p.name for p in config.skills_dirs])
+print("\n── System instruction (edit this in starter_agent/agent.py) ──\n")
+print(config.instruction[:1200], "...")
+```
+
+## Cell 5 (markdown)
+
+---
+## 2. Talk to it  *(Track 2 — open-ended analysis)*
+
+Ask the agent anything. This is the interactive mode: no scoring, no schema — just reasoning (and a web search, since search is on). Edit the question and explore.
+
+## Cell 6 (code)
+
+```python
+from aieng.forecasting.methods.agentic import build_adk_agent
+from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig
+
+
+QUESTION = (
+    "What are the key macro risks for U.S. equities over the next month, and how "
+    "should they shape the spread of a 1-week S&P 500 return forecast? Be concise."
+)
+
+if RUN_AGENT:
+    chat_agent = build_adk_agent(config)  # schema-free: plain text in, text out
+    runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name="sp500_starter_chat"))
+    reply = await runner.run_text_async(QUESTION)  # noqa: F704, PLE1142
+    print(reply)
+else:
+    print("RUN_AGENT is False — set it to True in the setup cell to talk to the agent.")
+```
+
+## Cell 7 (markdown)
+
+---
+## 3. Score one prediction against a known outcome  *(Track 1)*
+
+Now run the agent as a `Predictor`. We pick the **most recent origin whose horizon has already resolved**, forecast the 1-week S&P 500 return, and check whether the actual return landed inside the agent's 80% band. (One origin can't tell you if the agent is *calibrated*; that's what the backtest in `01_sp500_multivariate_backtest.ipynb` is for.) Live, so gated by `RUN_AGENT`.
+
+## Cell 8 (code)
+
+```python
+from datetime import datetime, timezone
+
+
+if RUN_AGENT:
+    from aieng.forecasting.evaluation.task import ForecastingTask
+    from sp500_forecasting import build_sp500_multivariate_service
+    from sp500_forecasting.data import sp500_logret_series_id
+
+    HORIZON = 5
+    COVARIATES = ["vix_level_l1b", "ust10y_level_l1b", "oil_log_ret_1b_l1b"]
+    svc = build_sp500_multivariate_service(covariate_series_ids=COVARIATES)
+    now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    tgt = sp500_logret_series_id(HORIZON)
+    full = svc.get_series(tgt, as_of=now)
+    full["timestamp"] = pd.to_datetime(full["timestamp"])
+    last_date = full["timestamp"].iloc[-1]
+
+    # Most recent origin whose horizon return has already resolved.
+    AS_OF = last_date - pd.offsets.BDay(HORIZON + 1)
+
+    task = ForecastingTask(
+        task_id=f"sp500_logret_{HORIZON}b",
+        target_series_id=tgt,
+        horizons=[HORIZON],
+        frequency="B",
+        description=f"S&P 500 cumulative log return, {HORIZON} business days ahead (starter).",
+    )
+    ctx = svc.context(as_of=AS_OF)
+    pred = build_starter_agent_predictor(config, covariate_series_ids=COVARIATES).predict(task, ctx)[0]
+
+    rows = full[full["timestamp"] >= AS_OF + pd.offsets.BDay(HORIZON)]
+    actual = float(rows["value"].iloc[0]) if not rows.empty else None
+
+    fc = pred.payload
+    lo, hi = fc.quantiles[0.10], fc.quantiles[0.90]
+    print(f"Origin as_of={AS_OF.date()}  horizon={HORIZON}b (1 week)  (latest data {last_date.date()})\n")
+    print(f"  agent point  : {fc.point_forecast:+.4f}  ({fc.point_forecast * 100:+.2f}%)")
+    print(f"  agent 80% CI : [{lo:+.4f}, {hi:+.4f}]")
+    if actual is None:
+        print("  actual       : N/A (not yet resolved)")
+    else:
+        inb = "yes ✓" if lo <= actual <= hi else "no ✗"
+        print(f"  actual       : {actual:+.4f}  ({actual * 100:+.2f}%)   in 80% band? {inb}")
+    if pred.metadata.get("rationale"):
+        print("\nRationale:", pred.metadata["rationale"][:300])
+else:
+    print("RUN_AGENT is False — set it to True to score a live forecast against a known outcome.")
+```
+
+## Cell 9 (markdown)
+
+---
+## 4. Make it yours
+
+This agent is a starting point. Here are concrete next steps, easiest first — each is a small edit, then re-run the cells above.
+
+1. **Flip code execution on.** Set `enable_code_exec=True` in §1 (needs `E2B_API_KEY`). The agent loads the `code-analysis-playbook` skill and can compute its own diagnostics before forecasting. Compare the rationale.
+2. **Edit the agent's personality.** Open `starter_agent/agent.py` and change `_build_starter_instruction()` — make it more cautious, more contrarian, focused on one driver. Re-run §1 to see the new instruction.
+3. **Sharpen the skills.** The two files in `starter_agent/skills/` are short on purpose. Add your best queries to `research-playbook`, or a new diagnostic to `code-analysis-playbook`. The agent picks them up automatically.
+4. **Change the question and the origin.** Try a different `QUESTION` in §2 and a different origin in §3.
+5. **Widen the covariate panel.** Pass `DEFAULT_COVARIATE_SERIES_IDS` (11 series) to both the service and `build_starter_agent_predictor(...)` and see if the extra context helps.
+6. **Forecast other horizons.** Swap `HORIZON` to 1 (next session) or 21 (1 month) — each maps to its own `sp500_logret_{h}b` target.
+
+Bigger ideas — the full conventional-vs-LLM-Process comparison and direction metrics in `01_sp500_multivariate_backtest.ipynb` — are in the use-case `README.md` and `planning-docs/roadmap.md`.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__README.md.md
new file mode 100644
index 0000000..d9502eb
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__README.md.md
@@ -0,0 +1,246 @@
+# Source: implementations/sp500_forecasting/README.md
+
+kind: markdown
+
+# S&P 500 multivariate forecasting (leak-safe covariates)
+
+> **Reference implementation 1 of 4.** Recommended order: [getting_started](../getting_started/) → **S&P 500** → [food CPI](../food_price_forecasting/) → [energy / WTI](../energy_oil_forecasting/) → [BoC rate decisions](../boc_rate_decisions/). Each stands on its own.
+
+The **financial-markets** reference: a head-to-head comparison of conventional
+time-series methods on a daily equity index, all reading the **same leak-safe
+covariate panel**, plus an LLM-Process forecaster that can read those covariates
+in its prompt. It is the template for evaluated prediction (Track 1) on market
+series with exogenous covariates.
+
+The headline question:
+
+> Given the same macro/market observations, **which method forecasts the index
+> best — and can an LLM-Process, handed those covariates, keep up with gradient
+> boosting?**
+
+**How this differs from the energy/oil reference.** Energy forecasts a
+*univariate* price trajectory with news-grounded, code-executing, and adaptive
+**agents**. This reference has no agents and no news — it is a clean, reproducible
+**numerical-methods bake-off across a multivariate covariate panel**, scored with
+CRPS and direction metrics.
+
+---
+
+## Forecasting task
+
+The targets are **close-to-close cumulative log returns** of `^GSPC`, registered
+one series per horizon (window `N` in business days):
+
+$$
+r^{(N)}_t = \log\frac{C^{\text{adj}}_{t}}{C^{\text{adj}}_{t-N}}
+$$
+
+Forecasting `sp500_logret_{N}b` exactly `N` business days ahead resolves to the
+**forward** cumulative return over the next `N` sessions — a clean single-marginal
+forecast at each horizon (no joint-path aggregation):
+
+| Target | Horizon | Actionable framing |
+|--------|---------|--------------------|
+| `sp500_logret_1b`  | 1 (next session) | direction / next-day **risk management** |
+| `sp500_logret_5b`  | 5 (forward 1 week) | tactical rebalancing, weekly tenors |
+| `sp500_logret_21b` | 21 (forward 1 month) | allocation, monthly tenors |
+
+**Frequency:** business (`B`). Returns (not the index level) keep the target
+stationary, which is the right setup for a methods comparison.
+
+**What's forecastable at daily resolution.** The *level* of index returns is
+close to a martingale, so far-ahead point forecasts trend toward ~0 and add
+little; the forecastable, actionable objects are **volatility, tail risk, and
+direction**. That is why a VIX-led covariate panel can help — and why the
+method/covariate edge is largest at `h=1` and compresses as the horizon grows.
+The notebook's opening note develops this.
+
+---
+
+## Methods compared
+
+| Family | Predictors | Covariates? |
+|--------|-----------|-------------|
+| Naive floor | `LastValuePredictor` | — |
+| Classical | `DartsExponentialSmoothingPredictor` (ETS), `DartsKalmanForecasterPredictor`, `DartsAutoARIMAPredictor` | — (univariate) |
+| ML regression | `DartsLinearRegressionPredictor`, `DartsLightGBMPredictor` | ✅ optional past covariates |
+| LLM-Process | `SampledTrajectoryLLMPredictor` | ✅ optional covariate prompt blocks |
+
+The **LLMP (target)** vs **LLMP + cov** rows are the centerpiece: the covariate
+variant serializes labeled covariate-history blocks into the prompt (the
+`covariate_series_ids=` passed to `build_sp500_llmp_sampled_trajectory`), so their
+CRPS gap measures whether an LLM can use the same exogenous observations the ML
+methods do. Both are built in the notebook's predictors cell.
+
+---
+
+## Canonical covariates (when enabled)
+
+| Series ID (registered) | Economic meaning |
+|------------------------|------------------|
+| `vix_level_l1b` | VIX level, lagged 1 business day |
+| `vix_log_ret_1b_l1b` | VIX log return, lagged |
+| `ust10y_level_l1b` | 10Y Treasury yield |
+| `ust2y10y_spread_l1b` | 2Y–10Y spread |
+| `fed_funds_level_l1b` | Fed funds effective rate |
+| `cpi_mom_logdiff_l1b` | CPI MoM log-diff |
+| `unemployment_rate_l1b` | Unemployment rate |
+| `oil_log_ret_1b_l1b` | Oil futures log return |
+| `gold_log_ret_1b_l1b` | Gold log return (skipped if FRED series unavailable) |
+| `dollar_index_log_ret_1b_l1b` | Broad dollar index log return |
+| `nasdaq_log_ret_1b_l1b` | NASDAQ composite log return |
+
+Exact adapters and transforms live in `data.py` (`DEFAULT_COVARIATE_SERIES_IDS`).
+Yahoo covariates use `YFinanceDailyAdapter` (parquet under `data/yfinance/` at the
+repo root); FRED series use `FREDAdapter` (`data/fred/`). Warm both caches to the
+present before running the 2025/2026 windows (see Prerequisites).
+
+---
+
+## Cutoff-aware evaluation (read this)
+
+This is the methodological heart of the comparison, and easy to get wrong.
+
+- **Numerical methods are cutoff-safe by construction.** Naive, ETS, Kalman,
+  AutoARIMA, LinReg and LightGBM only ever see the series up to the forecast
+  origin (`ForecastContext` enforces it), so they can be backtested on *any*
+  historical window.
+- **An LLM is not.** Gemini's training cutoff is ~**January 2025**, so it has
+  effectively memorised pre-2025 outcomes. Scoring an LLM-Process on a pre-cutoff
+  origin measures recall, not forecasting, and silently flatters it in the
+  head-to-head.
+
+So the LLM-inclusive comparison lives **after the cutoff** — a **2025 backtest**
+for iteration and a **protected 2026 eval** as the honest scoreboard (mirroring
+the energy reference and `getting_started`'s `backtest()` → `evaluate()` split).
+The 2020 COVID window is kept as a **numerical-only** stress test.
+
+---
+
+## No-leakage design
+
+- Every covariate is shifted by **one business day** before registration.
+- Macro series use **conservative release proxies** before daily expansion;
+  rows carry `released_at` suitable for `ForecastContext` cutoffs.
+- Backtests enforce **information available at `as_of`**.
+
+Missing optional feeds are **skipped with warnings** by default
+(`strict_covariates=False`). Set `strict_covariates=True` to fail fast.
+
+---
+
+## Specs — windows and tasks (experiment design only)
+
+Four co-located YAML configs. Each spec carries **only the experiment design** —
+the window (`start`/`end`/`stride`/`warmup`) and one single-horizon task per
+`sp500_logret_{N}b` target (`horizons: [N]`, `frequency: B`). The first three are
+`MultiTargetBacktestSpec`; the eval spec is a `MultiTargetEvalSpec` that adds
+`max_runs`. **Which predictors run, and all their hyperparameters (including the
+covariate panel), live in the notebook — not the spec.**
+
+```text
+specs/
+├── sp500_smoke.yaml         # fast laptop run — short late-2025 window (post-cutoff)
+├── sp500_backtest_2025.yaml # main comparison: weekly origins across 2025
+├── sp500_eval_2026.yaml     # protected held-out 2026 eval (MultiTargetEvalSpec, max_runs)
+└── sp500_stress_2020.yaml   # COVID-crash stress, numerical only (notebook drops LLMP — pre-cutoff is leaked)
+```
+
+The notebook runs the 2025 backtest (Section 5) and the protected 2026 eval
+(Section 7); set `EXPERIMENT_CONFIG = "stress_2020"` to study the volatile regime
+with the cutoff-safe methods (the predictors cell drops the LLMP rows
+automatically). Copy a spec and edit the window/tasks to pose a new study.
+
+---
+
+## Module layout
+
+```text
+implementations/sp500_forecasting/
+├── data.py                    # build_sp500_multivariate_service(); cumulative-return targets; covariate ids
+├── predictors/                # build_sp500_llmp_sampled_trajectory() — the S&P 500 LLMP recipe
+├── leaderboard.py             # build_leaderboard(): cached results → RESULTS_DF; forecast-vs-actual frame
+├── analysis.py                # style_results_dataframe(); direction metrics
+├── plots.py                   # target history; per-horizon CRPS; forecast vs realised return
+├── starter_agent/             # fresh, hackable agent template (toggleable search/code-exec + skills)
+├── specs/                     # sp500_smoke / sp500_backtest_2025 / sp500_eval_2026 / sp500_stress_2020
+├── 01_sp500_multivariate_backtest.ipynb
+├── 99_starter_agent.ipynb     # ← start here to build your own agent
+└── README.md
+```
+
+Unit tests for data helpers live under
+`implementations/tests/sp500_forecasting/test_data.py`.
+
+---
+
+## Adding a method
+
+The roster is meant to grow, and it's all just code now — no registry or dispatch
+to edit. In the notebook's predictors cell:
+
+1. Instantiate any `Predictor` and append it to `all_predictors`. For a new Darts
+   model, mirror `aieng-forecasting/aieng/forecasting/methods/numerical/darts_classical.py`
+   (univariate, probabilistic via `num_samples`, per-horizon quantiles) and export
+   it from `methods/numerical/__init__.py` and `methods/__init__.py` first.
+2. Add a `PREDICTOR_LABELS` entry (the leaderboard "model" column). If it reads the
+   covariate panel, also add a `PREDICTOR_COVARIATES` entry so the leaderboard's
+   covariate columns are correct.
+
+For a tuned LLM-Process variant, add a builder to `predictors/` (mirror
+`predictors/llmp_sampled_trajectory.py`) so the prompt framing is reusable.
+
+Keep numerical models **fast** (sub-second per origin) and **probabilistic** (CRPS
+needs a distribution — deterministic models like Theta need a conformal/residual
+wrapper first).
+
+---
+
+## Prerequisites
+
+From the **repository root**, run `uv sync` once so `sp500_forecasting` is on the
+interpreter path (same pattern as `food_price_forecasting` / `energy_oil_forecasting`).
+Use the project `.venv` as the Jupyter kernel — imports are `from sp500_forecasting import ...`.
+
+Warm caches at the repo root (gitignored) to the **present** — the 2025/2026
+windows need coverage through today:
+
+```bash
+uv run python scripts/fetch_sp500_market.py --refresh   # ^GSPC / ^VIX / ^IXIC (Yahoo)
+uv run python scripts/fetch_fred.py                     # macro covariates (FRED)
+```
+
+`fetch_fred.py` requires a **FRED API key** in your repo-root `.env` (`FRED_API_KEY=...`).
+FRED keys are free but must be requested individually — **we cannot provide one for you**.
+Request yours at <https://fred.stlouisfed.org/docs/api/api_key.html> (approval is usually
+quick, but allow some time). A description like "Requesting an API key to explore the
+effectiveness of various forecasting techniques on economic data." works well.
+
+The `llmp_*` rows call the Vector proxy, so a populated repo-root `.env` (with
+`OPENAI_BASE_URL` / `OPENAI_API_KEY`) is required when those rows are enabled.
+
+**How to run:** open `01_sp500_multivariate_backtest.ipynb` and **Run All**. The
+`EXPERIMENT_CONFIG` cell selects the 2025 comparison spec (`"smoke"` by default;
+`"backtest_2025"` for the full run, `"stress_2020"` for the numerical-only COVID
+study); the protected 2026 eval (`sp500_eval_2026.yaml`) runs in Section 7. The
+predictor roster is configured in the predictors cell (Section 4).
+
+The default smoke run keeps the LLM-Process rows on in the 2025 backtest (the
+headline comparison); the 2026 eval's `eval_finalists` list defaults to the
+cutoff-safe baselines plus `LightGBM + cov`, so a first Run All isn't a
+long/expensive surprise. Add `llmp` / `llmp_cov` to `eval_finalists` when you're
+ready to spend the proxy tokens on the protected scoreboard.
+
+---
+
+## Build your own — `99_starter_agent.ipynb`
+
+Not sure what to do next? [`99_starter_agent.ipynb`](99_starter_agent.ipynb) is
+this use case's first **agent** and a fresh, hackable starting point — *not*
+part of the backtest above. It wires our common building blocks behind simple
+toggles (cutoff-aware news search, an E2B code sandbox) plus two lightweight
+tool-usage skills, and walks through talking to the agent (Track 2), scoring one
+real return forecast (Track 1), and a "make it yours" guide. Live cells are
+gated by `RUN_AGENT` (default `False`), so a first Run All is safe.
+
+---
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting____init__.py.md
new file mode 100644
index 0000000..f455497
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting____init__.py.md
@@ -0,0 +1,64 @@
+# Source: implementations/sp500_forecasting/__init__.py
+
+kind: python
+
+```python
+"""S&P 500 multivariate log-return experiment — leak-safe covariates.
+
+The demo notebooks are narrative shells over the modules in this directory:
+
+- :mod:`data` — ``build_sp500_multivariate_service()`` and canonical covariate ids.
+- :mod:`predictors` — ``build_sp500_llmp_sampled_trajectory()`` recipe (prompt framing + sampling budget).
+- :mod:`leaderboard` — ``build_leaderboard()`` turns cached results into ``RESULTS_DF``.
+- :mod:`analysis` — styled leaderboards and direction metrics.
+- :mod:`plots` — matplotlib figures (target history, per-horizon CRPS, forecast vs realised return).
+- YAML specs — experiment design only (window + one single-horizon task per
+  ``sp500_logret_{N}b`` target): ``sp500_smoke`` / ``sp500_backtest_2025`` /
+  ``sp500_eval_2026`` / ``sp500_stress_2020``. The predictor roster lives in the notebook.
+
+See ``README.md`` for the full experiment description.
+"""
+
+from .data import (
+    DEFAULT_COVARIATE_SERIES_IDS,
+    FRED_PREFETCH_REGISTRY,
+    FRED_SERIES_IDS_FOR_PREFETCH,
+    SERIES_ID_2Y10Y_SPREAD,
+    SERIES_ID_10Y_YIELD,
+    SERIES_ID_CPI_INFLATION_CHANGE,
+    SERIES_ID_DOLLAR_INDEX_RETURN,
+    SERIES_ID_FED_FUNDS,
+    SERIES_ID_GOLD_RETURN,
+    SERIES_ID_NASDAQ_RETURN,
+    SERIES_ID_OIL_RETURN,
+    SERIES_ID_UNEMPLOYMENT,
+    SERIES_ID_VIX_CHANGE,
+    SERIES_ID_VIX_LEVEL,
+    SP500_LOG_RETURN_SERIES_ID,
+    SP500_SERIES_ID,
+    SP500_TICKER,
+    build_sp500_multivariate_service,
+)
+
+
+__all__ = [
+    "DEFAULT_COVARIATE_SERIES_IDS",
+    "FRED_PREFETCH_REGISTRY",
+    "FRED_SERIES_IDS_FOR_PREFETCH",
+    "SERIES_ID_2Y10Y_SPREAD",
+    "SERIES_ID_10Y_YIELD",
+    "SERIES_ID_CPI_INFLATION_CHANGE",
+    "SERIES_ID_DOLLAR_INDEX_RETURN",
+    "SERIES_ID_FED_FUNDS",
+    "SERIES_ID_GOLD_RETURN",
+    "SERIES_ID_NASDAQ_RETURN",
+    "SERIES_ID_OIL_RETURN",
+    "SERIES_ID_UNEMPLOYMENT",
+    "SERIES_ID_VIX_CHANGE",
+    "SERIES_ID_VIX_LEVEL",
+    "SP500_LOG_RETURN_SERIES_ID",
+    "SP500_SERIES_ID",
+    "SP500_TICKER",
+    "build_sp500_multivariate_service",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__analysis.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__analysis.py.md
new file mode 100644
index 0000000..badae65
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__analysis.py.md
@@ -0,0 +1,167 @@
+# Source: implementations/sp500_forecasting/analysis.py
+
+kind: python
+
+```python
+"""Notebook-oriented formatting and direction metrics for the S&P 500 demo."""
+
+from __future__ import annotations
+
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.evaluation.prediction import ContinuousForecast
+from pandas.io.formats.style import Styler
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.service import DataService
+    from aieng.forecasting.evaluation.prediction import Prediction
+
+
+def style_results_dataframe(df: pd.DataFrame) -> Styler:
+    """Return a :class:`~pandas.io.formats.style.Styler` tuned for ``RESULTS_DF``.
+
+    Intended for ``IPython.display.display`` in Jupyter — readable numeric
+    precision without manual rounding in every cell of the notebook.
+    """
+    fmt: dict[str, str] = {
+        "mean_crps": "{:.5f}",
+        "dir_precision_up": "{:.3f}",
+        "dir_recall_up": "{:.3f}",
+        "dir_f1_up": "{:.3f}",
+        "dir_accuracy": "{:.3f}",
+        "dir_roc_auc_prob_up": "{:.3f}",
+    }
+    fmt = {k: v for k, v in fmt.items() if k in df.columns}
+    return df.style.format(fmt, na_rep="—")
+
+
+def prob_return_above_threshold_from_quantiles(quantiles: dict[float, float], threshold: float = 0.0) -> float:
+    """Approximate ``P(X > threshold)`` from a piecewise-linear CDF through quantile pairs."""
+    pairs = sorted(((float(v), float(q)) for q, v in quantiles.items()), key=lambda x: x[0])
+    if not pairs:
+        return float("nan")
+    vs = np.array([p[0] for p in pairs], dtype=float)
+    qs = np.array([p[1] for p in pairs], dtype=float)
+    f_at = float(np.interp(threshold, vs, qs, left=0.0, right=1.0))
+    return float(np.clip(1.0 - f_at, 0.0, 1.0))
+
+
+def build_direction_eval_frame(
+    predictions: list[Prediction],
+    *,
+    target_series_id: str,
+    data_service: DataService,
+) -> pd.DataFrame:
+    """Align each scored prediction with the realized log return at ``forecast_date``."""
+    as_of_now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    full_series = data_service.get_series(target_series_id, as_of=as_of_now)
+    full = full_series.copy()
+    full["timestamp"] = pd.to_datetime(full["timestamp"])
+    lookup = full.set_index("timestamp")["value"]
+
+    rows: list[dict[str, object]] = []
+    for p in predictions:
+        if not isinstance(p.payload, ContinuousForecast):
+            continue
+        ts = pd.Timestamp(p.forecast_date)
+        if ts not in lookup.index:
+            continue
+        actual = float(lookup.loc[ts])
+        qmap = p.payload.quantiles
+        prob_up = prob_return_above_threshold_from_quantiles(qmap, threshold=0.0)
+        rows.append(
+            {
+                "as_of": p.as_of,
+                "forecast_date": p.forecast_date,
+                "actual": actual,
+                "point_forecast": p.payload.point_forecast,
+                "prob_up": prob_up,
+                "actual_up": int(actual > 0.0),
+                "pred_up_point": int(p.payload.point_forecast > 0.0),
+            }
+        )
+    return pd.DataFrame(rows)
+
+
+def direction_classification_metrics(
+    df: pd.DataFrame,
+    *,
+    y_pred_col: str = "pred_up_point",
+    y_score_col: str = "prob_up",
+) -> pd.Series:
+    """Binary metrics for predicting a positive next-session log return."""
+    from sklearn.metrics import (  # noqa: PLC0415
+        accuracy_score,
+        balanced_accuracy_score,
+        cohen_kappa_score,
+        confusion_matrix,
+        matthews_corrcoef,
+        precision_recall_fscore_support,
+        roc_auc_score,
+    )
+
+    if df.empty:
+        return pd.Series(dtype=float)
+
+    y_true = df["actual_up"].to_numpy(dtype=int)
+    y_pred = df[y_pred_col].to_numpy(dtype=int)
+    n = int(len(y_true))
+    pos_rate = float(y_true.mean()) if n else float("nan")
+
+    acc = float(accuracy_score(y_true, y_pred))
+    bal_acc = float(balanced_accuracy_score(y_true, y_pred))
+    prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", pos_label=1, zero_division=0)
+    prec_f, rec_f, f1_f, _ = precision_recall_fscore_support(
+        y_true, y_pred, average="binary", pos_label=0, zero_division=0
+    )
+    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
+    mcc = float(matthews_corrcoef(y_true, y_pred))
+    kappa = float(cohen_kappa_score(y_true, y_pred))
+
+    baseline_acc = max(pos_rate, 1.0 - pos_rate)
+    maj = int(pos_rate >= 0.5)
+    baseline_always_up_acc = float((y_true == maj).mean())
+
+    roc = float("nan")
+    if y_score_col in df.columns and np.unique(y_true).size == 2:
+        try:
+            roc = float(roc_auc_score(y_true, df[y_score_col].to_numpy(dtype=float)))
+        except ValueError:
+            roc = float("nan")
+
+    return pd.Series(
+        {
+            "n": n,
+            "prevalence_up": pos_rate,
+            "accuracy": acc,
+            "balanced_accuracy": bal_acc,
+            "precision_up": float(prec),
+            "recall_up": float(rec),
+            "f1_up": float(f1),
+            "precision_down": float(prec_f),
+            "recall_down": float(rec_f),
+            "f1_down": float(f1_f),
+            "matthews_corrcoef": mcc,
+            "cohen_kappa": kappa,
+            "confusion_tn": int(tn),
+            "confusion_fp": int(fp),
+            "confusion_fn": int(fn),
+            "confusion_tp": int(tp),
+            "baseline_accuracy_maj_class": baseline_acc,
+            "baseline_always_predict_up": baseline_always_up_acc,
+            "roc_auc_prob_up": roc,
+        }
+    )
+
+
+__all__ = [
+    "build_direction_eval_frame",
+    "direction_classification_metrics",
+    "prob_return_above_threshold_from_quantiles",
+    "style_results_dataframe",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__data.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__data.py.md
new file mode 100644
index 0000000..057c513
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__data.py.md
@@ -0,0 +1,896 @@
+# Source: implementations/sp500_forecasting/data.py
+
+kind: python
+
+```python
+"""Leak-safe data-service setup for multivariate S&P 500 log-return forecasting.
+
+Targets: **close-to-close cumulative log returns** of ``^GSPC`` over a few
+horizons, registered as one series per window ``N``::
+
+    r^(N)_t = log(adj_close[t] / adj_close[t - N])
+
+Forecasting ``r^(N)`` ``N`` business days ahead resolves to the *forward*
+cumulative return over the next ``N`` sessions — a clean single-marginal
+forecast at each horizon (no joint-path aggregation):
+
+- ``sp500_logret_1b``  (forecast 1 step ahead)  → next-session return.
+- ``sp500_logret_5b``  (forecast 5 steps ahead) → forward 1-week return.
+- ``sp500_logret_21b`` (forecast 21 steps ahead)→ forward 1-month return.
+
+Using returns (rather than the index level) keeps the target stationary, which
+is the appropriate setup for a conventional-methods comparison.
+
+Covariates supported (daily business-day frame):
+- VIX level / VIX change
+- 10Y Treasury yield
+- 2Y-10Y yield spread
+- Fed funds rate
+- CPI inflation change (MoM log-diff)
+- Unemployment rate
+- Oil returns
+- Gold returns
+- Dollar index returns
+- NASDAQ returns
+
+Anti-leakage policy:
+- Every covariate is transformed and then lagged by one business day.
+- ``released_at`` is set conservatively for macro series before daily expansion.
+- The DataService cutoff then guarantees context views never include unavailable rows.
+
+Macro series use :class:`~aieng.forecasting.data.adapters.fred.FREDAdapter`, which
+writes ``data/fred/{FRED_SERIES_ID}.parquet`` (see adapter docstring). Run
+``uv run python scripts/fetch_fred.py`` to warm the same files the covariate
+builders read. Yahoo covariates use :class:`~aieng.forecasting.data.adapters.yfinance.YFinanceDailyAdapter`
+under ``data/yfinance/`` (default adapter layout). :func:`build_sp500_multivariate_service`
+loads ``FRED_API_KEY`` from the repo-root ``.env`` via ``python-dotenv``, identical
+to ``fetch_fred.py``. Raw series ids and prefetch metadata live in :data:`FRED_PREFETCH_REGISTRY`.
+"""
+
+from __future__ import annotations
+
+import warnings
+from collections.abc import Callable
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters import FREDAdapter, YFinanceDailyAdapter
+from aieng.forecasting.data.adapters.base import BaseAdapter
+
+
+_load_dotenv: Callable[..., Any] | None
+try:
+    from dotenv import load_dotenv as _load_dotenv
+except ImportError:
+    _load_dotenv = None
+
+
+def _repo_root() -> Path | None:
+    here = Path(__file__).resolve()
+    for p in (here, *here.parents):
+        if (p / "aieng-forecasting").is_dir():
+            return p
+    return None
+
+
+def _load_fred_dotenv() -> None:
+    """Populate os.environ from repo-root ``.env`` (same pattern as ``scripts/fetch_fred.py``)."""
+    if _load_dotenv is None:
+        return
+    root = _repo_root()
+    if root is None:
+        return
+    _load_dotenv(root / ".env")
+
+
+def _as_absolute_cache(path: Path | None) -> Path | None:
+    if path is None or path.is_absolute():
+        return path
+    root = _repo_root()
+    if root is not None:
+        return (root / path).resolve()
+    return path
+
+
+def _yahoo_cache_file_default() -> Path:
+    root = _repo_root()
+    if root is not None:
+        return root / "data/yahoo/sp500_gspc.parquet"
+    return Path("data/yahoo/sp500_gspc.parquet")
+
+
+SP500_TICKER = "^GSPC"
+SP500_SERIES_ID = "sp500_close_adj_usd"
+DEFAULT_CACHE_FILE = _yahoo_cache_file_default()
+
+#: Cumulative-return horizons (in business days) registered as targets.  Each
+#: window ``N`` becomes a ``sp500_logret_{N}b`` target forecast ``N`` steps ahead.
+SP500_RETURN_WINDOWS: tuple[int, ...] = (1, 5, 21)
+
+#: Human-readable framing per horizon, surfaced in metadata and the notebook.
+SP500_WINDOW_LABELS: dict[int, str] = {
+    1: "next-session",
+    5: "forward 1-week (5 business days)",
+    21: "forward 1-month (21 business days)",
+}
+
+
+def sp500_logret_series_id(window: int) -> str:
+    """Return the canonical target series id for an ``N``-business-day return."""
+    return f"sp500_logret_{window}b"
+
+
+#: Mapping from horizon (business days) to target series id.
+SP500_RETURN_TARGETS: dict[int, str] = {w: sp500_logret_series_id(w) for w in SP500_RETURN_WINDOWS}
+
+#: The next-session (1-business-day) return — the canonical daily target used by
+#: default in the recent-history plot and direction baselines.
+SP500_LOG_RETURN_SERIES_ID = sp500_logret_series_id(1)
+
+
+class YahooFinanceDailyAdapter(BaseAdapter):
+    """Fetch ^GSPC from Yahoo Finance with adjusted close and same-day open."""
+
+    def __init__(
+        self,
+        ticker: str,
+        *,
+        start: str = "1990-01-01",
+        end: str | None = None,
+        cache_path: Path | None = DEFAULT_CACHE_FILE,
+        refresh: bool = False,
+    ) -> None:
+        self._ticker = ticker
+        self._start = start
+        self._end = end
+        self._cache_path = _as_absolute_cache(cache_path)
+        self._refresh = refresh
+
+    def fetch(self) -> pd.DataFrame:
+        if self._cache_path is not None and self._cache_path.exists() and not self._refresh:
+            df = self._read_cache(self._cache_path)
+            if "open" not in df.columns:
+                df = self._fetch_from_yahoo()
+                if self._cache_path is not None:
+                    self._cache_path.parent.mkdir(parents=True, exist_ok=True)
+                    df.to_parquet(self._cache_path, index=False)
+        else:
+            df = self._fetch_from_yahoo()
+            if self._cache_path is not None:
+                self._cache_path.parent.mkdir(parents=True, exist_ok=True)
+                df.to_parquet(self._cache_path, index=False)
+        return self._apply_date_range(df)
+
+    def _apply_date_range(self, df: pd.DataFrame) -> pd.DataFrame:
+        out = df
+        if self._start:
+            lo = pd.Timestamp(self._start)
+            out = out[out["timestamp"] >= lo]
+        if self._end is not None:
+            hi = pd.Timestamp(self._end)
+            out = out[out["timestamp"] < hi]
+        if out.empty:
+            raise RuntimeError(
+                f"No rows left after applying date range start={self._start!r} end={self._end!r} "
+                f"for ticker {self._ticker!r}."
+            )
+        return out.reset_index(drop=True)
+
+    def _fetch_from_yahoo(self) -> pd.DataFrame:
+        try:
+            import yfinance as yf  # noqa: PLC0415
+        except ImportError as exc:
+            raise RuntimeError("yfinance is not installed. Add it to dependencies (e.g. `uv add yfinance`).") from exc
+
+        ticker = yf.Ticker(self._ticker)
+        raw = ticker.history(start=self._start, end=self._end, auto_adjust=False)
+        if raw.empty:
+            raise RuntimeError(
+                f"Yahoo Finance returned no rows for ticker {self._ticker!r} between {self._start!r} and {self._end!r}."
+            )
+        if "Adj Close" not in raw.columns or "Open" not in raw.columns:
+            raise RuntimeError(f"Yahoo Finance response for {self._ticker!r} missing required columns.")
+
+        df = raw.reset_index()
+        timestamp_col = "Date" if "Date" in df.columns else df.columns[0]
+        df = df.rename(columns={timestamp_col: "timestamp", "Adj Close": "value", "Open": "open"})
+        df["timestamp"] = pd.to_datetime(df["timestamp"]).dt.tz_localize(None)
+        df["value"] = pd.to_numeric(df["value"], errors="coerce")
+        df["open"] = pd.to_numeric(df["open"], errors="coerce")
+        df = df.dropna(subset=["value", "open"]).sort_values("timestamp").reset_index(drop=True)
+        df["released_at"] = df["timestamp"] + pd.offsets.BDay(1)
+        return df[["timestamp", "value", "released_at", "open"]]
+
+    @staticmethod
+    def _read_cache(cache_path: Path) -> pd.DataFrame:
+        df = pd.read_parquet(cache_path)
+        df["timestamp"] = pd.to_datetime(df["timestamp"])
+        df["value"] = pd.to_numeric(df["value"], errors="coerce")
+        df["released_at"] = pd.to_datetime(df["released_at"])
+        cols = ["timestamp", "value", "released_at"]
+        if "open" in df.columns:
+            df["open"] = pd.to_numeric(df["open"], errors="coerce")
+            cols.append("open")
+        out = df[cols].dropna(subset=["value"]).reset_index(drop=True)
+        if "open" in out.columns:
+            out = out.dropna(subset=["open"]).reset_index(drop=True)
+        return out
+
+
+class StaticFrameAdapter(BaseAdapter):
+    """Adapter that returns a precomputed canonical DataFrame."""
+
+    def __init__(self, frame: pd.DataFrame) -> None:
+        self._frame = frame.copy()
+
+    def fetch(self) -> pd.DataFrame:
+        return self._frame.copy()
+
+
+def _build_cumulative_log_return_frame(price_df: pd.DataFrame, window: int) -> pd.DataFrame:
+    """One row per session: value = log(adj_close[t] / adj_close[t-window]).
+
+    ``window=1`` is the ordinary daily close-to-close return; larger windows are
+    trailing cumulative returns.  ``released_at`` is the session timestamp (the
+    return is known at that session's close).
+    """
+    if window < 1:
+        raise ValueError(f"window must be >= 1, got {window}.")
+    if "value" not in price_df.columns:
+        raise RuntimeError("Price data must include adjusted close as 'value'.")
+    frame = price_df[["timestamp", "value"]].copy().sort_values("timestamp").reset_index(drop=True)
+    frame["value"] = pd.to_numeric(frame["value"], errors="coerce")
+    frame = frame[frame["value"] > 0].dropna(subset=["value"]).reset_index(drop=True)
+    frame["value"] = np.log(frame["value"] / frame["value"].shift(window))
+    frame = frame.dropna(subset=["value"]).reset_index(drop=True)
+    frame["released_at"] = pd.to_datetime(frame["timestamp"])
+    return frame[["timestamp", "value", "released_at"]]
+
+
+def build_sp500_log_return_service(
+    *,
+    windows: tuple[int, ...] = SP500_RETURN_WINDOWS,
+    refresh: bool = False,
+    start: str = "1990-01-01",
+    end: str | None = None,
+    cache_path: Path | None = DEFAULT_CACHE_FILE,
+) -> DataService:
+    """Register one close-to-close cumulative log-return target per window in ``windows``."""
+    price_adapter = YahooFinanceDailyAdapter(
+        SP500_TICKER,
+        start=start,
+        end=end,
+        cache_path=_as_absolute_cache(cache_path),
+        refresh=refresh,
+    )
+    price_df = price_adapter.fetch()
+
+    svc = DataService()
+    for window in windows:
+        series_id = sp500_logret_series_id(window)
+        label = SP500_WINDOW_LABELS.get(window, f"{window} business days")
+        svc.register(
+            series_id,
+            StaticFrameAdapter(_build_cumulative_log_return_frame(price_df, window)),
+            SeriesMetadata(
+                series_id=series_id,
+                description=(
+                    f"S&P 500 close-to-close cumulative log return over {window} business day(s) "
+                    f"({label}) (Yahoo Finance ^GSPC, derived)"
+                ),
+                source=f"Yahoo Finance ({SP500_TICKER}), derived",
+                units="log-return",
+                frequency="B",
+                table_id=f"yahoo:^GSPC:logret-{window}b",
+            ),
+        )
+    return svc
+
+
+def _default_cache_dir() -> Path:
+    root = _repo_root()
+    if root is not None:
+        return root / "data"
+    return Path("data")
+
+
+DEFAULT_CACHE_DIR = _default_cache_dir()
+# Matches :attr:`~aieng.forecasting.data.adapters.yfinance.YFinanceDailyAdapter.DEFAULT_CACHE_DIR`
+# resolved against repo ``data/`` (stem filenames like ``gspc_adj_close_1d.parquet`` per ticker).
+DEFAULT_YAHOO_CACHE_DIR = DEFAULT_CACHE_DIR / "yfinance"
+DEFAULT_FRED_CACHE_DIR = DEFAULT_CACHE_DIR / "fred"
+
+# Keys are FRED series ids; values are (description, units, pandas frequency hint)
+# for ``scripts/fetch_fred.py`` registration — keep in sync with _fred_frame call sites below.
+FRED_PREFETCH_REGISTRY: dict[str, tuple[str, str, str]] = {
+    "DGS10": ("10-Year Treasury Constant Maturity Rate", "Percent", "D"),
+    "DGS2": ("2-Year Treasury Constant Maturity Rate", "Percent", "D"),
+    "DFF": ("Effective Federal Funds Rate", "Percent", "D"),
+    "CPIAUCSL": (
+        "Consumer Price Index for All Urban Consumers: All Items in U.S. City Average",
+        "Index 1982-84=100",
+        "MS",
+    ),
+    "UNRATE": ("Unemployment Rate", "Percent", "MS"),
+    "DCOILWTICO": ("Crude Oil Prices: West Texas Intermediate (WTI)", "Dollars per Barrel", "D"),
+    "GOLDAMGBD228NLBM": (
+        "Gold Fixing Price 10:30 A.M. (London time) in London Bullion Market",
+        "U.S. Dollars per Troy Ounce",
+        "D",
+    ),
+    "GOLDPMGBD228NLBM": (
+        "Gold Fixing Price 3:00 P.M. (London time) in London Bullion Market",
+        "U.S. Dollars per Troy Ounce",
+        "D",
+    ),
+    "DTWEXBGS": ("Trade Weighted U.S. Dollar Index: Broad, Goods and Services", "Index Jan 2006=100", "D"),
+}
+
+FRED_SERIES_IDS_FOR_PREFETCH: tuple[str, ...] = tuple(FRED_PREFETCH_REGISTRY.keys())
+
+VIX_TICKER = "^VIX"
+NASDAQ_TICKER = "^IXIC"
+
+SERIES_ID_VIX_LEVEL = "vix_level_l1b"
+SERIES_ID_VIX_CHANGE = "vix_log_ret_1b_l1b"
+SERIES_ID_10Y_YIELD = "ust10y_level_l1b"
+SERIES_ID_2Y10Y_SPREAD = "ust2y10y_spread_l1b"
+SERIES_ID_FED_FUNDS = "fed_funds_level_l1b"
+SERIES_ID_CPI_INFLATION_CHANGE = "cpi_mom_logdiff_l1b"
+SERIES_ID_UNEMPLOYMENT = "unemployment_rate_l1b"
+SERIES_ID_OIL_RETURN = "oil_log_ret_1b_l1b"
+SERIES_ID_GOLD_RETURN = "gold_log_ret_1b_l1b"
+SERIES_ID_DOLLAR_INDEX_RETURN = "dollar_index_log_ret_1b_l1b"
+SERIES_ID_NASDAQ_RETURN = "nasdaq_log_ret_1b_l1b"
+
+
+DEFAULT_COVARIATE_SERIES_IDS: list[str] = [
+    SERIES_ID_VIX_LEVEL,
+    SERIES_ID_VIX_CHANGE,
+    SERIES_ID_10Y_YIELD,
+    SERIES_ID_2Y10Y_SPREAD,
+    SERIES_ID_FED_FUNDS,
+    SERIES_ID_CPI_INFLATION_CHANGE,
+    SERIES_ID_UNEMPLOYMENT,
+    SERIES_ID_OIL_RETURN,
+    SERIES_ID_GOLD_RETURN,
+    SERIES_ID_DOLLAR_INDEX_RETURN,
+    SERIES_ID_NASDAQ_RETURN,
+]
+
+
+def _canonical_three_col(df: pd.DataFrame) -> pd.DataFrame:
+    out = df.copy()
+    out["timestamp"] = pd.to_datetime(out["timestamp"]).dt.tz_localize(None)
+    out["released_at"] = pd.to_datetime(out["released_at"]).dt.tz_localize(None)
+    out["value"] = pd.to_numeric(out["value"], errors="coerce")
+    out = out.dropna(subset=["timestamp", "released_at", "value"]).sort_values("timestamp")
+    return out[["timestamp", "value", "released_at"]].reset_index(drop=True)
+
+
+def _drop_weekend_timestamp_rows(df: pd.DataFrame) -> pd.DataFrame:
+    r"""Remove rows whose ``timestamp`` is Saturday or Sunday.
+
+    Some FRED daily series (notably effective fed funds ``DFF``) include
+    weekend dates in early vintages. Forecast tasks and Darts regression models
+    use ``freq="B"`` (pandas Mon--Fri business days); ``TimeSeries.from_dataframe``
+    with ``fill_missing_dates=True`` then raises if any input stamp is not on
+    that grid.
+    """
+    if df.empty:
+        return df
+    x = df.copy()
+    ts = pd.to_datetime(x["timestamp"])
+    return x.loc[ts.dt.dayofweek < 5].reset_index(drop=True)
+
+
+def _load_yahoo_close_frame(
+    ticker: str,
+    *,
+    start: str,
+    end: str | None,
+    cache_dir: Path,
+    refresh: bool,
+) -> pd.DataFrame:
+    adapter = YFinanceDailyAdapter(
+        ticker,
+        field="Adj Close",
+        start=start,
+        end=end,
+        cache_dir=cache_dir,
+        refresh=refresh,
+    )
+    raw = adapter.fetch()
+    frame = raw[["timestamp", "value"]].copy().sort_values("timestamp").reset_index(drop=True)
+    frame["value"] = pd.to_numeric(frame["value"], errors="coerce")
+    return frame.dropna(subset=["value"]).reset_index(drop=True)
+
+
+def _to_log_return_feature(close_df: pd.DataFrame) -> pd.DataFrame:
+    out = close_df.copy()
+    out = out[out["value"] > 0].reset_index(drop=True)
+    out["value"] = np.log(out["value"] / out["value"].shift(1))
+    out = out.dropna(subset=["value"]).reset_index(drop=True)
+    # Daily closes are known after market close; model sees them from next business day.
+    out["released_at"] = pd.to_datetime(out["timestamp"]) + pd.offsets.BDay(1)
+    return _canonical_three_col(out[["timestamp", "value", "released_at"]])
+
+
+def _to_level_feature_from_daily(close_df: pd.DataFrame) -> pd.DataFrame:
+    out = close_df.copy()
+    out["released_at"] = pd.to_datetime(out["timestamp"]) + pd.offsets.BDay(1)
+    return _canonical_three_col(out[["timestamp", "value", "released_at"]])
+
+
+def _fred_frame(
+    fred_id: str,
+    *,
+    cache_dir: Path,
+    refresh: bool,
+) -> pd.DataFrame:
+    adapter = FREDAdapter(fred_id, cache_dir=cache_dir, refresh=refresh)
+    return _canonical_three_col(adapter.fetch())
+
+
+def _business_daily_expand_from_releases(
+    monthly_df: pd.DataFrame,
+    *,
+    start: str,
+    end: str | None,
+) -> pd.DataFrame:
+    x = monthly_df.copy().sort_values("released_at").reset_index(drop=True)
+    lo = pd.Timestamp(start)
+    hi = pd.Timestamp(end) if end is not None else x["released_at"].max() + pd.offsets.BDay(1)
+    if hi < lo:
+        return pd.DataFrame(columns=["timestamp", "value", "released_at"])
+    daily_idx = pd.bdate_range(lo, hi)
+    rel = x.set_index("released_at")["value"].reindex(daily_idx).ffill()
+    out = rel.reset_index()
+    out.columns = ["timestamp", "value"]
+    out = out.dropna(subset=["value"]).reset_index(drop=True)
+    out["released_at"] = out["timestamp"]
+    return _canonical_three_col(out)
+
+
+def _apply_one_business_day_feature_lag(df: pd.DataFrame) -> pd.DataFrame:
+    """Shift values so the feature at *t* only uses information through *t-1*."""
+    x = df.copy().sort_values("timestamp").reset_index(drop=True)
+    x["value"] = x["value"].shift(1)
+    x = x.dropna(subset=["value"]).reset_index(drop=True)
+    # After lagging, the shifted value is available at row timestamp.
+    x["released_at"] = x["timestamp"]
+    return _canonical_three_col(x)
+
+
+def _business_daily_ffill(df: pd.DataFrame) -> pd.DataFrame:
+    """Reindex a daily feature onto a complete business-day calendar, forward-filling.
+
+    FRED bond / commodity series follow a different holiday calendar than the
+    NYSE-traded target — e.g. Columbus Day and Veterans Day, when the bond market
+    is closed but equities trade. Without this, such a covariate ends a few days
+    short of a target origin and Darts raises ``past_covariates are not long
+    enough``, silently skipping those origins for the covariate-using models.
+
+    Forward-filling carries the last observed value across those gaps (and onto
+    every Mon–Fri business day), so the covariate is defined wherever the target
+    is. It is leak-safe: it only repeats already-known past information, and the
+    one-business-day feature lag is still applied afterwards.
+    """
+    if df.empty:
+        return df
+    x = df.copy().sort_values("timestamp").reset_index(drop=True)
+    idx = pd.bdate_range(x["timestamp"].min(), x["timestamp"].max())
+    filled = x.set_index("timestamp")["value"].reindex(idx).ffill()
+    out = filled.reset_index()
+    out.columns = ["timestamp", "value"]
+    out = out.dropna(subset=["value"]).reset_index(drop=True)
+    out["released_at"] = out["timestamp"]
+    return _canonical_three_col(out)
+
+
+def _build_monthly_cpi_mom_feature(
+    *,
+    cache_dir: Path,
+    refresh: bool,
+    start: str,
+    end: str | None,
+) -> pd.DataFrame:
+    cpi = _fred_frame("CPIAUCSL", cache_dir=cache_dir, refresh=refresh)
+    cpi["value"] = np.log(cpi["value"] / cpi["value"].shift(1))
+    cpi = cpi.dropna(subset=["value"]).reset_index(drop=True)
+    # Conservative monthly publication proxy: mid-next-month business day.
+    cpi["released_at"] = pd.to_datetime(cpi["timestamp"]) + pd.offsets.MonthEnd(1) + pd.offsets.BDay(10)
+    daily = _business_daily_expand_from_releases(cpi, start=start, end=end)
+    return _apply_one_business_day_feature_lag(daily)
+
+
+def _build_monthly_unemployment_feature(
+    *,
+    cache_dir: Path,
+    refresh: bool,
+    start: str,
+    end: str | None,
+) -> pd.DataFrame:
+    unrate = _fred_frame("UNRATE", cache_dir=cache_dir, refresh=refresh)
+    # Conservative publication proxy: 10 business days after month end.
+    unrate["released_at"] = pd.to_datetime(unrate["timestamp"]) + pd.offsets.MonthEnd(1) + pd.offsets.BDay(10)
+    daily = _business_daily_expand_from_releases(unrate, start=start, end=end)
+    return _apply_one_business_day_feature_lag(daily)
+
+
+def _build_daily_fred_level_feature(
+    fred_id: str,
+    *,
+    cache_dir: Path,
+    refresh: bool,
+) -> pd.DataFrame:
+    x = _fred_frame(fred_id, cache_dir=cache_dir, refresh=refresh)
+    x = _drop_weekend_timestamp_rows(x)
+    # Forward-fill onto the full business-day calendar so bond-market holidays
+    # (when equities still trade) don't leave the covariate short of the origin.
+    x = _business_daily_ffill(x)
+    return _apply_one_business_day_feature_lag(x)
+
+
+def _build_daily_fred_return_feature(
+    fred_id: str,
+    *,
+    cache_dir: Path,
+    refresh: bool,
+) -> pd.DataFrame:
+    x = _fred_frame(fred_id, cache_dir=cache_dir, refresh=refresh)
+    x = _drop_weekend_timestamp_rows(x)
+    x = x[x["value"] > 0].reset_index(drop=True)
+    # Forward-fill the *level* onto the full business-day calendar first, so a
+    # bond-market holiday becomes a 0-return business day rather than a gap that
+    # ends the covariate before the target origin.
+    x = _business_daily_ffill(x)
+    x["value"] = np.log(x["value"] / x["value"].shift(1))
+    x = x.dropna(subset=["value"]).reset_index(drop=True)
+    return _apply_one_business_day_feature_lag(x)
+
+
+def _build_first_available_daily_fred_return_feature(
+    fred_ids: list[str],
+    *,
+    cache_dir: Path,
+    refresh: bool,
+) -> tuple[pd.DataFrame, str]:
+    """Try multiple FRED ids and return the first one that fetches successfully."""
+    last_error: Exception | None = None
+    for fred_id in fred_ids:
+        try:
+            frame = _build_daily_fred_return_feature(fred_id, cache_dir=cache_dir, refresh=refresh)
+            return frame, fred_id
+        except (RuntimeError, ValueError) as exc:
+            last_error = exc
+            continue
+    ids = ", ".join(fred_ids)
+    raise RuntimeError(
+        f"Could not fetch any configured gold FRED series ({ids}). "
+        "Check FRED availability/API key or override the gold covariate setup."
+    ) from last_error
+
+
+def build_sp500_multivariate_service(  # noqa: PLR0912, PLR0915
+    *,
+    windows: tuple[int, ...] = SP500_RETURN_WINDOWS,
+    include_covariates: bool = True,
+    covariate_series_ids: list[str] | None = None,
+    strict_covariates: bool = False,
+    refresh: bool = False,
+    start: str = "1990-01-01",
+    end: str | None = None,
+    sp500_cache_path: Path | None = None,
+    yahoo_cache_dir: Path | None = None,
+    fred_cache_dir: Path | None = None,
+) -> DataService:
+    """Build DataService with target plus optional leak-safe covariates.
+
+    Parameters
+    ----------
+    strict_covariates : bool
+        If ``True``, any covariate fetch/build failure raises immediately.
+        If ``False`` (default), unavailable covariates are skipped with a warning.
+    """
+    _load_fred_dotenv()
+    # Only forward ``cache_path`` when the caller supplies one. Passing ``None``
+    # would shadow the single-variable default (repo ``data/yahoo/sp500_gspc.parquet``)
+    # and force every notebook run to hit Yahoo Finance live.
+    sp500_kwargs: dict[str, Any] = {"windows": windows, "refresh": refresh, "start": start, "end": end}
+    if sp500_cache_path is not None:
+        sp500_kwargs["cache_path"] = sp500_cache_path
+    svc = build_sp500_log_return_service(**sp500_kwargs)
+    if not include_covariates:
+        return svc
+
+    desired = covariate_series_ids or DEFAULT_COVARIATE_SERIES_IDS
+    desired_set = set(desired)
+
+    yahoo_dir = _as_absolute_cache(yahoo_cache_dir or DEFAULT_YAHOO_CACHE_DIR)
+    fred_dir = _as_absolute_cache(fred_cache_dir or DEFAULT_FRED_CACHE_DIR)
+    if yahoo_dir is None or fred_dir is None:
+        raise RuntimeError("Could not resolve yahoo/fred cache directories.")
+    yahoo_dir.mkdir(parents=True, exist_ok=True)
+    fred_dir.mkdir(parents=True, exist_ok=True)
+
+    def _handle_covariate_error(series_id: str, exc: Exception) -> None:
+        if strict_covariates:
+            raise RuntimeError(f"Failed to build required covariate {series_id!r}.") from exc
+        warnings.warn(
+            f"Skipping unavailable covariate {series_id!r}: {exc}",
+            stacklevel=2,
+        )
+
+    if SERIES_ID_VIX_LEVEL in desired_set or SERIES_ID_VIX_CHANGE in desired_set:
+        try:
+            vix_close = _load_yahoo_close_frame(
+                VIX_TICKER,
+                start=start,
+                end=end,
+                cache_dir=yahoo_dir,
+                refresh=refresh,
+            )
+            if SERIES_ID_VIX_LEVEL in desired_set:
+                vix_level = _apply_one_business_day_feature_lag(_to_level_feature_from_daily(vix_close))
+                svc.register(
+                    SERIES_ID_VIX_LEVEL,
+                    StaticFrameAdapter(vix_level),
+                    SeriesMetadata(
+                        series_id=SERIES_ID_VIX_LEVEL,
+                        description="CBOE VIX close level, lagged 1 business day",
+                        source=f"Yahoo Finance ({VIX_TICKER})",
+                        units="index-level",
+                        frequency="B",
+                        table_id="yahoo:^VIX:close-l1b",
+                    ),
+                )
+            if SERIES_ID_VIX_CHANGE in desired_set:
+                vix_change = _apply_one_business_day_feature_lag(_to_log_return_feature(vix_close))
+                svc.register(
+                    SERIES_ID_VIX_CHANGE,
+                    StaticFrameAdapter(vix_change),
+                    SeriesMetadata(
+                        series_id=SERIES_ID_VIX_CHANGE,
+                        description="CBOE VIX close-to-close log return, lagged 1 business day",
+                        source=f"Yahoo Finance ({VIX_TICKER}), derived",
+                        units="log-return",
+                        frequency="B",
+                        table_id="yahoo:^VIX:log-return-l1b",
+                    ),
+                )
+        except (RuntimeError, ValueError) as exc:
+            if SERIES_ID_VIX_LEVEL in desired_set:
+                _handle_covariate_error(SERIES_ID_VIX_LEVEL, exc)
+            if SERIES_ID_VIX_CHANGE in desired_set:
+                _handle_covariate_error(SERIES_ID_VIX_CHANGE, exc)
+
+    if SERIES_ID_NASDAQ_RETURN in desired_set:
+        try:
+            nasdaq_close = _load_yahoo_close_frame(
+                NASDAQ_TICKER,
+                start=start,
+                end=end,
+                cache_dir=yahoo_dir,
+                refresh=refresh,
+            )
+            nasdaq_ret = _apply_one_business_day_feature_lag(_to_log_return_feature(nasdaq_close))
+            svc.register(
+                SERIES_ID_NASDAQ_RETURN,
+                StaticFrameAdapter(nasdaq_ret),
+                SeriesMetadata(
+                    series_id=SERIES_ID_NASDAQ_RETURN,
+                    description="NASDAQ Composite close-to-close log return, lagged 1 business day",
+                    source=f"Yahoo Finance ({NASDAQ_TICKER}), derived",
+                    units="log-return",
+                    frequency="B",
+                    table_id="yahoo:^IXIC:log-return-l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_NASDAQ_RETURN, exc)
+
+    if SERIES_ID_10Y_YIELD in desired_set:
+        try:
+            dgs10 = _build_daily_fred_level_feature("DGS10", cache_dir=fred_dir, refresh=refresh)
+            svc.register(
+                SERIES_ID_10Y_YIELD,
+                StaticFrameAdapter(dgs10),
+                SeriesMetadata(
+                    series_id=SERIES_ID_10Y_YIELD,
+                    description="US 10-year Treasury yield level, lagged 1 business day",
+                    source="FRED (DGS10)",
+                    units="percent",
+                    frequency="B",
+                    table_id="fred:DGS10:l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_10Y_YIELD, exc)
+
+    if SERIES_ID_2Y10Y_SPREAD in desired_set:
+        try:
+            dgs10 = _fred_frame("DGS10", cache_dir=fred_dir, refresh=refresh)
+            dgs2 = _fred_frame("DGS2", cache_dir=fred_dir, refresh=refresh)
+            spread = pd.merge(
+                dgs10[["timestamp", "value"]],
+                dgs2[["timestamp", "value"]],
+                on="timestamp",
+                how="inner",
+                suffixes=("_10y", "_2y"),
+            )
+            spread["value"] = spread["value_10y"] - spread["value_2y"]
+            spread["released_at"] = pd.to_datetime(spread["timestamp"]) + pd.offsets.BDay(1)
+            # Forward-fill onto the full business-day calendar (bond-holiday safe).
+            spread = _business_daily_ffill(spread[["timestamp", "value", "released_at"]])
+            spread = _apply_one_business_day_feature_lag(spread)
+            svc.register(
+                SERIES_ID_2Y10Y_SPREAD,
+                StaticFrameAdapter(spread),
+                SeriesMetadata(
+                    series_id=SERIES_ID_2Y10Y_SPREAD,
+                    description="US 10Y minus 2Y Treasury spread, lagged 1 business day",
+                    source="FRED (DGS10, DGS2), derived",
+                    units="percent-points",
+                    frequency="B",
+                    table_id="fred:DGS10-DGS2:l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_2Y10Y_SPREAD, exc)
+
+    if SERIES_ID_FED_FUNDS in desired_set:
+        try:
+            fed = _build_daily_fred_level_feature("DFF", cache_dir=fred_dir, refresh=refresh)
+            svc.register(
+                SERIES_ID_FED_FUNDS,
+                StaticFrameAdapter(fed),
+                SeriesMetadata(
+                    series_id=SERIES_ID_FED_FUNDS,
+                    description="Effective federal funds rate, lagged 1 business day",
+                    source="FRED (DFF)",
+                    units="percent",
+                    frequency="B",
+                    table_id="fred:DFF:l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_FED_FUNDS, exc)
+
+    if SERIES_ID_CPI_INFLATION_CHANGE in desired_set:
+        try:
+            cpi = _build_monthly_cpi_mom_feature(
+                cache_dir=fred_dir,
+                refresh=refresh,
+                start=start,
+                end=end,
+            )
+            svc.register(
+                SERIES_ID_CPI_INFLATION_CHANGE,
+                StaticFrameAdapter(cpi),
+                SeriesMetadata(
+                    series_id=SERIES_ID_CPI_INFLATION_CHANGE,
+                    description="US CPI MoM log change, conservative release lag + 1B feature lag",
+                    source="FRED (CPIAUCSL), derived",
+                    units="log-change",
+                    frequency="B",
+                    table_id="fred:CPIAUCSL:mom-l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_CPI_INFLATION_CHANGE, exc)
+
+    if SERIES_ID_UNEMPLOYMENT in desired_set:
+        try:
+            unemp = _build_monthly_unemployment_feature(
+                cache_dir=fred_dir,
+                refresh=refresh,
+                start=start,
+                end=end,
+            )
+            svc.register(
+                SERIES_ID_UNEMPLOYMENT,
+                StaticFrameAdapter(unemp),
+                SeriesMetadata(
+                    series_id=SERIES_ID_UNEMPLOYMENT,
+                    description="US unemployment rate, conservative release lag + 1B feature lag",
+                    source="FRED (UNRATE)",
+                    units="percent",
+                    frequency="B",
+                    table_id="fred:UNRATE:l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_UNEMPLOYMENT, exc)
+
+    if SERIES_ID_OIL_RETURN in desired_set:
+        try:
+            oil = _build_daily_fred_return_feature("DCOILWTICO", cache_dir=fred_dir, refresh=refresh)
+            svc.register(
+                SERIES_ID_OIL_RETURN,
+                StaticFrameAdapter(oil),
+                SeriesMetadata(
+                    series_id=SERIES_ID_OIL_RETURN,
+                    description="WTI oil spot log return, lagged 1 business day",
+                    source="FRED (DCOILWTICO), derived",
+                    units="log-return",
+                    frequency="B",
+                    table_id="fred:DCOILWTICO:log-return-l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_OIL_RETURN, exc)
+
+    if SERIES_ID_GOLD_RETURN in desired_set:
+        try:
+            gold, gold_source_id = _build_first_available_daily_fred_return_feature(
+                ["GOLDAMGBD228NLBM", "GOLDPMGBD228NLBM"],
+                cache_dir=fred_dir,
+                refresh=refresh,
+            )
+            svc.register(
+                SERIES_ID_GOLD_RETURN,
+                StaticFrameAdapter(gold),
+                SeriesMetadata(
+                    series_id=SERIES_ID_GOLD_RETURN,
+                    description="Gold fix log return, lagged 1 business day",
+                    source=f"FRED ({gold_source_id}), derived",
+                    units="log-return",
+                    frequency="B",
+                    table_id=f"fred:{gold_source_id}:log-return-l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_GOLD_RETURN, exc)
+
+    if SERIES_ID_DOLLAR_INDEX_RETURN in desired_set:
+        try:
+            dxy = _build_daily_fred_return_feature("DTWEXBGS", cache_dir=fred_dir, refresh=refresh)
+            svc.register(
+                SERIES_ID_DOLLAR_INDEX_RETURN,
+                StaticFrameAdapter(dxy),
+                SeriesMetadata(
+                    series_id=SERIES_ID_DOLLAR_INDEX_RETURN,
+                    description="Trade-weighted dollar index log return, lagged 1 business day",
+                    source="FRED (DTWEXBGS), derived",
+                    units="log-return",
+                    frequency="B",
+                    table_id="fred:DTWEXBGS:log-return-l1b",
+                ),
+            )
+        except (RuntimeError, ValueError) as exc:
+            _handle_covariate_error(SERIES_ID_DOLLAR_INDEX_RETURN, exc)
+
+    return svc
+
+
+__all__ = [
+    "DEFAULT_COVARIATE_SERIES_IDS",
+    "FRED_PREFETCH_REGISTRY",
+    "FRED_SERIES_IDS_FOR_PREFETCH",
+    "SERIES_ID_10Y_YIELD",
+    "SERIES_ID_2Y10Y_SPREAD",
+    "SERIES_ID_CPI_INFLATION_CHANGE",
+    "SERIES_ID_DOLLAR_INDEX_RETURN",
+    "SERIES_ID_FED_FUNDS",
+    "SERIES_ID_GOLD_RETURN",
+    "SERIES_ID_NASDAQ_RETURN",
+    "SERIES_ID_OIL_RETURN",
+    "SERIES_ID_UNEMPLOYMENT",
+    "SERIES_ID_VIX_CHANGE",
+    "SERIES_ID_VIX_LEVEL",
+    "SP500_LOG_RETURN_SERIES_ID",
+    "SP500_RETURN_TARGETS",
+    "SP500_RETURN_WINDOWS",
+    "SP500_SERIES_ID",
+    "SP500_TICKER",
+    "SP500_WINDOW_LABELS",
+    "StaticFrameAdapter",
+    "build_sp500_log_return_service",
+    "build_sp500_multivariate_service",
+    "sp500_logret_series_id",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__leaderboard.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__leaderboard.py.md
new file mode 100644
index 0000000..9e49097
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__leaderboard.py.md
@@ -0,0 +1,211 @@
+# Source: implementations/sp500_forecasting/leaderboard.py
+
+kind: python
+
+```python
+"""Leaderboard rows for the multivariate S&P 500 experiment.
+
+The notebook runs each predictor with the shared
+:func:`~aieng.forecasting.evaluation.cached_multi_backtest` /
+:func:`~aieng.forecasting.evaluation.multi_evaluate` helpers, which return a
+``dict`` keyed by ``task_id`` (one task per horizon, targeting
+``sp500_logret_{N}b``).  :func:`build_leaderboard` turns the
+``{predictor_id: {task_id: result}}`` mapping those produce into the
+``RESULTS_DF`` frame consumed by
+:func:`~sp500_forecasting.plots.display_multivariate_backtest_leaderboard`
+(one row per predictor × horizon, with mean CRPS and next-direction metrics).
+
+Which predictors run, and all their hyperparameters, are configured in the
+notebook — there is no model registry or dispatch here.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+import pandas as pd
+from aieng.forecasting.data.service import DataService
+from aieng.forecasting.evaluation import BacktestResult, EvalResult
+from sp500_forecasting.analysis import (
+    build_direction_eval_frame,
+    direction_classification_metrics,
+)
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.evaluation.prediction import Prediction
+
+
+_NAN_DIR: dict[str, float | int] = {
+    "dir_precision_up": float("nan"),
+    "dir_recall_up": float("nan"),
+    "dir_f1_up": float("nan"),
+    "dir_accuracy": float("nan"),
+    "dir_roc_auc_prob_up": float("nan"),
+    "dir_n_eval": 0,
+}
+
+
+def build_return_compare_frame(
+    predictions: list[Prediction],
+    data_service: DataService,
+    target_series_id: str,
+) -> pd.DataFrame:
+    """One row per scored prediction: realised return vs forecast median and 5–95% band.
+
+    Returns are kept on the target (log-return) scale; the notebook renders them
+    as percentages.  Rows whose ``forecast_date`` has no realised observation are
+    dropped.
+    """
+    from datetime import datetime, timezone  # noqa: PLC0415
+
+    from aieng.forecasting.evaluation.prediction import ContinuousForecast  # noqa: PLC0415
+
+    as_of_now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    full = data_service.get_series(target_series_id, as_of=as_of_now).copy()
+    full["timestamp"] = pd.to_datetime(full["timestamp"])
+    lookup = full.set_index("timestamp")["value"]
+
+    rows: list[dict[str, object]] = []
+    for pred in predictions:
+        if not isinstance(pred.payload, ContinuousForecast):
+            continue
+        ts = pd.Timestamp(pred.forecast_date)
+        if ts not in lookup.index:
+            continue
+        qmap = pred.payload.quantiles
+        med = qmap.get(0.5, pred.payload.point_forecast)
+        rows.append(
+            {
+                "session": ts,
+                "actual_return": float(lookup.loc[ts]),
+                "forecast_return": float(med),
+                "forecast_return_p05": float(qmap.get(0.05, float("nan"))),
+                "forecast_return_p95": float(qmap.get(0.95, float("nan"))),
+            }
+        )
+    if not rows:
+        return pd.DataFrame()
+    return pd.DataFrame(rows).sort_values("session").reset_index(drop=True)
+
+
+def _direction_metrics_row(
+    *,
+    predictions: list[Prediction],
+    data_service: DataService,
+    target_series_id: str,
+) -> dict[str, float | int]:
+    eval_df = build_direction_eval_frame(
+        predictions,
+        target_series_id=target_series_id,
+        data_service=data_service,
+    )
+    if eval_df.empty:
+        return dict(_NAN_DIR)
+    m = direction_classification_metrics(eval_df)
+    return {
+        "dir_precision_up": float(m.get("precision_up", float("nan"))),
+        "dir_recall_up": float(m.get("recall_up", float("nan"))),
+        "dir_f1_up": float(m.get("f1_up", float("nan"))),
+        "dir_accuracy": float(m.get("accuracy", float("nan"))),
+        "dir_roc_auc_prob_up": float(m.get("roc_auc_prob_up", float("nan"))),
+        "dir_n_eval": int(m.get("n", 0)),
+    }
+
+
+def _leaderboard_row(
+    *,
+    predictor_id: str,
+    result: BacktestResult | EvalResult,
+    data_service: DataService,
+    covariates: list[str],
+    label: str,
+) -> dict[str, object]:
+    """Build one leaderboard row from a single (predictor, task) result."""
+    # BacktestResult carries ``spec``; EvalResult carries ``eval_spec`` — both
+    # expose the same ``.task``.
+    spec = getattr(result, "spec", None) or result.eval_spec
+    target_series_id = spec.task.target_series_id
+    dir_row = _direction_metrics_row(
+        predictions=result.predictions,
+        data_service=data_service,
+        target_series_id=target_series_id,
+    )
+    row: dict[str, object] = {
+        "horizon": int(max(spec.task.horizons)),
+        "target": target_series_id,
+        "model": label,
+        "uses_covariates": bool(covariates),
+        "n_covariates": len(covariates),
+        "covariates": ", ".join(covariates) if covariates else "—",
+        "predictor_id": predictor_id,
+        "mean_crps": float(result.mean_score),
+        "n_scores": int(len(result.scores)),
+        "n_predictions": int(len(result.predictions)),
+        "skipped_origins": int(getattr(result, "skipped_origins", 0)),
+        **dir_row,
+    }
+    run_number = getattr(result, "run_number", None)
+    if run_number is not None:
+        row["run_number"] = int(run_number)
+    return row
+
+
+def build_leaderboard(
+    results_by_predictor: dict[str, dict[str, BacktestResult | EvalResult]],
+    data_service: DataService,
+    *,
+    covariates_by_predictor: dict[str, list[str]] | None = None,
+    labels_by_predictor: dict[str, str] | None = None,
+) -> pd.DataFrame:
+    """Assemble a ``RESULTS_DF`` leaderboard from cached backtest/eval results.
+
+    Parameters
+    ----------
+    results_by_predictor
+        ``{predictor_id: {task_id: result}}`` as returned by looping
+        :func:`~aieng.forecasting.evaluation.cached_multi_backtest` (or
+        :func:`~aieng.forecasting.evaluation.multi_evaluate`) over a list of
+        predictors.  Backtest and eval results are both accepted; eval rows get a
+        ``run_number`` column.
+    data_service
+        Service that registers the target series (used for the next-direction
+        metrics that align each forecast with its realised return).
+    covariates_by_predictor
+        Optional ``{predictor_id: [series_id, ...]}`` so the ``uses_covariates`` /
+        ``covariates`` columns reflect each predictor's covariate panel.  A
+        predictor absent from the mapping is treated as target-only.
+    labels_by_predictor
+        Optional ``{predictor_id: short_label}`` driving the ``model`` column (and
+        the bar-chart labels).  Falls back to ``predictor_id``.
+
+    Returns
+    -------
+    pandas.DataFrame
+        One row per (predictor, horizon), sorted by ``(horizon, mean_crps)``.
+    """
+    covariates_by_predictor = covariates_by_predictor or {}
+    labels_by_predictor = labels_by_predictor or {}
+
+    rows: list[dict[str, object]] = []
+    for predictor_id, task_results in results_by_predictor.items():
+        for result in task_results.values():
+            rows.append(
+                _leaderboard_row(
+                    predictor_id=predictor_id,
+                    result=result,
+                    data_service=data_service,
+                    covariates=list(covariates_by_predictor.get(predictor_id, [])),
+                    label=labels_by_predictor.get(predictor_id, predictor_id),
+                )
+            )
+    if not rows:
+        return pd.DataFrame()
+    return pd.DataFrame(rows).sort_values(["horizon", "mean_crps"], na_position="last").reset_index(drop=True)
+
+
+__all__ = [
+    "build_leaderboard",
+    "build_return_compare_frame",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__plots.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__plots.py.md
new file mode 100644
index 0000000..9390485
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__plots.py.md
@@ -0,0 +1,244 @@
+# Source: implementations/sp500_forecasting/plots.py
+
+kind: python
+
+```python
+"""Matplotlib helpers for the multivariate S&P 500 demo notebook.
+
+Keeps the notebook narrative-focused; style matches
+``food_price_forecasting/plots.py`` (matplotlib only, ``(fig, ax)`` returns).
+"""
+
+from __future__ import annotations
+
+from collections.abc import Mapping
+from datetime import datetime, timezone
+from typing import TYPE_CHECKING
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from matplotlib.axes import Axes
+from matplotlib.figure import Figure
+from matplotlib.ticker import FuncFormatter
+from sp500_forecasting.analysis import style_results_dataframe
+from sp500_forecasting.data import SP500_LOG_RETURN_SERIES_ID
+
+
+if TYPE_CHECKING:
+    from aieng.forecasting.data.service import DataService
+
+
+def plot_sp500_log_return_recent(
+    data_service: DataService,
+    *,
+    series_id: str = SP500_LOG_RETURN_SERIES_ID,
+    n_trading_days: int = 756,
+    title: str | None = None,
+) -> tuple[Figure, Axes]:
+    """Plot the last *n_trading_days* observed close-to-close log returns.
+
+    Parameters
+    ----------
+    data_service
+        Any service that registers ``series_id`` (typically ``svc_no_cov``).
+    series_id
+        Canonical log-return series id (defaults to the 1-business-day return).
+    n_trading_days
+        How many most recent rows to show (default ~3y of sessions).
+    title
+        Figure title; a default is used when ``None``.
+    """
+    as_of = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    df = data_service.get_series(series_id, as_of=as_of)
+    plot_df = df.sort_values("timestamp").tail(int(n_trading_days)).copy()
+    plot_df["timestamp"] = pd.to_datetime(plot_df["timestamp"])
+
+    fig, ax = plt.subplots(figsize=(10, 3.5), layout="constrained")
+    ax.axhline(0.0, color="0.45", linewidth=0.8, linestyle="--", zorder=1)
+    ax.fill_between(
+        plot_df["timestamp"],
+        0.0,
+        plot_df["value"],
+        where=plot_df["value"] >= 0,
+        interpolate=True,
+        alpha=0.35,
+        color="#1f77b4",
+        linewidth=0,
+    )
+    ax.fill_between(
+        plot_df["timestamp"],
+        0.0,
+        plot_df["value"],
+        where=plot_df["value"] < 0,
+        interpolate=True,
+        alpha=0.35,
+        color="#d62728",
+        linewidth=0,
+    )
+    ax.plot(plot_df["timestamp"], plot_df["value"], color="0.15", linewidth=0.6, zorder=2)
+    ax.set_xlabel("Session date (target timestamp)")
+    ax.set_ylabel("Log return")
+    ttl = title or (
+        f"Observed {series_id} (last {len(plot_df)} sessions)\nPositive: index up over the window; negative: down."
+    )
+    ax.set_title(ttl, fontsize=11)
+    ax.grid(True, alpha=0.25)
+    fig.autofmt_xdate()
+    return fig, ax
+
+
+def plot_mean_crps_leaderboard(
+    results_df: pd.DataFrame,
+    *,
+    value_col: str = "mean_crps",
+    label_col: str = "predictor_id",
+    title: str = "Mean CRPS by run (lower is better)",
+) -> tuple[Figure, Axes]:
+    """Horizontal bar chart from a ``RESULTS_DF``-style frame (single horizon)."""
+    d = results_df.dropna(subset=[value_col]).copy()
+    fig, ax = plt.subplots(figsize=(8.5, max(2.5, 0.45 * len(d) + 1)), layout="constrained")
+
+    if d.empty:
+        ax.text(0.5, 0.5, "No rows with finite mean CRPS to plot.", ha="center", va="center")
+        ax.set_axis_off()
+        return fig, ax
+
+    d = d.sort_values(value_col, ascending=True)
+    y = np.arange(len(d))
+    viridis = plt.get_cmap("viridis")
+    colors = viridis(np.linspace(0.25, 0.85, len(d)))
+    ax.barh(y, d[value_col].to_numpy(dtype=float), color=colors, height=0.65)
+    ax.set_yticks(y, d[label_col].astype(str).to_list())
+    ax.invert_yaxis()
+    ax.set_xlabel("Mean CRPS")
+    ax.set_title(title, fontsize=11)
+    ax.grid(True, axis="x", alpha=0.3)
+    for yi, val in zip(y, d[value_col].to_numpy(dtype=float), strict=True):
+        ax.text(float(val), float(yi), f"  {val:.5f}", va="center", fontsize=9, color="0.2")
+    return fig, ax
+
+
+def plot_mean_crps_by_horizon(
+    results_df: pd.DataFrame,
+    *,
+    label_col: str = "model",
+    title: str = "Mean CRPS by method and horizon (lower is better)",
+) -> tuple[Figure, list[Axes]]:
+    """Small-multiples: one CRPS bar panel per horizon, methods sorted within each.
+
+    Expects a combined frame from
+    :func:`~sp500_forecasting.leaderboard.build_leaderboard` (with a ``horizon``
+    column).  Makes the "predictability decays with horizon" story visible.
+    """
+    d = results_df.dropna(subset=["mean_crps"]).copy()
+    horizons = sorted(d["horizon"].unique()) if "horizon" in d.columns and not d.empty else []
+    n = len(horizons)
+    fig, axes = plt.subplots(1, max(n, 1), figsize=(5.0 * max(n, 1), 4.0), layout="constrained", squeeze=False)
+    ax_row = list(axes[0])
+
+    if not horizons:
+        ax_row[0].text(0.5, 0.5, "No rows with finite mean CRPS to plot.", ha="center", va="center")
+        ax_row[0].set_axis_off()
+        return fig, ax_row
+
+    cmap = plt.get_cmap("viridis")
+    for ax, h in zip(ax_row, horizons):
+        dh = d[d["horizon"] == h].sort_values("mean_crps", ascending=True)
+        y = np.arange(len(dh))
+        ax.barh(y, dh["mean_crps"].to_numpy(dtype=float), color=cmap(np.linspace(0.25, 0.85, len(dh))), height=0.65)
+        ax.set_yticks(y, dh[label_col].astype(str).to_list(), fontsize=8)
+        ax.invert_yaxis()
+        ax.set_xlabel("Mean CRPS")
+        ax.set_title(f"h = {h} business day(s)", fontsize=10)
+        ax.grid(True, axis="x", alpha=0.3)
+    fig.suptitle(title, fontsize=12)
+    return fig, ax_row
+
+
+def plot_return_forecast_vs_actual_multi(
+    compare_by_run: Mapping[str, pd.DataFrame],
+    *,
+    title: str | None = None,
+) -> tuple[Figure, Axes]:
+    """Realised return (once) vs each run's median forecast, rendered as percent.
+
+    Each value frame is from
+    :func:`~sp500_forecasting.leaderboard.build_return_compare_frame` for a
+    single horizon.  Insertion order controls legend order.
+    """
+    fig, ax = plt.subplots(figsize=(12, 5.0), layout="constrained", facecolor="0.98")
+    ax.set_facecolor("#fafafa")
+
+    items = [(k, df) for k, df in compare_by_run.items() if df is not None and not df.empty]
+    if not items:
+        ax.text(0.5, 0.5, "No rows to plot (check price cache and backtest window).", ha="center", va="center")
+        ax.set_axis_off()
+        return fig, ax
+
+    base = items[0][1].copy()
+    base["session"] = pd.to_datetime(base["session"])
+    base = base.sort_values("session")
+    ax.axhline(0.0, color="0.5", linewidth=0.8, linestyle="--", zorder=1)
+    ax.plot(
+        base["session"].to_numpy(),
+        100.0 * base["actual_return"].to_numpy(dtype=float),
+        color="#0d47a1",
+        linewidth=2.2,
+        marker="o",
+        markersize=4,
+        label="Actual",
+        zorder=5,
+    )
+
+    cmap = plt.get_cmap("tab10")
+    for i, (run_key, d0) in enumerate(items):
+        d = d0.copy()
+        d["session"] = pd.to_datetime(d["session"])
+        d = d.sort_values("session")
+        ax.plot(
+            d["session"].to_numpy(),
+            100.0 * d["forecast_return"].to_numpy(dtype=float),
+            color=cmap(i % 10),
+            linewidth=1.6,
+            linestyle="--",
+            marker="s",
+            markersize=3,
+            label=run_key.replace("_", " "),
+            zorder=4 + i * 0.01,
+        )
+
+    ttl = title or "S&P 500 — forecast vs realised return"
+    ax.set_title(ttl, fontsize=12, fontweight="600", color="0.15")
+    ax.set_xlabel("Session date (forecast resolution)", fontsize=10, color="0.25")
+    ax.set_ylabel("Return (%)", fontsize=10, color="0.25")
+    ax.legend(loc="upper left", framealpha=0.92, fontsize=8, ncol=2)
+    ax.grid(True, alpha=0.28, linestyle="-", linewidth=0.6)
+    ax.tick_params(axis="both", labelsize=9, colors="0.35")
+    ax.yaxis.set_major_formatter(FuncFormatter(lambda v, _: f"{v:,.1f}%"))
+    fig.autofmt_xdate()
+    for spine in ("top", "right"):
+        ax.spines[spine].set_visible(False)
+    return fig, ax
+
+
+_RESULTS_EMPTY_HINT = (
+    "RESULTS_DF is empty — add at least one predictor to ``all_predictors`` in the "
+    "notebook's predictors cell, then re-run the backtest loop."
+)
+
+
+def display_multivariate_backtest_leaderboard(results_df: pd.DataFrame) -> None:
+    """Styled ``RESULTS_DF`` plus mean-CRPS bar charts (faceted by horizon when present)."""
+    from IPython.display import display  # noqa: PLC0415 — optional notebook dependency
+
+    if results_df.empty:
+        print(_RESULTS_EMPTY_HINT)
+        return
+    display(style_results_dataframe(results_df))  # type: ignore[no-untyped-call]
+    if "horizon" in results_df.columns and results_df["horizon"].nunique() > 1:
+        plot_mean_crps_by_horizon(results_df)
+    else:
+        plot_mean_crps_leaderboard(results_df, title="Mean CRPS — log return (target scale)")
+    plt.show()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors____init__.py.md
new file mode 100644
index 0000000..8610572
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors____init__.py.md
@@ -0,0 +1,25 @@
+# Source: implementations/sp500_forecasting/predictors/__init__.py
+
+kind: python
+
+```python
+"""Tuned predictor recipes for the multivariate S&P 500 experiment.
+
+Each module here builds a fully-configured predictor instance for the S&P 500
+use case. Recipes pair a task-agnostic predictor from
+:mod:`aieng.forecasting.methods` with use-case-specific configuration: prompt
+overrides (what the series is and how returns behave), history windows, sampling
+budgets, the optional covariate panel, and a
+:attr:`~aieng.forecasting.methods.llm_processes.base.LLMPredictorConfig.variant_tag`
+that keeps cached artifacts distinct from ad-hoc bare-config runs.
+
+The conventional numerical methods (naive floor, ETS/Kalman/AutoARIMA, Darts
+linear regression / LightGBM) need no recipe — the notebook instantiates them
+directly from :mod:`aieng.forecasting.methods`.
+"""
+
+from .llmp_sampled_trajectory import build_sp500_llmp_sampled_trajectory
+
+
+__all__ = ["build_sp500_llmp_sampled_trajectory"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors__llmp_sampled_trajectory.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors__llmp_sampled_trajectory.py.md
new file mode 100644
index 0000000..1fc1628
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__predictors__llmp_sampled_trajectory.py.md
@@ -0,0 +1,117 @@
+# Source: implementations/sp500_forecasting/predictors/llmp_sampled_trajectory.py
+
+kind: python
+
+```python
+"""S&P 500 recipe: sampled-trajectory LLMP (target-only and with-covariates).
+
+This file is intentionally small and explicit so notebook readers can open it as
+a reference recipe. The reusable method lives in ``aieng.forecasting``; this
+module captures the S&P 500 prompt framing (what the series *is* and how returns
+behave), the default sampling budget, the history window, and the cache tag used
+by the experiment.
+
+Two variants share this builder:
+
+- **target-only** — ``covariate_series_ids=None``; the LLM sees only the return
+  history.
+- **with-covariates** — pass the covariate panel; the predictor serializes
+  labeled covariate-history blocks (VIX, yields, …) into the prompt, so its CRPS
+  gap vs the target-only variant answers "can an LLM use the same exogenous
+  observations the ML methods do?".
+"""
+
+from __future__ import annotations
+
+from aieng.forecasting.methods.llm_processes import (
+    SampledTrajectoryLLMPredictor,
+    SampledTrajectoryLLMPredictorConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+
+
+_DEFAULT_MODEL = LITE_MODEL
+_DEFAULT_N_SAMPLES = 10
+_DEFAULT_HISTORY_WINDOW = 64
+_RECIPE_FAMILY = "sp500_v1"
+
+_SERIES_DESCRIPTION = (
+    "Series: S&P 500 (^GSPC) close-to-close cumulative log return over a fixed "
+    "number of business days.\n"
+    "Units: log-return (a value of 0.01 is roughly a +1% move).\n"
+    "Frequency: business days (Mon-Fri)."
+)
+
+_USER_PROMPT_SUFFIX = (
+    "Notes for this series:\n"
+    "- Daily index returns are close to a martingale: the *level* of the return "
+    "is barely predictable, so point forecasts should sit near 0 and the value "
+    "is in the *spread* (volatility and tail risk), not a confident direction.\n"
+    "- Returns cluster in volatility — calm and turbulent stretches persist — so "
+    "recent realised dispersion is the best guide to the width of your interval.\n"
+    "- Keep the distribution roughly symmetric about ~0 unless the recent history "
+    "or the covariate blocks give a clear reason to skew it; avoid extrapolating "
+    "a short run of up or down days into a trend."
+)
+
+
+def build_sp500_llmp_sampled_trajectory(
+    *,
+    model: str = _DEFAULT_MODEL,
+    n_samples: int = _DEFAULT_N_SAMPLES,
+    history_window: int | None = _DEFAULT_HISTORY_WINDOW,
+    covariate_series_ids: list[str] | None = None,
+    reasoning_effort: str | None = None,
+    max_tokens: int = 16384,
+    variant_tag: str | None = None,
+) -> SampledTrajectoryLLMPredictor:
+    """Return the S&P 500 sampled-trajectory LLMP recipe.
+
+    The model is a normal parameter because the base LLMP ``predictor_id``
+    already includes it. The recipe tag records the S&P 500 prompt/config family,
+    whether the covariate panel is in context, and the cache-relevant knobs that
+    are not otherwise visible in the ID.
+
+    Parameters
+    ----------
+    model : str
+        Model identifier. Defaults to the lite model (``gemini-3.1-flash-lite-preview``).
+    n_samples : int
+        Number of trajectory samples to draw per prediction call.
+    history_window : int or None
+        Number of most-recent business days to include in context.
+    covariate_series_ids : list[str] or None
+        When provided, the covariate panel is serialized into the prompt
+        (the "with-covariates" variant). ``None`` is the target-only variant.
+    reasoning_effort : str or None
+        Provider reasoning budget. ``None`` (default) uses the provider default;
+        the Vector proxy rejects ``'disable'``/``'low'``.
+    max_tokens : int, default=16384
+        Per-call output token budget. The generous default prevents truncation
+        on thinking models where thinking tokens consume the same budget via the
+        OpenAI-compatible proxy. The model only generates tokens it needs, so
+        non-thinking models are unaffected in cost.
+    variant_tag : str or None
+        Override the cache tag suffix.
+    """
+    history_tag = "hfull" if history_window is None else f"h{history_window}"
+    sample_count_tag = f"n{n_samples}"
+    covariate_tag = "cov" if covariate_series_ids else "target"
+    resolved_variant_tag = variant_tag or f"{_RECIPE_FAMILY}_{covariate_tag}_{history_tag}_{sample_count_tag}"
+
+    config = SampledTrajectoryLLMPredictorConfig(
+        model=model,
+        n_samples=n_samples,
+        history_window=history_window,
+        covariate_series_ids=covariate_series_ids,
+        reasoning_effort=reasoning_effort,
+        max_tokens=max_tokens,
+        series_description=_SERIES_DESCRIPTION,
+        user_prompt_suffix=_USER_PROMPT_SUFFIX,
+        variant_tag=resolved_variant_tag,
+    )
+    return SampledTrajectoryLLMPredictor(config)
+
+
+__all__ = ["build_sp500_llmp_sampled_trajectory"]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_backtest_2025.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_backtest_2025.yaml.md
new file mode 100644
index 0000000..aa3a94f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_backtest_2025.yaml.md
@@ -0,0 +1,54 @@
+# Source: implementations/sp500_forecasting/specs/sp500_backtest_2025.yaml
+
+kind: yaml
+
+```yaml
+# Main backtest spec — weekly origins across 2025 (post-cutoff).
+#
+# 2025 is after the Gemini training cutoff (~Jan 2025), so this is the window
+# where the conventional methods AND the LLM-Process can be compared *fairly* —
+# the LLM has not memorised these outcomes. Mirrors the energy reference's 2025
+# backtest window. Use it for open iteration; spend the protected 2026 eval
+# (`sp500_eval_2026.yaml`) sparingly on your finalists.
+#
+# This spec carries experiment design only (window + one single-horizon task per
+# target). The predictor roster and hyperparameters live in the notebook. Note
+# the LLMP predictors are token-heavy over ~50 weekly origins — the notebook lets
+# you trim the predictor list (or widen the stride here) before enabling them.
+
+spec_id: sp500_backtest_2025
+
+description: >-
+  Main multivariate comparison: weekly origins across 2025 (post-cutoff),
+  forecasting close-to-close cumulative returns at 1/5/21 business days.
+
+tasks:
+  - task_id: sp500_logret_1b
+    target_series_id: sp500_logret_1b
+    horizons: [1]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 1 business day ahead
+      (next-session return / direction).
+
+  - task_id: sp500_logret_5b
+    target_series_id: sp500_logret_5b
+    horizons: [5]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 5 business days ahead
+      (forward 1-week return).
+
+  - task_id: sp500_logret_21b
+    target_series_id: sp500_logret_21b
+    horizons: [21]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 21 business days ahead
+      (forward 1-month return).
+
+start: "2025-01-06"
+end: "2025-12-22"
+stride: 5            # weekly origins (~50)
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_eval_2026.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_eval_2026.yaml.md
new file mode 100644
index 0000000..aa39831
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_eval_2026.yaml.md
@@ -0,0 +1,57 @@
+# Source: implementations/sp500_forecasting/specs/sp500_eval_2026.yaml
+
+kind: yaml
+
+```yaml
+# Protected eval — held-out 2026 window, scored through multi_evaluate() with a budget.
+#
+# This is the honest scoreboard. 2026 is unambiguously after the Gemini training
+# cutoff, so neither the numerical methods nor the LLM-Process can have seen the
+# outcomes. Treat it as scarce: `max_runs` caps how many times the spec may be
+# scored (via EvalTracker), so iterate on `sp500_backtest_2025.yaml` and spend
+# this only on a curated set of finalists (chosen in the notebook's eval cell).
+#
+# Loaded as a MultiTargetEvalSpec: one single-horizon task per target, all under
+# a single shared run budget. One multi_evaluate() call across all three
+# horizons counts as ONE run against `max_runs` — the budget is keyed by
+# `spec_id`, not per-horizon. The predictor roster lives in the notebook.
+
+spec_id: sp500_eval_2026
+
+description: >-
+  Protected eval: weekly origins across early/mid 2026 (post-cutoff),
+  forecasting close-to-close cumulative returns at 1/5/21 business days.
+
+tasks:
+  - task_id: sp500_logret_1b
+    target_series_id: sp500_logret_1b
+    horizons: [1]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 1 business day ahead
+      (next-session return / direction).
+
+  - task_id: sp500_logret_5b
+    target_series_id: sp500_logret_5b
+    horizons: [5]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 5 business days ahead
+      (forward 1-week return).
+
+  - task_id: sp500_logret_21b
+    target_series_id: sp500_logret_21b
+    horizons: [21]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 21 business days ahead
+      (forward 1-month return).
+
+start: "2026-02-02"
+end: "2026-03-23"     # 8 weekly origins; all resolve at h=21
+stride: 5             # weekly origins
+warmup: 250
+
+# Budget cap: each multi_evaluate() call (all 3 horizons) counts as one run.
+max_runs: 5
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_smoke.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_smoke.yaml.md
new file mode 100644
index 0000000..3ccf94e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_smoke.yaml.md
@@ -0,0 +1,59 @@
+# Source: implementations/sp500_forecasting/specs/sp500_smoke.yaml
+
+kind: yaml
+
+```yaml
+# Smoke spec — fast laptop run over a short, post-cutoff (2025) window.
+#
+# Weekly origins in late 2025: after the Gemini training cutoff (~Jan 2025), so
+# the LLM-Process rows in the notebook can be compared *fairly* against the
+# conventional methods here. The notebook keeps its LLMP predictors ON for this
+# window (the predictors cell gates them on a post-cutoff flag).
+#
+# This spec carries experiment design only — the window, stride/warmup, and one
+# single-horizon task per target. WHICH predictors run (and all their
+# hyperparameters, including the covariate panel) is configured in the notebook,
+# not here. Each task targets `sp500_logret_{N}b` (the close-to-close cumulative
+# log return over N business days), so forecasting it N steps ahead resolves to
+# the forward N-session return.
+#
+# Prerequisites (warm caches to the present first):
+#   uv run python scripts/fetch_sp500_market.py --refresh   # ^GSPC / ^VIX / ^IXIC
+#   uv run python scripts/fetch_fred.py                     # macro covariates
+
+spec_id: sp500_smoke
+
+description: >-
+  Smoke multivariate demo: weekly origins in late 2025 (post-cutoff),
+  forecasting close-to-close cumulative returns at 1/5/21 business days.
+
+tasks:
+  - task_id: sp500_logret_1b
+    target_series_id: sp500_logret_1b
+    horizons: [1]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 1 business day ahead
+      (next-session return / direction).
+
+  - task_id: sp500_logret_5b
+    target_series_id: sp500_logret_5b
+    horizons: [5]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 5 business days ahead
+      (forward 1-week return).
+
+  - task_id: sp500_logret_21b
+    target_series_id: sp500_logret_21b
+    horizons: [21]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 21 business days ahead
+      (forward 1-month return).
+
+start: "2025-10-06"
+end: "2025-11-14"
+stride: 5            # weekly origins (~6)
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_stress_2020.yaml.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_stress_2020.yaml.md
new file mode 100644
index 0000000..b52c824
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__specs__sp500_stress_2020.yaml.md
@@ -0,0 +1,56 @@
+# Source: implementations/sp500_forecasting/specs/sp500_stress_2020.yaml
+
+kind: yaml
+
+```yaml
+# Regime-stress spec — the 2020 COVID crash, NUMERICAL METHODS ONLY.
+#
+# ⚠️  Keep the notebook's LLM-Process predictors OFF for this window. 2020 is
+# BEFORE the Gemini training cutoff (~Jan 2025), so an LLM has effectively
+# memorised these outcomes — scoring an LLMP here measures recall, not
+# forecasting, and would silently flatter it in the comparison. The numerical
+# methods are cutoff-safe by construction (they only see the series up to the
+# origin), so this volatile window is a perfectly valid stress test *for them* —
+# it's where a covariate edge is most visible.
+#
+# The notebook enforces "numerical only" in code: its predictors cell gates the
+# LLMP variants on a post-cutoff flag that is False for this config. Use this to
+# study "when do covariates help?" among the conventional methods; use the 2025
+# backtest / 2026 eval for anything involving the LLMP.
+
+spec_id: sp500_stress_2020
+
+description: >-
+  COVID-crash regime stress (numerical methods only): daily origins Feb–Apr
+  2020, forecasting close-to-close cumulative returns at 1/5/21 business days.
+
+tasks:
+  - task_id: sp500_logret_1b
+    target_series_id: sp500_logret_1b
+    horizons: [1]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 1 business day ahead
+      (next-session return / direction).
+
+  - task_id: sp500_logret_5b
+    target_series_id: sp500_logret_5b
+    horizons: [5]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 5 business days ahead
+      (forward 1-week return).
+
+  - task_id: sp500_logret_21b
+    target_series_id: sp500_logret_21b
+    horizons: [21]
+    frequency: B
+    description: >-
+      S&P 500 close-to-close cumulative log return, 21 business days ahead
+      (forward 1-month return).
+
+start: "2020-02-03"
+end: "2020-04-30"
+stride: 1            # daily origins across the crash
+warmup: 250
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent____init__.py.md
new file mode 100644
index 0000000..e37aec1
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent____init__.py.md
@@ -0,0 +1,25 @@
+# Source: implementations/sp500_forecasting/starter_agent/__init__.py
+
+kind: python
+
+```python
+"""S&P 500 starter agent — a fresh, hackable template for your own exploration.
+
+Exports the toggle-driven :class:`AgentConfig` factory, the predictor
+convenience factory, and the self-contained prompt builder. See
+``99_starter_agent.ipynb`` and ``agent.py``.
+"""
+
+from sp500_forecasting.starter_agent.agent import (
+    Sp500StarterPromptBuilder,
+    build_starter_agent_config,
+    build_starter_agent_predictor,
+)
+
+
+__all__ = [
+    "Sp500StarterPromptBuilder",
+    "build_starter_agent_config",
+    "build_starter_agent_predictor",
+]
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__agent.py.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__agent.py.md
new file mode 100644
index 0000000..03cc28e
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__agent.py.md
@@ -0,0 +1,298 @@
+# Source: implementations/sp500_forecasting/starter_agent/agent.py
+
+kind: python
+
+```python
+"""S&P 500 starter agent — a fresh, hackable template for your own exploration.
+
+This is the S&P 500 use case's **first** agent, and it is deliberately minimal:
+a clean starting point with our common building blocks behind simple toggles —
+
+- **optional news search** (``enable_search``, on by default) — bounded,
+  cutoff-aware Google Search through the Vector proxy;
+- **optional code execution** (``enable_code_exec``, off by default) — an E2B
+  Python sandbox;
+- **two lightweight skills** (:mod:`skills/`) that are *tool-usage playbooks*:
+  how to get good results out of search and code execution.
+
+Everything routes through the Vector proxy — no direct provider keys. See
+``planning-docs/vector-llm-proxy.md``.
+
+This use case had no agent to borrow a prompt builder from, so
+:class:`Sp500StarterPromptBuilder` below is a small, self-contained serialiser —
+read it, then extend it (more covariates, richer panels, report context). The
+target is a single-horizon cumulative log return; the output is a probabilistic
+forecast of that return. Pair this with ``99_starter_agent.ipynb``.
+
+Module-level ``__getattr__`` exposes ``root_agent`` lazily so ``adk web`` can
+load this module for interactive (schema-free) use.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Callable
+
+import pandas as pd
+from aieng.forecasting.data.context import ForecastContext
+from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES
+from aieng.forecasting.evaluation.task import ForecastingTask
+from aieng.forecasting.methods.agentic import (
+    AgentPredictor,
+    ContinuousAgentForecastOutput,
+    build_adk_agent,
+)
+from aieng.forecasting.methods.agentic.agent_factory import (
+    AgentConfig,
+    CodeExecutionConfig,
+    ContextRetrievalConfig,
+)
+from aieng.forecasting.models import LITE_MODEL
+from pydantic import BaseModel
+
+
+# Skills live next to this module.
+_SKILLS_ROOT = Path(__file__).parent / "skills"
+_FORECASTING_SKILL = _SKILLS_ROOT / "forecasting"
+_RESEARCH_SKILL = _SKILLS_ROOT / "research-playbook"
+_CODE_ANALYSIS_SKILL = _SKILLS_ROOT / "code-analysis-playbook"
+
+
+# ---------------------------------------------------------------------------
+# Prompt builder (self-contained — this use case has no analyst_agent to reuse)
+# ---------------------------------------------------------------------------
+
+
+class Sp500StarterPromptBuilder(BaseModel):
+    """Serialise the target log-return series (+ optional covariate snapshot).
+
+    Minimal on purpose: the recent history of the cumulative-log-return target,
+    the task spec, the exact quantile grid, and — when ``covariate_series_ids``
+    are supplied and present in the context — the latest value of each covariate
+    as a leak-safe macro snapshot. Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally — extend it with richer covariate panels.
+    """
+
+    model_config = {"extra": "forbid"}
+
+    history: int = 64
+    covariate_series_ids: list[str] = []
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        df = context.get_series(task.target_series_id).tail(self.history)
+        rows = ["date,log_return"] + [
+            f"{pd.Timestamp(ts).date()},{float(v):.6f}" for ts, v in zip(df["timestamp"], df["value"])
+        ]
+
+        covariate_snapshot: dict[str, float] = {}
+        for cov_id in self.covariate_series_ids:
+            try:
+                cov_df = context.get_series(cov_id)
+            except Exception:  # noqa: BLE001 — a missing covariate just drops out of the snapshot
+                continue
+            if not cov_df.empty:
+                covariate_snapshot[cov_id] = round(float(cov_df["value"].iloc[-1]), 6)
+
+        payload: dict[str, Any] = {
+            "task": task.task_id,
+            "as_of": str(context.as_of)[:10],
+            "horizons": list(task.horizons),
+            "standard_quantiles": list(STANDARD_QUANTILES),
+            "target_summary": {
+                "last_log_return": float(df["value"].iloc[-1]),
+                "last_date": str(pd.Timestamp(df["timestamp"].iloc[-1]).date()),
+                "n_obs": int(len(df)),
+            },
+            "target_history_csv": "\n".join(rows),
+        }
+        if covariate_snapshot:
+            payload["covariate_snapshot"] = covariate_snapshot
+        return json.dumps(payload, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+
+
+def _build_starter_instruction() -> str:
+    """Build the task-agnostic, skill-agnostic starter persona.
+
+    Just the analyst's identity and how to behave — no output schema, no payload
+    contract, no skill or tool mechanics. ADK injects the name + description of
+    every attached skill (and every tool) into the system prompt, so the agent
+    already knows what it can load and call; repeating that here would only
+    duplicate dynamically-injected information. The forecasting *contract* lives
+    in the loadable ``forecasting`` skill. Edit the persona freely.
+    """
+    return (
+        "## Role\n\n"
+        "You are an equity-market analyst — fluent in the rate path and Fed "
+        "guidance, inflation and jobs data, volatility and credit conditions, "
+        "and how macro catalysts move the S&P 500. This is a starter agent: keep "
+        "your reasoning transparent and your claims honest, and remember returns "
+        "are close to a random walk.\n\n"
+        "## How to respond\n\n"
+        "- For open-ended questions, scenario analysis, or anything "
+        "conversational, answer directly and concisely — do NOT ask for a JSON "
+        "payload.\n"
+        "- When you are handed a task that asks for a structured probabilistic "
+        "forecast, produce a calibrated one."
+    )
+
+
+_STARTER_INSTRUCTION = _build_starter_instruction()
+
+
+_CONTEXT_RETRIEVAL_INSTRUCTION = """\
+You are an equity-market intelligence specialist with web search.
+
+Return a concise structured markdown summary (3-5 paragraphs) covering, as the
+query warrants: the rate path and Fed guidance; recent inflation and jobs data;
+the VIX and credit spreads; earnings-season tone; and major geopolitical or
+policy shocks.
+
+Ground every claim in the search results you actually retrieve. When a cutoff
+date is specified, never report or speculate about events after it.\
+"""
+
+
+# ---------------------------------------------------------------------------
+# Config factory
+# ---------------------------------------------------------------------------
+
+
+def build_starter_agent_config(
+    model: str = LITE_MODEL,
+    search_model: str = LITE_MODEL,
+    *,
+    enable_search: bool = True,
+    enable_code_exec: bool = False,
+) -> AgentConfig:
+    """Build the S&P 500 starter :class:`AgentConfig`.
+
+    Parameters
+    ----------
+    model : str
+        Model for the analyst agent (default: lite). Pass the advanced model
+        (``"gemini-3.5-flash"``) for higher-quality runs.
+    search_model : str
+        Model for the bounded web-search sub-tool.
+    enable_search : bool, default=True
+        Wire a cutoff-aware ``search_web`` tool and load the
+        ``research-playbook`` skill. Proxy-only — no extra API key.
+    enable_code_exec : bool, default=False
+        Wire an E2B Python sandbox and load the ``code-analysis-playbook``
+        skill. Needs ``E2B_API_KEY`` and is slower, so it is off by default.
+
+    Returns
+    -------
+    AgentConfig
+    """
+    # Every attached skill is loaded on demand: ADK injects each skill's name +
+    # description into the system prompt, and the agent reads the full SKILL.md
+    # only when relevant — so toggling a tool just adds its skill, no persona edits.
+    skills_dirs: list[Path] = [_FORECASTING_SKILL]
+    if enable_search:
+        skills_dirs.append(_RESEARCH_SKILL)
+    if enable_code_exec:
+        skills_dirs.append(_CODE_ANALYSIS_SKILL)
+
+    context_retrieval = (
+        ContextRetrievalConfig(
+            enabled=True,
+            instruction=_CONTEXT_RETRIEVAL_INSTRUCTION,
+            search_model=search_model,
+        )
+        if enable_search
+        else ContextRetrievalConfig()
+    )
+
+    return AgentConfig(
+        name="sp500_starter_agent",
+        model=model,
+        instruction=_STARTER_INSTRUCTION,
+        # 16k headroom: enough for a complete run_code script + structured output.
+        max_output_tokens=16_384 if enable_code_exec else None,
+        context_retrieval=context_retrieval,
+        code_execution=CodeExecutionConfig(enabled=enable_code_exec),
+        skills_dirs=skills_dirs,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Predictor convenience factory
+# ---------------------------------------------------------------------------
+
+
+class _StarterForecastPromptBuilder:
+    """Add the output schema + a forecast directive to a base builder's payload.
+
+    The exact JSON schema is generated at call time from the output class
+    (drift-free) and injected into the user payload — never into the system
+    prompt — so the agent stays conversational until it is actually asked to
+    forecast. Implements the
+    :class:`~aieng.forecasting.methods.agentic.predictor.ForecastPromptBuilder`
+    protocol structurally.
+    """
+
+    def __init__(self, inner: Callable[..., str], output_schema_json: str) -> None:
+        self._inner = inner
+        self._schema_json = output_schema_json
+
+    def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str:
+        payload = json.loads(self._inner(task=task, context=context))
+        payload["instructions"] = (
+            "Produce a calibrated probabilistic forecast for this task and return it by "
+            "calling `set_model_response` with a `json_response` string matching "
+            "`output_schema` exactly."
+        )
+        payload["output_schema"] = self._schema_json
+        return json.dumps(payload, indent=2)
+
+
+def build_starter_agent_predictor(
+    config: AgentConfig,
+    *,
+    covariate_series_ids: list[str] | None = None,
+) -> AgentPredictor:
+    """Wrap a starter :class:`AgentConfig` in an :class:`AgentPredictor`.
+
+    Wraps :class:`Sp500StarterPromptBuilder` so the (drift-free) continuous
+    output schema and a forecast directive ride in the payload — keeping the
+    schema out of the persona. ``predict(task, context)`` returns one
+    :class:`~aieng.forecasting.evaluation.prediction.Prediction` for the task's
+    single horizon.
+
+    Parameters
+    ----------
+    config : AgentConfig
+        A config from :func:`build_starter_agent_config`.
+    covariate_series_ids : list[str] or None
+        Covariates to include as a leak-safe snapshot in the prompt. They must be
+        registered on the data service used to build the context. ``None`` keeps
+        the starter target-only.
+    """
+    return AgentPredictor(
+        agent_config=config,
+        prompt_builder=_StarterForecastPromptBuilder(
+            Sp500StarterPromptBuilder(covariate_series_ids=covariate_series_ids or []),
+            ContinuousAgentForecastOutput.prompt_schema_json(),
+        ),
+        output_schema=ContinuousAgentForecastOutput,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Lazy root_agent for `adk web` interactive use
+# ---------------------------------------------------------------------------
+
+
+def __getattr__(name: str) -> Any:
+    """Expose ``root_agent`` lazily for schema-free interactive use via ``adk web``."""
+    if name == "root_agent":
+        return build_adk_agent(build_starter_agent_config())
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
new file mode 100644
index 0000000..0b4e16d
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
@@ -0,0 +1,57 @@
+# Source: implementations/sp500_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: code-analysis-playbook
+description: >-
+  How to use the code execution sandbox well — parse the JSON payload (not
+  disk files), compute a couple of useful diagnostics before forecasting, and
+  keep the session stateful within a turn. Load this before writing code. No
+  scripts.
+---
+
+# Code-analysis playbook
+
+A short guide to using the `run_code` sandbox productively. This is a starter
+skill — extend it with the diagnostics that matter for your problem.
+
+## Where your data lives
+
+All data comes from the **JSON payload in your context** — there are no disk
+files and no network. The history arrives as a CSV *string* (e.g.
+`target_history_csv`). Parse it with `io.StringIO`, never as a file path:
+
+```python
+import io, pandas as pd
+df = pd.read_csv(io.StringIO(payload["target_history_csv"]))
+```
+
+The sandbox is **stateful within a turn**: parse once in your first code block,
+then reuse the DataFrame in later blocks instead of re-parsing.
+
+## Compute before you forecast
+
+Run a couple of cheap diagnostics so your forecast is grounded in arithmetic,
+not vibes:
+
+1. **Recent trend** — slope/return over the last N observations.
+2. **Volatility** — recent standard deviation of changes; it sets how wide your
+   quantile bands should be.
+3. **Sanity check** — does your point forecast sit within a plausible multiple
+   of recent moves? If not, revisit it.
+
+Use the printed numbers to set the point forecast and to *calibrate the spread*
+between your low and high quantiles — wider when recent volatility is high.
+
+## Domain focus (edit this for your use case)
+
+For S&P 500 log-returns, the series is near-unforecastable in the mean: recent
+realised volatility is far more predictable than direction. Let recent vol, not a
+directional hunch, set the *width* of your quantile bands, and keep the point
+forecast close to zero unless you have real signal.
+
+## Room to grow
+
+- Add your own diagnostic patterns (regime detection, seasonality, covariates).
+- Drop reusable reference values into a `references/` file and `load_skill_resource` them.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__forecasting__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__forecasting__SKILL.md.md
new file mode 100644
index 0000000..4de57a2
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__forecasting__SKILL.md.md
@@ -0,0 +1,57 @@
+# Source: implementations/sp500_forecasting/starter_agent/skills/forecasting/SKILL.md
+
+kind: markdown
+
+---
+name: forecasting
+description: >-
+  The output contract for producing a structured probabilistic forecast — the
+  JSON shape, the calibration and quantile rules, and how to submit it. Load
+  this ONLY when your task payload asks for a forecast; ignore it for
+  open-ended questions. No scripts.
+---
+
+# Forecasting skill
+
+Load this when your task payload asks for a structured forecast. For open-ended
+questions, ignore it and just answer.
+
+## What you'll receive
+
+A JSON payload describing the task: a `task` id, the `as_of` cutoff date,
+`horizons` (steps ahead), the `standard_quantiles` grid, a `target_summary`, the
+recent `target_history_csv`, and an `output_schema` showing the exact JSON to
+return.
+
+## The output contract
+
+1. Produce **one forecast per horizon** in `horizons`.
+2. Use **exactly** the levels in `standard_quantiles` — no additions or omissions.
+3. `point_forecast` must equal the **0.50 quantile** value.
+4. Quantile values must be **non-decreasing** as the quantile level rises.
+5. Use ONLY information available on or before `as_of`.
+6. Put your reasoning in the `rationale` fields.
+
+Submit by calling `set_model_response` with a `json_response` string that
+matches the payload's `output_schema` **exactly** — use `"horizon"` (an
+integer), and make `"quantiles"` a **list** of `{"quantile": <level>, "value":
+<number>}` objects. Omit any field not shown in the schema.
+
+## Calibration
+
+Report calibrated intervals, not false precision: across many forecasts where
+your 80% band is stated, the truth should land inside it about 80% of the time.
+Anchor the point on the recent level and trend; let recent **volatility** set
+how wide the bands are, and widen them as the horizon grows.
+
+## Domain focus (edit this for your use case)
+
+For S&P 500 log-returns, keep the point forecast near zero unless you have real
+signal — returns are close to a random walk. Let recent realised volatility set
+how WIDE the quantile bands are; it is far more predictable than direction. Note
+any macro catalyst you lean on in the rationale.
+
+## Room to grow
+
+- Tighten the calibration guidance with your own backtest findings.
+- Add worked examples of good vs. over-confident forecasts.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__research-playbook__SKILL.md.md b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
new file mode 100644
index 0000000..dce71de
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/implementations__sp500_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
@@ -0,0 +1,45 @@
+# Source: implementations/sp500_forecasting/starter_agent/skills/research-playbook/SKILL.md
+
+kind: markdown
+
+---
+name: research-playbook
+description: >-
+  How to use the search_web tool well when grounding a forecast in recent
+  news — phrase cutoff-aware queries, decide what is worth searching for, and
+  weigh sources. Load this before your first search_web call. No scripts.
+---
+
+# Research playbook
+
+A short guide to getting real signal out of `search_web`. This is a starter
+skill — extend it with the queries and sources that work for your problem.
+
+## The one rule that matters
+
+Always pass `cutoff_date` equal to the `as_of` date in your payload. It is the
+temporal fence that keeps post-origin information out of a historical forecast.
+A forecast that "knew" what happened after `as_of` is not a forecast.
+
+## How to search
+
+- **Search before you forecast, not after.** Gather context first, then reason.
+- **One topic per query.** Several focused queries beat one broad one. Stop when
+  new queries stop returning new facts.
+- **Ask for the present state, not a prediction.** "current OPEC+ production
+  policy" returns facts; "will oil go up" returns noise.
+- **Weigh sources.** Prefer primary releases and major outlets; treat a single
+  blog or forum post as a lead to confirm, not a fact.
+
+## Domain focus (edit this for your use case)
+
+For S&P 500 returns, the signals: the rate path and Fed guidance, recent
+inflation and jobs prints, the VIX and credit spreads, earnings-season tone, and
+major geopolitical or policy shocks. Search for the *current state* of these,
+then remember returns are close to a random walk — be humble about direction.
+
+## Room to grow
+
+- Add a curated list of go-to sources for your domain.
+- Track which queries paid off and prune the ones that didn't.
+- Add a `references/` file with example high-signal searches.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/planning-docs__roadmap.md.md b/implementations/getting_started/concierge_agent/context/artifacts/planning-docs__roadmap.md.md
new file mode 100644
index 0000000..0775d58
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/planning-docs__roadmap.md.md
@@ -0,0 +1,73 @@
+# Source: planning-docs/roadmap.md
+
+kind: markdown
+
+# Roadmap and Architecture Notes
+
+This document holds the cross-cutting design principles worth preserving and a catalog of extension ideas for building on the foundation. It is a maintainer-facing reference, not a task tracker — per-implementation guidance lives in each `implementations/<use-case>/README.md`, and participant-facing setup lives in the repository `README.md` files.
+
+## Forecasting taxonomy
+
+Keep three concepts separate:
+
+- **Task / output modality** — what is being predicted. Continuous forecasts predict future values or distributions for a time series (scored with CRPS). Discrete-event forecasts predict the probability of a resolved event and are scored with Brier (binary) or RPS (ordered categorical).
+- **Forecasting method** — how the prediction is produced. Numerical forecasters, LLM Processes, and agentic forecasters are method families that apply to either modality.
+- **Interaction mode** — how the system is used. Track 1 produces standardized `Prediction` objects for evaluation; Track 2 supports interactive analysis, scenario exploration, monitoring, and Q&A without head-to-head scoring.
+
+Output modality and method family are independent: a time-series task can often be reframed as a discrete-event question, and numerical models can supply features or probabilities that support discrete-event predictors.
+
+## Architecture principles
+
+- `aieng-forecasting` (`aieng.forecasting`) owns stable infrastructure: the data service, cutoff enforcement, evaluation interfaces, prediction payloads, artifact storage, and the reusable agent backbone.
+- `aieng.forecasting.methods` owns reusable concrete `Predictor` implementations.
+- `implementations/<use-case>/` owns notebooks, task-specific configuration, prompts, and co-located YAML specs (one `specs/` directory per use case).
+- Darts is the primary numerical forecasting library.
+- Pydantic structured outputs and strong, mypy-clean typing are the default for core interfaces.
+- StatCan, FRED, and yfinance are the reference data sources.
+- Code, notebooks, specs, and documentation stay aligned; READMEs are part of the product.
+- Add methods incrementally — give each reference implementation one strong, runnable baseline before adding a method zoo.
+
+### Agent modes
+
+The agent backbone supports two modes:
+
+- **Track 1 prediction** — configured to emit standardized `Prediction` objects through the evaluation interfaces.
+- **Track 2 interactive analysis** — configured for conversation, scenario analysis, evidence gathering, and code execution; its interaction surface differs because it is not scored head-to-head.
+
+A common decomposition is a Gemini-backed **Context Retrieval Agent** for search grounding and source-aware context, and a provider-flexible **Analyst Agent** for reasoning, code execution, and synthesis.
+
+**LLM routing.** Everything routes through the Vector proxy (`proxy.vectorinstitute.ai`) — there are no direct-Gemini sub-agents. Web search is a `search_web` tool backed by the proxy's `{"googleSearch": {}}` extension; code execution runs in an E2B sandbox (provider-independent); the analyst/reasoning model is auto-wrapped in `LiteLlm` pointing at the proxy. LLM Processes use the same proxy seam. See [`vector-llm-proxy.md`](vector-llm-proxy.md) for the full convention and the history of the proxy fixes that made this possible.
+
+## Extension ideas
+
+The repository is a foundation. Each reference implementation's README ends with extensions specific to it; the cross-cutting ones are collected here. Each builds on a complete implementation and has a clear seam in the code.
+
+### Deepen a reference implementation
+
+- **BoC live forecasting** — extend `meeting_schedule.yaml` with the Bank's published future dates and forecast each announcement the day before it happens: genuinely out-of-sample, and the honest test that backtest leakage precludes. Needs annual calendar maintenance.
+- **Reports as predictor context** — wire cutoff-filtered documents into the forecast prompt: BoC press releases / Monetary Policy Reports through the LLM-Process `user_prompt_suffix` or the `build_boc_news_config` retrieval seam, and the analogous food-CPI CFPR wiring (extraction already exists; mirror BoC's `PressReleaseStore`). Measure the lift over the quantitative-only baseline.
+- **Memory-augmented agent** — an agent that learns from its own resolved prediction errors over time; a generalization of the energy adaptive agent across use cases.
+
+### Agent and analyst depth
+
+Every domain implementation (S&P 500, food, energy, BoC) now ships a **`starter_agent/` module + `99_starter_agent.ipynb`** — a fresh, participant-owned agent template with toggleable proxy news search and E2B code execution, two lightweight tool-usage skills, an interactive (Track 2) cell, and one scored (Track 1) prediction. It is the canonical "build your own" entry point and doubles as a quick end-to-end smoke test of each use case's agent stack. Natural next steps from here: richer E2B code-execution configs, prompt and context-formatting optimization, and deeper Track 2 interactive analyst configurations per use case (see [`../docs/adk-skills-guide.md`](../docs/adk-skills-guide.md) for the skill design rules).
+
+**Repo concierge (shipped).** `getting_started/concierge_agent/` + `99_repo_concierge.ipynb` — a `gemini-3.1-flash-lite-preview` ADK agent that answers bootcamp onboarding questions by searching a committed catalog of public `main` (maintainers rebuild via `scripts/build_concierge_context.py`). Points participants to notebooks, modules, and snippets; complements the domain starter agents.
+
+### Broaden coverage
+
+- Transpose the S&P 500 template to additional energy commodities, or to other liquid assets, equities, or indices. The S&P 500 reference now compares conventional numerical methods (incl. ETS and Kalman) against a **covariate-aware LLM-Process** across cumulative-return horizons — `SampledTrajectoryLLMPredictor` supports `covariate_series_ids` (exogenous-series prompt blocks), so the "can an LLM use the covariate panel as well as gradient boosting?" comparison is shipped, not deferred.
+- Add richer FRED covariates for food, energy, or financial markets. Extending covariate-aware prompting to the other LLM-Process predictors (`QuantileGridLLMPredictor`, …) is a natural next step.
+- Reframe a continuous target as a binary or categorical question (the BoC harness shows the pattern).
+- Add time-series foundation models or additional numerical methods once an implementation has one strong baseline.
+- Explore ForecastBench as a comparison or discussion point.
+
+### Live testing
+
+Record predictions from the reference methods (energy first, given its daily data), persist predictions and reasoning traces, and resolve them as horizons mature — a true prospective Track 1 test, distinct from Track 2 scoring.
+
+**Cutoff-aware evaluation (principle).** LLM/agent forecasters can only be scored honestly on origins *after* the model's training cutoff (~Jan 2025 for Gemini) — earlier origins measure memorised recall, not forecasting, and silently flatter the LLM against the cutoff-safe numerical methods. Energy and S&P 500 both put the LLM-inclusive comparison in a 2025 backtest plus a protected 2026 eval; pre-cutoff windows (e.g. S&P 500's 2020 COVID stress) are kept **numerical-only**. food and BoC still backtest their LLM rows on pre-cutoff windows and should migrate to the same discipline.
+
+### Core-library follow-up
+
+`resolution_fn` on `ForecastingTask` is still a placeholder; the derived-event-series approach avoids needing dispatch today, but spread/level-target framings will eventually force it.
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc.py.md
new file mode 100644
index 0000000..65b456f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc.py.md
@@ -0,0 +1,131 @@
+# Source: scripts/fetch_boc.py
+
+kind: python
+
+```python
+"""Populate the local data cache for the BoC rate-decision experiment.
+
+Downloads (or revalidates) the StatCan tables used by
+``implementations/boc_rate_decisions``:
+
+- 10-10-0139-01 — daily financial-market statistics (BoC target rate,
+  GoC benchmark bond yields).
+- 18-10-0004-11 — monthly CPI (All-items covariate; shared with the
+  getting-started and food-price use cases).
+
+It then derives the per-meeting rate-cut event series, validates the curated
+meeting calendar against observed target-rate changes, and prints a summary.
+
+The FRED unemployment covariate is fetched by ``scripts/fetch_fred.py``
+(requires ``FRED_API_KEY``); run both scripts before the BoC notebooks.
+
+Re-running is idempotent: cached tables are re-read from disk. Pass
+``--refresh`` to delete the cached StatCan zips and re-download.
+
+Usage
+-----
+::
+
+    uv run python scripts/fetch_boc.py
+    uv run python scripts/fetch_boc.py --refresh
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(REPO_ROOT))
+sys.path.insert(0, str(REPO_ROOT / "implementations"))
+
+from dotenv import load_dotenv
+
+
+load_dotenv(REPO_ROOT / ".env")
+
+from datetime import datetime, timezone
+
+from boc_rate_decisions.data import (
+    RATE_CUT_EVENT_SERIES_ID,
+    TARGET_RATE_SERIES_ID,
+    build_boc_service,
+    load_meeting_schedule,
+    load_unscheduled_announcements,
+    validate_schedule_against_rate_series,
+)
+
+
+DEFAULT_CACHE_DIR = REPO_ROOT / "data" / "statcan"
+
+# Normalized zip names for the tables this experiment depends on.
+_TABLE_ZIPS = ["10100139-eng.zip", "18100004-eng.zip"]
+
+
+def main() -> None:
+    """CLI entry point: populate the StatCan cache and print a summary."""
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--refresh",
+        action="store_true",
+        help="Delete cached StatCan zips for this experiment and re-download.",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        type=Path,
+        default=DEFAULT_CACHE_DIR,
+        help=f"stats-can cache directory (default: {DEFAULT_CACHE_DIR}).",
+    )
+    args = parser.parse_args()
+
+    if args.refresh:
+        for zip_name in _TABLE_ZIPS:
+            zip_path = args.cache_dir / zip_name
+            if zip_path.exists():
+                zip_path.unlink()
+                print(f"Removed cached {zip_path}")
+
+    print(f"Populating StatCan cache at {args.cache_dir}")
+    fred_cache = REPO_ROOT / "data" / "fred"
+    try:
+        svc = build_boc_service(statcan_cache_dir=args.cache_dir, fred_cache_dir=fred_cache)
+    except Exception as exc:
+        print(f"  [WARN] FRED unemployment covariate unavailable ({exc}).")
+        print("         Run `uv run python scripts/fetch_fred.py` (needs FRED_API_KEY) to add it.")
+        svc = build_boc_service(statcan_cache_dir=args.cache_dir, include_fred=False)
+
+    now = datetime.now(tz=timezone.utc).replace(tzinfo=None)
+    rate_df = svc.get_series(TARGET_RATE_SERIES_ID, as_of=now)
+    event_df = svc.get_series(RATE_CUT_EVENT_SERIES_ID, as_of=now)
+    orphans = validate_schedule_against_rate_series(
+        rate_df,
+        load_meeting_schedule(),
+        load_unscheduled_announcements(),
+    )
+    if orphans:
+        print()
+        print("WARNING: observed target-rate changes not attributable to any known announcement:")
+        for d in orphans:
+            print(f"  {d.date()}")
+        print("The curated meeting_schedule.yaml is likely missing or misdating a meeting.")
+
+    print()
+    print(
+        f"Target rate: {len(rate_df)} daily observations "
+        f"({rate_df['timestamp'].min().date()} to {rate_df['timestamp'].max().date()})"
+    )
+    n_cuts = int(event_df["value"].sum())
+    print(f"Rate-cut events: {len(event_df)} resolved meetings, {n_cuts} cuts (base rate {n_cuts / len(event_df):.1%})")
+
+    print()
+    summary = svc.summary()
+    summary["start"] = summary["start"].dt.strftime("%Y-%m-%d")
+    summary["end"] = summary["end"].dt.strftime("%Y-%m-%d")
+    print(summary.to_string(index=False))
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc_press_releases.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc_press_releases.py.md
new file mode 100644
index 0000000..bc35772
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_boc_press_releases.py.md
@@ -0,0 +1,135 @@
+# Source: scripts/fetch_boc_press_releases.py
+
+kind: python
+
+```python
+"""Download and extract Bank of Canada FAD press releases into the data cache.
+
+Counterpart to ``scripts/fetch_cfpr.py`` / ``scripts/extract_reports.py`` for the
+BoC use case. Because press releases are HTML (not PDF), fetch and extract happen
+in a single pass: for each scheduled announcement date it
+
+  * downloads the FAD press-release page,
+  * extracts the article body to an
+    :class:`~aieng.forecasting.documents.ExtractedDocument`,
+  * writes ``data/reports/boc_press_releases/<doc_id>.md`` (full text) +
+    ``<doc_id>.json`` (metadata with a ``text_path`` pointer) + a provenance
+    sidecar.
+
+URLs are derived from ``meeting_schedule.yaml`` (no manifest needed). Individual
+missing pages (older slugs, future dates not yet published) are logged and
+skipped so a few misses don't abort the run.
+
+Usage
+-----
+::
+
+    uv run python scripts/fetch_boc_press_releases.py            # all scheduled dates
+    uv run python scripts/fetch_boc_press_releases.py --force    # re-download
+    uv run python scripts/fetch_boc_press_releases.py --year 2026
+
+Artifacts live under ``data/`` and are never committed.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import urllib.error
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "implementations"))
+
+from boc_rate_decisions.press_releases import (
+    DEFAULT_PRESS_RELEASE_CACHE_DIR,
+    PressReleaseEntry,
+    extract_press_release_html,
+    press_release_entries,
+    write_artifact,
+)
+
+
+# A browser-like UA: bankofcanada.ca can reject the default urllib agent.
+_USER_AGENT = "Mozilla/5.0 (compatible; agentic-forecasting-bootcamp/0.1; +data-cache)"
+
+# A genuine FAD release always mentions the overnight rate; a soft sanity check
+# that we extracted the article and not a redirect/error shell.
+_MIN_CHARS = 400
+_SENTINEL = "overnight rate"
+
+
+def _download_html(url: str) -> str:
+    """Fetch ``url`` and return decoded HTML, raising ``RuntimeError`` on HTTP error."""
+    request = urllib.request.Request(url, headers={"User-Agent": _USER_AGENT})  # noqa: S310 (trusted BoC URL)
+    try:
+        with urllib.request.urlopen(request, timeout=60) as response:  # noqa: S310
+            return response.read().decode("utf-8", errors="replace")
+    except urllib.error.HTTPError as exc:
+        raise RuntimeError(f"HTTP {exc.code}") from exc
+    except urllib.error.URLError as exc:
+        raise RuntimeError(f"network error: {exc.reason}") from exc
+
+
+def _write_provenance(cache_dir: Path, key: str, *, url: str) -> None:
+    """Write a provenance sidecar JSON for one fetched release."""
+    provenance_path = cache_dir / "provenance" / f"{key}.json"
+    provenance_path.parent.mkdir(parents=True, exist_ok=True)
+    provenance_path.write_text(
+        json.dumps({"url": url, "retrieved_at": datetime.now(tz=timezone.utc).isoformat()}, indent=2),
+        encoding="utf-8",
+    )
+
+
+def fetch_entry(entry: PressReleaseEntry, *, cache_dir: Path, force: bool) -> str:
+    """Fetch + extract one press release; return a short status string."""
+    _, json_path = entry.artifact_paths(cache_dir)
+    if json_path.exists() and not force:
+        return f"skip (cached)  {json_path}"
+
+    html = _download_html(entry.url)
+    doc = extract_press_release_html(html, entry.meta)
+    if doc.n_chars < _MIN_CHARS or _SENTINEL not in doc.text.lower():
+        raise RuntimeError(
+            f"extracted {doc.n_chars} chars but it does not look like a FAD release "
+            f"(missing {_SENTINEL!r}); the page slug may differ for this date",
+        )
+    write_artifact(doc, cache_dir)
+    _write_provenance(cache_dir, entry.key, url=entry.url)
+    return f"ok  {doc.n_chars:>7,} chars  ~{doc.est_tokens:>6,} tokens"
+
+
+def main() -> None:
+    """Parse args and fetch+extract all (or one year's) press releases."""
+    parser = argparse.ArgumentParser(description="Download + extract BoC FAD press releases into data/.")
+    parser.add_argument("--year", type=int, default=None, help="Fetch only releases in this year.")
+    parser.add_argument("--force", action="store_true", help="Re-download even if cached.")
+    args = parser.parse_args()
+
+    entries = press_release_entries()
+    if args.year is not None:
+        entries = [e for e in entries if e.meta.publication_date.year == args.year]
+        if not entries:
+            raise SystemExit(f"No scheduled announcement dates in {args.year}.")
+
+    cache_dir = DEFAULT_PRESS_RELEASE_CACHE_DIR
+    print(f"Fetching {len(entries)} BoC press release(s) -> {cache_dir.resolve()}\n")
+
+    failures = 0
+    for entry in entries:
+        try:
+            print(f"  [{entry.key}] {fetch_entry(entry, cache_dir=cache_dir, force=args.force)}")
+        except RuntimeError as exc:
+            failures += 1
+            print(f"  [{entry.key}] skip: {exc}")
+
+    print(f"\nDone. {len(entries) - failures}/{len(entries)} fetched (missing pages skipped).")
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cfpr.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cfpr.py.md
new file mode 100644
index 0000000..5eae48c
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cfpr.py.md
@@ -0,0 +1,166 @@
+# Source: scripts/fetch_cfpr.py
+
+kind: python
+
+```python
+"""Download and cache published report PDFs into the local ``data/`` cache.
+
+This populator mirrors ``scripts/fetch_cpi.py`` / ``scripts/fetch_wti.py``: run
+it once before extraction to fill the gitignored cache. It is source-agnostic
+by design -- today it serves Canada's Food Price Report (``--source cfpr``); the
+same machinery will serve Bank of Canada Monetary Policy Reports
+(``--source boc``) once that manifest lands.
+
+For each manifest entry it:
+  * downloads the PDF into ``data/reports/<source>/<year>_<lang>.pdf``,
+  * verifies the payload really is a PDF (``%PDF`` magic bytes) and FAILS LOUDLY
+    otherwise -- a moved CDN filename returns an HTML 404 page, which we refuse
+    to cache silently,
+  * writes a provenance sidecar (url, retrieved-at, sha256, byte length) under
+    ``data/reports/<source>/provenance/<key>.json``.
+
+Usage
+-----
+    uv run python scripts/fetch_cfpr.py                 # all CFPR editions
+    uv run python scripts/fetch_cfpr.py --force         # re-download
+    uv run python scripts/fetch_cfpr.py --year 2026     # one edition
+
+PDFs and provenance live under ``data/`` and are never committed; only the
+manifest is committed.
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import urllib.error
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+
+from food_price_forecasting.reports import (
+    CFPRReportEntry,
+    load_manifest,
+)
+
+
+# A browser-like UA: the Dalhousie CDN can reject the default urllib agent.
+_USER_AGENT = "Mozilla/5.0 (compatible; agentic-forecasting-bootcamp/0.1; +data-cache)"
+_PDF_MAGIC = b"%PDF"
+_REPORTS_ROOT = Path("data/reports")
+
+
+# source -> manifest loader. Extend with "boc" once its manifest exists.
+_SOURCE_LOADERS = {
+    "cfpr": load_manifest,
+}
+
+
+def _download(url: str) -> tuple[bytes, int]:
+    """Fetch ``url`` and return ``(body, http_status)``.
+
+    Raises
+    ------
+    RuntimeError
+        On HTTP error or if the payload is not a PDF.
+    """
+    request = urllib.request.Request(url, headers={"User-Agent": _USER_AGENT})  # noqa: S310 (trusted manifest URL)
+    try:
+        with urllib.request.urlopen(request, timeout=60) as response:  # noqa: S310
+            status = int(getattr(response, "status", 200) or 200)
+            body = response.read()
+    except urllib.error.HTTPError as exc:
+        raise RuntimeError(f"HTTP {exc.code} fetching {url}") from exc
+    except urllib.error.URLError as exc:
+        raise RuntimeError(f"Network error fetching {url}: {exc.reason}") from exc
+
+    if not body.startswith(_PDF_MAGIC):
+        preview = body[:80].decode("utf-8", errors="replace")
+        raise RuntimeError(
+            f"Response from {url} is not a PDF (got {len(body)} bytes starting with {preview!r}). "
+            "The CDN filename may have changed -- update reports_manifest.yaml.",
+        )
+    return body, status
+
+
+def _write_provenance(provenance_path: Path, *, url: str, status: int, sha256: str, content_length: int) -> None:
+    """Write a provenance sidecar JSON next to the cached PDF."""
+    provenance_path.parent.mkdir(parents=True, exist_ok=True)
+    provenance_path.write_text(
+        json.dumps(
+            {
+                "url": url,
+                "http_status": status,
+                "retrieved_at": datetime.now(tz=timezone.utc).isoformat(),
+                "sha256": sha256,
+                "content_length": content_length,
+            },
+            indent=2,
+        ),
+        encoding="utf-8",
+    )
+
+
+def fetch_entry(entry: CFPRReportEntry, *, cache_dir: Path, force: bool) -> str:
+    """Download one report edition; return a short status string."""
+    pdf_path = cache_dir / f"{entry.key}.pdf"
+    if pdf_path.exists() and not force:
+        return f"skip (cached)  {pdf_path}"
+
+    body, status = _download(entry.url)
+    digest = hashlib.sha256(body).hexdigest()
+    if entry.sha256 and digest != entry.sha256:
+        raise RuntimeError(
+            f"sha256 mismatch: expected {entry.sha256}, got {digest}. "
+            "The CDN file changed -- verify and update sha256 in reports_manifest.yaml.",
+        )
+    pdf_path.parent.mkdir(parents=True, exist_ok=True)
+    pdf_path.write_bytes(body)
+    _write_provenance(
+        cache_dir / "provenance" / f"{entry.key}.json",
+        url=entry.url,
+        status=status,
+        sha256=digest,
+        content_length=len(body),
+    )
+    return f"ok  {len(body):>9,} B  {pdf_path}"
+
+
+def main() -> None:
+    """Parse args and download all (or one) report edition for a source."""
+    parser = argparse.ArgumentParser(description="Download published report PDFs into the data/ cache.")
+    parser.add_argument("--source", default="cfpr", choices=sorted(_SOURCE_LOADERS), help="Report source key.")
+    parser.add_argument("--year", type=int, default=None, help="Fetch only this edition year.")
+    parser.add_argument("--force", action="store_true", help="Re-download even if cached.")
+    args = parser.parse_args()
+
+    entries = _SOURCE_LOADERS[args.source]()
+    if args.year is not None:
+        entries = [e for e in entries if e.meta.doc_id.startswith(f"{args.year}_")]
+        if not entries:
+            raise SystemExit(f"No {args.source} manifest entry for year {args.year}.")
+
+    cache_dir = _REPORTS_ROOT / args.source
+    print(f"Fetching {len(entries)} {args.source} report(s) -> {cache_dir.resolve()}\n")
+
+    failures = 0
+    for entry in entries:
+        try:
+            print(f"  [{entry.key}] {fetch_entry(entry, cache_dir=cache_dir, force=args.force)}")
+        except RuntimeError as exc:
+            failures += 1
+            print(f"  [{entry.key}] FAIL: {exc}")
+
+    print(f"\nDone. {len(entries) - failures}/{len(entries)} succeeded.")
+    if failures:
+        raise SystemExit(1)
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cpi.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cpi.py.md
new file mode 100644
index 0000000..5ef57a0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_cpi.py.md
@@ -0,0 +1,449 @@
+# Source: scripts/fetch_cpi.py
+
+kind: python
+
+```python
+"""Fetch and cache Canada-wide CPI series from Statistics Canada.
+
+This script downloads table 18-10-0004-11 (Consumer Price Index, by geography,
+monthly, percentage change, not seasonally adjusted) from Statistics Canada,
+filters to Canada-wide series, and registers them in a DataService instance for
+validation. The raw data is cached locally by the stats-can library in
+``data/statcan/``.
+
+Run this script once before starting a session or backtest to populate the
+local cache. Re-running is safe and idempotent — the stats-can library skips
+downloads when the cache is current.
+
+Usage
+-----
+    uv run python scripts/fetch_cpi.py
+
+Output
+------
+Prints a summary table of all registered series (series_id, date range,
+number of observations).
+
+Source
+------
+Table 18-10-0004-11, pid=1810000411:
+https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000411
+
+Notes
+-----
+All series use a 2002=100 baseline except Internet access services, which uses
+a December 2002=100 (200212=100) baseline as published by Statistics Canada.
+
+The member filter key "Internet access services (200212=100)" includes the
+baseline annotation exactly as it appears in the StatCan CSV.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+# Ensure the workspace root is on sys.path when run directly.
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters import StatCanAdapter
+
+
+# Statistics Canada table: Consumer Price Index, by geography, monthly,
+# percentage change, not seasonally adjusted, provinces, Whitehorse and
+# Yellowknife (2002=100 baseline). pid=1810000411.
+CPI_TABLE_ID = "18-10-0004-11"
+
+# Local directory where stats-can caches downloaded tables.
+CACHE_DIR = Path("data/statcan")
+
+# Canada-wide CPI series to register.
+# Each entry: (series_id, product_group_label, description, units)
+# Labels must match the "Products and product groups" dimension in the StatCan CSV exactly.
+CPI_SERIES: list[tuple[str, str, str, str]] = [
+    # --- Top-level aggregates ---
+    (
+        "cpi_all_items_canada",
+        "All-items",
+        "CPI All-items, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Food ---
+    (
+        "cpi_food_canada",
+        "Food",
+        "CPI Food, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_food_from_stores_canada",
+        "Food purchased from stores",
+        "CPI Food purchased from stores, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_meat_canada",
+        "Meat",
+        "CPI Meat, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_dairy_eggs_canada",
+        "Dairy products and eggs",
+        "CPI Dairy products and eggs, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_bakery_cereal_canada",
+        "Bakery and cereal products (excluding baby food)",
+        "CPI Bakery and cereal products (excl. baby food), Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fresh_fruit_canada",
+        "Fresh fruit",
+        "CPI Fresh fruit, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fruit_preparations_nuts_canada",
+        "Fruit, fruit preparations and nuts",
+        "CPI Fruit, fruit preparations and nuts, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fresh_vegetables_canada",
+        "Fresh vegetables",
+        "CPI Fresh vegetables, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_vegetables_preparations_canada",
+        "Vegetables and vegetable preparations",
+        "CPI Vegetables and vegetable preparations, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fish_seafood_canada",
+        "Fish, seafood and other marine products",
+        "CPI Fish, seafood and other marine products, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_other_food_nonalcoholic_canada",
+        "Other food products and non-alcoholic beverages",
+        "CPI Other food products and non-alcoholic beverages, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_restaurants_canada",
+        "Food purchased from restaurants",
+        "CPI Food purchased from restaurants, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Shelter ---
+    (
+        "cpi_shelter_canada",
+        "Shelter",
+        "CPI Shelter, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_rented_accommodation_canada",
+        "Rented accommodation",
+        "CPI Rented accommodation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_owned_accommodation_canada",
+        "Owned accommodation",
+        "CPI Owned accommodation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_homeowners_replacement_canada",
+        "Homeowners' replacement cost",
+        "CPI Homeowners' replacement cost, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_homeowners_insurance_canada",
+        "Homeowners' home and mortgage insurance",
+        "CPI Homeowners' home and mortgage insurance, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_homeowners_maintenance_canada",
+        "Homeowners' maintenance and repairs",
+        "CPI Homeowners' maintenance and repairs, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_water_fuel_electricity_canada",
+        "Water, fuel and electricity",
+        "CPI Water, fuel and electricity, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_electricity_canada",
+        "Electricity",
+        "CPI Electricity, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_natural_gas_canada",
+        "Natural gas",
+        "CPI Natural gas, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_fuel_oil_canada",
+        "Fuel oil and other fuels",
+        "CPI Fuel oil and other fuels, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Household operations, furnishings and equipment ---
+    (
+        "cpi_household_operations_canada",
+        "Household operations, furnishings and equipment",
+        "CPI Household operations, furnishings and equipment, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_telephone_canada",
+        "Telephone services",
+        "CPI Telephone services, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # Baseline is December 2002=100 for this series; label includes baseline annotation.
+    (
+        "cpi_internet_canada",
+        "Internet access services (200212=100)",
+        "CPI Internet access services, Canada (Dec 2002=100)",
+        "Index Dec 2002=100",
+    ),
+    (
+        "cpi_household_furnishings_canada",
+        "Household furnishings and equipment",
+        "CPI Household furnishings and equipment, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Clothing and footwear ---
+    (
+        "cpi_clothing_canada",
+        "Clothing and footwear",
+        "CPI Clothing and footwear, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_womens_clothing_canada",
+        "Women's clothing",
+        "CPI Women's clothing, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_mens_clothing_canada",
+        "Men's clothing",
+        "CPI Men's clothing, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_footwear_canada",
+        "Footwear",
+        "CPI Footwear, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Transportation ---
+    (
+        "cpi_transportation_canada",
+        "Transportation",
+        "CPI Transportation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_private_transportation_canada",
+        "Private transportation",
+        "CPI Private transportation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_vehicle_purchase_canada",
+        "Purchase and leasing of passenger vehicles",
+        "CPI Purchase and leasing of passenger vehicles, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_gasoline_canada",
+        "Gasoline",
+        "CPI Gasoline, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_vehicle_insurance_canada",
+        "Passenger vehicle insurance premiums",
+        "CPI Passenger vehicle insurance premiums, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_public_transportation_canada",
+        "Public transportation",
+        "CPI Public transportation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Health and personal care ---
+    (
+        "cpi_health_personal_canada",
+        "Health and personal care",
+        "CPI Health and personal care, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_health_care_canada",
+        "Health care",
+        "CPI Health care, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_personal_care_canada",
+        "Personal care",
+        "CPI Personal care, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Recreation, education and reading ---
+    (
+        "cpi_recreation_canada",
+        "Recreation, education and reading",
+        "CPI Recreation, education and reading, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_recreation_only_canada",
+        "Recreation",
+        "CPI Recreation, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_education_reading_canada",
+        "Education and reading",
+        "CPI Education and reading, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Alcoholic beverages, tobacco products and recreational cannabis ---
+    (
+        "cpi_alcoholic_tobacco_canada",
+        "Alcoholic beverages, tobacco products and recreational cannabis",
+        "CPI Alcoholic beverages, tobacco and cannabis, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_alcoholic_beverages_canada",
+        "Alcoholic beverages",
+        "CPI Alcoholic beverages, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_tobacco_canada",
+        "Tobacco products and smokers' supplies",
+        "CPI Tobacco products and smokers' supplies, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    # --- Special aggregates ---
+    (
+        "cpi_ex_food_canada",
+        "All-items excluding food",
+        "CPI All-items excluding food, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_ex_food_energy_canada",
+        "All-items excluding food and energy",
+        "CPI All-items excluding food and energy, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_ex_energy_canada",
+        "All-items excluding energy",
+        "CPI All-items excluding energy, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_ex_gasoline_canada",
+        "All-items excluding gasoline",
+        "CPI All-items excluding gasoline, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+    (
+        "cpi_energy_canada",
+        "Energy",
+        "CPI Energy, Canada (2002=100)",
+        "Index 2002=100",
+    ),
+]
+
+
+def build_data_service() -> DataService:
+    """Build and populate a DataService with Canada-wide CPI series.
+
+    Returns
+    -------
+    DataService
+        DataService instance with all CPI series registered.
+    """
+    svc = DataService()
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+
+    print(f"Fetching StatCan table {CPI_TABLE_ID} → cache: {CACHE_DIR.resolve()}")
+    print()
+
+    succeeded = 0
+    failed = 0
+
+    for series_id, product_group, description, units in CPI_SERIES:
+        adapter = StatCanAdapter(
+            table_id=CPI_TABLE_ID,
+            member_filter={
+                "GEO": "Canada",
+                "Products and product groups": product_group,
+            },
+            cache_dir=CACHE_DIR,
+        )
+        metadata = SeriesMetadata(
+            series_id=series_id,
+            description=description,
+            source="StatCan",
+            units=units,
+            frequency="MS",
+            table_id=CPI_TABLE_ID,
+        )
+        try:
+            svc.register(series_id, adapter, metadata)
+            succeeded += 1
+        except Exception as exc:
+            print(f"  [WARN] Failed to register {series_id!r}: {exc}")
+            failed += 1
+
+    print(f"Registered {succeeded} series ({failed} failed).")
+    return svc
+
+
+def main() -> None:
+    """Fetch CPI data and print a summary."""
+    svc = build_data_service()
+
+    print()
+    summary = svc.summary()
+    if summary.empty:
+        print("No series registered.")
+        return
+
+    # Format for display.
+    summary["start"] = summary["start"].dt.strftime("%Y-%m")
+    summary["end"] = summary["end"].dt.strftime("%Y-%m")
+    print(summary.to_string(index=False))
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_fred.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_fred.py.md
new file mode 100644
index 0000000..a8e3b6d
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_fred.py.md
@@ -0,0 +1,192 @@
+# Source: scripts/fetch_fred.py
+
+kind: python
+
+```python
+"""Populate the local FRED cache with series used by the CFPR experiment.
+
+Each FRED series in ``FRED_SERIES`` below is fetched from the FRED REST API
+and written to ``data/fred/{fred_id}.parquet``.  Subsequent calls to
+:class:`~aieng.forecasting.data.adapters.FREDAdapter` read directly from
+those parquet files — no further network access is required.
+
+Re-running the script is idempotent: any series already cached is re-read
+from disk and re-validated.  Pass ``--refresh`` to force a fresh download.
+
+**Prerequisite:** set ``FRED_API_KEY`` in your environment or in the
+repo-root ``.env`` file.  A free key is available at
+https://fred.stlouisfed.org/docs/api/api_key.html.
+
+Usage
+-----
+::
+
+    uv run python scripts/fetch_fred.py
+    uv run python scripts/fetch_fred.py --refresh
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(REPO_ROOT))
+
+from dotenv import load_dotenv
+
+
+load_dotenv(REPO_ROOT / ".env")
+
+from aieng.forecasting.data import DataService, SeriesMetadata
+from aieng.forecasting.data.adapters import FREDAdapter
+
+
+DEFAULT_CACHE_DIR = REPO_ROOT / "data" / "fred"
+
+
+# ---------------------------------------------------------------------------
+# FRED series catalogue for food price forecasting
+#
+# Each entry: (series_id, fred_series_id, description, units)
+#
+# Rationale for inclusion:
+#   - US food CPI sub-indices: US prices transmit to Canadian food costs
+#     through trade and supply chains, especially for commodities.
+#   - Canadian 10-year bond yield: measures cost of capital and credit
+#     conditions affecting food production and distribution.
+#   - Canada/US exchange rate: direct pass-through to import food prices.
+#   - Canada unemployment rate: labour-market covariate for the BoC
+#     rate-decision experiment (implementations/boc_rate_decisions/).
+#
+# All series below are published at monthly (MS) frequency on FRED, which
+# matches the Statistics Canada food CPI target frequency.  Daily series
+# (e.g. VXO, VIXCLS) are intentionally excluded here — the ``FREDAdapter``
+# does not resample, so mixing frequencies silently breaks the covariate
+# alignment inside Darts models.
+# ---------------------------------------------------------------------------
+
+FRED_SERIES: list[tuple[str, str, str, str]] = [
+    (
+        "fred_us_cpi_food_at_home",
+        "CPIFABSL",
+        "US CPI: Food at Home, All Urban Consumers (1982-84=100)",
+        "Index 1982-84=100",
+    ),
+    (
+        "fred_us_cpi_meats_poultry_fish_eggs",
+        "CUSR0000SAF112",
+        "US CPI: Meats, Poultry, Fish, and Eggs, All Urban Consumers (1982-84=100)",
+        "Index 1982-84=100",
+    ),
+    (
+        "fred_us_cpi_fruits_vegetables",
+        "CUSR0000SAF113",
+        "US CPI: Fruits and Vegetables, All Urban Consumers (1982-84=100)",
+        "Index 1982-84=100",
+    ),
+    (
+        "fred_canada_10yr_bond_yield",
+        "IRLTLT01CAM156N",
+        "Canada Long-Term Government Bond Yields: 10-Year (% per annum)",
+        "Percent per annum",
+    ),
+    (
+        "fred_canada_us_exchange_rate",
+        "EXCAUS",
+        "Canada / US Foreign Exchange Rate (CAD per 1 USD, monthly average)",
+        "CAD per USD",
+    ),
+    (
+        "fred_canada_unemployment_rate",
+        "LRUNTTTTCAM156S",
+        "Unemployment Rate: Total, All Persons for Canada (seasonally adjusted, monthly)",
+        "Percent",
+    ),
+]
+
+
+def build_data_service(cache_dir: Path, refresh: bool) -> DataService:
+    """Fetch/validate every catalogued FRED series and register it in a DataService.
+
+    Parameters
+    ----------
+    cache_dir : Path
+        Directory where parquet files are written/read.
+    refresh : bool
+        If ``True``, bypass any existing cache files and re-download.
+
+    Returns
+    -------
+    DataService
+        Populated with all successfully fetched FRED series.
+    """
+    svc = DataService()
+    print(f"Populating FRED cache at {cache_dir}")
+    print(f"  refresh={refresh}")
+    print()
+
+    succeeded = 0
+    failed = 0
+
+    for series_id, fred_id, description, units in FRED_SERIES:
+        adapter = FREDAdapter(fred_id, cache_dir=cache_dir, refresh=refresh)
+        metadata = SeriesMetadata(
+            series_id=series_id,
+            description=description,
+            source=f"FRED ({fred_id})",
+            units=units,
+            frequency="MS",
+        )
+        try:
+            svc.register(series_id, adapter, metadata)
+            succeeded += 1
+            cached = adapter.cache_path is not None and adapter.cache_path.exists()
+            marker = "cache" if cached and not refresh else "fetched"
+            print(f"  [{marker:>7}] {series_id:<42} ({fred_id})")
+        except Exception as exc:
+            failed += 1
+            print(f"  [ failed] {series_id:<42} ({fred_id}): {exc}")
+
+    print()
+    print(f"Registered {succeeded} series ({failed} failed).")
+    return svc
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--refresh",
+        action="store_true",
+        help="Force re-download of every series, overwriting the cache.",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        type=Path,
+        default=DEFAULT_CACHE_DIR,
+        help=f"Destination directory for parquet cache (default: {DEFAULT_CACHE_DIR}).",
+    )
+    return parser.parse_args()
+
+
+def main() -> None:
+    """CLI entry point: populate the FRED cache and print a summary."""
+    args = _parse_args()
+    svc = build_data_service(args.cache_dir, args.refresh)
+
+    print()
+    summary = svc.summary()
+    if summary.empty:
+        print("No series registered.")
+        return
+
+    summary["start"] = summary["start"].dt.strftime("%Y-%m")
+    summary["end"] = summary["end"].dt.strftime("%Y-%m")
+    print(summary.to_string(index=False))
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_sp500_market.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_sp500_market.py.md
new file mode 100644
index 0000000..95155d0
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_sp500_market.py.md
@@ -0,0 +1,85 @@
+# Source: scripts/fetch_sp500_market.py
+
+kind: python
+
+```python
+"""Populate / refresh the local Yahoo market caches for the S&P 500 use case.
+
+Fetches the daily bars the S&P 500 implementation reads from Yahoo Finance —
+the ``^GSPC`` index (with same-day open) plus the ``^VIX`` and ``^IXIC``
+covariates — and writes them to ``data/yahoo/`` and ``data/yfinance/``.  The
+FRED covariates are warmed separately by ``scripts/fetch_fred.py``.
+
+Run this once (with ``--refresh``) before working with the 2025 / 2026 spec
+windows: the bundled caches may only reach an earlier date, and the
+cutoff-aware evaluation windows require coverage through the present.
+
+Re-running is idempotent; ``--refresh`` forces a fresh download.
+
+Usage
+-----
+::
+
+    uv run python scripts/fetch_sp500_market.py --refresh
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(REPO_ROOT))
+sys.path.insert(0, str(REPO_ROOT / "implementations"))
+
+from aieng.forecasting.data.adapters import YFinanceDailyAdapter
+from sp500_forecasting.data import (
+    DEFAULT_CACHE_FILE,
+    DEFAULT_YAHOO_CACHE_DIR,
+    NASDAQ_TICKER,
+    SP500_TICKER,
+    VIX_TICKER,
+    YahooFinanceDailyAdapter,
+)
+
+
+_START = "2016-01-01"
+
+
+def main() -> None:
+    """Fetch ^GSPC, ^VIX and ^IXIC daily bars to the local caches."""
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--refresh", action="store_true", help="Force a fresh download, overwriting the cache.")
+    parser.add_argument("--start", default=_START, help=f"History start date (default {_START}).")
+    args = parser.parse_args()
+
+    print(f"Refreshing Yahoo caches (refresh={args.refresh}, start={args.start}) …\n")
+
+    # ^GSPC index (adjusted close + same-day open) → data/yahoo/sp500_gspc.parquet
+    gspc = YahooFinanceDailyAdapter(
+        SP500_TICKER, start=args.start, end=None, cache_path=DEFAULT_CACHE_FILE, refresh=args.refresh
+    ).fetch()
+    print(
+        f"  {SP500_TICKER:8} {len(gspc):>5} rows  {gspc['timestamp'].min().date()} → {gspc['timestamp'].max().date()}"
+    )
+
+    # ^VIX and ^IXIC covariates → data/yfinance/{ticker}.parquet
+    for ticker in (VIX_TICKER, NASDAQ_TICKER):
+        df = YFinanceDailyAdapter(
+            ticker,
+            field="Adj Close",
+            start=args.start,
+            end=None,
+            cache_dir=DEFAULT_YAHOO_CACHE_DIR,
+            refresh=args.refresh,
+        ).fetch()
+        print(f"  {ticker:8} {len(df):>5} rows  {df['timestamp'].min().date()} → {df['timestamp'].max().date()}")
+
+    print("\nDone. FRED covariates are warmed separately via scripts/fetch_fred.py.")
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_wti.py.md b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_wti.py.md
new file mode 100644
index 0000000..ab376fc
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/artifacts/scripts__fetch_wti.py.md
@@ -0,0 +1,51 @@
+# Source: scripts/fetch_wti.py
+
+kind: python
+
+```python
+"""Fetch and cache WTI Crude Oil daily price history from Yahoo Finance.
+
+Downloads the continuous front-month WTI futures contract (``CL=F``) via
+yfinance and stores it as ``data/yfinance/cl_f_adj_close_1d.parquet``.
+The local parquet cache is what :func:`~energy_oil_forecasting.data.build_wti_service`
+reads; running this script once before a notebook session avoids live yfinance
+requests during forecasting or backtesting.
+
+Usage
+-----
+    uv run python scripts/fetch_wti.py
+
+The script is idempotent and safe to re-run — it overwrites the cache with a
+fresh download each time (``refresh=True``).
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+
+from aieng.forecasting.data.adapters.yfinance import YFinanceDailyAdapter
+
+
+CACHE_DIR = Path("data/yfinance")
+TICKER = "CL=F"
+HISTORY_START = "2004-01-01"
+
+
+def main() -> None:
+    """Fetch WTI history and print a brief summary."""
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    adapter = YFinanceDailyAdapter(ticker=TICKER, start=HISTORY_START, cache_dir=CACHE_DIR, refresh=True)
+    print(f"Fetching {TICKER} (Adj Close) → {adapter.cache_path}")
+    df = adapter.fetch()
+    print(f"  {len(df):,} trading days  |  {df['timestamp'].min().date()} → {df['timestamp'].max().date()}")
+    print(f"  Latest close: ${df['value'].iloc[-1]:.2f}")
+    print("Done.")
+
+
+if __name__ == "__main__":
+    main()
+```
diff --git a/implementations/getting_started/concierge_agent/context/catalog.yaml b/implementations/getting_started/concierge_agent/context/catalog.yaml
new file mode 100644
index 0000000..d033995
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/context/catalog.yaml
@@ -0,0 +1,2883 @@
+source_url: https://github.com/VectorInstitute/agentic-forecasting
+git_ref: d4a22d05c7a9e86763a8d9b5e2891bcc24eda429
+branch: main
+built_at: '2026-06-25T15:12:49+00:00'
+ingest_source: /home/coder/agentic-forecasting
+entry_count: 195
+entries:
+- path: AGENTS.md
+  kind: markdown
+  domain: docs
+  summary: AGENTS.md
+  symbols: []
+  sections:
+  - AGENTS.md
+  - How to use this file
+  - Project documentation
+  - Documentation is part of every change (hard rule)
+  - planning-docs/
+  - README files
+  - Development conventions
+  - Data cache
+  - Model selection
+  - Code quality (not on commit)
+  - Test philosophy
+  chars: 6702
+  artifact: artifacts/AGENTS.md.md
+- path: README.md
+  kind: markdown
+  domain: docs
+  summary: Agentic Forecasting
+  symbols: []
+  sections:
+  - Agentic Forecasting
+  - What's here
+  - Two ways to use a forecaster
+  - Reference implementations
+  - Time Series Data sources
+  - FRED API key
+  - Repository layout
+  - Setup
+  - Coder Workspaces
+  - Verify your environment first
+  - Populate the data cache
+  - Build the E2B sandbox image (agentic implementations only)
+  - Core concepts
+  - Extending the foundation
+  - Code quality
+  - Documentation
+  chars: 14710
+  artifact: artifacts/README.md.md
+- path: aieng-forecasting/aieng/forecasting/__init__.py
+  kind: python
+  domain: core.root
+  summary: "Agentic Forecasting \u2014 data service and evaluation harness."
+  symbols:
+  - ADVANCED_MODEL
+  - DEFAULT_MODEL
+  - LITE_MODEL
+  sections: []
+  chars: 294
+  artifact: artifacts/aieng-forecasting__aieng__forecasting____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/data/__init__.py
+  kind: python
+  domain: core.data
+  summary: 'Data service: adapters, series store, and cutoff enforcement.'
+  symbols:
+  - DataService
+  - ForecastContext
+  - SeriesMetadata
+  - SeriesRecord
+  sections: []
+  chars: 427
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/data/adapters/__init__.py
+  kind: python
+  domain: core.data
+  summary: Adapter implementations for ingesting data into the SeriesStore.
+  symbols:
+  - BaseAdapter
+  - FREDAdapter
+  - StatCanAdapter
+  - YFinanceDailyAdapter
+  sections: []
+  chars: 521
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/data/adapters/base.py
+  kind: python
+  domain: core.data
+  summary: Base adapter protocol for data ingestion.
+  symbols:
+  - BaseAdapter
+  sections: []
+  chars: 2094
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md
+- path: aieng-forecasting/aieng/forecasting/data/adapters/fred.py
+  kind: python
+  domain: core.data
+  summary: FRED (Federal Reserve Economic Data) adapter for the SeriesStore.
+  symbols:
+  - FREDAdapter
+  sections: []
+  chars: 7400
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md
+- path: aieng-forecasting/aieng/forecasting/data/adapters/statcan.py
+  kind: python
+  domain: core.data
+  summary: Statistics Canada adapter using the stats-can library.
+  symbols:
+  - StatCanAdapter
+  sections: []
+  chars: 8934
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md
+- path: aieng-forecasting/aieng/forecasting/data/adapters/yfinance.py
+  kind: python
+  domain: core.data
+  summary: Yahoo Finance adapter for daily market series.
+  symbols:
+  - YFinanceDailyConfig
+  - YFinanceDailyAdapter
+  - YFinanceDailyAdapter
+  - YFinanceDailyConfig
+  - YFinanceField
+  - YFinanceInterval
+  sections: []
+  chars: 13674
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md
+- path: aieng-forecasting/aieng/forecasting/data/context.py
+  kind: python
+  domain: core.data
+  summary: 'ForecastContext: the predictor-facing, cutoff-scoped data view.'
+  symbols:
+  - ForecastContext
+  sections: []
+  chars: 4947
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md
+- path: aieng-forecasting/aieng/forecasting/data/cutoff.py
+  kind: python
+  domain: core.data
+  summary: Information cutoff enforcement.
+  symbols:
+  - CutoffEnforcer
+  sections: []
+  chars: 2581
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md
+- path: aieng-forecasting/aieng/forecasting/data/models.py
+  kind: python
+  domain: core.data
+  summary: Pydantic models for the data service layer.
+  symbols:
+  - SeriesRecord
+  - SeriesMetadata
+  sections: []
+  chars: 2728
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md
+- path: aieng-forecasting/aieng/forecasting/data/service.py
+  kind: python
+  domain: core.data
+  summary: 'DataService: registration and management of time series data.'
+  symbols:
+  - DataService
+  sections: []
+  chars: 7060
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md
+- path: aieng-forecasting/aieng/forecasting/data/store.py
+  kind: python
+  domain: core.data
+  summary: In-memory series store.
+  symbols:
+  - SeriesStore
+  sections: []
+  chars: 3768
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md
+- path: aieng-forecasting/aieng/forecasting/documents/__init__.py
+  kind: python
+  domain: core.documents
+  summary: 'Document extraction: source-agnostic PDF -> full text + cutoff metadata.'
+  symbols:
+  - DocumentMeta
+  - DocumentStore
+  - ExtractedDocument
+  - MIME_PDF
+  - estimate_tokens
+  - extract_document
+  - inject_pdf_parts
+  - pdf_bytes_to_content_part
+  - pdf_to_content_part
+  sections: []
+  chars: 1310
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/documents/extract.py
+  kind: python
+  domain: core.documents
+  summary: Document text extraction.
+  symbols:
+  - extract_document
+  sections: []
+  chars: 3698
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md
+- path: aieng-forecasting/aieng/forecasting/documents/models.py
+  kind: python
+  domain: core.documents
+  summary: Pydantic models for extracted documents.
+  symbols:
+  - DocumentMeta
+  - estimate_tokens
+  - ExtractedDocument
+  sections: []
+  chars: 3581
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md
+- path: aieng-forecasting/aieng/forecasting/documents/pdf_upload.py
+  kind: python
+  domain: core.documents
+  summary: PDF-to-message-part conversion for native document ingestion.
+  symbols:
+  - pdf_bytes_to_content_part
+  - pdf_to_content_part
+  - inject_pdf_parts
+  - MIME_PDF
+  - inject_pdf_parts
+  - pdf_bytes_to_content_part
+  - pdf_to_content_part
+  sections: []
+  chars: 7397
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md
+- path: aieng-forecasting/aieng/forecasting/documents/store.py
+  kind: python
+  domain: core.documents
+  summary: Cutoff-aware in-memory store for extracted documents.
+  symbols:
+  - DocumentStore
+  - DocumentStore
+  sections: []
+  chars: 7587
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/__init__.py
+  kind: python
+  domain: core.evaluation
+  summary: 'Evaluation harness: forecasting tasks, prediction payloads, and scoring.'
+  symbols:
+  - DEFAULT_STORE_DIR
+  - BacktestResult
+  - BacktestSpec
+  - BinaryForecast
+  - CategoricalForecast
+  - ContinuousForecast
+  - EvalBudgetExceededError
+  - EvalResult
+  - EvalSpec
+  - EvalTracker
+  - ForecastingTask
+  - MultiTargetBacktestSpec
+  - MultiTargetEvalSpec
+  - Prediction
+  - Predictor
+  - STANDARD_QUANTILES
+  - TaskCategory
+  - backtest
+  - cached_backtest
+  - cached_multi_backtest
+  - compute_brier_score
+  - compute_rps
+  - describe_spec
+  - describe_task
+  - evaluate
+  - load_backtest_result
+  - load_multi_backtest_results
+  - multi_backtest
+  - multi_evaluate
+  - save_backtest_result
+  sections: []
+  chars: 2037
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/artifacts.py
+  kind: python
+  domain: core.evaluation
+  summary: Persist backtest and eval results to a filesystem artefact store.
+  symbols:
+  - save_backtest_result
+  - load_backtest_result
+  - cached_backtest
+  - save_multi_backtest_results
+  - load_multi_backtest_results
+  - cached_multi_backtest
+  - save_eval_result
+  - save_multi_eval_results
+  sections: []
+  chars: 15131
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/backtest.py
+  kind: python
+  domain: core.evaluation
+  summary: BacktestSpec, BacktestResult, and the backtest() harness.
+  symbols:
+  - BacktestSpec
+  - BacktestResult
+  - compute_brier_score
+  - compute_rps
+  - run_eval_loop
+  - backtest
+  - MultiTargetBacktestSpec
+  - multi_backtest
+  sections: []
+  chars: 32734
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/describe.py
+  kind: python
+  domain: core.evaluation
+  summary: Human-readable descriptions of forecasting tasks and specs.
+  symbols:
+  - describe_task
+  - describe_spec
+  sections: []
+  chars: 7042
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/eval.py
+  kind: python
+  domain: core.evaluation
+  summary: EvalSpec, EvalResult, EvalTracker, and the evaluate() harness.
+  symbols:
+  - EvalBudgetExceededError
+  - EvalSpec
+  - EvalResult
+  - EvalTracker
+  - evaluate
+  - MultiTargetEvalSpec
+  - multi_evaluate
+  sections: []
+  chars: 25346
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/langfuse_traces.py
+  kind: python
+  domain: core.evaluation
+  summary: 'Langfuse trace-evaluation plumbing: stamp forecasts, fetch traces, push
+    scores.'
+  symbols:
+  - stamp_forecast_on_trace
+  - read_forecasts_from_trace
+  - trace_has_forecast
+  - fetch_trace_with_wait
+  - list_trace_ids
+  - push_trace_score
+  - flush_scores
+  - FORECAST_OBSERVATION_NAME
+  - FORECAST_TRACE_OUTPUT_KEY
+  - fetch_trace_with_wait
+  - flush_scores
+  - list_trace_ids
+  - push_trace_score
+  - read_forecasts_from_trace
+  - stamp_forecast_on_trace
+  - trace_has_forecast
+  sections: []
+  chars: 12009
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/prediction.py
+  kind: python
+  domain: core.evaluation
+  summary: Prediction payload types and the Prediction metadata wrapper.
+  symbols:
+  - ContinuousForecast
+  - BinaryForecast
+  - CategoricalForecast
+  - Prediction
+  sections: []
+  chars: 7884
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/predictor.py
+  kind: python
+  domain: core.evaluation
+  summary: "Predictor ABC \u2014 the interface all forecasting models must implement."
+  symbols:
+  - Predictor
+  sections: []
+  chars: 5577
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md
+- path: aieng-forecasting/aieng/forecasting/evaluation/task.py
+  kind: python
+  domain: core.evaluation
+  summary: 'ForecastingTask: defines a prediction problem against the data service.'
+  symbols:
+  - TaskCategory
+  - ForecastingTask
+  sections: []
+  chars: 9402
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md
+- path: aieng-forecasting/aieng/forecasting/langfuse_tracing.py
+  kind: python
+  domain: core.root
+  summary: Langfuse-oriented tracing bootstrap for LiteLLM and Google ADK.
+  symbols:
+  - _LangfuseTracingBootstrap
+  - init_langfuse_tracing
+  - print_langfuse_trace_url
+  sections: []
+  chars: 6632
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/README.md
+  kind: markdown
+  domain: core.methods
+  summary: Methods
+  symbols: []
+  sections:
+  - Methods
+  - What belongs here
+  - What does NOT belong here
+  - Import patterns
+  - Current contents
+  - Baselines
+  - Numerical
+  - LLM Processes
+  - Agentic
+  chars: 7424
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md
+- path: aieng-forecasting/aieng/forecasting/methods/__init__.py
+  kind: python
+  domain: core.methods
+  summary: Reference predictor implementations for ``aieng.forecasting``.
+  symbols:
+  - BinaryProbabilityLLMPredictor
+  - BinaryProbabilityLLMPredictorConfig
+  - CategoricalFrequencyPredictor
+  - CategoricalProbabilityLLMPredictor
+  - CategoricalProbabilityLLMPredictorConfig
+  - DartsAutoARIMAPredictor
+  - DartsExponentialSmoothingPredictor
+  - DartsKalmanForecasterPredictor
+  - DartsLightGBMPredictor
+  - DartsLinearRegressionPredictor
+  - HistoricalFrequencyPredictor
+  - LastValuePredictor
+  - QuantileGridLLMPredictor
+  - QuantileGridLLMPredictorConfig
+  - SampledTrajectoryLLMPredictor
+  - SampledTrajectoryLLMPredictorConfig
+  sections: []
+  chars: 3470
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/__init__.py
+  kind: python
+  domain: core.methods
+  summary: ADK-based agentic predictors.
+  symbols: []
+  sections: []
+  chars: 4216
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/adaptive_skill.py
+  kind: python
+  domain: core.methods
+  summary: Generic adaptive skill infrastructure for learnable agent strategies.
+  symbols:
+  - AdaptiveSkillState
+  - AdaptiveSkillStore
+  sections: []
+  chars: 8583
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/adk_runner.py
+  kind: python
+  domain: core.methods
+  summary: 'General-purpose ADK runner: text-in / text-out over ``InMemoryRunner``.'
+  symbols:
+  - AdkTextRunnerConfig
+  - AdkTextRunner
+  sections: []
+  chars: 15228
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py
+  kind: python
+  domain: core.methods
+  summary: Factory functions for building Google ADK agents for forecasting.
+  symbols:
+  - _LiteLLMNoiseFilter
+  - ContextRetrievalConfig
+  - CodeExecutionConfig
+  - AgentConfig
+  - build_adk_agent
+  sections: []
+  chars: 24986
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/curriculum.py
+  kind: python
+  domain: core.methods
+  summary: Curriculum assembly utilities for adaptive agent training.
+  symbols:
+  - format_backtest_report
+  - load_context_documents
+  - build_curriculum_prompt
+  sections: []
+  chars: 20460
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/forecast_tool.py
+  kind: python
+  domain: core.methods
+  summary: A conventional ADK function tool that runs a forecasting model on demand.
+  symbols:
+  - ForecastTool
+  sections: []
+  chars: 10814
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/outputs.py
+  kind: python
+  domain: core.methods
+  summary: Output schemas for agentic forecasting.
+  symbols:
+  - AgentForecastOutput
+  - AgentQuantileForecast
+  - ContinuousAgentHorizonForecast
+  - ContinuousAgentForecastOutput
+  - DiscreteAgentForecastOutput
+  - AgentCategoryProbability
+  - CategoricalAgentForecastOutput
+  sections: []
+  chars: 25341
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/agentic/predictor.py
+  kind: python
+  domain: core.methods
+  summary: Predictor that uses an ADK agent for forecasting.
+  symbols:
+  - ForecastPromptBuilder
+  - AgentPredictor
+  sections: []
+  chars: 13655
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/baselines/__init__.py
+  kind: python
+  domain: core.methods
+  summary: Baseline predictor implementations.
+  symbols:
+  - CategoricalFrequencyPredictor
+  - HistoricalFrequencyPredictor
+  - LastValuePredictor
+  sections: []
+  chars: 534
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/baselines/categorical_frequency.py
+  kind: python
+  domain: core.methods
+  summary: "Categorical-frequency predictor \u2014 the floor baseline for ordinal\
+    \ tasks."
+  symbols:
+  - CategoricalFrequencyPredictor
+  sections: []
+  chars: 5867
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/baselines/historical_frequency.py
+  kind: python
+  domain: core.methods
+  summary: "Historical-frequency predictor \u2014 the floor baseline for binary-event\
+    \ tasks."
+  symbols:
+  - HistoricalFrequencyPredictor
+  sections: []
+  chars: 4586
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/baselines/naive.py
+  kind: python
+  domain: core.methods
+  summary: "Naive last-value predictor \u2014 the floor baseline for any continuous\
+    \ forecasting task."
+  symbols:
+  - LastValuePredictor
+  sections: []
+  chars: 6138
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/__init__.py
+  kind: python
+  domain: core.methods
+  summary: LLM-process predictor implementations.
+  symbols:
+  - BinaryProbabilityLLMPredictor
+  - BinaryProbabilityLLMPredictorConfig
+  - CategoricalProbabilityLLMPredictor
+  - CategoricalProbabilityLLMPredictorConfig
+  - SampledTrajectoryLLMPredictor
+  - SampledTrajectoryLLMPredictorConfig
+  - QuantileGridLLMPredictor
+  - QuantileGridLLMPredictorConfig
+  - LLMPredictor
+  - LLMPredictorConfig
+  sections: []
+  chars: 3300
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/_client.py
+  kind: python
+  domain: core.methods
+  summary: Shared LiteLLM call seam for all ``llm_processes`` predictors.
+  symbols:
+  - bootstrap_litellm
+  - langfuse_observe
+  - current_trace_info
+  - trace_url_for
+  - set_current_trace_name
+  - make_json_schema_response_format
+  - strip_markdown_fence
+  - sample_n_async
+  - run_async
+  sections: []
+  chars: 18293
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/base.py
+  kind: python
+  domain: core.methods
+  summary: Abstract base class and shared config for LLM-process predictors.
+  symbols:
+  - LLMPredictorConfig
+  - serialize_history
+  - build_covariate_block
+  - get_history_and_meta
+  - fetch_report_docs
+  - build_report_preamble
+  - apply_report_context
+  - LLMPredictor
+  sections: []
+  chars: 19537
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/binary_probability.py
+  kind: python
+  domain: core.methods
+  summary: "BinaryProbabilityLLMPredictor \u2014 direct probability elicitation for\
+    \ binary events."
+  symbols:
+  - BinaryProbabilityLLMPredictorConfig
+  - _BinaryProbability
+  - BinaryProbabilityLLMPredictor
+  - BinaryProbabilityLLMPredictor
+  - BinaryProbabilityLLMPredictorConfig
+  sections: []
+  chars: 13182
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/categorical_probability.py
+  kind: python
+  domain: core.methods
+  summary: "CategoricalProbabilityLLMPredictor \u2014 direct categorical distribution\
+    \ elicitation."
+  symbols:
+  - CategoricalProbabilityLLMPredictorConfig
+  - _CategoryProbability
+  - _CategoricalDistribution
+  - serialize_categorical_history
+  - CategoricalProbabilityLLMPredictor
+  - CategoricalProbabilityLLMPredictor
+  - CategoricalProbabilityLLMPredictorConfig
+  - serialize_categorical_history
+  sections: []
+  chars: 19430
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/point_intervals.py
+  kind: python
+  domain: core.methods
+  summary: Design placeholder for point-plus-interval LLM forecasting.
+  symbols: []
+  sections: []
+  chars: 1354
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__point_intervals.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/quantile_grid.py
+  kind: python
+  domain: core.methods
+  summary: "QuantileGridLLMPredictor \u2014 one-shot quantile forecaster."
+  symbols:
+  - QuantileGridLLMPredictorConfig
+  - _QuantileStep
+  - _QuantileTrajectory
+  - QuantileGridLLMPredictor
+  - QuantileGridLLMPredictor
+  - QuantileGridLLMPredictorConfig
+  sections: []
+  chars: 12304
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__quantile_grid.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/llm_processes/sampled_trajectory.py
+  kind: python
+  domain: core.methods
+  summary: "SampledTrajectoryLLMPredictor \u2014 sample-based quantile forecaster."
+  symbols:
+  - SampledTrajectoryLLMPredictorConfig
+  - _Trajectory
+  - SampledTrajectoryLLMPredictor
+  sections: []
+  chars: 17095
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__sampled_trajectory.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/numerical/__init__.py
+  kind: python
+  domain: core.methods
+  summary: Numerical forecasting predictor implementations.
+  symbols:
+  - DartsAutoARIMAPredictor
+  - DartsExponentialSmoothingPredictor
+  - DartsKalmanForecasterPredictor
+  - DartsLightGBMPredictor
+  - DartsLinearRegressionPredictor
+  sections: []
+  chars: 747
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__numerical____init__.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/numerical/darts_arima.py
+  kind: python
+  domain: core.methods
+  summary: "Darts AutoARIMA predictor \u2014 probabilistic forecast via Monte Carlo\
+    \ sampling."
+  symbols:
+  - DartsAutoARIMAPredictor
+  sections: []
+  chars: 5294
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_arima.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/numerical/darts_classical.py
+  kind: python
+  domain: core.methods
+  summary: "Fast classical Darts predictors \u2014 Exponential Smoothing and Kalman\
+    \ filter."
+  symbols:
+  - DartsExponentialSmoothingPredictor
+  - DartsKalmanForecasterPredictor
+  sections: []
+  chars: 8365
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_classical.py.md
+- path: aieng-forecasting/aieng/forecasting/methods/numerical/darts_regression.py
+  kind: python
+  domain: core.methods
+  summary: "Darts regression-model predictors \u2014 LinearRegression and LightGBM."
+  symbols:
+  - _DartsRegressionModel
+  - DartsLinearRegressionPredictor
+  - DartsLightGBMPredictor
+  sections: []
+  chars: 15304
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__methods__numerical__darts_regression.py.md
+- path: aieng-forecasting/aieng/forecasting/models.py
+  kind: python
+  domain: core.root
+  summary: Canonical proxy model identifiers used across the project.
+  symbols:
+  - ADVANCED_MODEL
+  - DEFAULT_MODEL
+  - LITE_MODEL
+  sections: []
+  chars: 1315
+  artifact: artifacts/aieng-forecasting__aieng__forecasting__models.py.md
+- path: docs/adk-skills-guide.md
+  kind: markdown
+  domain: docs
+  summary: "ADK Skills and Code Execution \u2014 How-To for This Repo"
+  symbols: []
+  sections:
+  - "ADK Skills and Code Execution \u2014 How-To for This Repo"
+  - 1. The three ways to extend an agent
+  - 2. Code execution is E2B-only
+  - 3. How ADK skills work
+  - 4. The design rules
+  - "Rule 1 \u2014 Don't attach a skill that has no files in `references/`, `assets/`,\
+    \ or `scripts/`."
+  - "Rule 2 \u2014 If a skill has references but no scripts, say so in the prompt."
+  - "Rule 3 \u2014 Keep the SKILL.md body minimal."
+  - 5. Worked examples in the repo
+  - "Read-only skills \u2014 `energy_oil_forecasting/analyst_agent/skills/`"
+  - "Adaptive skills \u2014 a learnable strategy"
+  - 6. Checklist for adding a skill
+  - 7. Current status
+  chars: 10079
+  artifact: artifacts/docs__adk-skills-guide.md.md
+- path: implementations/README.md
+  kind: markdown
+  domain: impl.README.md
+  summary: implementations
+  symbols: []
+  sections:
+  - implementations
+  - Directory layout
+  - Relationship to `aieng-forecasting`
+  - Adding a new use case
+  chars: 3644
+  artifact: artifacts/implementations__README.md.md
+- path: implementations/__init__.py
+  kind: python
+  domain: impl.__init__.py
+  summary: Namespace package root for per-topic experiment notebooks and helpers.
+  symbols: []
+  sections: []
+  chars: 144
+  artifact: artifacts/implementations____init__.py.md
+- path: implementations/boc_rate_decisions/01_boc_data_exploration.ipynb
+  kind: notebook
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Decisions \u2014 Data Exploration & Problem Framing"
+  symbols: []
+  sections:
+  - "BoC Rate Decisions \u2014 Data Exploration & Problem Framing"
+  chars: 18051
+  artifact: artifacts/implementations__boc_rate_decisions__01_boc_data_exploration.ipynb.md
+- path: implementations/boc_rate_decisions/02_boc_rate_direction_experiment.ipynb
+  kind: notebook
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Decisions \u2014 3-Way Direction Prediction Experiment"
+  symbols: []
+  sections:
+  - "BoC Rate Decisions \u2014 3-Way Direction Prediction Experiment"
+  - Viewing other meetings
+  chars: 27955
+  artifact: artifacts/implementations__boc_rate_decisions__02_boc_rate_direction_experiment.ipynb.md
+- path: implementations/boc_rate_decisions/03_rationale_alignment.ipynb
+  kind: notebook
+  domain: impl.boc_rate_decisions
+  summary: "BoC \u2014 Rationale-alignment evaluation (LLM-as-a-judge, on the side)"
+  symbols: []
+  sections:
+  - "BoC \u2014 Rationale-alignment evaluation (LLM-as-a-judge, on the side)"
+  chars: 11377
+  artifact: artifacts/implementations__boc_rate_decisions__03_rationale_alignment.ipynb.md
+- path: implementations/boc_rate_decisions/99_starter_agent.ipynb
+  kind: notebook
+  domain: impl.boc_rate_decisions
+  summary: "Bank of Canada Rate Decisions \u2014 Your Starter Agent"
+  symbols: []
+  sections:
+  - "Bank of Canada Rate Decisions \u2014 Your Starter Agent"
+  chars: 8070
+  artifact: artifacts/implementations__boc_rate_decisions__99_starter_agent.ipynb.md
+- path: implementations/boc_rate_decisions/README.md
+  kind: markdown
+  domain: impl.boc_rate_decisions
+  summary: BoC Rate Decisions
+  symbols: []
+  sections:
+  - BoC Rate Decisions
+  - Prediction task
+  - Data
+  - Predictors
+  - Reference specs
+  - Module layout
+  - Notebooks
+  - Roadmap
+  - Implemented since the first draft
+  - "Remaining extensions \u2014 good participant projects"
+  chars: 15157
+  artifact: artifacts/implementations__boc_rate_decisions__README.md.md
+- path: implementations/boc_rate_decisions/__init__.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: "Bank of Canada rate-decision experiment \u2014 helper modules and reference\
+    \ implementations."
+  symbols: []
+  sections: []
+  chars: 1091
+  artifact: artifacts/implementations__boc_rate_decisions____init__.py.md
+- path: implementations/boc_rate_decisions/analysis.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Analysis helpers for the BoC rate-decision experiment.
+  symbols:
+  - predictions_to_frame
+  - score_leaderboard
+  - one_vs_rest_frame
+  - calibration_table
+  - yearly_outcome_table
+  - rationales_table
+  - PanelRow
+  - DecisionPanel
+  - decision_panel_data
+  - panel_rationales_markdown
+  - DecisionPanel
+  - PanelRow
+  - calibration_table
+  - decision_panel_data
+  - one_vs_rest_frame
+  - panel_rationales_markdown
+  - predictions_to_frame
+  - rationales_table
+  - score_leaderboard
+  - yearly_outcome_table
+  sections: []
+  chars: 19911
+  artifact: artifacts/implementations__boc_rate_decisions__analysis.py.md
+- path: implementations/boc_rate_decisions/analyst_agent/__init__.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Bank of Canada policy analyst agent module.
+  symbols:
+  - BoCDecisionPromptBuilder
+  - build_boc_agent_predictor
+  - build_boc_basic_config
+  - build_boc_news_config
+  sections: []
+  chars: 610
+  artifact: artifacts/implementations__boc_rate_decisions__analyst_agent____init__.py.md
+- path: implementations/boc_rate_decisions/analyst_agent/agent.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Bank of Canada policy analyst agent configuration and prompt builder.
+  symbols:
+  - BoCDecisionPromptBuilder
+  - build_boc_basic_config
+  - build_boc_news_config
+  - build_boc_agent_predictor
+  sections: []
+  chars: 16347
+  artifact: artifacts/implementations__boc_rate_decisions__analyst_agent__agent.py.md
+- path: implementations/boc_rate_decisions/data.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Data-service setup for the Bank of Canada rate-decision experiment.
+  symbols:
+  - load_meeting_schedule
+  - load_unscheduled_announcements
+  - derive_rate_decision_directions
+  - derive_rate_cut_events
+  - validate_schedule_against_rate_series
+  - BoCDecisionEventAdapter
+  - build_boc_service
+  - BOND_YIELD_2YR_SERIES_ID
+  - CPI_SERIES_ID
+  - CPI_TABLE_ID
+  - DEFAULT_FRED_CACHE_DIR
+  - DEFAULT_STATCAN_CACHE_DIR
+  - DIRECTION_SERIES_ID
+  - DIRECTION_TASK_CATEGORIES
+  - MEETING_SCHEDULE_PATH
+  - RATES_TABLE_ID
+  - RATE_CUT_EVENT_SERIES_ID
+  - TARGET_RATE_SERIES_ID
+  - UNEMPLOYMENT_FRED_ID
+  - UNEMPLOYMENT_SERIES_ID
+  - BoCDecisionEventAdapter
+  - build_boc_service
+  - derive_rate_decision_directions
+  - derive_rate_cut_events
+  - load_meeting_schedule
+  - load_unscheduled_announcements
+  - validate_schedule_against_rate_series
+  sections: []
+  chars: 20204
+  artifact: artifacts/implementations__boc_rate_decisions__data.py.md
+- path: implementations/boc_rate_decisions/plots.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Plotting helpers for the BoC rate-decision experiment.
+  symbols:
+  - plot_policy_rate_with_decisions
+  - plot_reliability_curve
+  - plot_decision_timeline
+  - plot_probability_timeline
+  - plot_decision_panel
+  - CATEGORY_COLORS
+  - DEFAULT_PREDICTOR_PALETTE
+  - plot_decision_panel
+  - plot_decision_timeline
+  - plot_policy_rate_with_decisions
+  - plot_probability_timeline
+  - plot_reliability_curve
+  sections: []
+  chars: 22061
+  artifact: artifacts/implementations__boc_rate_decisions__plots.py.md
+- path: implementations/boc_rate_decisions/predictors/__init__.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Tuned predictor recipes for the BoC rate-decision experiment.
+  symbols:
+  - BoCLogisticPredictor
+  - build_llmp_binary
+  - build_llmp_direction
+  sections: []
+  chars: 1151
+  artifact: artifacts/implementations__boc_rate_decisions__predictors____init__.py.md
+- path: implementations/boc_rate_decisions/predictors/llmp_binary.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: 'BoC rate-cut recipe: binary-probability LLMP.'
+  symbols:
+  - build_llmp_binary
+  - build_llmp_binary
+  sections: []
+  chars: 4511
+  artifact: artifacts/implementations__boc_rate_decisions__predictors__llmp_binary.py.md
+- path: implementations/boc_rate_decisions/predictors/llmp_direction.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: 'BoC rate-direction recipe: categorical-probability LLMP.'
+  symbols:
+  - build_llmp_direction
+  - build_llmp_direction
+  sections: []
+  chars: 4818
+  artifact: artifacts/implementations__boc_rate_decisions__predictors__llmp_direction.py.md
+- path: implementations/boc_rate_decisions/predictors/logistic_baseline.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Logistic-regression conventional baseline for BoC rate-decision prediction.
+  symbols:
+  - build_feature_row
+  - BoCLogisticPredictor
+  - FEATURE_NAMES
+  - BoCLogisticPredictor
+  - build_feature_row
+  sections: []
+  chars: 15030
+  artifact: artifacts/implementations__boc_rate_decisions__predictors__logistic_baseline.py.md
+- path: implementations/boc_rate_decisions/press_releases.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: Bank of Canada rate-announcement press-release ingestion (use-case glue).
+  symbols:
+  - press_release_url
+  - PressReleaseEntry
+  - press_release_entries
+  - extract_press_release_html
+  - write_artifact
+  - PressReleaseStore
+  - BOC_PRESS_RELEASE_SOURCE
+  - DEFAULT_PRESS_RELEASE_CACHE_DIR
+  - PressReleaseEntry
+  - PressReleaseStore
+  - extract_press_release_html
+  - press_release_entries
+  - press_release_url
+  - write_artifact
+  sections: []
+  chars: 12251
+  artifact: artifacts/implementations__boc_rate_decisions__press_releases.py.md
+- path: implementations/boc_rate_decisions/rationale_eval.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: LLM-as-a-judge rationale-alignment evaluator for BoC forecasts (trace-driven).
+  symbols:
+  - AlignmentVerdict
+  - judge_rationale_alignment
+  - resolve_trace_url
+  - trace_ids_from_result
+  - evaluate_trace_alignment
+  - evaluate_result_alignment
+  - AlignmentVerdict
+  - evaluate_result_alignment
+  - evaluate_trace_alignment
+  - judge_rationale_alignment
+  - resolve_trace_url
+  - trace_ids_from_result
+  sections: []
+  chars: 18090
+  artifact: artifacts/implementations__boc_rate_decisions__rationale_eval.py.md
+- path: implementations/boc_rate_decisions/specs/boc_rate_cut_smoke.yaml
+  kind: yaml
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Cut Spec \u2014 binary reference (cut vs no cut), 3 origins"
+  symbols: []
+  sections:
+  - "BoC Rate Cut Spec \u2014 binary reference (cut vs no cut), 3 origins"
+  - '# The compact binary (Brier-scored) reference for the notebook 02 warm-up:'
+  - the K=2 corner of the categorical machinery, kept deliberately small so it
+  - demonstrates the binary payload + scoring format in the fewest moving parts.
+  - '# The three origins span both outcome classes: a hold (2024-04-10), the'
+  - first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+  - "(2024-09-04) \u2014 enough to exercise scoring and plotting paths without"
+  - burning tokens on a long run.
+  - 'One origin per meeting: announcement_date - 1 day.'
+  chars: 1579
+  artifact: artifacts/implementations__boc_rate_decisions__specs__boc_rate_cut_smoke.yaml.md
+- path: implementations/boc_rate_decisions/specs/boc_rate_direction_backtest.yaml
+  kind: yaml
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Direction Backtest Spec \u2014 2010-2024 fixed announcement dates,\
+    \ T-28"
+  symbols: []
+  sections:
+  - "BoC Rate Direction Backtest Spec \u2014 2010-2024 fixed announcement dates, T-28"
+  - '# Canonical 3-way ordered-categorical backtest: at each forecast origin'
+  - (28 days before a BoC fixed announcement date), predict whether the Bank
+  - will cut, hold, or hike its target for the overnight rate at the
+  - 'announcement. Scored with RPS (task payload_type: categorical).'
+  - '# Why a 28-day lead: on the eve of a decision the 2-year GoC yield has'
+  - already absorbed the market consensus, so a T-1 forecast mostly reads
+  - market pricing off a curve. Four weeks out the decision is genuinely
+  - "uncertain \u2014 the interesting skill is anticipating cycle turns before the"
+  - market converges. The eve-of-decision variant is kept as a small diagnostic in
+  - boc_rate_direction_eve_smoke.yaml; comparing the two shows how skill
+  - concentrates as information arrives.
+  - "# Origins are EXPLICIT because BoC meetings are an irregular calendar \u2014"
+  - 8 dates per year that no pandas frequency alias can generate. Each origin
+  - is announcement_date - 28 days, so the 28-day horizon resolves exactly on
+  - the announcement. The minimum gap between scheduled meetings is 35 days,
+  - so the previous meeting's outcome is always visible at the origin.
+  - The origin list is derived from ../meeting_schedule.yaml; a use-case test
+  - asserts the two files stay consistent.
+  - '# Origin count : 120 (8 per year, 2010-2024)'
+  - 'Coverage     : spans the 2010 + 2017-18 + 2022-23 hike cycles and the'
+  - 2015 + 2020 + 2024 cut cycles.
+  - 'Warmup       : 8 events (the full 2009 easing cycle is visible history at'
+  - every origin).
+  - 'One origin per meeting: announcement_date - 28 days.'
+  chars: 7750
+  artifact: artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_backtest.yaml.md
+- path: implementations/boc_rate_decisions/specs/boc_rate_direction_eval.yaml
+  kind: yaml
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Direction Eval Spec \u2014 2025-2026 protected window, T-28"
+  symbols: []
+  sections:
+  - "BoC Rate Direction Eval Spec \u2014 2025-2026 protected window, T-28"
+  - '# Held-out, budget-controlled evaluation over the 12 BoC fixed announcement'
+  - dates from January 2025 through June 2026. All 12 are resolved as of
+  - June 2026 (the June 10 announcement resolves the final origin).
+  - "# Origins sit 28 days before each announcement \u2014 the same lead as the"
+  - "canonical backtest \u2014 so the eval measures anticipation, not eve-of-decision"
+  - market reading. This window contains cuts and holds but NO hikes, so it
+  - cannot reward hike discrimination. That is acceptable for this protected
+  - slice because RPS handles absent categories while still scoring calibrated
+  - mass over the full ordered support.
+  - '# Use this spec sparingly. max_runs: 5 limits how many times a participant'
+  - may run evaluate() against it, reducing the risk of inadvertently
+  - over-fitting to the held-out window.
+  - '# Origins are explicit (announcement_date - 28 days) because BoC meetings are'
+  - an irregular calendar; derived from ../meeting_schedule.yaml.
+  - 'One origin per meeting: announcement_date - 28 days.'
+  chars: 2813
+  artifact: artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eval.yaml.md
+- path: implementations/boc_rate_decisions/specs/boc_rate_direction_eve_smoke.yaml
+  kind: yaml
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Direction EVE Smoke Spec \u2014 T-1 diagnostic, 3 origins"
+  symbols: []
+  sections:
+  - "BoC Rate Direction EVE Smoke Spec \u2014 T-1 diagnostic, 3 origins"
+  - '# Eve-of-decision companion to boc_rate_direction_smoke.yaml: same three'
+  - meetings, origins the day before each announcement. Used in notebook 02
+  - "(\xA77) for the cheap lead-time comparison (T-28 vs T-1) \u2014 the eve lead\
+    \ is"
+  - kept only as this small diagnostic, not as a full backtest.
+  - '# The three origins span holds and cuts but no hikes: a hold (2024-04-10), the'
+  - first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+  - "(2024-09-04) \u2014 enough to exercise categorical scoring and plotting paths"
+  - without burning tokens on a long run.
+  - 'One origin per meeting: announcement_date - 1 day.'
+  chars: 1840
+  artifact: artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_eve_smoke.yaml.md
+- path: implementations/boc_rate_decisions/specs/boc_rate_direction_smoke.yaml
+  kind: yaml
+  domain: impl.boc_rate_decisions
+  summary: "BoC Rate Direction Smoke Spec \u2014 Fast CI/Testing Backtest, T-28"
+  symbols: []
+  sections:
+  - "BoC Rate Direction Smoke Spec \u2014 Fast CI/Testing Backtest, T-28"
+  - '# Three-origin subset of boc_rate_direction_backtest.yaml for running the full'
+  - notebook pipeline cheaply during development and end-to-end testing.
+  - Use by setting EXPERIMENT_CONFIG = "smoke" in the notebook setup cell.
+  - '# The three origins span holds and cuts but no hikes: a hold (2024-04-10), the'
+  - first cut of the 2024 easing cycle (2024-06-05), and a mid-cycle cut
+  - "(2024-09-04) \u2014 enough to exercise categorical scoring and plotting paths"
+  - without burning tokens on 120 LLM calls. Origins sit 28 days before each
+  - announcement, matching the canonical backtest lead.
+  - 'One origin per meeting: announcement_date - 28 days.'
+  chars: 1993
+  artifact: artifacts/implementations__boc_rate_decisions__specs__boc_rate_direction_smoke.yaml.md
+- path: implementations/boc_rate_decisions/starter_agent/__init__.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: "BoC starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - build_starter_agent_config
+  - build_starter_agent_predictor
+  sections: []
+  chars: 534
+  artifact: artifacts/implementations__boc_rate_decisions__starter_agent____init__.py.md
+- path: implementations/boc_rate_decisions/starter_agent/agent.py
+  kind: python
+  domain: impl.boc_rate_decisions
+  summary: "BoC starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - build_starter_agent_config
+  - _StarterForecastPromptBuilder
+  - build_starter_agent_predictor
+  sections: []
+  chars: 9637
+  artifact: artifacts/implementations__boc_rate_decisions__starter_agent__agent.py.md
+- path: implementations/boc_rate_decisions/starter_agent/skills/code-analysis-playbook/SKILL.md
+  kind: markdown
+  domain: impl.boc_rate_decisions
+  summary: Code-analysis playbook
+  symbols: []
+  sections:
+  - Code-analysis playbook
+  - Where your data lives
+  - Compute before you forecast
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2313
+  artifact: artifacts/implementations__boc_rate_decisions__starter_agent__skills__code-analysis-playbook__SKILL.md.md
+- path: implementations/boc_rate_decisions/starter_agent/skills/forecasting/SKILL.md
+  kind: markdown
+  domain: impl.boc_rate_decisions
+  summary: Forecasting skill
+  symbols: []
+  sections:
+  - Forecasting skill
+  - What you'll receive
+  - The output contract
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2186
+  artifact: artifacts/implementations__boc_rate_decisions__starter_agent__skills__forecasting__SKILL.md.md
+- path: implementations/boc_rate_decisions/starter_agent/skills/research-playbook/SKILL.md
+  kind: markdown
+  domain: impl.boc_rate_decisions
+  summary: Research playbook
+  symbols: []
+  sections:
+  - Research playbook
+  - The one rule that matters
+  - How to search
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 1947
+  artifact: artifacts/implementations__boc_rate_decisions__starter_agent__skills__research-playbook__SKILL.md.md
+- path: implementations/energy_oil_forecasting/01_wti_case_study.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "Oil Prices in 2026 \u2014 A Forecasting Case Study"
+  symbols: []
+  sections:
+  - "Oil Prices in 2026 \u2014 A Forecasting Case Study"
+  - Load WTI price data
+  - What happened?
+  - The Futures Curve
+  chars: 10319
+  artifact: artifacts/implementations__energy_oil_forecasting__01_wti_case_study.ipynb.md
+- path: implementations/energy_oil_forecasting/02_intro_agentic_predictor.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil Price Forecasting \u2014 Introducing the Agentic Predictor\
+    \ (Notebook 2 of 7)"
+  symbols: []
+  sections:
+  - "WTI Crude Oil Price Forecasting \u2014 Introducing the Agentic Predictor (Notebook\
+    \ 2 of 7)"
+  chars: 16360
+  artifact: artifacts/implementations__energy_oil_forecasting__02_intro_agentic_predictor.ipynb.md
+- path: implementations/energy_oil_forecasting/03_one_agent_three_tasks.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Oil Price Forecasting \u2014 One Agent, Three Tasks"
+  symbols: []
+  sections:
+  - "WTI Oil Price Forecasting \u2014 One Agent, Three Tasks"
+  chars: 14481
+  artifact: artifacts/implementations__energy_oil_forecasting__03_one_agent_three_tasks.ipynb.md
+- path: implementations/energy_oil_forecasting/04_systematic_backtest_eval.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil Price Forecasting \u2014 Stateless Methods: Systematic Backtest\
+    \ (Notebook 4 of 7)"
+  symbols: []
+  sections:
+  - "WTI Crude Oil Price Forecasting \u2014 Stateless Methods: Systematic Backtest\
+    \ (Notebook 4 of 7)"
+  chars: 13343
+  artifact: artifacts/implementations__energy_oil_forecasting__04_systematic_backtest_eval.ipynb.md
+- path: implementations/energy_oil_forecasting/05_adaptive_agent_training.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil \u2014 Adaptive Agent: Self-Directed Study (Notebook 5 of\
+    \ 7)"
+  symbols: []
+  sections:
+  - "WTI Crude Oil \u2014 Adaptive Agent: Self-Directed Study (Notebook 5 of 7)"
+  - "Task A \u2014 Cross-Period Robustness (2023\u20132024)"
+  - "Task B \u2014 Scope Check and Graduation Attempt"
+  - Strategy state after robustness testing
+  chars: 15882
+  artifact: artifacts/implementations__energy_oil_forecasting__05_adaptive_agent_training.ipynb.md
+- path: implementations/energy_oil_forecasting/05_forecast_tool_demo.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil \u2014 The Forecast Tool (Notebook 5)"
+  symbols: []
+  sections:
+  - "WTI Crude Oil \u2014 The Forecast Tool (Notebook 5)"
+  chars: 6855
+  artifact: artifacts/implementations__energy_oil_forecasting__05_forecast_tool_demo.ipynb.md
+- path: implementations/energy_oil_forecasting/06_protected_eval.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil \u2014 Protected Evaluation (Notebook 6 of 7)"
+  symbols: []
+  sections:
+  - "WTI Crude Oil \u2014 Protected Evaluation (Notebook 6 of 7)"
+  chars: 15724
+  artifact: artifacts/implementations__energy_oil_forecasting__06_protected_eval.ipynb.md
+- path: implementations/energy_oil_forecasting/99_starter_agent.ipynb
+  kind: notebook
+  domain: impl.energy_oil_forecasting
+  summary: "WTI Crude Oil \u2014 Your Starter Agent"
+  symbols: []
+  sections:
+  - "WTI Crude Oil \u2014 Your Starter Agent"
+  chars: 7557
+  artifact: artifacts/implementations__energy_oil_forecasting__99_starter_agent.ipynb.md
+- path: implementations/energy_oil_forecasting/README.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: WTI Crude Oil Price Forecasting
+  symbols: []
+  sections:
+  - WTI Crude Oil Price Forecasting
+  - Curriculum Structure
+  - Stateless capability track
+  - Adaptive-agent track
+  - Side demo
+  - Build your own
+  - The Forecasting Tasks
+  - 'Task A: Trajectory Forecast (Track 1)'
+  - 'Task B: Binary Up-shock Probability (Track 1)'
+  - 'Task C: Scenario Analysis (Track 2)'
+  - Module Layout
+  - Agent layering
+  - Data Source & Setup
+  chars: 7707
+  artifact: artifacts/implementations__energy_oil_forecasting__README.md.md
+- path: implementations/energy_oil_forecasting/__init__.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: __init__.py
+  symbols: []
+  sections: []
+  chars: 91
+  artifact: artifacts/implementations__energy_oil_forecasting____init__.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/__init__.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Adaptive WTI crude oil analyst agent module.
+  symbols:
+  - WtiAdaptiveForecastPromptBuilder
+  - build_skill_tools
+  - build_wti_adaptive_config
+  - build_wti_adaptive_predictor
+  - compress_history
+  sections: []
+  chars: 788
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent____init__.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/agent.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Adaptive WTI crude oil analyst agent.
+  symbols:
+  - WtiAdaptiveForecastPromptBuilder
+  - build_wti_adaptive_config
+  - build_wti_adaptive_predictor
+  sections: []
+  chars: 19884
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__agent.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/curriculum/snapshot_utils.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Skill state snapshot helpers for the adaptive agent training notebooks.
+  symbols:
+  - snapshot_state
+  - restore_state
+  - state_checksum
+  sections: []
+  chars: 3571
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__curriculum__snapshot_utils.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skill_state.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: WTI forecasting strategy state model.
+  symbols:
+  - Observation
+  - Hypothesis
+  - CalibrationCorrection
+  - VersionEntry
+  - WtiStrategyState
+  sections: []
+  chars: 7381
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_state.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skill_tools.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Mutation tools for the ``wti-strategy`` adaptive skill.
+  symbols:
+  - build_skill_tools
+  sections: []
+  chars: 16387
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skill_tools.py.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/fetch-yfinance/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Fetching market data with yfinance
+  symbols: []
+  sections:
+  - Fetching market data with yfinance
+  - E2B execution model
+  - What this skill provides
+  - Workflow
+  - Common tickers
+  - Gotchas
+  chars: 1873
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__SKILL.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/fetch-yfinance/references/examples.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: 'fetch-yfinance: code examples'
+  symbols: []
+  sections:
+  - 'fetch-yfinance: code examples'
+  - 'Pattern 1: Single ticker, full date range'
+  - 'Pattern 2: Temporal cutoff for backtesting'
+  - 'Apply cutoff: keep only data strictly before as_of'
+  - 'Pattern 3: Multiple tickers'
+  chars: 2471
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__fetch-yfinance__references__examples.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/meta-learning/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: 'Meta-learning: strategy update governance'
+  symbols: []
+  sections:
+  - 'Meta-learning: strategy update governance'
+  - The four learning layers
+  - When to update
+  - 'How to update: the tool call sequence'
+  - "Step 1 \u2014 Always: record an observation"
+  - "Step 2 \u2014 If a durable pattern is suspected: open a hypothesis"
+  - "Step 3 \u2014 On each subsequent resolution: update hypothesis counts"
+  - "Step 4 \u2014 When the threshold is reached: graduate to calibration"
+  - "Step 5 \u2014 Rarely: update the approach narrative"
+  - Guarding against over-learning
+  - What NOT to update
+  chars: 5836
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__meta-learning__SKILL.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/trend-projection/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Linear trend projection
+  symbols: []
+  sections:
+  - Linear trend projection
+  - What this skill provides
+  - Typical usage
+  - Key formula
+  - Interval calibration note
+  chars: 1752
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__SKILL.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/trend-projection/references/examples.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: 'trend-projection: code examples'
+  symbols: []
+  sections:
+  - 'trend-projection: code examples'
+  - 'Pattern 1: Fit linear trend and project to horizons'
+  - 'Requires: daily (DataFrame, columns date/close, sorted ascending)'
+  - trend_window (int, from vol-regime Pattern 3)
+  - 'Pattern 2: Calibrated 80% prediction intervals'
+  - 'Requires: projections (dict), residual_std (float), regime (str)'
+  - 'Pattern 3: Plausibility guard'
+  - 'Requires: projections (dict), df (full DataFrame, not just window)'
+  - Full Pipeline Example
+  - "\u2500\u2500 1. Fetch data (fetch-yfinance) \u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500"
+  - "\u2500\u2500 2. Vol regime (vol-regime) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500"
+  - "\u2500\u2500 3. Trend projection (trend-projection) \u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500"
+  chars: 5291
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__trend-projection__references__examples.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/vol-regime/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Volatility regime classification
+  symbols: []
+  sections:
+  - Volatility regime classification
+  - What this skill provides
+  - Typical usage
+  - Regime thresholds (WTI crude oil)
+  - Output of Pattern 3
+  chars: 1693
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__SKILL.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/vol-regime/references/examples.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: 'vol-regime: code examples'
+  symbols: []
+  sections:
+  - 'vol-regime: code examples'
+  - 'Pattern 1: Rolling vol and regime classification'
+  - Use only the daily-frequency portion (drop gaps > 3 days)
+  - 'Pattern 2: Anomaly detection (z-score of last move)'
+  - Add this after Pattern 1 (daily is already defined)
+  - 'Pattern 3: Adaptive trend window'
+  - "Add this after Patterns 1\u20132 (regime and z_score are already defined)"
+  chars: 3089
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__vol-regime__references__examples.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/wti-strategy-trained/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: WTI Forecasting Strategy
+  symbols: []
+  sections:
+  - WTI Forecasting Strategy
+  - Approach
+  - Active calibration corrections
+  - Open hypotheses
+  - Observations
+  - Version history
+  chars: 2999
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy-trained__SKILL.md.md
+- path: implementations/energy_oil_forecasting/adaptive_agent/skills/wti-strategy/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: WTI Forecasting Strategy
+  symbols: []
+  sections:
+  - WTI Forecasting Strategy
+  - Approach
+  - Active calibration corrections
+  - Open hypotheses
+  - Observations
+  - Version history
+  chars: 1744
+  artifact: artifacts/implementations__energy_oil_forecasting__adaptive_agent__skills__wti-strategy__SKILL.md.md
+- path: implementations/energy_oil_forecasting/analysis.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Analysis helpers for the WTI crude oil experiment.
+  symbols:
+  - rolling_coverage_pct
+  - score_backtest_results
+  - backtest_results_to_frame
+  - trajectory_mae_table
+  - select_top_predictors
+  - backtest_results_to_frame
+  - compute_brier_score
+  - rolling_coverage_pct
+  - score_backtest_results
+  - select_top_predictors
+  - trajectory_mae_table
+  sections: []
+  chars: 6287
+  artifact: artifacts/implementations__energy_oil_forecasting__analysis.py.md
+- path: implementations/energy_oil_forecasting/analyst_agent/__init__.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: WTI crude oil analyst agent module.
+  symbols:
+  - WtiPriceForecastPromptBuilder
+  - build_wti_agent_predictor
+  - build_wti_basic_config
+  - build_wti_code_exec_config
+  - build_wti_multitask_news_config
+  - build_wti_news_config
+  - build_wti_tool_config
+  - compress_history
+  sections: []
+  chars: 857
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent____init__.py.md
+- path: implementations/energy_oil_forecasting/analyst_agent/agent.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: WTI crude oil analyst agent configurations and prompt builder.
+  symbols:
+  - compress_history
+  - WtiPriceForecastPromptBuilder
+  - build_wti_basic_config
+  - build_wti_multitask_news_config
+  - build_wti_news_config
+  - build_wti_code_exec_config
+  - build_wti_tool_config
+  - build_wti_agent_predictor
+  sections: []
+  chars: 22599
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent__agent.py.md
+- path: implementations/energy_oil_forecasting/analyst_agent/skills/statistical-analysis/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Statistical analysis skill
+  symbols: []
+  sections:
+  - Statistical analysis skill
+  - Your data universe
+  - What this skill provides
+  - Recommended workflow
+  chars: 2861
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__SKILL.md.md
+- path: implementations/energy_oil_forecasting/analyst_agent/skills/statistical-analysis/references/analysis-patterns.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: "Statistical Analysis \u2014 Code Patterns"
+  symbols: []
+  sections:
+  - "Statistical Analysis \u2014 Code Patterns"
+  - 'Section 0: Working with the Gemini execution environment'
+  - "Parse once \u2014 reference `df` and `daily` in subsequent blocks"
+  - Split daily (recent) vs weekly (older) rows by detecting date gaps > 3 days
+  - 'Pattern 1: Is the current vol regime normal or elevated?'
+  - Assumes `daily` DataFrame is already defined (Section 0)
+  - Assumes `benchmarks` dict is already loaded from wti_benchmarks.json
+  - Rolling 30-day annualised vol
+  - 'Pattern 2: Was the most recent move anomalous?'
+  - Assumes `daily` DataFrame is already defined (Section 0)
+  - 'Pattern 3: How many recent days should I trust for trend estimation?'
+  - Assumes `regime` string and `z_score` float are already defined
+  chars: 6386
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__statistical-analysis__references__analysis-patterns.md.md
+- path: implementations/energy_oil_forecasting/analyst_agent/skills/trend-projection/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Trend projection skill
+  symbols: []
+  sections:
+  - Trend projection skill
+  - Quick-reference steps
+  chars: 1924
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__SKILL.md.md
+- path: implementations/energy_oil_forecasting/analyst_agent/skills/trend-projection/references/projection-examples.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: "Trend Projection \u2014 Code Patterns"
+  symbols: []
+  sections:
+  - "Trend Projection \u2014 Code Patterns"
+  - 'Pattern 1: Linear regression trend + residual-based 80% CI'
+  - "\u2500\u2500 1. Parse the CSV payload \u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500"
+  - Assume `history_csv` is the string value of task_payload["target_history_csv"]
+  - "\u2500\u2500 2. Select the most recent 30 trading days \u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500"
+  - "\u2500\u2500 3. Fit linear regression \u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500"
+  - "\u2500\u2500 4. Project to horizons \u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\
+    \u2500\u2500"
+  - 'Pattern 2: Plausibility guard for trend extrapolation'
+  - "Allow \xB150% of 52-week range as plausible boundary"
+  - 'Pattern 3: Standard quantile grid from point + CI'
+  - 'Derive sigma from the 80% CI: CI_half_width = 1.28 * sigma'
+  - Verify median matches point_forecast
+  - Notes on Gemini code execution limits
+  chars: 4035
+  artifact: artifacts/implementations__energy_oil_forecasting__analyst_agent__skills__trend-projection__references__projection-examples.md.md
+- path: implementations/energy_oil_forecasting/data.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Data-service setup for the WTI Crude Oil forecasting experiment.
+  symbols:
+  - naive_utc_now
+  - build_wti_service
+  - DEFAULT_CACHE_DIR
+  - WTI_SERIES_ID
+  - build_wti_service
+  - naive_utc_now
+  sections: []
+  chars: 3494
+  artifact: artifacts/implementations__energy_oil_forecasting__data.py.md
+- path: implementations/energy_oil_forecasting/paths.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Shared paths, simulation constants, and colour palette for the energy/oil
+    experiment.
+  symbols:
+  - repo_data_dir
+  - CLR_ACTUAL
+  - CLR_AGENT
+  - CLR_CI_CURR_FILL
+  - CLR_CI_PAST_FILL
+  - CLR_DAY_LINE
+  - CLR_HISTORY
+  - CLR_HIT
+  - CLR_MISS
+  - CLR_PROPHET
+  - DATA_DIR
+  - IRAN_COLOR
+  - PROPHET_SHOCK_TRAJ_CACHE
+  - PROPHET_TRAJ_CACHE
+  - ROLLING_CI_WIDTH
+  - ROLLING_FORECAST_CACHE
+  - ROLLING_HORIZON_DAYS
+  - SCENARIO_CACHE
+  - SCENARIO_ORIGIN
+  - SHOCK_ANALYST_CACHE
+  - SHOCK_CONTEXT_CACHE
+  - SHOCK_HORIZON
+  - SHOCK_ORIGINS
+  - SHOCK_THRESHOLD
+  - SIMULATION_END
+  - SIMULATION_START
+  - TRAJ_AGENT_CACHE
+  - TRAJ_CONTEXT_CACHE
+  - TRAJECTORY_ORIGINS
+  - WARN_COLOR
+  sections: []
+  chars: 3268
+  artifact: artifacts/implementations__energy_oil_forecasting__paths.py.md
+- path: implementations/energy_oil_forecasting/prophet_baseline.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Prophet baseline helpers for the WTI crude oil experiment.
+  symbols:
+  - find_nearest_trading_day
+  - price_series_to_prophet_df
+  - compute_rolling_forecasts
+  - load_prophet_trajectories
+  - check_shock_outcome
+  - prophet_prob_shock
+  - ProphetPredictor
+  - wti_series_to_price_df
+  - ProphetPredictor
+  - check_shock_outcome
+  - compute_rolling_forecasts
+  - find_nearest_trading_day
+  - load_prophet_trajectories
+  - prophet_prob_shock
+  - wti_series_to_price_df
+  sections: []
+  chars: 10873
+  artifact: artifacts/implementations__energy_oil_forecasting__prophet_baseline.py.md
+- path: implementations/energy_oil_forecasting/specs/energy_oil_backtest.yaml
+  kind: yaml
+  domain: impl.energy_oil_forecasting
+  summary: "Energy Oil Backtest Spec \u2014 2025 Weekly Rolling Backtest"
+  symbols: []
+  sections:
+  - "Energy Oil Backtest Spec \u2014 2025 Weekly Rolling Backtest"
+  - '# Runs weekly origins across 2025. Stride is 5 business days (weekly).'
+  - 'Target is WTI Crude Oil price (yfinance ticker: CL=F).'
+  - 'Horizons: 5, 10, 21 business days.'
+  - '# Origin count : 51 (weekly in 2025)'
+  - 'Warmup       : 250 trading days (~1 year) of historical prices'
+  chars: 1104
+  artifact: artifacts/implementations__energy_oil_forecasting__specs__energy_oil_backtest.yaml.md
+- path: implementations/energy_oil_forecasting/specs/energy_oil_eval.yaml
+  kind: yaml
+  domain: impl.energy_oil_forecasting
+  summary: "Energy Oil Eval Spec \u2014 2026 Prospective Competition"
+  symbols: []
+  sections:
+  - "Energy Oil Eval Spec \u2014 2026 Prospective Competition"
+  - '# Runs on 8 weekly origins from Feb 2, 2026 to Mar 23, 2026.'
+  - Covers the high-volatility Persian Gulf geopolitical price shock period.
+  - 'Target is WTI Crude Oil price (yfinance ticker: CL=F).'
+  - 'Horizons: 5, 10, 21 business days.'
+  chars: 1019
+  artifact: artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval.yaml.md
+- path: implementations/energy_oil_forecasting/specs/energy_oil_eval_smoke.yaml
+  kind: yaml
+  domain: impl.energy_oil_forecasting
+  summary: "Energy Oil Eval Smoke Spec \u2014 Fast CI/Testing Evaluation"
+  symbols: []
+  sections:
+  - "Energy Oil Eval Smoke Spec \u2014 Fast CI/Testing Evaluation"
+  - '# Two-origin subset of energy_oil_eval.yaml for running the 2026 protected'
+  - arena cheaply during development and end-to-end testing.
+  - Use by setting SMOKE_TEST = True in the notebook setup cell.
+  - '# Origin count : 2 (vs. 8 in the full eval)'
+  - 'Warmup       : 250 trading days (~1 year) of historical prices'
+  chars: 1084
+  artifact: artifacts/implementations__energy_oil_forecasting__specs__energy_oil_eval_smoke.yaml.md
+- path: implementations/energy_oil_forecasting/specs/energy_oil_smoke.yaml
+  kind: yaml
+  domain: impl.energy_oil_forecasting
+  summary: "Energy Oil Smoke Spec \u2014 Fast CI/Testing Backtest"
+  symbols: []
+  sections:
+  - "Energy Oil Smoke Spec \u2014 Fast CI/Testing Backtest"
+  - '# Two-origin subset of energy_oil_backtest.yaml for running the full'
+  - NB04 pipeline cheaply during development and end-to-end testing.
+  - Use by setting SMOKE_TEST = True in the notebook setup cell.
+  - '# Origin count : 2 (vs. 51 in the full backtest)'
+  - 'Warmup       : 250 trading days (~1 year) of historical prices'
+  chars: 1133
+  artifact: artifacts/implementations__energy_oil_forecasting__specs__energy_oil_smoke.yaml.md
+- path: implementations/energy_oil_forecasting/starter_agent/__init__.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: "WTI starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - build_starter_agent_config
+  - build_starter_agent_predictor
+  sections: []
+  chars: 542
+  artifact: artifacts/implementations__energy_oil_forecasting__starter_agent____init__.py.md
+- path: implementations/energy_oil_forecasting/starter_agent/agent.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: "WTI starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - build_starter_agent_config
+  - _StarterForecastPromptBuilder
+  - build_starter_agent_predictor
+  sections: []
+  chars: 9265
+  artifact: artifacts/implementations__energy_oil_forecasting__starter_agent__agent.py.md
+- path: implementations/energy_oil_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Code-analysis playbook
+  symbols: []
+  sections:
+  - Code-analysis playbook
+  - Where your data lives
+  - Compute before you forecast
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2093
+  artifact: artifacts/implementations__energy_oil_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
+- path: implementations/energy_oil_forecasting/starter_agent/skills/forecasting/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Forecasting skill
+  symbols: []
+  sections:
+  - Forecasting skill
+  - What you'll receive
+  - The output contract
+  - Calibration
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2295
+  artifact: artifacts/implementations__energy_oil_forecasting__starter_agent__skills__forecasting__SKILL.md.md
+- path: implementations/energy_oil_forecasting/starter_agent/skills/research-playbook/SKILL.md
+  kind: markdown
+  domain: impl.energy_oil_forecasting
+  summary: Research playbook
+  symbols: []
+  sections:
+  - Research playbook
+  - The one rule that matters
+  - How to search
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 1830
+  artifact: artifacts/implementations__energy_oil_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
+- path: implementations/energy_oil_forecasting/tasks.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Task specifications and agent predictor wiring for the WTI experiment.
+  symbols:
+  - WtiMultitaskPromptBuilder
+  - ScenarioCard
+  - ScenarioAgentForecastOutput
+  - build_wti_news_predictor
+  - build_wti_agent_predictor_for_task
+  - TASK_SCENARIOS_SPEC
+  - TASK_SHOCK_SPEC
+  - TASK_SPECS
+  - TASK_TRAJECTORY_SPEC
+  - ScenarioAgentForecastOutput
+  - ScenarioCard
+  - TaskKind
+  - WtiMultitaskPromptBuilder
+  - build_wti_agent_predictor_for_task
+  - build_wti_news_predictor
+  sections: []
+  chars: 8968
+  artifact: artifacts/implementations__energy_oil_forecasting__tasks.py.md
+- path: implementations/energy_oil_forecasting/viz.py
+  kind: python
+  domain: impl.energy_oil_forecasting
+  summary: Plotly visualisation helpers for the WTI crude oil experiment.
+  symbols:
+  - build_forecast_animation
+  - make_context_chart
+  - make_error_timeline
+  - coverage_summary_table
+  - make_coverage_chart
+  - make_punchline_charts
+  - make_futures_curve_chart
+  - export_animation_html
+  - make_trajectory_fan_chart
+  - make_shock_comparison_chart
+  - verdict_label
+  - prob_bar
+  - conf_bar
+  sections: []
+  chars: 43163
+  artifact: artifacts/implementations__energy_oil_forecasting__viz.py.md
+- path: implementations/food_price_forecasting/01_food_data_exploration.ipynb
+  kind: notebook
+  domain: impl.food_price_forecasting
+  summary: "Food Price CPI \u2014 Data Exploration"
+  symbols: []
+  sections:
+  - "Food Price CPI \u2014 Data Exploration"
+  chars: 3591
+  artifact: artifacts/implementations__food_price_forecasting__01_food_data_exploration.ipynb.md
+- path: implementations/food_price_forecasting/02_food_cpi_experiment.ipynb
+  kind: notebook
+  domain: impl.food_price_forecasting
+  summary: "Canada Food CPI \u2014 CFPR Replica Experiment"
+  symbols: []
+  sections:
+  - "Canada Food CPI \u2014 CFPR Replica Experiment"
+  - 8.2 MAPE per category
+  chars: 26434
+  artifact: artifacts/implementations__food_price_forecasting__02_food_cpi_experiment.ipynb.md
+- path: implementations/food_price_forecasting/99_starter_agent.ipynb
+  kind: notebook
+  domain: impl.food_price_forecasting
+  summary: "Food Price (CPI) \u2014 Your Starter Agent"
+  symbols: []
+  sections:
+  - "Food Price (CPI) \u2014 Your Starter Agent"
+  chars: 7733
+  artifact: artifacts/implementations__food_price_forecasting__99_starter_agent.ipynb.md
+- path: implementations/food_price_forecasting/README.md
+  kind: markdown
+  domain: impl.food_price_forecasting
+  summary: Food Price CPI Forecasting
+  symbols: []
+  sections:
+  - Food Price CPI Forecasting
+  - Forecasting task
+  - CFPR methodology
+  - Reference specs
+  - Module layout
+  - Covariates
+  - Artifact storage
+  - Prerequisites
+  - Report context (CFPR PDFs)
+  - 1. download the report PDFs into data/reports/cfpr/ (gitignored)
+  - 2. extract each PDF -> <year>_en.md (full text) + <year>_en.json (metadata)
+  - Notebooks
+  - Key design decisions
+  chars: 10973
+  artifact: artifacts/implementations__food_price_forecasting__README.md.md
+- path: implementations/food_price_forecasting/__init__.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: "Canada Food CPI experiment \u2014 helper modules and reference implementations."
+  symbols: []
+  sections: []
+  chars: 727
+  artifact: artifacts/implementations__food_price_forecasting____init__.py.md
+- path: implementations/food_price_forecasting/analysis.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: Analysis helpers for the Canada Food CPI experiment.
+  symbols:
+  - predictions_to_dataframe
+  - compute_avgyoy
+  - summarize_crps
+  - compute_ape_long
+  - compute_mape
+  - rationales_table
+  - compute_ape_long
+  - compute_avgyoy
+  - compute_mape
+  - predictions_to_dataframe
+  - rationales_table
+  - summarize_crps
+  sections: []
+  chars: 13091
+  artifact: artifacts/implementations__food_price_forecasting__analysis.py.md
+- path: implementations/food_price_forecasting/data.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: Data-service setup for the Canada Food CPI experiment.
+  symbols:
+  - build_food_cpi_service
+  - CATEGORY_LABELS
+  - CFPR_REPORTS_SOURCE
+  - DEFAULT_CACHE_DIR
+  - DEFAULT_REPORTS_DIR
+  - FOOD_CPI_SERIES
+  - STATCAN_TABLE
+  - build_food_cpi_service
+  sections: []
+  chars: 6528
+  artifact: artifacts/implementations__food_price_forecasting__data.py.md
+- path: implementations/food_price_forecasting/plots.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: Plotting helpers for the Canada Food CPI experiment.
+  symbols:
+  - plot_trajectory_fan
+  - plot_avgyoy_grid
+  - plot_crps_disaggregated
+  - plot_mape_distribution
+  - plot_mape_by_category
+  - plot_food_cpi_small_multiples
+  - DEFAULT_PREDICTOR_PALETTE
+  - plot_avgyoy_grid
+  - plot_crps_disaggregated
+  - plot_food_cpi_small_multiples
+  - plot_mape_by_category
+  - plot_mape_distribution
+  - plot_trajectory_fan
+  sections: []
+  chars: 19740
+  artifact: artifacts/implementations__food_price_forecasting__plots.py.md
+- path: implementations/food_price_forecasting/predictors/__init__.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: Tuned predictor recipes for the Canada Food CPI experiment.
+  symbols:
+  - build_llmp_quantile_grid
+  - build_llmp_sampled_trajectory
+  sections: []
+  chars: 955
+  artifact: artifacts/implementations__food_price_forecasting__predictors____init__.py.md
+- path: implementations/food_price_forecasting/predictors/llmp_quantile_grid.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: 'Food CPI recipe: quantile-grid LLMP.'
+  symbols:
+  - build_llmp_quantile_grid
+  - build_llmp_quantile_grid
+  sections: []
+  chars: 5311
+  artifact: artifacts/implementations__food_price_forecasting__predictors__llmp_quantile_grid.py.md
+- path: implementations/food_price_forecasting/predictors/llmp_sampled_trajectory.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: 'Food CPI recipe: sampled-trajectory LLMP.'
+  symbols:
+  - build_llmp_sampled_trajectory
+  - build_llmp_sampled_trajectory
+  sections: []
+  chars: 3507
+  artifact: artifacts/implementations__food_price_forecasting__predictors__llmp_sampled_trajectory.py.md
+- path: implementations/food_price_forecasting/reports.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: CFPR report acquisition manifest.
+  symbols:
+  - CFPRReportEntry
+  - load_manifest
+  - DEFAULT_REPORTS_CACHE_DIR
+  - REPORTS_MANIFEST_PATH
+  - CFPRReportEntry
+  - load_manifest
+  sections: []
+  chars: 3343
+  artifact: artifacts/implementations__food_price_forecasting__reports.py.md
+- path: implementations/food_price_forecasting/smoke_report.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: Plain-text reporting for food CPI agent smoke tests.
+  symbols:
+  - summarize_agent_predictions
+  sections: []
+  chars: 3339
+  artifact: artifacts/implementations__food_price_forecasting__smoke_report.py.md
+- path: implementations/food_price_forecasting/specs/food_cpi_cfpr_backtest.yaml
+  kind: yaml
+  domain: impl.food_price_forecasting
+  summary: 'Reference MultiTargetBacktestSpec: Canada Food CPI, CFPR trajectory replica'
+  symbols: []
+  sections:
+  - 'Reference MultiTargetBacktestSpec: Canada Food CPI, CFPR trajectory replica'
+  - '# This spec is the canonical backtest for the Canada''s Food Price Report'
+  - (CFPR) forecasting exercise.  The CFPR produces, once a year, an
+  - '*average-over-average* YoY CPI change projection for the upcoming calendar'
+  - 'year: "by how much will the average food CPI index in year Y+1 differ'
+  - from the average in year Y?"
+  - '# To compute that quantity from a monthly CPI forecast we need the full'
+  - 12-month trajectory for calendar year Y+1, issued from a July-of-year-Y
+  - 'origin.  Concretely:'
+  - '#   * Origin:   July 1, year Y  (the most recent CPI data available in hand'
+  - is typically May or June Y, so July is the first origin at
+  - which a CFPR-style prediction can be made).
+  - '* Horizons: 6, 7, 8, ..., 17  (January .. December of year Y+1).'
+  - '* Stride:   12 months (one origin per July, annual cadence).'
+  - '# From these 12 forecasts one computes:'
+  - '#     avg/avg YoY = mean(predictions for Jan..Dec of Y+1)'
+  - / mean(actuals for Jan..Dec of Y)
+  - '- 1'
+  - '# Window'
+  - '------'
+  - "Jul 2009 \u2192 Jul 2024 = 16 annual origins.  This spans three distinct"
+  - macro regimes (low-inflation 2010-19, COVID shock 2020-21, food-price
+  - surge/retreat 2021-24).  The last origin (Jul 2024) forecasts out to
+  - Dec 2025, which resolves in the StatCan release of Jan 2026.
+  - '# Predictors'
+  - '----------'
+  - "This is the open backtesting resource \u2014 run it as many times as you"
+  - like to tune predictors and build intuition.  There is no protected eval
+  - 'spec for this experiment: historical LLM/agent scores are upper bounds'
+  - on live performance (see the food CPI experiment notebook).
+  - '# Prerequisites'
+  - '-------------'
+  - 'uv run python scripts/fetch_cpi.py    # StatCan food CPI cache'
+  - "# (FRED covariates are deliberately NOT part of this canonical spec \u2014 see"
+  - '`planning-docs/bootcamp-workplan.md` for the deferred "covariate framing" design'
+  - discussion.  Predictors that want exogenous inputs should be evaluated
+  - on a separate experiment.)
+  - '# Loading'
+  - '-------'
+  - '#   import yaml'
+  chars: 5921
+  artifact: artifacts/implementations__food_price_forecasting__specs__food_cpi_cfpr_backtest.yaml.md
+- path: implementations/food_price_forecasting/specs/food_cpi_recent_backtest.yaml
+  kind: yaml
+  domain: impl.food_price_forecasting
+  summary: 'Mini backtest spec: all 9 targets, recent 6 origins.'
+  symbols: []
+  sections:
+  - 'Mini backtest spec: all 9 targets, recent 6 origins.'
+  - '# All nine food CPI categories, 6 annual July origins (2019-2024).'
+  - "Covers two meaningful macro regimes \u2014 the COVID shock (2020-21) and the"
+  - "food-price surge and retreat (2021-24) \u2014 while keeping total agent calls"
+  - manageable.  Not budget-gated; run freely.
+  - '# Origin count : 6    (Jul 2019, 2020, 2021, 2022, 2023, 2024)'
+  - "Agent calls  : ~54  (9 tasks \xD7 6 origins, all 12 trajectory horizons per call)"
+  - '# Use `food_cpi_cfpr_backtest.yaml` for the full 16-origin canonical backtest.'
+  - '# Prerequisites'
+  - '-------------'
+  - uv run python scripts/fetch_cpi.py
+  - Jul 2019 -> Jul 2024 = 6 annual origins.
+  chars: 3865
+  artifact: artifacts/implementations__food_price_forecasting__specs__food_cpi_recent_backtest.yaml.md
+- path: implementations/food_price_forecasting/specs/food_cpi_single_mini_backtest.yaml
+  kind: yaml
+  domain: impl.food_price_forecasting
+  summary: 'Mini backtest spec: single target, recent 6 origins.'
+  symbols: []
+  sections:
+  - 'Mini backtest spec: single target, recent 6 origins.'
+  - '# One task (bakery & cereal), 6 annual July origins (2019-2024).'
+  - Use this for fast agent development and pipeline smoke-testing.
+  - Not budget-gated; run freely.
+  - '# Origin count : 6   (Jul 2019, 2020, 2021, 2022, 2023, 2024)'
+  - 'Agent calls  : ~6  (one per origin, all 12 trajectory horizons in one call)'
+  - '# Prerequisites'
+  - '-------------'
+  - uv run python scripts/fetch_cpi.py
+  - Jul 2019 -> Jul 2024 = 6 annual origins.
+  chars: 1260
+  artifact: artifacts/implementations__food_price_forecasting__specs__food_cpi_single_mini_backtest.yaml.md
+- path: implementations/food_price_forecasting/starter_agent/__init__.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: "Food CPI starter agent \u2014 a fresh, hackable template for your own\
+    \ exploration."
+  symbols:
+  - FoodCpiStarterPromptBuilder
+  - build_starter_agent_config
+  - build_starter_agent_predictor
+  sections: []
+  chars: 651
+  artifact: artifacts/implementations__food_price_forecasting__starter_agent____init__.py.md
+- path: implementations/food_price_forecasting/starter_agent/agent.py
+  kind: python
+  domain: impl.food_price_forecasting
+  summary: "Food CPI starter agent \u2014 a fresh, hackable template for your own\
+    \ exploration."
+  symbols:
+  - FoodCpiStarterPromptBuilder
+  - build_starter_agent_config
+  - _StarterForecastPromptBuilder
+  - build_starter_agent_predictor
+  sections: []
+  chars: 10665
+  artifact: artifacts/implementations__food_price_forecasting__starter_agent__agent.py.md
+- path: implementations/food_price_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+  kind: markdown
+  domain: impl.food_price_forecasting
+  summary: Code-analysis playbook
+  symbols: []
+  sections:
+  - Code-analysis playbook
+  - Where your data lives
+  - Compute before you forecast
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2157
+  artifact: artifacts/implementations__food_price_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
+- path: implementations/food_price_forecasting/starter_agent/skills/forecasting/SKILL.md
+  kind: markdown
+  domain: impl.food_price_forecasting
+  summary: Forecasting skill
+  symbols: []
+  sections:
+  - Forecasting skill
+  - What you'll receive
+  - The output contract
+  - Calibration
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2323
+  artifact: artifacts/implementations__food_price_forecasting__starter_agent__skills__forecasting__SKILL.md.md
+- path: implementations/food_price_forecasting/starter_agent/skills/research-playbook/SKILL.md
+  kind: markdown
+  domain: impl.food_price_forecasting
+  summary: Research playbook
+  symbols: []
+  sections:
+  - Research playbook
+  - The one rule that matters
+  - How to search
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 1871
+  artifact: artifacts/implementations__food_price_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
+- path: implementations/getting_started/00_environment_check.ipynb
+  kind: notebook
+  domain: impl.getting_started
+  summary: "00 \xB7 Environment Check \u2014 start here"
+  symbols: []
+  sections:
+  - "00 \xB7 Environment Check \u2014 start here"
+  - Setup
+  - "1 \xB7 API key inventory"
+  - "2 \xB7 Package imports & native libraries"
+  - "3 \xB7 LLM inference via the Vector proxy"
+  - "4 \xB7 Langfuse tracing connection"
+  - "5 \xB7 E2B code execution sandbox"
+  - "6 \xB7 StatCan data access"
+  - "7 \xB7 FRED data access (optional)"
+  - "8 \xB7 End-to-end mini forecast"
+  - Summary
+  chars: 22120
+  artifact: artifacts/implementations__getting_started__00_environment_check.ipynb.md
+- path: implementations/getting_started/01_cpi_data_exploration.ipynb
+  kind: notebook
+  domain: impl.getting_started
+  summary: "Getting Started \u2014 CPI Data Exploration"
+  symbols: []
+  sections:
+  - "Getting Started \u2014 CPI Data Exploration"
+  - 1. Build the DataService
+  - 2. Inspect the registered series
+  - "3. Cutoff filtering \u2014 the core discipline"
+  - '4. Plot: the three series side by side'
+  - 5. Year-over-year change
+  - 6. Define a `ForecastingTask`
+  - 'Next: `02_cpi_backtest_demo.ipynb`'
+  chars: 7140
+  artifact: artifacts/implementations__getting_started__01_cpi_data_exploration.ipynb.md
+- path: implementations/getting_started/02_cpi_backtest_demo.ipynb
+  kind: notebook
+  domain: impl.getting_started
+  summary: "Getting Started \u2014 CPI Gasoline Backtest"
+  symbols: []
+  sections:
+  - "Getting Started \u2014 CPI Gasoline Backtest"
+  - 1. Register CPI Gasoline
+  - 2. Load the reference spec
+  - 3. Define the predictors
+  - 4. Run both backtests
+  - 5. Per-origin CRPS comparison
+  - "6. Predictions vs. actuals \u2014 the visual story"
+  - 7. Where does it fail?
+  - 8. Spend an eval run (optional)
+  - '9. Try this next: re-run against Shelter'
+  - 10. Serialize the result to YAML
+  chars: 16383
+  artifact: artifacts/implementations__getting_started__02_cpi_backtest_demo.ipynb.md
+- path: implementations/getting_started/99_repo_concierge.ipynb
+  kind: notebook
+  domain: impl.getting_started
+  summary: "Repo Concierge \u2014 ask questions about this codebase"
+  symbols: []
+  sections:
+  - "Repo Concierge \u2014 ask questions about this codebase"
+  chars: 5634
+  artifact: artifacts/implementations__getting_started__99_repo_concierge.ipynb.md
+- path: implementations/getting_started/README.md
+  kind: markdown
+  domain: impl.getting_started
+  summary: Getting Started
+  symbols: []
+  sections:
+  - Getting Started
+  - The task
+  - Before you start
+  - 0. Check your environment - `00_environment_check.ipynb`
+  - Populate the local data cache
+  - Walkthrough
+  - 1. Warm up - `01_cpi_data_exploration.ipynb`
+  - 2. Run the backtest - `02_cpi_backtest_demo.ipynb`
+  - 3. Write your own predictor
+  - 4. Compare predictors
+  - 5. Spend an eval run
+  - "6. Ask the repo concierge \u2014 `99_repo_concierge.ipynb`"
+  - Where to go next
+  - Directory layout
+  - Key interfaces (from `aieng-forecasting`)
+  chars: 9228
+  artifact: artifacts/implementations__getting_started__README.md.md
+- path: implementations/getting_started/__init__.py
+  kind: python
+  domain: impl.getting_started
+  summary: Getting started reference implementation (notebooks + repo concierge agent).
+  symbols: []
+  sections: []
+  chars: 166
+  artifact: artifacts/implementations__getting_started____init__.py.md
+- path: implementations/getting_started/concierge_agent/__init__.py
+  kind: python
+  domain: impl.getting_started
+  summary: "Repo concierge agent \u2014 onboarding helper for the agentic-forecasting\
+    \ codebase."
+  symbols:
+  - build_concierge_config
+  - fetch_repo_artifact
+  - search_repo_catalog
+  - search_repo_knowledge
+  sections: []
+  chars: 713
+  artifact: artifacts/implementations__getting_started__concierge_agent____init__.py.md
+- path: implementations/getting_started/concierge_agent/agent.py
+  kind: python
+  domain: impl.getting_started
+  summary: "Repo concierge agent \u2014 onboarding helper for the agentic-forecasting\
+    \ codebase."
+  symbols:
+  - build_concierge_config
+  sections: []
+  chars: 4534
+  artifact: artifacts/implementations__getting_started__concierge_agent__agent.py.md
+- path: implementations/getting_started/concierge_agent/catalog.py
+  kind: python
+  domain: impl.getting_started
+  summary: Runtime catalog search and artifact fetch for the repo concierge agent.
+  symbols:
+  - CatalogHit
+  - search_repo_catalog
+  - fetch_repo_artifact
+  - clear_catalog_cache
+  - clear_catalog_cache
+  - fetch_repo_artifact
+  - search_repo_catalog
+  sections: []
+  chars: 7426
+  artifact: artifacts/implementations__getting_started__concierge_agent__catalog.py.md
+- path: implementations/getting_started/concierge_agent/catalog_build.py
+  kind: python
+  domain: impl.getting_started
+  summary: Build the repo concierge catalog and per-source artifacts (maintainer-only).
+  symbols:
+  - CatalogEntry
+  - repo_root_from_here
+  - context_dir
+  - path_to_artifact_slug
+  - infer_domain
+  - infer_kind
+  - collect_source_paths
+  - build_entry
+  - git_ref
+  - build_catalog
+  sections: []
+  chars: 11490
+  artifact: artifacts/implementations__getting_started__concierge_agent__catalog_build.py.md
+- path: implementations/getting_started/concierge_agent/knowledge.py
+  kind: python
+  domain: impl.getting_started
+  summary: "Repo knowledge tools \u2014 catalog search and artifact fetch."
+  symbols:
+  - clear_knowledge_cache
+  - search_repo_knowledge
+  - clear_knowledge_cache
+  - fetch_repo_artifact
+  - search_repo_catalog
+  - search_repo_knowledge
+  sections: []
+  chars: 1845
+  artifact: artifacts/implementations__getting_started__concierge_agent__knowledge.py.md
+- path: implementations/getting_started/concierge_agent/skills/repo-navigation/SKILL.md
+  kind: markdown
+  domain: impl.getting_started
+  summary: Repo navigation skill
+  symbols: []
+  sections:
+  - Repo navigation skill
+  - Workflow
+  chars: 665
+  artifact: artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__SKILL.md.md
+- path: implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-guide.md
+  kind: markdown
+  domain: impl.getting_started
+  summary: Catalog guide
+  symbols: []
+  sections:
+  - Catalog guide
+  - Two-step retrieval
+  - Domain filters (`domain=`)
+  - Kind filters (`kind=`)
+  - Example sequences
+  chars: 1855
+  artifact: artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__catalog-guide.md.md
+- path: implementations/getting_started/concierge_agent/skills/repo-navigation/references/navigation-map.md
+  kind: markdown
+  domain: impl.getting_started
+  summary: Bootcamp navigation map
+  symbols: []
+  sections:
+  - Bootcamp navigation map
+  - First steps
+  - Reference implementations (pick by problem)
+  - Related agents
+  - Key library entry points
+  chars: 1832
+  artifact: artifacts/implementations__getting_started__concierge_agent__skills__repo-navigation__references__navigation-map.md.md
+- path: implementations/getting_started/specs/cpi_gasoline_1m.yaml
+  kind: yaml
+  domain: impl.getting_started
+  summary: 'Reference BacktestSpec: CPI Gasoline Canada, 1-month ahead forecast'
+  symbols: []
+  sections:
+  - 'Reference BacktestSpec: CPI Gasoline Canada, 1-month ahead forecast'
+  - '# Gasoline is the "hello-world" target for the getting-started experiment.'
+  - It is deliberately chosen over headline All-items because it is visibly
+  - volatile (2008 crude collapse, 2014-16 OPEC-led decline, 2020 COVID
+  - "demand shock, 2022 Russia/Ukraine surge) \u2014 a single-series backtest"
+  - makes the "why is this hard?" teaching point land without needing the
+  - full CFPR trajectory machinery.
+  - '# Horizon is 1 month ahead (h=1).  Rationale:'
+  - '- One-step-ahead is the most natural short-term question on a monthly'
+  - series and is immediately interpretable.
+  - '- StatCan releases CPI ~3 weeks after the reference month, so a'
+  - forecast made at origin T resolves at T+1 within the same month the
+  - next CPI print is published.  This is short enough to enable
+  - genuine live / prospective evaluation (make a prediction today,
+  - validate it next month).
+  - "- The backtest (2000\u20132025) provides ~276 training-window origins;"
+  - "the eval set (2025\u2013present) covers ~15+ already-resolved origins,"
+  - enough for a stable CRPS estimate.
+  - '- Naive (last-value) is meaningfully bad at turning points, so the'
+  - naive-vs-AutoARIMA comparison still lands even at h=1.
+  - '# To load this spec in Python:'
+  - '#   import yaml'
+  - from aieng.forecasting.evaluation import BacktestSpec
+  - '#   with open("implementations/getting_started/specs/cpi_gasoline_1m.yaml") as
+    f:'
+  - spec = BacktestSpec.model_validate(yaml.safe_load(f))
+  - '# The target series must be registered in the DataService before running'
+  - a backtest.  Use scripts/fetch_cpi.py to populate the local data cache,
+  - then register "cpi_gasoline_canada" from StatCan table 18-10-0004-11.
+  - 'Backtest window: January 2000 through January 2025.'
+  - Origins from January 2025 onward are reserved for the eval set.
+  - stride=1 on monthly (MS) frequency gives one origin per month.
+  - Require 24 months of history before the first forecast.
+  chars: 2930
+  artifact: artifacts/implementations__getting_started__specs__cpi_gasoline_1m.yaml.md
+- path: implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml
+  kind: yaml
+  domain: impl.getting_started
+  summary: "Reference EvalSpec: CPI Gasoline Canada, 1-month ahead \u2014 2025 eval\
+    \ window"
+  symbols: []
+  sections:
+  - "Reference EvalSpec: CPI Gasoline Canada, 1-month ahead \u2014 2025 eval window"
+  - '# Companion to cpi_gasoline_1m.yaml.  Covers origins from January 2025'
+  - "through March 2026 \u2014 the most recent window where every forecast (h=1)"
+  - has a published resolution as of the time of writing (May 2026; StatCan
+  - releases CPI ~3 weeks after the reference month, so April 2026 data was
+  - published ~May 19 2026 and all h=1 forecasts up to March 2026 are
+  - resolved).
+  - '# This spec is intentionally "live-adjacent": origins from late 2025 and'
+  - early 2026 are as recent as possible, making it possible to extend the
+  - eval window month by month as new CPI prints are published.
+  - '# Use this spec sparingly.  max_runs: 5 limits how many times a'
+  - participant may run evaluate() against it, reducing the risk of
+  - inadvertently over-fitting to the held-out window.
+  - '# To load this spec in Python:'
+  - '#   import yaml'
+  - from pathlib import Path
+  - from aieng.forecasting.evaluation import EvalSpec, EvalTracker, evaluate
+  - '#   with open("implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml")
+    as f:'
+  - spec = EvalSpec.model_validate(yaml.safe_load(f))
+  - '#   tracker = EvalTracker(Path("data/eval_runs.yaml"))'
+  - result = evaluate(predictor=my_predictor, spec=spec, data_service=svc, tracker=tracker)
+  - 'print(f"Eval mean CRPS: {result.mean_score:.4f}  (run {result.run_number}/{spec.max_runs})")'
+  - '# The target series must be registered before running.  Use'
+  - scripts/fetch_cpi.py to populate the local data cache, then register
+  - '"cpi_gasoline_canada".'
+  - 'Eval window: January 2025 through March 2026.'
+  - "Origins Jan 2025 \u2013 Mar 2026 with h=1 produce forecast dates Feb 2025 \u2013\
+    \ Apr 2026,"
+  - all of which are published as of May 2026.
+  - stride=1 gives one origin per month (~15 origins).
+  - Require 24 months of history before the first forecast.
+  - 'Budget cap: each participant may run evaluate() against this spec at'
+  - most 5 times.  Enforced by EvalTracker when passed to evaluate().
+  chars: 2789
+  artifact: artifacts/implementations__getting_started__specs__cpi_gasoline_eval_2025.yaml.md
+- path: implementations/sp500_forecasting/01_sp500_multivariate_backtest.ipynb
+  kind: notebook
+  domain: impl.sp500_forecasting
+  summary: "S&P 500 \u2014 multivariate conventional-methods comparison"
+  symbols: []
+  sections:
+  - "S&P 500 \u2014 multivariate conventional-methods comparison"
+  - What's actually forecastable at daily resolution?
+  - "\u26A0\uFE0F Cutoff-aware evaluation \u2014 why the windows are what they are"
+  - "Leaderboard \u2014 mean CRPS by method and horizon"
+  - "Reading the LLMP \xB1 covariates rows"
+  chars: 20376
+  artifact: artifacts/implementations__sp500_forecasting__01_sp500_multivariate_backtest.ipynb.md
+- path: implementations/sp500_forecasting/99_starter_agent.ipynb
+  kind: notebook
+  domain: impl.sp500_forecasting
+  summary: "S&P 500 \u2014 Your Starter Agent"
+  symbols: []
+  sections:
+  - "S&P 500 \u2014 Your Starter Agent"
+  chars: 7839
+  artifact: artifacts/implementations__sp500_forecasting__99_starter_agent.ipynb.md
+- path: implementations/sp500_forecasting/README.md
+  kind: markdown
+  domain: impl.sp500_forecasting
+  summary: S&P 500 multivariate forecasting (leak-safe covariates)
+  symbols: []
+  sections:
+  - S&P 500 multivariate forecasting (leak-safe covariates)
+  - Forecasting task
+  - Methods compared
+  - Canonical covariates (when enabled)
+  - Cutoff-aware evaluation (read this)
+  - No-leakage design
+  - "Specs \u2014 windows and tasks (experiment design only)"
+  - Module layout
+  - Adding a method
+  - Prerequisites
+  - "Build your own \u2014 `99_starter_agent.ipynb`"
+  chars: 11662
+  artifact: artifacts/implementations__sp500_forecasting__README.md.md
+- path: implementations/sp500_forecasting/__init__.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: "S&P 500 multivariate log-return experiment \u2014 leak-safe covariates."
+  symbols:
+  - DEFAULT_COVARIATE_SERIES_IDS
+  - FRED_PREFETCH_REGISTRY
+  - FRED_SERIES_IDS_FOR_PREFETCH
+  - SERIES_ID_2Y10Y_SPREAD
+  - SERIES_ID_10Y_YIELD
+  - SERIES_ID_CPI_INFLATION_CHANGE
+  - SERIES_ID_DOLLAR_INDEX_RETURN
+  - SERIES_ID_FED_FUNDS
+  - SERIES_ID_GOLD_RETURN
+  - SERIES_ID_NASDAQ_RETURN
+  - SERIES_ID_OIL_RETURN
+  - SERIES_ID_UNEMPLOYMENT
+  - SERIES_ID_VIX_CHANGE
+  - SERIES_ID_VIX_LEVEL
+  - SP500_LOG_RETURN_SERIES_ID
+  - SP500_SERIES_ID
+  - SP500_TICKER
+  - build_sp500_multivariate_service
+  sections: []
+  chars: 2086
+  artifact: artifacts/implementations__sp500_forecasting____init__.py.md
+- path: implementations/sp500_forecasting/analysis.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: Notebook-oriented formatting and direction metrics for the S&P 500 demo.
+  symbols:
+  - style_results_dataframe
+  - prob_return_above_threshold_from_quantiles
+  - build_direction_eval_frame
+  - direction_classification_metrics
+  - build_direction_eval_frame
+  - direction_classification_metrics
+  - prob_return_above_threshold_from_quantiles
+  - style_results_dataframe
+  sections: []
+  chars: 5762
+  artifact: artifacts/implementations__sp500_forecasting__analysis.py.md
+- path: implementations/sp500_forecasting/data.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: Leak-safe data-service setup for multivariate S&P 500 log-return forecasting.
+  symbols:
+  - sp500_logret_series_id
+  - YahooFinanceDailyAdapter
+  - StaticFrameAdapter
+  - build_sp500_log_return_service
+  - build_sp500_multivariate_service
+  - DEFAULT_COVARIATE_SERIES_IDS
+  - FRED_PREFETCH_REGISTRY
+  - FRED_SERIES_IDS_FOR_PREFETCH
+  - SERIES_ID_10Y_YIELD
+  - SERIES_ID_2Y10Y_SPREAD
+  - SERIES_ID_CPI_INFLATION_CHANGE
+  - SERIES_ID_DOLLAR_INDEX_RETURN
+  - SERIES_ID_FED_FUNDS
+  - SERIES_ID_GOLD_RETURN
+  - SERIES_ID_NASDAQ_RETURN
+  - SERIES_ID_OIL_RETURN
+  - SERIES_ID_UNEMPLOYMENT
+  - SERIES_ID_VIX_CHANGE
+  - SERIES_ID_VIX_LEVEL
+  - SP500_LOG_RETURN_SERIES_ID
+  - SP500_RETURN_TARGETS
+  - SP500_RETURN_WINDOWS
+  - SP500_SERIES_ID
+  - SP500_TICKER
+  - SP500_WINDOW_LABELS
+  - StaticFrameAdapter
+  - build_sp500_log_return_service
+  - build_sp500_multivariate_service
+  - sp500_logret_series_id
+  sections: []
+  chars: 35434
+  artifact: artifacts/implementations__sp500_forecasting__data.py.md
+- path: implementations/sp500_forecasting/leaderboard.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: Leaderboard rows for the multivariate S&P 500 experiment.
+  symbols:
+  - build_return_compare_frame
+  - build_leaderboard
+  - build_leaderboard
+  - build_return_compare_frame
+  sections: []
+  chars: 7679
+  artifact: artifacts/implementations__sp500_forecasting__leaderboard.py.md
+- path: implementations/sp500_forecasting/plots.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: Matplotlib helpers for the multivariate S&P 500 demo notebook.
+  symbols:
+  - plot_sp500_log_return_recent
+  - plot_mean_crps_leaderboard
+  - plot_mean_crps_by_horizon
+  - plot_return_forecast_vs_actual_multi
+  - display_multivariate_backtest_leaderboard
+  sections: []
+  chars: 8856
+  artifact: artifacts/implementations__sp500_forecasting__plots.py.md
+- path: implementations/sp500_forecasting/predictors/__init__.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: Tuned predictor recipes for the multivariate S&P 500 experiment.
+  symbols:
+  - build_sp500_llmp_sampled_trajectory
+  sections: []
+  chars: 987
+  artifact: artifacts/implementations__sp500_forecasting__predictors____init__.py.md
+- path: implementations/sp500_forecasting/predictors/llmp_sampled_trajectory.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: 'S&P 500 recipe: sampled-trajectory LLMP (target-only and with-covariates).'
+  symbols:
+  - build_sp500_llmp_sampled_trajectory
+  - build_sp500_llmp_sampled_trajectory
+  sections: []
+  chars: 4804
+  artifact: artifacts/implementations__sp500_forecasting__predictors__llmp_sampled_trajectory.py.md
+- path: implementations/sp500_forecasting/specs/sp500_backtest_2025.yaml
+  kind: yaml
+  domain: impl.sp500_forecasting
+  summary: "Main backtest spec \u2014 weekly origins across 2025 (post-cutoff)."
+  symbols: []
+  sections:
+  - "Main backtest spec \u2014 weekly origins across 2025 (post-cutoff)."
+  - '# 2025 is after the Gemini training cutoff (~Jan 2025), so this is the window'
+  - "where the conventional methods AND the LLM-Process can be compared *fairly* \u2014"
+  - the LLM has not memorised these outcomes. Mirrors the energy reference's 2025
+  - backtest window. Use it for open iteration; spend the protected 2026 eval
+  - (`sp500_eval_2026.yaml`) sparingly on your finalists.
+  - '# This spec carries experiment design only (window + one single-horizon task
+    per'
+  - target). The predictor roster and hyperparameters live in the notebook. Note
+  - "the LLMP predictors are token-heavy over ~50 weekly origins \u2014 the notebook\
+    \ lets"
+  - you trim the predictor list (or widen the stride here) before enabling them.
+  chars: 1854
+  artifact: artifacts/implementations__sp500_forecasting__specs__sp500_backtest_2025.yaml.md
+- path: implementations/sp500_forecasting/specs/sp500_eval_2026.yaml
+  kind: yaml
+  domain: impl.sp500_forecasting
+  summary: "Protected eval \u2014 held-out 2026 window, scored through multi_evaluate()\
+    \ with a budget."
+  symbols: []
+  sections:
+  - "Protected eval \u2014 held-out 2026 window, scored through multi_evaluate() with\
+    \ a budget."
+  - '# This is the honest scoreboard. 2026 is unambiguously after the Gemini training'
+  - cutoff, so neither the numerical methods nor the LLM-Process can have seen the
+  - 'outcomes. Treat it as scarce: `max_runs` caps how many times the spec may be'
+  - scored (via EvalTracker), so iterate on `sp500_backtest_2025.yaml` and spend
+  - this only on a curated set of finalists (chosen in the notebook's eval cell).
+  - '# Loaded as a MultiTargetEvalSpec: one single-horizon task per target, all under'
+  - a single shared run budget. One multi_evaluate() call across all three
+  - "horizons counts as ONE run against `max_runs` \u2014 the budget is keyed by"
+  - '`spec_id`, not per-horizon. The predictor roster lives in the notebook.'
+  - 'Budget cap: each multi_evaluate() call (all 3 horizons) counts as one run.'
+  chars: 2004
+  artifact: artifacts/implementations__sp500_forecasting__specs__sp500_eval_2026.yaml.md
+- path: implementations/sp500_forecasting/specs/sp500_smoke.yaml
+  kind: yaml
+  domain: impl.sp500_forecasting
+  summary: "Smoke spec \u2014 fast laptop run over a short, post-cutoff (2025) window."
+  symbols: []
+  sections:
+  - "Smoke spec \u2014 fast laptop run over a short, post-cutoff (2025) window."
+  - '# Weekly origins in late 2025: after the Gemini training cutoff (~Jan 2025),
+    so'
+  - the LLM-Process rows in the notebook can be compared *fairly* against the
+  - conventional methods here. The notebook keeps its LLMP predictors ON for this
+  - window (the predictors cell gates them on a post-cutoff flag).
+  - "# This spec carries experiment design only \u2014 the window, stride/warmup,\
+    \ and one"
+  - single-horizon task per target. WHICH predictors run (and all their
+  - hyperparameters, including the covariate panel) is configured in the notebook,
+  - not here. Each task targets `sp500_logret_{N}b` (the close-to-close cumulative
+  - log return over N business days), so forecasting it N steps ahead resolves to
+  - the forward N-session return.
+  - '# Prerequisites (warm caches to the present first):'
+  - 'uv run python scripts/fetch_sp500_market.py --refresh   # ^GSPC / ^VIX / ^IXIC'
+  - 'uv run python scripts/fetch_fred.py                     # macro covariates'
+  chars: 2090
+  artifact: artifacts/implementations__sp500_forecasting__specs__sp500_smoke.yaml.md
+- path: implementations/sp500_forecasting/specs/sp500_stress_2020.yaml
+  kind: yaml
+  domain: impl.sp500_forecasting
+  summary: "Regime-stress spec \u2014 the 2020 COVID crash, NUMERICAL METHODS ONLY."
+  symbols: []
+  sections:
+  - "Regime-stress spec \u2014 the 2020 COVID crash, NUMERICAL METHODS ONLY."
+  - "# \u26A0\uFE0F  Keep the notebook's LLM-Process predictors OFF for this window.\
+    \ 2020 is"
+  - BEFORE the Gemini training cutoff (~Jan 2025), so an LLM has effectively
+  - "memorised these outcomes \u2014 scoring an LLMP here measures recall, not"
+  - forecasting, and would silently flatter it in the comparison. The numerical
+  - methods are cutoff-safe by construction (they only see the series up to the
+  - "origin), so this volatile window is a perfectly valid stress test *for them*\
+    \ \u2014"
+  - it's where a covariate edge is most visible.
+  - '# The notebook enforces "numerical only" in code: its predictors cell gates the'
+  - LLMP variants on a post-cutoff flag that is False for this config. Use this to
+  - study "when do covariates help?" among the conventional methods; use the 2025
+  - backtest / 2026 eval for anything involving the LLMP.
+  chars: 1986
+  artifact: artifacts/implementations__sp500_forecasting__specs__sp500_stress_2020.yaml.md
+- path: implementations/sp500_forecasting/starter_agent/__init__.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: "S&P 500 starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - Sp500StarterPromptBuilder
+  - build_starter_agent_config
+  - build_starter_agent_predictor
+  sections: []
+  chars: 636
+  artifact: artifacts/implementations__sp500_forecasting__starter_agent____init__.py.md
+- path: implementations/sp500_forecasting/starter_agent/agent.py
+  kind: python
+  domain: impl.sp500_forecasting
+  summary: "S&P 500 starter agent \u2014 a fresh, hackable template for your own exploration."
+  symbols:
+  - Sp500StarterPromptBuilder
+  - build_starter_agent_config
+  - _StarterForecastPromptBuilder
+  - build_starter_agent_predictor
+  sections: []
+  chars: 11869
+  artifact: artifacts/implementations__sp500_forecasting__starter_agent__agent.py.md
+- path: implementations/sp500_forecasting/starter_agent/skills/code-analysis-playbook/SKILL.md
+  kind: markdown
+  domain: impl.sp500_forecasting
+  summary: Code-analysis playbook
+  symbols: []
+  sections:
+  - Code-analysis playbook
+  - Where your data lives
+  - Compute before you forecast
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2172
+  artifact: artifacts/implementations__sp500_forecasting__starter_agent__skills__code-analysis-playbook__SKILL.md.md
+- path: implementations/sp500_forecasting/starter_agent/skills/forecasting/SKILL.md
+  kind: markdown
+  domain: impl.sp500_forecasting
+  summary: Forecasting skill
+  symbols: []
+  sections:
+  - Forecasting skill
+  - What you'll receive
+  - The output contract
+  - Calibration
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 2299
+  artifact: artifacts/implementations__sp500_forecasting__starter_agent__skills__forecasting__SKILL.md.md
+- path: implementations/sp500_forecasting/starter_agent/skills/research-playbook/SKILL.md
+  kind: markdown
+  domain: impl.sp500_forecasting
+  summary: Research playbook
+  symbols: []
+  sections:
+  - Research playbook
+  - The one rule that matters
+  - How to search
+  - Domain focus (edit this for your use case)
+  - Room to grow
+  chars: 1860
+  artifact: artifacts/implementations__sp500_forecasting__starter_agent__skills__research-playbook__SKILL.md.md
+- path: planning-docs/roadmap.md
+  kind: markdown
+  domain: docs
+  summary: Roadmap and Architecture Notes
+  symbols: []
+  sections:
+  - Roadmap and Architecture Notes
+  - Forecasting taxonomy
+  - Architecture principles
+  - Agent modes
+  - Extension ideas
+  - Deepen a reference implementation
+  - Agent and analyst depth
+  - Broaden coverage
+  - Live testing
+  - Core-library follow-up
+  chars: 7708
+  artifact: artifacts/planning-docs__roadmap.md.md
+- path: scripts/fetch_boc.py
+  kind: python
+  domain: scripts
+  summary: Populate the local data cache for the BoC rate-decision experiment.
+  symbols:
+  - main
+  sections: []
+  chars: 4192
+  artifact: artifacts/scripts__fetch_boc.py.md
+- path: scripts/fetch_boc_press_releases.py
+  kind: python
+  domain: scripts
+  summary: Download and extract Bank of Canada FAD press releases into the data cache.
+  symbols:
+  - fetch_entry
+  - main
+  sections: []
+  chars: 5157
+  artifact: artifacts/scripts__fetch_boc_press_releases.py.md
+- path: scripts/fetch_cfpr.py
+  kind: python
+  domain: scripts
+  summary: Download and cache published report PDFs into the local ``data/`` cache.
+  symbols:
+  - fetch_entry
+  - main
+  sections: []
+  chars: 5955
+  artifact: artifacts/scripts__fetch_cfpr.py.md
+- path: scripts/fetch_cpi.py
+  kind: python
+  domain: scripts
+  summary: Fetch and cache Canada-wide CPI series from Statistics Canada.
+  symbols:
+  - build_data_service
+  - main
+  sections: []
+  chars: 13054
+  artifact: artifacts/scripts__fetch_cpi.py.md
+- path: scripts/fetch_fred.py
+  kind: python
+  domain: scripts
+  summary: Populate the local FRED cache with series used by the CFPR experiment.
+  symbols:
+  - build_data_service
+  - main
+  sections: []
+  chars: 6004
+  artifact: artifacts/scripts__fetch_fred.py.md
+- path: scripts/fetch_sp500_market.py
+  kind: python
+  domain: scripts
+  summary: Populate / refresh the local Yahoo market caches for the S&P 500 use case.
+  symbols:
+  - main
+  sections: []
+  chars: 2742
+  artifact: artifacts/scripts__fetch_sp500_market.py.md
+- path: scripts/fetch_wti.py
+  kind: python
+  domain: scripts
+  summary: Fetch and cache WTI Crude Oil daily price history from Yahoo Finance.
+  symbols:
+  - main
+  sections: []
+  chars: 1509
+  artifact: artifacts/scripts__fetch_wti.py.md
diff --git a/implementations/getting_started/concierge_agent/knowledge.py b/implementations/getting_started/concierge_agent/knowledge.py
new file mode 100644
index 0000000..030d77f
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/knowledge.py
@@ -0,0 +1,54 @@
+"""Repo knowledge tools — catalog search and artifact fetch.
+
+Legacy :func:`search_repo_knowledge` delegates to the catalog tools for
+backward compatibility.
+"""
+
+from __future__ import annotations
+
+from getting_started.concierge_agent.catalog import (
+    clear_catalog_cache,
+    fetch_repo_artifact,
+    search_repo_catalog,
+)
+
+
+def clear_knowledge_cache() -> None:
+    """Clear cached catalog reads (for tests)."""
+    clear_catalog_cache()
+
+
+def search_repo_knowledge(query: str, topic: str | None = None) -> str:
+    """Backward-compatible wrapper: catalog search + fetch of the top hit."""
+    catalog_result = search_repo_catalog(query, domain=_topic_to_domain(topic))
+    if catalog_result.startswith("No catalog matches"):
+        return catalog_result
+    # Pull first path from catalog output for a combined excerpt.
+    for line in catalog_result.splitlines():
+        if line.startswith("- **path:** "):
+            path = line.removeprefix("- **path:** ").strip().strip("`")
+            body = fetch_repo_artifact(path, max_chars=2400)
+            return f"{catalog_result}\n\n---\n\n# Fetched: `{path}`\n\n{body}"
+    return catalog_result
+
+
+def _topic_to_domain(topic: str | None) -> str | None:
+    if topic is None:
+        return None
+    key = topic.lower().removesuffix(".md")
+    mapping = {
+        "overview": "docs",
+        "core_library": "core.evaluation",
+        "methods": "core.methods",
+        "implementations": "impl.getting_started",
+        "extension_guides": "scripts",
+    }
+    return mapping.get(key, key if key.startswith(("core.", "impl.", "docs", "scripts")) else None)
+
+
+__all__ = [
+    "clear_knowledge_cache",
+    "fetch_repo_artifact",
+    "search_repo_catalog",
+    "search_repo_knowledge",
+]
diff --git a/implementations/getting_started/concierge_agent/skills/repo-navigation/SKILL.md b/implementations/getting_started/concierge_agent/skills/repo-navigation/SKILL.md
new file mode 100644
index 0000000..564a7f1
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/skills/repo-navigation/SKILL.md
@@ -0,0 +1,17 @@
+---
+name: repo-navigation
+description: >-
+  Reference guide for the repo concierge catalog — domain filters, the
+  search-then-fetch workflow, and bootcamp routing. Load references/catalog-guide.md
+  before your first answer. No scripts.
+---
+
+# Repo navigation skill
+
+## Workflow
+
+1. Optional: `load_skill_resource("repo-navigation", "references/catalog-guide.md")`
+2. `search_repo_catalog(query, domain=..., kind=...)` — metadata only
+3. `fetch_repo_artifact(path)` for each path you need (1–3 per question)
+
+**No scripts. Do not call `run_skill_script`.**
diff --git a/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-guide.md b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-guide.md
new file mode 100644
index 0000000..773b7c9
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-guide.md
@@ -0,0 +1,47 @@
+# Catalog guide
+
+## Two-step retrieval
+
+| Step | Tool | Returns |
+|------|------|---------|
+| 1 | `search_repo_catalog(query, domain?, kind?)` | Paths, summaries, section titles |
+| 2 | `fetch_repo_artifact(path, section?)` | Full file/notebook content |
+
+Never skip step 1 — it keeps responses grounded and token-efficient.
+
+## Domain filters (`domain=`)
+
+| Domain | Contents |
+|--------|----------|
+| `core.data` | `aieng/forecasting/data/` |
+| `core.evaluation` | `aieng/forecasting/evaluation/` |
+| `core.methods` | `aieng/forecasting/methods/` |
+| `core.documents` | `aieng/forecasting/documents/` |
+| `core.root` | top-level `aieng/forecasting/*.py` |
+| `impl.<use_case>` | e.g. `impl.energy_oil_forecasting` |
+| `scripts` | `scripts/fetch_*.py` |
+| `docs` | README, AGENTS, roadmap, adk-skills-guide |
+
+## Kind filters (`kind=`)
+
+- `python` — library and implementation modules
+- `notebook` — markdown **and code cells** (outputs stripped)
+- `markdown` — READMEs
+- `yaml` — `specs/*.yaml`
+
+## Example sequences
+
+**DataService:**
+1. `search_repo_catalog("DataService register", domain="core.data")`
+2. `fetch_repo_artifact("aieng-forecasting/aieng/forecasting/data/service.py")`
+3. Optionally `fetch_repo_artifact("scripts/fetch_cpi.py")`
+
+**LLMP context:**
+1. `search_repo_catalog("LLMP user_prompt_suffix", domain="core.methods")`
+2. `fetch_repo_artifact("aieng-forecasting/aieng/forecasting/methods/llm_processes/base.py")`
+
+**Energy notebook 02:**
+1. `search_repo_catalog("intro agentic predictor", domain="impl.energy_oil_forecasting", kind="notebook")`
+2. `fetch_repo_artifact("implementations/energy_oil_forecasting/02_intro_agentic_predictor.ipynb")`
+
+Load `references/navigation-map.md` for bootcamp entry points.
diff --git a/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml
new file mode 100644
index 0000000..ee0ac75
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml
@@ -0,0 +1,21 @@
+# Concierge catalog summary (regenerated by scripts/build_concierge_context.py)
+source_url: https://github.com/VectorInstitute/agentic-forecasting
+branch: main
+built_at: '2026-06-25T15:12:49+00:00'
+git_ref: d4a22d05c7a9e86763a8d9b5e2891bcc24eda429
+entry_count: 195
+domains:
+  docs: 4
+  core.root: 3
+  core.data: 11
+  core.documents: 5
+  core.evaluation: 9
+  core.methods: 26
+  impl.README.md: 1
+  impl.__init__.py: 1
+  impl.boc_rate_decisions: 27
+  impl.energy_oil_forecasting: 45
+  impl.food_price_forecasting: 21
+  impl.getting_started: 16
+  impl.sp500_forecasting: 19
+  scripts: 7
diff --git a/implementations/getting_started/concierge_agent/skills/repo-navigation/references/navigation-map.md b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/navigation-map.md
new file mode 100644
index 0000000..9cfd994
--- /dev/null
+++ b/implementations/getting_started/concierge_agent/skills/repo-navigation/references/navigation-map.md
@@ -0,0 +1,35 @@
+# Bootcamp navigation map
+
+Quick pointers for common participant questions. Confirm details with
+`search_repo_catalog` and `fetch_repo_artifact` — this file is a map, not the full docs.
+
+## First steps
+
+1. `implementations/getting_started/00_environment_check.ipynb` — preflight (run first).
+2. `implementations/getting_started/01_cpi_data_exploration.ipynb` — data + ForecastingTask.
+3. `implementations/getting_started/02_cpi_backtest_demo.ipynb` — backtest loop.
+4. `implementations/getting_started/99_repo_concierge.ipynb` — this concierge (repo Q&A).
+
+## Reference implementations (pick by problem)
+
+| Order | Directory | Good for |
+|-------|-----------|----------|
+| 0 | `getting_started/` | Smallest eval loop (CPI gasoline, h=1) |
+| 1 | `sp500_forecasting/` | Numerical methods + covariate-aware LLMP |
+| 2 | `food_price_forecasting/` | Multi-target CPI trajectories, CFPR metric |
+| 3 | `energy_oil_forecasting/` | Daily prices, news/code agents, adaptive agent |
+| 4 | `boc_rate_decisions/` | Discrete cut/hold/hike events, RPS/Brier |
+
+## Related agents
+
+Each domain ships `99_starter_agent.ipynb` + `starter_agent/` for hands-on
+forecasting (news search, code execution). Energy also has `analyst_agent/` and
+`adaptive_agent/`. This **repo concierge** helps you navigate and understand the
+codebase; domain starter agents are where you build and score forecasts.
+
+## Key library entry points
+
+- `aieng.forecasting.data.DataService` — register series, build contexts.
+- `aieng.forecasting.evaluation` — `Predictor`, `backtest()`, `evaluate()`.
+- `aieng.forecasting.methods` — baselines, numerical, LLM Processes, agentic ADK.
+- `AGENTS.md` — contributor conventions (models, data cache, docs).
diff --git a/implementations/pyproject.toml b/implementations/pyproject.toml
index d93ce2a..7467a9c 100644
--- a/implementations/pyproject.toml
+++ b/implementations/pyproject.toml
@@ -14,7 +14,8 @@ dependencies = [
 
 [tool.setuptools.packages.find]
 where = ["."]          # treat implementations/ as the src root
-exclude = ["tests*", "getting_started*"]
+exclude = ["tests*"]
+# getting_started/ is mostly notebooks; only concierge_agent/ is packaged.
 
 [tool.uv.sources]
 aieng-forecasting = { workspace = true }
diff --git a/implementations/tests/getting_started/__init__.py b/implementations/tests/getting_started/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/implementations/tests/getting_started/test_concierge_catalog.py b/implementations/tests/getting_started/test_concierge_catalog.py
new file mode 100644
index 0000000..9c54182
--- /dev/null
+++ b/implementations/tests/getting_started/test_concierge_catalog.py
@@ -0,0 +1,61 @@
+"""Tests for the repo-concierge catalog and artifact tools."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import yaml
+from getting_started.concierge_agent.catalog import (
+    clear_catalog_cache,
+    fetch_repo_artifact,
+    search_repo_catalog,
+)
+from getting_started.concierge_agent.catalog_build import CORE_PREFIX, collect_source_paths
+
+
+_CONTEXT_DIR = Path(__file__).resolve().parents[2] / "getting_started/concierge_agent/context"
+_REPO_ROOT = Path(__file__).resolve().parents[3]
+_DATA_SERVICE = f"{CORE_PREFIX}/data/service.py"
+
+
+def setup_function() -> None:
+    clear_catalog_cache()
+
+
+def test_catalog_covers_full_core_package() -> None:
+    catalog_path = _CONTEXT_DIR / "catalog.yaml"
+    with catalog_path.open(encoding="utf-8") as fh:
+        catalog = yaml.safe_load(fh)
+    indexed = {entry["path"] for entry in catalog["entries"]}
+    expected_py = {
+        p.relative_to(_REPO_ROOT).as_posix()
+        for p in collect_source_paths(_REPO_ROOT)
+        if p.suffix == ".py" and CORE_PREFIX in p.as_posix()
+    }
+    missing = expected_py - indexed
+    assert not missing, f"Core modules missing from catalog: {sorted(missing)[:5]}"
+
+
+def test_search_catalog_returns_metadata_only() -> None:
+    result = search_repo_catalog("DataService register", domain="core.data")
+    assert "DataService" in result or "service.py" in result
+    assert "class DataService" not in result
+    assert "fetch_repo_artifact" in result
+
+
+def test_fetch_data_service_module() -> None:
+    body = fetch_repo_artifact(_DATA_SERVICE)
+    assert "class DataService" in body
+    assert "def register" in body
+
+
+def test_fetch_notebook_includes_code_cells() -> None:
+    path = "implementations/energy_oil_forecasting/02_intro_agentic_predictor.ipynb"
+    body = fetch_repo_artifact(path, max_chars=20000)
+    assert "Cell" in body and "(code)" in body
+    assert "```python" in body
+
+
+def test_fetch_respects_max_chars() -> None:
+    body = fetch_repo_artifact(_DATA_SERVICE, max_chars=500)
+    assert len(body) <= 520
diff --git a/implementations/tests/getting_started/test_concierge_skill.py b/implementations/tests/getting_started/test_concierge_skill.py
new file mode 100644
index 0000000..8ef99d5
--- /dev/null
+++ b/implementations/tests/getting_started/test_concierge_skill.py
@@ -0,0 +1,38 @@
+"""Tests for the repo-concierge ADK skill wiring."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from getting_started.concierge_agent.agent import build_concierge_config
+from google.adk.skills import load_skill_from_dir
+
+
+_SKILL_DIR = Path(__file__).resolve().parents[2] / "getting_started/concierge_agent/skills/repo-navigation"
+
+
+def test_repo_navigation_skill_has_reference_files() -> None:
+    refs = _SKILL_DIR / "references"
+    assert (refs / "catalog-guide.md").is_file()
+    assert (refs / "catalog-summary.yaml").is_file()
+    assert (refs / "navigation-map.md").is_file()
+    assert not (_SKILL_DIR / "scripts").exists()
+
+
+def test_repo_navigation_skill_loads_via_adk() -> None:
+    skill = load_skill_from_dir(_SKILL_DIR)
+    assert skill.name == "repo-navigation"
+    assert "catalog" in skill.description.lower()
+    assert len(skill.resources.references) >= 3
+    assert not skill.resources.scripts
+
+
+def test_concierge_instruction_forbids_run_skill_script() -> None:
+    config = build_concierge_config()
+    assert "run_skill_script" in config.instruction
+    assert "NO scripts" in config.instruction
+    assert "search_repo_catalog" in config.instruction
+    assert "fetch_repo_artifact" in config.instruction
+    tool_names = [getattr(t, "__name__", "") for t in config.extra_tools]
+    assert "search_repo_catalog" in tool_names
+    assert "fetch_repo_artifact" in tool_names
diff --git a/planning-docs/roadmap.md b/planning-docs/roadmap.md
index 3ad1456..4bbd818 100644
--- a/planning-docs/roadmap.md
+++ b/planning-docs/roadmap.md
@@ -48,6 +48,8 @@ The repository is a foundation. Each reference implementation's README ends with
 
 Every domain implementation (S&P 500, food, energy, BoC) now ships a **`starter_agent/` module + `99_starter_agent.ipynb`** — a fresh, participant-owned agent template with toggleable proxy news search and E2B code execution, two lightweight tool-usage skills, an interactive (Track 2) cell, and one scored (Track 1) prediction. It is the canonical "build your own" entry point and doubles as a quick end-to-end smoke test of each use case's agent stack. Natural next steps from here: richer E2B code-execution configs, prompt and context-formatting optimization, and deeper Track 2 interactive analyst configurations per use case (see [`../docs/adk-skills-guide.md`](../docs/adk-skills-guide.md) for the skill design rules).
 
+**Repo concierge (shipped).** `getting_started/concierge_agent/` + `99_repo_concierge.ipynb` — a `gemini-3.1-flash-lite-preview` ADK agent that answers bootcamp onboarding questions by searching a committed catalog of public `main` (maintainers rebuild via `scripts/build_concierge_context.py`). Points participants to notebooks, modules, and snippets; complements the domain starter agents.
+
 ### Broaden coverage
 
 - Transpose the S&P 500 template to additional energy commodities, or to other liquid assets, equities, or indices. The S&P 500 reference now compares conventional numerical methods (incl. ETS and Kalman) against a **covariate-aware LLM-Process** across cumulative-return horizons — `SampledTrajectoryLLMPredictor` supports `covariate_series_ids` (exogenous-series prompt blocks), so the "can an LLM use the covariate panel as well as gradient boosting?" comparison is shipped, not deferred.
diff --git a/scripts/build_concierge_context.py b/scripts/build_concierge_context.py
new file mode 100644
index 0000000..780a695
--- /dev/null
+++ b/scripts/build_concierge_context.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+"""Build the repo concierge catalog and per-source artifacts.
+
+Maintainers run this when library code, implementations, or notebooks change:
+
+    uv run python scripts/build_concierge_context.py
+
+Indexes the full ``aieng-forecasting/aieng/forecasting`` tree, reference
+implementation modules/READMEs/specs/notebooks (markdown + code cells, no
+outputs), plus root docs and fetch scripts.
+
+Output: ``implementations/getting_started/concierge_agent/context/catalog.yaml``
+and ``context/artifacts/*.md``.
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+import yaml
+
+
+# Allow running as ``python scripts/build_concierge_context.py`` from repo root.
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(_REPO_ROOT / "implementations") not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT / "implementations"))
+
+from getting_started.concierge_agent.catalog_build import build_catalog  # noqa: E402
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=_REPO_ROOT,
+        help="Repository root to index (default: parent of scripts/).",
+    )
+    args = parser.parse_args()
+    out_dir = build_catalog(args.repo_root.resolve())
+
+    catalog = yaml.safe_load((out_dir / "catalog.yaml").read_text(encoding="utf-8"))
+    count = catalog.get("entry_count", 0)
+    print(f"Wrote catalog with {count} entries to {out_dir}")
+    print(f"  artifacts/: {len(list((out_dir / 'artifacts').glob('*.md')))} files")
+
+
+if __name__ == "__main__":
+    main()