diff --git a/README.md b/README.md index a2a8e6f..c2c9164 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Every method can be used in one of two modes, and the distinction runs through t Each is independent and self-contained — pick the one that matches the problem you care about, and read that directory's `README.md` for the full walkthrough. They are numbered in a recommended order that mirrors the bootcamp progression — conventional numerical methods → LLM Processes → agents → agentic evaluation — but any one stands on its own, so jump straight to the problem you care about. -**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below. +**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below. Also includes [`99_repo_concierge.ipynb`](implementations/getting_started/99_repo_concierge.ipynb) — a lite-model repo guide for “how does this codebase work?” questions (`uv run adk run implementations/getting_started/concierge_agent` from the repo root). | # | Implementation | The problem | Concepts & techniques it demonstrates | | --- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | diff --git a/implementations/README.md b/implementations/README.md index bf89a80..cbe7a51 100644 --- a/implementations/README.md +++ b/implementations/README.md @@ -32,6 +32,8 @@ YAML backtest and eval specs live under each use case in `specs/`. Each director Every domain use case (all except `getting_started`) also ships a `starter_agent/` module and a `99_starter_agent.ipynb` — a fresh, hackable **starter agent** that is the consistent "build your own" entry point for that use case (toggleable news search + code execution, two lightweight tool-usage skills, an interactive cell, and one scored forecast). +`getting_started/` additionally ships a **`concierge_agent/`** module and **`99_repo_concierge.ipynb`** — a repo onboarding helper (not a forecaster) that answers questions about how the codebase works using a committed public-`main` knowledge digest. From the repository root: `uv run adk run implementations/getting_started/concierge_agent`. See [`getting_started/README.md`](getting_started/README.md) and the notebook for full usage. + --- ## Relationship to `aieng-forecasting` diff --git a/implementations/getting_started/99_repo_concierge.ipynb b/implementations/getting_started/99_repo_concierge.ipynb new file mode 100644 index 0000000..8691ae4 --- /dev/null +++ b/implementations/getting_started/99_repo_concierge.ipynb @@ -0,0 +1,241 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Repo Concierge — ask questions about this codebase\n", + "\n", + "> **Note:** This agent uses a snapshot of the public `main` branch (not your local\n", + "> uncommitted changes or `data/` cache). Like any LLM, it can be wrong — verify\n", + "> important details against the repo or ask a facilitator.\n", + "\n", + "**Not sure how something works? Start here.**\n", + "\n", + "The repo concierge helps you **find your way** — it answers questions, points you\n", + "to the right notebooks and modules, and can quote short snippets so you know\n", + "where to dig deeper. Example questions:\n", + "\n", + "- *How do I create a new data service?*\n", + "- *How do I customize the way context is presented to an LLMP?*\n", + "- *What's the difference between `backtest()` and `evaluate()`?*\n", + "\n", + "It searches a committed **catalog** of the codebase (`search_repo_catalog` →\n", + "`fetch_repo_artifact`): full `aieng/forecasting`, reference implementations, and\n", + "notebooks (markdown + code cells). Domain `99_starter_agent.ipynb` notebooks are\n", + "for building forecasters; this one is your map of the repo.\n", + "\n", + "Live cells are gated by `RUN_AGENT` so `Run All` is safe and free; set it to `True`\n", + "to call the model." + ], + "id": "cell-00" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import warnings\n", + "from pathlib import Path\n", + "\n", + "from IPython.display import Markdown, display # noqa: A004\n", + "\n", + "\n", + "warnings.filterwarnings(\"ignore\")\n", + "\n", + "from dotenv import load_dotenv\n", + "\n", + "\n", + "def find_repo_root(start: Path | None = None) -> Path:\n", + " \"\"\"Walk upward until we find the workspace root.\"\"\"\n", + " here = (start or Path.cwd()).resolve()\n", + " for cand in (here, *here.parents):\n", + " if (cand / \"pyproject.toml\").exists() and (cand / \"aieng-forecasting\").is_dir():\n", + " return cand\n", + " return Path.cwd().resolve().parents[1]\n", + "\n", + "\n", + "ROOT = find_repo_root()\n", + "load_dotenv(ROOT / \".env\", override=False)\n", + "\n", + "# ── Model selection ───────────────────────────────────\n", + "# Concierge uses the lite/default model only.\n", + "AGENT_MODEL = \"gemini-3.1-flash-lite-preview\"\n", + "\n", + "# ── Run guard ──────────────────────────────────────\n", + "RUN_AGENT = True\n", + "\n", + "from getting_started.concierge_agent import build_concierge_config\n", + "\n", + "\n", + "print(\"RUN_AGENT =\", RUN_AGENT, \"| model =\", AGENT_MODEL)" + ], + "execution_count": null, + "outputs": [], + "id": "cell-01" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 1. Meet the concierge\n", + "\n", + "The agent uses a **catalog + artifacts** knowledge pack shipped under `concierge_agent/context/` — no build step for participants.\n", + "\n", + "1. **`search_repo_catalog`** — search metadata (paths, summaries, domains); cheap, run first.\n", + "2. **`fetch_repo_artifact`** — fetch full content for a catalog path (Python modules, READMEs, notebooks with **markdown + code cells**).\n", + "\n", + "Maintainers regenerate the pack from public `main` with `scripts/build_concierge_context.py` when library code or notebooks change. The `repo-navigation` skill has reference guides (no scripts)." + ], + "id": "cell-02" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "config = build_concierge_config(model=AGENT_MODEL)\n", + "\n", + "print(\"Agent:\", config.name)\n", + "print(\"Search enabled: \", config.context_retrieval.enabled)\n", + "print(\"Code-exec enabled: \", config.code_execution.enabled)\n", + "print(\"Skills loaded: \", [p.name for p in config.skills_dirs])\n", + "print(\"Extra tools: \", [getattr(t, \"__name__\", repr(t)) for t in config.extra_tools])\n", + "display(Markdown(\"### System instruction\\n\\n*Edit in `concierge_agent/agent.py`*\"))\n", + "display(Markdown(config.instruction))" + ], + "execution_count": null, + "outputs": [], + "id": "cell-03" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 2. Try a seed question\n", + "\n", + "Edit `QUESTION` below, or jump to the next section for a multi-turn conversation." + ], + "id": "cell-04" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "from aieng.forecasting.methods.agentic import build_adk_agent\n", + "from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig\n", + "\n", + "\n", + "QUESTION = \"How do I create a new data service?\"\n", + "\n", + "if RUN_AGENT:\n", + " chat_agent = build_adk_agent(config)\n", + " runner = AdkTextRunner(chat_agent, config=AdkTextRunnerConfig(app_name=\"repo_concierge_chat\"))\n", + " reply = await runner.run_text_async(QUESTION) # noqa: F704, PLE1142\n", + " display(Markdown(reply))\n", + "else:\n", + " print(\"RUN_AGENT is False — set it to True in the setup cell to ask the concierge.\")" + ], + "execution_count": null, + "outputs": [], + "id": "cell-05" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "QUESTION = \"How do I customize the way context is presented to an LLMP?\"\n", + "\n", + "if RUN_AGENT:\n", + " reply = await runner.run_text_async(QUESTION) # noqa: F704, F821, PLE1142\n", + " display(Markdown(reply))\n", + "else:\n", + " print(\"RUN_AGENT is False — set it to True to run this cell.\")" + ], + "execution_count": null, + "outputs": [], + "id": "cell-05b" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "QUESTION = \"What's the difference between backtest() and evaluate()?\"\n", + "\n", + "if RUN_AGENT:\n", + " reply = await runner.run_text_async(QUESTION) # noqa: F704, F821, PLE1142\n", + " display(Markdown(reply))\n", + "else:\n", + " print(\"RUN_AGENT is False — set it to True to run this cell.\")" + ], + "execution_count": null, + "outputs": [], + "id": "cell-05c" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "QUESTION = \"Where should I go after getting_started if I want to build agents?\"\n", + "\n", + "if RUN_AGENT:\n", + " reply = await runner.run_text_async(QUESTION) # noqa: F704, F821, PLE1142\n", + " display(Markdown(reply))\n", + "else:\n", + " print(\"RUN_AGENT is False — set it to True to run this cell.\")" + ], + "execution_count": null, + "outputs": [], + "id": "cell-05d" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 3. Terminal mode — multi-turn conversations\n", + "\n", + "For extended back-and-forth, use the ADK CLI from the **repository root**:\n", + "\n", + "```bash\n", + "uv run adk run implementations/getting_started/concierge_agent\n", + "```\n", + "\n", + "That loads the same `repo_concierge` agent (`gemini-3.1-flash-lite-preview`) with\n", + "`search_repo_catalog`, `fetch_repo_artifact`, and the repo-navigation skill.\n", + "\n", + "**Alternative:** `uv run adk web implementations/getting_started/concierge_agent`\n", + "opens a browser UI (same agent). From `implementations/getting_started/`, you can\n", + "also use the shorter `uv run adk run concierge_agent`.\n", + "\n", + "---\n", + "\n", + "**Where next?** Forecasting starter agents live in each domain implementation's\n", + "`99_starter_agent.ipynb` (food, energy, BoC, S&P 500). This concierge helps you\n", + "navigate the repo — open one of those when you're ready to build and score a forecaster." + ], + "id": "cell-08" + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/implementations/getting_started/README.md b/implementations/getting_started/README.md index f548cfb..99e5638 100644 --- a/implementations/getting_started/README.md +++ b/implementations/getting_started/README.md @@ -139,6 +139,33 @@ against [`cpi_gasoline_eval_2025.yaml`](specs/cpi_gasoline_eval_2025.yaml) — monthly origins from Jan 2025 through Mar 2026, all currently resolved. `max_runs: 5` — spend deliberately. +### 6. Ask the repo concierge — `99_repo_concierge.ipynb` + +**Questions about how the repository works?** Open +[`99_repo_concierge.ipynb`](99_repo_concierge.ipynb) — a lite-model **repo +concierge** that answers onboarding questions, points you to the right notebooks +and modules, and can quote snippets from the committed public-`main` catalog. + +- Notebook cells are gated by `RUN_AGENT` (safe `Run All`). +- For longer conversations, run the ADK CLI from the **repository root**: + + ```bash + uv run adk run implementations/getting_started/concierge_agent + ``` + + (`uv run adk web implementations/getting_started/concierge_agent` opens the same + agent in a browser.) + + From `implementations/getting_started/`, the shorter `uv run adk run concierge_agent` + works too. + +This is different from each domain's `99_starter_agent.ipynb` — those are +hackable **forecasting** agents; the concierge only explains the repo. + +Maintainers regenerate the catalog with +`uv run python scripts/build_concierge_context.py` when library code, +implementations, or notebooks change. + --- ## Where to go next @@ -166,9 +193,11 @@ what you're building: getting_started/ # this directory ├── README.md ├── specs/ # backtest and eval YAML +├── concierge_agent/ # repo concierge ADK agent + catalog + artifacts ├── 00_environment_check.ipynb # self-guided setup preflight — run this first ├── 01_cpi_data_exploration.ipynb -└── 02_cpi_backtest_demo.ipynb +├── 02_cpi_backtest_demo.ipynb +└── 99_repo_concierge.ipynb # ask questions about the repo (onboarding helper) ``` Reference predictors live in the `aieng-forecasting` package under diff --git a/implementations/getting_started/__init__.py b/implementations/getting_started/__init__.py new file mode 100644 index 0000000..13bc933 --- /dev/null +++ b/implementations/getting_started/__init__.py @@ -0,0 +1 @@ +"""Getting started reference implementation (notebooks + repo concierge agent).""" diff --git a/implementations/getting_started/concierge_agent/__init__.py b/implementations/getting_started/concierge_agent/__init__.py new file mode 100644 index 0000000..c84fffe --- /dev/null +++ b/implementations/getting_started/concierge_agent/__init__.py @@ -0,0 +1,21 @@ +"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase. + +Exports the :class:`AgentConfig` factory and the knowledge-search tool. Pair +with ``99_repo_concierge.ipynb`` or ``adk run implementations/getting_started/concierge_agent`` +from the repository root. +""" + +from getting_started.concierge_agent.agent import build_concierge_config +from getting_started.concierge_agent.catalog import ( + fetch_repo_artifact, + search_repo_catalog, +) +from getting_started.concierge_agent.knowledge import search_repo_knowledge + + +__all__ = [ + "build_concierge_config", + "fetch_repo_artifact", + "search_repo_catalog", + "search_repo_knowledge", +] diff --git a/implementations/getting_started/concierge_agent/agent.py b/implementations/getting_started/concierge_agent/agent.py new file mode 100644 index 0000000..22ac40d --- /dev/null +++ b/implementations/getting_started/concierge_agent/agent.py @@ -0,0 +1,108 @@ +"""Repo concierge agent — onboarding helper for the agentic-forecasting codebase. + +A lightweight ADK agent powered by ``LITE_MODEL`` (``gemini-3.1-flash-lite-preview``). +It answers questions about the repository using a committed **catalog + artifacts** +snapshot of public ``main`` — not the participant's local workspace. + +Pair with ``99_repo_concierge.ipynb`` or ``adk run implementations/getting_started/concierge_agent`` +from the repository root. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Any + +from aieng.forecasting.methods.agentic import build_adk_agent +from aieng.forecasting.methods.agentic.agent_factory import ( + AgentConfig, + CodeExecutionConfig, + ContextRetrievalConfig, +) +from aieng.forecasting.models import LITE_MODEL +from getting_started.concierge_agent.catalog import fetch_repo_artifact, search_repo_catalog + + +_SKILLS_ROOT = Path(__file__).parent / "skills" +_REPO_NAV_SKILL = _SKILLS_ROOT / "repo-navigation" + + +def _build_concierge_instruction() -> str: + return ( + "## Role\n\n" + "You are the **repo concierge** for the agentic-forecasting bootcamp — a " + "friendly guide who helps participants understand the repository and find " + "their way to the right notebooks, modules, and patterns.\n\n" + "Answer questions clearly. Point people to **concrete paths** in the " + "codebase (READMEs, notebooks, specs, library modules) where they can " + "read more or try things themselves. When it helps, quote short snippets " + "from fetched artifacts — especially from notebooks and reference " + "implementations.\n\n" + "## How you work\n\n" + "- Ground answers in the committed catalog: call " + "``search_repo_catalog`` first, then ``fetch_repo_artifact`` for the " + "paths you need (usually one to three per question).\n" + "- Prefer showing *where* something lives and *how it fits together* " + "over long generic explanations.\n" + "- If someone is debugging or extending code, walk them through the " + "relevant files and patterns you find in the catalog; suggest what to " + "open next in their editor.\n" + "- Your knowledge reflects the committed public ``main`` snapshot — not " + "the participant's local ``.env``, ``data/`` cache, or uncommitted " + "changes. If the catalog does not cover something, say so and name the " + "best file to open or a facilitator to ask.\n\n" + "## Tone\n\n" + "- Concise, welcoming, and practical — short paragraphs and bullet lists.\n" + "- Always cite paths returned by the catalog.\n" + ) + + +_CONCIERGE_INSTRUCTION = _build_concierge_instruction() + +_SKILLS_SUPPLEMENT = """ + +## Skills + +You have one read-only skill: `repo-navigation` with reference files (catalog guide, +domain map). Load them via `load_skill_resource` when you need routing hints. + +**To use a skill:** +1. Call `list_skills` → `load_skill` → `load_skill_resource` as needed. + +These skills have NO scripts. Do not call `run_skill_script`. + +## Repo catalog tools (required workflow) + +1. **`search_repo_catalog(query, domain=None, kind=None)`** — search metadata only + (paths, summaries, section titles). Use `domain` filters like `core.data`, + `core.methods`, `impl.energy_oil_forecasting`, `scripts`, `docs`. + Use `kind` filters: `python`, `notebook`, `markdown`, `yaml`. +2. **`fetch_repo_artifact(path, section=None)`** — fetch full content for one catalog + path (optionally one heading/section). Fetch 1–3 artifacts per question. + +Do not answer implementation or API questions without fetching the relevant paths.\ +""" + + +def _full_instruction() -> str: + return _CONCIERGE_INSTRUCTION + _SKILLS_SUPPLEMENT + + +def build_concierge_config(*, model: str = LITE_MODEL) -> AgentConfig: + """Build the repo-concierge :class:`AgentConfig`.""" + return AgentConfig( + name="repo_concierge", + model=model, + instruction=_full_instruction(), + context_retrieval=ContextRetrievalConfig(), + code_execution=CodeExecutionConfig(), + skills_dirs=[_REPO_NAV_SKILL], + extra_tools=[search_repo_catalog, fetch_repo_artifact], + ) + + +def __getattr__(name: str) -> Any: + """Expose ``root_agent`` lazily for schema-free interactive use via ADK CLI.""" + if name == "root_agent": + return build_adk_agent(build_concierge_config()) + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/implementations/getting_started/concierge_agent/catalog.py b/implementations/getting_started/concierge_agent/catalog.py new file mode 100644 index 0000000..379040e --- /dev/null +++ b/implementations/getting_started/concierge_agent/catalog.py @@ -0,0 +1,238 @@ +"""Runtime catalog search and artifact fetch for the repo concierge agent.""" + +from __future__ import annotations + +import re +from dataclasses import dataclass +from functools import lru_cache +from pathlib import Path +from typing import Any + +import yaml +from getting_started.concierge_agent.catalog_build import CatalogEntry + + +_CONTEXT_DIR = Path(__file__).parent / "context" +_MAX_CATALOG_HITS = 8 +_DEFAULT_FETCH_MAX_CHARS = 6000 +_MIN_SCORE = 1 +_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE) + + +@dataclass(frozen=True) +class CatalogHit: + """A ranked catalog match (metadata only).""" + + path: str + kind: str + domain: str + summary: str + score: int + artifact: str + sections: list[str] + + +def _entry_from_dict(data: dict[str, Any]) -> CatalogEntry: + return CatalogEntry( + path=str(data["path"]), + kind=str(data.get("kind", "other")), + domain=str(data.get("domain", "other")), + summary=str(data.get("summary", "")), + symbols=[str(s) for s in data.get("symbols", [])], + sections=[str(s) for s in data.get("sections", [])], + chars=int(data.get("chars", 0)), + artifact=str(data["artifact"]), + ) + + +@lru_cache(maxsize=1) +def _load_catalog() -> dict[str, Any]: + catalog_path = _CONTEXT_DIR / "catalog.yaml" + if not catalog_path.is_file(): + msg = f"Concierge catalog not found: {catalog_path}. Run scripts/build_concierge_context.py" + raise FileNotFoundError(msg) + with catalog_path.open(encoding="utf-8") as fh: + data = yaml.safe_load(fh) + if not isinstance(data, dict): + msg = f"Invalid catalog format in {catalog_path}" + raise ValueError(msg) + return data + + +@lru_cache(maxsize=1) +def _load_entries() -> tuple[CatalogEntry, ...]: + catalog = _load_catalog() + raw_entries = catalog.get("entries", []) + if not isinstance(raw_entries, list): + return () + return tuple(_entry_from_dict(item) for item in raw_entries if isinstance(item, dict)) + + +def _tokenize(query: str) -> list[str]: + return [t for t in re.findall(r"[a-zA-Z0-9_./-]+", query.lower()) if len(t) > 2] + + +def _score_entry(entry: CatalogEntry, terms: list[str], domain: str | None, kind: str | None) -> int: + haystack = " ".join( + [ + entry.path, + entry.summary, + entry.domain, + entry.kind, + " ".join(entry.symbols), + " ".join(entry.sections), + ] + ).lower() + score = sum(haystack.count(term) for term in terms) + if domain and entry.domain == domain: + score += 5 + if kind and entry.kind == kind: + score += 3 + return score + + +def _normalize_domain(domain: str | None) -> str | None: + if domain is None: + return None + return domain.strip().lower() + + +def _normalize_kind(kind: str | None) -> str | None: + if kind is None: + return None + return kind.strip().lower() + + +def search_repo_catalog( + query: str, + domain: str | None = None, + kind: str | None = None, +) -> str: + """Search the committed repo catalog (metadata only). + + Returns matching paths, summaries, and section titles — not file bodies. + Follow up with :func:`fetch_repo_artifact` for content. + """ + terms = _tokenize(query) + if not terms: + return "No search terms found. Try e.g. 'DataService register' or 'energy notebook 02 agentic'." + + domain_filter = _normalize_domain(domain) + kind_filter = _normalize_kind(kind) + ranked: list[CatalogHit] = [] + for entry in _load_entries(): + if domain_filter and entry.domain != domain_filter: + continue + if kind_filter and entry.kind != kind_filter: + continue + score = _score_entry(entry, terms, domain_filter, kind_filter) + if score >= _MIN_SCORE: + ranked.append( + CatalogHit( + path=entry.path, + kind=entry.kind, + domain=entry.domain, + summary=entry.summary, + score=score, + artifact=entry.artifact, + sections=entry.sections[:5], + ) + ) + + if not ranked: + domains = sorted({e.domain for e in _load_entries()}) + return ( + f"No catalog matches for query={query!r}" + + (f", domain={domain!r}" if domain else "") + + (f", kind={kind!r}" if kind else "") + + f". Available domains: {', '.join(domains)}." + ) + + ranked.sort(key=lambda hit: hit.score, reverse=True) + top = ranked[:_MAX_CATALOG_HITS] + + lines = [ + f"# Catalog search: {query}", + "", + "Metadata only — call `fetch_repo_artifact(path)` for full content.", + "", + ] + for i, hit in enumerate(top, start=1): + lines.append(f"## Match {i} (score={hit.score})") + lines.append(f"- **path:** `{hit.path}`") + lines.append(f"- **kind:** `{hit.kind}` | **domain:** `{hit.domain}`") + lines.append(f"- **summary:** {hit.summary}") + if hit.sections: + lines.append(f"- **sections:** {'; '.join(hit.sections[:3])}") + lines.append("") + return "\n".join(lines) + + +def _find_entry_by_path(path: str) -> CatalogEntry | None: + normalized = path.strip().replace("\\", "/") + for entry in _load_entries(): + if entry.path == normalized: + return entry + return None + + +def _extract_section(body: str, section: str) -> str | None: + needle = section.strip().lower() + if not needle: + return None + parts = re.split(r"\n(?=#{1,4} )", body) + for part in parts: + heading_match = _HEADING_RE.match(part.strip()) + if heading_match and needle in heading_match.group(1).lower(): + return part.strip() + if needle in part[:120].lower(): + return part.strip() + return None + + +def fetch_repo_artifact( + path: str, + section: str | None = None, + max_chars: int = _DEFAULT_FETCH_MAX_CHARS, +) -> str: + """Fetch one pre-built artifact by repo-relative ``path``. + + Parameters + ---------- + path : str + Repo-relative path as listed in the catalog (e.g. + ``aieng-forecasting/aieng/forecasting/data/service.py``). + section : str or None + Optional heading substring to return one section only. + max_chars : int + Hard cap on returned characters. + """ + entry = _find_entry_by_path(path) + if entry is None: + return f"No catalog entry for path={path!r}. Call `search_repo_catalog` first." + + artifact_path = _CONTEXT_DIR / entry.artifact + if not artifact_path.is_file(): + return f"Artifact missing for {path!r}: {entry.artifact}" + + body = artifact_path.read_text(encoding="utf-8") + if section: + extracted = _extract_section(body, section) + body = extracted or (f"(Section {section!r} not found in artifact; showing beginning.)\n\n" + body[:max_chars]) + + if len(body) > max_chars: + body = body[:max_chars] + "\n…\n" + return body + + +def clear_catalog_cache() -> None: + """Clear cached catalog reads (for tests).""" + _load_catalog.cache_clear() + _load_entries.cache_clear() + + +__all__ = [ + "clear_catalog_cache", + "fetch_repo_artifact", + "search_repo_catalog", +] diff --git a/implementations/getting_started/concierge_agent/catalog_build.py b/implementations/getting_started/concierge_agent/catalog_build.py new file mode 100644 index 0000000..9b5638f --- /dev/null +++ b/implementations/getting_started/concierge_agent/catalog_build.py @@ -0,0 +1,341 @@ +"""Build the repo concierge catalog and per-source artifacts (maintainer-only).""" + +from __future__ import annotations + +import ast +import json +import re +import shutil +import subprocess +from dataclasses import dataclass +from datetime import UTC, datetime +from pathlib import Path +from typing import Any, Literal + +import yaml + + +REPO_URL = "https://github.com/VectorInstitute/agentic-forecasting" +DEFAULT_BRANCH = "main" +CORE_PREFIX = "aieng-forecasting/aieng/forecasting" + +Kind = Literal["python", "markdown", "notebook", "yaml", "shell"] + +_SKIP_IMPL_PARTS = frozenset({"tests", "context", "__pycache__"}) +_HEADING_RE = re.compile(r"^#{1,4}\s+(.+)$", re.MULTILINE) + + +@dataclass(frozen=True) +class CatalogEntry: + """One indexed source file in the public repo snapshot.""" + + path: str + kind: str + domain: str + summary: str + symbols: list[str] + sections: list[str] + chars: int + artifact: str + + +def repo_root_from_here() -> Path: + """Return repository root (parent of ``implementations/``).""" + return Path(__file__).resolve().parents[3] + + +def context_dir(repo_root: Path | None = None) -> Path: + root = repo_root or repo_root_from_here() + return root / "implementations/getting_started/concierge_agent/context" + + +def path_to_artifact_slug(rel_path: str) -> str: + return rel_path.replace("/", "__") + + +_DOMAIN_RULES: tuple[tuple[str, str], ...] = ( + (f"{CORE_PREFIX}/data", "core.data"), + (f"{CORE_PREFIX}/evaluation", "core.evaluation"), + (f"{CORE_PREFIX}/methods", "core.methods"), + (f"{CORE_PREFIX}/documents", "core.documents"), + (f"{CORE_PREFIX}/", "core.root"), +) + + +def infer_domain(rel_path: str) -> str: + """Map a repo-relative path to a catalog domain tag.""" + for prefix, domain in _DOMAIN_RULES: + if rel_path.startswith(prefix): + return domain + if rel_path.startswith("implementations/"): + parts = rel_path.split("/") + if len(parts) >= 2: + return f"impl.{parts[1]}" + if rel_path.startswith("scripts/"): + return "scripts" + if rel_path.startswith(("docs/", "planning-docs/")) or rel_path in {"README.md", "AGENTS.md"}: + return "docs" + return "other" + + +def infer_kind(rel_path: str) -> Kind: + suffix = Path(rel_path).suffix.lower() + if suffix == ".py": + return "python" + if suffix == ".ipynb": + return "notebook" + if suffix in {".yaml", ".yml"}: + return "yaml" + if suffix == ".md": + return "markdown" + return "shell" + + +def _first_paragraph(text: str) -> str: + stripped = text.strip() + if not stripped: + return "" + return stripped.split("\n\n")[0].replace("\n", " ").strip()[:240] + + +def _extract_headings(text: str) -> list[str]: + return [m.group(1).strip() for m in _HEADING_RE.finditer(text)][:40] + + +def _analyze_python(source: str) -> tuple[str, list[str]]: + try: + tree = ast.parse(source) + except SyntaxError: + return "", [] + summary = _first_paragraph(ast.get_docstring(tree) or "") + symbols: list[str] = [] + for node in tree.body: + if isinstance(node, ast.ClassDef): + symbols.append(node.name) + elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)): + if not node.name.startswith("_"): + symbols.append(node.name) + elif isinstance(node, ast.Assign): + for target in node.targets: + if ( + isinstance(target, ast.Name) + and target.id == "__all__" + and isinstance(node.value, (ast.List, ast.Tuple)) + ): + for elt in node.value.elts: + if isinstance(elt, ast.Constant) and isinstance(elt.value, str): + symbols.append(elt.value) + return summary, symbols[:30] + + +def _notebook_to_markdown(rel_path: str, raw: str) -> tuple[str, str, list[str]]: + nb = json.loads(raw) + lines = [f"# Source: {rel_path}", "", "kind: notebook", ""] + sections: list[str] = [] + for idx, cell in enumerate(nb.get("cells", []), start=1): + cell_type = cell.get("cell_type", "") + source = "".join(cell.get("source", [])) + if not source.strip(): + continue + if cell_type == "markdown": + lines.extend([f"## Cell {idx} (markdown)", "", source.rstrip(), ""]) + first = source.strip().splitlines()[0] if source.strip() else "" + if first.startswith("#"): + sections.append(first.lstrip("#").strip()) + elif cell_type == "code": + lines.extend([f"## Cell {idx} (code)", "", "```python", source.rstrip(), "```", ""]) + body = "\n".join(lines) + title = sections[0] if sections else Path(rel_path).stem.replace("_", " ") + summary = title[:240] + return body, summary, sections[:40] + + +def _markdown_summary_and_sections(body: str, *, fallback: str) -> tuple[str, list[str]]: + sections = _extract_headings(body) + summary = sections[0] if sections else _first_paragraph(body) or fallback + return summary[:240], sections + + +def _collect_core_paths(repo_root: Path) -> set[Path]: + paths: set[Path] = set() + core = repo_root / CORE_PREFIX + if core.is_dir(): + for path in core.rglob("*"): + if path.is_file() and path.suffix in {".py", ".md"} and "__pycache__" not in path.parts: + paths.add(path) + return paths + + +def _collect_impl_paths(repo_root: Path) -> set[Path]: + paths: set[Path] = set() + impl_root = repo_root / "implementations" + if not impl_root.is_dir(): + return paths + for path in impl_root.rglob("*"): + if not path.is_file(): + continue + if _SKIP_IMPL_PARTS.intersection(path.parts): + continue + if "curriculum" in path.parts and "context" in path.parts: + continue + if path.suffix in {".py", ".md", ".ipynb"} or ( + path.parent.name == "specs" and path.suffix in {".yaml", ".yml"} + ): + paths.add(path) + return paths + + +def collect_source_paths(repo_root: Path) -> list[Path]: + """Collect all concierge-indexed paths under the repo snapshot.""" + paths = _collect_core_paths(repo_root) | _collect_impl_paths(repo_root) + + for rel in ( + "README.md", + "AGENTS.md", + "implementations/README.md", + "planning-docs/roadmap.md", + "docs/adk-skills-guide.md", + ): + candidate = repo_root / rel + if candidate.is_file(): + paths.add(candidate) + + scripts = repo_root / "scripts" + if scripts.is_dir(): + for path in scripts.glob("fetch_*.py"): + paths.add(path) + + return sorted(paths, key=lambda p: p.relative_to(repo_root).as_posix()) + + +def build_entry(repo_root: Path, path: Path) -> tuple[CatalogEntry, str]: + rel = path.relative_to(repo_root).as_posix() + kind = infer_kind(rel) + domain = infer_domain(rel) + raw = path.read_text(encoding="utf-8", errors="replace") + + symbols: list[str] = [] + sections: list[str] = [] + if kind == "python": + summary, symbols = _analyze_python(raw) + if not summary: + summary = Path(rel).name + body = f"# Source: {rel}\n\nkind: python\n\n```python\n{raw.rstrip()}\n```\n" + elif kind == "notebook": + body, summary, sections = _notebook_to_markdown(rel, raw) + elif kind == "markdown": + summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name) + body = f"# Source: {rel}\n\nkind: markdown\n\n{raw.rstrip()}\n" + elif kind == "yaml": + summary, sections = _markdown_summary_and_sections(raw, fallback=Path(rel).name) + body = f"# Source: {rel}\n\nkind: yaml\n\n```yaml\n{raw.rstrip()}\n```\n" + else: + summary = Path(rel).name + body = f"# Source: {rel}\n\nkind: shell\n\n```\n{raw.rstrip()}\n```\n" + + artifact_rel = f"artifacts/{path_to_artifact_slug(rel)}.md" + entry = CatalogEntry( + path=rel, + kind=kind, + domain=domain, + summary=summary, + symbols=symbols, + sections=sections, + chars=len(body), + artifact=artifact_rel, + ) + return entry, body + + +def git_ref(repo_root: Path) -> str: + try: + return subprocess.check_output( + ["git", "rev-parse", "HEAD"], + cwd=repo_root, + text=True, + ).strip() + except (subprocess.CalledProcessError, FileNotFoundError): + return "unknown" + + +def build_catalog(repo_root: Path | None = None) -> Path: + """Walk the repo, write ``catalog.yaml`` and per-source artifacts.""" + root = repo_root or repo_root_from_here() + out_dir = context_dir(root) + artifacts_dir = out_dir / "artifacts" + if artifacts_dir.exists(): + shutil.rmtree(artifacts_dir) + artifacts_dir.mkdir(parents=True, exist_ok=True) + + entries: list[CatalogEntry] = [] + for path in collect_source_paths(root): + entry, body = build_entry(root, path) + entries.append(entry) + artifact_path = out_dir / entry.artifact + artifact_path.parent.mkdir(parents=True, exist_ok=True) + artifact_path.write_text(body, encoding="utf-8") + + built_at = datetime.now(tz=UTC).replace(microsecond=0).isoformat() + catalog = { + "source_url": REPO_URL, + "git_ref": git_ref(root), + "branch": DEFAULT_BRANCH, + "built_at": built_at, + "ingest_source": str(root), + "entry_count": len(entries), + "entries": [ + { + "path": e.path, + "kind": e.kind, + "domain": e.domain, + "summary": e.summary, + "symbols": e.symbols, + "sections": e.sections, + "chars": e.chars, + "artifact": e.artifact, + } + for e in entries + ], + } + catalog_path = out_dir / "catalog.yaml" + catalog_path.write_text(yaml.safe_dump(catalog, sort_keys=False), encoding="utf-8") + + # Remove legacy topic-blob digests if present. + for legacy in ( + "overview.md", + "core_library.md", + "methods.md", + "implementations.md", + "extension_guides.md", + "manifest.yaml", + ): + legacy_path = out_dir / legacy + if legacy_path.is_file(): + legacy_path.unlink() + + _sync_skill_catalog_summary(catalog, root) + return out_dir + + +def _sync_skill_catalog_summary(catalog: dict[str, Any], repo_root: Path) -> None: + """Write a compact domain summary for the repo-navigation skill.""" + entries = catalog.get("entries", []) + domains: dict[str, int] = {} + for entry in entries: + if isinstance(entry, dict): + domain = str(entry.get("domain", "other")) + domains[domain] = domains.get(domain, 0) + 1 + summary = { + "source_url": catalog.get("source_url"), + "branch": catalog.get("branch"), + "built_at": catalog.get("built_at"), + "git_ref": catalog.get("git_ref"), + "entry_count": catalog.get("entry_count"), + "domains": domains, + } + out = ( + repo_root + / "implementations/getting_started/concierge_agent/skills/repo-navigation/references/catalog-summary.yaml" + ) + header = "# Concierge catalog summary (regenerated by scripts/build_concierge_context.py)\n" + out.write_text(header + yaml.safe_dump(summary, sort_keys=False), encoding="utf-8") diff --git a/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md b/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md new file mode 100644 index 0000000..a7d37df --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/AGENTS.md.md @@ -0,0 +1,67 @@ +# Source: AGENTS.md + +kind: markdown + +# AGENTS.md + +## How to use this file + +Instructions here are **general when possible, specific when needed.** Prefer patterns and principles over static lists — static lists go stale. When something is specific (a command, a maintenance contract, a non-obvious convention), it is specific for a reason. + +--- + +## Project documentation + +### Documentation is part of every change (hard rule) + +**Any change to code, features, datasets, methods, specs, notebooks, or observable behavior must update the docs that describe it, in the same change.** Docs are part of the product and part of the definition of "done." A change that lands working code but leaves a README, the root `README.md`, the method catalog, or `planning-docs/roadmap.md` describing the old reality is a **regression** — treat it exactly as you would a failing test, not as follow-up work. + +So "done" always includes a documentation reconciliation step. Before considering any change complete: + +1. **Grep for what you touched** across docs — the feature name, the module/class/function, the dataset, the spec, the notebook. The fast version: `grep -rn "" --include="*.md" .` (and check notebook markdown cells). Don't rely on memory for where something is mentioned. +2. **Reconcile every hit.** If a doc calls something "planned", "deferred", "not yet wired in", a "seam", or "out of scope" and you just made it real, update that wording. If a doc lists files, notebooks, predictors, specs, or data sources and you added or removed one, fix the list. If you changed a default, a metric, or a command, fix it everywhere it appears. +3. **Update the layered docs together**, not just the nearest one: the use-case README (most detail), the reference-implementations table in the root `README.md`, the method catalog (`aieng-forecasting/aieng/forecasting/methods/README.md`) when you touch a reusable predictor, and `planning-docs/roadmap.md` when something moves from "extension idea" to "shipped". + +Concrete example: integrating Canada's Food Price Report PDFs into the food-price LLM-Process prompt is **not done when the code runs** — it is done when `implementations/food_price_forecasting/README.md` (which currently frames report→prompt wiring as a deferred extension) and the "Reports as predictor context" entry in `planning-docs/roadmap.md` no longer describe it as future work. Shipping the code while those still say "deferred" is the regression the reviewer should catch. + +The two subsections below are the map of where docs live, so the reconciliation in step 3 is quick. + +### planning-docs/ + +`./planning-docs/roadmap.md` captures the architecture principles worth preserving and the catalog of extension ideas. It is the place for cross-cutting design notes, not per-task tracking. + +The older planning log, backlog, project charter, and technical-design files under `planning-docs/` (and `planning-docs/archive/`) are retired and kept only for continuity — do not add new decisions to them. When a change affects architecture, datasets, repo layout, or the set of reference implementations, update `planning-docs/roadmap.md` (for an architectural principle or a new extension idea) and the relevant README files in the same session. + +Project shape to keep in mind: + +- The core library `aieng.forecasting` owns stable infrastructure; reusable predictors live in `aieng.forecasting.methods`; use-case material lives in `implementations//`. +- YAML specs are co-located under `implementations//specs/`. +- Reference implementations: Getting Started, Food Price Forecasting, Energy/Oil (stateless capability track plus an adaptive learning agent), BoC Rate Decisions (quantitative path, cutoff-aware press-release ingestion, and a reasoning-alignment evaluator), and S&P 500 (in active development). +- Energy/oil's older information-session notebooks are archived under `playground/energy_case_study/`. +- Continuous and discrete-event forecasts are output modalities; numerical methods, LLM Processes, and agentic forecasters are method families that apply to either. + +### README files + +Search the repo for `README.md` files (excluding `.venv/`) to find every README — there is one at the root, one per package (`aieng-forecasting/`, `implementations/`), the method catalog under `aieng-forecasting/aieng/forecasting/methods/`, and one per use case under `implementations//`. These are the primary user surface and the first thing a new contributor reads; the reconciliation rule above applies to all of them. Keep them accurate and production-quality: describe what the code does and what you can build from it, with no internal program, scheduling, or ownership framing. + +--- + +## Development conventions + +### Data cache + +Historical data is stored in `data/` at the repo root (gitignored). Before running notebooks or scripts that depend on live data, populate the cache by running the relevant script in `scripts/` (e.g. `uv run python scripts/fetch_cpi.py`). Never commit data files. + +### Model selection + +The project standardizes on **two** Vector-proxy models so examples stay consistent: `gemini-3.1-flash-lite-preview` (the **lite / default** model) and `gemini-3.5-flash` (the **advanced** model, used for the adaptive-agent path and curriculum runs). Both are defined once in `aieng.forecasting.models` as `LITE_MODEL` / `ADVANCED_MODEL` (`DEFAULT_MODEL = LITE_MODEL`). Reference these constants in code rather than hardcoding model strings; notebooks pick one of the two literals with the other shown as a commented alternative. See `planning-docs/vector-llm-proxy.md` for the full convention. + +### Code quality (not on commit) + +Git commits **do not** run automated hooks locally. Run **`make lint`** (ruff format + ruff check + mypy on `aieng`) before pushing — a passing `make lint` means CI will be happy with the code. To fully mirror CI (yaml checks, uv-lock, etc.) run **`uv run pre-commit run --all-files`**. CI on `main` runs the same `pre-commit` config. + +Notebook outputs **are** committed at the author's discretion — `nbstripout` is not in the pre-commit config. Strip outputs manually before committing if you don't want them in the repo. + +### Test philosophy + +Tests should justify their existence. Write tests for: non-obvious logic that is easy to get wrong, defensive contracts (e.g. copy-on-return), and error paths where the message matters. Do not write tests for: Pydantic model construction (Pydantic already validates this), trivial Python behaviour (sorted lists, empty dicts), or mock-interaction assertions that test implementation rather than behaviour. When in doubt, fewer focused tests are better than many shallow ones. diff --git a/implementations/getting_started/concierge_agent/context/artifacts/README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/README.md.md new file mode 100644 index 0000000..bd4d6a0 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/README.md.md @@ -0,0 +1,207 @@ +# Source: README.md + +kind: markdown + +# Agentic Forecasting + +A foundation for building, evaluating, and comparing forecasting systems — conventional numerical models, LLM Processes, and agentic forecasters — on real economic, financial, and event-prediction tasks. + +The repository pairs a small, stable core library with a set of self-contained reference implementations. The library gives you cutoff-safe data handling, a single `Predictor` interface, and a backtest/evaluation harness. Each reference implementation is a worked example of a different forecasting problem and the techniques that suit it. Start from whichever one is closest to what you want to build. + +> **👉 First time here? Run the environment check.** After `uv sync` (see [Setup](#setup)), open [`implementations/getting_started/00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) and run it top to bottom. It's a self-guided preflight that verifies every capability — proxy LLM inference, Langfuse, E2B code execution, StatCan/FRED data access, and an end-to-end mini backtest — and tells you exactly what to fix when something isn't set up. **Do this before anything else.** + +## What's here + +- **Core library** — `aieng-forecasting` (`aieng.forecasting`): data services, cutoff enforcement, forecasting tasks, prediction payloads, backtesting, evaluation, and artifacts. +- **Reusable methods** — `aieng.forecasting.methods`: `Predictor` implementations including naive baselines (continuous, binary, and categorical), Darts numerical predictors, LLM-process predictors (continuous, binary-probability, and categorical-probability), and ADK-based agentic infrastructure (`build_adk_agent`, `AdkTextRunner`, `AgentPredictor`). +- **Reference implementations** — `implementations//`: notebooks, helper modules, task-specific configuration, and co-located YAML specs. +- **Tracing** — Langfuse / OpenTelemetry bootstrap (`aieng.forecasting.langfuse_tracing`) for LiteLLM and Google ADK. +- **Data scripts** — `scripts/`: one fetch script per data source, plus `build_e2b_template.py` for the agentic code-execution sandbox. + +## Two ways to use a forecaster + +Every method can be used in one of two modes, and the distinction runs through the library: + +- **Track 1 — evaluated prediction.** Numerical methods, LLM Processes, and agentic forecasters emit standardized `Prediction` objects and are compared head-to-head with the evaluation harness (CRPS, Brier, RPS, calibration). +- **Track 2 — interactive analysis.** The same agents can do scenario analysis, monitoring, open-ended Q&A, code-backed analysis, and reasoning over evidence — useful work that isn't reduced to a single score. + +## Reference implementations + +Each is independent and self-contained — pick the one that matches the problem you care about, and read that directory's `README.md` for the full walkthrough. They are numbered in a recommended order that mirrors the bootcamp progression — conventional numerical methods → LLM Processes → agents → agentic evaluation — but any one stands on its own, so jump straight to the problem you care about. + +**Start here → #0 [`getting_started/`](implementations/getting_started/)** — one CPI series, one month ahead. The smallest end-to-end loop: a `Predictor`, a `BacktestSpec` and `EvalSpec`, naive + AutoARIMA baselines, CRPS scoring. The place to learn the evaluation framework before picking a domain below. Also includes [`99_repo_concierge.ipynb`](implementations/getting_started/99_repo_concierge.ipynb) — a lite-model repo guide for “how does this codebase work?” questions. + +| # | Implementation | The problem | Concepts & techniques it demonstrates | +| --- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 1 | [`sp500_forecasting/`](implementations/sp500_forecasting/) | S&P 500 returns under a macro/market covariate panel. | A head-to-head of conventional numerical methods (naive, ETS, Kalman, AutoARIMA, linear regression, LightGBM) plus a covariate-aware LLM-Process, all reading the same leak-safe covariate panel. Cumulative-return targets at 1/5/21-business-day horizons, CRPS + direction metrics, config-driven specs. | +| 2 | [`food_price_forecasting/`](implementations/food_price_forecasting/) | A multivariate food-CPI trajectory, in the style of Canada's Food Price Report. | Nine correlated sub-indices, a 12-step trajectory, a domain metric (avg/avg YoY), baselines vs LLM-Process predictors, leakage-aware backtests, and cached artifacts for fast iteration. | +| 3 | [`energy_oil_forecasting/`](implementations/energy_oil_forecasting/) | Daily WTI crude-oil price under regime-breaking news. | A capability progression — Prophet → LLM-Process → news-grounded agent → code-executing agent — plus an adaptive agent that learns a strategy from data and is scored before vs after. Continuous trajectories, a binary up-shock task, and interactive scenario analysis. | +| 4 | [`boc_rate_decisions/`](implementations/boc_rate_decisions/) | Will the Bank of Canada cut, hold, or hike at its next meeting? | Discrete-event forecasting: ordered-categorical outcomes on an irregular calendar, RPS scoring and one-vs-rest calibration (instead of CRPS), a binary (Brier) special case, cutoff-aware document ingestion, and an LLM-as-judge that scores an agent's reasoning against the official rationale. | + +**Not sure where to start building?** Each of the four domain implementations above ends with a `99_starter_agent.ipynb` — a fresh, hackable **starter agent** (a `starter_agent/` module) with toggleable news search and code execution, two lightweight tool-usage skills, an interactive cell, and one scored forecast. It's the consistent "continue from here" entry point for taking any reference use case in an agentic direction, and a quick end-to-end test of that use case's agent stack. + +## Time Series Data sources + +- **StatCan** — Canadian CPI and related macroeconomic series. +- **FRED** — macroeconomic and commodity series. +- **yfinance** — equities, indices, and commodity futures. + +Historical data is cached locally under `data/` and is not committed. Each implementation's README names the fetch script(s) it needs. + +### FRED API key + +Several reference implementations (S&P 500, BoC rate decisions) fetch data from the Federal Reserve Economic Data (FRED) API, which requires a free personal API key. **We cannot provide this key for you** — each participant must request their own at: + +> [https://fred.stlouisfed.org/docs/api/api_key.html](https://fred.stlouisfed.org/docs/api/api_key.html) + +FRED keys are free and approval is typically quick, but it can occasionally take some time, so request yours early. When asked for a use-case description, something extended from the following works well: + +> "Requesting an API key to explore the effectiveness of various forecasting techniques on economic data." + +Once you have the key, add it to your repo-root `.env`: + +``` +FRED_API_KEY=your_fred_api_key +``` + +On Coder workspaces, bootcamp keys (`OPENAI_*`, `E2B_*`, `LANGFUSE_*`) live in your shell environment — **not** in repo `.env`. See [Bootcamp environment](#bootcamp-environment-coder). + +## Repository layout + +```text +aieng-forecasting/ # Installable library: import as aieng.forecasting +implementations/ # Self-contained reference implementations + co-located specs +scripts/ # Data-fetch scripts + E2B template builder +tests/ # Onboarding integration tests (not run in CI) +planning-docs/ # Architecture notes and the extension/roadmap catalog +playground/ # Exploration and archived demos (not reference implementations) +``` + +## Setup + +Install dependencies from the repo root: + +```bash +git clone . # If running locally. Coder environment setup clones repo automatically. +cd agentic-forecasting +uv sync --dev +``` + +**macOS — LightGBM and OpenMP.** The library depends on **LightGBM** (used by `DartsLightGBMPredictor` and some notebooks). The PyPI wheel expects **OpenMP** at runtime. If you see `Library not loaded: @rpath/libomp.dylib` when importing or training, install Homebrew's OpenMP once and restart your shell or Jupyter kernel: + +```bash +brew install libomp +``` + +On Apple Silicon the dylib is typically under `/opt/homebrew/opt/libomp/lib/`; on Intel Homebrew, `/usr/local/opt/libomp/lib/`. + +### Coder Workspaces + +When you open a **Coder workspace**, startup runs automatically in the background. By the time you connect you should have: + +- The repo cloned, a Python venv, and dependencies installed +- Bootcamp API keys (`OPENAI_*`, `E2B_*`, `LANGFUSE_*`) available in your shell (not in `.env`) +- A shell that opens in the repo with the venv activated + +**Your next step:** run [`00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) top to bottom. That notebook will confirm that startup succeeded. + +On first boot, keys are verified against live services and your onboarding status is recorded. Workspace restarts reload keys without re-running the full test suite. + +**Local machine or troubleshooting** — fetch and verify keys manually: + +```bash +eval "$(onboard --bootcamp-name agentic-forecasting --test-script tests/test_integration.py)" +``` + +Reload keys in a new shell without re-testing: + +```bash +eval "$(onboard --bootcamp-name agentic-forecasting --skip-test)" +``` + +Headless verification (same checks as first-boot onboarding): + +```bash +uv sync --all-extras --dev --all-packages +uv run pytest tests/test_integration.py -v +``` + +**Credential model:** bootcamp keys live in your shell environment. Optional personal keys (e.g. `FRED_API_KEY`) go in a `.env` only — see [`.env.example`](.env.example). + +### Verify your environment first + +New to the project? Open [`implementations/getting_started/00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb) and run it top to bottom. It's a self-guided preflight that checks every major capability — proxy LLM inference, Langfuse, E2B code execution, StatCan/FRED data access, and a full end-to-end mini backtest — one cell at a time, and tells you exactly what to fix when something isn't set up (most often a missing or placeholder key in your `.env`). It's the fastest way to confirm setup before working through the reference implementations. + +### Populate the data cache + +Data is fetched once and cached locally (gitignored). Each implementation names the fetch script(s) it needs in its own `README.md` — for example `scripts/fetch_cpi.py` (getting started), `scripts/fetch_sp500_market.py` + `scripts/fetch_fred.py` (S&P 500), `scripts/fetch_wti.py` (energy), and `scripts/fetch_boc.py` and `scripts/fetch_boc_press_releases.py` (BoC). Run the relevant one before opening that implementation's notebooks: + +```bash +uv run python scripts/fetch_cpi.py +``` + +### Build the E2B sandbox image (agentic implementations only) + +Agentic forecasters can run code in an E2B cloud sandbox. Credentials for e2b should be automatically injected into the environment for bootcamp participants, and you can confirm successful setup by running [`00_environment_check.ipynb`](implementations/getting_started/00_environment_check.ipynb). + +If this was unsuccessful, or if you prefer to run with E2B in an alternative environment, do this once before enabling code execution in `build_adk_agent`: + +1. Create a free account at [e2b.dev](https://e2b.dev) and copy your API key. +2. Add it to your `.env` file alongside the other keys (see `.env.example`): + + ``` + E2B_API_KEY=your_e2b_api_key + ``` + +1. Build the template (takes a few minutes on first run): + + ```bash + uv run --env-file .env scripts/build_e2b_template.py + ``` + +The template name is the default in `CodeExecutionConfig.template_name`, so notebooks pick it up automatically. + +## Core concepts + +`Predictor` is the interface every forecasting method implements: + +```python +class MyPredictor(Predictor): + @property + def predictor_id(self) -> str: + return "my_predictor" + + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + series = context.get_series(task.target_series_id) + ... + return [Prediction(...)] +``` + +`ForecastContext` is cutoff-scoped. Predictors only see observations available as of the forecast origin, which keeps backtests honest. + +`backtest()` is the open iteration loop against historical data. `evaluate()` is the budgeted protected-window loop. + +## Extending the foundation + +This repo is a starting point, not a finished product. The shape of a new forecaster is always the same: implement `Predictor`, declare a spec, and run `backtest()` / `evaluate()` to compare it against the baselines. Each reference implementation's README ends with concrete extension ideas; `planning-docs/roadmap.md` collects the cross-cutting ones (new data sources, additional methods, live forecasting, deeper agent work). + +## Code quality + +```bash +make lint +make format +``` + +`make lint` runs the expected pre-push quality checks. Git commits do not run hooks locally. To mirror the full pre-commit suite, run: + +```bash +uv run pre-commit run --all-files +``` + +## Documentation + +- Per-implementation READMEs under [`implementations/`](implementations/) — the primary user surface. +- [`aieng-forecasting/README.md`](aieng-forecasting/README.md) and [`aieng-forecasting/aieng/forecasting/methods/README.md`](aieng-forecasting/aieng/forecasting/methods/README.md) — the library and the method catalog. +- [`planning-docs/roadmap.md`](planning-docs/roadmap.md) — architecture principles and extension ideas. + +Keep code, notebooks, specs, and these docs in sync when you change behavior, setup, layout, or datasets. diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md new file mode 100644 index 0000000..1f44b7d --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting____init__.py.md @@ -0,0 +1,12 @@ +# Source: aieng-forecasting/aieng/forecasting/__init__.py + +kind: python + +```python +"""Agentic Forecasting — data service and evaluation harness.""" + +from aieng.forecasting.models import ADVANCED_MODEL, DEFAULT_MODEL, LITE_MODEL + + +__all__ = ["ADVANCED_MODEL", "DEFAULT_MODEL", "LITE_MODEL"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md new file mode 100644 index 0000000..47acf3f --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data____init__.py.md @@ -0,0 +1,14 @@ +# Source: aieng-forecasting/aieng/forecasting/data/__init__.py + +kind: python + +```python +"""Data service: adapters, series store, and cutoff enforcement.""" + +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.data.models import SeriesMetadata, SeriesRecord +from aieng.forecasting.data.service import DataService + + +__all__ = ["DataService", "ForecastContext", "SeriesMetadata", "SeriesRecord"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md new file mode 100644 index 0000000..52e5f2f --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters____init__.py.md @@ -0,0 +1,15 @@ +# Source: aieng-forecasting/aieng/forecasting/data/adapters/__init__.py + +kind: python + +```python +"""Adapter implementations for ingesting data into the SeriesStore.""" + +from aieng.forecasting.data.adapters.base import BaseAdapter +from aieng.forecasting.data.adapters.fred import FREDAdapter +from aieng.forecasting.data.adapters.statcan import StatCanAdapter +from aieng.forecasting.data.adapters.yfinance import YFinanceDailyAdapter + + +__all__ = ["BaseAdapter", "FREDAdapter", "StatCanAdapter", "YFinanceDailyAdapter"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md new file mode 100644 index 0000000..dd46a88 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__base.py.md @@ -0,0 +1,61 @@ +# Source: aieng-forecasting/aieng/forecasting/data/adapters/base.py + +kind: python + +```python +"""Base adapter protocol for data ingestion.""" + +from abc import ABC, abstractmethod + +import pandas as pd + + +class BaseAdapter(ABC): + """Abstract base class for all data adapters. + + An adapter is responsible for fetching data from a single source and + returning it in the canonical internal format understood by ``SeriesStore``. + + Each adapter instance represents **one series**. If a source provides + multiple series (e.g. a StatCan table with many product groups), create + one adapter instance per series. + + The canonical format returned by ``fetch()`` is a ``pandas.DataFrame`` + with the following columns: + + - ``timestamp`` (``datetime64[ns]``): observation time / reference period. + - ``value`` (``float64``): the observed quantity. + - ``released_at`` (``datetime64[ns]``, optional): when the data point + became publicly available. If absent, ``CutoffEnforcer`` falls back to + ``timestamp``. + + The ``series_id`` is **not** a column — it is the key used when + registering the adapter with ``DataService``. + + Notes + ----- + Adapters should be **offline-safe** after initial data retrieval. All + network calls belong in ``fetch()``, which is called once by a + data-loading script ahead of sessions. During sessions or backtests, + ``DataService.get_series()`` serves from the in-memory store with no + further network access. + """ + + @abstractmethod + def fetch(self) -> pd.DataFrame: + """Fetch the series and return it in canonical format. + + Returns + ------- + pd.DataFrame + DataFrame with columns ``timestamp`` (datetime64) and ``value`` + (float64). The optional ``released_at`` column (datetime64) should + be included when the source provides reliable publication dates. + Rows are sorted ascending by ``timestamp``. + + Raises + ------ + RuntimeError + If the fetch fails (network error, missing data, etc.). + """ +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md new file mode 100644 index 0000000..1e8f285 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__fred.py.md @@ -0,0 +1,193 @@ +# Source: aieng-forecasting/aieng/forecasting/data/adapters/fred.py + +kind: python + +```python +"""FRED (Federal Reserve Economic Data) adapter for the SeriesStore. + +``FREDAdapter`` fetches a single FRED series and returns it in the canonical +internal format understood by :class:`~aieng.forecasting.data.store.SeriesStore`. + +Caching +------- +When ``cache_dir`` is provided, the adapter persists each series to +``{cache_dir}/{fred_id}.parquet`` on first fetch and reads from the parquet +file on all subsequent calls. This mirrors the ``StatCanAdapter`` pattern: +run ``scripts/fetch_fred.py`` once to populate the cache, then notebooks and +backtests read from disk with no further network access. + +**API key requirement:** FRED requires a free API key obtained from +https://fred.stlouisfed.org/docs/api/api_key.html. Provide it via the +``FRED_API_KEY`` environment variable (recommended) or the ``api_key`` +constructor argument. The key is only needed when the local cache is empty +or ``refresh=True``. + +**``released_at`` approximation:** FRED does not expose vintage / release +dates through the standard ``fredapi`` interface. The adapter sets +``released_at = timestamp``, which is correct for series that are available +at their reference period end (e.g. monthly averages published at or shortly +after month end). For series with significant publication lags this is +optimistic and may be refined in a later pass using FRED's +``get_series_vintage_dates`` endpoint. +""" + +from __future__ import annotations + +import os +from pathlib import Path + +import pandas as pd +from aieng.forecasting.data.adapters.base import BaseAdapter + + +class FREDAdapter(BaseAdapter): + """Adapter that fetches a single FRED series, with optional disk cache. + + Parameters + ---------- + series_id : str + FRED series identifier, e.g. ``"CPIFABSL"`` or ``"EXCAUS"``. + api_key : str or None + FRED API key. If ``None``, the value is read from the + ``FRED_API_KEY`` environment variable. The key is only consulted + when a network fetch is actually required (cache miss or + ``refresh=True``); adapters pointing at a populated cache can be + instantiated without a key. + cache_dir : str, Path, or None + Directory to read/write parquet cache files. When ``None``, + caching is disabled and every ``fetch()`` call hits the FRED API. + When set, the adapter reads from ``{cache_dir}/{series_id}.parquet`` + if present; otherwise it fetches from FRED and writes the parquet + before returning. Default: ``"data/fred"``. + refresh : bool + When ``True``, force a network fetch even if a cache file exists + (and overwrite the cache). Default: ``False``. + + Raises + ------ + ValueError + When a network fetch is required but no API key is available. + + Examples + -------- + Populate the cache once:: + + >>> adapter = FREDAdapter("EXCAUS") # uses FRED_API_KEY env var + >>> df = adapter.fetch() # hits API, writes parquet + + Subsequent reads never touch the network:: + + >>> adapter = FREDAdapter("EXCAUS") + >>> df = adapter.fetch() # reads parquet + """ + + DEFAULT_CACHE_DIR = "data/fred" + + def __init__( + self, + series_id: str, + api_key: str | None = None, + cache_dir: str | Path | None = DEFAULT_CACHE_DIR, + refresh: bool = False, + ) -> None: + self._series_id = series_id + self._api_key = api_key or os.environ.get("FRED_API_KEY") + self._cache_dir = Path(cache_dir) if cache_dir is not None else None + self._refresh = refresh + + @property + def series_id(self) -> str: + """FRED series identifier.""" + return self._series_id + + @property + def cache_path(self) -> Path | None: + """Full path to this adapter's parquet cache file, or ``None`` if disabled.""" + if self._cache_dir is None: + return None + return self._cache_dir / f"{self._series_id}.parquet" + + def fetch(self) -> pd.DataFrame: + """Return the series in canonical format, using the disk cache when available. + + Flow: + + 1. If ``cache_dir`` is set and the parquet file exists and ``refresh=False``, + read and return it. + 2. Otherwise fetch from the FRED API, normalize, write to parquet (when + caching is enabled), and return. + + Returns + ------- + pd.DataFrame + Columns: ``timestamp`` (datetime64[ns]), ``value`` (float64), + ``released_at`` (datetime64[ns]). Sorted ascending by + ``timestamp``. Index is a default RangeIndex. + + Raises + ------ + ValueError + If a network fetch is required but no API key is available. + RuntimeError + If the FRED API request fails or returns no data. + """ + cache_path = self.cache_path + if cache_path is not None and cache_path.exists() and not self._refresh: + return self._read_cache(cache_path) + + df = self._fetch_from_api() + + if cache_path is not None: + cache_path.parent.mkdir(parents=True, exist_ok=True) + df.to_parquet(cache_path, index=False) + + return df + + def _fetch_from_api(self) -> pd.DataFrame: + """Fetch the series directly from the FRED API.""" + if not self._api_key: + raise ValueError( + "FRED API key not provided. Set the FRED_API_KEY environment variable " + "or pass api_key= to FREDAdapter. (Key is only required on cache miss; " + "populated caches can be read without one.)" + ) + + try: + from fredapi import Fred # noqa: PLC0415 + except ImportError as exc: + raise RuntimeError("fredapi is not installed. Run `uv add fredapi` to install it.") from exc + + fred = Fred(api_key=self._api_key) + + try: + raw: pd.Series = fred.get_series(self._series_id) + except Exception as exc: + raise RuntimeError(f"Failed to fetch FRED series '{self._series_id}': {exc}") from exc + + if raw.empty: + raise RuntimeError(f"FRED series '{self._series_id}' returned no data.") + + df = raw.reset_index() + df.columns = pd.Index(["timestamp", "value"]) + df["timestamp"] = pd.to_datetime(df["timestamp"]) + df["value"] = pd.to_numeric(df["value"], errors="coerce") + df = df.dropna(subset=["value"]) + df["released_at"] = df["timestamp"] + df = df.sort_values("timestamp").reset_index(drop=True) + + return df[["timestamp", "value", "released_at"]] + + @staticmethod + def _read_cache(cache_path: Path) -> pd.DataFrame: + """Read a cached parquet and normalize dtypes defensively.""" + df = pd.read_parquet(cache_path) + df["timestamp"] = pd.to_datetime(df["timestamp"]) + df["released_at"] = pd.to_datetime(df["released_at"]) + df["value"] = pd.to_numeric(df["value"], errors="coerce") + return df[["timestamp", "value", "released_at"]].reset_index(drop=True) + + def __repr__(self) -> str: + """Return a short representation without exposing the API key.""" + cache = self._cache_dir if self._cache_dir is not None else "disabled" + return f"FREDAdapter(series_id={self._series_id!r}, cache_dir={cache!r})" +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md new file mode 100644 index 0000000..3b25f7a --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__statcan.py.md @@ -0,0 +1,219 @@ +# Source: aieng-forecasting/aieng/forecasting/data/adapters/statcan.py + +kind: python + +```python +"""Statistics Canada adapter using the stats-can library.""" + +import re +import zipfile +from pathlib import Path + +import pandas as pd +from aieng.forecasting.data.adapters.base import BaseAdapter + + +# Canonical column names in StatCan CSV exports (stable across tables). +_STATCAN_DATE_COL = "REF_DATE" +_STATCAN_VALUE_COL = "VALUE" + + +def _normalize_table_id(table_id: str) -> str: + """Strip non-numeric characters and take the first 8 digits. + + Statistics Canada table IDs like ``"18-10-0004-13"`` map to the zip filename + ``"18100004-eng.zip"`` — the last two digits are a product variant suffix + not used in the filename. + """ + return re.sub(r"\D", "", table_id)[:8] + + +def _read_zip(zip_path: Path, normalized_id: str) -> pd.DataFrame: + """Read the CSV from a StatCan zip file into a raw DataFrame. + + Uses ``errors="coerce"`` for date parsing (avoiding the pandas-3 + incompatibility in ``stats_can.zip_table_to_dataframe`` which used + the now-removed ``errors="ignore"``). + """ + csv_name = f"{normalized_id}.csv" + with zipfile.ZipFile(zip_path) as zf: + with zf.open(csv_name) as f: + col_names = pd.read_csv(f, nrows=0).columns.tolist() + types_dict: dict[str, type | str] = {_STATCAN_VALUE_COL: float} + types_dict.update({col: str for col in col_names if col not in types_dict}) + with zf.open(csv_name) as f: + df = pd.read_csv(f, dtype=types_dict) + + df[_STATCAN_DATE_COL] = pd.to_datetime(df[_STATCAN_DATE_COL], errors="coerce") + return df + + +class StatCanAdapter(BaseAdapter): + """Adapter for a single series from a Statistics Canada table. + + Uses the ``stats-can`` library (v3+) to download tables and caches the + raw zip locally. The CSV inside the zip is read directly with pandas to + avoid a pandas-3 incompatibility in ``stats_can.zip_table_to_dataframe``. + After the initial download, all data is served from the local cache — + no further network calls are made unless the cache is cleared. + + Each instance represents **one series**, identified by a set of filter + criteria (e.g. geography + product group). For tables that contain many + series, instantiate one ``StatCanAdapter`` per series and register each + with ``DataService`` under a distinct ``series_id``. + + Parameters + ---------- + table_id : str + Statistics Canada table identifier (e.g. ``"18-10-0004-13"``). + member_filter : dict[str, str] + Column-value pairs used to select a single series from the table. + For example: ``{"GEO": "Canada", "Products and product groups": "All-items"}``. + All specified columns must be present in the downloaded table. + cache_dir : str or Path + Directory where the ``stats-can`` library stores its local table cache. + Defaults to ``"data/statcan"`` relative to the current working directory. + release_lag_days : int + Days added to ``timestamp`` to populate ``released_at``. The default + of 21 is a deliberately loose approximation for monthly survey + tables; note the lag is measured from the *month-start* timestamp, + while StatCan publishes CPI roughly three weeks after the month + *ends* (~51 days after the timestamp), so the default is optimistic + by about one month. Consumers that use monthly series as covariates + should add their own conservative lag (see the BoC use case). Daily + financial-market tables (e.g. 10-10-0139-01 interest rates) are + published the next business day — pass ``release_lag_days=1`` for + those so backtests do not hide three weeks of perfectly public + market data from predictors. + + Notes + ----- + **Information cutoff**: StatCan publishes CPI data roughly 3 weeks after + the reference month ends. For example, January CPI is released in + mid-to-late February. This adapter populates ``released_at`` as + ``timestamp + release_lag_days``, a fixed-lag approximation. A more + precise implementation would query StatCan's release calendar API, but + the fixed lag removes the most significant optimistic bias in backtests. + + Examples + -------- + >>> adapter = StatCanAdapter( + ... table_id="18-10-0004-13", + ... member_filter={ + ... "GEO": "Canada", + ... "Products and product groups": "All-items", + ... }, + ... ) + >>> df = adapter.fetch() + >>> df.columns.tolist() + ['timestamp', 'value', 'released_at'] + """ + + def __init__( + self, + table_id: str, + member_filter: dict[str, str], + cache_dir: str | Path = "data/statcan", + release_lag_days: int = 21, + ) -> None: + if release_lag_days < 0: + raise ValueError(f"release_lag_days must be non-negative; got {release_lag_days}") + self._table_id = table_id + self._member_filter = member_filter + self._cache_dir = Path(cache_dir) + self._release_lag_days = release_lag_days + + @property + def table_id(self) -> str: + """Return the StatCan table identifier.""" + return self._table_id + + @property + def member_filter(self) -> dict[str, str]: + """Return the filter criteria that identify this series.""" + return dict(self._member_filter) + + def fetch(self) -> pd.DataFrame: + """Download (or load from cache) and return the series in canonical format. + + Returns + ------- + pd.DataFrame + DataFrame with columns ``timestamp`` (datetime64[ns]), ``value`` + (float64), and ``released_at`` (datetime64[ns]), sorted ascending + by ``timestamp``. ``released_at`` is set to + ``timestamp + release_lag_days`` to approximate StatCan's + publication lag. Rows with missing values are dropped. + + Raises + ------ + RuntimeError + If the table cannot be downloaded or the filter criteria do not + match any rows. + ValueError + If a column named in ``member_filter`` is not present in the table. + """ + import stats_can.sc as _sc # noqa: PLC0415 — lazy import after package checks + + self._cache_dir.mkdir(parents=True, exist_ok=True) + + normalized = _normalize_table_id(self._table_id) + zip_path = self._cache_dir / f"{normalized}-eng.zip" + + if not zip_path.exists(): + try: + _sc.download_tables([normalized], path=self._cache_dir) + except Exception as exc: + raise RuntimeError(f"Failed to download StatCan table {self._table_id!r}: {exc}") from exc + + try: + raw = _read_zip(zip_path, normalized) + except Exception as exc: + raise RuntimeError(f"Failed to fetch StatCan table {self._table_id!r}: {exc}") from exc + + # Validate that all filter columns exist before filtering. + missing_cols = [col for col in self._member_filter if col not in raw.columns] + if missing_cols: + raise ValueError( + f"Filter column(s) {missing_cols} not found in table {self._table_id!r}. " + f"Available columns: {raw.columns.tolist()}" + ) + + # Apply member filter to isolate the target series. + mask = pd.Series(True, index=raw.index) + for col, val in self._member_filter.items(): + mask &= raw[col] == val + + filtered = raw.loc[mask].copy() + + if filtered.empty: + raise RuntimeError(f"No rows matched filter {self._member_filter} in table {self._table_id!r}.") + + if _STATCAN_VALUE_COL not in filtered.columns: + raise ValueError( + f"Expected value column {_STATCAN_VALUE_COL!r} not found in table. " + f"Available columns: {filtered.columns.tolist()}" + ) + + if _STATCAN_DATE_COL not in filtered.columns: + raise ValueError( + f"Expected date column {_STATCAN_DATE_COL!r} not found in table. " + f"Available columns: {filtered.columns.tolist()}" + ) + + # Build canonical output: (timestamp, value, released_at). + timestamps = pd.to_datetime(filtered[_STATCAN_DATE_COL]) + result = pd.DataFrame( + { + "timestamp": timestamps, + "value": pd.to_numeric(filtered[_STATCAN_VALUE_COL], errors="coerce"), + # Approximate the table's publication lag (default 21 days for + # monthly survey tables like CPI; 1 day for daily market data). + "released_at": timestamps + pd.DateOffset(days=self._release_lag_days), + } + ) + + # Drop rows with missing values (StatCan uses blank VALUE for suppressed data). + result = result.dropna(subset=["value"]) + return result.sort_values("timestamp").reset_index(drop=True) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md new file mode 100644 index 0000000..6fc3163 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__adapters__yfinance.py.md @@ -0,0 +1,340 @@ +# Source: aieng-forecasting/aieng/forecasting/data/adapters/yfinance.py + +kind: python + +```python +"""Yahoo Finance adapter for daily market series. + +``YFinanceDailyAdapter`` fetches one ticker/field pair from Yahoo Finance via +``yfinance`` and returns the canonical internal format understood by +:class:`~aieng.forecasting.data.store.SeriesStore`. + +Caching +------- +When ``cache_dir`` is provided, the adapter persists each ticker/field pair to +``{cache_dir}/{ticker}_{field}_1d.parquet`` on first fetch and reads from that +parquet file on subsequent calls. The cache is only used when it fully covers the +requested ``start``/``end`` window; if the cached data starts too late *or* ends too +early, a fresh yfinance request is made and the cache is overwritten. Use +``refresh=True`` to force a network fetch regardless of cache state. + +Information cutoff +------------------ +Yahoo Finance daily bars do not include a reliable point-in-time availability +timestamp. For daily bars, this adapter sets ``released_at`` to the next +business day after the observation timestamp. That is a conservative default +for close-based daily forecasting and avoids treating a session close as known +at the start of that same session. It is not an exchange-grade release calendar +and should be revisited for intraday or contract-specific futures workflows. +""" + +from __future__ import annotations + +import re +from pathlib import Path +from typing import Any, Literal + +import pandas as pd +from aieng.forecasting.data.adapters.base import BaseAdapter +from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator + + +# Supported Yahoo Finance daily history fields. +YFinanceField = Literal["Open", "High", "Low", "Close", "Adj Close", "Volume"] + + +# Supported yfinance interval for this adapter. +YFinanceInterval = Literal["1d"] + + +_DEFAULT_FIELD: YFinanceField = "Adj Close" +_DEFAULT_INTERVAL: YFinanceInterval = "1d" + + +def _cache_stem(ticker: str, field: YFinanceField, interval: YFinanceInterval) -> str: + """Return a filesystem-safe cache stem for a ticker/field/interval combination.""" + key = f"{ticker}_{field}_{interval}".lower() + sanitized = re.sub(r"[^a-z0-9]+", "_", key).strip("_") + if not sanitized: + raise ValueError("ticker and field produced an empty cache key") + return sanitized + + +class YFinanceDailyConfig(BaseModel): + """Validated configuration for :class:`YFinanceDailyAdapter`.""" + + model_config = ConfigDict(frozen=True) + + ticker: str = Field(min_length=1) + field: YFinanceField = _DEFAULT_FIELD + start: str | None = None + end: str | None = None + interval: YFinanceInterval = _DEFAULT_INTERVAL + + @field_validator("ticker") + @classmethod + def ticker_must_not_be_blank(cls, value: str) -> str: + """Normalize ticker whitespace and reject blank values.""" + stripped = value.strip() + if not stripped: + raise ValueError("ticker must not be blank") + return stripped + + @model_validator(mode="after") + def end_must_be_after_start(self) -> "YFinanceDailyConfig": + """Validate the requested date window.""" + if self.start is not None and self.end is not None: + start = pd.Timestamp(self.start) + end = pd.Timestamp(self.end) + if end <= start: + raise ValueError(f"end ({self.end!r}) must be after start ({self.start!r})") + return self + + +class YFinanceDailyAdapter(BaseAdapter): + """Adapter that fetches a single Yahoo Finance daily ticker field. + + Parameters + ---------- + ticker : str + Yahoo Finance symbol, e.g. ``"^GSPC"``, ``"CL=F"``, or ``"XLE"``. + field : {"Open", "High", "Low", "Close", "Adj Close", "Volume"} + Daily history column to expose as canonical ``value``. Defaults to + ``"Adj Close"``. + start : str or None + Inclusive start date passed to yfinance and applied to cache reads. + end : str or None + Exclusive end date passed to yfinance and applied to cache reads. + cache_dir : str, Path, or None + Directory for parquet cache files. When ``None``, caching is disabled + and every ``fetch()`` call hits yfinance. Default: ``"data/yfinance"``. + refresh : bool + When ``True``, force a network fetch even if a cache file exists. + """ + + DEFAULT_CACHE_DIR = "data/yfinance" + + def __init__( + self, + ticker: str, + *, + field: YFinanceField = _DEFAULT_FIELD, + start: str | None = None, + end: str | None = None, + cache_dir: str | Path | None = DEFAULT_CACHE_DIR, + refresh: bool = False, + ) -> None: + self._config = YFinanceDailyConfig( + ticker=ticker, + field=field, + start=start, + end=end, + interval=_DEFAULT_INTERVAL, + ) + self._cache_dir = Path(cache_dir) if cache_dir is not None else None + self._refresh = refresh + + @property + def ticker(self) -> str: + """Yahoo Finance ticker symbol.""" + return self._config.ticker + + @property + def field(self) -> YFinanceField: + """Yahoo Finance daily history field exposed as ``value``.""" + return self._config.field + + @property + def start(self) -> str | None: + """Inclusive start date for the requested window.""" + return self._config.start + + @property + def end(self) -> str | None: + """Exclusive end date for the requested window.""" + return self._config.end + + @property + def cache_path(self) -> Path | None: + """Full path to this adapter's parquet cache file, or ``None`` if disabled.""" + if self._cache_dir is None: + return None + stem = _cache_stem(self._config.ticker, self._config.field, self._config.interval) + return self._cache_dir / f"{stem}.parquet" + + def fetch(self) -> pd.DataFrame: + """Return the series in canonical format, using disk cache when available. + + Returns + ------- + pd.DataFrame + Columns: ``timestamp`` (datetime64[ns]), ``value`` (float64), and + ``released_at`` (datetime64[ns]). Rows are sorted ascending by + ``timestamp`` and filtered to the configured ``start`` / ``end`` + window. + + Raises + ------ + RuntimeError + If yfinance cannot be imported, the request fails, or no rows are + available after normalization and date filtering. + ValueError + If the Yahoo response is missing the configured field. + """ + cache_path = self.cache_path + if cache_path is not None and cache_path.exists() and not self._refresh: + cached = self._read_cache(cache_path) + if self._cache_covers_range(cached): + return self._apply_date_range(cached) + + df = self._fetch_from_yfinance() + + if cache_path is not None: + cache_path.parent.mkdir(parents=True, exist_ok=True) + df.to_parquet(cache_path, index=False) + + return self._apply_date_range(df) + + def _cache_covers_range(self, df: pd.DataFrame) -> bool: + """Return whether cached data fully covers the requested date range. + + Both the start and end boundaries are checked. If either falls outside + the cached window we fall through to a live yfinance fetch so the caller + always receives the exact rows they asked for. + + Start boundary: the cache is considered sufficient when it opens on or + before the first business day on or after the requested ``start``. This + handles non-trading days (weekends, public holidays) at the boundary + without accepting a cache that is genuinely missing earlier data. For + example, a ``start`` of ``"2005-01-01"`` (Saturday) is satisfied by a + cache that begins on ``"2005-01-03"`` (Monday), but a ``start`` of + ``"2024-01-02"`` (Tuesday) would *not* be satisfied by a cache that + begins on ``"2024-01-03"``. + """ + if df.empty: + return False + if self._config.start is not None: + cache_start = df["timestamp"].min() + first_trading_day = pd.bdate_range(start=self._config.start, periods=1)[0].normalize() + if cache_start > first_trading_day: + return False + if self._config.end is not None: + cache_end = df["timestamp"].max() + # end is exclusive, so the last row we expect is strictly before it. + # Allow one calendar day of slack to tolerate weekends/holidays at + # the boundary; any larger gap means the cache is genuinely short. + if cache_end < pd.Timestamp(self._config.end) - pd.Timedelta(days=1): + return False + return True + + def _fetch_from_yfinance(self) -> pd.DataFrame: + """Fetch and normalize a daily history frame from yfinance.""" + try: + import yfinance as yf # noqa: PLC0415 + except ImportError as exc: + raise RuntimeError("yfinance is not installed. Run `uv add yfinance` to install it.") from exc + + try: + ticker = yf.Ticker(self._config.ticker) + raw: pd.DataFrame = ticker.history( + start=self._config.start, + end=self._config.end, + interval=self._config.interval, + auto_adjust=False, + ) + except Exception as exc: + raise RuntimeError(f"Failed to fetch yfinance ticker {self._config.ticker!r}: {exc}") from exc + + if raw.empty: + raise RuntimeError( + f"Yahoo Finance returned no rows for ticker {self._config.ticker!r} " + f"between {self._config.start!r} and {self._config.end!r}." + ) + + return self._normalize_history(raw) + + def _normalize_history(self, raw: pd.DataFrame) -> pd.DataFrame: + """Normalize a yfinance history frame to canonical columns.""" + if self._config.field not in raw.columns: + raise ValueError( + f"Yahoo Finance response for {self._config.ticker!r} is missing field " + f"{self._config.field!r}. Available columns: {raw.columns.tolist()}" + ) + + df = raw.reset_index() + timestamp_col = self._find_timestamp_column(df) + result = pd.DataFrame( + { + "timestamp": self._normalize_timestamp(df[timestamp_col]), + "value": pd.to_numeric(df[self._config.field], errors="coerce"), + } + ) + result["released_at"] = result["timestamp"] + pd.offsets.BDay(1) + result = result.dropna(subset=["timestamp", "value"]) + result = result.sort_values("timestamp").reset_index(drop=True) + + if result.empty: + raise RuntimeError( + f"Yahoo Finance returned no usable {self._config.field!r} values for ticker {self._config.ticker!r}." + ) + + return result[["timestamp", "value", "released_at"]] + + def _apply_date_range(self, df: pd.DataFrame) -> pd.DataFrame: + """Apply the configured ``start`` / ``end`` window to cached or fetched data.""" + result = df.copy() + if self._config.start is not None: + result = result[result["timestamp"] >= pd.Timestamp(self._config.start)] + if self._config.end is not None: + result = result[result["timestamp"] < pd.Timestamp(self._config.end)] + result = result.reset_index(drop=True) + if result.empty: + raise RuntimeError( + f"No rows left after applying date range start={self._config.start!r} " + f"end={self._config.end!r} for ticker {self._config.ticker!r}." + ) + return result + + @staticmethod + def _find_timestamp_column(df: pd.DataFrame) -> str: + """Return the yfinance date/datetime column created by ``reset_index``.""" + for candidate in ("Date", "Datetime"): + if candidate in df.columns: + return candidate + return str(df.columns[0]) + + @staticmethod + def _normalize_timestamp(values: Any) -> pd.Series: + """Return timezone-naive pandas timestamps.""" + timestamps = pd.to_datetime(values, errors="coerce") + if isinstance(timestamps.dtype, pd.DatetimeTZDtype): + timestamps = timestamps.dt.tz_localize(None) + return timestamps.astype("datetime64[ns]") + + @staticmethod + def _read_cache(cache_path: Path) -> pd.DataFrame: + """Read a cached parquet and normalize dtypes defensively.""" + df = pd.read_parquet(cache_path) + missing = {"timestamp", "value", "released_at"} - set(df.columns) + if missing: + raise ValueError(f"Cached yfinance file {cache_path} is missing column(s): {sorted(missing)}") + result = pd.DataFrame( + { + "timestamp": YFinanceDailyAdapter._normalize_timestamp(df["timestamp"]), + "value": pd.to_numeric(df["value"], errors="coerce"), + "released_at": YFinanceDailyAdapter._normalize_timestamp(df["released_at"]), + } + ) + result = result.dropna(subset=["timestamp", "value", "released_at"]) + return result.sort_values("timestamp").reset_index(drop=True) + + def __repr__(self) -> str: + """Return a short representation of this adapter.""" + cache = self._cache_dir if self._cache_dir is not None else "disabled" + return ( + f"YFinanceDailyAdapter(ticker={self._config.ticker!r}, field={self._config.field!r}, cache_dir={cache!r})" + ) + + +__all__ = ["YFinanceDailyAdapter", "YFinanceDailyConfig", "YFinanceField", "YFinanceInterval"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md new file mode 100644 index 0000000..04cd9b7 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__context.py.md @@ -0,0 +1,155 @@ +# Source: aieng-forecasting/aieng/forecasting/data/context.py + +kind: python + +```python +"""ForecastContext: the predictor-facing, cutoff-scoped data view.""" + +from __future__ import annotations + +from datetime import datetime +from typing import TYPE_CHECKING + +import pandas as pd +from aieng.forecasting.data.cutoff import CutoffEnforcer +from aieng.forecasting.data.models import SeriesMetadata +from aieng.forecasting.data.store import SeriesStore + + +if TYPE_CHECKING: + from aieng.forecasting.documents.models import ExtractedDocument + from aieng.forecasting.documents.store import DocumentStore + + +class ForecastContext: + """Read-only, cutoff-scoped data view passed to predictors. + + ``ForecastContext`` is the object predictors receive during backtesting or + live evaluation. It bakes in an ``as_of`` date so that ``get_series()`` + always enforces the information cutoff automatically — a predictor cannot + accidentally access data that was not available at forecast time. + + The harness creates a ``ForecastContext`` for each backtest origin via + ``DataService.context(as_of)``. In live mode the same factory is called + with the current date. The predictor interface is identical in both modes. + + Intended predictor usage + ------------------------ + >>> def predict(task: ForecastingTask, context: ForecastContext) -> Prediction: + ... series = context.get_series(task.target_series_id) + ... # series contains only observations available as of context.as_of + ... + ... # Optionally retrieve cutoff-filtered documents: + ... docs = context.get_documents("cfpr") + + Parameters + ---------- + store : SeriesStore + The underlying series store (owned by the ``DataService``). + as_of : datetime + The information cutoff. All ``get_series`` queries are filtered to + data available on or before this date. + doc_store : DocumentStore or None + Optional document store for report integration. When ``None``, + ``get_documents()`` returns an empty list. + """ + + def __init__( + self, + store: SeriesStore, + as_of: datetime, + doc_store: DocumentStore | None = None, + ) -> None: + self._store = store + self._as_of = as_of + self._cutoff = CutoffEnforcer() + self._doc_store = doc_store + + @property + def as_of(self) -> datetime: + """The information cutoff date for this context.""" + return self._as_of + + def get_series(self, series_id: str) -> pd.DataFrame: + """Return a series filtered to observations available as of the cutoff. + + Parameters + ---------- + series_id : str + The series to retrieve. + + Returns + ------- + pd.DataFrame + DataFrame with columns ``timestamp`` and ``value`` (and optionally + ``released_at``), containing only rows available as of + ``self.as_of``, sorted ascending by ``timestamp``. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + raw = self._store.get(series_id) + return self._cutoff.filter(raw, self._as_of) + + def get_metadata(self, series_id: str) -> SeriesMetadata: + """Return metadata for a registered series. + + Parameters + ---------- + series_id : str + The series identifier. + + Returns + ------- + SeriesMetadata + Metadata for the series. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + return self._store.get_metadata(series_id) + + @property + def series_ids(self) -> list[str]: + """Return a sorted list of registered series identifiers.""" + return self._store.series_ids + + # ------------------------------------------------------------------ + # Document access + # ------------------------------------------------------------------ + + def get_documents(self, source: str) -> list[ExtractedDocument]: + """Return cutoff-filtered documents for ``source``. + + Only documents whose ``publication_date`` is on or before + ``self.as_of`` are returned. Returns an empty list when no + ``DocumentStore`` is attached. + + Parameters + ---------- + source : str + Source key (e.g. ``"cfpr"``). + + Returns + ------- + list[ExtractedDocument] + Cutoff-filtered documents in chronological order. + """ + if self._doc_store is None: + return [] + return self._doc_store.list_docs(source, as_of=self._as_of) + + @property + def document_sources(self) -> list[str]: + """Return sorted list of known document source keys. + + Returns an empty list when no ``DocumentStore`` is attached. + """ + if self._doc_store is None: + return [] + return self._doc_store.sources +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md new file mode 100644 index 0000000..bbbc8a1 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__cutoff.py.md @@ -0,0 +1,73 @@ +# Source: aieng-forecasting/aieng/forecasting/data/cutoff.py + +kind: python + +```python +"""Information cutoff enforcement.""" + +from datetime import datetime + +import pandas as pd + + +class CutoffEnforcer: + """Enforces information cutoff discipline on series data. + + Ensures that no model or agent receives data that would not have been + available at the time a forecast was issued. This is the mechanism that + makes backtesting honest: a predictor running as-of 2022-01-01 sees + exactly the data that existed on that date, nothing more. + + **Cutoff logic:** + + - If the DataFrame includes a ``released_at`` column, rows where + ``released_at > as_of`` are excluded. + - If ``released_at`` is absent or null for a row, ``timestamp`` is used + as the fallback. This is correct for custom datasets where data is + available at observation time, but introduces a slight optimistic bias + for official datasets that have publication lags (e.g. StatCan CPI is + published ~3 weeks after the reference month). + + Notes + ----- + This class is stateless — it is a pure function wrapped in a class for + testability and future extension (e.g. injecting release calendars). + """ + + def filter(self, df: pd.DataFrame, as_of: datetime) -> pd.DataFrame: + """Return only rows available as of the given date. + + Parameters + ---------- + df : pd.DataFrame + Series DataFrame with columns ``timestamp`` and ``value``. + Optionally includes ``released_at``. + as_of : datetime + The information cutoff point. Rows with an effective release date + after this point are excluded. + + Returns + ------- + pd.DataFrame + Filtered copy of ``df`` containing only rows available as of + ``as_of``, sorted ascending by ``timestamp``. + + Raises + ------ + ValueError + If ``df`` does not contain a ``timestamp`` column. + """ + if "timestamp" not in df.columns: + raise ValueError("DataFrame must contain a 'timestamp' column.") + + as_of_ts = pd.Timestamp(as_of) + + if "released_at" in df.columns: + # Use released_at when available, fall back to timestamp for null values. + effective_release = df["released_at"].fillna(df["timestamp"]) + mask = pd.to_datetime(effective_release) <= as_of_ts + else: + mask = pd.to_datetime(df["timestamp"]) <= as_of_ts + + return df.loc[mask].copy().sort_values("timestamp").reset_index(drop=True) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md new file mode 100644 index 0000000..c61a853 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__models.py.md @@ -0,0 +1,86 @@ +# Source: aieng-forecasting/aieng/forecasting/data/models.py + +kind: python + +```python +"""Pydantic models for the data service layer.""" + +from datetime import datetime + +from pydantic import BaseModel, Field, model_validator + + +class SeriesRecord(BaseModel): + """A single timestamped observation of a series. + + Parameters + ---------- + timestamp : datetime + The observation time (when the measurement was taken / the reference period). + value : float + The observed quantity. + released_at : datetime or None + When this data point became publicly available. If None, the + CutoffEnforcer falls back to ``timestamp``. For official datasets with + known release lags (e.g. StatCan CPI published ~3 weeks after the + reference month), this should be set explicitly to ensure backtests + respect information cutoff discipline. + """ + + timestamp: datetime + value: float + released_at: datetime | None = Field( + default=None, + description="Publication date; None means available at observation time.", + ) + + @model_validator(mode="after") + def released_at_not_before_timestamp(self) -> "SeriesRecord": + """Validate that released_at is not before timestamp. + + Returns + ------- + SeriesRecord + The validated instance. + + Raises + ------ + ValueError + If released_at is before timestamp. + """ + if self.released_at is not None and self.released_at < self.timestamp: + raise ValueError(f"released_at ({self.released_at}) cannot be before timestamp ({self.timestamp})") + return self + + +class SeriesMetadata(BaseModel): + """Descriptive metadata for a registered series. + + Parameters + ---------- + series_id : str + Unique identifier used as the key in SeriesStore. + description : str + Human-readable description of what the series measures. + source : str + Data source (e.g. "StatCan", "FRED", "yfinance"). + units : str + Unit of measure (e.g. "Index 2002=100", "Percentage change"). + frequency : str + Pandas offset alias for the series frequency (e.g. "MS" for month-start, + "h" for hourly). Used as a hint for gap-filling at the Darts conversion + boundary; the SeriesStore itself does not enforce regularity. + table_id : str or None + Source table or dataset identifier, if applicable. + """ + + series_id: str + description: str + source: str + units: str + frequency: str = Field(description="Pandas offset alias, e.g. 'MS', 'h', 'D'.") + table_id: str | None = Field( + default=None, + description="Source table or dataset identifier.", + ) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md new file mode 100644 index 0000000..f8e25e3 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__service.py.md @@ -0,0 +1,204 @@ +# Source: aieng-forecasting/aieng/forecasting/data/service.py + +kind: python + +```python +"""DataService: registration and management of time series data.""" + +from __future__ import annotations + +from datetime import datetime +from typing import TYPE_CHECKING + +import pandas as pd +from aieng.forecasting.data.adapters.base import BaseAdapter +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.data.cutoff import CutoffEnforcer +from aieng.forecasting.data.models import SeriesMetadata +from aieng.forecasting.data.store import SeriesStore + + +if TYPE_CHECKING: + from aieng.forecasting.documents.store import DocumentStore + + +class DataService: + """Registration and management layer for time series data. + + ``DataService`` owns the ``SeriesStore`` and exposes two distinct + responsibilities: + + 1. **Registration** — ``register()`` fetches data via an adapter and + stores it in memory. Called by setup scripts (e.g. + ``scripts/fetch_cpi.py``) once at startup; no further network access + occurs after that. + 2. **Context creation** — ``context(as_of)`` creates a + :class:`ForecastContext` scoped to a specific date. This is what the + backtesting harness (and live evaluation harness) passes to predictors. + Predictors should never receive a raw ``DataService``; they should + receive a ``ForecastContext``. + + **Notebooks and scripts** may also call ``get_series`` directly for + ad-hoc exploration — this is the same cutoff-filtered query that + ``ForecastContext`` wraps, exposed here for convenience. + + Examples + -------- + >>> from aieng.forecasting.data import DataService, SeriesMetadata + >>> from aieng.forecasting.data.adapters import StatCanAdapter + >>> svc = DataService() + >>> adapter = StatCanAdapter( + ... table_id="18-10-0004-11", + ... member_filter={"GEO": "Canada", "Products and product groups": "All-items"}, + ... ) + >>> meta = SeriesMetadata( + ... series_id="cpi_all_items_canada", + ... description="CPI All-items, Canada (2002=100)", + ... source="StatCan", + ... units="Index 2002=100", + ... frequency="MS", + ... table_id="18-10-0004-11", + ... ) + >>> svc.register("cpi_all_items_canada", adapter, meta) + >>> df = svc.get_series("cpi_all_items_canada", as_of=datetime(2023, 1, 1)) + """ + + def __init__(self, doc_store: DocumentStore | None = None) -> None: + self._store = SeriesStore() + self._cutoff = CutoffEnforcer() + self._doc_store = doc_store + + def register( + self, + series_id: str, + adapter: BaseAdapter, + metadata: SeriesMetadata, + ) -> None: + """Fetch data via an adapter and register the series in the store. + + Parameters + ---------- + series_id : str + Unique identifier for the series. Used as the lookup key in + subsequent ``get_series`` calls. + adapter : BaseAdapter + Adapter responsible for fetching the data. ``adapter.fetch()`` is + called exactly once; the result is stored in memory. + metadata : SeriesMetadata + Descriptive metadata (units, source, frequency, etc.). + + Raises + ------ + RuntimeError + If the adapter fails to fetch data. + ValueError + If the fetched DataFrame is missing required columns. + """ + df = adapter.fetch() + self._store.put(series_id, df, metadata) + + def get_series(self, series_id: str, as_of: datetime) -> pd.DataFrame: + """Return a series filtered to observations available as of ``as_of``. + + The ``CutoffEnforcer`` ensures that only data published on or before + ``as_of`` is returned. This guarantees that backtests and live + forecasts share the same information discipline. + + Parameters + ---------- + series_id : str + The series to retrieve. + as_of : datetime + Information cutoff point. Observations released after this date + are excluded. + + Returns + ------- + pd.DataFrame + DataFrame with columns ``timestamp`` and ``value`` (and optionally + ``released_at``), containing only rows available as of ``as_of``, + sorted ascending by ``timestamp``. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + raw = self._store.get(series_id) + return self._cutoff.filter(raw, as_of) + + def context(self, as_of: datetime) -> ForecastContext: + """Create a :class:`ForecastContext` scoped to the given as-of date. + + This is the factory method used by the backtesting harness (and live + evaluation harness) to create the object passed to predictors. The + returned context bakes in ``as_of`` so that ``get_series()`` always + enforces the information cutoff automatically. + + If a ``DocumentStore`` was provided at construction, it is wired into + every context so predictors can call ``context.get_documents()``. + + Parameters + ---------- + as_of : datetime + The information cutoff date. + + Returns + ------- + ForecastContext + A read-only, cutoff-scoped view of the series store. + """ + return ForecastContext(self._store, as_of, doc_store=self._doc_store) + + def get_metadata(self, series_id: str) -> SeriesMetadata: + """Return metadata for a registered series. + + Parameters + ---------- + series_id : str + The series identifier. + + Returns + ------- + SeriesMetadata + Metadata for the series. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + return self._store.get_metadata(series_id) + + @property + def series_ids(self) -> list[str]: + """Return a sorted list of registered series identifiers.""" + return self._store.series_ids + + def summary(self) -> pd.DataFrame: + """Return a summary table of all registered series. + + Returns + ------- + pd.DataFrame + One row per series with columns: ``series_id``, ``description``, + ``source``, ``units``, ``frequency``, ``n_obs``, ``start``, ``end``. + """ + rows = [] + for sid in self._store.series_ids: + df = self._store.get(sid) + meta = self._store.get_metadata(sid) + rows.append( + { + "series_id": sid, + "description": meta.description, + "source": meta.source, + "units": meta.units, + "frequency": meta.frequency, + "n_obs": len(df), + "start": df["timestamp"].min() if len(df) > 0 else None, + "end": df["timestamp"].max() if len(df) > 0 else None, + } + ) + return pd.DataFrame(rows) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md new file mode 100644 index 0000000..3470703 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__data__store.py.md @@ -0,0 +1,117 @@ +# Source: aieng-forecasting/aieng/forecasting/data/store.py + +kind: python + +```python +"""In-memory series store.""" + +import pandas as pd +from aieng.forecasting.data.models import SeriesMetadata + + +class SeriesStore: + """In-memory store for historical time series. + + Stores each series as a ``pandas.DataFrame`` with columns ``timestamp``, + ``value``, and optionally ``released_at``. Series are keyed by + ``series_id``; there is no ``series_id`` column in the stored DataFrame. + + This class is intentionally thin — it is a dict with type-checked access + and basic introspection helpers. All filtering (cutoff enforcement) happens + in ``CutoffEnforcer`` before data reaches callers. + + Notes + ----- + The store makes no guarantees about temporal regularity. Series may be + irregularly spaced, sparse, or contain gaps. Gap-filling to a regular + frequency is a predictor-level concern performed at the Darts conversion + boundary, not here. + """ + + def __init__(self) -> None: + self._data: dict[str, pd.DataFrame] = {} + self._metadata: dict[str, SeriesMetadata] = {} + + def put(self, series_id: str, df: pd.DataFrame, metadata: SeriesMetadata) -> None: + """Store a series and its metadata. + + Parameters + ---------- + series_id : str + Unique identifier for the series. Used as the lookup key. + df : pd.DataFrame + DataFrame with columns ``timestamp`` (datetime64) and ``value`` + (float64). Optionally includes ``released_at`` (datetime64). + Rows should be sorted ascending by ``timestamp``. + metadata : SeriesMetadata + Descriptive metadata for the series. + + Raises + ------ + ValueError + If ``df`` is missing required columns ``timestamp`` or ``value``. + """ + required = {"timestamp", "value"} + missing = required - set(df.columns) + if missing: + raise ValueError(f"DataFrame for series {series_id!r} is missing required columns: {missing}") + self._data[series_id] = df.copy() + self._metadata[series_id] = metadata + + def get(self, series_id: str) -> pd.DataFrame: + """Return the full (unfiltered) DataFrame for a series. + + Parameters + ---------- + series_id : str + The series identifier. + + Returns + ------- + pd.DataFrame + A copy of the stored DataFrame. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + if series_id not in self._data: + raise KeyError(f"Series {series_id!r} not found. Registered series: {self.series_ids}") + return self._data[series_id].copy() + + def get_metadata(self, series_id: str) -> SeriesMetadata: + """Return metadata for a series. + + Parameters + ---------- + series_id : str + The series identifier. + + Returns + ------- + SeriesMetadata + The metadata for the series. + + Raises + ------ + KeyError + If ``series_id`` is not registered. + """ + if series_id not in self._metadata: + raise KeyError(f"Series {series_id!r} not found. Registered series: {self.series_ids}") + return self._metadata[series_id] + + @property + def series_ids(self) -> list[str]: + """Return a sorted list of registered series identifiers.""" + return sorted(self._data.keys()) + + def __contains__(self, series_id: str) -> bool: + """Return True if series_id is registered.""" + return series_id in self._data + + def __len__(self) -> int: + """Return the number of registered series.""" + return len(self._data) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md new file mode 100644 index 0000000..ed56429 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents____init__.py.md @@ -0,0 +1,41 @@ +# Source: aieng-forecasting/aieng/forecasting/documents/__init__.py + +kind: python + +```python +"""Document extraction: source-agnostic PDF -> full text + cutoff metadata. + +This sub-package turns published document PDFs (e.g. Canada's Food Price Report, +Bank of Canada Monetary Policy Report) into minimal, cutoff-stamped +:class:`ExtractedDocument` artifacts -- full text plus a ``publication_date`` +and size counts. It intentionally models no source-specific structure; a future +cutoff-aware ``DocumentStore`` will consume these artifacts for LLM-P report +integration. + +The extractor depends on the optional ``documents`` extra (``pymupdf4llm``) and +imports it lazily, so importing this package is cheap. +""" + +from aieng.forecasting.documents.extract import extract_document +from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument, estimate_tokens +from aieng.forecasting.documents.pdf_upload import ( + MIME_PDF, + inject_pdf_parts, + pdf_bytes_to_content_part, + pdf_to_content_part, +) +from aieng.forecasting.documents.store import DocumentStore + + +__all__ = [ + "DocumentMeta", + "DocumentStore", + "ExtractedDocument", + "MIME_PDF", + "estimate_tokens", + "extract_document", + "inject_pdf_parts", + "pdf_bytes_to_content_part", + "pdf_to_content_part", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md new file mode 100644 index 0000000..2aa7985 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__extract.py.md @@ -0,0 +1,92 @@ +# Source: aieng-forecasting/aieng/forecasting/documents/extract.py + +kind: python + +```python +"""Document text extraction. + +A single, source-agnostic function turns any born-digital PDF into full text +plus size counts. The bootcamp reference pipeline targets born-digital report +PDFs (Canada's Food Price Report, Bank of Canada Monetary Policy Report), where +a lightweight, deterministic, CPU-only parser captures the text well. + +We use the classic ``pymupdf4llm`` engine rather than its OCR layout engine so +extraction is deterministic and reproducible for honest backtests. No section +or heading structure is reconstructed -- callers get the whole document text. + +``pymupdf4llm`` is an optional dependency (the ``documents`` extra); it is +imported lazily so importing this module never requires the package. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from pathlib import Path + +from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument, estimate_tokens + + +def extract_document( + pdf_path: Path, meta: DocumentMeta, *, dpi: int = 150, min_chars_per_page: int = 20 +) -> ExtractedDocument: + """Extract full text and size counts from a born-digital PDF. + + Parameters + ---------- + pdf_path : Path + Path to the source PDF. + meta : DocumentMeta + Provenance/cutoff metadata, carried through to the result. The + ``publication_date`` is supplied by the caller (from the committed + manifest), not parsed from the PDF. + dpi : int + Render DPI passed to the engine. Affects only any rasterization the + engine performs internally; text extraction is unaffected. + min_chars_per_page : int + Fail loudly if the extracted text averages fewer characters per page + than this -- a near-empty result signals a scanned/encrypted/image-only + PDF that this text-only path cannot handle. Set ``0`` to disable. + + Returns + ------- + ExtractedDocument + Full text plus page count, character count, and an approximate token + count. + + Raises + ------ + FileNotFoundError + If ``pdf_path`` does not exist. + ValueError + If extraction yields implausibly little text (see ``min_chars_per_page``). + """ + if not pdf_path.exists(): + raise FileNotFoundError(f"PDF not found: {pdf_path}") + + # Lazy import: the ``documents`` optional dependency need not be installed + # to import this module (only to actually run extraction). + from pymupdf4llm.helpers.pymupdf_rag import to_markdown # noqa: PLC0415 + + # ``table_strategy=None`` disables table detection, which trips an upstream + # empty-cell ValueError on several real report PDFs and is not needed for + # whole-document text extraction. ``page_chunks=True`` yields one entry per + # page, giving us the page count without a separately-typed pymupdf call. + chunks = to_markdown(str(pdf_path), page_chunks=True, table_strategy=None, dpi=dpi, show_progress=False) + page_count = len(chunks) + text = "\n\n".join(str(chunk.get("text", "")) for chunk in chunks).strip() + + n_chars = len(text) + if min_chars_per_page > 0 and n_chars < min_chars_per_page * max(page_count, 1): + raise ValueError( + f"Extracted only {n_chars} chars from {page_count} page(s) of {pdf_path.name}; " + "likely a scanned/encrypted/image-only PDF that the text-only extractor cannot read.", + ) + return ExtractedDocument( + meta=meta, + text=text, + page_count=page_count, + n_chars=n_chars, + est_tokens=estimate_tokens(n_chars), + extracted_at=datetime.now(tz=timezone.utc).replace(tzinfo=None), + ) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md new file mode 100644 index 0000000..b5d1f9c --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__models.py.md @@ -0,0 +1,108 @@ +# Source: aieng-forecasting/aieng/forecasting/documents/models.py + +kind: python + +```python +"""Pydantic models for extracted documents. + +A document here is intentionally minimal: full text plus the metadata needed to +use it honestly in a backtest. We deliberately do *not* model sections, +segments, or any source-specific structure -- different report families (e.g. +Canada's Food Price Report vs. Bank of Canada Monetary Policy Report) have +nothing in common structurally, and the planned LLM-P report formats consume the +whole document at its single publication date rather than hand-picked sections. + +The field that matters most for honest backtesting is +:attr:`DocumentMeta.publication_date`. A future cutoff-aware ``DocumentStore`` +will filter documents with ``publication_date <= as_of`` using the same +information-cutoff discipline that ``CutoffEnforcer`` (see +:mod:`aieng.forecasting.data.cutoff`) applies to numeric series. +""" + +from __future__ import annotations + +from datetime import date, datetime + +from pydantic import BaseModel, Field + + +class DocumentMeta(BaseModel): + """Provenance and cutoff metadata for a single document. + + Parameters + ---------- + source : str + Short source key, e.g. ``"cfpr"`` (Canada's Food Price Report) or + ``"boc"`` (Bank of Canada Monetary Policy Report). + doc_id : str + Stable per-document identifier, unique within ``source`` (e.g. + ``"2026_en"``). Used as the cache filename stem. + publication_date : date + The date the document became publicly available. This is the cutoff + key: a forecast issued before this date must not see this document. + title : str or None + Document title, if known. + lang : str + Two-letter language code, e.g. ``"en"``. + """ + + source: str + doc_id: str + publication_date: date = Field(description="Public release date; the cutoff key for honest backtests.") + title: str | None = None + lang: str = "en" + + +def estimate_tokens(n_chars: int) -> int: + """Roughly estimate token count from character count. + + Uses the common ``~4 chars/token`` rule of thumb. This is a deliberately + crude, model-agnostic ballpark for context-budget planning -- not an exact + count for any specific tokenizer. + + Parameters + ---------- + n_chars : int + Number of characters. + + Returns + ------- + int + Approximate token count. + """ + return (n_chars + 3) // 4 + + +class ExtractedDocument(BaseModel): + """The full-text result of extracting one document. + + Parameters + ---------- + meta : DocumentMeta + Provenance and cutoff metadata. + text : str + The full extracted text (markdown). + page_count : int + Number of pages in the source document. + n_chars : int + Character count of ``text`` (context-cost signal). + est_tokens : int + Approximate token count (``~n_chars / 4``); see :func:`estimate_tokens`. + extracted_at : datetime + UTC timestamp when extraction ran. + pdf_path : str or None + Local filesystem path to the source PDF, resolved at load time by + :class:`~aieng.forecasting.documents.store.DocumentStore` for native + document ingestion. Runtime-only and machine-specific — it is *not* + part of the persisted artifact contract; serialized artifacts leave it + ``None``. + """ + + meta: DocumentMeta + text: str + page_count: int = Field(ge=0) + n_chars: int = Field(ge=0) + est_tokens: int = Field(ge=0) + extracted_at: datetime + pdf_path: str | None = None +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md new file mode 100644 index 0000000..d88e6c0 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__pdf_upload.py.md @@ -0,0 +1,208 @@ +# Source: aieng-forecasting/aieng/forecasting/documents/pdf_upload.py + +kind: python + +```python +"""PDF-to-message-part conversion for native document ingestion. + +Converts a PDF into a content-part dict that a model can read directly, +**dispatched by backend family** because each provider's native API expects a +different document-block shape and the Vector Proxy forwards content blocks to +each backend largely untranslated: + +- **Anthropic** (``claude-*``): ``{"type": "document", "source": {...}}`` +- **OpenAI** (``gpt-*``, ``o*``): ``{"type": "file", "file": {...}}`` +- **Google** (``gemini-*``): **not supported through the proxy yet** — the + proxy routes Gemini via Google's OpenAI-compatibility endpoint, which drops + document (and image) parts. ``pdf_to_content_part`` raises + :class:`NotImplementedError` for Gemini models. See the ``TODO(proxy-pdf)`` + below: once the proxy routes Gemini through the native ``generateContent`` + API with ``inline_data``, emit that part here and Gemini becomes just another + branch — at which point native ingestion can be configured uniformly + alongside text extraction for every model. + +Usage:: + + from aieng.forecasting.documents.pdf_upload import pdf_to_content_part + + part = pdf_to_content_part(Path("report.pdf"), model="claude-sonnet-4-6") + messages = [{"role": "user", "content": "Summarize this document."}] + messages = inject_pdf_parts(messages, [part]) +""" + +from __future__ import annotations + +import base64 +from pathlib import Path +from typing import Any + + +#: MIME type used for PDF document parts. +MIME_PDF = "application/pdf" + + +def _backend_family(model: str) -> str: + """Map a proxy model name to its backend family. + + The model may carry a LiteLLM provider prefix (e.g. ``openai/gpt-4o``); + only the bare name after the last ``/`` is inspected. + + Returns one of ``"anthropic"``, ``"openai"``, ``"google"``. + + Raises + ------ + ValueError + If the model name does not match a known family. + """ + name = model.lower().rsplit("/", 1)[-1] + if name.startswith("claude"): + return "anthropic" + if name.startswith(("gpt", "o1", "o3", "o4")): + return "openai" + if name.startswith("gemini"): + return "google" + raise ValueError( + f"Cannot determine backend family for model {model!r}; native PDF " + "ingestion supports Anthropic ('claude-*') and OpenAI ('gpt-*', 'o*') " + "models. Use text extraction (report_ingestion='text') for others." + ) + + +def pdf_bytes_to_content_part( + pdf_bytes: bytes, + model: str, + *, + filename: str = "document.pdf", +) -> dict[str, Any]: + """Convert raw PDF bytes into a backend-appropriate content-part dict. + + Parameters + ---------- + pdf_bytes : bytes + Raw PDF file bytes. + model : str + Target model name (bare or provider-prefixed). Selects the block shape. + filename : str + Filename advertised to OpenAI's ``file`` block. Ignored by Anthropic. + + Returns + ------- + dict + A content-part dict in the target backend's native document format. + + Raises + ------ + ValueError + If ``model`` is not a recognised Anthropic/OpenAI family member. + NotImplementedError + If ``model`` is a Gemini model (unsupported through the proxy today). + """ + family = _backend_family(model) + b64 = base64.b64encode(pdf_bytes).decode("utf-8") + if family == "anthropic": + return { + "type": "document", + "source": {"type": "base64", "media_type": MIME_PDF, "data": b64}, + } + if family == "openai": + return { + "type": "file", + "file": {"filename": filename, "file_data": f"data:{MIME_PDF};base64,{b64}"}, + } + # Remaining family: Google (Gemini). + # TODO(proxy-pdf): the Vector Proxy currently routes Gemini through Google's + # OpenAI-compatibility endpoint, which silently drops document/image parts + # (verified: multimodal content reaches Gemini as 0 added prompt tokens). + # Once the proxy routes Gemini via the native generateContent API, emit a + # Gemini-native inline_data part here (a "file"/"file_data" data-URI part + # that the proxy translates to inline_data) and delete this guard, so native + # ingestion becomes configurable uniformly for every model alongside text + # extraction. + raise NotImplementedError( + f"Native PDF ingestion for Gemini model {model!r} is not supported " + "through the Vector Proxy yet: the proxy routes Gemini via Google's " + "OpenAI-compatibility endpoint, which drops document parts. Use text " + "extraction (report_ingestion='text') for Gemini, or a Claude/GPT " + "model for native ingestion." + ) + + +def pdf_to_content_part(pdf_path: Path, model: str) -> dict[str, Any]: + """Read a PDF file and convert it to a backend-appropriate content part. + + Parameters + ---------- + pdf_path : Path + Path to the PDF file. Must exist and be readable. The filename is + forwarded to OpenAI's ``file`` block. + model : str + Target model name (bare or provider-prefixed). Selects the block shape. + + Returns + ------- + dict + A content-part dict in the target backend's native document format. + + Raises + ------ + FileNotFoundError + If ``pdf_path`` does not exist. + ValueError + If ``model`` is not a recognised Anthropic/OpenAI family member. + NotImplementedError + If ``model`` is a Gemini model (unsupported through the proxy today). + """ + if not pdf_path.exists(): + raise FileNotFoundError(f"PDF not found: {pdf_path}") + return pdf_bytes_to_content_part(pdf_path.read_bytes(), model, filename=pdf_path.name) + + +def inject_pdf_parts( + messages: list[dict[str, Any]], + pdf_parts: list[dict[str, Any]], + *, + target_role: str = "user", +) -> list[dict[str, Any]]: + """Inject PDF content parts into the first message matching ``target_role``. + + If the target message's ``content`` is a string, it is converted to a + content-part list with the original text as a ``"text"`` part. PDF parts + are prepended so the model sees the document before the instruction text. + + When no message matches ``target_role``, a new message with that role + and only the PDF parts is appended as a fallback. + + Parameters + ---------- + messages : list[dict] + Existing messages list (mutated in place and returned for chaining). + pdf_parts : list[dict] + One or more content-part dicts from :func:`pdf_to_content_part`. + target_role : str + Role of the message to inject into (default ``"user"``). + + Returns + ------- + list[dict] + The same ``messages`` list (mutated in place). + """ + for msg in messages: + if msg.get("role") == target_role: + content = msg["content"] + if isinstance(content, str): + msg["content"] = [{"type": "text", "text": content}] + # Prepend PDF parts before text instruction. + msg["content"] = pdf_parts + list(msg["content"]) + return messages + # Fallback: append a new target_role message with only the PDF parts. + messages.append({"role": target_role, "content": pdf_parts}) + return messages + + +__all__ = [ + "MIME_PDF", + "inject_pdf_parts", + "pdf_bytes_to_content_part", + "pdf_to_content_part", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md new file mode 100644 index 0000000..848778f --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__documents__store.py.md @@ -0,0 +1,200 @@ +# Source: aieng-forecasting/aieng/forecasting/documents/store.py + +kind: python + +```python +"""Cutoff-aware in-memory store for extracted documents. + +``DocumentStore`` loads :class:`ExtractedDocument` JSON artifacts written by +``scripts/extract_reports.py`` and makes them queryable by source and ``as_of`` +date — the same information-discipline pattern that ``SeriesStore`` enforces for +numeric series. + +Artifact layout (one directory per source):: + + data/reports// + ├── .pdf # cached PDF (source of extraction) + ├── .md # extracted full text + └── .json # ExtractedDocument metadata + text_path pointer +""" + +from __future__ import annotations + +import json +from datetime import date, datetime +from pathlib import Path +from typing import Any + +from aieng.forecasting.documents.models import DocumentMeta, ExtractedDocument + + +class DocumentStore: + """In-memory store for extracted documents, indexed by ``(source, doc_id)``. + + Populated by ``load_dir()`` from the JSON artifacts written by + ``scripts/extract_reports.py``. Supports cutoff-filtered listing via + ``list_docs()`` so that predictors can only see documents whose + ``publication_date`` is <= the forecast ``as_of`` date. + + Parameters + ---------- + source_dirs : dict[str, Path] or None + Mapping of ``source`` keys to artifact directories. When ``None``, + the store starts empty; call ``load_dir()`` to populate. + """ + + def __init__(self, source_dirs: dict[str, Path] | None = None) -> None: + self._docs: dict[tuple[str, str], ExtractedDocument] = {} + self._source_names: set[str] = set() + if source_dirs: + for source, directory in source_dirs.items(): + self.load_dir(source, directory) + + # ------------------------------------------------------------------ + # Population + # ------------------------------------------------------------------ + + def load_dir(self, source: str, directory: Path) -> int: + """Load all ``*.json`` artifacts from ``directory`` into the store. + + Each ``.json`` file must be a serialized :class:`ExtractedDocument` + (the shape written by ``scripts/extract_reports.py``). The ``text`` + field is loaded from the ``text_path`` pointer stored inside the JSON, + or from the ``.md`` companion file with the same stem. + + Parameters + ---------- + source : str + Source key (e.g. ``"cfpr"``). + directory : Path + Directory containing ``.json`` artifacts. + + Returns + ------- + int + Number of documents loaded. + """ + if not directory.is_dir(): + self._source_names.add(source) + return 0 + count = 0 + for json_path in sorted(directory.glob("*.json")): + doc = self._load_one(source, json_path) + if doc is not None: + self._docs[(source, doc.meta.doc_id)] = doc + count += 1 + self._source_names.add(source) + return count + + def _load_one(self, source: str, json_path: Path) -> ExtractedDocument | None: + """Parse one ``.json`` artifact and resolve its text.""" + try: + raw: dict[str, Any] = json.loads(json_path.read_text(encoding="utf-8")) + except (json.JSONDecodeError, OSError): + return None + + meta_raw = raw.get("meta", {}) + text = raw.get("text", "") or "" + + # If text is empty (extract_reports.py excludes it from the JSON), load + # it from the companion .md. Prefer the co-located ``.md`` next + # to the JSON — it is CWD-independent. The stored ``text_path`` is only + # a fallback and may be repo-root-relative, so resolve it against the + # JSON's own directory rather than the current working directory. + if not text: + md_companion = json_path.with_suffix(".md") + text_path_str = raw.get("text_path") + if md_companion.exists(): + text = md_companion.read_text(encoding="utf-8") + elif text_path_str: + candidate = Path(text_path_str) + if not candidate.is_absolute(): + candidate = json_path.parent / candidate.name + text = candidate.read_text(encoding="utf-8") + + meta = DocumentMeta( + source=source, + doc_id=meta_raw.get("doc_id", json_path.stem), + publication_date=date.fromisoformat(meta_raw["publication_date"]), + title=meta_raw.get("title"), + lang=meta_raw.get("lang", "en"), + ) + # Resolve the companion PDF (``.pdf``) for native ingestion. + # Runtime-only; not persisted in the JSON artifact. + pdf_companion = json_path.with_suffix(".pdf") + pdf_path = str(pdf_companion) if pdf_companion.exists() else None + return ExtractedDocument( + meta=meta, + text=text, + page_count=raw.get("page_count", 0), + n_chars=len(text), + est_tokens=raw.get("est_tokens", 0), + extracted_at=datetime.fromisoformat(raw["extracted_at"]) if raw.get("extracted_at") else datetime.now(), + pdf_path=pdf_path, + ) + + # ------------------------------------------------------------------ + # Query + # ------------------------------------------------------------------ + + def get(self, source: str, doc_id: str) -> ExtractedDocument: + """Return a single document by source and doc_id. + + Raises + ------ + KeyError + If ``(source, doc_id)`` is not in the store. + """ + key = (source, doc_id) + if key not in self._docs: + available = [f"{s}/{d}" for s, d in self._docs] + raise KeyError(f"Document '{source}/{doc_id}' not found. Available: {sorted(available)}") + return self._docs[key] + + def list_docs( + self, + source: str, + *, + as_of: date | datetime | None = None, + ) -> list[ExtractedDocument]: + """Return documents for ``source``, optionally cutoff-filtered. + + Documents are sorted by ``publication_date`` ascending then by + ``doc_id`` for stable ordering. + + Parameters + ---------- + source : str + Source key (e.g. ``"cfpr"``). + as_of : date or datetime or None + When set, only documents with ``publication_date <= as_of`` are + returned. ``None`` returns all documents for the source. + + Returns + ------- + list[ExtractedDocument] + Cutoff-filtered, chronologically sorted documents. + """ + candidates = [doc for (s, _), doc in self._docs.items() if s == source] + if as_of is not None: + as_of_date = as_of.date() if isinstance(as_of, datetime) else as_of + candidates = [d for d in candidates if d.meta.publication_date <= as_of_date] + candidates.sort(key=lambda d: (d.meta.publication_date, d.meta.doc_id)) + return candidates + + @property + def sources(self) -> list[str]: + """Return sorted list of known document source keys.""" + return sorted(self._source_names) + + def __contains__(self, key: tuple[str, str]) -> bool: + """Check whether ``(source, doc_id)`` is in the store.""" + return key in self._docs + + def __len__(self) -> int: + """Return total number of loaded documents across all sources.""" + return len(self._docs) + + +__all__ = ["DocumentStore"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md new file mode 100644 index 0000000..3b9a937 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation____init__.py.md @@ -0,0 +1,84 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/__init__.py + +kind: python + +```python +"""Evaluation harness: forecasting tasks, prediction payloads, and scoring.""" + +from aieng.forecasting.evaluation.artifacts import ( + DEFAULT_STORE_DIR, + cached_backtest, + cached_multi_backtest, + load_backtest_result, + load_multi_backtest_results, + save_backtest_result, + save_eval_result, + save_multi_backtest_results, + save_multi_eval_results, +) +from aieng.forecasting.evaluation.backtest import ( + BacktestResult, + BacktestSpec, + MultiTargetBacktestSpec, + backtest, + compute_brier_score, + compute_rps, + multi_backtest, +) +from aieng.forecasting.evaluation.describe import describe_spec, describe_task +from aieng.forecasting.evaluation.eval import ( + EvalBudgetExceededError, + EvalResult, + EvalSpec, + EvalTracker, + MultiTargetEvalSpec, + evaluate, + multi_evaluate, +) +from aieng.forecasting.evaluation.prediction import ( + STANDARD_QUANTILES, + BinaryForecast, + CategoricalForecast, + ContinuousForecast, + Prediction, +) +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory + + +__all__ = [ + "DEFAULT_STORE_DIR", + "BacktestResult", + "BacktestSpec", + "BinaryForecast", + "CategoricalForecast", + "ContinuousForecast", + "EvalBudgetExceededError", + "EvalResult", + "EvalSpec", + "EvalTracker", + "ForecastingTask", + "MultiTargetBacktestSpec", + "MultiTargetEvalSpec", + "Prediction", + "Predictor", + "STANDARD_QUANTILES", + "TaskCategory", + "backtest", + "cached_backtest", + "cached_multi_backtest", + "compute_brier_score", + "compute_rps", + "describe_spec", + "describe_task", + "evaluate", + "load_backtest_result", + "load_multi_backtest_results", + "multi_backtest", + "multi_evaluate", + "save_backtest_result", + "save_eval_result", + "save_multi_backtest_results", + "save_multi_eval_results", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md new file mode 100644 index 0000000..a9fba49 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__artifacts.py.md @@ -0,0 +1,440 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/artifacts.py + +kind: python + +```python +"""Persist backtest and eval results to a filesystem artefact store. + +Backtests can be expensive to run — especially for agentic or LLM-based +predictors — and their outputs are the primary input to downstream analysis, +plotting, and leaderboard computation. This module provides a small +filesystem-backed store so that results can be saved once and re-read many +times across notebook sessions. + +Layout +------ +Results are stored as YAML files under a store directory: + +.. code-block:: text + + data/predictions/ + / + .yaml # single-target backtest + __.yaml # one file per task for multi-target + ____eval.yaml # multi-target eval run + +Single-target :class:`BacktestResult` / :class:`EvalResult` files live at +``//.yaml``. + +Multi-target results (one result per task under a single +:class:`MultiTargetBacktestSpec` / :class:`MultiTargetEvalSpec`) are split +across one YAML file per task. This keeps individual files readable and +makes partial caching straightforward: re-running after a new task is added +to the spec only has to compute the missing task. + +Caching semantics +----------------- +:func:`cached_backtest` and :func:`cached_multi_backtest` implement a simple +load-or-compute policy: + +- If all expected files exist under the store, load and return them. +- Otherwise, run the backtest, save the result(s), and return them. +- ``force_refresh=True`` always recomputes and overwrites. + +**Eval runs are never silently cached.** Each :func:`evaluate` / +:func:`multi_evaluate` call consumes one run from the budget in +:class:`EvalTracker`, so caching would obscure budget spend. Eval helpers +are write-only: :func:`save_eval_result` / :func:`save_multi_eval_result`. + +YAML (not parquet or pickle) is the on-disk format because +:class:`BacktestResult` and :class:`EvalResult` are Pydantic models — the +YAML round-trip is straightforward and the result is human-readable, which +matters more than disk footprint at bootcamp scale. +""" + +from __future__ import annotations + +import logging +from pathlib import Path + +import yaml +from aieng.forecasting.data.service import DataService +from aieng.forecasting.evaluation.backtest import ( + BacktestResult, + BacktestSpec, + MultiTargetBacktestSpec, + backtest, +) +from aieng.forecasting.evaluation.eval import EvalResult, MultiTargetEvalSpec +from aieng.forecasting.evaluation.predictor import Predictor + + +#: Default store location, relative to the caller's working directory. +DEFAULT_STORE_DIR = Path("data/predictions") + + +# --------------------------------------------------------------------------- +# Internal helpers +# --------------------------------------------------------------------------- + + +def _resolve_store(store_dir: Path | None) -> Path: + """Return the effective store directory, falling back to the default.""" + return Path(store_dir) if store_dir is not None else DEFAULT_STORE_DIR + + +def _dump_yaml(model: BacktestResult | EvalResult, path: Path) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + data = model.model_dump(mode="json") + with path.open("w") as f: + yaml.safe_dump(data, f, default_flow_style=False, sort_keys=False) + + +def _load_yaml(path: Path) -> dict[str, object]: + with path.open() as f: + loaded = yaml.safe_load(f) + if not isinstance(loaded, dict): + raise ValueError(f"Expected a mapping at {path}, got {type(loaded).__name__}") + return loaded + + +def _backtest_path(store_dir: Path, spec_id: str, predictor_id: str, task_id: str | None = None) -> Path: + """Return the artefact path for a single :class:`BacktestResult`. + + For single-target backtests pass ``task_id=None`` — the filename is + ``.yaml``. For multi-target the filename becomes + ``__.yaml`` to keep all tasks for a spec in one + directory. + """ + if task_id is None: + return store_dir / spec_id / f"{predictor_id}.yaml" + return store_dir / spec_id / f"{predictor_id}__{task_id}.yaml" + + +def _eval_path(store_dir: Path, spec_id: str, predictor_id: str, run_number: int, task_id: str | None = None) -> Path: + """Return the artefact path for a single :class:`EvalResult`. + + Eval filenames include ``run_number`` because each eval run consumes the + budget and we want all runs persisted rather than overwriting a previous + one. + """ + if task_id is None: + return store_dir / spec_id / f"{predictor_id}__eval_run{run_number}.yaml" + return store_dir / spec_id / f"{predictor_id}__{task_id}__eval_run{run_number}.yaml" + + +# --------------------------------------------------------------------------- +# Single-target backtest artefacts +# --------------------------------------------------------------------------- + + +def save_backtest_result( + result: BacktestResult, + spec_id: str, + store_dir: Path | None = None, +) -> Path: + """Persist a :class:`BacktestResult` to the artefact store. + + Parameters + ---------- + result : BacktestResult + The result to persist. + spec_id : str + Directory key under the store. For single-target backtests the + :class:`BacktestSpec` does not carry a ``spec_id`` field, so callers + must supply one explicitly. + store_dir : Path or None + Store root. Defaults to :data:`DEFAULT_STORE_DIR`. + + Returns + ------- + Path + The path the result was written to. + """ + store = _resolve_store(store_dir) + path = _backtest_path(store, spec_id, result.predictor_id) + _dump_yaml(result, path) + return path + + +def load_backtest_result( + spec_id: str, + predictor_id: str, + store_dir: Path | None = None, +) -> BacktestResult | None: + """Load a previously persisted :class:`BacktestResult` from the store. + + Parameters + ---------- + spec_id : str + Directory key under the store. + predictor_id : str + Predictor whose result to load. + store_dir : Path or None + Store root. Defaults to :data:`DEFAULT_STORE_DIR`. + + Returns + ------- + BacktestResult or None + The loaded result, or ``None`` if no file exists for this combination. + """ + store = _resolve_store(store_dir) + path = _backtest_path(store, spec_id, predictor_id) + if not path.exists(): + return None + return BacktestResult.model_validate(_load_yaml(path)) + + +def cached_backtest( + predictor: Predictor, + spec: BacktestSpec, + spec_id: str, + data_service: DataService, + store_dir: Path | None = None, + force_refresh: bool = False, +) -> BacktestResult: + """Run :func:`backtest` with a load-or-compute cache. + + If a result already exists under ``//.yaml`` + and ``force_refresh`` is ``False``, the cached result is returned. + Otherwise the backtest is run and the result is persisted before return. + + Parameters + ---------- + predictor : Predictor + Forecasting model to evaluate. + spec : BacktestSpec + Backtest specification. + spec_id : str + Directory key used to locate / persist the artefact. + data_service : DataService + Pre-populated data service. + store_dir : Path or None + Store root. Defaults to :data:`DEFAULT_STORE_DIR`. + force_refresh : bool + When ``True`` always recompute even if a cached file exists. + + Returns + ------- + BacktestResult + The (possibly cached) backtest result. + """ + if not force_refresh: + cached = load_backtest_result(spec_id, predictor.predictor_id, store_dir=store_dir) + if cached is not None: + return cached + result = backtest(predictor=predictor, spec=spec, data_service=data_service) + save_backtest_result(result, spec_id=spec_id, store_dir=store_dir) + return result + + +# --------------------------------------------------------------------------- +# Multi-target backtest artefacts +# --------------------------------------------------------------------------- + + +def save_multi_backtest_results( + results: dict[str, BacktestResult], + spec: MultiTargetBacktestSpec, + store_dir: Path | None = None, +) -> dict[str, Path]: + """Persist a full multi-target backtest result set (one file per task). + + Parameters + ---------- + results : dict[str, BacktestResult] + Output of :func:`multi_backtest`, keyed by ``task_id``. + spec : MultiTargetBacktestSpec + The parent spec; supplies ``spec_id`` used as the store subdirectory. + store_dir : Path or None + Store root. + + Returns + ------- + dict[str, Path] + Map from ``task_id`` to the written artefact path. + """ + store = _resolve_store(store_dir) + paths: dict[str, Path] = {} + for task_id, result in results.items(): + path = _backtest_path(store, spec.spec_id, result.predictor_id, task_id=task_id) + _dump_yaml(result, path) + paths[task_id] = path + return paths + + +def load_multi_backtest_results( + spec: MultiTargetBacktestSpec, + predictor_id: str, + store_dir: Path | None = None, +) -> dict[str, BacktestResult] | None: + """Load persisted multi-target results if *all* tasks have an artefact. + + Parameters + ---------- + spec : MultiTargetBacktestSpec + The parent spec. Its ``spec_id`` keys the lookup and its ``tasks`` + list enumerates which artefacts to load. + predictor_id : str + Predictor whose results to load. + store_dir : Path or None + Store root. + + Returns + ------- + dict[str, BacktestResult] or None + Full result dict keyed by ``task_id``. Returns ``None`` if any task + is missing — partial caches are never returned, to avoid hiding + incomplete state from callers. + """ + store = _resolve_store(store_dir) + results: dict[str, BacktestResult] = {} + for task in spec.tasks: + path = _backtest_path(store, spec.spec_id, predictor_id, task_id=task.task_id) + if not path.exists(): + return None + results[task.task_id] = BacktestResult.model_validate(_load_yaml(path)) + return results + + +_log = logging.getLogger(__name__) + + +def cached_multi_backtest( + predictor: Predictor, + spec: MultiTargetBacktestSpec, + data_service: DataService, + store_dir: Path | None = None, + force_refresh: bool = False, + max_retries: int = 2, + retry_delay: float = 2.0, +) -> dict[str, BacktestResult]: + """Run :func:`multi_backtest` with a per-task load-or-compute cache. + + Each task is cached independently under + ``//__.yaml``. On a fresh run a + completed task's file is written immediately so a crash mid-run leaves all + prior tasks intact. Re-running after a crash skips every already-cached + task and only retries the ones that didn't complete. + + If a task fails even after the retry logic inside :func:`run_eval_loop` + has been exhausted, the failure is logged at WARNING level and the task is + omitted from the returned dict rather than propagating the exception. This + keeps the outer experiment loop running so all other predictors still + complete. + + Parameters + ---------- + predictor : Predictor + Forecasting model to evaluate. + spec : MultiTargetBacktestSpec + Multi-target backtest specification. + data_service : DataService + Pre-populated data service. + store_dir : Path or None + Store root. Defaults to :data:`DEFAULT_STORE_DIR`. + force_refresh : bool + When ``True`` always recompute even if cached files exist. + max_retries : int, default=2 + Passed through to :func:`~aieng.forecasting.evaluation.backtest.backtest`. + Number of retry attempts per failing origin. + retry_delay : float, default=2.0 + Seconds to wait between per-origin retry attempts. + + Returns + ------- + dict[str, BacktestResult] + Results keyed by ``task_id``. Tasks that failed are absent from the + dict; a WARNING log entry is emitted for each failure. + """ + store = _resolve_store(store_dir) + results: dict[str, BacktestResult] = {} + for single_spec in spec.specs(): + task_id = single_spec.task.task_id + path = _backtest_path(store, spec.spec_id, predictor.predictor_id, task_id=task_id) + if not force_refresh and path.exists(): + results[task_id] = BacktestResult.model_validate(_load_yaml(path)) + continue + try: + result = backtest( + predictor=predictor, + spec=single_spec, + data_service=data_service, + max_retries=max_retries, + retry_delay=retry_delay, + ) + except Exception as exc: + _log.warning( + "Backtest failed for predictor=%s task=%s — skipping task: %s", + predictor.predictor_id, + task_id, + exc, + ) + continue + _dump_yaml(result, path) + results[task_id] = result + return results + + +# --------------------------------------------------------------------------- +# Eval artefacts (write-only — eval is never silently cached) +# --------------------------------------------------------------------------- + + +def save_eval_result( + result: EvalResult, + store_dir: Path | None = None, +) -> Path: + """Persist a single :class:`EvalResult` to the artefact store. + + The filename encodes ``run_number`` so that successive eval runs are all + preserved rather than overwriting each other. + + Parameters + ---------- + result : EvalResult + The eval result to persist. Its ``eval_spec.spec_id`` determines the + subdirectory under the store. + store_dir : Path or None + Store root. Defaults to :data:`DEFAULT_STORE_DIR`. + + Returns + ------- + Path + The path the result was written to. + """ + store = _resolve_store(store_dir) + path = _eval_path(store, result.eval_spec.spec_id, result.predictor_id, result.run_number) + _dump_yaml(result, path) + return path + + +def save_multi_eval_results( + results: dict[str, EvalResult], + spec: MultiTargetEvalSpec, + store_dir: Path | None = None, +) -> dict[str, Path]: + """Persist a full multi-target eval run (one file per task). + + Parameters + ---------- + results : dict[str, EvalResult] + Output of :func:`multi_evaluate`, keyed by ``task_id``. + spec : MultiTargetEvalSpec + Parent spec; supplies ``spec_id`` used as the store subdirectory. + store_dir : Path or None + Store root. + + Returns + ------- + dict[str, Path] + Map from ``task_id`` to the written artefact path. + """ + store = _resolve_store(store_dir) + paths: dict[str, Path] = {} + for task_id, result in results.items(): + path = _eval_path(store, spec.spec_id, result.predictor_id, result.run_number, task_id=task_id) + _dump_yaml(result, path) + paths[task_id] = path + return paths +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md new file mode 100644 index 0000000..e83f5e2 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__backtest.py.md @@ -0,0 +1,826 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/backtest.py + +kind: python + +```python +"""BacktestSpec, BacktestResult, and the backtest() harness. + +This module also provides :class:`MultiTargetBacktestSpec` and +:func:`multi_backtest` for running a single predictor across a collection of +related forecasting tasks (e.g. all food CPI sub-categories) under identical +evaluation window parameters. +""" + +from __future__ import annotations + +import logging +import math +import time +from datetime import datetime, timezone +from typing import Literal + +import numpy as np +import pandas as pd +import properscoring as ps +from aieng.forecasting.data.service import DataService +from aieng.forecasting.evaluation.prediction import BinaryForecast, CategoricalForecast, ContinuousForecast, Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask +from pydantic import AliasChoices, BaseModel, Field, model_validator + + +ScoreMetric = Literal["crps", "brier", "rps"] + +#: Score metric names, keyed by ``ForecastingTask.payload_type``. +METRIC_BY_PAYLOAD_TYPE: dict[str, ScoreMetric] = {"continuous": "crps", "binary": "brier", "categorical": "rps"} + + +logger = logging.getLogger(__name__) + + +def _compute_origins(start: datetime, end: datetime, frequency: str, stride: int) -> list[datetime]: + """Compute strided forecast origin dates for a spec window. + + Shared by :class:`BacktestSpec` and + :class:`~aieng.forecasting.evaluation.eval.EvalSpec` to avoid duplicating + the striding logic. + + Parameters + ---------- + start : datetime + First candidate origin. + end : datetime + Last candidate origin (inclusive). + frequency : str + Pandas offset alias (e.g. ``"MS"``). + stride : int + Step size between origins in frequency units. + + Returns + ------- + list[datetime] + Candidate forecast origin dates, sorted ascending. + """ + all_dates = pd.date_range(start=start, end=end, freq=frequency) + strided = all_dates[::stride] + return [ts.to_pydatetime() for ts in strided] + + +class BacktestSpec(BaseModel): + """Specifies when and how often to evaluate a predictor against a task. + + ``BacktestSpec`` separates the *evaluation window* from the prediction + problem itself. A :class:`ForecastingTask` defines *what* to forecast; + ``BacktestSpec`` wraps a task and adds *when* and *how often* to run + the harness. + + Because ``BacktestSpec`` is a Pydantic model it is YAML-serializable, + making evaluation windows shareable and reproducible. Reference specs for + canonical tasks live in ``implementations//specs/``. + + Parameters + ---------- + task : ForecastingTask + The prediction problem to evaluate. + start : datetime + First candidate forecast origin. + end : datetime + Last candidate forecast origin (inclusive). + stride : int + Step size between origins in task-frequency units. ``stride=1`` means + every period; ``stride=6`` on monthly data means twice per year + (January and July when ``start`` falls on a month boundary). + origin_dates : list[datetime] or None + Optional explicit forecast origins. When provided, :meth:`origins` + returns exactly these dates (sorted ascending) instead of deriving a + regular grid from ``start``/``end``/``stride``. This supports + irregular event calendars — for example Bank of Canada fixed + announcement dates, which occur eight times per year on dates that no + pandas frequency alias can generate. All dates must fall within + ``[start, end]`` so the window fields remain an honest summary of the + evaluation period. + warmup : int + Minimum number of observations required in the cutoff-filtered series + before a forecast origin is used. Origins that do not have enough + history are silently skipped. + description : str + Free-form prose description of the backtest intent (methodology, + origin rationale, etc.). Optional — defaults to an empty string. + Consumers such as :func:`aieng.forecasting.evaluation.describe.describe_spec` + and LLM-based predictors surface this to provide qualitative context + alongside the quantitative task definition. + + Examples + -------- + >>> from datetime import datetime + >>> spec = BacktestSpec( + ... task=ForecastingTask( + ... task_id="cpi_gasoline_canada_1m", + ... target_series_id="cpi_gasoline_canada", + ... horizon=1, + ... frequency="MS", + ... description="CPI Gasoline Canada, 1-month ahead forecast", + ... ), + ... start=datetime(2000, 1, 1), + ... end=datetime(2025, 1, 1), + ... stride=1, + ... warmup=24, + ... ) + >>> origins = spec.origins() + >>> len(origins) > 0 + True + """ + + task: ForecastingTask + start: datetime = Field(description="First candidate forecast origin.") + end: datetime = Field(description="Last candidate forecast origin (inclusive).") + stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.") + origin_dates: list[datetime] | None = Field( + default=None, + description=( + "Optional explicit forecast origins for irregular calendars (e.g. central bank " + "announcement dates). When set, overrides the start/end/stride grid derivation." + ), + ) + warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.") + description: str = Field( + default="", + description="Free-form prose description of the backtest intent (methodology, origin rationale, etc.).", + ) + + @model_validator(mode="after") + def start_before_end(self) -> "BacktestSpec": + """Validate that start precedes end.""" + if self.start >= self.end: + raise ValueError(f"start ({self.start}) must be before end ({self.end})") + return self + + @model_validator(mode="after") + def origin_dates_in_window(self) -> "BacktestSpec": + """Validate that explicit origin dates fall within [start, end].""" + if self.origin_dates is not None: + if not self.origin_dates: + raise ValueError("origin_dates must be non-empty when provided; omit it to derive origins.") + out_of_window = [d for d in self.origin_dates if not (self.start <= d <= self.end)] + if out_of_window: + raise ValueError( + f"All origin_dates must fall within [start, end] = [{self.start}, {self.end}]. " + f"Out of window: {out_of_window}" + ) + return self + + def origins(self) -> list[datetime]: + """Return the candidate forecast origins derived from this spec. + + When ``origin_dates`` is set, those dates are returned sorted + ascending. Otherwise origins are generated using ``pd.date_range`` + with the task's frequency and the configured stride. The returned + list does not apply the warmup filter — that is applied inside + :func:`backtest` where the actual series data is available. + + Returns + ------- + list[datetime] + Candidate forecast origin dates, sorted ascending. + """ + if self.origin_dates is not None: + return sorted(self.origin_dates) + return _compute_origins(self.start, self.end, self.task.frequency, self.stride) + + +class BacktestResult(BaseModel): + """The outcome of a backtest run — a self-contained, serializable record. + + ``BacktestResult`` is a first-class Pydantic model (not just a DataFrame + of numbers). It is designed to be YAML-roundtrippable so that results can + be persisted alongside predictor implementations, fed to downstream agents + as structured context, or used as submission artefacts in a future + competition mechanism. + + Parameters + ---------- + spec : BacktestSpec + The exact spec that was evaluated. + predictor_id : str + Identifier for the predictor that produced these forecasts. + predictions : list[Prediction] + Flat list of scored predictions. For single-horizon tasks this is one + entry per evaluated origin; for multi-horizon tasks it is + ``origins_scored × len(task.horizons)`` (minus any future steps that + could not yet be resolved). Ordered by origin then by horizon. + scores : list[float] + Score for each prediction, parallel to ``predictions``. CRPS for + continuous tasks, Brier for binary tasks, RPS for categorical tasks. + Lower is better. + metric : {"crps", "brier", "rps"} + Which scoring rule produced ``scores`` / ``mean_score``. Determined by + the task's ``payload_type``. Defaults to ``"crps"`` so artefacts + written before binary support existed still load correctly. + mean_score : float + Mean score across all scored (origin, horizon) pairs. Older artefacts + serialized this field as ``mean_crps``; both keys are accepted on load. + ran_at : datetime + UTC wall-clock time when the backtest was executed. + skipped_origins : int + Number of candidate origins where no horizon could be scored (either + warmup not met, or all forecast dates were unresolvable). + """ + + spec: BacktestSpec + predictor_id: str + predictions: list[Prediction] + scores: list[float] + metric: ScoreMetric = Field( + default="crps", + description="Scoring rule used: 'crps' (continuous), 'brier' (binary), or 'rps' (categorical).", + ) + mean_score: float = Field( + validation_alias=AliasChoices("mean_score", "mean_crps"), + description="Mean score across all scored predictions (CRPS or Brier; lower is better).", + ) + ran_at: datetime + skipped_origins: int = Field(default=0, description="Candidate origins skipped due to warmup.") + + @model_validator(mode="after") + def lengths_match(self) -> "BacktestResult": + """Validate that predictions and scores have the same length.""" + if len(self.predictions) != len(self.scores): + raise ValueError( + f"predictions ({len(self.predictions)}) and scores ({len(self.scores)}) must have the same length" + ) + return self + + +def _crps_for_prediction(prediction: Prediction, actual: float) -> float: + """Compute CRPS for a single ContinuousForecast against an observed value. + + Uses ``properscoring.crps_ensemble`` with the quantile forecast values + as an ensemble. While quantile values are not independent samples from + the predictive distribution, this gives a reasonable CRPS approximation + when the quantile grid is sufficiently fine. + + Parameters + ---------- + prediction : Prediction + Must have a :class:`ContinuousForecast` payload. + actual : float + The observed value at the forecast date. + + Returns + ------- + float + CRPS score (lower is better). + """ + if not isinstance(prediction.payload, ContinuousForecast): + raise TypeError("CRPS scoring requires a ContinuousForecast payload.") + payload = prediction.payload + ensemble = np.array(sorted(payload.quantiles.values()), dtype=float) + return float(ps.crps_ensemble(actual, ensemble)) + + +def compute_brier_score(probabilities: list[float], outcomes: list[float]) -> float: + """Mean Brier score for a batch of binary forecasts. + + The Brier score is ``mean((p - y)**2)`` over forecast/outcome pairs. It is + a strictly proper scoring rule for binary events: it is minimised in + expectation only by reporting the true event probability. + + Parameters + ---------- + probabilities : list[float] + Predicted P(event), each in [0, 1]. + outcomes : list[float] + Realised outcomes (0 or 1), parallel to ``probabilities``. + + Returns + ------- + float + Mean Brier score in [0, 1]; lower is better. ``nan`` for empty input. + """ + if not probabilities: + return float("nan") + if len(probabilities) != len(outcomes): + raise ValueError( + f"probabilities ({len(probabilities)}) and outcomes ({len(outcomes)}) must have the same length" + ) + probs = np.asarray(probabilities, dtype=float) + ys = np.asarray(outcomes, dtype=float) + return float(np.mean((probs - ys) ** 2)) + + +def compute_rps(probabilities: list[list[float]], outcome_indices: list[int]) -> float: + """Mean Ranked Probability Score for ordered-categorical forecasts. + + RPS is a strictly proper scoring rule for ordinal outcomes. For one + forecast with ``K`` ordered category probabilities ``p`` and realised + category index ``j``, this implementation uses the standard unnormalized + Epstein/Murphy convention: + ``sum((cumsum(p)[k] - I[j <= k])**2 for k in range(K - 1))``. + + For ``K=2`` it equals the binary Brier score ``(p - y)**2`` as implemented + by :func:`compute_brier_score`. This convention is one half of Brier's + original 1950 multi-category score, which is noted here because both + normalizations appear in the literature. + + Parameters + ---------- + probabilities : list[list[float]] + Ordered category probability rows, one row per forecast. All rows must + have the same length ``K >= 2``. + outcome_indices : list[int] + Realised category indices in ``[0, K)``, parallel to ``probabilities``. + + Returns + ------- + float + Mean RPS in ``[0, K-1]``; lower is better. ``nan`` for empty input. + """ + if not probabilities: + return float("nan") + if len(probabilities) != len(outcome_indices): + raise ValueError( + f"probabilities ({len(probabilities)}) and outcome_indices ({len(outcome_indices)}) " + "must have the same length" + ) + + row_length = len(probabilities[0]) + if row_length < 2: + raise ValueError(f"RPS probability rows must have length K >= 2; got {row_length}.") + for row in probabilities: + if len(row) != row_length: + raise ValueError("RPS probability rows must all have the same length.") + + scores: list[float] = [] + for row, outcome_index in zip(probabilities, outcome_indices, strict=True): + if outcome_index < 0 or outcome_index >= row_length: + raise ValueError(f"RPS outcome index {outcome_index} is out of range for K={row_length}.") + cumulative = np.cumsum(np.asarray(row, dtype=float))[:-1] + observed = np.asarray([1.0 if outcome_index <= k else 0.0 for k in range(row_length - 1)], dtype=float) + scores.append(float(np.sum((cumulative - observed) ** 2))) + return float(np.mean(scores)) + + +def _brier_for_prediction(prediction: Prediction, actual: float) -> float: + """Compute the Brier score for a single BinaryForecast against an observed outcome. + + The Brier score is the squared error between the forecast probability and + the realised binary outcome: ``(p - y)**2``. It is a strictly proper + scoring rule for binary events — the binary counterpart of CRPS. + + Parameters + ---------- + prediction : Prediction + Must have a :class:`BinaryForecast` payload. + actual : float + The observed outcome at the forecast date. Must be 0.0 or 1.0 + (binary tasks resolve against a 0/1 event series). + + Returns + ------- + float + Brier score in [0, 1] (lower is better). + """ + if not isinstance(prediction.payload, BinaryForecast): + raise TypeError("Brier scoring requires a BinaryForecast payload.") + if actual not in (0.0, 1.0): + raise ValueError( + f"Brier scoring requires a binary (0/1) resolved outcome; got {actual}. " + f"Check that the task's target series is a 0/1 event series." + ) + return compute_brier_score([prediction.payload.probability], [actual]) + + +def _rps_for_prediction(task: ForecastingTask, prediction: Prediction, actual: float) -> float: + """Compute RPS for a single CategoricalForecast against an observed outcome.""" + if task.categories is None: + raise ValueError(f"Task '{task.task_id}' declares payload_type='categorical' but has no categories.") + if not isinstance(prediction.payload, CategoricalForecast): + raise TypeError("RPS scoring requires a CategoricalForecast payload.") + + categories = task.categories + labels = [category.label for category in categories] + values = [category.value for category in categories] + expected_labels = set(labels) + predicted_labels = set(prediction.payload.probabilities) + if predicted_labels != expected_labels: + missing = sorted(expected_labels - predicted_labels) + extra = sorted(predicted_labels - expected_labels) + raise ValueError( + f"Categorical prediction from predictor '{prediction.predictor_id}' must contain exactly the task " + f"category labels. Missing labels: {missing}; extra labels: {extra}." + ) + + outcome_index: int | None = None + for index, value in enumerate(values): + if math.isclose(actual, value, abs_tol=1e-9): + outcome_index = index + break + if outcome_index is None: + raise ValueError( + f"Categorical resolved outcome {actual} does not match any task category value. Allowed values: {values}." + ) + + ordered_probabilities = [prediction.payload.probabilities[label] for label in labels] + return compute_rps([ordered_probabilities], [outcome_index]) + + +def _score_for_prediction(task: ForecastingTask, prediction: Prediction, actual: float) -> float: + """Score a prediction with the metric implied by the task's payload type. + + Dispatches to CRPS for ``payload_type="continuous"``, Brier for + ``payload_type="binary"``, and RPS for ``payload_type="categorical"``, + after validating that the payload the predictor returned actually matches + the task declaration. A mismatch fails loudly: a probability scored with + CRPS (or quantiles scored with Brier/RPS) would be silently meaningless. + + Parameters + ---------- + task : ForecastingTask + Declares the expected payload modality. + prediction : Prediction + The prediction to score. + actual : float + The resolved ground-truth value. + + Returns + ------- + float + CRPS, Brier, or RPS score (lower is better). + """ + if task.payload_type == "binary": + if not isinstance(prediction.payload, BinaryForecast): + raise TypeError( + f"Task '{task.task_id}' declares payload_type='binary' but predictor " + f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload." + ) + return _brier_for_prediction(prediction, actual) + if task.payload_type == "categorical": + if not isinstance(prediction.payload, CategoricalForecast): + raise TypeError( + f"Task '{task.task_id}' declares payload_type='categorical' but predictor " + f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload." + ) + return _rps_for_prediction(task, prediction, actual) + if not isinstance(prediction.payload, ContinuousForecast): + raise TypeError( + f"Task '{task.task_id}' declares payload_type='continuous' but predictor " + f"'{prediction.predictor_id}' returned a {type(prediction.payload).__name__} payload." + ) + return _crps_for_prediction(prediction, actual) + + +def _resolve(task: ForecastingTask, forecast_date: datetime, data_service: DataService) -> float | None: + """Look up the observed value at a forecast date. + + Queries the data service with a sufficiently late ``as_of`` to ensure the + observation is available. Returns ``None`` if the observation is not found + (e.g. the forecast date is in the future). + + Parameters + ---------- + task : ForecastingTask + Used to identify the target series. + forecast_date : datetime + The date whose observed value is needed. + data_service : DataService + The data service to query. + + Returns + ------- + float or None + The observed value, or ``None`` if unavailable. + """ + # Query with today as as_of to get all available data including future observations. + as_of_now = datetime.now(tz=timezone.utc).replace(tzinfo=None) + full_series = data_service.get_series(task.target_series_id, as_of=as_of_now) + + target_ts = pd.Timestamp(forecast_date) + match = full_series[pd.to_datetime(full_series["timestamp"]) == target_ts] + if match.empty: + return None + return float(match["value"].iloc[0]) + + +def run_eval_loop( + predictor: Predictor, + task: ForecastingTask, + origins: list[datetime], + warmup: int, + data_service: DataService, + max_retries: int = 2, + retry_delay: float = 2.0, +) -> tuple[list[Prediction], list[float], int]: + """Core evaluation loop shared by ``backtest()`` and ``evaluate()``. + + Iterates over ``origins``, calls the predictor at each origin, resolves + predictions against the observed series, and scores with the metric + implied by the task's ``payload_type`` (CRPS for continuous, Brier for + binary, RPS for categorical). + + Parameters + ---------- + predictor : Predictor + The forecasting model to evaluate. + task : ForecastingTask + The prediction problem being evaluated. + origins : list[datetime] + Candidate forecast origin dates (already strided / derived from a spec). + warmup : int + Minimum number of observations required before a forecast origin is used. + data_service : DataService + Pre-populated data service. Must have the target series registered. + max_retries : int, default=2 + Number of times to retry a failing ``predictor.predict()`` call before + skipping the origin. Handles transient model errors (e.g. malformed + structured output) without crashing the whole backtest. + retry_delay : float, default=2.0 + Seconds to wait between retry attempts. + + Returns + ------- + tuple[list[Prediction], list[float], int] + ``(predictions, scores, skipped)`` — parallel lists of predictions and + scores, plus the count of origins that were skipped. + + Raises + ------ + ValueError + If no origins produce a resolvable prediction. + """ + predictions: list[Prediction] = [] + scores: list[float] = [] + skipped = 0 + + for origin in origins: + ctx = data_service.context(as_of=origin) + + if warmup > 0: + series = ctx.get_series(task.target_series_id) + if len(series) < warmup: + skipped += 1 + continue + + origin_predictions: list[Prediction] = [] + last_exc: BaseException | None = None + for attempt in range(max_retries + 1): + try: + origin_predictions = predictor.predict(task, ctx) + last_exc = None + break + except Exception as exc: + last_exc = exc + if attempt < max_retries: + logger.warning( + "predict() failed at origin %s (attempt %d/%d): %s — retrying in %.0fs", + origin.date(), + attempt + 1, + max_retries + 1, + exc, + retry_delay, + ) + time.sleep(retry_delay) + + if last_exc is not None: + logger.warning( + "predict() failed at origin %s after %d attempt(s) — skipping origin: %s", + origin.date(), + max_retries + 1, + last_exc, + ) + skipped += 1 + continue + + origin_scored = 0 + for pred in origin_predictions: + actual = _resolve(task, pred.forecast_date, data_service) + if actual is None: + continue + score = _score_for_prediction(task, pred, actual) + predictions.append(pred) + scores.append(score) + origin_scored += 1 + + if origin_scored == 0: + skipped += 1 + + if not predictions: + raise ValueError( + f"No predictions were scored. All {len(origins)} candidate origins were skipped. " + f"Check that the target series covers the evaluation window and that warmup ({warmup}) " + f"is not too large." + ) + + return predictions, scores, skipped + + +def backtest( + predictor: Predictor, + spec: BacktestSpec, + data_service: DataService, + max_retries: int = 2, + retry_delay: float = 2.0, +) -> BacktestResult: + """Run a backtest of a predictor against a BacktestSpec. + + Iterates over forecast origins derived from the spec, calls the predictor + at each origin (with a :class:`~aieng.forecasting.data.context.ForecastContext` + scoped to that date), resolves predictions against the observed series, and + scores with the metric implied by the task's ``payload_type`` (CRPS for + continuous tasks, Brier for binary tasks, RPS for categorical tasks). + + Origins with insufficient history (fewer than ``spec.warmup`` observations + in the cutoff-filtered series) are silently skipped. Origins whose + ``forecast_date`` has not yet been observed are also skipped with a warning. + + Parameters + ---------- + predictor : Predictor + The forecasting model to evaluate. + spec : BacktestSpec + Defines the task, evaluation window, stride, and warmup. + data_service : DataService + Pre-populated data service. Must have the target series registered. + max_retries : int, default=2 + Passed through to :func:`run_eval_loop`. Number of retry attempts per + failing origin before it is counted as skipped. + retry_delay : float, default=2.0 + Seconds to wait between retry attempts. + + Returns + ------- + BacktestResult + A fully populated result record including all predictions and scores. + + Raises + ------ + KeyError + If the target series is not registered in the data service. + ValueError + If no origins produce a resolvable prediction (all skipped). + + Examples + -------- + >>> results = backtest(predictor=my_predictor, spec=spec, data_service=svc) + >>> print(f"Mean {results.metric.upper()}: {results.mean_score:.4f}") + """ + predictions, scores, skipped = run_eval_loop( + predictor=predictor, + task=spec.task, + origins=spec.origins(), + warmup=spec.warmup, + data_service=data_service, + max_retries=max_retries, + retry_delay=retry_delay, + ) + return BacktestResult( + spec=spec, + predictor_id=predictor.predictor_id, + predictions=predictions, + scores=scores, + metric=METRIC_BY_PAYLOAD_TYPE[spec.task.payload_type], + mean_score=float(np.mean(scores)), + ran_at=datetime.now(tz=timezone.utc).replace(tzinfo=None), + skipped_origins=skipped, + ) + + +# --------------------------------------------------------------------------- +# MultiTargetBacktestSpec and multi_backtest() # noqa: ERA001 +# --------------------------------------------------------------------------- + + +class MultiTargetBacktestSpec(BaseModel): + """Backtest spec that evaluates a predictor across multiple related tasks. + + ``MultiTargetBacktestSpec`` groups several :class:`ForecastingTask` objects + under a single shared evaluation window (``start``, ``end``, ``stride``, + ``warmup``). All tasks must share the same ``frequency`` — this is + enforced at construction time. + + A typical use case is evaluating a predictor on all food CPI sub-categories + simultaneously: each category is a separate task, but they all use monthly + data and the same historical window. + + The spec can be decomposed into a list of standard :class:`BacktestSpec` + objects via :meth:`specs`, or evaluated directly with :func:`multi_backtest`. + + Parameters + ---------- + spec_id : str + Stable identifier for this spec. Used as the directory key for + persisted artefacts (see + :mod:`aieng.forecasting.evaluation.artifacts`) and for surfacing the + spec in logs and agent context. Should be unique across all spec files. + tasks : list[ForecastingTask] + The prediction problems to evaluate. All must share the same + ``frequency``. + start : datetime + First candidate forecast origin. + end : datetime + Last candidate forecast origin (inclusive). + stride : int + Step size between origins in task-frequency units. + warmup : int + Minimum number of observations required before a forecast origin is used. + description : str + Free-form prose description of the backtest intent (methodology, + origin rationale, etc.). Optional — defaults to an empty string. + + Examples + -------- + >>> spec = MultiTargetBacktestSpec( + ... spec_id="food_cpi_cfpr_backtest", + ... tasks=[task_food, task_meat, task_dairy], + ... start=datetime(2000, 1, 1), + ... end=datetime(2026, 1, 1), + ... stride=6, + ... warmup=24, + ... ) + >>> per_task_results = multi_backtest(my_predictor, spec, svc) + >>> for task_id, result in per_task_results.items(): + ... print(f"{task_id}: mean CRPS = {result.mean_score:.4f}") + """ + + spec_id: str = Field(description="Stable identifier for this spec; keys the artefact store.") + tasks: list[ForecastingTask] = Field( + min_length=1, description="Prediction problems; all must share the same frequency." + ) + start: datetime = Field(description="First candidate forecast origin.") + end: datetime = Field(description="Last candidate forecast origin (inclusive).") + stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.") + warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.") + description: str = Field( + default="", + description="Free-form prose description of the backtest intent (methodology, origin rationale, etc.).", + ) + + @model_validator(mode="after") + def _validate(self) -> "MultiTargetBacktestSpec": + if self.start >= self.end: + raise ValueError(f"start ({self.start}) must be before end ({self.end})") + frequencies = {t.frequency for t in self.tasks} + if len(frequencies) > 1: + raise ValueError( + f"All tasks in a MultiTargetBacktestSpec must share the same frequency. Found: {sorted(frequencies)}" + ) + return self + + def specs(self) -> list[BacktestSpec]: + """Decompose into one :class:`BacktestSpec` per task. + + Returns + ------- + list[BacktestSpec] + One spec per task, all sharing the same window parameters. + """ + return [ + BacktestSpec( + task=t, + start=self.start, + end=self.end, + stride=self.stride, + warmup=self.warmup, + description=self.description, + ) + for t in self.tasks + ] + + +def multi_backtest( + predictor: Predictor, spec: MultiTargetBacktestSpec, data_service: DataService +) -> dict[str, BacktestResult]: + """Run a backtest of a predictor across all tasks in a MultiTargetBacktestSpec. + + Calls :func:`backtest` once per task and returns the results keyed by + ``task_id``. All tasks share the same evaluation window, stride, and warmup + defined in the spec. + + Parameters + ---------- + predictor : Predictor + The forecasting model to evaluate. + spec : MultiTargetBacktestSpec + Defines the tasks, shared evaluation window, stride, and warmup. + data_service : DataService + Pre-populated data service. Must have all target series registered. + + Returns + ------- + dict[str, BacktestResult] + Backtest results keyed by ``task_id``, one entry per task. + + Raises + ------ + KeyError + If any target series is not registered in the data service. + ValueError + If no origins can be scored for any task. + + Examples + -------- + >>> results = multi_backtest(predictor=my_predictor, spec=spec, data_service=svc) + >>> for task_id, result in results.items(): + ... print(f"{task_id}: {result.mean_score:.4f}") + """ + return {single_spec.task.task_id: backtest(predictor, single_spec, data_service) for single_spec in spec.specs()} +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md new file mode 100644 index 0000000..e44aa07 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__describe.py.md @@ -0,0 +1,183 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/describe.py + +kind: python + +```python +"""Human-readable descriptions of forecasting tasks and specs. + +These helpers turn a :class:`ForecastingTask` / :class:`BacktestSpec` / +:class:`EvalSpec` and their multi-target counterparts into a plain-text +block suitable for printing in a notebook or piping into an LLM predictor +prompt. They are the simplest form of "spec as source of truth": one +input (the spec, optionally a :class:`DataService` for metadata lookup), +one output (a string that captures the full problem definition). + +The output format is intentionally minimal and stable — it is not an API, +and production code should depend on the model fields directly. It is +purely for display / prompt-construction use cases. +""" + +from __future__ import annotations + +from aieng.forecasting.data.service import DataService +from aieng.forecasting.evaluation.backtest import BacktestSpec, MultiTargetBacktestSpec +from aieng.forecasting.evaluation.eval import EvalSpec, MultiTargetEvalSpec +from aieng.forecasting.evaluation.task import ForecastingTask + + +def _series_line(series_id: str, data_service: DataService | None) -> str: + """Return a display line for ``series_id``, with metadata if available.""" + if data_service is None: + return f"- target_series_id: {series_id}" + try: + meta = data_service.get_metadata(series_id) + except KeyError: + return f"- target_series_id: {series_id} (not registered in data_service)" + return ( + f"- target_series_id: {series_id}\n" + f" description: {meta.description}\n" + f" source: {meta.source}\n" + f" units: {meta.units}\n" + f" frequency: {meta.frequency}" + ) + + +def describe_task(task: ForecastingTask, data_service: DataService | None = None) -> str: + """Return a plain-text description of a :class:`ForecastingTask`. + + Parameters + ---------- + task : ForecastingTask + The task to describe. + data_service : DataService or None + Optional data service. When provided, metadata for + ``target_series_id`` is included in the description. + + Returns + ------- + str + Multi-line description suitable for printing or embedding in a prompt. + """ + horizons_display = task.horizons[0] if len(task.horizons) == 1 else f"{task.horizons} (len={len(task.horizons)})" + lines = [ + f"Task: {task.task_id}", + f" description: {task.description}", + f" horizons: {horizons_display}", + f" frequency: {task.frequency}", + f" payload: {task.payload_type}", + ] + if task.payload_type == "categorical" and task.categories is not None: + categories = " < ".join(f"{category.label}({category.value:g})" for category in task.categories) + lines.append(f" categories: {categories}") + lines.extend( + [ + f" resolution: {task.resolution_fn}", + _series_line(task.target_series_id, data_service), + ] + ) + return "\n".join(lines) + + +def _window_lines(start: object, end: object, stride: int, warmup: int) -> list[str]: + return [ + f" start: {start}", + f" end: {end}", + f" stride: {stride}", + f" warmup: {warmup}", + ] + + +def _describe_backtest_spec(spec: BacktestSpec, data_service: DataService | None) -> str: + lines = [ + "BacktestSpec", + ] + if spec.description: + lines.append(f" description: {spec.description}") + lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup)) + if spec.origin_dates is not None: + lines.append(f" origins: {len(spec.origin_dates)} explicit dates (irregular calendar)") + lines.append("") + lines.append(describe_task(spec.task, data_service)) + return "\n".join(lines) + + +def _describe_eval_spec(spec: EvalSpec, data_service: DataService | None) -> str: + lines = [ + f"EvalSpec (spec_id={spec.spec_id})", + ] + if spec.description: + lines.append(f" description: {spec.description}") + lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup)) + if spec.origin_dates is not None: + lines.append(f" origins: {len(spec.origin_dates)} explicit dates (irregular calendar)") + lines.append(f" max_runs: {spec.max_runs}") + lines.append("") + lines.append(describe_task(spec.task, data_service)) + return "\n".join(lines) + + +def _describe_multi_target_backtest_spec(spec: MultiTargetBacktestSpec, data_service: DataService | None) -> str: + lines = [ + f"MultiTargetBacktestSpec (spec_id={spec.spec_id})", + ] + if spec.description: + lines.append(f" description: {spec.description}") + lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup)) + lines.append(f" tasks: {len(spec.tasks)}") + lines.append("") + for task in spec.tasks: + lines.append(describe_task(task, data_service)) + lines.append("") + return "\n".join(lines).rstrip() + "\n" + + +def _describe_multi_target_eval_spec(spec: MultiTargetEvalSpec, data_service: DataService | None) -> str: + lines = [ + f"MultiTargetEvalSpec (spec_id={spec.spec_id})", + ] + if spec.description: + lines.append(f" description: {spec.description}") + lines.extend(_window_lines(spec.start, spec.end, spec.stride, spec.warmup)) + lines.append(f" max_runs: {spec.max_runs}") + lines.append(f" tasks: {len(spec.tasks)}") + lines.append("") + for task in spec.tasks: + lines.append(describe_task(task, data_service)) + lines.append("") + return "\n".join(lines).rstrip() + "\n" + + +def describe_spec( + spec: BacktestSpec | EvalSpec | MultiTargetBacktestSpec | MultiTargetEvalSpec, + data_service: DataService | None = None, +) -> str: + """Return a plain-text description of any supported spec. + + Dispatches on the spec type and produces a consistent multi-line layout + covering the window parameters, budget / run-count (where applicable), + and the full task definition(s). + + Parameters + ---------- + spec : BacktestSpec | EvalSpec | MultiTargetBacktestSpec | MultiTargetEvalSpec + The specification to describe. + data_service : DataService or None + Optional data service used to enrich target-series lines with + metadata (description, source, units, frequency). + + Returns + ------- + str + Multi-line description suitable for printing or embedding in a + prompt. + """ + if isinstance(spec, MultiTargetBacktestSpec): + return _describe_multi_target_backtest_spec(spec, data_service) + if isinstance(spec, MultiTargetEvalSpec): + return _describe_multi_target_eval_spec(spec, data_service) + if isinstance(spec, EvalSpec): + return _describe_eval_spec(spec, data_service) + if isinstance(spec, BacktestSpec): + return _describe_backtest_spec(spec, data_service) + raise TypeError(f"Unsupported spec type: {type(spec).__name__}") +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md new file mode 100644 index 0000000..3f06a68 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__eval.py.md @@ -0,0 +1,676 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/eval.py + +kind: python + +```python +"""EvalSpec, EvalResult, EvalTracker, and the evaluate() harness. + +Eval mode is distinct from backtesting: it is intended to estimate how well +learned or backtested results generalise to recent, held-out data. The key +differences from a backtest are: + +- **Protected window** — the evaluation window should cover recent data that + has not been used for tuning or learning. By convention, reference eval + specs live in ``implementations//specs/`` and are not modified + by participants. + +- **Run-budget control** — ``EvalSpec.max_runs`` optionally caps how many + times a participant is allowed to run a given eval. When an + :class:`EvalTracker` is supplied to :func:`evaluate`, the budget is checked + before the run and the counter is incremented on success. This prevents + inadvertent over-fitting to the held-out window. + +This module also provides :class:`MultiTargetEvalSpec` and +:func:`multi_evaluate` for evaluating a predictor across multiple related tasks +under a single shared budget. A single ``multi_evaluate`` call counts as one +run against the budget regardless of how many tasks are included. + +Intended usage in a bootcamp session:: + + import yaml + from pathlib import Path + from aieng.forecasting.evaluation import EvalSpec, EvalTracker, evaluate + + with open("implementations/getting_started/specs/cpi_gasoline_eval_2025.yaml") as f: + spec = EvalSpec.model_validate(yaml.safe_load(f)) + + tracker = EvalTracker(Path("eval_runs.yaml")) + result = evaluate(my_predictor, spec, svc, tracker=tracker) + print(f"Eval mean {result.metric.upper()}: {result.mean_score:.4f}") + +If ``tracker`` is omitted, :func:`evaluate` runs unconditionally and sets +``run_number=1``. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from pathlib import Path + +import numpy as np +import yaml +from aieng.forecasting.data.service import DataService +from aieng.forecasting.evaluation.backtest import METRIC_BY_PAYLOAD_TYPE, ScoreMetric, _compute_origins, run_eval_loop +from aieng.forecasting.evaluation.prediction import Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask +from pydantic import AliasChoices, BaseModel, Field, model_validator + + +# --------------------------------------------------------------------------- +# Exceptions +# --------------------------------------------------------------------------- + + +class EvalBudgetExceededError(ValueError): + """Raised when an :class:`EvalTracker` has exhausted the run budget for a spec. + + Parameters + ---------- + spec_id : str + The identifier of the eval spec whose budget was exceeded. + runs_used : int + How many runs have already been recorded for this spec. + max_runs : int + The budget cap declared on the spec. + """ + + def __init__(self, spec_id: str, runs_used: int, max_runs: int) -> None: + self.spec_id = spec_id + self.runs_used = runs_used + self.max_runs = max_runs + super().__init__( + f"Eval budget exhausted for '{spec_id}': " + f"{runs_used}/{max_runs} runs already used. " + f"Run fewer evaluations against the held-out window to avoid over-fitting." + ) + + +# --------------------------------------------------------------------------- +# EvalSpec +# --------------------------------------------------------------------------- + + +class EvalSpec(BaseModel): + """Specifies a protected evaluation window for estimating generalisation. + + ``EvalSpec`` mirrors :class:`~aieng.forecasting.evaluation.backtest.BacktestSpec` + but adds two fields that make it suitable as a held-out, budget-controlled + evaluation mode: + + - ``spec_id`` — a stable, human-readable identifier used by + :class:`EvalTracker` to key run counts. Should be unique per spec file. + - ``max_runs`` — an optional cap on how many times this spec may be + evaluated by a single participant. ``None`` means unlimited. + + Like ``BacktestSpec``, ``EvalSpec`` is fully YAML-serializable. Reference + eval specs live in ``implementations//specs/`` and are + versioned in the repo so that the exact window used for evaluation is + always reproducible. + + Parameters + ---------- + spec_id : str + Stable identifier for this spec, used by :class:`EvalTracker` to key + run counts. Should be unique across all spec files. + task : ForecastingTask + The prediction problem to evaluate. + start : datetime + First candidate forecast origin. + end : datetime + Last candidate forecast origin (inclusive). + stride : int + Step size between origins in task-frequency units. + origin_dates : list[datetime] or None + Optional explicit forecast origins. When provided, :meth:`origins` + returns exactly these dates (sorted ascending) instead of deriving a + regular grid from ``start``/``end``/``stride``. Supports irregular + event calendars (e.g. Bank of Canada fixed announcement dates). All + dates must fall within ``[start, end]``. + warmup : int + Minimum number of observations required before a forecast origin is used. + max_runs : int or None + Maximum number of times this spec may be evaluated (per tracker). + ``None`` means unlimited. + + Examples + -------- + >>> spec = EvalSpec( + ... spec_id="cpi_gasoline_eval_2025", + ... task=ForecastingTask( + ... task_id="cpi_gasoline_canada_1m", + ... target_series_id="cpi_gasoline_canada", + ... horizon=1, + ... frequency="MS", + ... description="CPI Gasoline Canada, 1-month ahead forecast", + ... ), + ... start=datetime(2025, 1, 1), + ... end=datetime(2026, 3, 1), + ... stride=1, + ... warmup=24, + ... max_runs=5, + ... ) + """ + + spec_id: str = Field(description="Stable identifier for tracking; keyed by EvalTracker.") + task: ForecastingTask + start: datetime = Field(description="First candidate forecast origin.") + end: datetime = Field(description="Last candidate forecast origin (inclusive).") + stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.") + origin_dates: list[datetime] | None = Field( + default=None, + description=( + "Optional explicit forecast origins for irregular calendars (e.g. central bank " + "announcement dates). When set, overrides the start/end/stride grid derivation." + ), + ) + warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.") + max_runs: int | None = Field( + default=None, + ge=1, + description="Maximum allowed evaluations against this spec (per tracker). None = unlimited.", + ) + description: str = Field( + default="", + description="Free-form prose description of the eval intent (methodology, origin rationale, etc.).", + ) + + @model_validator(mode="after") + def start_before_end(self) -> "EvalSpec": + """Validate that start precedes end.""" + if self.start >= self.end: + raise ValueError(f"start ({self.start}) must be before end ({self.end})") + return self + + @model_validator(mode="after") + def origin_dates_in_window(self) -> "EvalSpec": + """Validate that explicit origin dates fall within [start, end].""" + if self.origin_dates is not None: + if not self.origin_dates: + raise ValueError("origin_dates must be non-empty when provided; omit it to derive origins.") + out_of_window = [d for d in self.origin_dates if not (self.start <= d <= self.end)] + if out_of_window: + raise ValueError( + f"All origin_dates must fall within [start, end] = [{self.start}, {self.end}]. " + f"Out of window: {out_of_window}" + ) + return self + + def origins(self) -> list[datetime]: + """Return the candidate forecast origins derived from this spec. + + When ``origin_dates`` is set, those dates are returned sorted + ascending. Otherwise origins are derived from + ``start``/``end``/``stride`` on the task's frequency grid. + + Returns + ------- + list[datetime] + Candidate forecast origin dates, sorted ascending. + """ + if self.origin_dates is not None: + return sorted(self.origin_dates) + return _compute_origins(self.start, self.end, self.task.frequency, self.stride) + + +# --------------------------------------------------------------------------- +# EvalResult +# --------------------------------------------------------------------------- + + +class EvalResult(BaseModel): + """The outcome of an eval run — analogous to ``BacktestResult`` for eval mode. + + ``EvalResult`` carries the same payload as + :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` plus + ``run_number``, which records which run against this spec this was (1st, + 2nd, …). This provenance field is populated automatically by + :func:`evaluate` using the :class:`EvalTracker`. + + Parameters + ---------- + eval_spec : EvalSpec + The exact spec that was evaluated. + predictor_id : str + Identifier for the predictor that produced these forecasts. + predictions : list[Prediction] + One ``Prediction`` per evaluated forecast origin, in chronological order. + scores : list[float] + Score for each prediction. CRPS for continuous tasks, Brier for binary + tasks, RPS for categorical tasks. Lower is better. + metric : {"crps", "brier", "rps"} + Which scoring rule produced ``scores`` / ``mean_score``. Determined by + the task's ``payload_type``. Defaults to ``"crps"`` so artefacts + written before binary support existed still load correctly. + mean_score : float + Mean score across all evaluated origins. Older artefacts serialized + this field as ``mean_crps``; both keys are accepted on load. + ran_at : datetime + UTC wall-clock time when the eval was executed. + skipped_origins : int + Number of candidate origins skipped due to insufficient warmup or + missing ground truth. + run_number : int + Which run against this spec this was (1-indexed). Set to 1 when no + tracker is supplied to :func:`evaluate`. + """ + + eval_spec: EvalSpec + predictor_id: str + predictions: list[Prediction] + scores: list[float] + metric: ScoreMetric = Field( + default="crps", + description="Scoring rule used: 'crps' (continuous), 'brier' (binary), or 'rps' (categorical).", + ) + mean_score: float = Field( + validation_alias=AliasChoices("mean_score", "mean_crps"), + description="Mean score across all scored predictions (CRPS or Brier; lower is better).", + ) + ran_at: datetime + skipped_origins: int = Field(default=0) + run_number: int = Field(default=1, ge=1, description="Which run against this spec this was (1-indexed).") + + @model_validator(mode="after") + def lengths_match(self) -> "EvalResult": + """Validate that predictions and scores have the same length.""" + if len(self.predictions) != len(self.scores): + raise ValueError( + f"predictions ({len(self.predictions)}) and scores ({len(self.scores)}) must have the same length" + ) + return self + + +# --------------------------------------------------------------------------- +# EvalTracker +# --------------------------------------------------------------------------- + + +class EvalTracker: + """Persists run counts for eval specs to a YAML file. + + Each call to :meth:`record` increments the run counter for the given + ``spec_id`` and writes the updated state to disk. On the next call to + :func:`evaluate`, the counter is read back via :meth:`runs_for` before the + run begins so that the budget cap in :attr:`EvalSpec.max_runs` can be + enforced. + + The tracking file is created on first write; the directory must already + exist. + + Tracking file format:: + + cpi_gasoline_eval_2025: + runs: 2 + last_run_at: "2026-04-02T10:00:00" + + Parameters + ---------- + path : Path + Path to the YAML tracking file. + + Examples + -------- + >>> tracker = EvalTracker(Path("eval_runs.yaml")) + >>> tracker.runs_for("my_spec") + 0 + >>> tracker.record("my_spec", datetime.utcnow()) + >>> tracker.runs_for("my_spec") + 1 + """ + + def __init__(self, path: Path) -> None: + self._path = path + + @property + def path(self) -> Path: + """Path to the YAML tracking file.""" + return self._path + + def _load(self) -> dict[str, dict[str, object]]: + if not self._path.exists(): + return {} + with self._path.open() as f: + data = yaml.safe_load(f) + return data if isinstance(data, dict) else {} + + def _save(self, data: dict[str, dict[str, object]]) -> None: + with self._path.open("w") as f: + yaml.dump(data, f, default_flow_style=False, sort_keys=True) + + def runs_for(self, spec_id: str) -> int: + """Return the number of runs already recorded for ``spec_id``. + + Parameters + ---------- + spec_id : str + The eval spec identifier to query. + + Returns + ------- + int + Number of runs recorded; 0 if ``spec_id`` has never been run. + """ + data = self._load() + entry = data.get(spec_id, {}) + runs_val = entry.get("runs", 0) + return runs_val if isinstance(runs_val, int) else int(str(runs_val)) + + def record(self, spec_id: str, ran_at: datetime) -> None: + """Increment the run counter for ``spec_id`` and persist to disk. + + Parameters + ---------- + spec_id : str + The eval spec identifier to update. + ran_at : datetime + The UTC time of the run being recorded. + """ + data = self._load() + entry = data.get(spec_id, {"runs": 0}) + runs_val = entry.get("runs", 0) + current = runs_val if isinstance(runs_val, int) else int(str(runs_val)) + entry["runs"] = current + 1 + entry["last_run_at"] = ran_at.isoformat() + data[spec_id] = entry + self._save(data) + + +# --------------------------------------------------------------------------- +# evaluate() harness +# --------------------------------------------------------------------------- + + +def evaluate( + predictor: Predictor, + spec: EvalSpec, + data_service: DataService, + tracker: EvalTracker | None = None, +) -> EvalResult: + """Run an evaluation of a predictor against a protected :class:`EvalSpec`. + + Behaves identically to :func:`~aieng.forecasting.evaluation.backtest.backtest` + at the forecast level, but additionally: + + 1. **Budget check** — if ``tracker`` is provided and ``spec.max_runs`` is + set, the run is refused with :exc:`EvalBudgetExceededError` if the + budget has been exhausted. + 2. **Run recording** — after a successful run, ``tracker.record()`` is + called so the budget is decremented for subsequent attempts. + 3. **Provenance** — :attr:`EvalResult.run_number` records which run this + was (derived from the tracker, or 1 if no tracker is supplied). + + Parameters + ---------- + predictor : Predictor + The forecasting model to evaluate. + spec : EvalSpec + Defines the task, evaluation window, stride, warmup, and optional + run budget. + data_service : DataService + Pre-populated data service. Must have the target series registered. + tracker : EvalTracker or None + Optional tracker for budget enforcement and run-count provenance. + If ``None``, the run proceeds unconditionally and ``run_number`` is 1. + + Returns + ------- + EvalResult + A fully populated result record including all predictions, scores, + and run provenance. + + Raises + ------ + EvalBudgetExceededError + If ``tracker`` is provided, ``spec.max_runs`` is set, and the budget + has been exhausted. + KeyError + If the target series is not registered in the data service. + ValueError + If no origins produce a resolvable prediction (all skipped). + + Examples + -------- + >>> result = evaluate(predictor=my_predictor, spec=spec, data_service=svc) + >>> print(f"Eval mean {result.metric.upper()}: {result.mean_score:.4f}") + """ + runs_used = tracker.runs_for(spec.spec_id) if tracker is not None else 0 + + if tracker is not None and spec.max_runs is not None and runs_used >= spec.max_runs: + raise EvalBudgetExceededError( + spec_id=spec.spec_id, + runs_used=runs_used, + max_runs=spec.max_runs, + ) + + ran_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + + predictions, scores, skipped = run_eval_loop( + predictor=predictor, + task=spec.task, + origins=spec.origins(), + warmup=spec.warmup, + data_service=data_service, + ) + + if tracker is not None: + tracker.record(spec.spec_id, ran_at) + + return EvalResult( + eval_spec=spec, + predictor_id=predictor.predictor_id, + predictions=predictions, + scores=scores, + metric=METRIC_BY_PAYLOAD_TYPE[spec.task.payload_type], + mean_score=float(np.mean(scores)), + ran_at=ran_at, + skipped_origins=skipped, + run_number=runs_used + 1, + ) + + +# --------------------------------------------------------------------------- +# MultiTargetEvalSpec and multi_evaluate() # noqa: ERA001 +# --------------------------------------------------------------------------- + + +class MultiTargetEvalSpec(BaseModel): + """Eval spec that assesses a predictor across multiple related tasks. + + ``MultiTargetEvalSpec`` is the eval-mode counterpart to + :class:`~aieng.forecasting.evaluation.backtest.MultiTargetBacktestSpec`. + It groups several :class:`ForecastingTask` objects under a single shared + evaluation window and a single run budget. + + **Budget semantics:** One call to :func:`multi_evaluate` counts as *one* + run against ``max_runs``, regardless of how many tasks are included. This + means the budget governs "evaluation sessions", not individual series. + + All tasks must share the same ``frequency``; this is enforced at + construction time. + + Parameters + ---------- + spec_id : str + Stable identifier for this spec, used by :class:`EvalTracker` to key + run counts. Should be unique across all spec files. + tasks : list[ForecastingTask] + The prediction problems to evaluate. All must share the same + ``frequency``. + start : datetime + First candidate forecast origin. + end : datetime + Last candidate forecast origin (inclusive). + stride : int + Step size between origins in task-frequency units. + warmup : int + Minimum observations required before a forecast origin is used. + max_runs : int or None + Maximum number of ``multi_evaluate`` calls allowed (per tracker). + ``None`` means unlimited. + + Examples + -------- + >>> spec = MultiTargetEvalSpec( + ... spec_id="food_cpi_18m_eval", + ... tasks=[task_food, task_meat, task_dairy], + ... start=datetime(2022, 7, 1), + ... end=datetime(2024, 7, 1), + ... stride=6, + ... warmup=24, + ... max_runs=5, + ... ) + """ + + spec_id: str = Field(description="Stable identifier for tracking; keyed by EvalTracker.") + tasks: list[ForecastingTask] = Field( + min_length=1, description="Prediction problems; all must share the same frequency." + ) + start: datetime = Field(description="First candidate forecast origin.") + end: datetime = Field(description="Last candidate forecast origin (inclusive).") + stride: int = Field(default=1, ge=1, description="Step size between origins in task-frequency units.") + warmup: int = Field(default=0, ge=0, description="Minimum observations required before first forecast.") + max_runs: int | None = Field( + default=None, + ge=1, + description="Maximum allowed evaluation sessions against this spec (per tracker). None = unlimited.", + ) + description: str = Field( + default="", + description="Free-form prose description of the eval intent (methodology, origin rationale, etc.).", + ) + + @model_validator(mode="after") + def _validate(self) -> "MultiTargetEvalSpec": + if self.start >= self.end: + raise ValueError(f"start ({self.start}) must be before end ({self.end})") + frequencies = {t.frequency for t in self.tasks} + if len(frequencies) > 1: + raise ValueError( + f"All tasks in a MultiTargetEvalSpec must share the same frequency. Found: {sorted(frequencies)}" + ) + return self + + def specs(self) -> list[EvalSpec]: + """Decompose into one :class:`EvalSpec` per task. + + The individual specs share ``spec_id`` and window parameters. They are + intended for internal use by :func:`multi_evaluate` — the budget is + enforced once at the multi-target level, not per task. + + Returns + ------- + list[EvalSpec] + One spec per task, sharing ``spec_id``, window, and budget fields. + """ + return [ + EvalSpec( + spec_id=self.spec_id, + task=t, + start=self.start, + end=self.end, + stride=self.stride, + warmup=self.warmup, + max_runs=self.max_runs, + description=self.description, + ) + for t in self.tasks + ] + + +def multi_evaluate( + predictor: Predictor, + spec: MultiTargetEvalSpec, + data_service: DataService, + tracker: EvalTracker | None = None, +) -> dict[str, EvalResult]: + """Run an evaluation of a predictor across all tasks in a MultiTargetEvalSpec. + + The budget check and tracker increment happen *once* for the entire + multi-target evaluation — one call counts as one run regardless of how + many tasks are in the spec. All tasks then run using the same + underlying :func:`evaluate`-level loop, but without re-checking the budget + for each individual task. + + Parameters + ---------- + predictor : Predictor + The forecasting model to evaluate. + spec : MultiTargetEvalSpec + Defines the tasks, shared evaluation window, stride, warmup, and + optional run budget. + data_service : DataService + Pre-populated data service. Must have all target series registered. + tracker : EvalTracker or None + Optional tracker for budget enforcement and run-count provenance. + If ``None``, runs unconditionally and ``run_number`` is 1 on all results. + + Returns + ------- + dict[str, EvalResult] + Eval results keyed by ``task_id``, one entry per task. + + Raises + ------ + EvalBudgetExceededError + If ``tracker`` is provided, ``spec.max_runs`` is set, and the budget + has been exhausted. + KeyError + If any target series is not registered in the data service. + ValueError + If no origins can be scored for any task. + + Examples + -------- + >>> results = multi_evaluate(my_predictor, spec, svc, tracker=tracker) + >>> for task_id, result in results.items(): + ... print(f"{task_id}: mean {result.metric.upper()} = {result.mean_score:.4f}") + """ + runs_used = tracker.runs_for(spec.spec_id) if tracker is not None else 0 + + if tracker is not None and spec.max_runs is not None and runs_used >= spec.max_runs: + raise EvalBudgetExceededError( + spec_id=spec.spec_id, + runs_used=runs_used, + max_runs=spec.max_runs, + ) + + ran_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + run_number = runs_used + 1 + + results: dict[str, EvalResult] = {} + for task in spec.tasks: + predictions, scores, skipped = run_eval_loop( + predictor=predictor, + task=task, + origins=_compute_origins(spec.start, spec.end, task.frequency, spec.stride), + warmup=spec.warmup, + data_service=data_service, + ) + task_eval_spec = EvalSpec( + spec_id=spec.spec_id, + task=task, + start=spec.start, + end=spec.end, + stride=spec.stride, + warmup=spec.warmup, + max_runs=spec.max_runs, + description=spec.description, + ) + results[task.task_id] = EvalResult( + eval_spec=task_eval_spec, + predictor_id=predictor.predictor_id, + predictions=predictions, + scores=scores, + metric=METRIC_BY_PAYLOAD_TYPE[task.payload_type], + mean_score=float(np.mean(scores)), + ran_at=ran_at, + skipped_origins=skipped, + run_number=run_number, + ) + + if tracker is not None: + tracker.record(spec.spec_id, ran_at) + + return results +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md new file mode 100644 index 0000000..319fa3e --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__langfuse_traces.py.md @@ -0,0 +1,298 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/langfuse_traces.py + +kind: python + +```python +"""Langfuse trace-evaluation plumbing: stamp forecasts, fetch traces, push scores. + +This module makes the **Langfuse trace** the canonical record a trace evaluator +reads from and writes back to. It owns three jobs: + +1. **Stamp** the structured forecast onto the trace at generation time + (:func:`stamp_forecast_on_trace`) so the model's rationale and distribution can + be read straight back from the trace rather than from a local cache. The + forecast is written as the ``output`` of a dedicated ``forecast`` child + observation: observation I/O is the supported surface (trace-level + ``input``/``output`` is deprecated in the v4 SDK). Stamping works either in the + active trace context (the LLMP path) or **post-hoc by trace id** (the agent + path, whose trace is created on a worker thread the caller cannot see). +2. **Fetch** a trace by id with readiness polling + (:func:`fetch_trace_with_wait`), because trace ingestion is asynchronous — a + freshly-emitted trace may not yet carry the ``forecast`` observation when the + evaluator looks. +3. **Push** an evaluation result back as a Langfuse score + (:func:`push_trace_score`), dispatching the score ``data_type`` from the + Python value type. + +The fetch/score/readiness pattern is a trimmed, **Langfuse v4** adaptation of the +trace-evaluation pass in VectorInstitute/eval-agents +(``aieng/agent_evals/evaluation/trace.py``); that reference targets the Langfuse +v3 SDK, so the API calls here (``client.api.trace.get`` / ``set_current_trace_io`` +/ ``create_score``) are the v4 equivalents. + +Langfuse is an optional dependency (the ``llm`` / ``agentic`` extras); every entry +point imports it lazily and degrades to a guarded no-op when it is absent, so +importing this module never requires the package. +""" + +from __future__ import annotations + +import logging +import time +from typing import Any, Callable, Sequence + +from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction + + +logger = logging.getLogger(__name__) + + +#: Name of the child observation whose ``output`` holds the stamped forecasts. +FORECAST_OBSERVATION_NAME = "forecast" + +#: Key under the observation ``output`` that holds the list of forecast dicts. +FORECAST_TRACE_OUTPUT_KEY = "forecasts" + + +def _get_client(client: Any | None = None) -> Any: + """Return the given client, or the process-wide Langfuse client.""" + if client is not None: + return client + from langfuse import get_client # noqa: PLC0415 + + return get_client() + + +# --------------------------------------------------------------------------- # +# Stamp: generation side +# --------------------------------------------------------------------------- # +def _forecast_to_dict(pred: Prediction) -> dict[str, Any] | None: + """Project one rationale-bearing categorical prediction to a trace-output dict. + + Returns ``None`` for predictions the rationale evaluator ignores (non + categorical, or without a stated rationale), so they are not stamped. + """ + if not isinstance(pred.payload, CategoricalForecast): + return None + metadata = pred.metadata or {} + rationale = str(metadata.get("rationale", "") or "").strip() + if not rationale: + return None + forecast_date = pred.forecast_date + return { + "predictor_id": pred.predictor_id, + "task_id": pred.task_id, + "forecast_date": forecast_date.isoformat() if hasattr(forecast_date, "isoformat") else str(forecast_date), + "probabilities": dict(pred.payload.probabilities), + "rationale": rationale, + "key_signals": list(metadata.get("key_signals", []) or []), + "confidence": str(metadata.get("confidence", "") or ""), + } + + +def stamp_forecast_on_trace( + predictions: Sequence[Prediction], *, trace_id: str | None = None, client: Any | None = None +) -> bool: + """Write the structured forecast(s) onto a ``forecast`` observation in the trace. + + Creates a child observation named ``forecast`` whose ``output`` carries the + rationale, predicted distribution, and forecast date, so they can be read back + by :func:`read_forecasts_from_trace`. Only rationale-bearing categorical + predictions are stamped. + + Parameters + ---------- + predictions : sequence of Prediction + The predictions to stamp (filtered to rationale-bearing categorical ones). + trace_id : str or None + When given, the observation is attached to that trace **post-hoc** (the + agent path, whose trace is created on a worker thread). When ``None``, it + is created in the active trace context (the LLMP path, inside ``@observe``). + client : Langfuse client, optional + Defaults to the process-wide client. + + Returns ``True`` when something was stamped, ``False`` on no-op (nothing to + stamp, or Langfuse unavailable). Best-effort: never raises. + """ + forecasts = [d for d in (_forecast_to_dict(p) for p in predictions) if d is not None] + if not forecasts: + return False + try: + client = _get_client(client) + kwargs: dict[str, Any] = {"name": FORECAST_OBSERVATION_NAME, "as_type": "span"} + if trace_id is not None: + from langfuse.types import TraceContext # noqa: PLC0415 + + kwargs["trace_context"] = TraceContext(trace_id=trace_id) + with client.start_as_current_observation(**kwargs) as observation: + observation.update(output={FORECAST_TRACE_OUTPUT_KEY: forecasts}) + return True + except Exception: # pragma: no cover - guarded no-op when tracing is unavailable + logger.debug("Could not stamp forecast onto Langfuse trace.", exc_info=True) + return False + + +def _forecasts_from_output(output: Any) -> list[dict[str, Any]]: + """Extract the forecast list from an observation/trace ``output`` payload.""" + if isinstance(output, dict): + forecasts = output.get(FORECAST_TRACE_OUTPUT_KEY) + if isinstance(forecasts, list): + return [f for f in forecasts if isinstance(f, dict)] + return [] + + +def read_forecasts_from_trace(trace: Any) -> list[dict[str, Any]]: + """Return the stamped forecast dicts from a fetched trace, or ``[]``. + + Reads the ``forecast`` child observation's output (the supported surface); + falls back to trace-level ``output`` for traces stamped before the switch. + """ + for observation in getattr(trace, "observations", None) or []: + if getattr(observation, "name", None) == FORECAST_OBSERVATION_NAME: + forecasts = _forecasts_from_output(getattr(observation, "output", None)) + if forecasts: + return forecasts + return _forecasts_from_output(getattr(trace, "output", None)) + + +def trace_has_forecast(trace: Any) -> bool: + """Readiness predicate: the trace carries at least one stamped forecast.""" + return bool(read_forecasts_from_trace(trace)) + + +# --------------------------------------------------------------------------- # +# Fetch: evaluation side (readiness polling) +# --------------------------------------------------------------------------- # +def _is_retryable_trace_fetch_error(exc: BaseException) -> bool: + """Whether a trace-fetch error is worth retrying (ingestion still in flight).""" + name = type(exc).__name__ + if name == "NotFoundError": # trace id not yet ingested + return True + if name in {"ConnectError", "ConnectTimeout", "ReadError", "ReadTimeout", "RemoteProtocolError", "TimeoutError"}: + return True + status = getattr(exc, "status_code", None) + return isinstance(status, int) and (status in (408, 429) or 500 <= status < 600) + + +def fetch_trace_with_wait( + trace_id: str, + *, + client: Any | None = None, + max_wait_s: float = 30.0, + initial_delay_s: float = 1.0, + max_delay_s: float = 8.0, + ready: Callable[[Any], bool] = trace_has_forecast, + sleep: Callable[[float], None] = time.sleep, +) -> Any | None: + """Fetch a trace by id, polling until ``ready(trace)`` or the budget expires. + + Trace ingestion is asynchronous, so a just-emitted trace may 404 or lack its + output briefly. This retries on transient/not-found errors with exponential + backoff up to ``max_wait_s``. + + Returns the ready trace, or ``None`` if it never became ready within the + budget (the caller should *skip*, not fail — mirrors eval-agents' SKIPPED + bucket). Raises only on non-retryable errors. + """ + client = _get_client(client) + delay = initial_delay_s + waited = 0.0 + while True: + try: + trace = client.api.trace.get(trace_id) + if ready(trace): + return trace + except Exception as exc: # noqa: BLE001 - re-raised below unless retryable + if not _is_retryable_trace_fetch_error(exc): + raise + if waited >= max_wait_s: + return None + step = min(delay, max_delay_s, max_wait_s - waited) + sleep(step) + waited += step + delay *= 2 + + +def list_trace_ids( + *, + client: Any | None = None, + name: str | None = None, + tags: Sequence[str] | str | None = None, + since: Any | None = None, + limit: int = 50, +) -> list[str]: + """Discover trace ids by name/tags/time window (best-effort; ``[]`` on error).""" + try: + client = _get_client(client) + response = client.api.trace.list(name=name, tags=tags, from_timestamp=since, limit=limit) + data = getattr(response, "data", None) or [] + return [trace.id for trace in data if getattr(trace, "id", None)] + except Exception: # pragma: no cover - guarded no-op when listing is unavailable + logger.debug("Could not list Langfuse traces.", exc_info=True) + return [] + + +# --------------------------------------------------------------------------- # +# Push: write evaluation results back as scores +# --------------------------------------------------------------------------- # +def push_trace_score( + trace_id: str, + name: str, + value: bool | int | float | str, + *, + client: Any | None = None, + comment: str | None = None, + metadata: dict[str, Any] | None = None, + config_id: str | None = None, +) -> bool: + """Push one Langfuse score to ``trace_id``, picking ``data_type`` from ``value``. + + ``bool`` -> ``BOOLEAN``, ``int``/``float`` -> ``NUMERIC``, ``str`` -> + ``CATEGORICAL`` (mirrors eval-agents' ``_upload_trace_scores``). Returns + whether the score was pushed; guarded no-op (``False``) without Langfuse. + """ + data_type: str + score_value: bool | float | str + if isinstance(value, bool): # bool before int: bool is an int subclass + data_type, score_value = "BOOLEAN", value + elif isinstance(value, (int, float)): + data_type, score_value = "NUMERIC", float(value) + else: + data_type, score_value = "CATEGORICAL", str(value) + try: + client = _get_client(client) + client.create_score( + name=name, + value=score_value, + trace_id=trace_id, + data_type=data_type, + comment=comment, + metadata=metadata, + config_id=config_id, + ) + return True + except Exception: # pragma: no cover - guarded no-op when scoring is unavailable + logger.debug("Could not push Langfuse score %r to trace %s.", name, trace_id, exc_info=True) + return False + + +def flush_scores(client: Any | None = None) -> None: + """Flush pending score/trace exports (best-effort).""" + try: + _get_client(client).flush() + except Exception: # pragma: no cover - guarded no-op when tracing is unavailable + logger.debug("Langfuse flush failed.", exc_info=True) + + +__all__ = [ + "FORECAST_OBSERVATION_NAME", + "FORECAST_TRACE_OUTPUT_KEY", + "fetch_trace_with_wait", + "flush_scores", + "list_trace_ids", + "push_trace_score", + "read_forecasts_from_trace", + "stamp_forecast_on_trace", + "trace_has_forecast", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md new file mode 100644 index 0000000..abd8b64 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__prediction.py.md @@ -0,0 +1,191 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/prediction.py + +kind: python + +```python +"""Prediction payload types and the Prediction metadata wrapper.""" + +from datetime import datetime +from math import isfinite +from typing import Any + +from pydantic import BaseModel, Field, field_validator + + +#: Standard quantile levels stored in every ContinuousForecast. +STANDARD_QUANTILES: list[float] = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95] + + +class ContinuousForecast(BaseModel): + """Probabilistic forecast payload for one continuous future target. + + Stores a point estimate and a set of quantile forecasts at standard levels. + The quantile grid (0.05 … 0.95) is dense enough to compute a good CRPS + approximation and compact enough to be stored in YAML alongside the + full prediction record. + + Parameters + ---------- + point_forecast : float + Central estimate — typically the median (0.50 quantile) of the + predictive distribution. + quantiles : dict[float, float] + Mapping from quantile level (in [0, 1]) to forecast value. Keys must + be strictly in ``(0, 1)``; values are the corresponding forecast + values. The standard levels in :data:`STANDARD_QUANTILES` are + recommended for compatibility with the CRPS scorer, but any set of + quantile keys in range is accepted. + + Examples + -------- + >>> fc = ContinuousForecast( + ... point_forecast=160.5, + ... quantiles={0.05: 155.0, 0.50: 160.5, 0.95: 166.0}, + ... ) + """ + + point_forecast: float = Field(description="Central estimate of the predictive distribution.") + quantiles: dict[float, float] = Field( + description="Quantile forecasts. Keys are quantile levels in (0, 1); values are forecast values." + ) + + @field_validator("quantiles") + @classmethod + def quantile_keys_in_range(cls, v: dict[float, float]) -> dict[float, float]: + """Validate that all quantile keys are strictly in (0, 1).""" + bad = [q for q in v if not (0.0 < q < 1.0)] + if bad: + raise ValueError(f"Quantile keys must be in (0, 1). Invalid keys: {bad}") + return v + + +class BinaryForecast(BaseModel): + """Binary event probability payload for discrete-event forecasting tasks. + + Parameters + ---------- + probability : float + Predicted probability that the event resolves ``True``, in ``[0, 1]``. + """ + + probability: float = Field(ge=0.0, le=1.0, description="Predicted probability the event occurs.") + + @field_validator("probability") + @classmethod + def probability_is_finite(cls, value: float) -> float: + """Reject NaN and infinite probabilities.""" + if not isfinite(value): + raise ValueError("Probability must be a finite number.") + return value + + +class CategoricalForecast(BaseModel): + """Ordered-categorical probability payload for one future target. + + The category order and allowed label set are declared on the + :class:`~aieng.forecasting.evaluation.task.ForecastingTask` via + ``task.categories``. The scorer aligns this probability dictionary to that + task-declared order before computing the Ranked Probability Score. + + Parameters + ---------- + probabilities : dict[str, float] + Mapping from category label to predicted probability. Values must be + finite probabilities in ``[0, 1]`` and sum to 1 within absolute + tolerance ``1e-6``. + """ + + probabilities: dict[str, float] = Field(description="Predicted probability for each category label.") + + @field_validator("probabilities") + @classmethod + def probabilities_are_valid(cls, value: dict[str, float]) -> dict[str, float]: + """Validate that probabilities form a finite distribution.""" + if len(value) < 2: + raise ValueError("Categorical probabilities must include at least two categories.") + bad_finite = [label for label, probability in value.items() if not isfinite(probability)] + if bad_finite: + raise ValueError(f"Categorical probabilities must be finite. Invalid labels: {bad_finite}") + bad_range = [label for label, probability in value.items() if not (0.0 <= probability <= 1.0)] + if bad_range: + raise ValueError(f"Categorical probabilities must be in [0, 1]. Invalid labels: {bad_range}") + total = sum(value.values()) + if abs(total - 1.0) > 1e-6: + raise ValueError(f"Categorical probabilities must sum to 1 within 1e-6; got {total}.") + return value + + +ForecastPayload = ContinuousForecast | BinaryForecast | CategoricalForecast + + +class Prediction(BaseModel): + """A single forecast submission — metadata wrapper around a forecast payload. + + ``Prediction`` is the unit of exchange between a :class:`Predictor` and the + evaluation harness. It carries all the metadata needed to score, persist, + and compare forecasts independently of the system that produced them. + + Designed to be YAML-serializable so it can be: + + - Persisted alongside a predictor implementation. + - Passed as structured context to downstream agents. + - Used as the unit of submission in a live evaluation or competition. + + Parameters + ---------- + predictor_id : str + Identifier for the predictor that issued this forecast. + task_id : str + Identifier for the + :class:`~aieng.forecasting.evaluation.task.ForecastingTask` this + prediction is for. + issued_at : datetime + Wall-clock time when the prediction was generated. + as_of : datetime + Information cutoff used — the ``as_of`` date of the + :class:`~aieng.forecasting.data.context.ForecastContext` passed to the + predictor. + forecast_date : datetime + The future date being predicted (``as_of`` + horizon steps). + payload : ContinuousForecast | BinaryForecast | CategoricalForecast + The forecast payload. + metadata : dict[str, Any] + Optional free-form metadata the predictor wants to return alongside the + forecast. The evaluation harness never reads or validates this field — + it passes through transparently into ``BacktestResult.predictions`` and + ``EvalResult.predictions``. Use it to surface structured side-channel + data: token counts, source lists, intermediate statistics, agent trace + IDs, etc. Anything requiring richer structure should be stored + externally (e.g. in Langfuse) and referenced here by ID. + + Examples + -------- + >>> from datetime import datetime + >>> pred = Prediction( + ... predictor_id="arima_auto", + ... task_id="cpi_all_items_canada_12m", + ... issued_at=datetime(2024, 1, 1), + ... as_of=datetime(2024, 1, 1), + ... forecast_date=datetime(2025, 1, 1), + ... payload=ContinuousForecast( + ... point_forecast=162.3, + ... quantiles={0.05: 157.0, 0.50: 162.3, 0.95: 167.8}, + ... ), + ... ) + """ + + predictor_id: str = Field(description="Identifier for the predictor that issued this forecast.") + task_id: str = Field(description="Identifier for the ForecastingTask this prediction answers.") + issued_at: datetime = Field(description="Wall-clock time when the prediction was generated.") + as_of: datetime = Field(description="Information cutoff used when generating this prediction.") + forecast_date: datetime = Field(description="The future date being predicted.") + payload: ForecastPayload = Field(description="The forecast payload.") + metadata: dict[str, Any] = Field( + default_factory=dict, + description=( + "Optional free-form metadata returned alongside the forecast. " + "Ignored by the evaluation harness; passes through transparently. " + "Use for token counts, source lists, trace IDs, etc." + ), + ) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md new file mode 100644 index 0000000..e8cef8f --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__predictor.py.md @@ -0,0 +1,134 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/predictor.py + +kind: python + +```python +"""Predictor ABC — the interface all forecasting models must implement.""" + +from abc import ABC, abstractmethod + +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.prediction import Prediction +from aieng.forecasting.evaluation.task import ForecastingTask + + +class Predictor(ABC): + """Abstract base class for all forecasting models. + + A ``Predictor`` encapsulates everything about *how* a forecasting problem + is solved: which series to request from the data service, how to handle + gaps, what model or agent to use, and how to produce probabilistic + forecasts. + + The interface is deliberately minimal — a ``predict`` method and a + ``predictor_id`` property. This means any two implementations — a vanilla + ARIMA and a multi-step LLM agent — can be evaluated against the same + :class:`~aieng.forecasting.evaluation.task.ForecastingTask` without the + evaluation harness needing to know anything about either. + + **Multi-horizon forecasting:** ``predict()`` returns ``list[Prediction]`` + — one ``Prediction`` per horizon step declared in ``task.horizons``. + Single-horizon tasks produce a one-element list; multi-horizon tasks + produce one element per requested step. + + This design lets trajectory-based models (Darts, LLMs) produce a coherent + forecast path in one call, while also making single-step predictors natural + (just return a one-element list). The evaluation harness scores each + ``Prediction`` in the list independently and accumulates the results in a + flat ``BacktestResult``. + + **Backtesting vs live evaluation:** the predictor never knows which mode + it is in. The harness creates a + :class:`~aieng.forecasting.data.context.ForecastContext` scoped to the + appropriate ``as_of`` date and passes it in. The predictor's code is + identical in both modes. + + **Information discipline:** the ``ForecastContext`` enforces the + information cutoff for deterministic data (historical series). For + agentic predictors that use live tools (web search, news APIs), the cutoff + cannot be enforced structurally — this is a known limitation and is part + of the challenge for evaluating such predictors. + + **Side-effects and metadata:** predictors are free to write logs, traces, + or other artifacts as side-effects of ``predict()``. For structured data + that should travel *with* each prediction (token counts, source lists, + agent trace IDs), populate the ``Prediction.metadata`` dict. The harness + passes it through transparently. + + Examples + -------- + Implementing a trivial constant predictor:: + + class ConstantPredictor(Predictor): + def __init__(self, value: float) -> None: + self._value = value + + @property + def predictor_id(self) -> str: + return "constant" + + def predict( + self, + task: ForecastingTask, + context: ForecastContext, + ) -> list[Prediction]: + from datetime import datetime + import pandas as pd + from aieng.forecasting.evaluation.prediction import ( + ContinuousForecast, + STANDARD_QUANTILES, + ) + + offset = pd.tseries.frequencies.to_offset(task.frequency) + start = pd.Timestamp(context.as_of) + payload = ContinuousForecast( + point_forecast=self._value, + quantiles={q: self._value for q in STANDARD_QUANTILES}, + ) + return [ + Prediction( + predictor_id=self.predictor_id, + task_id=task.task_id, + issued_at=datetime.utcnow(), + as_of=context.as_of, + forecast_date=(start + offset * h).to_pydatetime(), + payload=payload, + ) + for h in task.horizons + ] + """ + + @property + @abstractmethod + def predictor_id(self) -> str: + """Unique, human-readable identifier for this predictor. + + Used in :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` + and in persisted :class:`~aieng.forecasting.evaluation.prediction.Prediction` + records to identify which predictor produced a forecast. + """ + + @abstractmethod + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + """Produce probabilistic forecasts for the given task and context. + + Parameters + ---------- + task : ForecastingTask + Defines the prediction problem — target series, horizon(s), + frequency, and resolution logic. The predictor must not modify + the task. + context : ForecastContext + The information state available at forecast time. All calls to + ``context.get_series()`` are automatically filtered to + ``context.as_of`` — the predictor cannot accidentally access + future data from the series store. + + Returns + ------- + list[Prediction] + One ``Prediction`` per horizon step in ``task.horizons``, each + with ``as_of = context.as_of`` and ``forecast_date`` set to the + corresponding step ahead of the origin. The list must not be empty. + """ +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md new file mode 100644 index 0000000..450e9fd --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__evaluation__task.py.md @@ -0,0 +1,214 @@ +# Source: aieng-forecasting/aieng/forecasting/evaluation/task.py + +kind: python + +```python +"""ForecastingTask: defines a prediction problem against the data service.""" + +from __future__ import annotations + +from typing import Literal + +from pydantic import BaseModel, Field, model_validator + + +class TaskCategory(BaseModel): + """Ordered category declaration for a categorical forecasting task. + + Parameters + ---------- + label : str + Non-empty category label used by categorical forecast payloads. + value : float + Numeric value stored in the observed target series for this category. + """ + + label: str = Field(min_length=1, description="Category label used by categorical forecast payloads.") + value: float = Field(description="Numeric value stored in the observed target series for this category.") + + +class ForecastingTask(BaseModel): + """Defines a prediction problem, independent of how it is solved. + + A ``ForecastingTask`` specifies *what* to forecast: the target series, + the horizon(s), the temporal resolution, and how to determine ground truth. + It says nothing about *how* a predictor should solve the problem — + covariate selection, gap-filling, and model choice are all predictor + concerns. + + This separation means any two predictors (a vanilla ARIMA and a + multi-step LLM agent) can be evaluated against the same task without + the task needing to know anything about either of them. + + Parameters + ---------- + task_id : str + Unique identifier for this forecasting task. + target_series_id : str + The ``series_id`` (key in ``SeriesStore``) of the series to forecast. + horizons : list[int] + One or more horizon steps to forecast. Horizon ``h`` means ``h`` + frequency-units ahead of the forecast origin. For example, + ``horizons=[18]`` on monthly data means 18 months ahead; + ``horizons=[6, 7, 8, ..., 17]`` produces a full trajectory. + + **Backward compatibility:** you may pass ``horizon=N`` (singular, int) + and it will be silently coerced to ``horizons=[N]``. This keeps + existing YAML specs, notebook code, and tests working without changes. + frequency : str + Pandas offset alias for the forecast frequency (e.g. ``"MS"`` for + month-start, ``"h"`` for hourly, ``"D"`` for daily). Combined with + ``horizons``, this determines the forecast window. + description : str + Human-readable description of the prediction problem. + payload_type : {"continuous", "binary", "categorical"} + The forecast payload modality this task expects. ``"continuous"`` + (the default) means predictors must return + :class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast` + payloads, scored with CRPS. ``"binary"`` means the target series is a + 0/1 event series and predictors must return + :class:`~aieng.forecasting.evaluation.prediction.BinaryForecast` + payloads, scored with the Brier score. ``"categorical"`` means the + target series stores ordered category values declared in + ``categories`` and predictors must return + :class:`~aieng.forecasting.evaluation.prediction.CategoricalForecast` + payloads, scored with RPS. The evaluation harness validates payloads + against this declaration and fails loudly on a mismatch rather than + producing meaningless scores. + categories : list[TaskCategory] or None + Ordered category declarations for ``payload_type="categorical"``. + The list order is the ordinal order used by RPS, e.g. + ``cut < hold < hike``. Must be omitted for non-categorical tasks. + resolution_fn : str + How ground truth is determined. Defaults to + ``"observed_value_at_resolution_timestamp"``, meaning the resolution + is the actual observed value of ``target_series_id`` at the target + timestamp. + + .. note:: + **This field is currently a placeholder.** The evaluation harness + always uses ``"observed_value_at_resolution_timestamp"`` regardless + of this value. Dispatch on alternative strategies is deferred. + + Notes + ----- + The evaluation loop is identical for backtesting and live forecasting: + + .. code-block:: text + + ForecastingTask → defines the question + Predictor → decides how to answer it + list[Prediction] → the answers (one per horizon) + Resolution → ground truth + Score → how well each answer matched + + In backtest mode, the harness iterates over historical forecast origins. + In live mode, it waits for the resolution date. The task definition does + not change between modes. + + Examples + -------- + Single horizon (equivalent to old ``horizon=18``): + + >>> task = ForecastingTask( + ... task_id="cpi_food_18m", + ... target_series_id="cpi_food_canada", + ... horizons=[18], + ... frequency="MS", + ... description="Forecast Canada food CPI 18 months ahead.", + ... ) + + Multi-horizon trajectory (horizons 6–17 → January through December of Y+1 + from a July origin): + + >>> task = ForecastingTask( + ... task_id="cpi_food_cfpr_trajectory", + ... target_series_id="cpi_food_canada", + ... horizons=list(range(6, 18)), + ... frequency="MS", + ... description="Full 12-step trajectory for CFPR average-year analysis.", + ... ) + + Backward-compatible old syntax still works: + + >>> task = ForecastingTask( + ... task_id="cpi_all_items_1m_ahead", + ... target_series_id="cpi_all_items_canada", + ... horizon=1, + ... frequency="MS", + ... description="Forecast Canada All-items CPI one month ahead.", + ... ) + """ + + task_id: str = Field(description="Unique identifier for this forecasting task.") + target_series_id: str = Field(description="The series_id of the series to forecast.") + horizons: list[int] = Field( + min_length=1, + description=( + "One or more horizon steps to forecast. Horizon h means h frequency-units ahead of the forecast origin." + ), + ) + frequency: str = Field(description="Pandas offset alias for the forecast frequency, e.g. 'MS', 'h', 'D'.") + description: str = Field(description="Human-readable description of the prediction problem.") + payload_type: Literal["continuous", "binary", "categorical"] = Field( + default="continuous", + description=( + "Forecast payload modality: 'continuous' (ContinuousForecast, CRPS-scored), " + "'binary' (BinaryForecast against a 0/1 event series, Brier-scored), or " + "'categorical' (CategoricalForecast against ordered categories, RPS-scored)." + ), + ) + categories: list[TaskCategory] | None = Field( + default=None, + description="Ordered categories for categorical tasks; omitted for continuous and binary tasks.", + ) + resolution_fn: str = Field( + default="observed_value_at_resolution_timestamp", + description=( + "How ground truth is determined. Placeholder — harness currently always uses " + "'observed_value_at_resolution_timestamp' regardless of this value. " + "Dispatch on alternative strategies is deferred." + ), + ) + + @model_validator(mode="before") + @classmethod + def _coerce_single_horizon(cls, data: object) -> object: + """Accept legacy ``horizon=N`` and coerce to ``horizons=[N]``.""" + if isinstance(data, dict) and "horizon" in data and "horizons" not in data: + data = dict(data) + data["horizons"] = [int(data.pop("horizon"))] + return data + + @model_validator(mode="after") + def _validate_categories(self) -> "ForecastingTask": + """Validate the categorical task contract.""" + if self.payload_type == "categorical": + if self.categories is None: + raise ValueError("Categorical tasks must define categories with at least two entries.") + if len(self.categories) < 2: + raise ValueError("Categorical tasks must define at least two categories.") + labels = [category.label for category in self.categories] + duplicate_labels = sorted({label for label in labels if labels.count(label) > 1}) + if duplicate_labels: + raise ValueError(f"Categorical task category labels must be unique. Duplicates: {duplicate_labels}") + values = [category.value for category in self.categories] + duplicate_values = sorted({value for value in values if values.count(value) > 1}) + if duplicate_values: + raise ValueError(f"Categorical task category values must be unique. Duplicates: {duplicate_values}") + return self + if self.categories is not None: + raise ValueError("categories must be omitted unless payload_type='categorical'.") + return self + + @property + def horizon(self) -> int: + """The maximum (outermost) horizon step. + + For single-horizon tasks this is the only element of ``horizons``. + For multi-horizon tasks this is ``max(horizons)``, which is what + Darts models and other trajectory-based predictors need as their + ``n`` parameter. + """ + return max(self.horizons) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md new file mode 100644 index 0000000..105042e --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__langfuse_tracing.py.md @@ -0,0 +1,192 @@ +# Source: aieng-forecasting/aieng/forecasting/langfuse_tracing.py + +kind: python + +```python +"""Langfuse-oriented tracing bootstrap for LiteLLM and Google ADK. + +Call :func:`init_langfuse_tracing` once at process startup when using the +``llm`` or ``agentic`` extras and Langfuse credentials are set in the +environment. + +Call :func:`print_langfuse_trace_url` after a ``predict()`` call to flush +pending spans and print a clickable Langfuse UI link. +""" + +from __future__ import annotations + +import logging +import os + + +logger = logging.getLogger(__name__) + + +def _langfuse_credentials_present() -> bool: + pub = os.environ.get("LANGFUSE_PUBLIC_KEY", "").strip() + sec = os.environ.get("LANGFUSE_SECRET_KEY", "").strip() + return bool(pub and sec) + + +class _LangfuseTracingBootstrap: + """Registers LiteLLM + ADK exporters at most once per process.""" + + __slots__ = ("_google_adk_instrumented", "_langfuse_client_initialized", "_litellm_instrumented") + + def __init__(self) -> None: + self._litellm_instrumented = False + self._google_adk_instrumented = False + self._langfuse_client_initialized = False + + def init(self) -> None: + """Initialize Langfuse tracing when credentials and dependencies exist.""" + if not _langfuse_credentials_present(): + logger.debug( + "Skipping Langfuse tracing: set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.", + ) + return + + # OpenInference's ADK instrumentor uses the *global* OTel tracer provider. + # Langfuse attaches its span processor when the SDK client is created; without + # this, ADK spans are emitted into a no-op provider and never reach Langfuse. + self._ensure_langfuse_client() + + self._register_litellm_langfuse_otel() + self._instrument_google_adk() + + def _ensure_langfuse_client(self) -> None: + if self._langfuse_client_initialized: + return + try: + from langfuse import get_client # noqa: PLC0415 + except ImportError: + logger.debug("langfuse not installed; skipping Langfuse client initialization.") + return + try: + get_client() + except Exception: + logger.exception("Langfuse get_client() failed; ADK spans may not export.") + return + self._langfuse_client_initialized = True + + def _register_litellm_langfuse_otel(self) -> None: + """Register LiteLLM Langfuse callback.""" + if self._litellm_instrumented: + return + try: + import litellm # noqa: PLC0415 + except ImportError: + logger.debug("litellm not installed; skipping LiteLLM Langfuse callback.") + return + + existing = list(getattr(litellm, "callbacks", None) or []) + if "langfuse_otel" not in existing: + litellm.callbacks = [*existing, "langfuse_otel"] + self._litellm_instrumented = True + + def _instrument_google_adk(self) -> None: + """Instrument Google ADK.""" + if self._google_adk_instrumented: + return + try: + from openinference.instrumentation.google_adk import ( # noqa: PLC0415 + GoogleADKInstrumentor, + ) + except ImportError: + logger.debug( + "openinference-instrumentation-google-adk not installed; skipping ADK instrumentation.", + ) + return + + try: + GoogleADKInstrumentor().instrument() + except Exception: + logger.exception("GoogleADKInstrumentor().instrument() failed.") + return + + self._google_adk_instrumented = True + + +_bootstrap = _LangfuseTracingBootstrap() + + +def init_langfuse_tracing() -> None: + """Wire LiteLLM and Google ADK to Langfuse. + + No-ops when ``LANGFUSE_PUBLIC_KEY`` or ``LANGFUSE_SECRET_KEY`` is absent + from the environment. Safe to call multiple times. + + Notes + ----- + When both environment keys are present, performs up to three one-time + registrations: + + 1. Calls ``langfuse.get_client()`` so the global OpenTelemetry + ``TracerProvider`` receives Langfuse's span processor. This is required + for ADK spans emitted via ``openinference-instrumentation-google-adk`` + to reach Langfuse. + 2. Appends ``"langfuse_otel"`` to ``litellm.callbacks`` once (if + ``litellm`` is importable). + 3. Runs ``GoogleADKInstrumentor().instrument()`` once (if + ``openinference-instrumentation-google-adk`` is importable). + + Set ``LANGFUSE_HOST`` or ``LANGFUSE_BASE_URL`` for non-default regions. + For short-lived processes, call ``langfuse.get_client().flush()`` before + exit so pending spans are exported. + """ + _bootstrap.init() + + +def print_langfuse_trace_url( + trace_id: str | None = None, + *, + trace_name: str | None = None, +) -> str | None: + """Flush pending spans and print a Langfuse trace URL (no API trace fetch). + + Uses the in-process trace id when available (``get_current_trace_id``). + Does **not** call ``api.trace.list`` — use this from notebooks when list/get + time out. If no trace id is available, prints the project traces page and the + ``trace_name`` to filter manually in the UI. + + Parameters + ---------- + trace_id : str, optional + Explicit trace id. When omitted, uses ``get_current_trace_id()`` if set. + trace_name : str, optional + ``trace_name`` from ``propagate_attributes`` (for manual UI lookup). + + Returns + ------- + str or None + Trace URL when resolved, else ``None``. + """ + if not _langfuse_credentials_present(): + print("Langfuse: set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY in .env to export traces.") + return None + + try: + from langfuse import get_client # noqa: PLC0415 + except ImportError: + print("Langfuse package not installed.") + return None + + init_langfuse_tracing() + client = get_client() + client.flush() + + resolved_id = trace_id or client.get_current_trace_id() + url = client.get_trace_url(trace_id=resolved_id) + if url: + print(f"Langfuse trace: {url}") + return url + + project_id = client._get_project_id() # noqa: SLF001 + base = getattr(client, "_base_url", None) or os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com") + traces_page = f"{base}/project/{project_id}/traces" if project_id else base + print("Langfuse: trace id not available in this process after flush.") + print(f" Open traces: {traces_page}") + if trace_name: + print(f" Filter by trace name: {trace_name!r}") + return None +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md new file mode 100644 index 0000000..e7beeec --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__README.md.md @@ -0,0 +1,119 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/README.md + +kind: markdown + +# Methods + +This directory contains **reference predictor implementations** — concrete +`Predictor` subclasses that are reusable across more than one forecasting +experiment. + +The package is organized by method family: + +```text +methods/ +├── baselines/ # simple floor baselines and teaching references +├── numerical/ # classical / ML numerical forecasters +├── llm_processes/ # LLM-process predictors (sampled trajectories, quantile grids, etc.) +└── agentic/ # reusable ADK runners, agent factory, predictors, and output schemas +``` + +--- + +## What belongs here + +- Concrete `Predictor` subclasses that are **not** tied to a specific use case +- Implementations that a participant would use as-is or as a copy-paste + starting point across more than one experiment +- Well-documented, linted Python modules (not notebooks) + +## What does NOT belong here + +- Task-specific configuration (prompts tuned for CFPR, specs, task YAMLs) — + those live in `implementations//` +- Notebooks or experiment scripts — those live in `implementations//` +- Infrastructure or ABCs — those live elsewhere in `aieng.forecasting` + (`data/`, `evaluation/`, future `agents/`) + +--- + +## Import patterns + +Common imports: + +```python +from aieng.forecasting.methods import ( + DartsAutoARIMAPredictor, + DartsLightGBMPredictor, + DartsLinearRegressionPredictor, + LastValuePredictor, +) +``` + +Sub-package imports are also fine when you want to signal the method family: + +```python +from aieng.forecasting.methods.baselines import LastValuePredictor +from aieng.forecasting.methods.numerical import DartsAutoARIMAPredictor +``` + +Agentic runner, factory, and output schemas: + +```python +from aieng.forecasting.methods.agentic import ( + AdkTextRunner, + AdkTextRunnerConfig, + AgentConfig, + AgentPredictor, + ContinuousAgentForecastOutput, + build_adk_agent, +) +``` + +--- + +## Current contents + +### Baselines + +| Module | Class | Description | +|---|---|---| +| `baselines/naive.py` | `LastValuePredictor` | Last-value naive baseline. Predicts the most recently observed value at all quantiles. The floor every predictor must beat. Also the annotated reference implementation — read this to understand the `Predictor` interface. | +| `baselines/historical_frequency.py` | `HistoricalFrequencyPredictor` | Binary floor baseline: the constant historical base rate of the event, optionally over a trailing window. | +| `baselines/categorical_frequency.py` | `CategoricalFrequencyPredictor` | Categorical floor baseline: the constant climatological distribution over the task-declared ordered categories. | + +### Numerical + +| Module | Class | Description | +|---|---|---| +| `numerical/darts_arima.py` | `DartsAutoARIMAPredictor` | Univariate Darts AutoARIMA with probabilistic multi-horizon output via Monte Carlo sampling. | +| `numerical/darts_classical.py` | `DartsExponentialSmoothingPredictor` | Univariate state-space exponential smoothing (ETS); fast probabilistic baseline (non-seasonal by default, optional `seasonal_periods`). | +| `numerical/darts_classical.py` | `DartsKalmanForecasterPredictor` | Univariate linear Gaussian state-space (Kalman) forecaster; fast probabilistic baseline with configurable latent dimension `dim_x`. | +| `numerical/darts_regression.py` | `DartsLinearRegressionPredictor` | Darts linear regression predictor with optional past covariates and probabilistic output. | +| `numerical/darts_regression.py` | `DartsLightGBMPredictor` | Darts LightGBM quantile-regression predictor with optional past covariates. | + +### LLM Processes + +| Module | Class | Description | +|---|---|---| +| `llm_processes/sampled_trajectory.py` | `SampledTrajectoryLLMPredictor` | Samples full trajectories from an LLM, then computes empirical quantiles per horizon. Supports optional covariates: set `covariate_series_ids` to serialize labeled exogenous-series history into the prompt (Context-is-Key §5.4). | +| `llm_processes/quantile_grid.py` | `QuantileGridLLMPredictor` | Asks an LLM for the standard quantile grid in one structured completion. | +| `llm_processes/binary_probability.py` | `BinaryProbabilityLLMPredictor` | Direct elicitation of one calibrated event probability for binary tasks (Brier-scored), in one structured completion. | +| `llm_processes/categorical_probability.py` | `CategoricalProbabilityLLMPredictor` | Direct elicitation of a calibrated distribution over the task-declared ordered categories (RPS-scored); history serialized as category labels. | +| `llm_processes/point_intervals.py` | — | Placeholder for a compact point-plus-interval contract; may become configurable sparse quantile-grid elicitation. | + +### Agentic + +| Module | Class / Function | Description | +|---|---|---| +| `agentic/adk_runner.py` | `AdkTextRunner` | Async text-in / text-out wrapper around ADK `InMemoryRunner`. Manages ADK sessions (fresh-per-message or sticky) and optionally traces each turn to Langfuse via `propagate_attributes`. | +| `agentic/adk_runner.py` | `AdkTextRunnerConfig` | Pydantic configuration for `AdkTextRunner` (session mode, Langfuse fields). | +| `agentic/agent_factory.py` | `build_adk_agent` | Generic ADK `LlmAgent` factory with optional code execution, context retrieval, skills, generation controls, and structured output schema. | +| `agentic/agent_factory.py` | `AgentConfig` | Pydantic configuration for reusable ADK agents. `output_schema=None` supports interactive/free-form agents; a structured `AgentForecastOutput` schema supports Track 1 predictors. The `function_tools` field attaches conventional ADK tools (e.g. `ForecastTool`). Use-case-specific prompts and presets should live in `implementations//`. | +| `agentic/forecast_tool.py` | `ForecastTool` | Conventional ADK `FunctionTool` that runs a pre-specified `Predictor` (AutoARIMA by default) on any registered series at a given cutoff/horizon, returning a structured JSON forecast. A controlled, reproducible alternative to open-ended code execution; series data never enters the LLM context. | +| `agentic/outputs.py` | `AgentForecastOutput` | Abstract output adapter interface for converting structured agent JSON into evaluation `Prediction` objects. | +| `agentic/outputs.py` | `ContinuousAgentForecastOutput` | Canonical continuous forecasting output schema. Declares `modality = "continuous"`, requires one forecast per task horizon and the standard quantile grid, then converts to `ContinuousForecast` payloads. | +| `agentic/outputs.py` | `DiscreteAgentForecastOutput` | Binary event output schema (`modality = "discrete"`): one probability plus `reasoning` / `key_signals` metadata, converted to a `BinaryForecast` payload. | +| `agentic/outputs.py` | `CategoricalAgentForecastOutput` | Ordered-categorical output schema (`modality = "categorical"`): one `{label, probability}` row per task category, validated against `task.categories` and converted to a `CategoricalForecast` payload. | +| `agentic/predictor.py` | `AgentPredictor` | Track 1 `Predictor` that builds prompts, runs an ADK agent through `AdkTextRunner`, validates structured JSON, and converts it to `Prediction` objects. Accepts an optional injected runner for tests or custom observability. | +| `agentic/predictor.py` | `ForecastPromptBuilder` | Protocol for task-specific prompt builders that turn `(task, context)` into the text passed to the agent. | diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md new file mode 100644 index 0000000..151e290 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods____init__.py.md @@ -0,0 +1,90 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/__init__.py + +kind: python + +```python +"""Reference predictor implementations for ``aieng.forecasting``. + +This package groups concrete :class:`~aieng.forecasting.evaluation.predictor.Predictor` +implementations by method family: + +- :mod:`baselines` — simple floor baselines and teaching references +- :mod:`numerical` — classical / ML numerical forecasters +- :mod:`llm_processes` — LLM-process predictors +- :mod:`agentic` — tool-using / hybrid agentic predictors + +""" + +# --------------------------------------------------------------------------- +# Patch: suppress spurious OTel cross-context ValueError in Jupyter / ADK +# --------------------------------------------------------------------------- +# When ADK or openinference-instrumented code runs async generators inside +# Jupyter's nested event loop, pending tasks are garbage-collected mid-span. +# GeneratorExit is thrown into OTel's start_as_current_span context manager, +# which then tries to detach a contextvars Token that was created in a +# different asyncio.Context, raising: +# ValueError: was created in a different Context +# +# Patching opentelemetry.context.detach (the module attribute) is insufficient +# because openinference captures a direct `from opentelemetry.context import +# detach` reference at instrumentation time, bypassing any later module-level +# reassignment. Patching at the ContextVarsRuntimeContext *class* level is the +# correct fix: it intercepts the call site that actually raises the error, +# regardless of when or how callers imported the detach function. +# +# This patch is applied here, before any LLM or ADK imports, to ensure it is +# in place before openinference instruments litellm. +try: + import contextlib + + from opentelemetry.context.contextvars_context import ContextVarsRuntimeContext as _CtxVarsRC + + _orig_ctx_detach = _CtxVarsRC.detach + + def _safe_ctx_detach(self, token): # type: ignore[no-untyped-def] + with contextlib.suppress(ValueError): + _orig_ctx_detach(self, token) + + _CtxVarsRC.detach = _safe_ctx_detach # type: ignore[method-assign] +except ImportError: + pass # opentelemetry not installed; nothing to patch + +from .baselines import CategoricalFrequencyPredictor, HistoricalFrequencyPredictor, LastValuePredictor +from .llm_processes import ( + BinaryProbabilityLLMPredictor, + BinaryProbabilityLLMPredictorConfig, + CategoricalProbabilityLLMPredictor, + CategoricalProbabilityLLMPredictorConfig, + QuantileGridLLMPredictor, + QuantileGridLLMPredictorConfig, + SampledTrajectoryLLMPredictor, + SampledTrajectoryLLMPredictorConfig, +) +from .numerical import ( + DartsAutoARIMAPredictor, + DartsExponentialSmoothingPredictor, + DartsKalmanForecasterPredictor, + DartsLightGBMPredictor, + DartsLinearRegressionPredictor, +) + + +__all__ = [ + "BinaryProbabilityLLMPredictor", + "BinaryProbabilityLLMPredictorConfig", + "CategoricalFrequencyPredictor", + "CategoricalProbabilityLLMPredictor", + "CategoricalProbabilityLLMPredictorConfig", + "DartsAutoARIMAPredictor", + "DartsExponentialSmoothingPredictor", + "DartsKalmanForecasterPredictor", + "DartsLightGBMPredictor", + "DartsLinearRegressionPredictor", + "HistoricalFrequencyPredictor", + "LastValuePredictor", + "QuantileGridLLMPredictor", + "QuantileGridLLMPredictorConfig", + "SampledTrajectoryLLMPredictor", + "SampledTrajectoryLLMPredictorConfig", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md new file mode 100644 index 0000000..167d828 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic____init__.py.md @@ -0,0 +1,116 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/__init__.py + +kind: python + +```python +"""ADK-based agentic predictors. + +Concrete forecasting components that use tool execution, code interpreters, +or hybrid numerical reasoning to produce forecasts. + +This subpackage requires the ``agentic`` extra. Install it with:: + + pip install aieng-forecasting[agentic] + +Importing any name from this package (or its submodules) without the extra +raises :class:`ImportError` with installation guidance. + +Public API +---------- +AdaptiveSkillState, AdaptiveSkillStore + Abstract base and generic persistence layer for learnable agent skills. + Subclass ``AdaptiveSkillState`` with domain-specific fields and implement + ``build_markdown()`` to render the strategy to a ``SKILL.md`` the agent + reads. ``AdaptiveSkillStore`` handles YAML serialisation, ``SKILL.md`` + rendering, and timestamped backup on every mutation. +format_backtest_report, load_context_documents, build_curriculum_prompt + Curriculum assembly utilities for adaptive agent training. Format a + :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` as a + structured markdown document, load pre-cached context files by date, and + assemble both into a single curriculum message for the agent. +AgentConfig, CodeExecutionConfig, ContextRetrievalConfig + Pydantic configuration for building an ADK ``LlmAgent`` with optional + code execution and a Google Search sub-agent. +build_adk_agent + Factory that turns an :class:`AgentConfig` into a configured + :class:`google.adk.agents.LlmAgent`. +AdkTextRunner, AdkTextRunnerConfig + Text-in / text-out wrapper around ADK's ``InMemoryRunner`` with session + management and optional Langfuse tracing. +AgentForecastOutput, ContinuousAgentForecastOutput, ... + Schemas for structured agent output and conversion to evaluation + :class:`~aieng.forecasting.evaluation.prediction.Prediction` objects. + See :mod:`aieng.forecasting.methods.agentic.outputs`. +AgentPredictor, ForecastPromptBuilder + :class:`~aieng.forecasting.evaluation.predictor.Predictor` + that drives an ADK agent and converts its structured output into + predictions, plus the prompt-builder protocol it depends on. + +Examples +-------- +Building a predictor from a config:: + + from aieng.forecasting.methods.agentic import ( + AgentConfig, + AgentPredictor, + ContinuousAgentForecastOutput, + ) + + config = AgentConfig(instruction="Forecast the target series.") + predictor = AgentPredictor( + config, + my_prompt_builder, + output_schema=ContinuousAgentForecastOutput, + ) +""" + +from aieng.forecasting.methods.agentic.adaptive_skill import AdaptiveSkillState, AdaptiveSkillStore +from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig +from aieng.forecasting.methods.agentic.agent_factory import ( + AgentConfig, + CodeExecutionConfig, + ContextRetrievalConfig, + build_adk_agent, +) +from aieng.forecasting.methods.agentic.curriculum import ( + build_curriculum_prompt, + format_backtest_report, + load_context_documents, +) +from aieng.forecasting.methods.agentic.forecast_tool import ForecastTool +from aieng.forecasting.methods.agentic.outputs import ( + AgentCategoryProbability, + AgentForecastOutput, + AgentQuantileForecast, + CategoricalAgentForecastOutput, + ContinuousAgentForecastOutput, + ContinuousAgentHorizonForecast, + DiscreteAgentForecastOutput, +) +from aieng.forecasting.methods.agentic.predictor import AgentPredictor, ForecastPromptBuilder + + +__all__: list[str] = [ + "AdaptiveSkillState", + "AdaptiveSkillStore", + "AdkTextRunner", + "AdkTextRunnerConfig", + "AgentCategoryProbability", + "AgentConfig", + "AgentForecastOutput", + "AgentPredictor", + "AgentQuantileForecast", + "CategoricalAgentForecastOutput", + "CodeExecutionConfig", + "ContinuousAgentForecastOutput", + "ContinuousAgentHorizonForecast", + "ContextRetrievalConfig", + "DiscreteAgentForecastOutput", + "ForecastPromptBuilder", + "ForecastTool", + "build_adk_agent", + "build_curriculum_prompt", + "format_backtest_report", + "load_context_documents", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md new file mode 100644 index 0000000..e9a4af9 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adaptive_skill.py.md @@ -0,0 +1,241 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/adaptive_skill.py + +kind: python + +```python +"""Generic adaptive skill infrastructure for learnable agent strategies. + +An *adaptive skill* is a skill whose content can be mutated by the agent +through typed tool calls — as opposed to read-only skills that are authored +once and never changed by the agent. + +The pattern has three layers: + +``AdaptiveSkillState`` (abstract) + A Pydantic model that is the source of truth for a skill's content. + Concrete subclasses define the domain-specific fields and implement + ``build_markdown()`` to render those fields into the ``SKILL.md`` text + that the ADK ``SkillToolset`` injects into the agent's context. + +``AdaptiveSkillStore`` + Persistence layer for a concrete ``AdaptiveSkillState``. Manages three + artefacts in a skill directory: + + - ``skill_state.yaml`` — serialized state; the source of truth. + - ``SKILL.md`` — rendered from state on every save; what the agent reads. + - ``.history/`` — timestamped backups written before each save so that + every mutation is reversible without git involvement. + + The ``confirmation_threshold`` (how many confirmed hypothesis outcomes are + required before ``graduate_hypothesis`` is permitted) is a *store* + parameter, not a *state* field. This keeps the evidence bar outside the + agent's reach — the agent cannot lower the bar by mutating state. + +Usage pattern +------------- +Define a concrete state and instantiate a store once at module level in the +implementation's ``skill_tools.py``:: + + from aieng.forecasting.methods.agentic.adaptive_skill import ( + AdaptiveSkillState, + AdaptiveSkillStore, + ) + + + class MyStrategyState(AdaptiveSkillState): + approach_narrative: str + ... + + def build_markdown(self) -> str: ... + + + STORE: AdaptiveSkillStore[MyStrategyState] = AdaptiveSkillStore( + skill_dir=Path(__file__).parent / "skills" / "my-strategy", + state_type=MyStrategyState, + confirmation_threshold=3, + ) + +Then write one thin tool function per mutation operation and register them via +``AgentConfig(extra_tools=[...])``. +""" + +from __future__ import annotations + +import shutil +from abc import ABC, abstractmethod +from datetime import datetime, timezone +from pathlib import Path +from typing import Generic, TypeVar + +import yaml +from pydantic import BaseModel + + +# --------------------------------------------------------------------------- +# Abstract base +# --------------------------------------------------------------------------- + + +class AdaptiveSkillState(BaseModel, ABC): + """Abstract base for skill states that an agent can modify through typed tool calls. + + Subclasses define domain-specific fields (observations, hypotheses, + calibration corrections, etc.) and implement ``build_markdown()`` to + render those fields into a complete ``SKILL.md`` document — including + YAML frontmatter. + + The rendered ``SKILL.md`` is what the ADK ``SkillToolset`` loads and + injects into the agent's context. The state itself is serialized to + ``skill_state.yaml`` as the authoritative source of truth. + """ + + schema_version: str = "1.0" + + @abstractmethod + def build_markdown(self, skill_name: str | None = None) -> str: + """Render the current state to full ``SKILL.md`` content. + + The output must include valid YAML frontmatter (``---`` fences) at the + top so that ``load_skill_from_dir`` can parse the skill metadata. + + Parameters + ---------- + skill_name : str or None + The value to embed in the frontmatter ``name:`` field. Must match + the containing directory name exactly (ADK enforces this). When + ``None``, subclasses should fall back to their default skill name. + + Returns + ------- + str + Complete ``SKILL.md`` text, ready to write to disk. + """ + ... + + +# --------------------------------------------------------------------------- +# Generic store +# --------------------------------------------------------------------------- + +S = TypeVar("S", bound=AdaptiveSkillState) + +_YAML_STATE_FILENAME = "skill_state.yaml" +_SKILL_MD_FILENAME = "SKILL.md" +_HISTORY_DIR = ".history" +_GENERATED_HEADER = "\n" + + +class AdaptiveSkillStore(Generic[S]): + """Persistence layer for an :class:`AdaptiveSkillState`. + + Manages the three artefacts in a skill directory: + + - ``skill_state.yaml`` — YAML-serialised state (source of truth). + - ``SKILL.md`` — rendered from state on every save. + - ``.history/`` — timestamped backups for reversibility. + + Parameters + ---------- + skill_dir : Path + Directory containing the skill (the one passed to + ``load_skill_from_dir``). Must exist. + state_type : type[S] + Concrete ``AdaptiveSkillState`` subclass. Used for deserialisation. + confirmation_threshold : int, default=3 + Minimum ``confirmations`` count a hypothesis must reach before + ``graduate_hypothesis`` is allowed to promote it to a calibration + correction. Kept here (not in state) so the agent cannot lower it. + """ + + def __init__( + self, + skill_dir: Path, + state_type: type[S], + confirmation_threshold: int = 3, + ) -> None: + if not skill_dir.is_dir(): + raise ValueError(f"Skill directory does not exist: {skill_dir}") + self._skill_dir = skill_dir + self._state_type = state_type + self.confirmation_threshold = confirmation_threshold + + # ------------------------------------------------------------------ + # Paths + # ------------------------------------------------------------------ + + @property + def state_path(self) -> Path: + """Path to ``skill_state.yaml``.""" + return self._skill_dir / _YAML_STATE_FILENAME + + @property + def skill_md_path(self) -> Path: + """Path to the rendered ``SKILL.md``.""" + return self._skill_dir / _SKILL_MD_FILENAME + + @property + def history_dir(self) -> Path: + """Path to the ``.history/`` backup directory.""" + return self._skill_dir / _HISTORY_DIR + + # ------------------------------------------------------------------ + # Load / save + # ------------------------------------------------------------------ + + def load(self) -> S: + """Deserialise state from ``skill_state.yaml``. + + Raises + ------ + FileNotFoundError + If ``skill_state.yaml`` does not exist. Seed it first with + ``save(initial_state)`` before registering mutation tools. + """ + if not self.state_path.exists(): + raise FileNotFoundError( + f"skill_state.yaml not found in {self._skill_dir}. Run the seed script to initialise the skill state." + ) + raw = yaml.safe_load(self.state_path.read_text(encoding="utf-8")) + return self._state_type.model_validate(raw) + + def save(self, state: S) -> str: + """Persist *state* to disk and re-render ``SKILL.md``. + + Steps: + + 1. Back up the current ``skill_state.yaml`` to ``.history/`` with a + UTC ISO-8601 timestamp suffix (before overwriting). + 2. Write ``skill_state.yaml`` from ``state.model_dump()``. + 3. Re-render ``SKILL.md`` via ``state.build_markdown()``. + + Parameters + ---------- + state : S + Updated state to persist. + + Returns + ------- + str + Human-readable confirmation message suitable for returning from a + tool call. + """ + # 1. Backup + if self.state_path.exists(): + self.history_dir.mkdir(exist_ok=True) + ts = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ") + backup_path = self.history_dir / f"skill_state_{ts}.yaml" + shutil.copy2(self.state_path, backup_path) + + # 2. Write YAML state + data = state.model_dump(mode="json") + self.state_path.write_text( + yaml.dump(data, default_flow_style=False, allow_unicode=True, sort_keys=False), + encoding="utf-8", + ) + + # 3. Re-render SKILL.md — pass dir name so frontmatter matches ADK requirement + rendered = state.build_markdown(skill_name=self._skill_dir.name) + self.skill_md_path.write_text(rendered, encoding="utf-8") + + return f"State saved to {self.state_path.name}. SKILL.md re-rendered. Backup written to {_HISTORY_DIR}/." +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md new file mode 100644 index 0000000..6337eb9 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__adk_runner.py.md @@ -0,0 +1,375 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/adk_runner.py + +kind: python + +```python +"""General-purpose ADK runner: text-in / text-out over ``InMemoryRunner``. + +This module provides :class:`AdkTextRunner`, a thin wrapper around Google +ADK's :class:`~google.adk.runners.InMemoryRunner` that exposes a single +``run_text_async(prompt) -> str`` method, manages per-user session lifecycle, +and optionally propagates Langfuse trace attributes for each turn. + +This module requires the ``agentic`` extra; importing it without the extra +raises :class:`ImportError`. +""" + +from __future__ import annotations + +import types as py_types +from typing import Any + +from pydantic import BaseModel, Field + + +try: + from google.adk.agents.base_agent import BaseAgent + from google.adk.agents.run_config import RunConfig + from google.adk.runners import InMemoryRunner + from google.genai import types as genai_types +except ModuleNotFoundError as exc: + raise ImportError( + "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'." + ) from exc + + +class AdkTextRunnerConfig(BaseModel): + """Configuration for :class:`AdkTextRunner`. + + Attributes + ---------- + app_name : str + Application id shared by the session service and runner. + default_user_id : str + Fallback user id when :meth:`~AdkTextRunner.run_text_async` is called + without an explicit ``user_id``. + fresh_session_per_message : bool + When ``True`` (default), each :meth:`~AdkTextRunner.run_text_async` + call creates a fresh ADK session and any supplied ``session_id`` is + ignored. When ``False``, sessions are reused per ``user_id`` + (sticky conversation). + enable_langfuse_tracing : bool + When ``True``, initialise Langfuse at construction time and wrap every + turn with ``propagate_attributes``. Requires the ``agentic`` extra. + langfuse_tags : list of str or None + Tags forwarded to Langfuse ``propagate_attributes``. + langfuse_propagate_metadata : dict of str to str, or None + Extra key/value metadata merged with ``adk_app_name`` and forwarded + to ``propagate_attributes``. + langfuse_trace_name : str or None + ``trace_name`` forwarded to Langfuse ``propagate_attributes``. + langfuse_version : str or None + ``version`` forwarded to Langfuse ``propagate_attributes``. + + Notes + ----- + When ``enable_langfuse_tracing`` is ``True``, ``user_id``, ``session_id``, + ``trace_name``, and every key/value in ``langfuse_propagate_metadata`` must + be US-ASCII and ≤ 200 characters each; Langfuse silently drops + non-conforming values. + """ + + app_name: str = Field( + ..., + description="Application id shared by session service and runner.", + ) + default_user_id: str = Field( + default="user", + description=( + "Used when ``run_text_async`` is called without ``user_id``. " + "If Langfuse tracing is enabled, must be US-ASCII and ≤ 200 characters." + ), + ) + fresh_session_per_message: bool = Field( + default=True, + description=( + "If True, each ``run_text_async`` creates a new session (``session_id`` is ignored). " + "If False, turns for the same ``user_id`` reuse one session: the first call creates it, " + "later calls omit ``session_id`` unless switching threads; optional explicit " + "``session_id`` joins or replaces the sticky session for that user." + ), + ) + enable_langfuse_tracing: bool = Field( + default=False, + description=( + "If True, call :func:`~aieng.forecasting.langfuse_tracing.init_langfuse_tracing` " + "at runner construction and wrap each turn with Langfuse " + "``propagate_attributes``. Forwards resolved ``user_id`` and ADK ``session_id`` " + "plus optional fields below. Langfuse requires propagated identifiers to be " + "US-ASCII and ≤ 200 characters; invalid values may be dropped with warnings. " + "Requires the ``agentic`` extra (``langfuse``)." + ), + ) + langfuse_tags: list[str] | None = Field( + default=None, + description=("Optional tags for ``propagate_attributes`` to categorize observations in Langfuse."), + ) + langfuse_propagate_metadata: dict[str, str] | None = Field( + default=None, + description=( + "Extra metadata merged with ``adk_app_name`` for ``propagate_attributes``. " + "Keys and values must be US-ASCII strings ≤ 200 characters each; avoid large " + "payloads or sensitive data (non-conforming entries may be dropped with warnings)." + ), + ) + langfuse_trace_name: str | None = Field( + default=None, + description=("Optional ``trace_name`` for ``propagate_attributes``: US-ASCII, ≤ 200 characters."), + ) + langfuse_version: str | None = Field( + default=None, + description=( + "Optional ``version`` for independently versioned parts of the app (e.g. agent " + "revision). Use short US-ASCII values suitable for span attributes." + ), + ) + + model_config = {"extra": "forbid"} + + +class AdkTextRunner: + """Wrap ``InMemoryRunner`` with session helpers. + + Parameters + ---------- + agent : BaseAgent + The ADK agent to run. + config : AdkTextRunnerConfig + The configuration for the runner. + + Examples + -------- + Build a runner from an :class:`AgentConfig` and send one prompt: + + >>> from aieng.forecasting.methods.agentic import ( + ... AgentConfig, + ... build_adk_agent, + ... ) + >>> from aieng.forecasting.methods.agentic.adk_runner import ( + ... AdkTextRunner, + ... AdkTextRunnerConfig, + ... ) + >>> agent = build_adk_agent(AgentConfig(instruction="You are a helpful assistant.")) + >>> runner = AdkTextRunner( + ... agent, + ... config=AdkTextRunnerConfig(app_name="demo"), + ... ) + >>> reply = await runner.run_text_async("Hello.") + """ + + def __init__(self, agent: BaseAgent, *, config: AdkTextRunnerConfig) -> None: + """Construct the runner and optionally initialise Langfuse tracing.""" + self.config = config + self.agent = agent + self._runner = InMemoryRunner(agent=agent, app_name=config.app_name) + # Sticky ADK session per user when ``fresh_session_per_message`` is False. + self._conversation_session_by_user: dict[str, str] = {} + # Trace id captured during the most recent traced run (see ``last_trace_id``). + self._last_trace_id: str | None = None + if config.enable_langfuse_tracing: + from aieng.forecasting.langfuse_tracing import init_langfuse_tracing # noqa: PLC0415 + + init_langfuse_tracing() + + @property + def last_trace_id(self) -> str | None: + """Langfuse trace id captured during the most recent traced run, if any. + + The agent runs on a worker event loop whose trace context the caller's + thread cannot see; the runner captures the id here so a predictor can link + and score the trace after the run. ``None`` when tracing is off or the last + run produced no trace. + """ + return self._last_trace_id + + @property + def runner(self) -> InMemoryRunner: + """Underlying ADK runner (session, artifact, memory services).""" + return self._runner + + async def _resolve_session_id(self, user_id: str | None, session_id: str | None) -> str: + """Return the ADK session id to use for a single turn. + + Parameters + ---------- + user_id : str or None + Resolved user id; falls back to ``default_user_id`` when ``None``. + session_id : str or None + Explicit session id from the caller. ``None`` triggers sticky-session + lookup or new-session creation depending on ``fresh_session_per_message``. + + Returns + ------- + str + ADK session id for this turn. + """ + if user_id is None: + user_id = self.config.default_user_id + + if self.config.fresh_session_per_message: + new_session = await self._runner.session_service.create_session( + app_name=self.config.app_name, + user_id=user_id, + ) + sid = new_session.id + elif session_id is not None: + sid = session_id + self._conversation_session_by_user[user_id] = sid + elif user_id in self._conversation_session_by_user: + sid = self._conversation_session_by_user[user_id] + else: + new_session = await self._runner.session_service.create_session( + app_name=self.config.app_name, + user_id=user_id, + ) + sid = new_session.id + self._conversation_session_by_user[user_id] = sid + + return sid + + async def run_text_async( + self, + prompt: str, + *, + user_id: str | None = None, + session_id: str | None = None, + run_config: RunConfig | None = None, + ) -> str: + """Run one user turn; return the first final model text or an empty string. + + Parameters + ---------- + prompt : str + The user prompt to run. + user_id : str | None, optional + The user id to use for the session. If not provided, the default + user id is used. With Langfuse tracing, must be US-ASCII and ≤ 200 + characters for propagation. + session_id : str | None, optional + The session id to use for the session. If not provided, a new session + is created. With Langfuse tracing, the ADK session id must remain + US-ASCII and ≤ 200 characters for propagation. + run_config : RunConfig | None, optional + The run configuration to use for the run. If not provided, the default + run configuration is used. + + Returns + ------- + str + The first final model text or an empty string. + + Notes + ----- + If ``fresh_session_per_message`` is True, each call uses a new ADK session and + ``session_id`` is ignored. + + If it is False, the runner keeps a session per ``user_id``: omit ``session_id`` + after the first message to continue the same conversation. Pass ``session_id`` + to attach to an existing session or switch threads; that id is remembered for + later calls with ``session_id`` omitted (same user). + + When ``enable_langfuse_tracing`` is True, each turn runs inside Langfuse + ``propagate_attributes`` using the resolved ``user_id`` and ADK ``session_id``. + """ + from aieng.forecasting.methods.agentic.agent_factory import SMR_STATE_KEY # noqa: PLC0415 + + user_id = user_id or self.config.default_user_id + + session_id = await self._resolve_session_id(user_id, session_id) + + content = genai_types.Content(role="user", parts=[genai_types.Part(text=prompt)]) + + async def drain_run() -> str: + async for event in self._runner.run_async( + user_id=user_id, + session_id=session_id, + new_message=content, + run_config=run_config, + ): + if event.is_final_response() and event.content and event.content.parts: + return event.content.parts[0].text or "" + return "" + + async def run_and_resolve() -> str: + """Run the agent and return the best available output string. + + When the agent uses our set_model_response shim (LiteLlm path with + tools + output_schema), the structured JSON is stored in session + state under SMR_STATE_KEY. We prefer that over the model's + subsequent "Task complete." text response. + """ + text = await drain_run() + session = await self._runner.session_service.get_session( + app_name=self.config.app_name, + user_id=user_id, + session_id=session_id, + ) + if session is not None and SMR_STATE_KEY in (session.state or {}): + return str(session.state[SMR_STATE_KEY]) + return text + + if self.config.enable_langfuse_tracing: + from langfuse import get_client, propagate_attributes # noqa: PLC0415 + + metadata: dict[str, str] = {"adk_app_name": self.config.app_name} + if self.config.langfuse_propagate_metadata: + metadata = {**metadata, **self.config.langfuse_propagate_metadata} + + pa_kw: dict[str, Any] = { + k: v + for k, v in { + "user_id": user_id, + "session_id": session_id, + "metadata": metadata, + "tags": self.config.langfuse_tags, + "trace_name": self.config.langfuse_trace_name, + "version": self.config.langfuse_version, + }.items() + if v is not None + } + # Wrap the run in an explicit Langfuse span so (a) the ADK spans nest + # under one root trace and (b) we can capture the trace id while its + # context is active — the caller's thread cannot see it otherwise. + self._last_trace_id = None + client = get_client() + root_name = self.config.langfuse_trace_name or self.config.app_name + with client.start_as_current_observation(name=root_name, as_type="agent"): + with propagate_attributes(**pa_kw): + result = await run_and_resolve() + self._last_trace_id = client.get_current_trace_id() + return result + + return await run_and_resolve() + + def clear_conversation(self, *, user_id: str | None = None) -> None: + """Drop sticky session id(s). Next ``run_text_async`` starts a new chat. + + With ``user_id``, clear only that user. With ``None``, clear every user. + No effect when ``fresh_session_per_message`` is True. + + Parameters + ---------- + user_id : str | None, optional + The user id to clear the conversation for. If not provided, all users + are cleared. No effect when ``fresh_session_per_message`` is True. + """ + if user_id is None: + self._conversation_session_by_user.clear() + else: + self._conversation_session_by_user.pop(user_id, None) + + async def aclose(self) -> None: + """Close the underlying runner (plugins, toolsets).""" + self._conversation_session_by_user.clear() + await self._runner.close() # type: ignore[no-untyped-call] + + async def __aenter__(self) -> AdkTextRunner: + """Return self for use as an ``async with`` target.""" + return self + + async def __aexit__( + self, exc_type: type[BaseException] | None, exc_val: BaseException | None, exc_tb: py_types.TracebackType | None + ) -> None: + """Close the runner when leaving the ``async with`` block.""" + await self.aclose() +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md new file mode 100644 index 0000000..494c0ed --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__agent_factory.py.md @@ -0,0 +1,576 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py + +kind: python + +```python +"""Factory functions for building Google ADK agents for forecasting. + +This module exposes :class:`AgentConfig` plus its nested +:class:`CodeExecutionConfig` and :class:`ContextRetrievalConfig` configs, +and the :func:`build_adk_agent` factory that turns a config into a fully +configured :class:`google.adk.agents.LlmAgent` (with optional E2B-backed +code execution and a proxy-grounded web-search tool for context retrieval). + +This module requires the ``agentic`` extra; importing it without the extra +raises :class:`ImportError` with installation guidance. +""" + +from __future__ import annotations + +import logging +import os +import warnings +from pathlib import Path +from typing import Any, Callable, Sequence + +from aieng.forecasting.methods.agentic.outputs import AgentForecastOutput +from aieng.forecasting.models import LITE_MODEL +from google.adk.models.base_llm import BaseLlm +from pydantic import BaseModel, Field, field_validator, model_validator + + +# --------------------------------------------------------------------------- +# Suppress LiteLLM startup and OTEL noise +# --------------------------------------------------------------------------- +# LiteLLM logs Bedrock/SageMaker "no botocore" warnings and an OTEL proxy- +# server notice on every import — all harmless when using the Vector proxy. +# OTEL span-lifecycle warnings ("Tried calling set_status on an ended span") +# fire when LiteLLM callbacks run after spans close; also benign. +# These filters run at module-import time so they are active before the first +# litellm import (which happens lazily inside search_web / build_adk_agent). + + +class _LiteLLMNoiseFilter(logging.Filter): + _NOISE = ("botocore", "Proxy Server is not installed") + + def filter(self, record: logging.LogRecord) -> bool: + return not any(n in record.getMessage() for n in self._NOISE) + + +logging.getLogger("LiteLLM").addFilter(_LiteLLMNoiseFilter()) +logging.getLogger("opentelemetry").setLevel(logging.ERROR) +warnings.filterwarnings("ignore", message="Tried calling set_status on an ended span") +warnings.filterwarnings("ignore", message="Setting attribute on ended span") + + +try: + from aieng.agents.tools.code_interpreter import CodeInterpreter + from google.adk.agents import LlmAgent + from google.adk.skills import load_skill_from_dir + from google.adk.skills.models import Skill + from google.adk.tools.function_tool import FunctionTool + from google.adk.tools.skill_toolset import SkillToolset + from google.adk.tools.tool_context import ToolContext + from google.genai.types import ( + AutomaticFunctionCallingConfig, + GenerateContentConfig, + ThinkingConfig, + ThinkingLevel, + ) +except ModuleNotFoundError as exc: + raise ImportError( + "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'." + ) from exc + + +# Session-state key used by our proxy-compatible set_model_response shim. +# When a LiteLlm agent has both output_schema and tools, we register a flat +# set_model_response(json_response: str) tool that stores the JSON here. +# AdkTextRunner reads this key after each run and returns it in place of the +# final text, giving the predictor the structured JSON it expects. +SMR_STATE_KEY = "__smr_output__" + + +def _build_set_model_response_tool() -> FunctionTool: + """Return a proxy-compatible ``set_model_response`` shim. + + Gemini thinking models call ``set_model_response`` when they produce + structured output alongside other tools — regardless of whether ADK + registered the tool. The real ``SetModelResponseTool`` uses a nested + Pydantic schema for its function declaration, which Gemini rejects via the + OpenAI-compatible proxy (``$defs``/``$ref`` not supported). + + This shim accepts the JSON as a plain string and stores it in session + state under :data:`SMR_STATE_KEY`. :class:`AdkTextRunner` reads that key + after the run and returns it as the final output, bypassing the model's + subsequent "Done." text response. + """ + + async def set_model_response(json_response: str, tool_context: ToolContext) -> str: + """Submit your final structured JSON response as a string. + + Call this tool once, passing the complete JSON object that satisfies + the required output schema. Do not produce any further text after + calling this tool. + """ + tool_context.state[SMR_STATE_KEY] = json_response + return "Response submitted. Task complete." + + return FunctionTool(set_model_response) + + +class ContextRetrievalConfig(BaseModel): + """Configuration for the web-search context-retrieval tool. + + When enabled, :func:`build_adk_agent` attaches a ``search_web`` + :class:`~google.adk.tools.FunctionTool` to the agent. The tool calls + the Vector proxy with Gemini's ``googleSearch`` server-side extension so + the calling agent can retrieve grounded, sourced web context without a + direct Gemini API key. + + Temporal cutoff enforcement is soft (LLM-judgment-based): when + ``enforce_cutoff`` is ``True`` and the calling agent passes a + ``cutoff_date`` to the tool, the inner proxy prompt explicitly asks the + model to exclude post-cutoff sources. This is the same trust model used + by the prior Google Search sub-agent — backtest leakage is a + pedagogically useful discussion point, not a hard guarantee. + + Attributes + ---------- + enabled : bool, default=False + Whether to enable context retrieval. Disabled by default. + search_model : str, default=LITE_MODEL (``"gemini-3.1-flash-lite-preview"``) + Proxy model used inside the ``search_web`` tool call. Must be a + model that supports the ``googleSearch`` server-side tool extension. + instruction : str + System prompt passed to the inner proxy call. Should describe the + search persona and what kind of output to return. Must be non-empty + when ``enabled`` is ``True``. + enforce_cutoff : bool, default=True + When ``True``, the ``search_web`` tool appends a cutoff-date + constraint to the user prompt whenever ``cutoff_date`` is supplied by + the calling agent. Set to ``False`` for live (non-backtest) agents + where no temporal fence is needed. + temperature : float | None, default=None + Sampling temperature for the inner search call. + max_output_tokens : int | None, default=None + Maximum output tokens for the inner search call. + """ + + model_config = {"extra": "forbid"} + + enabled: bool = False + search_model: str = LITE_MODEL + instruction: str = ( + "You are a specialized web search assistant.\n\n" + "Search for information relevant to the query and return a concise, " + "grounded summary with source URLs." + ) + enforce_cutoff: bool = True + temperature: float | None = Field(default=None, ge=0.0, le=2.0) + max_output_tokens: int | None = Field(default=None, ge=1) + + +class CodeExecutionConfig(BaseModel): + """Configuration for the E2B code execution tool. + + Code runs in an E2B-backed sandbox managed by the + :class:`~aieng.agents.tools.code_interpreter.CodeInterpreter` tool. + + Attributes + ---------- + enabled : bool, default=False + Whether to enable code execution. Disabled by default. + template_name : str | None, default="agentic-forecasting-bootcamp" + E2B template name. + sandbox_timeout_seconds : int, default=3600 + E2B sandbox lifetime in seconds. + code_execution_timeout_seconds : float | None, default=3300 + Per-execution timeout in seconds. + """ + + model_config = {"extra": "forbid"} + + enabled: bool = False + template_name: str | None = "agentic-forecasting-bootcamp" + sandbox_timeout_seconds: int = Field(default=3600, ge=1, le=3600) + code_execution_timeout_seconds: float | None = Field(default=3300, gt=0) + + @model_validator(mode="after") + def _timeouts_consistent(self) -> "CodeExecutionConfig": + """Ensure code execution cannot outlive the sandbox itself.""" + if ( + self.code_execution_timeout_seconds is not None + and self.code_execution_timeout_seconds > self.sandbox_timeout_seconds + ): + raise ValueError("code_execution_timeout_seconds cannot exceed sandbox_timeout_seconds") + return self + + +def _build_automatic_function_calling_config( + config: AgentConfig, + *, + tools: list[Any], + output_schema: type[AgentForecastOutput] | None, +) -> AutomaticFunctionCallingConfig | None: + """Disable genai AFC when ADK orchestrates tools or schemas.""" + disable = config.disable_automatic_function_calling + if disable is None: + disable = bool(tools or output_schema is not None) + if not disable: + return None + return AutomaticFunctionCallingConfig(disable=True) + + +def _build_search_tool( + config: ContextRetrievalConfig, + *, + openai_base_url: str, + openai_api_key: str | None, +) -> Callable[..., Any]: + """Return an async ``search_web`` FunctionTool backed by the proxy's googleSearch. + + The returned coroutine function is registered as an ADK tool. It calls + the proxy with ``"tools": [{"googleSearch": {}}]`` so the model does + server-side grounding and returns a synthesised answer plus source URLs + extracted from ``choices[0].provider_specific_fields["grounding_metadata"]``. + """ + + async def search_web(query: str, cutoff_date: str | None = None) -> str: + """Search the web and return a grounded summary with source URLs. + + Args: + query: What to search for. + cutoff_date: ISO date (YYYY-MM-DD). When provided, only include + information published strictly before this date. + + Returns + ------- + A grounded summary of search results, with source URLs appended. + """ + import litellm # noqa: PLC0415 + + user_content = query + if cutoff_date and config.enforce_cutoff: + user_content += f"\n\nOnly include and cite information published strictly before {cutoff_date}." + search_model = config.search_model + if not search_model.startswith("openai/"): + search_model = f"openai/{search_model}" + resp = await litellm.acompletion( + model=search_model, + api_base=openai_base_url, + api_key=openai_api_key, + messages=[ + {"role": "system", "content": config.instruction}, + {"role": "user", "content": user_content}, + ], + tools=[{"googleSearch": {}}], + max_tokens=config.max_output_tokens or 4096, + temperature=config.temperature or 0.0, + timeout=60.0, + ) + content = resp.choices[0].message.content or "" + psf = getattr(resp.choices[0], "provider_specific_fields", {}) or {} + gm = psf.get("grounding_metadata") or {} + sources: list[str] = [ + uri for c in gm.get("groundingChunks", []) if (uri := (c.get("web") or {}).get("uri")) is not None + ] + if sources: + content += "\n\nSources:\n" + "\n".join(sources[:5]) + return content + + return search_web + + +class AgentConfig(BaseModel): + """Configuration for building an ADK agent for forecasting tasks. + + Attributes + ---------- + name : str, default="adk_forecasting_agent" + Name of the agent. + model : str | BaseLlm, default=LITE_MODEL (``"gemini-3.1-flash-lite-preview"``) + Model name (bare, no provider prefix) or a custom + :class:`~google.adk.models.base_llm.BaseLlm` instance. When + ``openai_base_url`` is set and ``model`` is a plain string, + :func:`build_adk_agent` wraps it in a + :class:`~google.adk.models.lite_llm.LiteLlm` instance pointing to + the proxy. Pass a ``BaseLlm`` directly to skip automatic wrapping. + openai_base_url : str | None, default=OPENAI_BASE_URL env var + Base URL for the OpenAI-compatible LLM proxy. Defaults to the + ``OPENAI_BASE_URL`` environment variable. When set, the agent (and + the ``search_web`` tool) route all calls through the proxy. + openai_api_key : str | None, default=OPENAI_API_KEY env var + API key for the proxy. Defaults to the ``OPENAI_API_KEY`` + environment variable. + description : str, default="" + Description of the agent. Useful when the agent is used as a sub-agent. + instruction : str, default="" + Instruction for the agent. + skills_dirs : Sequence[Path], default=() + Sequence of paths to skill directories. + function_tools : Sequence[Any], default=() + Conventional ADK tools (e.g. :class:`~google.adk.tools.FunctionTool` + instances or plain callables) appended directly to the agent's tool + list. Use this to give the agent a rigid, pre-specified capability such + as the + :class:`~aieng.forecasting.methods.agentic.forecast_tool.ForecastTool` + (in contrast to open-ended code execution). Stored as-is; not validated. + seed : int or None, default=None + Generation seed forwarded to the model for reproducibility. + temperature : float or None, default=None + Sampling temperature; ``None`` uses the model default. + max_output_tokens : int or None, default=None + Maximum tokens per model response; ``None`` uses the model default. + thinking_budget : int or None, default=None + Token budget for extended thinking (Gemini thinking models only). + **Proxy-path caveat:** when routing through the Vector proxy (or any + OpenAI-compatible proxy), ``thinking_budget`` is passed via ADK's + ``ThinkingConfig`` → ``GenerateContentConfig``. Whether LiteLLM's + ``drop_params`` strips it on the proxy path is untested — if you set + this and see no change in thinking behaviour, treat it as silently + dropped (same root cause as the ``reasoning_effort`` stripping issue + documented in ``planning-docs/vector-llm-proxy.md``). + thinking_level : ThinkingLevel or None, default=None + Thinking-level preset; overrides ``thinking_budget`` when both are set. + Subject to the same proxy-path caveat as ``thinking_budget``. + code_execution : CodeExecutionConfig + Configuration for E2B code execution. Disabled by default. + context_retrieval : ContextRetrievalConfig + Configuration for web-search context retrieval. Disabled by default. + disable_automatic_function_calling : bool or None, default=None + When ``True``, sets ``automatic_function_calling.disable`` on the + Gemini request config. ADK agents execute tools via the ADK runtime, + not the genai SDK's Automatic Function Calling (AFC) helper. + ``None`` (default) auto-disables AFC whenever tools or an + ``output_schema`` are configured. + extra_tools : Sequence[Callable[..., Any]], default=() + Additional callable tools to register with the agent beyond the + standard code-execution and context-retrieval tools. Use this to + inject implementation-specific tools (e.g. adaptive skill mutation + tools) without coupling the shared factory to implementation code. + Each callable is appended to the tool list after skills are loaded + and will be wrapped by ADK as a ``FunctionTool``. + """ + + model_config = {"extra": "forbid", "arbitrary_types_allowed": True} + + name: str = "adk_forecasting_agent" + model: str | BaseLlm = LITE_MODEL + openai_base_url: str | None = Field( + default_factory=lambda: os.getenv("OPENAI_BASE_URL"), + description=( + "Base URL for the OpenAI-compatible LLM proxy. Defaults to the OPENAI_BASE_URL environment variable." + ), + ) + openai_api_key: str | None = Field( + default_factory=lambda: os.getenv("OPENAI_API_KEY"), + description="API key for the proxy. Defaults to the OPENAI_API_KEY environment variable.", + ) + description: str = "" + instruction: str = "" + skills_dirs: Sequence[Path] = () + function_tools: Sequence[Any] = () + # Optional generation overrides (None = model/provider defaults). + seed: int | None = None + temperature: float | None = None + max_output_tokens: int | None = None + thinking_budget: int | None = None + thinking_level: ThinkingLevel | None = None + + # Capabilities + code_execution: CodeExecutionConfig = Field(default_factory=CodeExecutionConfig) + context_retrieval: ContextRetrievalConfig = Field(default_factory=ContextRetrievalConfig) + disable_automatic_function_calling: bool | None = None + extra_tools: Sequence[Callable[..., Any]] = () + + @field_validator("skills_dirs") + @classmethod + def _skill_dirs_exist(cls, dirs: Sequence[Path]) -> Sequence[Path]: + """Reject skill directories that do not resolve to a real directory.""" + missing = [p for p in dirs if not p.is_dir()] + if missing: + raise ValueError(f"Skill directories do not exist: {missing}") + return dirs + + @model_validator(mode="after") + def _enabled_requires_instruction(self) -> "AgentConfig": + """Require non-empty instructions for the root and context-retrieval agents.""" + if self.context_retrieval.enabled and not self.context_retrieval.instruction.strip(): + raise ValueError( + "Expected non-empty instruction for context retrieval agent. " + "Please provide an instruction in the agent configuration." + ) + if not self.instruction.strip(): + raise ValueError( + "Expected non-empty instruction for root agent. " + "Please provide an instruction in the agent configuration." + ) + return self + + +def build_adk_agent( + config: AgentConfig, + *, + output_schema: type[AgentForecastOutput] | None = None, +) -> LlmAgent: + """Build an ADK agent for forecasting tasks with the given configuration. + + Code execution (E2B) and the web-search context-retrieval tool are wired + only when the corresponding capability blocks in ``config`` are enabled. + + When ``config.openai_base_url`` is set and ``config.model`` is a plain + string, the model is automatically wrapped in a + :class:`~google.adk.models.lite_llm.LiteLlm` instance that routes all + calls through the proxy. Pass a ``BaseLlm`` instance directly to bypass + automatic wrapping. + + Parameters + ---------- + config : AgentConfig + Configuration for the agent. ``config.instruction`` must be + non-empty; if ``config.context_retrieval.enabled`` is ``True``, + ``config.context_retrieval.instruction`` must also be non-empty + (enforced by :class:`AgentConfig`). + output_schema : type[AgentForecastOutput] or None, default=None + When provided, configures the agent to return JSON constrained to + this schema. Typically supplied by :class:`AgentPredictor`. + + Note: avoid ``str | None`` optional fields on schemas that also + contain ``list[BaseModel]`` fields; use string defaults (e.g. + ``rationale=""``) to stay compatible with ADK's + ``set_model_response`` tool. + + Returns + ------- + LlmAgent + Configured ADK agent with tools and skills attached. + + Examples + -------- + Interactive analyst — free-form output, no schema constraint: + + >>> from aieng.forecasting.methods.agentic import AgentConfig, build_adk_agent + >>> agent = build_adk_agent(AgentConfig(instruction="You are a helpful analyst.")) + + Predictor role — structured JSON output constrained to a schema: + + >>> from aieng.forecasting.methods.agentic import ( + ... AgentConfig, + ... ContinuousAgentForecastOutput, + ... build_adk_agent, + ... ) + >>> agent = build_adk_agent( + ... AgentConfig(instruction="Forecast the supplied series."), + ... output_schema=ContinuousAgentForecastOutput, + ... ) + """ + # Resolve model: wrap bare string in LiteLlm when proxy is configured. + model: str | BaseLlm = config.model + if isinstance(model, str) and config.openai_base_url: + from google.adk.models.lite_llm import LiteLlm # noqa: PLC0415 + + # Prefix with "openai/" so LiteLLM uses the OpenAI-compatible path. + # LiteLLM strips the prefix before sending, so the proxy receives the + # bare model name. + litellm_model = model if model.startswith("openai/") else f"openai/{model}" + model = LiteLlm( + model=litellm_model, + api_base=config.openai_base_url, + api_key=config.openai_api_key, + ) + + # Configure tools + tools: list[Any] = [] + + if config.code_execution.enabled: + tools.append( + CodeInterpreter( + template_name=config.code_execution.template_name, + sandbox_timeout_seconds=config.code_execution.sandbox_timeout_seconds, + code_execution_timeout_seconds=config.code_execution.code_execution_timeout_seconds, + ).run_code + ) + + if config.context_retrieval.enabled: + openai_base_url = config.openai_base_url or os.getenv("OPENAI_BASE_URL") or "" + tools.append( + _build_search_tool( + config.context_retrieval, + openai_base_url=openai_base_url, + openai_api_key=config.openai_api_key, + ) + ) + + # Load skills + skills: list[Skill] = [] + for skills_dir in config.skills_dirs: + skills.append(load_skill_from_dir(skills_dir)) + + if skills: + tools.append(SkillToolset(skills=skills)) + + # Append any extra implementation-specific tools (e.g. adaptive skill + # mutation tools). These run in the host process, not in E2B. + for extra in config.extra_tools: + tools.append(extra) + + # For LiteLlm agents with both output_schema and tools, ADK's + # can_use_output_schema_with_tools() returns True and skips set_model_response + # injection, using response_format instead. However, Gemini thinking models + # (e.g. gemini-3.5-flash) are trained to call set_model_response when + # producing structured output alongside other tools — and they do so even when + # output_schema=None on the Python side. + # + # The real SetModelResponseTool fails here because its function declaration + # uses JSON Schema $defs/$ref (from the Pydantic output schema), which Gemini + # rejects via the OpenAI-compatible proxy. + # + # Fix: register our flat-schema shim (_build_set_model_response_tool) that + # accepts the JSON as a plain string and parks it in session state. Clear + # output_schema so ADK does not also try to enforce it via response_format. + # AdkTextRunner reads the state key after the run and returns the captured + # JSON as the final output. + # + # This applies to *every* proxy-routed (LiteLlm) agent with an output_schema, + # not only tool-bearing ones: a schema-only agent with no other tools (e.g. a + # bare AgentPredictor) would otherwise send the Pydantic schema as Gemini's + # response_schema and 400 on $defs/$ref/additionalProperties through the proxy. + # When the shim is the only tool, the model emits the JSON via set_model_response + # (or as plain text, which AdkTextRunner returns as a fallback) — both paths are + # handled downstream. Direct-Gemini (non-LiteLlm) agents keep the native schema. + effective_output_schema = output_schema + try: + from google.adk.models.lite_llm import LiteLlm as _LiteLlm # noqa: PLC0415 + + if output_schema is not None and isinstance(model, _LiteLlm): + tools.append(_build_set_model_response_tool()) + effective_output_schema = None + except ImportError: + pass + + # Conventional function tools (e.g. ForecastTool) attach directly. + tools.extend(config.function_tools) + + thinking_config = ( + ThinkingConfig( + include_thoughts=True, + thinking_budget=config.thinking_budget, + thinking_level=config.thinking_level, + ) + if config.thinking_budget is not None or config.thinking_level is not None + else None + ) + + automatic_function_calling = _build_automatic_function_calling_config( + config, + tools=tools, + output_schema=output_schema, + ) + + return LlmAgent( + name=config.name, + description=config.description, + model=model, + instruction=config.instruction, + tools=tools, + output_schema=effective_output_schema, + generate_content_config=GenerateContentConfig( + seed=config.seed, + temperature=config.temperature, + max_output_tokens=config.max_output_tokens, + thinking_config=thinking_config, + automatic_function_calling=automatic_function_calling, + ), + ) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md new file mode 100644 index 0000000..402af8c --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__curriculum.py.md @@ -0,0 +1,520 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/curriculum.py + +kind: python + +```python +"""Curriculum assembly utilities for adaptive agent training. + +These functions help prepare structured learning material from historical +backtest results and cached context documents, and assemble it into a single +curriculum prompt that can be sent to an adaptive agent via +:class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`. + +The paradigm is **curriculum learning** — the agent studies evidence as a new +analyst would study case files, rather than simulating itself going back in +time. The curriculum utility functions are domain-agnostic; domain-specific +curriculum builders in each implementation assemble and pass the right content. + +Typical usage:: + + from aieng.forecasting.methods.agentic.curriculum import ( + format_backtest_report, + load_context_documents, + build_curriculum_prompt, + ) + + report = format_backtest_report( + result=backtest_result, + actuals=actuals_dict, + title="2024 WTI Baseline Backtest", + training_start=date(2024, 1, 1), + training_end=date(2024, 12, 31), + ) + + context_docs = load_context_documents( + context_dir=Path("adaptive_agent/curriculum/context"), + dates=["2024-03-04", "2024-06-03", ...], + ) + + prompt = build_curriculum_prompt( + report=report, + context_documents=context_docs, + as_of="2025-12-31", + preamble="Review 2025 WTI forecasting performance for systematic patterns.", + ) + + reply = await runner.run_text_async(prompt) +""" + +from __future__ import annotations + +import logging +import math +import warnings +from datetime import date, datetime +from pathlib import Path +from typing import TYPE_CHECKING + +import numpy as np +from aieng.forecasting.evaluation.backtest import BacktestResult +from aieng.forecasting.evaluation.prediction import ContinuousForecast, Prediction + + +if TYPE_CHECKING: + import pandas as pd + +logger = logging.getLogger(__name__) + +# --------------------------------------------------------------------------- +# Vol-regime helper +# --------------------------------------------------------------------------- + +_VOL_REGIMES = [ + (15.0, "low"), + (30.0, "medium"), + (50.0, "elevated"), + (math.inf, "extreme"), +] + + +_MIN_VOL_WINDOW = 5 +_COV_LOW = 0.70 +_COV_HIGH = 0.90 +_BIAS_FRACTION = 0.3 +_COV_TREND_THRESHOLD = 0.05 +_MAE_TREND_THRESHOLD = 1.1 +_MIN_HORIZONS_FOR_NARRATIVE = 2 + + +def _vol_regime(price_series: pd.DataFrame, as_of: datetime, lookback: int = 21) -> str: + """Classify the vol regime at *as_of*. + + Uses *lookback* trading days of log returns. + """ + import pandas as pd # noqa: PLC0415 — conditional import for optional dep + + ts = pd.to_datetime(price_series["timestamp"]) + vals = price_series.loc[ts <= pd.Timestamp(as_of), "value"].values + window = vals[-lookback:] + if len(window) < _MIN_VOL_WINDOW: + return "unknown" + log_returns = np.diff(np.log(window.astype(float))) + annualized_vol = float(np.std(log_returns) * np.sqrt(252) * 100) + for threshold, label in _VOL_REGIMES: + if annualized_vol < threshold: + return label + return "extreme" + + +# --------------------------------------------------------------------------- +# Report +# --------------------------------------------------------------------------- + + +def format_backtest_report( # noqa: PLR0912, PLR0913, PLR0915 + result: BacktestResult, + actuals: dict[tuple[str, int], float], + *, + title: str = "Backtest Report", + training_start: date | None = None, + training_end: date | None = None, + baseline_result: BacktestResult | None = None, + price_series: pd.DataFrame | None = None, +) -> str: + """Render a backtest result as a curriculum document. + + Formats a :class:`~aieng.forecasting.evaluation.backtest.BacktestResult` + as a structured markdown document for curriculum delivery. + + Produces a header, an optional naive-baseline comparison table, per-horizon + detail sections, and a cross-horizon pattern narrative. Each per-horizon + section includes: + + - **Coverage** — fraction of actuals inside the 80% CI (target: 0.80). + - **Mean bias** — signed mean error (positive = over-forecasting). + - **MAE** — mean absolute error of the point forecast. + - **Interval width** — average 80% CI width, vs. width needed for 80% coverage. + - **Regime breakdown** — coverage and MAE by vol regime (if *price_series* given). + + Parameters + ---------- + result : BacktestResult + Completed backtest result. + actuals : dict[tuple[str, int], float] + Mapping from ``(as_of_date_str, horizon_days)`` to the realised value. + ``as_of_date_str`` must match ``str(prediction.as_of.date())``. + title : str, default="Backtest Report" + Section heading at the top of the document. + training_start : date or None + If provided, only predictions with ``as_of.date() >= training_start`` + are included. + training_end : date or None + If provided, only predictions with ``as_of.date() <= training_end`` + are included. + baseline_result : BacktestResult or None + Optional naive/last-value backtest result for a relative-skill comparison + row in the header table. The same *actuals* dict is used for scoring. + price_series : DataFrame or None + Full price series returned by ``data_service.get_series()`` (columns: + ``timestamp``, ``value``). When provided, each origin is classified + into a vol regime (low / medium / elevated / extreme) based on 21-day + realized volatility, and per-regime coverage/MAE tables are appended + to each horizon section. + + Returns + ------- + str + Markdown-formatted curriculum document. + """ + preds = result.predictions + + if training_start is not None: + preds = [p for p in preds if p.as_of.date() >= training_start] + if training_end is not None: + preds = [p for p in preds if p.as_of.date() <= training_end] + + if not preds: + return f"# {title}\n\nNo predictions in the specified training window.\n" + + # Organise by horizon + horizons: dict[int, list[Prediction]] = {} + for pred in preds: + h = (pred.forecast_date - pred.as_of).days + horizons.setdefault(h, []).append(pred) + + # Pre-compute vol regime per origin (optional) + regime_at: dict[str, str] = {} + if price_series is not None: + for pred in preds: + key = str(pred.as_of.date()) + if key not in regime_at: + regime_at[key] = _vol_regime(price_series, pred.as_of) + + # ── Header ─────────────────────────────────────────────────────────────── + lines: list[str] = [ + f"# {title}", + "", + f"**Predictor:** {result.predictor_id} ", + f"**Origins included:** {len({str(p.as_of.date()) for p in preds})} ", + f"**Mean CRPS (all horizons):** {result.mean_score:.4f}", + "", + ] + + # ── Naive comparison (optional) ────────────────────────────────────────── + if baseline_result is not None: + b_preds = baseline_result.predictions + if training_start is not None: + b_preds = [p for p in b_preds if p.as_of.date() >= training_start] + if training_end is not None: + b_preds = [p for p in b_preds if p.as_of.date() <= training_end] + + b_horizons: dict[int, list[Prediction]] = {} + for pred in b_preds: + h = (pred.forecast_date - pred.as_of).days + b_horizons.setdefault(h, []).append(pred) + + lines += [ + "## Relative skill vs. naive baseline", + "", + f"Baseline predictor: **{baseline_result.predictor_id}** ", + f"Baseline mean CRPS: {baseline_result.mean_score:.4f} ", + f"This predictor mean CRPS: {result.mean_score:.4f}", + "", + "| Horizon | This MAE | Baseline MAE | Skill (lower is better) |", + "|---------|----------|--------------|-------------------------|", + ] + + def _mae(pred_list: list[Prediction], horizon: int) -> float: + errs = [] + for p in pred_list: + k = (str(p.as_of.date()), horizon) + a = actuals.get(k) + if a is not None and isinstance(p.payload, ContinuousForecast): + errs.append(abs(p.payload.point_forecast - a)) + return float(np.mean(errs)) if errs else float("nan") + + for h in sorted(horizons): + this_mae = _mae(horizons[h], h) + base_mae = _mae(b_horizons.get(h, []), h) + skill = ( + f"{this_mae:.2f} vs {base_mae:.2f} " + f"({'better' if this_mae < base_mae else 'worse'} by {abs(this_mae - base_mae):.2f})" + if not math.isnan(base_mae) + else f"{this_mae:.2f} (no baseline)" + ) + lines.append(f"| {h}d | {this_mae:.2f} | {base_mae:.2f} | {skill} |") + lines += ["", "---", ""] + + # ── Per-horizon detail ──────────────────────────────────────────────────── + horizon_summaries: list[str] = [] # bullet lines for the narrative section + _cov_vals: list[float] = [] + _mae_vals: list[float] = [] + _bias_vals: list[float] = [] + + for h in sorted(horizons): + h_preds = horizons[h] + resolved: list[tuple[Prediction, float, bool, float, float, float]] = [] + unresolved_count = 0 + + for pred in h_preds: + ak = (str(pred.as_of.date()), h) + actual = actuals.get(ak) + if actual is None: + unresolved_count += 1 + continue + if not isinstance(pred.payload, ContinuousForecast): + continue + lower = pred.payload.quantiles.get(0.1, float("nan")) + upper = pred.payload.quantiles.get(0.9, float("nan")) + covered = lower <= actual <= upper + error = abs(pred.payload.point_forecast - actual) + bias = pred.payload.point_forecast - actual + ci_width = upper - lower + resolved.append((pred, actual, covered, error, bias, ci_width)) + + if not resolved: + lines += [ + f"## Horizon: {h} days", + "", + f"No resolved predictions (unresolved: {unresolved_count}).", + "", + ] + continue + + n = len(resolved) + coverage = sum(1 for r in resolved if r[2]) / n + mae = float(np.mean([r[3] for r in resolved])) + mean_bias = float(np.mean([r[4] for r in resolved])) + avg_ci_width = float(np.mean([r[5] for r in resolved])) + # Half-width needed for a symmetric interval to achieve 80% coverage + required_half_width = float(np.percentile([r[3] for r in resolved], 80)) + + lines += [ + f"## Horizon: {h} days", + "", + "| Metric | Value |", + "|--------|-------|", + f"| Predictions resolved | {n} |", + f"| 80% CI coverage | {coverage:.1%} (target 80%) |", + f"| Mean bias (forecast − actual) | {mean_bias:+.2f} " + f"({'over-forecasting' if mean_bias > 0 else 'under-forecasting'}) |", + f"| Mean absolute error | {mae:.2f} |", + f"| Average 80% CI width | {avg_ci_width:.2f} |", + f"| Width needed for 80% coverage | ±{required_half_width:.2f} " + f"(current half-width: ±{avg_ci_width / 2:.2f}) |", + ] + if unresolved_count: + lines.append(f"| Unresolved (skipped) | {unresolved_count} |") + lines.append("") + + # Coverage / width commentary + if coverage < _COV_LOW: + ratio = required_half_width / (avg_ci_width / 2) if avg_ci_width > 0 else float("nan") + if mean_bias > mae * _BIAS_FRACTION: + bias_note = "intervals are also off-center (systematic over-forecast)." + elif mean_bias < -mae * _BIAS_FRACTION: + bias_note = "intervals are also off-center (systematic under-forecast)." + else: + bias_note = "point forecasts are roughly unbiased; the issue is interval width alone." + lines.append( + f"> **Coverage {coverage:.1%} is well below target.** " + f"Intervals are too narrow — they would need to be " + f"~{ratio:.1f}× wider to capture 80% of actuals. " + f"Mean bias of {mean_bias:+.2f} suggests {bias_note}" + ) + elif coverage > _COV_HIGH: + lines.append( + f"> **Coverage {coverage:.1%} is above target** — intervals may be overly conservative at this horizon." + ) + lines.append("") + + # Regime breakdown (optional) + if regime_at: + regime_buckets: dict[str, list[tuple[bool, float, float]]] = {} + for r in resolved: + pred_obj = r[0] + regime = regime_at.get(str(pred_obj.as_of.date()), "unknown") + regime_buckets.setdefault(regime, []).append((r[2], r[3], r[4])) + + regime_order = ["low", "medium", "elevated", "extreme", "unknown"] + present = [reg for reg in regime_order if reg in regime_buckets] + if len(present) > 1: + lines += [ + "**Regime breakdown:**", + "", + "| Vol regime | N | Coverage | MAE | Mean bias |", + "|-----------|---|----------|-----|-----------|", + ] + for reg in present: + bucket = regime_buckets[reg] + reg_cov = sum(1 for c, _, _ in bucket if c) / len(bucket) + reg_mae = float(np.mean([e for _, e, _ in bucket])) + reg_bias = float(np.mean([b for _, _, b in bucket])) + lines.append(f"| {reg} | {len(bucket)} | {reg_cov:.1%} | {reg_mae:.2f} | {reg_bias:+.2f} |") + lines.append("") + + # Collect values for cross-horizon narrative + _cov_vals.append(coverage) + _mae_vals.append(mae) + _bias_vals.append(mean_bias) + + bias_dir = "over" if mean_bias > 0 else "under" + horizon_summaries.append( + f"h={h}d: coverage {coverage:.1%}, MAE {mae:.2f}, " + f"bias {mean_bias:+.2f} ({bias_dir}), " + f"CI width {avg_ci_width:.2f} (needed {required_half_width * 2:.2f})" + ) + + # ── Cross-horizon narrative ──────────────────────────────────────────────── + if len(horizon_summaries) > 1: + lines += [ + "---", + "", + "## Cross-horizon pattern summary", + "", + ] + lines += [f"- {s}" for s in horizon_summaries] + lines.append("") + + # Synthesize from values already collected in the per-horizon loop + if len(_cov_vals) >= _MIN_HORIZONS_FOR_NARRATIVE: + if _cov_vals[-1] < _cov_vals[0] - _COV_TREND_THRESHOLD: + cov_trend = "worsens" + elif _cov_vals[-1] > _cov_vals[0] + _COV_TREND_THRESHOLD: + cov_trend = "improves" + else: + cov_trend = "is roughly flat" + mae_trend = "increases" if _mae_vals[-1] > _mae_vals[0] * _MAE_TREND_THRESHOLD else "is flat" + bias_consistent = all(b > 0 for b in _bias_vals) or all(b < 0 for b in _bias_vals) + if bias_consistent: + bias_note = ( + f"Bias is **consistent in direction** across all horizons " + f"({'+' if _bias_vals[0] > 0 else '-'}), suggesting a structural " + "over/under-forecast rather than a horizon-specific issue." + ) + else: + bias_note = "Bias **changes direction** across horizons, suggesting a more complex error pattern." + lines += [ + f"Coverage **{cov_trend}** across horizons. MAE **{mae_trend}** with horizon. {bias_note}", + "", + ] + + return "\n".join(lines) + + +def load_context_documents( + context_dir: Path, + dates: list[str], +) -> list[tuple[str, str]]: + """Load pre-cached context markdown files for a list of dates. + + Files are expected to be named ``_.md`` (any prefix). + This function matches by the date suffix — any file in ``context_dir`` + whose stem ends with the date string is considered a match. Missing dates + are warned and skipped. + + Parameters + ---------- + context_dir : Path + Directory containing pre-cached context files. + dates : list[str] + ISO-8601 date strings to load (e.g. ``["2024-03-04", "2024-06-03"]``). + + Returns + ------- + list[tuple[str, str]] + ``(date_str, content)`` pairs for each date that had a cached file, + sorted by date ascending. + """ + results: list[tuple[str, str]] = [] + for d in dates: + matches = sorted(context_dir.glob(f"*{d}.md")) + if not matches: + warnings.warn( + f"No cached context file found for date {d} in {context_dir}. Skipping.", + stacklevel=2, + ) + continue + if len(matches) > 1: + logger.warning("Multiple context files match date %s; using %s", d, matches[0]) + results.append((d, matches[0].read_text(encoding="utf-8"))) + + return sorted(results, key=lambda x: x[0]) + + +def build_curriculum_prompt( + report: str, + context_documents: list[tuple[str, str]], + *, + as_of: str, + preamble: str = "", +) -> str: + """Assemble a structured curriculum message for the agent. + + Combines a backtest report and any number of dated context documents into a + single prompt the agent receives as a curriculum delivery message. The + agent is expected to: + + 1. Read the backtest report and identify systematic patterns. + 2. Read the context documents to understand what information was available + at each date. + 3. Decide whether any findings meet the evidence threshold in + ``meta-learning`` and call the appropriate mutation tools. + + Parameters + ---------- + report : str + Backtest report markdown (from :func:`format_backtest_report`). + context_documents : list[tuple[str, str]] + ``(date_str, content)`` pairs from :func:`load_context_documents`. + May be empty for a statistics-only curriculum. + as_of : str + The end date of the training period. Included in the prompt header + so the agent knows the temporal scope of the curriculum. + preamble : str, optional + Domain-specific framing text prepended before the report. Use this to + orient the agent (e.g. "You are reviewing your 2024 WTI forecasting + performance to identify systematic patterns."). + + Returns + ------- + str + Complete curriculum message, ready to send via + :class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`. + """ + parts: list[str] = [] + + parts.append( + f"## Curriculum delivery — training period ending {as_of}\n\n" + "This is a structured self-study session, not a prediction request. " + "Read the materials below, identify any systematic patterns in your " + "forecasting behaviour, and decide whether any findings meet the " + "evidence threshold described in your `meta-learning` skill. " + "Call mutation tools only if the evidence warrants it." + ) + + if preamble.strip(): + parts.append(f"\n{preamble.strip()}") + + parts.append(f"\n---\n\n{report}") + + if context_documents: + parts.append( + "\n---\n\n## Market context at key dates\n\n" + "The following summaries describe what market and news context was " + "available at selected dates during the training period. Use them " + "to assess whether your information-weighting approach was well-calibrated." + ) + for d, content in context_documents: + parts.append(f"\n### Context as of {d}\n\n{content.strip()}") + + parts.append( + "\n---\n\n" + "Review the materials above. If you identify a pattern meeting the " + "evidence threshold, call the appropriate tool(s) (`record_observation`, " + "`open_hypothesis`, etc.). If the evidence is insufficient, state why " + "and what additional resolutions would be needed." + ) + + return "\n".join(parts) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md new file mode 100644 index 0000000..ef69185 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__forecast_tool.py.md @@ -0,0 +1,264 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/forecast_tool.py + +kind: python + +```python +"""A conventional ADK function tool that runs a forecasting model on demand. + +:class:`ForecastTool` exposes a single, rigidly-typed callable that lets an +analyst agent ask: *"show me what a statistical forecast would look like on +this series, using the data available up to this date, for these horizons."* + +Unlike the open-ended code-execution path, this tool gives the agent a fixed, +auditable interface to a pre-specified +:class:`~aieng.forecasting.evaluation.predictor.Predictor`. The agent supplies +only metadata (series id, cutoff date, horizons, frequency); the underlying +series data never passes through the LLM context window. + +The tool is constructed with a +:class:`~aieng.forecasting.data.service.DataService` and a ``Predictor`` +(dependency injection). At call time it builds a +:class:`~aieng.forecasting.data.context.ForecastContext` scoped to the requested +cutoff date and invokes the predictor against it, so the same +information-cutoff discipline used in backtests applies here. + +Scope: the tool reports continuous (numeric) forecasts. The injected predictor +must emit +:class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast` payloads; +other modalities are out of scope. + +This module requires the ``agentic`` extra; importing it without the extra +raises :class:`ImportError` with installation guidance. +""" + +from __future__ import annotations + +import json +from datetime import datetime + +import pandas as pd +from aieng.forecasting.data.service import DataService +from aieng.forecasting.evaluation.prediction import ContinuousForecast, Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask +from aieng.forecasting.methods.numerical.darts_arima import DartsAutoARIMAPredictor + + +try: + from google.adk.tools.function_tool import FunctionTool +except ModuleNotFoundError as exc: + raise ImportError( + "This module requires the 'agentic' extra. Install it with 'pip install aieng-forecasting[agentic]'." + ) from exc + + +#: Prediction-interval bounds reported by the tool, keyed by nominal coverage. +#: 95% is intentionally absent: the standard quantile grid tops out at p05/p95, +#: so the widest honest interval is 90%. Reporting a "95%" interval here would +#: require extrapolation beyond what the model actually produces. +_INTERVAL_QUANTILES: dict[str, tuple[float, float]] = { + "80%": (0.10, 0.90), + "90%": (0.05, 0.95), +} + + +class ForecastTool: + """ADK function tool that runs a forecasting predictor on a registered series. + + Wraps a :class:`~aieng.forecasting.evaluation.predictor.Predictor` behind a + rigid, JSON-native callable signature suitable for registration as a Google + ADK :class:`~google.adk.tools.FunctionTool`. The tool is general-purpose: it + forecasts any series registered in the injected + :class:`~aieng.forecasting.data.service.DataService`, selected by + ``series_id`` at call time. + + The wrapped predictor is fixed at construction time. To expose a different + method, construct a new tool with a different predictor; the predictor's + identity belongs in the tool description shown to the agent. + + Parameters + ---------- + data_service : DataService + Already-populated data service. The tool reads from it but never + fetches from external APIs. Series are selected by ``series_id``. + predictor : Predictor or None, default=None + Predictor to invoke. When ``None``, a + :class:`~aieng.forecasting.methods.numerical.darts_arima.DartsAutoARIMAPredictor` + is constructed with its own defaults. To tune it (e.g. reduce + ``num_samples`` to bound agent latency), pass an explicit instance such + as ``DartsAutoARIMAPredictor(num_samples=200)``. The predictor must emit + :class:`~aieng.forecasting.evaluation.prediction.ContinuousForecast` + payloads. + + Examples + -------- + >>> from aieng.forecasting.methods.agentic import ForecastTool + >>> tool = ForecastTool(data_service=svc) + >>> function_tool = tool.as_function_tool() # register on an AgentConfig + """ + + def __init__( + self, + data_service: DataService, + *, + predictor: Predictor | None = None, + ) -> None: + self._data_service = data_service + self._predictor: Predictor = predictor or DartsAutoARIMAPredictor() + + def as_function_tool(self) -> FunctionTool: + """Wrap :meth:`run_forecast` as an ADK :class:`FunctionTool`. + + Returns + ------- + FunctionTool + Ready to append to an agent's tool list (e.g. via + ``AgentConfig.function_tools``). ADK introspects the bound method's + signature and docstring to build the tool schema. + """ + return FunctionTool(func=self.run_forecast) + + def run_forecast( + self, + series_id: str, + cutoff_date: str, + horizons: list[int], + frequency: str, + ) -> str: + """Fit a forecasting model up to a cutoff date and return its forecast. + + Runs the configured statistical predictor on the requested series using + only data available on or before ``cutoff_date``, and returns its point + forecasts and prediction intervals for each horizon. Use this to ground + your reasoning in a conventional statistical forecast before combining + it with retrieved market context. + + Args: + series_id: Identifier of the registered series to forecast (e.g. + "wti_crude_oil_price"). + cutoff_date: Forecast origin / information cutoff in YYYY-MM-DD + format. Only data on or before this date is used. + horizons: Steps ahead to forecast, in units of ``frequency`` (e.g. + [1, 5, 10] for a daily/business-day series). + frequency: Pandas offset alias matching the series sampling, e.g. + "B" (business day), "D" (daily), "MS" (month start). + + Returns + ------- + A JSON string with the point forecast, 80% and 90% prediction + interval bounds, and the full quantile grid for each horizon, plus + the series description, units, and cutoff date used. + """ + try: + as_of = datetime.strptime(cutoff_date, "%Y-%m-%d") + except ValueError: + return self._error( + f"Invalid cutoff_date '{cutoff_date}'. Expected format YYYY-MM-DD.", + series_id=series_id, + cutoff_date=cutoff_date, + ) + + clean_horizons = [int(h) for h in horizons] + if not clean_horizons or any(h < 1 for h in clean_horizons): + return self._error( + "horizons must be a non-empty list of positive integers.", + series_id=series_id, + cutoff_date=cutoff_date, + ) + + try: + context = self._data_service.context(as_of) + metadata = context.get_metadata(series_id) + history = context.get_series(series_id) + except KeyError: + return self._error( + f"Series '{series_id}' is not registered. Available series: " + f"{', '.join(self._data_service.series_ids)}.", + series_id=series_id, + cutoff_date=cutoff_date, + ) + + if history.empty: + return self._error( + f"No observations available for '{series_id}' on or before {cutoff_date}.", + series_id=series_id, + cutoff_date=cutoff_date, + ) + + task = ForecastingTask( + task_id=f"forecast_{series_id}_{cutoff_date}", + target_series_id=series_id, + horizons=clean_horizons, + frequency=frequency, + description=f"Forecast for {series_id} as of {cutoff_date}.", + ) + + try: + predictions = self._predictor.predict(task, context) + except Exception as exc: # noqa: BLE001 - surface model failures to the agent as data + return self._error( + f"Forecast model failed: {type(exc).__name__}: {exc}", + series_id=series_id, + cutoff_date=cutoff_date, + ) + + last_row = history.iloc[-1] + result = { + "status": "ok", + "series_id": series_id, + "series_description": metadata.description, + "units": metadata.units, + "frequency": frequency, + "cutoff_date": cutoff_date, + "n_observations_at_cutoff": int(len(history)), + "last_observed": { + "date": str(pd.Timestamp(last_row["timestamp"]).date()), + "value": float(last_row["value"]), + }, + "forecasts": [ + self._format_prediction(horizon, prediction) + for horizon, prediction in zip(clean_horizons, predictions, strict=True) + ], + "notes": ( + "Point forecast is the predictive median. Intervals are derived " + "from the model's Monte Carlo quantiles. A 95% interval is not " + "reported because the standard quantile grid tops out at p05/p95 " + "(widest interval shown is 90%)." + ), + } + return json.dumps(result, indent=2) + + @staticmethod + def _format_prediction(horizon: int, prediction: Prediction) -> dict[str, object]: + """Render a single :class:`Prediction` as a JSON-friendly dict.""" + payload = prediction.payload + if not isinstance(payload, ContinuousForecast): # pragma: no cover - defensive + raise TypeError(f"Expected ContinuousForecast payload, got {type(payload).__name__}.") + + quantiles = payload.quantiles + intervals = { + label: {"lower": quantiles[lo], "upper": quantiles[hi]} + for label, (lo, hi) in _INTERVAL_QUANTILES.items() + if lo in quantiles and hi in quantiles + } + return { + "horizon": horizon, + "forecast_date": str(pd.Timestamp(prediction.forecast_date).date()), + "point_forecast": payload.point_forecast, + "intervals": intervals, + "quantiles": {str(q): v for q, v in sorted(quantiles.items())}, + } + + @staticmethod + def _error(message: str, *, series_id: str, cutoff_date: str) -> str: + """Return a structured error payload the agent can read and react to.""" + return json.dumps( + { + "status": "error", + "series_id": series_id, + "cutoff_date": cutoff_date, + "error": message, + }, + indent=2, + ) +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md new file mode 100644 index 0000000..0a39edc --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__outputs.py.md @@ -0,0 +1,634 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/outputs.py + +kind: python + +```python +"""Output schemas for agentic forecasting. + +This module defines the structured output contract that an ADK agent must +satisfy to be driven by +:class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor`. + +:class:`AgentForecastOutput` is the abstract base; concrete subclasses +declare their forecast modality via the ``modality`` ``ClassVar`` and +implement :meth:`AgentForecastOutput.to_predictions` to convert validated +agent JSON into evaluation +:class:`~aieng.forecasting.evaluation.prediction.Prediction` objects. + +:class:`ContinuousAgentForecastOutput` is the canonical schema for +continuous forecasting tasks; it enforces the standard quantile +grid, non-crossing quantiles, and ``point_forecast`` consistency with the +median. :class:`DiscreteAgentForecastOutput` covers binary event tasks, and +:class:`CategoricalAgentForecastOutput` covers ordered-categorical tasks +whose category set is declared on the task. +""" + +import json +from abc import ABC, abstractmethod +from datetime import datetime +from math import isclose, isfinite +from typing import Any, ClassVar, Literal + +import pandas as pd +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.prediction import ( + STANDARD_QUANTILES, + BinaryForecast, + CategoricalForecast, + ContinuousForecast, + Prediction, +) +from aieng.forecasting.evaluation.task import ForecastingTask +from pydantic import BaseModel, Field, field_validator, model_validator + + +class AgentForecastOutput(BaseModel, ABC): + """Base class for structured agent forecast output. + + Subclasses declare the forecast modality they produce via the + ``modality`` ``ClassVar`` and implement :meth:`to_predictions` to + convert validated agent JSON into evaluation + :class:`~aieng.forecasting.evaluation.prediction.Prediction` objects. + + Attributes + ---------- + modality : ClassVar[Literal["continuous", "discrete", "categorical"]] + Forecast modality this schema produces. Concrete subclasses must + set this; :class:`~aieng.forecasting.methods.agentic.predictor.AgentPredictor` + reads it to derive its ``predictor_id`` and tracing metadata. + + Notes + ----- + Subclasses must use ``model_config = {"extra": "ignore"}`` (not + ``"forbid"``) so that Pydantic does not emit ``additionalProperties: + false`` in the JSON schema — that key is rejected by the Gemini API + when the schema is used as a response constraint. All field-level + validations (types, constraints, required presence) still apply. + """ + + modality: ClassVar[Literal["continuous", "discrete", "categorical"]] + + @abstractmethod + def to_predictions( + self, + *, + task: ForecastingTask, + context: ForecastContext, + predictor_id: str, + metadata: dict[str, Any] | None = None, + ) -> list[Prediction]: + """Convert the forecast output to a list of predictions. + + Parameters + ---------- + task : ForecastingTask + The forecasting task. + context : ForecastContext + The forecast context. + predictor_id : str + The predictor ID. + metadata : dict[str, Any] | None, default=None + The metadata for the predictions. + + Returns + ------- + list[Prediction] + The list of predictions. + """ + ... + + +class AgentQuantileForecast(BaseModel): + """A single quantile forecast value emitted by an agent. + + Attributes + ---------- + quantile : float + Quantile level in the open interval ``(0, 1)``, e.g. ``0.50``. + value : float + Forecast value at this quantile level. Must be finite. + """ + + model_config = {"extra": "ignore"} + + quantile: float = Field(description="Quantile level in (0, 1), e.g. 0.50.") + value: float = Field(description="Forecast value at this quantile level.") + + @field_validator("quantile", "value") + @classmethod + def _values_are_finite(cls, value: float) -> float: + """Reject NaN and infinite quantile levels and values.""" + if not isfinite(value): + raise ValueError("Forecast quantile levels and values must be finite numbers.") + return value + + +class ContinuousAgentHorizonForecast(BaseModel): + """Agent output for one continuous forecast horizon. + + Attributes + ---------- + horizon : int + Forecast horizon step (>= 1) corresponding to one entry of + :attr:`~aieng.forecasting.evaluation.task.ForecastingTask.horizons`. + point_forecast : float + Central forecast for this horizon. Must equal the 0.50 quantile. + quantiles : list[AgentQuantileForecast] + Forecast values at every level of + :data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES`, + with no duplicates and non-decreasing values. + rationale : str + Optional horizon-specific explanation propagated to + ``Prediction.metadata["horizon_rationale"]`` when non-empty. + """ + + model_config = {"extra": "ignore"} + + horizon: int = Field(ge=1, description="Forecast horizon step from the task, e.g. 1 for one period ahead.") + point_forecast: float = Field( + description="Central forecast. This must match the 0.50 quantile to avoid contradictory output." + ) + quantiles: list[AgentQuantileForecast] = Field( + description="Forecast values for every standard quantile level.", + ) + rationale: str = Field(default="", description="Optional horizon-specific explanation; omit when not needed.") + + @field_validator("point_forecast") + @classmethod + def _point_forecast_is_finite(cls, value: float) -> float: + """Reject NaN and infinite point forecasts.""" + if not isfinite(value): + raise ValueError("Point forecast must be a finite number.") + return value + + @model_validator(mode="after") + def _validate_quantiles(self) -> "ContinuousAgentHorizonForecast": + """Require the standard quantile grid and a non-crossing distribution.""" + by_level: dict[float, float] = {} + duplicates: list[float] = [] + for forecast in self.quantiles: + if forecast.quantile in by_level: + duplicates.append(forecast.quantile) + by_level[forecast.quantile] = forecast.value + + if duplicates: + raise ValueError(f"Duplicate quantile levels are not allowed: {duplicates}") + + expected = set(STANDARD_QUANTILES) + actual = set(by_level) + missing = sorted(expected - actual) + extra = sorted(actual - expected) + if missing or extra: + raise ValueError( + "Continuous agent forecasts must include exactly the standard quantiles. " + f"Missing: {missing}; extra: {extra}" + ) + + values = [by_level[q] for q in STANDARD_QUANTILES] + if any(left > right for left, right in zip(values, values[1:])): + raise ValueError("Quantile forecasts must be non-decreasing as quantile levels increase.") + + median = by_level[0.50] + if not isclose(self.point_forecast, median, rel_tol=1e-9, abs_tol=1e-9): + raise ValueError("point_forecast must match the 0.50 quantile.") + + return self + + def quantile_dict(self) -> dict[float, float]: + """Return quantiles as the evaluation payload mapping. + + Returns + ------- + dict[float, float] + Mapping from each quantile level in + :data:`~aieng.forecasting.evaluation.prediction.STANDARD_QUANTILES` + to its forecast value, in standard-quantile order. + """ + by_level = {forecast.quantile: forecast.value for forecast in self.quantiles} + return {q: by_level[q] for q in STANDARD_QUANTILES} + + +class ContinuousAgentForecastOutput(AgentForecastOutput): + """Canonical agent output for continuous forecasting tasks. + + The agent supplies only forecast values and optional explanatory metadata. + Task-owned fields such as ``task_id``, ``as_of``, and ``forecast_date`` are + derived during conversion so the output cannot drift from the evaluation + contract. + + Attributes + ---------- + forecasts : list[ContinuousAgentHorizonForecast] + One forecast per requested task horizon. Horizon values must be + unique; :meth:`to_predictions` additionally requires the set of + horizons to match ``task.horizons`` exactly. + rationale : str + Optional overall explanation propagated to + ``Prediction.metadata["rationale"]`` when non-empty. + + Examples + -------- + Validating an agent JSON response and converting it to predictions: + + >>> output = ContinuousAgentForecastOutput.model_validate_json( + ... raw_json, + ... ) + >>> predictions = output.to_predictions( + ... task=task, + ... context=context, + ... predictor_id="my_predictor", + ... ) + """ + + modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "continuous" + + model_config = {"extra": "ignore"} + + forecasts: list[ContinuousAgentHorizonForecast] = Field( + description="One forecast object for each requested task horizon.", + ) + rationale: str = Field( + default="", description="Optional overall explanation for the forecast; omit when not needed." + ) + + @model_validator(mode="after") + def _forecast_horizons_are_unique(self) -> "ContinuousAgentForecastOutput": + """Reject empty or duplicate horizon forecasts before task-level conversion.""" + if not self.forecasts: + raise ValueError("forecasts must contain at least one horizon forecast.") + seen: set[int] = set() + duplicates: list[int] = [] + for forecast in self.forecasts: + if forecast.horizon in seen: + duplicates.append(forecast.horizon) + seen.add(forecast.horizon) + + if duplicates: + raise ValueError(f"Duplicate forecast horizons are not allowed: {duplicates}") + return self + + @classmethod + def prompt_schema_json(cls) -> str: + """Return a JSON template for use in agent instruction strings. + + The quantile list is derived from :data:`STANDARD_QUANTILES` so the + template stays in sync automatically when the standard grid changes. + Use this in agent instructions instead of a hardcoded JSON block. + + Returns + ------- + str + Indented JSON string showing the exact structure the agent must + pass to ``set_model_response``. + """ + quantile_entries = [{"quantile": float(q), "value": ""} for q in STANDARD_QUANTILES] + template: dict[str, object] = { + "forecasts": [ + { + "horizon": "", + "point_forecast": "", + "quantiles": quantile_entries, + "rationale": "", + } + ], + "rationale": "", + } + return json.dumps(template, indent=2) + + def to_predictions( + self, + *, + task: ForecastingTask, + context: ForecastContext, + predictor_id: str, + metadata: dict[str, Any] | None = None, + ) -> list[Prediction]: + """Convert agent output to evaluation ``Prediction`` objects. + + Parameters + ---------- + task : ForecastingTask + Source task. The set of forecast horizons in ``self.forecasts`` + must match ``task.horizons`` exactly. + context : ForecastContext + Forecast context whose ``as_of`` anchors each prediction's + ``forecast_date`` via ``task.frequency`` arithmetic. + predictor_id : str + Identifier of the predictor that produced this output. + metadata : dict, optional + Extra metadata merged into every generated ``Prediction.metadata``. + ``rationale`` keys are written after this merge and cannot be + overridden here. + + Returns + ------- + list[Prediction] + One :class:`~aieng.forecasting.evaluation.prediction.Prediction` + per ``task.horizons`` entry, in task-horizon order. + + Raises + ------ + ValueError + If the horizons in ``self.forecasts`` do not match ``task.horizons``. + """ + by_horizon = {forecast.horizon: forecast for forecast in self.forecasts} + expected = set(task.horizons) + actual = set(by_horizon) + missing = sorted(expected - actual) + extra = sorted(actual - expected) + if missing or extra: + raise ValueError( + f"Continuous agent output must contain exactly the task horizons. Missing: {missing}; extra: {extra}" + ) + + issued_at = datetime.utcnow() # naive UTC; Prediction.issued_at expects timezone-naive + offset = pd.tseries.frequencies.to_offset(task.frequency) + base_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {} + if self.rationale.strip(): + base_metadata["rationale"] = self.rationale + + predictions: list[Prediction] = [] + for horizon in task.horizons: + forecast = by_horizon[horizon] + prediction_metadata = dict(base_metadata) + if forecast.rationale.strip(): + prediction_metadata["horizon_rationale"] = forecast.rationale + + quantiles = forecast.quantile_dict() + predictions.append( + Prediction( + predictor_id=predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(), + payload=ContinuousForecast( + point_forecast=forecast.point_forecast, + quantiles=quantiles, + ), + metadata=prediction_metadata, + ) + ) + + return predictions + + +class DiscreteAgentForecastOutput(AgentForecastOutput): + """Agent output for binary / discrete-event forecasting tasks. + + Attributes + ---------- + probability : float + Predicted probability the event resolves True, in ``[0, 1]``. + reasoning : str + Optional explanation propagated to ``Prediction.metadata``. + direction_bias : str + Optional directional label (``up``, ``down``, ``neutral``). + key_signals : list[str] + Optional list of supporting signals for the forecast. + confidence : str + Optional self-reported confidence label. + """ + + modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "discrete" + + model_config = {"extra": "ignore"} + + probability: float = Field(ge=0.0, le=1.0, description="Predicted probability the event occurs.") + reasoning: str = Field(default="", description="Optional explanation for the probability estimate.") + direction_bias: str = Field(default="", description="Optional directional label: up, down, or neutral.") + key_signals: list[str] = Field(default_factory=list, description="Key signals supporting the estimate.") + confidence: str = Field(default="", description="Optional self-reported confidence: high, medium, or low.") + + @classmethod + def prompt_schema_json(cls) -> str: + """Return a JSON template for use in agent instruction strings. + + Returns + ------- + str + Indented JSON string showing the exact structure the agent must + pass to ``set_model_response``. + """ + template: dict[str, object] = { + "probability": "", + "direction_bias": "<'up' | 'down' | 'neutral'>", + "reasoning": "", + "key_signals": ["", ""], + "confidence": "<'high' | 'medium' | 'low'>", + } + return json.dumps(template, indent=2) + + def to_predictions( + self, + *, + task: ForecastingTask, + context: ForecastContext, + predictor_id: str, + metadata: dict[str, Any] | None = None, + ) -> list[Prediction]: + """Convert agent output to a single binary :class:`Prediction`.""" + if len(task.horizons) != 1: + raise ValueError("Discrete agent output expects exactly one task horizon.") + + horizon = task.horizons[0] + issued_at = datetime.utcnow() + offset = pd.tseries.frequencies.to_offset(task.frequency) + prediction_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {} + if self.reasoning.strip(): + prediction_metadata["rationale"] = self.reasoning + if self.direction_bias.strip(): + prediction_metadata["direction_bias"] = self.direction_bias + if self.key_signals: + prediction_metadata["key_signals"] = list(self.key_signals) + if self.confidence.strip(): + prediction_metadata["confidence"] = self.confidence + + return [ + Prediction( + predictor_id=predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(), + payload=BinaryForecast(probability=self.probability), + metadata=prediction_metadata, + ) + ] + + +#: Maximum allowed |sum - 1| before a categorical agent distribution is +#: rejected instead of renormalized in ``to_predictions``. +CATEGORICAL_RENORMALIZATION_TOLERANCE: float = 0.05 + + +class AgentCategoryProbability(BaseModel): + """One (label, probability) row of a categorical agent forecast. + + Attributes + ---------- + label : str + Category label. Must match one of the task's declared category + labels; checked during :meth:`CategoricalAgentForecastOutput.to_predictions`. + probability : float + Predicted probability of this category, in ``[0, 1]``. + """ + + model_config = {"extra": "ignore"} + + label: str = Field(min_length=1, description="Category label from the task's declared category set.") + probability: float = Field(ge=0.0, le=1.0, description="Predicted probability of this category.") + + +class CategoricalAgentForecastOutput(AgentForecastOutput): + """Agent output for ordered-categorical forecasting tasks. + + The agent supplies one probability per category label plus optional + explanatory metadata. The category order, label set, and series-value + mapping live on the task (``task.categories``); :meth:`to_predictions` + validates the agent's labels against that declaration, so the schema + itself stays task-agnostic. + + Schema validation enforces per-row constraints only. Cross-row + constraints (exact label-set match, probabilities summing to 1) are + enforced in :meth:`to_predictions`, where the task is available. Sums + within :data:`CATEGORICAL_RENORMALIZATION_TOLERANCE` of 1 are + renormalized — LLMs routinely emit 0.99 totals — with the raw sum + recorded in ``Prediction.metadata["probability_sum_raw"]``; sums further + off raise. + + Attributes + ---------- + probabilities : list[AgentCategoryProbability] + One probability per category label, in any order. + reasoning : str + Optional explanation propagated to ``Prediction.metadata``. + key_signals : list[str] + Optional list of supporting signals for the forecast. + confidence : str + Optional self-reported confidence label. + """ + + modality: ClassVar[Literal["continuous", "discrete", "categorical"]] = "categorical" + + model_config = {"extra": "ignore"} + + probabilities: list[AgentCategoryProbability] = Field( + description="One {label, probability} entry per task category." + ) + reasoning: str = Field(default="", description="Optional explanation for the distribution.") + key_signals: list[str] = Field(default_factory=list, description="Key signals supporting the estimate.") + confidence: str = Field(default="", description="Optional self-reported confidence: high, medium, or low.") + + @model_validator(mode="after") + def _labels_are_unique(self) -> "CategoricalAgentForecastOutput": + """Reject empty distributions and duplicate labels before conversion.""" + if not self.probabilities: + raise ValueError("probabilities must contain at least one category entry.") + labels = [row.label for row in self.probabilities] + duplicates = sorted({label for label in labels if labels.count(label) > 1}) + if duplicates: + raise ValueError(f"Duplicate category labels are not allowed: {duplicates}") + return self + + @classmethod + def prompt_schema_json(cls, labels: list[str] | None = None) -> str: + """Return a JSON template for use in agent instruction strings. + + Parameters + ---------- + labels : list[str], optional + Category labels to render in the template, in task order. When + given, the template shows one concrete entry per label; otherwise + it shows generic placeholders. + + Returns + ------- + str + Indented JSON string showing the exact structure the agent must + pass to ``set_model_response``. + """ + if labels: + entries: list[dict[str, object]] = [ + {"label": label, "probability": ""} for label in labels + ] + else: + entries = [{"label": "", "probability": ""}] + template: dict[str, object] = { + "probabilities": entries, + "reasoning": "", + "key_signals": ["", ""], + "confidence": "<'high' | 'medium' | 'low'>", + } + return json.dumps(template, indent=2) + + def to_predictions( + self, + *, + task: ForecastingTask, + context: ForecastContext, + predictor_id: str, + metadata: dict[str, Any] | None = None, + ) -> list[Prediction]: + """Convert agent output to a single categorical :class:`Prediction`. + + Raises + ------ + ValueError + If the task is not a single-horizon categorical task, if the + output labels do not exactly match ``task.categories``, or if the + probabilities sum outside + ``1 +/- CATEGORICAL_RENORMALIZATION_TOLERANCE``. + """ + if task.payload_type != "categorical" or task.categories is None: + raise ValueError( + f"Categorical agent output requires a categorical task with declared categories; " + f"task '{task.task_id}' declares payload_type='{task.payload_type}'." + ) + if len(task.horizons) != 1: + raise ValueError("Categorical agent output expects exactly one task horizon.") + + by_label = {row.label: row.probability for row in self.probabilities} + expected = {category.label for category in task.categories} + actual = set(by_label) + if actual != expected: + missing = sorted(expected - actual) + extra = sorted(actual - expected) + raise ValueError( + f"Categorical agent output must contain exactly the task category labels. " + f"Missing: {missing}; extra: {extra}." + ) + + raw_sum = sum(by_label.values()) + if abs(raw_sum - 1.0) > CATEGORICAL_RENORMALIZATION_TOLERANCE or raw_sum <= 0.0: + raise ValueError( + f"Categorical agent probabilities sum to {raw_sum}, outside the renormalization " + f"tolerance of 1 +/- {CATEGORICAL_RENORMALIZATION_TOLERANCE}." + ) + probabilities = {category.label: by_label[category.label] / raw_sum for category in task.categories} + + horizon = task.horizons[0] + issued_at = datetime.utcnow() # naive UTC; Prediction.issued_at expects timezone-naive + offset = pd.tseries.frequencies.to_offset(task.frequency) + prediction_metadata: dict[str, Any] = dict(metadata) if metadata is not None else {} + if self.reasoning.strip(): + prediction_metadata["rationale"] = self.reasoning + if self.key_signals: + prediction_metadata["key_signals"] = list(self.key_signals) + if self.confidence.strip(): + prediction_metadata["confidence"] = self.confidence + if not isclose(raw_sum, 1.0, abs_tol=1e-9): + prediction_metadata["probability_sum_raw"] = raw_sum + + return [ + Prediction( + predictor_id=predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(), + payload=CategoricalForecast(probabilities=probabilities), + metadata=prediction_metadata, + ) + ] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md new file mode 100644 index 0000000..86401bc --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__agentic__predictor.py.md @@ -0,0 +1,323 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/agentic/predictor.py + +kind: python + +```python +"""Predictor that uses an ADK agent for forecasting. + +This module provides :class:`AgentPredictor`, the agentic +:class:`~aieng.forecasting.evaluation.predictor.Predictor` that drives an +ADK agent through an +:class:`~aieng.forecasting.methods.agentic.adk_runner.AdkTextRunner`, +parses the agent's structured JSON response against an +:class:`~aieng.forecasting.methods.agentic.outputs.AgentForecastOutput` +schema, and converts it into evaluation +:class:`~aieng.forecasting.evaluation.prediction.Prediction` objects. + +It also defines the :class:`ForecastPromptBuilder` ``Protocol`` that +task-specific prompt builders must satisfy. + +This module requires the ``agentic`` extra; importing it without the extra +raises :class:`ImportError`. +""" + +import asyncio +import json +import logging +import threading +from collections.abc import Coroutine +from typing import Any, Protocol, TypeVar, cast + +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.langfuse_traces import stamp_forecast_on_trace +from aieng.forecasting.evaluation.prediction import Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask +from aieng.forecasting.methods.agentic.adk_runner import AdkTextRunner, AdkTextRunnerConfig +from aieng.forecasting.methods.agentic.agent_factory import AgentConfig, build_adk_agent +from aieng.forecasting.methods.agentic.outputs import AgentForecastOutput +from aieng.forecasting.methods.llm_processes._client import strip_markdown_fence, trace_url_for +from google.adk.agents.base_agent import BaseAgent +from pydantic import ValidationError + + +logger: logging.Logger = logging.getLogger(__name__) +T = TypeVar("T") + + +def _run_coroutine_sync(coro: Coroutine[Any, Any, T]) -> T: + """Run an async coroutine from the sync ``Predictor`` interface. + + If no event loop is running on the current thread, the coroutine is + executed via :func:`asyncio.run`. If a loop is already running (e.g. + inside a Jupyter notebook), the coroutine is executed on a fresh loop + in a daemon thread so the caller's loop is not disturbed. + """ + try: + asyncio.get_running_loop() + except RuntimeError: + return asyncio.run(coro) + + result: T | None = None + error: BaseException | None = None + + def run_in_thread() -> None: + nonlocal error, result + loop = asyncio.new_event_loop() + try: + asyncio.set_event_loop(loop) + result = loop.run_until_complete(coro) + except BaseException as exc: # pragma: no cover - defensive thread boundary + error = exc + finally: + # Cancel and drain any background tasks (e.g. LiteLLM's LoggingWorker) + # before closing the loop. Without this, Python emits + # "Task was destroyed but it is pending!" warnings for every run. + try: + pending = asyncio.all_tasks(loop) + if pending: + for task in pending: + task.cancel() + loop.run_until_complete(asyncio.gather(*pending, return_exceptions=True)) + except Exception: + pass + finally: + loop.close() + + thread = threading.Thread(target=run_in_thread, daemon=True) + thread.start() + thread.join() + if error is not None: + raise error + return cast("T", result) + + +class ForecastPromptBuilder(Protocol): + """Protocol for building prompts for forecasting agents. + + This is used to build the prompt that will be used to invoke the ADK agent + for forecasting. + """ + + def __call__(self, *, task: ForecastingTask, context: ForecastContext) -> str: + """Build the prompt for the forecasting agent. + + Parameters + ---------- + task : ForecastingTask + Defines the prediction problem — target series, horizon(s), + frequency, and resolution logic. The predictor must not modify + the task. + context : ForecastContext + The information state available at forecast time. All calls to + ``context.get_series()`` are automatically filtered to + ``context.as_of`` — the predictor cannot accidentally access + future data from the series store. + + Returns + ------- + str + The prompt for the forecasting agent. + """ + ... + + +class AgentPredictor(Predictor): + """Predictor that drives an ADK agent to produce forecasts. + + On each :meth:`predict` call, the predictor: + + 1. Builds a prompt with ``prompt_builder(task=task, context=context)``. + 2. Runs the prompt through the ADK runner (synchronously, even from + inside a running event loop). + 3. Validates the agent's JSON response against ``output_schema``. + 4. Converts the validated output to a list of + :class:`~aieng.forecasting.evaluation.prediction.Prediction` via + :meth:`AgentForecastOutput.to_predictions`. + + Conversion errors are logged and surfaced as an empty prediction list + so a single bad agent response does not abort a backtest loop. Schema + validation errors are *not* swallowed. + + The ``output_schema`` is separate from ``agent_config`` by design: + ``AgentConfig`` captures the agent's *identity* (instruction, model, + skills), while ``output_schema`` declares the agent's *role* in a + specific experiment. The same config can be used to build a free-form + interactive analyst (via :func:`build_adk_agent` with no schema) or + wired into different predictors with different output contracts. + + Parameters + ---------- + agent_config : AgentConfig + Configuration for the underlying ADK agent — instruction, model, + skills, and capability toggles. The output format is *not* part + of the agent config; it is declared via ``output_schema``. + prompt_builder : ForecastPromptBuilder + Callable that produces the prompt text for one ``(task, context)`` + pair. See :class:`ForecastPromptBuilder` for the contract. + output_schema : type[AgentForecastOutput] + Structured output schema the agent must satisfy. The forecast + modality is derived from ``output_schema.modality``. Supplied at + predictor instantiation time so the same agent config can be reused + with different schemas or in interactive (schema-free) mode. + enable_langfuse_tracing : bool, optional + Whether to wrap each turn in Langfuse ``propagate_attributes``. + ``None`` (default) auto-detects: enabled when the ``langfuse`` + package is importable, disabled otherwise. Ignored when ``runner`` + is supplied — the supplied runner's tracing config takes precedence. + runner : AdkTextRunner, optional + Custom runner to use. When ``None`` (default), the predictor + builds its own ADK agent and runner from ``agent_config``. Supply + a runner for tests (with a stub agent) or to share one runner + across predictors. + + Examples + -------- + >>> from aieng.forecasting.methods.agentic import ( + ... AgentConfig, + ... AgentPredictor, + ... ContinuousAgentForecastOutput, + ... ) + >>> predictor = AgentPredictor( + ... AgentConfig(instruction="Forecast the supplied series."), + ... my_prompt_builder, + ... output_schema=ContinuousAgentForecastOutput, + ... ) + >>> predictions = predictor.predict(task, context) + """ + + def __init__( + self, + agent_config: AgentConfig, + prompt_builder: ForecastPromptBuilder, + *, + output_schema: type[AgentForecastOutput], + enable_langfuse_tracing: bool | None = None, + runner: AdkTextRunner | None = None, + ) -> None: + """Store the schema, derive the modality, and build or accept a runner.""" + if enable_langfuse_tracing is None: + # Auto-detect: enable Langfuse tracing iff the package is importable. + try: + import langfuse # noqa: F401, PLC0415 + + enable_langfuse_tracing = True + except ModuleNotFoundError: + enable_langfuse_tracing = False + + self.prompt_builder = prompt_builder + self.agent_config = agent_config + self.output_schema: type[AgentForecastOutput] = output_schema + self.enable_langfuse_tracing = enable_langfuse_tracing + + self._forecast_output_modality = output_schema.modality + + if runner is None: + built_agent = build_adk_agent(agent_config, output_schema=output_schema) + self._agent: BaseAgent = built_agent + self._runner = AdkTextRunner( + agent=built_agent, + config=AdkTextRunnerConfig( + app_name="agentic_forecasting_predictor", + default_user_id="forecasting_agent", + fresh_session_per_message=True, + enable_langfuse_tracing=self.enable_langfuse_tracing, + langfuse_tags=["agent_predictor", "track1"], + langfuse_trace_name=self.predictor_id, + langfuse_propagate_metadata={ + "predictor_id": self.predictor_id, + "agent_name": built_agent.name, + "model": str(built_agent.model), + "output_modality": self._forecast_output_modality, + }, + ), + ) + else: + self._runner = runner + self._agent = runner.agent + + @property + def predictor_id(self) -> str: + """Stable identifier for this predictor. + + This is used to identify the predictor in the evaluation results. + """ + model = getattr(self._agent, "model", None) + model_suffix = f"_{model}" if isinstance(model, str) else "" + return f"agent_predictor_{self._agent.name}{model_suffix}_{self._forecast_output_modality}" + + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + """Produce probabilistic forecasts for the given task and context. + + Parameters + ---------- + task : ForecastingTask + Defines the prediction problem — target series, horizon(s), + frequency, and resolution logic. The predictor must not modify + the task. + context : ForecastContext + The information state available at forecast time. All calls to + ``context.get_series()`` are automatically filtered to + ``context.as_of`` — the predictor cannot accidentally access + future data from the series store. + + Returns + ------- + list[Prediction] + One ``Prediction`` per horizon step in ``task.horizons``, each + with ``as_of = context.as_of`` and ``forecast_date`` set to the + corresponding step ahead of the origin. An empty list is + returned when the agent's structured output cannot be + converted to predictions (the error is logged); schema + validation errors on the agent's JSON are not swallowed. + """ + prompt = self.prompt_builder(task=task, context=context) + output_str = _run_coroutine_sync(self._runner.run_text_async(prompt)) + + # Normalise: strip markdown fences before validation so any model can + # be swapped in without breaking the parse layer. + output_str = strip_markdown_fence(output_str) + + # Validate the output against the output schema; tolerate JSON + # responses that ``model_validate_json`` cannot parse but + # ``json.loads`` + ``model_validate`` can. + try: + output = self.output_schema.model_validate_json(output_str) + except ValidationError: + try: + output = self.output_schema.model_validate(json.loads(output_str)) + except Exception: + logger.warning("Raw agent response (schema validation failed):\n%s", output_str) + raise + + # Convert output to list of predictions + try: + predictions = output.to_predictions( + task=task, + context=context, + predictor_id=self.predictor_id, + ) + except Exception as e: + # Log the error and return an empty list of predictions + logger.error("Error converting output to list of predictions: %s", e) + return [] + + # Link each prediction back to its Langfuse trace so side-channel + # evaluators can attach scores. The agent runs on a worker event loop whose + # trace context isn't active here, so use the id the runner captured during + # the run (not the current context, which is empty on this thread). + trace_id = self._runner.last_trace_id + if trace_id is not None: + trace_url = trace_url_for(trace_id) + for prediction in predictions: + prediction.metadata.setdefault("langfuse_trace_id", trace_id) + if trace_url is not None: + prediction.metadata.setdefault("langfuse_trace_url", trace_url) + + # Make the trace the canonical record for rationale evaluation: stamp the + # structured forecast onto that trace (post-hoc, by id) so the evaluator + # reads the rationale + distribution from Langfuse, not from a cached run. + stamp_forecast_on_trace(predictions, trace_id=trace_id) + + return predictions +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md new file mode 100644 index 0000000..289e10f --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines____init__.py.md @@ -0,0 +1,18 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/baselines/__init__.py + +kind: python + +```python +"""Baseline predictor implementations. + +Baselines provide fast, low-dependency reference points that every more complex +predictor should be compared against. +""" + +from .categorical_frequency import CategoricalFrequencyPredictor +from .historical_frequency import HistoricalFrequencyPredictor +from .naive import LastValuePredictor + + +__all__ = ["CategoricalFrequencyPredictor", "HistoricalFrequencyPredictor", "LastValuePredictor"] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md new file mode 100644 index 0000000..75a6f81 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__categorical_frequency.py.md @@ -0,0 +1,136 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/baselines/categorical_frequency.py + +kind: python + +```python +"""Categorical-frequency predictor — the floor baseline for ordinal tasks. + +``CategoricalFrequencyPredictor`` predicts each ordered category with the +probability it has occurred historically (the climatological category +distribution). It is the categorical counterpart of +:class:`~aieng.forecasting.methods.baselines.historical_frequency.HistoricalFrequencyPredictor`: +zero modelling, pure persistence of the empirical distribution. + +Unseen categories receive probability 0. Run this first on any new +ordered-categorical task; every conditioned model should beat this floor +baseline on RPS. + +Usage:: + + from aieng.forecasting.methods import CategoricalFrequencyPredictor + from aieng.forecasting.evaluation import backtest, BacktestSpec + + predictor = CategoricalFrequencyPredictor() + result = backtest(predictor=predictor, spec=spec, data_service=svc) + print(f"Category-frequency mean RPS: {result.mean_score:.4f}") # must be beaten +""" + +from __future__ import annotations + +import math +from datetime import datetime, timezone + +import pandas as pd +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory + + +class CategoricalFrequencyPredictor(Predictor): + """Categorical baseline: forecast the empirical category frequencies. + + The target series must store one value per resolution opportunity, with + every observed value matching one of ``task.categories``. The predicted + probabilities are raw empirical frequencies from the cutoff-filtered + history, optionally restricted to a trailing window. There is no smoothing: + categories absent from the history receive probability 0. + + Parameters + ---------- + window : int or None + If set, only the last ``window`` observations are used to compute the + category frequencies, making the baseline responsive to slow regime + change. ``None`` uses the full history. + """ + + def __init__(self, window: int | None = None) -> None: + if window is not None and window < 1: + raise ValueError(f"window must be a positive integer or None; got {window}") + self._window = window + + @property + def predictor_id(self) -> str: + """Return a stable identifier for this predictor.""" + if self._window is not None: + return f"categorical_frequency_w{self._window}" + return "categorical_frequency" + + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + """Produce category-frequency forecasts for the task's single horizon. + + Raises + ------ + ValueError + If the task does not declare ``payload_type='categorical'``, if it + has more than one horizon, if the cutoff-filtered history is empty, + or if any observed value does not match a task category value. + """ + if task.payload_type != "categorical": + raise ValueError( + f"{type(self).__name__} requires a categorical task (payload_type='categorical'); " + f"task '{task.task_id}' declares payload_type='{task.payload_type}'." + ) + if len(task.horizons) != 1: + raise ValueError(f"{type(self).__name__} requires exactly one horizon; got {task.horizons}.") + if task.categories is None: + raise ValueError(f"Categorical task '{task.task_id}' must define categories.") + + series_df = context.get_series(task.target_series_id) + if series_df.empty: + raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.") + + values = series_df["value"].astype(float) + if self._window is not None: + values = values.tail(self._window) + if values.empty: + raise ValueError(f"History for '{task.target_series_id}' is empty after applying window={self._window}.") + + counts = {category.label: 0 for category in task.categories} + for observed in values: + category = _matching_category(float(observed), task.categories) + if category is None: + allowed = [category.value for category in task.categories] + raise ValueError( + f"Target series '{task.target_series_id}' contains value {float(observed)} that does not " + f"match any task category value. Allowed values: {allowed}." + ) + counts[category.label] += 1 + + n_observations = int(len(values)) + probabilities = {label: count / n_observations for label, count in counts.items()} + payload = CategoricalForecast(probabilities=probabilities) + offset = pd.tseries.frequencies.to_offset(task.frequency) + issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + horizon = task.horizons[0] + + return [ + Prediction( + predictor_id=self.predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * horizon).to_pydatetime(), + payload=payload, + metadata={"n_observations": n_observations, "window": self._window}, + ) + ] + + +def _matching_category(value: float, categories: list[TaskCategory]) -> TaskCategory | None: + """Return the task category whose series value matches ``value``.""" + for category in categories: + if math.isclose(value, category.value, abs_tol=1e-9): + return category + return None +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md new file mode 100644 index 0000000..9d63ae5 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__historical_frequency.py.md @@ -0,0 +1,113 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/baselines/historical_frequency.py + +kind: python + +```python +"""Historical-frequency predictor — the floor baseline for binary-event tasks. + +``HistoricalFrequencyPredictor`` predicts that a binary event occurs with the +probability it has occurred historically (the climatological base rate). It is +the binary counterpart of +:class:`~aieng.forecasting.methods.baselines.naive.LastValuePredictor`: zero +modelling, pure persistence of the empirical distribution. + +A constant base-rate forecast is surprisingly hard to beat on Brier score for +rare or regime-driven events — any model that reacts to conditions must react +*correctly* to win. Run this first on any new binary task; every other +predictor should beat it. + +Usage:: + + from aieng.forecasting.methods import HistoricalFrequencyPredictor + from aieng.forecasting.evaluation import backtest, BacktestSpec + + predictor = HistoricalFrequencyPredictor() + result = backtest(predictor=predictor, spec=spec, data_service=svc) + print(f"Base-rate mean Brier: {result.mean_score:.4f}") # must be beaten +""" + +from __future__ import annotations + +from datetime import datetime, timezone + +import pandas as pd +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.prediction import BinaryForecast, Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask + + +class HistoricalFrequencyPredictor(Predictor): + """Binary baseline: forecast the empirical event frequency as the probability. + + The target series must be a 0/1 event series (one row per resolution + opportunity, e.g. one row per central-bank meeting). The predicted + probability is the mean of the cutoff-filtered history, optionally + restricted to a trailing window. + + Parameters + ---------- + window : int or None + If set, only the last ``window`` observations are used to compute the + base rate, making the baseline responsive to slow regime change + (e.g. "share of cuts in the last 16 meetings" rather than all-time). + ``None`` uses the full history. + """ + + def __init__(self, window: int | None = None) -> None: + if window is not None and window < 1: + raise ValueError(f"window must be a positive integer or None; got {window}") + self._window = window + + @property + def predictor_id(self) -> str: + """Return a stable identifier for this predictor.""" + if self._window is not None: + return f"historical_frequency_w{self._window}" + return "historical_frequency" + + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + """Produce base-rate probability forecasts for every horizon in the task. + + Raises + ------ + ValueError + If the task does not declare ``payload_type='binary'``, or if the + cutoff-filtered history is empty or contains non-0/1 values. + """ + if task.payload_type != "binary": + raise ValueError( + f"{type(self).__name__} requires a binary task (payload_type='binary'); " + f"task '{task.task_id}' declares payload_type='{task.payload_type}'." + ) + + series_df = context.get_series(task.target_series_id) + if series_df.empty: + raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.") + + values = series_df["value"].astype(float) + if not values.isin([0.0, 1.0]).all(): + bad = sorted(set(values[~values.isin([0.0, 1.0])])) + raise ValueError(f"Target series '{task.target_series_id}' must be a 0/1 event series; found values {bad}.") + + if self._window is not None: + values = values.tail(self._window) + base_rate = float(values.mean()) + + payload = BinaryForecast(probability=base_rate) + offset = pd.tseries.frequencies.to_offset(task.frequency) + issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + + return [ + Prediction( + predictor_id=self.predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(), + payload=payload, + metadata={"n_observations": int(len(values)), "window": self._window}, + ) + for h in task.horizons + ] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md new file mode 100644 index 0000000..5a03df9 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__baselines__naive.py.md @@ -0,0 +1,135 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/baselines/naive.py + +kind: python + +```python +"""Naive last-value predictor — the floor baseline for any continuous forecasting task. + +``LastValuePredictor`` predicts that the next observation will equal the most +recently observed value, with no uncertainty spread (all quantiles equal the +point forecast). It is task-agnostic and applies to any ``ForecastingTask`` +with a continuous series target. + +Use this as: + +1. **A performance floor.** Run it first on any new task. Every other predictor + should beat it. If yours doesn't, something is wrong with your model. + +2. **A readable reference implementation.** The code is annotated step-by-step + to show exactly how to satisfy the ``Predictor`` ABC — what fields are + required, how to compute ``forecast_date``, and how to construct a + ``Prediction``. Copy the structure and replace the forecast logic. + +Usage:: + + from aieng.forecasting.methods.naive import LastValuePredictor + from aieng.forecasting.evaluation import backtest, BacktestSpec + + result = backtest(predictor=LastValuePredictor(), spec=spec, data_service=svc) + print(f"Naive mean CRPS: {result.mean_score:.4f}") # your model must beat this +""" + +from __future__ import annotations + +from datetime import datetime, timezone + +import pandas as pd +from aieng.forecasting.data.context import ForecastContext +from aieng.forecasting.evaluation.prediction import STANDARD_QUANTILES, ContinuousForecast, Prediction +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.evaluation.task import ForecastingTask + + +class LastValuePredictor(Predictor): + """Naive baseline: forecast the most recently observed value at all quantiles. + + All quantile levels receive the same value as the point forecast, producing + a degenerate distribution with zero spread. This gives the worst possible + calibration score — a well-calibrated model should spread its quantiles to + reflect genuine uncertainty. + + For multi-horizon tasks (``len(task.horizons) > 1``), the same last value + is carried forward as a flat forecast for every requested step — equivalent + to the "persistence" or "random-walk" assumption. + + Parameters + ---------- + None + """ + + # ------------------------------------------------------------------ + # Step 1: give your predictor a stable string ID. + # This appears in BacktestResult and every Prediction record, + # so changing it mid-experiment will break comparisons. + # ------------------------------------------------------------------ + @property + def predictor_id(self) -> str: + """Return a stable identifier for this predictor.""" + return "last_value_naive" + + # ------------------------------------------------------------------ + # Step 2: implement predict(). + # + # Arguments: + # task — ForecastingTask: defines the problem (target series, + # horizons, frequency). Read-only; do not modify it. + # context — ForecastContext: your data access object. All series + # returned by context.get_series() are already filtered + # to context.as_of — you cannot accidentally access + # future data. + # + # Return: + # list[Prediction] — one per horizon step in task.horizons. + # ------------------------------------------------------------------ + def predict(self, task: ForecastingTask, context: ForecastContext) -> list[Prediction]: + """Produce last-value naive forecasts for every horizon in the task.""" + # ------------------------------------------------------------------ + # Step 3: fetch the target series. + # Returns a DataFrame with columns: timestamp, value, released_at. + # Rows are already cut off at context.as_of. + # ------------------------------------------------------------------ + series_df = context.get_series(task.target_series_id) + + # ------------------------------------------------------------------ + # Step 4: produce a forecast. + # Replace everything below with your model logic. + # Here we just take the last observed value as the point forecast. + # ------------------------------------------------------------------ + last_value = float(series_df["value"].iloc[-1]) + + # ------------------------------------------------------------------ + # Step 5: build the ContinuousForecast payload. + # point_forecast: your central estimate (typically median). + # quantiles: a dict mapping quantile level → forecast value. + # STANDARD_QUANTILES = [0.05, 0.10, ..., 0.90, 0.95] # noqa: ERA001 + # The evaluation engine uses these to compute CRPS. + # A naive predictor with no uncertainty puts the same value + # at every quantile — real models spread them out. + # ------------------------------------------------------------------ + payload = ContinuousForecast( + point_forecast=last_value, + quantiles=dict.fromkeys(STANDARD_QUANTILES, last_value), + ) + + # ------------------------------------------------------------------ + # Step 6: build one Prediction per requested horizon. + # task.horizons is a list of integer steps (e.g. [18] or [6..17]). + # For each step h, the forecast date is as_of + h × frequency. + # The harness uses each forecast_date to look up the ground-truth + # observation and score the prediction. + # ------------------------------------------------------------------ + offset = pd.tseries.frequencies.to_offset(task.frequency) + issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + + return [ + Prediction( + predictor_id=self.predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=(pd.Timestamp(context.as_of) + offset * h).to_pydatetime(), + payload=payload, + ) + for h in task.horizons + ] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md new file mode 100644 index 0000000..ecd5ed5 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes____init__.py.md @@ -0,0 +1,82 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/__init__.py + +kind: python + +```python +"""LLM-process predictor implementations. + +Predictors that use an LLM directly as the forecasting engine (no agent loop, +no tool use). Concrete subclasses are organised by target type and elicitation +strategy: + +- :class:`SampledTrajectoryLLMPredictor` — sample-based empirical quantiles for + continuous targets (Gruver / Context-is-Key Direct Prompt path). +- :class:`QuantileGridLLMPredictor` — direct elicitation of the standard + quantile grid for continuous targets. +- :class:`BinaryProbabilityLLMPredictor` — direct elicitation of one + calibrated probability for binary-event tasks + (``ForecastingTask.payload_type == "binary"``), scored with Brier. +- :class:`CategoricalProbabilityLLMPredictor` — direct elicitation of a + calibrated distribution over the task-declared ordered categories + (``ForecastingTask.payload_type == "categorical"``), scored with RPS. +- ``point_intervals`` — design placeholder for a token-efficient point-plus- + interval contract. It may become a configurable sparse quantile grid rather + than a separate predictor. + +Method *variants* from the literature (Requeima A-LLMP / I-LLMP, logprob-based +hierarchical density, conformal-wrapped predictors) belong as additional +sibling classes here, **not** as configurations of an existing class. The same +rule applies to binary elicitation: sampled-outcome, logprob, or +conformal-wrapped binary forecasters should be siblings of +:class:`BinaryProbabilityLLMPredictor`, not modes on it. + +--- + +Placeholder method design notes +------------------------------- + +``point_intervals.py`` is intentionally non-exported. A point-plus-interval +prompt asks for a central path plus compact uncertainty bands (for example +``q10``, ``q50``, ``q90``). That contract is attractive for larger, +reasoning-capable LLMs because it is much cheaper than a full quantile grid, +but it is also just sparse quantile elicitation. Before implementing it, decide +whether configurable quantile sets belong on :class:`QuantileGridLLMPredictor` +instead, and how sparse intervals map to the standard ``ContinuousForecast`` +quantiles used for scoring. +""" + +from aieng.forecasting.methods.llm_processes.base import ( + LLMPredictor, + LLMPredictorConfig, +) +from aieng.forecasting.methods.llm_processes.binary_probability import ( + BinaryProbabilityLLMPredictor, + BinaryProbabilityLLMPredictorConfig, +) +from aieng.forecasting.methods.llm_processes.categorical_probability import ( + CategoricalProbabilityLLMPredictor, + CategoricalProbabilityLLMPredictorConfig, +) +from aieng.forecasting.methods.llm_processes.quantile_grid import ( + QuantileGridLLMPredictor, + QuantileGridLLMPredictorConfig, +) +from aieng.forecasting.methods.llm_processes.sampled_trajectory import ( + SampledTrajectoryLLMPredictor, + SampledTrajectoryLLMPredictorConfig, +) + + +__all__ = [ + "BinaryProbabilityLLMPredictor", + "BinaryProbabilityLLMPredictorConfig", + "CategoricalProbabilityLLMPredictor", + "CategoricalProbabilityLLMPredictorConfig", + "SampledTrajectoryLLMPredictor", + "SampledTrajectoryLLMPredictorConfig", + "QuantileGridLLMPredictor", + "QuantileGridLLMPredictorConfig", + "LLMPredictor", + "LLMPredictorConfig", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md new file mode 100644 index 0000000..877bd51 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes___client.py.md @@ -0,0 +1,487 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/_client.py + +kind: python + +```python +"""Shared LiteLLM call seam for all ``llm_processes`` predictors. + +This module owns: + +- Idempotent module-level bootstrap of LiteLLM callbacks. +- Async single-completion seam with one retry on parse failure. +- Parallel ``asyncio.gather`` fan-out for ``N``-sample elicitation. +- A small ``run_async`` shim that works in scripts, pytest, and Jupyter. +- Langfuse ``@observe`` decorator factory and trace-info helpers. + +Continuous and (future) binary predictors share this seam so the LLM-call +contract — request shape, retry policy, tracing — lives in exactly one +place. + +LiteLLM caching is intentionally **not** wired here: ``litellm[caching]`` +is an optional extra and disk caching collapses repeated identical prompts +into a single response, which would defeat sample-based forecasting. +""" + +from __future__ import annotations + +import asyncio +import contextvars +import json +import logging +import os +import warnings +from concurrent.futures import ThreadPoolExecutor +from typing import Any, Callable, TypeVar + +from pydantic import BaseModel, ValidationError + + +logger = logging.getLogger(__name__) + +T = TypeVar("T", bound=BaseModel) + +_BOOTSTRAP_DONE = False + + +def bootstrap_litellm() -> None: + """One-time wiring of LiteLLM callbacks. + + Lazy and idempotent so non-LLM predictors do not require Langfuse env vars. + The Langfuse OTEL callback is registered only when ``LANGFUSE_PUBLIC_KEY`` + is set in the environment. + """ + global _BOOTSTRAP_DONE # noqa: PLW0603 + if _BOOTSTRAP_DONE: + return + import litellm # noqa: PLC0415 + + if os.environ.get("LANGFUSE_PUBLIC_KEY"): + existing = list(getattr(litellm, "callbacks", []) or []) + if "langfuse_otel" not in existing: + litellm.callbacks = [*existing, "langfuse_otel"] + + # Suppress LiteLLM startup and OTEL noise (mirrors agent_factory.py filter). + # Bedrock/SageMaker "no botocore" and OTEL proxy-server notices are harmless. + # OTEL span-lifecycle warnings fire when callbacks run after spans close. + class _NoiseFilter(logging.Filter): + _NOISE = ("botocore", "Proxy Server is not installed") + + def filter(self, record: logging.LogRecord) -> bool: + return not any(n in record.getMessage() for n in self._NOISE) + + logging.getLogger("LiteLLM").addFilter(_NoiseFilter()) + warnings.filterwarnings("ignore", message="Tried calling set_status on an ended span") + warnings.filterwarnings("ignore", message="Setting attribute on ended span") + logging.getLogger("opentelemetry").setLevel(logging.ERROR) + + _BOOTSTRAP_DONE = True + + +def langfuse_observe(name: str) -> Callable[..., Any]: + """Return Langfuse's ``@observe`` decorator with the given span name. + + Falls back to a no-op decorator if Langfuse is not installed or fails to + import, so the predictor remains usable without the ``agentic`` extra. + """ + try: + from langfuse import observe # noqa: PLC0415 + + return observe(name=name) + except Exception: # pragma: no cover + logger.debug("langfuse not available; skipping @observe decoration") + + def _noop(fn: Any) -> Any: + return fn + + return _noop + + +def current_trace_info() -> tuple[str | None, str | None]: + """Return ``(trace_id, trace_url)`` from the active Langfuse client, if any.""" + try: + from langfuse import get_client # noqa: PLC0415 + except Exception: + return None, None + try: + client = get_client() + return client.get_current_trace_id(), client.get_trace_url() + except Exception: # pragma: no cover + return None, None + + +def trace_url_for(trace_id: str) -> str | None: + """Return the Langfuse UI URL for a specific ``trace_id``, or ``None``. + + Unlike :func:`current_trace_info`, this resolves a URL for a trace by id even + when no trace context is active (e.g. the agent path, whose trace id is + captured on a worker thread). No-op when Langfuse is unavailable. + """ + try: + from langfuse import get_client # noqa: PLC0415 + + return get_client().get_trace_url(trace_id=trace_id) + except Exception: + return None + + +def set_current_trace_name(name: str) -> None: + """Name the active Langfuse trace, if any, so it is identifiable in the UI. + + LLMP predictors call this with their ``predictor_id`` at the top of + ``predict``. Because ``predict`` is the ``@observe``-wrapped root span, its + name is what Langfuse shows as the trace name; renaming the current span + therefore renames the trace to the same identifier used by leaderboards and + artifact storage — matching how agent predictors name their traces. No-op + when Langfuse is not installed or no span is active. + """ + try: + from langfuse import get_client # noqa: PLC0415 + except Exception: + return + try: + get_client().update_current_span(name=name) + except Exception: # pragma: no cover + logger.debug("update_current_span(name=%r) failed; trace name unchanged.", name) + + +def _strip_additional_properties(node: Any) -> Any: + """Recursively drop ``additionalProperties`` keys from a JSON schema. + + The Vector proxy's Gemini ``response_schema`` route rejects + ``additionalProperties`` (``Unknown name "additionalProperties" at + 'generation_config.response_schema'``), even though OpenAI strict mode + expects ``additionalProperties: false``. We strip it centrally so the same + predictor schemas route through the proxy unchanged; ``strict: True`` still + pins the model to the declared fields. (If a direct OpenAI-strict route is + ever added, that path would need ``additionalProperties: false`` restored.) + """ + if isinstance(node, dict): + return {k: _strip_additional_properties(v) for k, v in node.items() if k != "additionalProperties"} + if isinstance(node, list): + return [_strip_additional_properties(v) for v in node] + return node + + +def make_json_schema_response_format(name: str, schema: dict[str, Any]) -> dict[str, Any]: + """Build the explicit ``json_schema`` ``response_format`` dict. + + Always pass this dict form to ``litellm.completion`` rather than a Pydantic + class — the class-to-schema conversion path has known regressions on + Anthropic providers. ``additionalProperties`` is stripped from the schema + for proxy/Gemini compatibility (see :func:`_strip_additional_properties`). + """ + return { + "type": "json_schema", + "json_schema": {"name": name, "schema": _strip_additional_properties(schema), "strict": True}, + } + + +def strip_markdown_fence(content: str) -> str: + r"""Normalise an LLM response down to its JSON payload. + + Defends the parse layer against two model/proxy quirks so participants can + swap models freely without hitting parse failures: + + 1. **Markdown fences.** Some models wrap JSON in a ```json ... ``` fence + even when ``response_format`` is set. + 2. **Surrounding prose.** Some models (notably Claude through the proxy) + append an explanation *after* the JSON — e.g. ``{...}\n\n**Method:** + ...`` — or leak a stray closing fence when prose follows it. This is a + Predictor-interface concern, not LLMP-specific: every methodology that + parses a structured JSON response needs the payload isolated. + + The prose-trimming step is best-effort: it isolates the first complete + JSON object via :meth:`json.JSONDecoder.raw_decode` and discards anything + after it. When no JSON object is present the fence-stripped string is + returned unchanged, so non-JSON content passes through untouched. + + Parameters + ---------- + content : str + Raw LLM response content, possibly fenced and/or surrounded by prose. + + Returns + ------- + str + The isolated JSON payload, or the fence-stripped, whitespace-trimmed + input when no JSON object can be located. + """ + stripped = content.strip() + if stripped.startswith("```"): + lines = stripped.splitlines() + # Drop opening fence line (```json or ```) + inner_lines = lines[1:] + # Drop closing fence line if present + if inner_lines and inner_lines[-1].strip() == "```": + inner_lines = inner_lines[:-1] + stripped = "\n".join(inner_lines).strip() + payload = _extract_json_payload(stripped) + return payload if payload is not None else stripped + + +def _extract_json_payload(text: str) -> str | None: + """Return the first complete JSON object in ``text``, or ``None``. + + Scans for the first ``{`` and uses ``raw_decode`` to consume a single + balanced JSON object, ignoring any trailing (or leading) prose. Candidate + start positions that do not begin a valid object are skipped, so a stray + brace inside prose cannot derail extraction. + + Only objects are matched (not arrays): every structured forecast payload in + the Predictor interface is a top-level JSON object, so anchoring on ``{`` + avoids accidentally capturing an echoed numeric array (e.g. the input + series) that some models repeat in their prose. + """ + decoder = json.JSONDecoder() + for start, char in enumerate(text): + if char != "{": + continue + try: + _, end = decoder.raw_decode(text, start) + except json.JSONDecodeError: + continue + return text[start:end] + return None + + +# --------------------------------------------------------------------------- +# Async sampling seam +# --------------------------------------------------------------------------- + + +async def _one_completion_async( + *, + model: str, + messages: list[dict[str, Any]], + response_format: dict[str, Any], + temperature: float, + max_tokens: int, + timeout_s: float, + reasoning_effort: str | None, + api_base: str | None = None, + api_key: str | None = None, +) -> tuple[str | None, float, int, int]: + """Issue a single ``litellm.acompletion`` and return content + usage.""" + import litellm # noqa: PLC0415 + + kwargs: dict[str, Any] = { + "model": model, + "messages": messages, + "response_format": response_format, + "temperature": temperature, + "max_tokens": max_tokens, + "timeout": timeout_s, + } + if api_base is not None: + kwargs["api_base"] = api_base + # Prefix the model with "openai/" so LiteLLM routes via the + # OpenAI-compatible path. LiteLLM strips the prefix before sending + # the request, so the proxy receives the bare model name as expected. + if not model.startswith("openai/"): + kwargs["model"] = f"openai/{model}" + if api_key is not None: + kwargs["api_key"] = api_key + if reasoning_effort is not None: + # LiteLLM unifies the per-provider reasoning-budget kwargs behind + # ``reasoning_effort`` ∈ {"disable", "low", "medium", "high"}. We + # default to ``"disable"`` in the config because CoT-induced + # overconfidence is well-documented for continuous probabilistic + # forecasting (Welch 2026, Marzoev 2026). + # + # IMPORTANT: when routing through an OpenAI-compatible proxy (api_base + # set), LiteLLM treats the model as a generic OpenAI model and does not + # list ``reasoning_effort`` as a supported param for non-o1/o3 model + # names (confirmed via litellm.get_supported_openai_params). With + # ``drop_params=True`` it is silently stripped before the request + # reaches the proxy, so the thinking model runs unconstrained. + # Workaround: inject via ``extra_body``, which bypasses LiteLLM's + # param-filtering step and is merged directly into the request JSON. + if api_base is not None: + kwargs.setdefault("extra_body", {})["reasoning_effort"] = reasoning_effort + else: + kwargs["reasoning_effort"] = reasoning_effort + # drop_params=True is still needed for other non-standard params on + # models that don't support them (e.g. temperature on some o-series). + kwargs["drop_params"] = True + + resp = await litellm.acompletion(**kwargs) + cost = float(getattr(resp, "_hidden_params", {}).get("response_cost") or 0.0) + usage = getattr(resp, "usage", None) + in_tok = int(getattr(usage, "prompt_tokens", 0) or 0) if usage is not None else 0 + out_tok = int(getattr(usage, "completion_tokens", 0) or 0) if usage is not None else 0 + # Log full usage so we can see thinking-token breakdown when available. + # The proxy may populate completion_tokens_details.reasoning_tokens. + if usage is not None: + logger.debug("LLM usage: %s", vars(usage) if hasattr(usage, "__dict__") else usage) + raw = resp.choices[0].message.content + content = strip_markdown_fence(raw) if raw else raw + return content, cost, in_tok, out_tok + + +async def _one_completion_with_transient_retry( + *, + model: str, + messages: list[dict[str, str]], + response_format: dict[str, Any], + temperature: float, + max_tokens: int, + timeout_s: float, + reasoning_effort: str | None, + api_base: str | None = None, + api_key: str | None = None, +) -> tuple[str | None, float, int, int]: + """Call ``_one_completion_async`` with retries for transient API errors. + + Retries up to 3 times on 503 / rate-limit responses, backing off + exponentially (5 s, 15 s). Non-transient errors propagate immediately. + """ + from litellm.exceptions import RateLimitError, ServiceUnavailableError # noqa: PLC0415 + + _transient = (ServiceUnavailableError, RateLimitError) + for attempt in range(3): + try: + return await _one_completion_async( + model=model, + messages=messages, + response_format=response_format, + temperature=temperature, + max_tokens=max_tokens, + timeout_s=timeout_s, + reasoning_effort=reasoning_effort, + api_base=api_base, + api_key=api_key, + ) + except _transient as exc: + if attempt == 2: + raise + wait_s = 5 * (3**attempt) # 5 s, 15 s + logger.warning( + "Transient API error (attempt %d/3), retrying in %ds: %s", + attempt + 1, + wait_s, + exc, + ) + await asyncio.sleep(wait_s) + raise RuntimeError("unreachable") # pragma: no cover + + +async def _sample_one_with_retry( + *, + schema_cls: type[T], + model: str, + base_messages: list[dict[str, Any]], + response_format: dict[str, Any], + temperature: float, + max_tokens: int, + timeout_s: float, + reasoning_effort: str | None, + sample_index: int, + api_base: str | None = None, + api_key: str | None = None, +) -> tuple[T | None, float, int, int, int]: + """Single sample with one retry on parse failure and transient-error backoff.""" + cost = 0.0 + in_tok = 0 + out_tok = 0 + failures = 0 + + for attempt in range(2): + content, c, i, o = await _one_completion_with_transient_retry( + model=model, + messages=base_messages, + response_format=response_format, + temperature=temperature, + max_tokens=max_tokens, + timeout_s=timeout_s, + reasoning_effort=reasoning_effort, + api_base=api_base, + api_key=api_key, + ) + cost += c + in_tok += i + out_tok += o + try: + parsed = schema_cls.model_validate(json.loads(content or "")) + return parsed, cost, in_tok, out_tok, failures + except (json.JSONDecodeError, ValidationError) as exc: + failures += 1 + logger.warning( + "Sample %d parse failure on attempt %d: %s", + sample_index + 1, + attempt + 1, + exc, + ) + + return None, cost, in_tok, out_tok, failures + + +async def sample_n_async( + *, + schema_cls: type[T], + model: str, + base_messages: list[dict[str, Any]], + response_format: dict[str, Any], + n_samples: int, + temperature: float, + max_tokens: int, + timeout_s: float, + reasoning_effort: str | None, + api_base: str | None = None, + api_key: str | None = None, +) -> tuple[list[T], float, int, int, int]: + """Fan ``n_samples`` calls out via ``asyncio.gather`` and aggregate usage. + + Returns ``(parsed_samples, total_cost, total_in_tokens, total_out_tokens, + total_parse_failures)``. Failed samples are dropped silently here; the + caller must decide what to do if the parsed list is empty. + """ + coros = [ + _sample_one_with_retry( + schema_cls=schema_cls, + model=model, + base_messages=base_messages, + response_format=response_format, + temperature=temperature, + max_tokens=max_tokens, + timeout_s=timeout_s, + reasoning_effort=reasoning_effort, + sample_index=i, + api_base=api_base, + api_key=api_key, + ) + for i in range(n_samples) + ] + results = await asyncio.gather(*coros) + + parsed: list[T] = [] + total_cost = 0.0 + total_in = 0 + total_out = 0 + total_failures = 0 + for sample, c, i, o, f in results: + total_cost += c + total_in += i + total_out += o + total_failures += f + if sample is not None: + parsed.append(sample) + return parsed, total_cost, total_in, total_out, total_failures + + +def run_async(coro: Any) -> Any: + """Run an async coroutine from sync code; works in scripts and Jupyter. + + If no event loop is running (scripts, pytest), uses ``asyncio.run``. + If a loop is already running (Jupyter), runs the coroutine on a fresh + loop in a worker thread with the current ``contextvars`` context copied + across, so Langfuse trace context propagates into the async sampling. + """ + try: + asyncio.get_running_loop() + except RuntimeError: + return asyncio.run(coro) + + ctx = contextvars.copy_context() + with ThreadPoolExecutor(max_workers=1) as pool: + return pool.submit(ctx.run, asyncio.run, coro).result() +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md new file mode 100644 index 0000000..f943f74 --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__base.py.md @@ -0,0 +1,480 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/base.py + +kind: python + +```python +"""Abstract base class and shared config for LLM-process predictors. + +``LLMPredictor`` is the abstract parent shared by every concrete predictor in +this package (today: :class:`SampledTrajectoryLLMPredictor` and +:class:`QuantileGridLLMPredictor`; planned: ``BinaryProbabilityLLMPredictor``). It is +**never instantiated directly** — users instantiate one of the concrete +subclasses re-exported from :mod:`aieng.forecasting.methods`. +""" + +from __future__ import annotations + +import os +from pathlib import Path +from typing import TYPE_CHECKING, Any, ClassVar, Literal, Mapping + +import pandas as pd +from aieng.forecasting.documents.models import ExtractedDocument +from aieng.forecasting.documents.pdf_upload import pdf_to_content_part +from aieng.forecasting.evaluation.predictor import Predictor +from aieng.forecasting.methods.llm_processes._client import bootstrap_litellm, current_trace_info +from aieng.forecasting.models import LITE_MODEL +from pydantic import BaseModel, ConfigDict, Field + + +if TYPE_CHECKING: + from aieng.forecasting.data.context import ForecastContext + from aieng.forecasting.data.models import SeriesMetadata + from aieng.forecasting.evaluation.task import ForecastingTask + + +class LLMPredictorConfig(BaseModel): + """Frozen base config: provider-agnostic LLM-call settings. + + Subclasses extend with modality-specific fields (e.g. ``n_samples``, + ``precision`` for the continuous case). + """ + + model_config = ConfigDict(frozen=True) + + model: str = Field( + default=LITE_MODEL, + description=( + "Model name as expected by the proxy (bare, no provider prefix), " + "e.g. 'gemini-3.1-flash-lite-preview', 'gpt-4o-mini'. " + "When openai_base_url is set, LiteLLM routes this to the proxy via " + "custom_llm_provider='openai'." + ), + ) + openai_base_url: str | None = Field( + default_factory=lambda: os.getenv("OPENAI_BASE_URL"), + description=( + "Base URL for an OpenAI-compatible LLM proxy. Defaults to the " + "``OPENAI_BASE_URL`` environment variable. When set, all completions " + "are routed through the proxy using ``api_base`` + " + "``custom_llm_provider='openai'``." + ), + ) + openai_api_key: str | None = Field( + default_factory=lambda: os.getenv("OPENAI_API_KEY"), + description=("API key for the proxy. Defaults to the ``OPENAI_API_KEY`` environment variable."), + ) + temperature: float = Field(default=1.0, ge=0.0, le=2.0, description="Sampling temperature.") + max_tokens: int = Field( + default=16384, + ge=1, + description=( + "Per-call output token budget. " + "Thinking models (e.g. gemini-3.1-pro-preview) consume thinking tokens " + "from this same budget via the OpenAI-compatible proxy — the 16 k default " + "is intentionally generous to prevent truncation; the model only generates " + "tokens it needs, so non-thinking models are not affected in cost." + ), + ) + timeout_s: float = Field(default=120.0, gt=0.0, description="Per-call timeout in seconds.") + reasoning_effort: Literal["disable", "low", "medium", "high"] | None = Field( + default=None, + description=( + "Reasoning budget passed through to LiteLLM. ``None`` (default) sends " + "no ``reasoning_effort`` and lets the provider use its own default — " + "for the project's Gemini-via-proxy setup the lite model does not " + "force chain-of-thought, which suits calibration-sensitive " + "forecasting (CoT-induced overconfidence is well-documented for " + "continuous probabilistic forecasting). ``'medium'`` / ``'high'`` " + "request more reasoning. NOTE: the Vector proxy currently rejects " + "``'disable'`` and ``'low'`` for Gemini models (valid: " + "minimal/medium/high) — those literals are retained for other " + "providers but will 400 through the proxy." + ), + ) + variant_tag: str | None = Field( + default=None, + description=( + "Optional short identifier for a method recipe (e.g. ``'food_cpi_v1_h60_n3'``, " + "``'short_history'``). When set, it is folded into :attr:`predictor_id` " + "as ``_[]`` so artifact storage, cached " + "backtests, and leaderboards keep recipes distinct. ``None`` preserves " + "the bare ``[]`` form used by ad-hoc construction." + ), + ) + report_sources: list[str] | None = Field( + default=None, + description=( + "Optional list of document source keys (e.g. ``['cfpr']``) to include " + "as a report preamble in the prompt. When set, the predictor calls " + "``context.get_documents(source)`` for each source and prepends the " + "extracted text to the user prompt in CiK-style Format A. Requires a " + "``DocumentStore`` to be attached to the ``DataService``." + ), + ) + report_max_chars: int | None = Field( + default=None, + ge=1, + description=( + "Per-report character truncation limit. Reports can be ~80,000 chars " + "each; set this to keep context windows manageable. Truncation is " + "applied per-report before concatenation. ``None`` means no truncation. " + "Only used by the ``'text'`` ingestion mode." + ), + ) + report_ingestion: Literal["text", "native"] = Field( + default="text", + description=( + "How report documents are fed to the model when ``report_sources`` is " + "set. ``'text'`` (default) injects pymupdf4llm-extracted markdown as a " + "CiK-style text preamble — works for every model through the proxy. " + "``'native'`` uploads the source PDFs as backend-native document parts " + "so the model reads the original (tables/figures intact). " + "TEMPORARY LIMITATION: native ingestion works only for Claude/GPT " + "models — the proxy drops document parts on the Gemini route. Once the " + "proxy routes Gemini natively (see TODO(proxy-pdf) in " + "documents/pdf_upload.py), native ingestion will apply uniformly and " + "this becomes a free text-vs-native choice for any model." + ), + ) + + +def serialize_history(df: pd.DataFrame, precision: int) -> str: + """Render a cutoff-filtered series as one ``: value`` line per row. + + Uses ``YYYY-MM-DD`` format when any timestamp falls on a day other than 1 + (i.e. the series is sub-monthly), and ``YYYY-MM`` format otherwise. + + .. TODO(history-format): the day-!= 1 heuristic handles monthly vs daily but + breaks for quarterly, weekly, or truly irregular series. A future revision + should accept an explicit ``fmt`` or ``frequency`` parameter so callers + have full control over the date representation sent to the LLM. + """ + timestamps = [pd.Timestamp(ts) for ts in df["timestamp"]] + is_sub_monthly = any(ts.day != 1 for ts in timestamps) + fmt = "%Y-%m-%d" if is_sub_monthly else "%Y-%m" + lines = [f"{ts.strftime(fmt)}: {v:.{precision}f}" for ts, v in zip(timestamps, df["value"])] + return "\n".join(lines) + + +def build_covariate_block( + context: ForecastContext, + covariate_series_ids: list[str], + *, + precision: int, + history_window: int | None = None, +) -> str: + """Serialize covariate histories into labeled blocks for the LLM prompt. + + Each registered covariate series is rendered cutoff-safe (via + ``context.get_series``) as a labeled block: a description / units header + (from :meth:`get_metadata` when available) followed by its + :func:`serialize_history` rendering. Series with no observations at the + cutoff are skipped. When ``history_window`` is set, each covariate is + truncated to its last ``history_window`` observations, matching the target. + + This is the Context-is-Key §5.4 "labeled covariate blocks" pattern: the + model sees the target history plus the recent trajectory of each exogenous + series and may condition on cross-series structure. + + Returns an empty string when ``covariate_series_ids`` is empty or no + covariate has usable history, so callers can unconditionally interpolate the + result into a prompt. + """ + blocks: list[str] = [] + for cov_id in covariate_series_ids: + cov_df = context.get_series(cov_id) + if cov_df.empty: + continue + if history_window is not None: + cov_df = cov_df.tail(history_window).reset_index(drop=True) + try: + cov_meta: SeriesMetadata | None = context.get_metadata(cov_id) + except KeyError: + cov_meta = None + if cov_meta is not None: + header = f"Covariate: {cov_meta.description} (source: {cov_meta.source})\nUnits: {cov_meta.units}" + else: + header = f"Covariate: {cov_id}" + blocks.append(f"{header}\n{serialize_history(cov_df, precision=precision)}") + if not blocks: + return "" + intro = ( + "Covariates (exogenous series observed through the forecast origin; " + "use as additional context for your forecast):" + ) + return intro + "\n\n" + "\n\n".join(blocks) + + +def get_history_and_meta( + task: ForecastingTask, + context: ForecastContext, +) -> tuple[pd.DataFrame, SeriesMetadata | None]: + """Fetch the target series and its metadata, respecting the cutoff. + + Raises ``ValueError`` if the series has no observations at ``context.as_of``. + Returns ``(df, None)`` for series whose adapter did not register metadata. + """ + series_df = context.get_series(task.target_series_id) + if series_df.empty: + raise ValueError(f"History for '{task.target_series_id}' is empty at as_of={context.as_of}.") + try: + series_meta = context.get_metadata(task.target_series_id) + except KeyError: + series_meta = None + return series_df, series_meta + + +def fetch_report_docs( + *, + config: LLMPredictorConfig, + context: ForecastContext, +) -> list[ExtractedDocument]: + """Fetch cutoff-filtered report documents per ``config.report_sources``. + + Parameters + ---------- + config : LLMPredictorConfig + Config with ``report_sources`` and ``report_max_chars`` fields. + context : ForecastContext + Cutoff-scoped context with optional ``DocumentStore``. + + Returns + ------- + list[ExtractedDocument] + Cutoff-filtered, chronologically sorted documents. Empty when + ``report_sources`` is ``None`` or no ``DocumentStore`` is attached. + """ + if not config.report_sources: + return [] + docs: list[ExtractedDocument] = [] + for source in config.report_sources: + docs.extend(context.get_documents(source)) + docs.sort(key=lambda d: (d.meta.publication_date, d.meta.doc_id)) + return docs + + +def build_report_preamble( + docs: list[ExtractedDocument], + *, + max_chars: int | None = None, +) -> str: + """Build a CiK-style Format A report preamble from a list of documents. + + Each document is formatted as a titled, dated block:: + + === Canada's Food Price Report 2025 (15th edition) === + Source: cfpr + Published: 2024-12-05 + + + When ``max_chars`` is set, each report's text is truncated to that limit + with a ``[...]`` marker appended. Documents are rendered in the order + provided (typically chronological). + + Parameters + ---------- + docs : list[ExtractedDocument] + Documents to include in the preamble. + max_chars : int or None + Per-report character truncation limit. ``None`` means no truncation. + + Returns + ------- + str + Formatted preamble string, or an empty string when ``docs`` is empty. + """ + if not docs: + return "" + blocks: list[str] = [] + for doc in docs: + title = doc.meta.title or f"{doc.meta.source}/{doc.meta.doc_id}" + text = doc.text + if max_chars is not None and len(text) > max_chars: + text = text[:max_chars] + "\n\n[...]" + block = ( + f"=== {title} ===\nSource: {doc.meta.source}\nPublished: {doc.meta.publication_date.isoformat()}\n\n{text}" + ) + blocks.append(block) + return "\n\n".join(blocks) + + +#: Shared framing line that introduces report context in both ingestion modes. +_REPORT_INTRO = ( + "You are provided with the following economic report(s) " + "published before the forecast date. Use them as context " + "for your forecast." +) + + +def apply_report_context( + *, + config: LLMPredictorConfig, + docs: list[ExtractedDocument], + user_prompt: str, +) -> str | list[dict[str, Any]]: + """Apply report context to the user prompt in the configured ingestion mode. + + Centralizes the report-injection logic shared by every LLMP predictor so the + text-vs-native decision lives in one place. + + Modes (``config.report_ingestion``): + + - ``"text"`` (default): build a CiK-style text preamble via + :func:`build_report_preamble` and prepend it to ``user_prompt``. Returns + a single string. Works for every model through the proxy. + - ``"native"``: emit the source PDFs as backend-native document content + parts (:func:`~aieng.forecasting.documents.pdf_upload.pdf_to_content_part`) + so the model reads the originals directly. Returns a content-part list + ``[intro_text, , prompt_text]``. Requires each document to + carry a resolvable ``pdf_path`` and a Claude/GPT model — Gemini native + ingestion is not supported through the proxy yet (see ``pdf_upload.py``). + + When ``docs`` is empty the bare ``user_prompt`` is returned unchanged, so + callers can pass the result straight through as message content regardless + of whether any reports were configured. + + Returns + ------- + str or list[dict] + A string (text mode / no docs) or a list of content-part dicts (native + mode), suitable as the ``content`` of a user message. + """ + if not docs: + return user_prompt + if config.report_ingestion == "native": + return _build_native_report_content(config=config, docs=docs, user_prompt=user_prompt) + preamble = build_report_preamble(docs, max_chars=config.report_max_chars) + if not preamble: + return user_prompt + return f"{_REPORT_INTRO}\n\n{preamble}\n\n---\n\n{user_prompt}" + + +def _build_native_report_content( + *, + config: LLMPredictorConfig, + docs: list[ExtractedDocument], + user_prompt: str, +) -> list[dict[str, Any]]: + """Build a content-part list with native PDF document parts + the prompt. + + Order: a brief intro text part, one backend-native document part per source + PDF (in the order given), then the user prompt as a trailing text part. + + Raises + ------ + ValueError + If any document lacks a resolved ``pdf_path``. + NotImplementedError + If ``config.model`` is a Gemini model (proxy limitation; raised by + :func:`~aieng.forecasting.documents.pdf_upload.pdf_to_content_part`). + """ + parts: list[dict[str, Any]] = [{"type": "text", "text": _REPORT_INTRO}] + for doc in docs: + if not doc.pdf_path: + raise ValueError( + f"Native report ingestion requested but document " + f"'{doc.meta.source}/{doc.meta.doc_id}' has no resolved pdf_path. " + "Ensure the source PDF sits beside its .json artifact, or use " + "report_ingestion='text'." + ) + parts.append(pdf_to_content_part(Path(doc.pdf_path), config.model)) + parts.append({"type": "text", "text": f"---\n\n{user_prompt}"}) + return parts + + +class LLMPredictor(Predictor): + """Abstract parent for all LLM-process predictors. + + Concrete subclasses differ in: + + - The config type they accept (extends :class:`LLMPredictorConfig`). + - The output schema they request from the LLM. + - How they aggregate one or many LLM responses into ``Prediction`` objects. + + What this base provides: + + - LiteLLM bootstrap on construction (lazy, idempotent). + - ``predictor_id`` derived from the class-level ``_method_tag``. + - ``cfg`` storage with the right modality-specific type. + + Subclasses must: + + - Set the class attribute ``_method_tag`` (e.g. ``"llmp_sampled_trajectories"``). + - Override ``_default_config`` to return their concrete config type. + - Implement ``predict``. + """ + + #: Stable, human-readable family tag used in :attr:`predictor_id`. + #: Subclasses must override (e.g. ``"llmp_sampled_trajectories"``). + _method_tag: ClassVar[str] = "" + + def __init__(self, cfg: LLMPredictorConfig | None = None) -> None: + if not self._method_tag: + raise TypeError( + f"{type(self).__name__} must set the class attribute '_method_tag'.", + ) + self.cfg = cfg if cfg is not None else self._default_config() + bootstrap_litellm() + + @classmethod + def _default_config(cls) -> LLMPredictorConfig: + """Return a default config; subclasses override with their own config type.""" + return LLMPredictorConfig() + + @property + def predictor_id(self) -> str: + """Stable identifier folding method tag, optional variant tag, and model. + + Format: + + - ``[]`` when ``cfg.variant_tag`` is ``None`` (default). + - ``_[]`` otherwise. + + Recipes (see ``implementations//predictors/``) set + ``variant_tag`` so their cached backtests and leaderboard rows stay + distinct from ad-hoc bare-config runs. Examples: + + - ``llmp_sampled_trajectories[anthropic/claude-sonnet-4-5]`` + - ``llmp_sampled_trajectories_food_cpi_v1_h60_n3[]`` + - ``llmp_quantile_grid_food_cpi_v1_h60_rlow[]`` + """ + if self.cfg.variant_tag: + return f"{self._method_tag}_{self.cfg.variant_tag}[{self.cfg.model}]" + return f"{self._method_tag}[{self.cfg.model}]" + + def _build_metadata( + self, + *, + cost_usd: float, + in_tokens: int, + out_tokens: int, + parse_failures: int, + history_window: int | None = None, + extra: Mapping[str, Any] | None = None, + ) -> dict[str, Any]: + """Build common metadata for an LLM-backed prediction.""" + trace_id, trace_url = current_trace_info() + metadata: dict[str, Any] = {"model": self.cfg.model} + if extra is not None: + metadata.update(extra) + metadata.update( + { + "temperature": self.cfg.temperature, + "reasoning_effort": self.cfg.reasoning_effort, + "cost_usd": cost_usd, + "input_tokens": in_tokens, + "output_tokens": out_tokens, + "parse_failures": parse_failures, + } + ) + if self.cfg.variant_tag is not None: + metadata["variant_tag"] = self.cfg.variant_tag + if history_window is not None: + metadata["history_window"] = history_window + if trace_id is not None: + metadata["langfuse_trace_id"] = trace_id + if trace_url is not None: + metadata["langfuse_trace_url"] = trace_url + return metadata +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md new file mode 100644 index 0000000..0c81afc --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__binary_probability.py.md @@ -0,0 +1,340 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/binary_probability.py + +kind: python + +```python +"""BinaryProbabilityLLMPredictor — direct probability elicitation for binary events. + +Asks an LLM for a single calibrated probability that a binary event resolves +``True``, via one structured completion per forecast origin. This is the +binary counterpart of +:class:`~aieng.forecasting.methods.llm_processes.quantile_grid.QuantileGridLLMPredictor`: +where the quantile grid elicits a full predictive distribution for a +continuous target, this class elicits the one number that fully describes a +Bernoulli predictive distribution. + +Direct probabilities are token-efficient and easy to score (Brier), but the +prompt must distinguish *calibrated probability* from *model confidence* — +the system prompt below is explicit about coverage semantics. Sampled-outcome, +logprob, and conformal variants should be implemented as sibling classes if +they prove useful, not as modes on this predictor. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from typing import TYPE_CHECKING, Any, ClassVar + +import pandas as pd +from aieng.forecasting.evaluation.prediction import BinaryForecast, Prediction +from aieng.forecasting.methods.llm_processes._client import ( + langfuse_observe, + make_json_schema_response_format, + run_async, + sample_n_async, + set_current_trace_name, +) +from aieng.forecasting.methods.llm_processes.base import ( + LLMPredictor, + LLMPredictorConfig, + apply_report_context, + fetch_report_docs, + get_history_and_meta, + serialize_history, +) +from pydantic import BaseModel, ConfigDict, Field + + +if TYPE_CHECKING: + from aieng.forecasting.data.context import ForecastContext + from aieng.forecasting.data.models import SeriesMetadata + from aieng.forecasting.evaluation.task import ForecastingTask + + +class BinaryProbabilityLLMPredictorConfig(LLMPredictorConfig): + """Frozen configuration for :class:`BinaryProbabilityLLMPredictor`. + + Adds only binary-task prompt controls that preserve the direct-probability + contract. The predictor makes one structured completion per forecast + origin and does not expose ``n_samples``. + """ + + model_config = ConfigDict(frozen=True) + + precision: int = Field( + default=0, + ge=0, + le=10, + description="Decimal places used when serializing the (0/1) event history.", + ) + history_window: int | None = Field( + default=None, + ge=1, + description="If set, only the last N cutoff-filtered observations are serialized into the prompt.", + ) + series_description: str | None = Field( + default=None, + description="Optional replacement for the metadata-derived series description block.", + ) + elicit_reasoning: bool = Field( + default=True, + description=( + "When True, ask the model for a short free-text 'reasoning' field alongside the " + "probability, captured into Prediction.metadata['rationale'] for inspection and " + "downstream reasoning evaluation. The field is requested *after* the probability so " + "the model commits to the number first, keeping the answer-first ordering that " + "protects calibration. Set False to restore the bare probability-only elicitation." + ), + ) + system_prompt_override: str | None = Field( + default=None, + description="Full replacement for the built-in binary-probability system prompt.", + ) + user_prompt_suffix: str | None = Field( + default=None, + description=( + "Free-form text appended to the user prompt after the standard question. " + "Use-case recipes use this to inject domain context (covariate summaries, " + "report excerpts) without changing the elicitation contract." + ), + ) + + +class _BinaryProbability(BaseModel): + """Internal Pydantic schema for one directly elicited event probability. + + ``reasoning`` is optional so parsing succeeds whether or not the field was + requested (controlled by ``elicit_reasoning`` on the config). + """ + + probability: float = Field(ge=0.0, le=1.0) + reasoning: str = Field(default="") + + +def _build_binary_probability_schema(elicit_reasoning: bool) -> dict[str, Any]: + """Build the strict ``json_schema`` for one event probability. + + ``probability`` comes first so the model commits to the number before any + justification. When ``elicit_reasoning`` is True, a free-text ``reasoning`` + field is appended; strict mode with ``additionalProperties: False`` requires + every property to be listed in ``required``. + """ + properties: dict[str, Any] = { + "probability": {"type": "number", "minimum": 0.0, "maximum": 1.0}, + } + required = ["probability"] + if elicit_reasoning: + properties["reasoning"] = {"type": "string"} + required.append("reasoning") + return { + "type": "object", + "properties": properties, + "required": required, + "additionalProperties": False, + } + + +def _build_system_prompt(override: str | None = None, *, elicit_reasoning: bool = False) -> str: + """Return the binary-probability system prompt, or ``override`` verbatim.""" + if override is not None: + return override + reasoning_rule = ( + "- Decide your probability first, then briefly justify it in plain text in the " + "'reasoning' field (a few sentences naming the key drivers).\n" + if elicit_reasoning + else "" + ) + return ( + "You are a probabilistic forecaster of binary events. Given the history of past " + "outcomes and a question about a future event, return one calibrated probability " + "that the event occurs.\n" + "\n" + "Rules:\n" + "- Return ONLY a JSON object matching the provided schema. No prose, no markdown.\n" + "- 'probability' is the probability the event resolves TRUE (1), in [0, 1].\n" + "- Report a CALIBRATED probability, not your confidence in a point answer: across " + "many questions where you answer 0.7, the event should occur about 70% of the time.\n" + "- Avoid 0.0 and 1.0 unless the outcome is logically certain.\n" + f"{reasoning_rule}" + "- Base rates matter: anchor on how often the event has occurred historically, then " + "adjust for the current situation." + ) + + +def _build_user_prompt( + task: ForecastingTask, + history_str: str, + series_meta: SeriesMetadata | None, + forecast_date: pd.Timestamp, + series_description_override: str | None = None, + suffix: str | None = None, +) -> str: + """Build the binary-probability user prompt.""" + if series_description_override is not None: + meta_block = series_description_override + else: + meta_lines: list[str] = [] + if series_meta is not None: + meta_lines.append(f"Event series: {series_meta.description} (source: {series_meta.source})") + meta_lines.append(f"Units: {series_meta.units}") + else: + meta_lines.append(f"Event series: {task.target_series_id}") + meta_block = "\n".join(meta_lines) + + base = ( + f"Question: {task.description}\n" + "\n" + f"{meta_block}\n" + "\n" + "History of past outcomes (1 = event occurred, 0 = it did not):\n" + f"{history_str}\n" + "\n" + f"The event resolves on {forecast_date.strftime('%Y-%m-%d')}.\n" + "Return a JSON object with a single 'probability' field: the calibrated probability " + "that the event occurs (resolves to 1)." + ) + if suffix: + base = f"{base}\n\n{suffix.lstrip(chr(10))}" + return base + + +def _sample_probability( + *, + cfg: BinaryProbabilityLLMPredictorConfig, + system_prompt: str, + user_prompt: str | list[dict[str, Any]], +) -> tuple[_BinaryProbability, float, int, int, int]: + """Issue one structured completion and return the parsed probability.""" + base_messages: list[dict[str, Any]] = [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt}, + ] + response_format = make_json_schema_response_format( + "BinaryProbability", _build_binary_probability_schema(cfg.elicit_reasoning) + ) + + parsed, cost_usd, in_tokens, out_tokens, parse_failures = run_async( + sample_n_async( + schema_cls=_BinaryProbability, + model=cfg.model, + base_messages=base_messages, + response_format=response_format, + n_samples=1, + temperature=cfg.temperature, + max_tokens=cfg.max_tokens, + timeout_s=cfg.timeout_s, + reasoning_effort=cfg.reasoning_effort, + api_base=cfg.openai_base_url, + api_key=cfg.openai_api_key, + ), + ) + if not parsed: + raise RuntimeError("No valid binary-probability response returned by LLM.") + return parsed[0], cost_usd, in_tokens, out_tokens, parse_failures + + +class BinaryProbabilityLLMPredictor(LLMPredictor): + """Binary-event LLM forecaster using direct probability elicitation.""" + + _method_tag: ClassVar[str] = "llmp_binary_probability" + + cfg: BinaryProbabilityLLMPredictorConfig + + def __init__(self, cfg: BinaryProbabilityLLMPredictorConfig | None = None) -> None: + super().__init__(cfg) + + @classmethod + def _default_config(cls) -> BinaryProbabilityLLMPredictorConfig: + return BinaryProbabilityLLMPredictorConfig() + + @langfuse_observe("BinaryProbabilityLLMPredictor.predict") + def predict( + self, + task: ForecastingTask, + context: ForecastContext, + ) -> list[Prediction]: + """Produce one BinaryForecast prediction from a directly elicited probability. + + Raises + ------ + ValueError + If the task does not declare ``payload_type='binary'`` or requests + more than one horizon — a single probability maps to exactly one + resolution date. + """ + if task.payload_type != "binary": + raise ValueError( + f"{type(self).__name__} requires a binary task (payload_type='binary'); " + f"task '{task.task_id}' declares payload_type='{task.payload_type}'." + ) + if len(task.horizons) != 1: + raise ValueError( + f"{type(self).__name__} supports exactly one horizon per task; " + f"task '{task.task_id}' declares horizons={task.horizons}." + ) + + set_current_trace_name(self.predictor_id) + series_df, series_meta = get_history_and_meta(task, context) + if self.cfg.history_window is not None: + series_df = series_df.tail(self.cfg.history_window).reset_index(drop=True) + + offset = pd.tseries.frequencies.to_offset(task.frequency) + horizon = task.horizons[0] + forecast_date = (pd.Timestamp(context.as_of) + offset * horizon).normalize() + + history_str = serialize_history(series_df, precision=self.cfg.precision) + + # Report context (before the task/history block): text preamble (CiK + # Format A) or native PDF parts, per cfg.report_ingestion. + report_docs = fetch_report_docs(config=self.cfg, context=context) + + system_prompt = _build_system_prompt( + self.cfg.system_prompt_override, elicit_reasoning=self.cfg.elicit_reasoning + ) + user_prompt = _build_user_prompt( + task, + history_str, + series_meta, + forecast_date, + series_description_override=self.cfg.series_description, + suffix=self.cfg.user_prompt_suffix, + ) + user_content = apply_report_context(config=self.cfg, docs=report_docs, user_prompt=user_prompt) + + parsed, cost_usd, in_tokens, out_tokens, parse_failures = _sample_probability( + cfg=self.cfg, + system_prompt=system_prompt, + user_prompt=user_content, + ) + + rationale = parsed.reasoning.strip() + issued_at = datetime.now(tz=timezone.utc).replace(tzinfo=None) + return [ + Prediction( + predictor_id=self.predictor_id, + task_id=task.task_id, + issued_at=issued_at, + as_of=context.as_of, + forecast_date=forecast_date.to_pydatetime(), + payload=BinaryForecast(probability=float(parsed.probability)), + metadata=self._build_metadata( + cost_usd=cost_usd, + in_tokens=in_tokens, + out_tokens=out_tokens, + parse_failures=parse_failures, + history_window=self.cfg.history_window, + extra={ + **({"rationale": rationale} if rationale else {}), + "n_report_docs": len(report_docs), + **({"report_sources": self.cfg.report_sources} if self.cfg.report_sources else {}), + }, + ), + ), + ] + + +__all__ = [ + "BinaryProbabilityLLMPredictor", + "BinaryProbabilityLLMPredictorConfig", +] +``` diff --git a/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md new file mode 100644 index 0000000..4a3088c --- /dev/null +++ b/implementations/getting_started/concierge_agent/context/artifacts/aieng-forecasting__aieng__forecasting__methods__llm_processes__categorical_probability.py.md @@ -0,0 +1,485 @@ +# Source: aieng-forecasting/aieng/forecasting/methods/llm_processes/categorical_probability.py + +kind: python + +```python +"""CategoricalProbabilityLLMPredictor — direct categorical distribution elicitation. + +Asks an LLM for one calibrated probability per ordered category, via one +structured completion per forecast origin. This is the categorical +counterpart of +:class:`~aieng.forecasting.methods.llm_processes.binary_probability.BinaryProbabilityLLMPredictor`: +where the binary predictor elicits the single number describing a Bernoulli +distribution, this class elicits the full probability vector over the task's +ordered categories (e.g. cut < hold < hike), scored with RPS. + +The category order, labels, and series-value mapping all come from +``task.categories`` — the predictor never invents its own label set. Observed +history is serialized using category *labels* rather than raw series values so +the LLM reasons over "cut/hold/hike" instead of "-1/0/1". + +LLMs frequently return distributions that sum to 0.99 or 1.01 (e.g. three +"0.33" entries). Rather than failing on the payload validator's 1e-6 sum +tolerance, responses within :data:`RENORMALIZATION_TOLERANCE` of 1 are +renormalized (and the raw sum recorded in prediction metadata); responses +further off than that are treated as malformed and raise. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from math import isclose +from typing import TYPE_CHECKING, Any, ClassVar + +import pandas as pd +from aieng.forecasting.evaluation.langfuse_traces import stamp_forecast_on_trace +from aieng.forecasting.evaluation.prediction import CategoricalForecast, Prediction +from aieng.forecasting.methods.llm_processes._client import ( + langfuse_observe, + make_json_schema_response_format, + run_async, + sample_n_async, + set_current_trace_name, +) +from aieng.forecasting.methods.llm_processes.base import ( + LLMPredictor, + LLMPredictorConfig, + apply_report_context, + fetch_report_docs, + get_history_and_meta, +) +from pydantic import BaseModel, ConfigDict, Field, field_validator + + +if TYPE_CHECKING: + from aieng.forecasting.data.context import ForecastContext + from aieng.forecasting.data.models import SeriesMetadata + from aieng.forecasting.evaluation.task import ForecastingTask, TaskCategory + + +#: Maximum allowed |sum - 1| before an elicited distribution is rejected +#: instead of renormalized. +RENORMALIZATION_TOLERANCE: float = 0.05 + + +class CategoricalProbabilityLLMPredictorConfig(LLMPredictorConfig): + """Frozen configuration for :class:`CategoricalProbabilityLLMPredictor`. + + Adds only categorical-task prompt controls that preserve the + direct-distribution contract. The predictor makes one structured + completion per forecast origin and does not expose ``n_samples``. + """ + + model_config = ConfigDict(frozen=True) + + history_window: int | None = Field( + default=None, + ge=1, + description="If set, only the last N cutoff-filtered observations are serialized into the prompt.", + ) + series_description: str | None = Field( + default=None, + description="Optional replacement for the metadata-derived series description block.", + ) + elicit_reasoning: bool = Field( + default=True, + description=( + "When True, ask the model for a short free-text 'reasoning' field alongside the " + "distribution, captured into Prediction.metadata['rationale'] for inspection and " + "downstream reasoning evaluation. The field is requested *after* the probabilities so " + "the model commits to the distribution first, keeping the answer-first ordering that " + "protects calibration. Set False to restore the bare distribution-only elicitation." + ), + ) + system_prompt_override: str | None = Field( + default=None, + description="Full replacement for the built-in categorical-probability system prompt.", + ) + user_prompt_suffix: str | None = Field( + default=None, + description=( + "Free-form text appended to the user prompt after the standard question. " + "Use-case recipes use this to inject domain context (covariate summaries, " + "report excerpts) without changing the elicitation contract." + ), + ) + + +class _CategoryProbability(BaseModel): + """One (label, probability) row of an elicited categorical distribution.""" + + label: str + probability: float = Field(ge=0.0, le=1.0) + + +class _CategoricalDistribution(BaseModel): + """Internal Pydantic schema for one directly elicited distribution. + + ``reasoning`` is optional so parsing succeeds whether or not the field was + requested (controlled by ``elicit_reasoning`` on the config). + """ + + probabilities: list[_CategoryProbability] + reasoning: str = Field(default="") + + @field_validator("probabilities", mode="before") + @classmethod + def _coerce_mapping_to_rows(cls, value: Any) -> Any: + """Accept a ``{label: probability}`` mapping as well as the list form. + + Despite the strict ``{label, probability}`` array schema, some models + (and some proxy routes) return the distribution as a JSON object + mapping label to probability, e.g. ``{"cut": 0.25, "hold": 0.7, + "hike": 0.05}``. Coerce that shape into the canonical list of rows so a + well-formed answer is not discarded as a parse failure. + """ + if isinstance(value, dict): + return [{"label": label, "probability": probability} for label, probability in value.items()] + return value + + +def _build_categorical_distribution_schema(elicit_reasoning: bool) -> dict[str, Any]: + """Build the strict ``json_schema`` for one elicited categorical distribution. + + ``probabilities`` comes first so the model commits to the distribution + before any justification. When ``elicit_reasoning`` is True, a free-text + ``reasoning`` field is appended; strict mode with + ``additionalProperties: False`` requires every property in ``required``. + """ + properties: dict[str, Any] = { + "probabilities": { + "type": "array", + "items": { + "type": "object", + "properties": { + "label": {"type": "string"}, + "probability": {"type": "number", "minimum": 0.0, "maximum": 1.0}, + }, + "required": ["label", "probability"], + "additionalProperties": False, + }, + }, + } + required = ["probabilities"] + if elicit_reasoning: + properties["reasoning"] = {"type": "string"} + required.append("reasoning") + return { + "type": "object", + "properties": properties, + "required": required, + "additionalProperties": False, + } + + +def serialize_categorical_history(df: pd.DataFrame, categories: list[TaskCategory]) -> str: + """Render a categorical series as one ``: