diff --git a/docs/en/ragas.mdx b/docs/en/ragas.mdx new file mode 100644 index 0000000..cd23cb2 --- /dev/null +++ b/docs/en/ragas.mdx @@ -0,0 +1,144 @@ +--- +weight: 100 +--- +# Evaluating RAG with Ragas + +Retrieval-Augmented Generation (RAG) systems combine a retriever and a generator. Measuring quality on the final answer alone is often insufficient: failures may come from irrelevant or incomplete retrieval, from generation that drifts away from sources, or from both. + +[Ragas](https://docs.ragas.io/) (Retrieval-Augmented Generation Assessment) is a Python framework that scores RAG behavior with metrics for faithfulness to context, answer relevance, retrieval quality, and more—depending on which metrics are enabled and which columns are available in the evaluation set. + +This page focuses on **using the Ragas SDK directly** in a notebook or batch job. Integration with other evaluation platforms is optional and not covered here. + +## What to record for each evaluation example + +A typical single-turn row includes: + +| Field | Role | +| --- | --- | +| `user_input` | User query (or equivalent). | +| `retrieved_contexts` | Retrieved passages as a **list of strings** for that row. | +| `response` | Model output to score. | +| `reference` | Reference answer or key facts; required only for some metrics (for example context recall). | + +## Ragas metrics overview + +Ragas exposes a large catalog of metrics. Only a subset is needed for a typical RAG evaluation pass. Each metric expects specific dataset columns (for example `question`, `contexts`, `answer`, `ground_truth`) and may require an **LLM**, **embeddings**, both, or neither. Names and import paths evolve across releases; confirm the installed-version guidance in the [Ragas metrics documentation](https://docs.ragas.io/). + +The lists below summarize **intent** and common use; they are not a substitute for upstream API details. + +### Core RAG metrics + +These metrics are most directly aligned with retrieval-augmented generation quality and are usually the first set to track: + +Unless otherwise noted, classes in this section are from `ragas.metrics.collections`. + +| Metric | Python class | Required arguments | Evaluation target | +| --- | --- | --- | --- | +| **Faithfulness** | `Faithfulness` | `user_input`, `retrieved_contexts`, `response` | Whether claims in the answer are supported by retrieved contexts (grounding / hallucination control). | +| **Answer relevancy** / **Response relevancy** | `AnswerRelevancy` | `user_input`, `response` | Whether the generated answer addresses the user query. | +| **Context precision** | `ContextPrecision` | `user_input`, `retrieved_contexts`, `reference` | Whether retrieved chunks are relevant and useful for answering the query. | +| **Context recall** | `ContextRecall` | `user_input`, `retrieved_contexts`, `reference` | Whether retrieved contexts cover the information required to answer correctly. | +| **Context entity recall** | `ContextEntityRecall` | `reference`, `retrieved_contexts` | Whether key entities from the reference are present in retrieved contexts. | +| **Context utilization** | `ContextUtilization` | `user_input`, `response`, `retrieved_contexts` | How much retrieved context is actually used by the answer. | + +Field and import requirements can vary by Ragas version and metric variant. Confirm against the installed version in the [Ragas metrics documentation](https://docs.ragas.io/). + +### Optional RAG metrics + +These metrics are useful in specific evaluation setups, especially when reference answers are available or when robustness checks are needed: + +| Metric | Python class | Required arguments | Evaluation target | +| --- | --- | --- | --- | +| **Answer correctness** | `AnswerCorrectness` | `user_input`, `response`, `reference` | Alignment of the generated answer with a reference answer. | +| **Answer similarity** / **Semantic similarity** | `SemanticSimilarity` | `response`, `reference` | Semantic closeness between generated and reference answers (embedding-based). | +| **Factual correctness** | `FactualCorrectness` | `response`, `reference` | Fact-level agreement with a reference or expected facts. | +| **Noise sensitivity** | `NoiseSensitivity` | `user_input`, `response`, `reference`, `retrieved_contexts` | Stability when distractors or noise are introduced into context or inputs. | + +For metrics that are weakly related to RAG core evaluation (for example generic text-overlap metrics, rubric-based custom metrics, agent/tool metrics, SQL metrics, or multimodal metrics), refer to the [Ragas metrics documentation](https://docs.ragas.io/). + +### Choosing a minimal RAG set + +A practical default for many RAG benchmarks is: **faithfulness**, **answer relevancy**, **context precision**, and **context recall** (recall and some precision variants need `ground_truth` or equivalent). Add **answer correctness** or **semantic similarity** when a reference answer is available. Match metrics to the columns present in the dataset and to cost constraints (LLM-heavy metrics are slower and more expensive). + +## Calling the Ragas SDK + +For modern Ragas usage, instantiate metrics from `ragas.metrics.collections` and score each row using `ascore()` (or `score()` in synchronous scripts). + +1. Prepare OpenAI-compatible clients (`AsyncOpenAI`) for LLM and embeddings, then wire `llm_factory` and `OpenAIEmbeddings` (see the sample notebook for environment-variable configuration). +2. Instantiate metrics with explicit dependencies (`llm`, `embeddings` where required). +3. Iterate through rows and call `metric.ascore(...)` with metric-specific arguments. + + ```python + from openai import AsyncOpenAI + from ragas.embeddings import OpenAIEmbeddings + from ragas.llms import llm_factory + from ragas.metrics.collections import AnswerRelevancy, Faithfulness + + llm_client = AsyncOpenAI( + api_key="...", + base_url="https://your-openai-compatible-endpoint/v1", # or None for provider default + ) + embed_client = AsyncOpenAI( + api_key="...", # often same key as LLM when using one gateway + base_url="https://your-embedding-endpoint/v1", # optional; can match llm_client + ) + + llm = llm_factory("your-llm-model", client=llm_client) + embeddings = OpenAIEmbeddings(model="your-embedding-model", client=embed_client) + + faithfulness = Faithfulness(llm=llm) + answer_relevancy = AnswerRelevancy(llm=llm, embeddings=embeddings) + ``` + +When selecting metrics, the following differences affect how the scoring call is prepared: + +| Aspect | How it varies | +| --- | --- | +| **Required columns** | Each metric expects specific arguments (for example `reference` for many reference-based retrieval metrics). Missing fields cause validation errors before scoring starts. | +| **LLM vs embeddings vs neither** | LLM-based metrics need a language model; embedding metrics need an embedding model; lexical metrics may need no model. In the modern API, dependencies are passed explicitly when creating metric instances. | +| **Metric variants** | Different classes implement the same intent with different scorers (for example LLM vs non-LLM context precision). Metric imports and selection change accordingly. | +| **Constructor configuration** | Rubrics, aspect critics, discrete/numeric custom metrics, or specialized faithfulness variants require **instantiation arguments** or extra setup. | +| **ID-based or multi-turn data** | ID-based precision/recall expect ID columns in the dataset; multi-turn or agent/tool metrics require different sample layouts. These are outside the single-turn notebook flow. | + +In practice, this means the main work is to align dataset fields and metric selection, then score rows with the chosen metric instances. + +## Prerequisites + +- Python 3.10+ recommended. +- Network access to an LLM API (and to an embeddings API for metrics that need embeddings). The sample notebook assumes an OpenAI-compatible setup and supports configuring credentials and an optional base URL for compatible gateways. +- Awareness that evaluation **issues many model calls**; cost and latency scale with rows and metrics. +- **Version pinning**: Ragas APIs and metric classes change between releases. For reproducible benchmarks, pin `ragas` (and related packages) in the environment or notebook; see the commented install line in the sample notebook. + +## Runnable notebook + +Download and open the notebook in JupyterLab or another Jupyter environment: + +- [ragas-rag-eval.ipynb](/ragas-rag-eval.ipynb) + +The notebook opens with a short SDK recap focused on modern metrics (`ragas.metrics.collections`) and explicit LLM/embedding setup. The canonical explanation is the **Calling the Ragas SDK** section on this page. + +The notebook: + +1. Installs dependencies (with an optional commented version pin for reproducibility). +2. Creates a small `datasets.Dataset` with `user_input`, `retrieved_contexts`, `response`, and `reference`. +3. Runs baseline evaluation with **faithfulness** and **answer relevancy** using modern metric classes. +4. Adds optional retrieval-focused metrics (**context precision** and **context recall**) using modern metric classes. +5. Shows aggregate and per-row results, followed by a short troubleshooting section. + +## Troubleshooting + +- **Credentials or endpoint configuration**: configure LLM API credentials (and an optional base URL for compatible gateways). If embeddings use a separate endpoint, configure embeddings credentials as well, then pass separate `AsyncOpenAI` clients into `llm_factory` and `OpenAIEmbeddings`. +- **Dataset validation errors**: verify required arguments for selected metrics and ensure dataset keys align with modern examples (`user_input`, `retrieved_contexts`, `response`, `reference`). +- **Notebook async execution**: the sample notebook uses `await metric.ascore(...)`. For synchronous scripts, use `metric.score(...)` or wrap async code with `asyncio.run(...)`. +- **Version-related warnings**: metric classes and signatures can change across Ragas versions. Pin package versions for reproducible runs and confirm behavior against the installed version documentation. + +## Interpreting results + +- Compare scores only under the same dataset and evaluation configuration (judge LLM, embeddings, and prompts); otherwise shifts may reflect configuration changes rather than RAG quality. +- For retrieval-oriented evaluation, use the same embedding model as the production RAG retriever whenever possible to reduce metric drift caused by mismatched embedding spaces. +- Use aggregate scores for trend tracking or quality gates, and per-row scores for diagnosis (for example missing context, hallucination, or irrelevant retrieval). Treat metric values as directional signals, not absolute truth. + +## Further reading + +- Ragas documentation: [https://docs.ragas.io/](https://docs.ragas.io/) +- Ragas GitHub repository: [https://github.com/vibrantlabsai/ragas](https://github.com/vibrantlabsai/ragas) diff --git a/docs/public/ragas-rag-eval.ipynb b/docs/public/ragas-rag-eval.ipynb new file mode 100644 index 0000000..193f8b6 --- /dev/null +++ b/docs/public/ragas-rag-eval.ipynb @@ -0,0 +1,339 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# RAG evaluation with the Ragas Python SDK (modern metrics API)\n", + "\n", + "This notebook shows a minimal end-to-end flow using Ragas modern metrics: build a small evaluation table (questions, retrieved contexts, and model answers), run metrics from `ragas.metrics.collections`, and inspect per-row and aggregate scores.\n", + "\n", + "**Requirements**\n", + "\n", + "- Python 3.10+ recommended.\n", + "- OpenAI-compatible API access: set `EVAL_LLM_API_KEY` and, when needed, `EVAL_LLM_BASE_URL` (the notebook also accepts `OPENAI_API_KEY` / `OPENAI_BASE_URL` as fallback).\n", + "- Optional separate embedding endpoint credentials: `EVAL_EMBED_API_KEY` and `EVAL_EMBED_BASE_URL`.\n", + "- Explicit model configuration in the notebook (`EVAL_LLM_MODEL` and optional `EVAL_EMBED_MODEL`; default is `text-embedding-3-small`).\n", + "- For retrieval-related metrics, use the same embedding model as the production RAG retriever whenever possible.\n", + "- Ragas calls the LLM (and embeddings where needed) multiple times; expect latency and cost proportional to dataset size and metrics.\n", + "- A dedicated virtual environment or workbench image reduces dependency conflicts with other projects.\n", + "\n", + "**References**\n", + "\n", + "- [Ragas documentation](https://docs.ragas.io/)\n", + "- Alauda AI docs: *Evaluating RAG with Ragas* — grouped metric overview and prerequisites (when browsing the documentation site)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install dependencies\n", + "\n", + "Run this once per environment (for example a new workbench or virtualenv)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use current kernel's Python so PATH does not point to another env\n", + "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n", + "import sys\n", + "!{sys.executable} -m pip install \"ragas\" \"datasets\" \"openai\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configure API credentials\n", + "\n", + "Set `EVAL_LLM_API_KEY` (recommended) or `OPENAI_API_KEY` before running evaluation. If the endpoint is not the provider default, set `EVAL_LLM_BASE_URL` (or `OPENAI_BASE_URL`) as well.\n", + "\n", + "Do not commit secrets into version control; use platform secret injection or notebook environment variables instead.\n", + "\n", + "Optional: disable Ragas analytics (`RAGAS_DO_NOT_TRACK=true`) if required by policy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Config LLM API\n", + "# os.environ[\"EVAL_LLM_API_KEY\"] = \"sk-...\"\n", + "# os.environ[\"EVAL_LLM_BASE_URL\"] = \"https://your-openai-compatible-endpoint/v1\" # optional\n", + "# os.environ[\"EVAL_LLM_MODEL\"] = \"...\"\n", + "\n", + "# Config Embeddings API\n", + "# os.environ[\"EVAL_EMBED_API_KEY\"] = \"sk-...\"\n", + "# os.environ[\"EVAL_EMBED_BASE_URL\"] = \"https://your-embedding-endpoint/v1\"\n", + "# os.environ[\"EVAL_EMBED_MODEL\"] = \"...\"\n", + "\n", + "\n", + "# Optional: disable Ragas analytics\n", + "# os.environ[\"RAGAS_DO_NOT_TRACK\"] = \"true\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import AsyncOpenAI\n", + "from ragas.embeddings import OpenAIEmbeddings\n", + "from ragas.llms import llm_factory\n", + "\n", + "EVAL_LLM_API_KEY = os.getenv(\"EVAL_LLM_API_KEY\", os.getenv(\"OPENAI_API_KEY\", \"\"))\n", + "EVAL_LLM_BASE_URL = os.getenv(\"EVAL_LLM_BASE_URL\", os.getenv(\"OPENAI_BASE_URL\", \"\"))\n", + "EVAL_LLM_MODEL = os.getenv(\"EVAL_LLM_MODEL\", \"\")\n", + "\n", + "EVAL_EMBED_API_KEY = os.getenv(\"EVAL_EMBED_API_KEY\", EVAL_LLM_API_KEY)\n", + "EVAL_EMBED_BASE_URL = os.getenv(\"EVAL_EMBED_BASE_URL\", EVAL_LLM_BASE_URL)\n", + "EVAL_EMBED_MODEL = os.getenv(\"EVAL_EMBED_MODEL\", \"\")\n", + "\n", + "if not EVAL_LLM_MODEL:\n", + " raise RuntimeError(\"Set EVAL_LLM_MODEL to an available model ID from your endpoint.\")\n", + "\n", + "if not EVAL_EMBED_MODEL:\n", + " raise RuntimeError(\"Set EVAL_EMBED_MODEL to an available model ID from your endpoint.\")\n", + "\n", + "llm_client = AsyncOpenAI(\n", + " api_key=EVAL_LLM_API_KEY,\n", + " base_url=EVAL_LLM_BASE_URL or None,\n", + ")\n", + "embed_client = AsyncOpenAI(\n", + " api_key=EVAL_EMBED_API_KEY,\n", + " base_url=EVAL_EMBED_BASE_URL or None,\n", + ")\n", + "\n", + "llm = llm_factory(EVAL_LLM_MODEL, client=llm_client)\n", + "embeddings = OpenAIEmbeddings(\n", + " model=EVAL_EMBED_MODEL,\n", + " client=embed_client,\n", + ")\n", + "\n", + "print(f\"llm_base_url={EVAL_LLM_BASE_URL or '(provider default)'}\")\n", + "print(f\"llm={EVAL_LLM_MODEL}\")\n", + "print(f\"embed_base_url={EVAL_EMBED_BASE_URL or '(provider default)'}\")\n", + "print(f\"embeddings={EVAL_EMBED_MODEL}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build an evaluation dataset\n", + "\n", + "For the modern metrics API in this notebook, organize data as row samples (one dictionary per sample).\n", + "\n", + "Each row uses argument-aligned names:\n", + "\n", + "- `user_input`: user query\n", + "- `retrieved_contexts`: list of retrieved passages for that row\n", + "- `response`: model response to score\n", + "- `reference`: reference answer or expected facts (needed by retrieval/reference-based metrics)\n", + "\n", + "This row-first structure matches `ascore()` usage and avoids extra mapping." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import Dataset\n", + "\n", + "samples = [\n", + " {\n", + " \"user_input\": \"What is the capital of France?\",\n", + " \"retrieved_contexts\": [\n", + " \"Paris is the capital and most populous city of France.\"\n", + " ],\n", + " \"response\": \"The capital of France is Paris.\",\n", + " \"reference\": \"Paris\",\n", + " },\n", + " {\n", + " \"user_input\": \"Who patented an early practical telephone?\",\n", + " \"retrieved_contexts\": [\n", + " \"Alexander Graham Bell was a Scottish-born inventor who patented the first practical telephone.\"\n", + " ],\n", + " \"response\": \"Alexander Graham Bell patented an early practical telephone.\",\n", + " \"reference\": \"Alexander Graham Bell\",\n", + " },\n", + " {\n", + " \"user_input\": \"What is photosynthesis?\",\n", + " \"retrieved_contexts\": [\n", + " \"Photosynthesis is the process by which plants convert light energy into chemical energy.\"\n", + " ],\n", + " \"response\": \"Photosynthesis is how plants turn sunlight into chemical energy.\",\n", + " \"reference\": \"Plants convert light energy into chemical energy during photosynthesis.\",\n", + " },\n", + "]\n", + "\n", + "dataset = Dataset.from_list(samples)\n", + "dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run evaluation (modern metrics)\n", + "\n", + "- **Faithfulness**: whether the answer is supported by the retrieved contexts.\n", + "- **Answer relevancy**: whether the answer addresses the question.\n", + "\n", + "This section uses metrics from `ragas.metrics.collections` with the modern embeddings/LLM interfaces." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from ragas.metrics.collections import AnswerRelevancy, Faithfulness\n", + "\n", + "faithfulness_metric = Faithfulness(llm=llm)\n", + "answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=embeddings)\n", + "\n", + "\n", + "async def score_baseline_rows(ds):\n", + " rows = ds.to_list()\n", + " scored = []\n", + " for row in rows:\n", + " faithfulness_result = await faithfulness_metric.ascore(\n", + " user_input=row[\"user_input\"],\n", + " response=row[\"response\"],\n", + " retrieved_contexts=row[\"retrieved_contexts\"],\n", + " )\n", + " answer_relevancy_result = await answer_relevancy_metric.ascore(\n", + " user_input=row[\"user_input\"],\n", + " response=row[\"response\"],\n", + " )\n", + " scored.append(\n", + " {\n", + " \"user_input\": row[\"user_input\"],\n", + " \"faithfulness\": faithfulness_result.value,\n", + " \"answer_relevancy\": answer_relevancy_result.value,\n", + " }\n", + " )\n", + " return scored\n", + "\n", + "\n", + "baseline_scores = await score_baseline_rows(dataset)\n", + "faithfulness_avg = sum(item[\"faithfulness\"] for item in baseline_scores) / len(baseline_scores)\n", + "answer_relevancy_avg = sum(item[\"answer_relevancy\"] for item in baseline_scores) / len(baseline_scores)\n", + "\n", + "print(\"Aggregate means:\")\n", + "print(f\"faithfulness={faithfulness_avg:.4f}\")\n", + "print(f\"answer_relevancy={answer_relevancy_avg:.4f}\")\n", + "print(\"\\nPer-row scores:\")\n", + "for idx, item in enumerate(baseline_scores, start=1):\n", + " print(\n", + " f\"{idx}. user_input={item['user_input']} | \"\n", + " f\"faithfulness={item['faithfulness']:.4f} | \"\n", + " f\"answer_relevancy={item['answer_relevancy']:.4f}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add retrieval-focused metrics (modern metrics)\n", + "\n", + "- **Context precision**: whether retrieved chunks are useful for answering the question.\n", + "- **Context recall**: whether retrieved contexts cover what the reference (`ground_truth`) states.\n", + "\n", + "This pass issues additional LLM calls. If validation errors mention missing columns, adjust the dataset or choose another metric variant per the [Ragas metrics documentation](https://docs.ragas.io/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from ragas.metrics.collections import ContextPrecision, ContextRecall\n", + "\n", + "context_precision_metric = ContextPrecision(llm=llm)\n", + "context_recall_metric = ContextRecall(llm=llm)\n", + "\n", + "\n", + "async def score_retrieval_rows(ds):\n", + " rows = ds.to_list()\n", + " scored = []\n", + " for row in rows:\n", + " context_precision_result = await context_precision_metric.ascore(\n", + " user_input=row[\"user_input\"],\n", + " reference=row[\"reference\"],\n", + " retrieved_contexts=row[\"retrieved_contexts\"],\n", + " )\n", + " context_recall_result = await context_recall_metric.ascore(\n", + " user_input=row[\"user_input\"],\n", + " retrieved_contexts=row[\"retrieved_contexts\"],\n", + " reference=row[\"reference\"],\n", + " )\n", + " scored.append(\n", + " {\n", + " \"user_input\": row[\"user_input\"],\n", + " \"context_precision\": context_precision_result.value,\n", + " \"context_recall\": context_recall_result.value,\n", + " }\n", + " )\n", + " return scored\n", + "\n", + "\n", + "retrieval_scores = await score_retrieval_rows(dataset)\n", + "context_precision_avg = sum(item[\"context_precision\"] for item in retrieval_scores) / len(retrieval_scores)\n", + "context_recall_avg = sum(item[\"context_recall\"] for item in retrieval_scores) / len(retrieval_scores)\n", + "\n", + "print(\"Aggregate means:\")\n", + "print(f\"context_precision={context_precision_avg:.4f}\")\n", + "print(f\"context_recall={context_recall_avg:.4f}\")\n", + "print(\"\\nPer-row scores:\")\n", + "for idx, item in enumerate(retrieval_scores, start=1):\n", + " print(\n", + " f\"{idx}. user_input={item['user_input']} | \"\n", + " f\"context_precision={item['context_precision']:.4f} | \"\n", + " f\"context_recall={item['context_recall']:.4f}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Troubleshooting\n", + "\n", + "- **Model not found (`Model Not Exist`)**: set `EVAL_LLM_MODEL` (and, when overridden, `EVAL_EMBED_MODEL`) to an available model ID from the endpoint (for example via `/models`).\n", + "- **Credentials or endpoint setup**: set `EVAL_LLM_API_KEY` / `EVAL_LLM_BASE_URL` (fallback: `OPENAI_API_KEY` / `OPENAI_BASE_URL`). If embeddings use a separate endpoint, also set `EVAL_EMBED_API_KEY` / `EVAL_EMBED_BASE_URL`.\n", + "- **Notebook async execution**: this notebook uses `await metric.ascore(...)` in cells. If running outside notebook contexts, use `asyncio.run(...)` or metric `.score(...)` in synchronous scripts.\n", + "- **Version-related warnings**: metric classes and signatures can change across Ragas releases. Pin package versions for reproducible runs and confirm behavior against [docs.ragas.io](https://docs.ragas.io/)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}