+
+
\ No newline at end of file
diff --git a/tools/agent_efficacy/EFFICACY_GUIDE.md b/tools/agent_efficacy/EFFICACY_GUIDE.md
new file mode 100644
index 0000000000..edd1c76ac0
--- /dev/null
+++ b/tools/agent_efficacy/EFFICACY_GUIDE.md
@@ -0,0 +1,71 @@
+# Agent Efficacy Testing Guide
+
+The efficacy calculation pipeline (`calculate_efficacy.py`) evaluates the quality of the autoschematization agent by comparing its output predictions against human-reviewed "gold standard" directories. It leverages semantic graph comparisons (`mcf_diff.py`) to accurately compute metrics like **Precision, Recall, and the F1 Score**.
+
+---
+
+### 1. Understanding the Evaluated Files
+
+The script compares four primary output artifacts generated by the agent against their reviewed counterparts. It computes True Positives (TP), False Positives (FP), and False Negatives (FN) to determine the overall agent efficacy.
+
+* **`output_pvmap.csv` (Primary Efficacy Metric):**
+ * **What it is:** The Property-Value (PV) mapping file that links dataset column headers to Data Commons schema nodes.
+ * **How it's tested:** This is processed using the `PropertyValueMapper`. The script extracts a normalized semantic graph and calculates the "Hero Metrics" (Precision, Recall, F1) that define the overall run's success.
+* **`output_stat_vars.mcf`:**
+ * **What it is:** Defines the statistical variables extracted from the dataset.
+ * **How it's tested:** The script uses **fingerprinting** (`use_fingerprint=True`). This means it evaluates if the core semantic definition (properties and values) of the StatVar is correct, even if the exact node ID (DCID) strings have slight, functionally irrelevant differences.
+* **`output_stat_vars_schema.mcf` & `output.tmcf`:**
+ * **What it is:** Extends the schema definitions and dictates the tabular template bindings.
+ * **How it's tested:** Compared as standard MCF graphs to ensure the underlying structure matches the gold standard perfectly.
+
+---
+
+### 2. Testing a Single Dataset
+
+The `calculate_efficacy.py` script uses command-line arguments to accept the required input and output paths dynamically. You should run this script within the project's virtual environment.
+
+**Command Line Arguments:**
+* `--test`: Path to the test (prediction) directory.
+* `--gold`: Path to the gold (reviewed) directory.
+* `--output`: Path to the directory where the efficacy results dashboard should be saved.
+* `--dataset_id` (Optional): The dataset ID for display purposes in the HTML dashboard.
+
+**Example execution:**
+```bash
+source venv/bin/activate
+python tools/agent_efficacy/calculate_efficacy.py \
+ --test test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+ --gold undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+ --output undata/DESA/efficacy_results/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+ --dataset_id DESA-GENDER_2025_OBS_ICT_SKILL_RT
+```
+**Results:** Open the generated dashboard (`Agent_Efficacy_Board.html` inside your `--output` directory) in your web browser.
+
+---
+
+### 3. Testing in Bulk (Multiple Datasets)
+
+To evaluate the agent across multiple datasets simultaneously, you can use the built-in `--bulk` flag. When this flag is passed, the `--test` and `--gold` arguments are treated as the parent directories containing all the datasets.
+
+**Example Bulk Run Execution:**
+
+```bash
+# Ensure virtual environment is active
+source venv/bin/activate
+
+python tools/agent_efficacy/calculate_efficacy.py \
+ --bulk \
+ --test undata/DESA/output/reviewed \
+ --gold undata/DESA/output/unreviewed \
+ --output undata/DESA/efficacy_results/bulk_run
+```
+
+The script will automatically detect datasets and generate a unique run directory (e.g., `bulk_run_20260601_120000`) inside your `--output` path. This directory will contain a separate HTML dashboard folder for every evaluated dataset, along with an aggregated `summary.csv` file containing the F1, Precision, Recall, TP, FP, and FN scores for all datasets in that run.
+
+---
+
+### 4. Interpreting the Output
+
+The HTML dashboard (`Agent_Efficacy_Board.html`) gives you a visual breakdown of the metrics:
+* **Hero Metrics (Top Section):** Represents the accuracy of the `pvmap.csv`. A high F1 score here means the agent successfully mapped columns to the correct semantic properties.
+* **Detailed Semantic Comparisons:** Look here to see the specific True Positives, False Positives (incorrect mappings), and False Negatives (missed mappings). The dashboard will output raw MCF node diffs to show you exactly *where* the agent deviated from the reviewed standard for `pvmap`, `stat_vars`, `schema`, and `tmcf`.
\ No newline at end of file
diff --git a/tools/agent_efficacy/USAGE_EXAMPLES.md b/tools/agent_efficacy/USAGE_EXAMPLES.md
new file mode 100644
index 0000000000..916b9063c0
--- /dev/null
+++ b/tools/agent_efficacy/USAGE_EXAMPLES.md
@@ -0,0 +1,44 @@
+# Efficacy Tool Quick-Start Examples
+
+This guide provides sample commands for running the `calculate_efficacy.py` script in both single and bulk modes.
+
+## 1. Single Dataset Evaluation
+Use this command to compare a single agent-generated folder against a reviewed gold standard.
+
+### Sample Command
+```bash
+python3 tools/agent_efficacy/calculate_efficacy.py \
+ --test /usr/local/google/home/nehil/datacommons/import/git/data/test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+ --gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+ --output /usr/local/google/home/nehil/datacommons/import/git/data/tmp/efficacy_results/single_run \
+ --dataset_id ICT_SKILL_RT
+```
+
+### Result
+- Dashboard: `/tmp/efficacy_results/single_run/Agent_Efficacy_Board.html`
+- Updates: Precision, Recall, and F1 will be correctly populated in the HTML.
+
+---
+
+## 2. Bulk Evaluation (Multiple Datasets)
+Use this command to evaluate all datasets in a directory. It will create a unique, timestamped folder for the run.
+
+### Sample Command
+```bash
+python3 tools/agent_efficacy/calculate_efficacy.py \
+ --bulk \
+ --test /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/agent_predictions \
+ --gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish \
+ --output /tmp/efficacy_results/bulk_runs
+```
+
+### Result
+- Output Directory: `/tmp/efficacy_results/bulk_runs/bulk_run_20260602_HHMMSS/`
+- Summary File: `summary.csv` inside the new run folder.
+- Dashboards: Individual `Agent_Efficacy_Board.html` files for every dataset found.
+
+---
+
+## 3. How to Present Results
+1. **HTML Dashboard:** Open the `Agent_Efficacy_Board.html` file in any browser to view the "Hero Metrics" and detailed semantic diffs.
+2. **Summary CSV:** Use the `summary.csv` generated during bulk runs to create a high-level report or table of performance across all indicators.
diff --git a/tools/agent_efficacy/calculate_efficacy.py b/tools/agent_efficacy/calculate_efficacy.py
new file mode 100644
index 0000000000..f322624f0e
--- /dev/null
+++ b/tools/agent_efficacy/calculate_efficacy.py
@@ -0,0 +1,213 @@
+import os
+import sys
+import re
+import csv
+import argparse
+import datetime
+
+# Set up paths to import tools
+PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
+sys.path.append(os.path.join(PROJECT_ROOT, 'tools/statvar_importer'))
+sys.path.append(os.path.join(PROJECT_ROOT, 'util'))
+sys.path.append(os.path.join(PROJECT_ROOT, 'tools/agentic_import/metrics'))
+
+import mcf_diff
+from counters import Counters
+from property_value_mapper import PropertyValueMapper
+from pvmap_generator_metrics import PVMapGeneratorMetricsRunner
+
+def get_metrics_from_counters(counters_obj):
+ # Use metrics formula directly from pvmap_generator_metrics.py
+ diff_stats = {'counters': counters_obj.get_counters()}
+ stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
+
+ return {
+ 'tp': stats.get('true_positive', 0),
+ 'fp': stats.get('false_positive', 0),
+ 'fn': stats.get('false_negative', 0),
+ 'precision': stats.get('precision', 0),
+ 'recall': stats.get('recall', 0),
+ 'f1': stats.get('f1', 0)
+ }
+
+def load_pv_map_nodes(file_path):
+ """Loads PV map into MCF-like nodes using PropertyValueMapper normalization."""
+ pv_mapper = PropertyValueMapper()
+ pv_mapper.load_pvs_from_file(file_path)
+ # Get the raw GLOBAL map
+ raw_map = pv_mapper.get_pv_map().get('GLOBAL', {})
+ # Convert to standard node format: {dcid: {prop: val}}
+ nodes = {}
+ for key, pvs in raw_map.items():
+ # Ensure DCID is clean
+ nodes[key] = pvs
+ return nodes
+
+def run_comparison(pred_path, gold_path, is_pvmap=False, use_fingerprint=False):
+ if not os.path.exists(pred_path) or not os.path.exists(gold_path):
+ print(f"Skipping: missing {pred_path} or {gold_path}")
+ return "", {'tp': 0, 'fp': 0, 'fn': 0, 'precision': 0, 'recall': 0, 'f1': 0}
+
+ counters = Counters()
+ config = {
+ 'show_diff_nodes_only': True,
+ 'ignore_property': ['description', 'provenance', 'memberOf', 'member', 'name', 'constraintProperties', 'keyString', 'relevantVariable'],
+ 'fingerprint_dcid': use_fingerprint
+ }
+
+ if is_pvmap:
+ # Specialized loading for PV Maps to handle wide vs narrow formats
+ nodes1 = load_pv_map_nodes(pred_path)
+ nodes2 = load_pv_map_nodes(gold_path)
+ print(f" [PVMap] Loaded {len(nodes1)} nodes from pred, {len(nodes2)} from gold")
+ diff_text = mcf_diff.diff_mcf_nodes(nodes1, nodes2, config, counters)
+ else:
+ # Standard MCF loading
+ diff_text = mcf_diff.diff_mcf_files(pred_path, gold_path, config, counters)
+
+ metrics = get_metrics_from_counters(counters)
+ return diff_text, metrics
+
+def update_html(template_content, dataset_id, hero_metrics, detailed_results):
+ content = template_content
+ for key, label in [('f1', 'Agent Efficacy (F1)'), ('precision', 'Precision'), ('recall', 'Recall')]:
+ escaped_label = re.escape(label)
+ content = re.sub(
+ rf'({escaped_label}.*?tracking-tight(?:er)?">)([\d\.]+%)(