datacommonsorg · nehil-gif · Jun 4, 2026 · gemini-code-assist · Jun 4, 2026 · gemini-code-assist
diff --git a/tools/agent_efficacy/Agent_Efficacy_Board.html b/tools/agent_efficacy/Agent_Efficacy_Board.html
diff --git a/tools/agent_efficacy/EFFICACY_GUIDE.md b/tools/agent_efficacy/EFFICACY_GUIDE.md
@@ -0,0 +1,71 @@
+# Agent Efficacy Testing Guide
+
+The efficacy calculation pipeline (`calculate_efficacy.py`) evaluates the quality of the autoschematization agent by comparing its output predictions against human-reviewed "gold standard" directories. It leverages semantic graph comparisons (`mcf_diff.py`) to accurately compute metrics like **Precision, Recall, and the F1 Score**.
+
+---
+
+### 1. Understanding the Evaluated Files
+
+The script compares four primary output artifacts generated by the agent against their reviewed counterparts. It computes True Positives (TP), False Positives (FP), and False Negatives (FN) to determine the overall agent efficacy.
+
+*   **`output_pvmap.csv` (Primary Efficacy Metric):**
+    *   **What it is:** The Property-Value (PV) mapping file that links dataset column headers to Data Commons schema nodes.
+    *   **How it's tested:** This is processed using the `PropertyValueMapper`. The script extracts a normalized semantic graph and calculates the "Hero Metrics" (Precision, Recall, F1) that define the overall run's success. 
+*   **`output_stat_vars.mcf`:**
+    *   **What it is:** Defines the statistical variables extracted from the dataset.
+    *   **How it's tested:** The script uses **fingerprinting** (`use_fingerprint=True`). This means it evaluates if the core semantic definition (properties and values) of the StatVar is correct, even if the exact node ID (DCID) strings have slight, functionally irrelevant differences.
+*   **`output_stat_vars_schema.mcf` & `output.tmcf`:**
+    *   **What it is:** Extends the schema definitions and dictates the tabular template bindings.
+    *   **How it's tested:** Compared as standard MCF graphs to ensure the underlying structure matches the gold standard perfectly.
+
+---
+
+### 2. Testing a Single Dataset
+
+The `calculate_efficacy.py` script uses command-line arguments to accept the required input and output paths dynamically. You should run this script within the project's virtual environment.
+
+**Command Line Arguments:**
+*   `--test`: Path to the test (prediction) directory.
+*   `--gold`: Path to the gold (reviewed) directory.
+*   `--output`: Path to the directory where the efficacy results dashboard should be saved.
+*   `--dataset_id` (Optional): The dataset ID for display purposes in the HTML dashboard.
+
+**Example execution:**
+```bash
+source venv/bin/activate
+python tools/agent_efficacy/calculate_efficacy.py \
+  --test test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+  --gold undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+  --output undata/DESA/efficacy_results/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+  --dataset_id DESA-GENDER_2025_OBS_ICT_SKILL_RT
+```
+**Results:** Open the generated dashboard (`Agent_Efficacy_Board.html` inside your `--output` directory) in your web browser.
+
+---
+
+### 3. Testing in Bulk (Multiple Datasets)
+
+To evaluate the agent across multiple datasets simultaneously, you can use the built-in `--bulk` flag. When this flag is passed, the `--test` and `--gold` arguments are treated as the parent directories containing all the datasets.
+
+**Example Bulk Run Execution:**
+
+```bash
+# Ensure virtual environment is active
+source venv/bin/activate
+
+python tools/agent_efficacy/calculate_efficacy.py \
+  --bulk \
+  --test undata/DESA/output/reviewed \
+  --gold undata/DESA/output/unreviewed \
+  --output undata/DESA/efficacy_results/bulk_run
+```
+
+The script will automatically detect datasets and generate a unique run directory (e.g., `bulk_run_20260601_120000`) inside your `--output` path. This directory will contain a separate HTML dashboard folder for every evaluated dataset, along with an aggregated `summary.csv` file containing the F1, Precision, Recall, TP, FP, and FN scores for all datasets in that run.
+
+---
+
+### 4. Interpreting the Output
+
+The HTML dashboard (`Agent_Efficacy_Board.html`) gives you a visual breakdown of the metrics:
+*   **Hero Metrics (Top Section):** Represents the accuracy of the `pvmap.csv`. A high F1 score here means the agent successfully mapped columns to the correct semantic properties.
+*   **Detailed Semantic Comparisons:** Look here to see the specific True Positives, False Positives (incorrect mappings), and False Negatives (missed mappings). The dashboard will output raw MCF node diffs to show you exactly *where* the agent deviated from the reviewed standard for `pvmap`, `stat_vars`, `schema`, and `tmcf`.
diff --git a/tools/agent_efficacy/USAGE_EXAMPLES.md b/tools/agent_efficacy/USAGE_EXAMPLES.md
@@ -0,0 +1,44 @@
+# Efficacy Tool Quick-Start Examples
+
+This guide provides sample commands for running the `calculate_efficacy.py` script in both single and bulk modes.
+
+## 1. Single Dataset Evaluation
+Use this command to compare a single agent-generated folder against a reviewed gold standard.
+
+### Sample Command
+```bash
+python3 tools/agent_efficacy/calculate_efficacy.py \
+  --test /usr/local/google/home/nehil/datacommons/import/git/data/test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+  --gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
+  --output /usr/local/google/home/nehil/datacommons/import/git/data/tmp/efficacy_results/single_run \
+  --dataset_id ICT_SKILL_RT
+```
+
+### Result
+- Dashboard: `/tmp/efficacy_results/single_run/Agent_Efficacy_Board.html`
+- Updates: Precision, Recall, and F1 will be correctly populated in the HTML.
+
+---
+
+## 2. Bulk Evaluation (Multiple Datasets)
+Use this command to evaluate all datasets in a directory. It will create a unique, timestamped folder for the run.
+
+### Sample Command
+```bash
+python3 tools/agent_efficacy/calculate_efficacy.py \
+  --bulk \
+  --test /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/agent_predictions \
+  --gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish \
+  --output /tmp/efficacy_results/bulk_runs
+```
+
+### Result
+- Output Directory: `/tmp/efficacy_results/bulk_runs/bulk_run_20260602_HHMMSS/`
+- Summary File: `summary.csv` inside the new run folder.
+- Dashboards: Individual `Agent_Efficacy_Board.html` files for every dataset found.
+
+---
+
+## 3. How to Present Results
+1. **HTML Dashboard:** Open the `Agent_Efficacy_Board.html` file in any browser to view the "Hero Metrics" and detailed semantic diffs.
+2. **Summary CSV:** Use the `summary.csv` generated during bulk runs to create a high-level report or table of performance across all indicators.
diff --git a/tools/agent_efficacy/calculate_efficacy.py b/tools/agent_efficacy/calculate_efficacy.py
@@ -0,0 +1,213 @@
+import os
+import sys
+import re
+import csv
+import argparse
+import datetime
+
+# Set up paths to import tools
+PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
-PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
+def find_repo_root(path):
+    if os.path.exists(os.path.join(path, '.git')) or os.path.exists(os.path.join(path, 'WORKSPACE')):
+        return path
+    parent = os.path.dirname(path)
+    return find_repo_root(parent) if parent != path else path
+PROJECT_ROOT = find_repo_root(os.path.abspath(os.path.dirname(__file__)))
-PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
+def find_repo_root(path):
+    if os.path.exists(os.path.join(path, '.git')) or os.path.exists(os.path.join(path, 'WORKSPACE')):
+        return path
+    parent = os.path.dirname(path)
+    return find_repo_root(parent) if parent != path else path
+PROJECT_ROOT = find_repo_root(os.path.abspath(os.path.dirname(__file__)))
+sys.path.append(os.path.join(PROJECT_ROOT, 'tools/statvar_importer'))
+sys.path.append(os.path.join(PROJECT_ROOT, 'util'))
+sys.path.append(os.path.join(PROJECT_ROOT, 'tools/agentic_import/metrics'))
+
+import mcf_diff
+from counters import Counters
+from property_value_mapper import PropertyValueMapper
+from pvmap_generator_metrics import PVMapGeneratorMetricsRunner
+
+def get_metrics_from_counters(counters_obj):
+    # Use metrics formula directly from pvmap_generator_metrics.py
+    diff_stats = {'counters': counters_obj.get_counters()}
+    stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
-    stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
+    stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(diff_stats)
-    stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
+    stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(diff_stats)
+
+    return {
+        'tp': stats.get('true_positive', 0), 
+        'fp': stats.get('false_positive', 0), 
+        'fn': stats.get('false_negative', 0),
+        'precision': stats.get('precision', 0), 
+        'recall': stats.get('recall', 0), 
+        'f1': stats.get('f1', 0)
+    }
+
+def load_pv_map_nodes(file_path):
+    """Loads PV map into MCF-like nodes using PropertyValueMapper normalization."""
+    pv_mapper = PropertyValueMapper()
+    pv_mapper.load_pvs_from_file(file_path)
+    # Get the raw GLOBAL map
+    raw_map = pv_mapper.get_pv_map().get('GLOBAL', {})
+    # Convert to standard node format: {dcid: {prop: val}}
+    nodes = {}
+    for key, pvs in raw_map.items():
+        # Ensure DCID is clean
+        nodes[key] = pvs
+    return nodes
+
+def run_comparison(pred_path, gold_path, is_pvmap=False, use_fingerprint=False):
+    if not os.path.exists(pred_path) or not os.path.exists(gold_path):
+        print(f"Skipping: missing {pred_path} or {gold_path}")
+        return "", {'tp': 0, 'fp': 0, 'fn': 0, 'precision': 0, 'recall': 0, 'f1': 0}
+
+    counters = Counters()
+    config = {
+        'show_diff_nodes_only': True,
+        'ignore_property': ['description', 'provenance', 'memberOf', 'member', 'name', 'constraintProperties', 'keyString', 'relevantVariable'],
+        'fingerprint_dcid': use_fingerprint
+    }
+
+    if is_pvmap:
+        # Specialized loading for PV Maps to handle wide vs narrow formats
+        nodes1 = load_pv_map_nodes(pred_path)
+        nodes2 = load_pv_map_nodes(gold_path)
+        print(f"  [PVMap] Loaded {len(nodes1)} nodes from pred, {len(nodes2)} from gold")
+        diff_text = mcf_diff.diff_mcf_nodes(nodes1, nodes2, config, counters)
+    else:
+        # Standard MCF loading
+        diff_text = mcf_diff.diff_mcf_files(pred_path, gold_path, config, counters)
+
+    metrics = get_metrics_from_counters(counters)
+    return diff_text, metrics
+
+def update_html(template_content, dataset_id, hero_metrics, detailed_results):
+    content = template_content
+    for key, label in [('f1', 'Agent Efficacy (F1)'), ('precision', 'Precision'), ('recall', 'Recall')]:
+        escaped_label = re.escape(label)
+        content = re.sub(
+            rf'({escaped_label}.*?tracking-tight(?:er)?">)([\d\.]+%)(</p>)',
+            r'\g<1>' + f"{hero_metrics[key]*100:.1f}%" + r'\g<3>',
+            content, flags=re.DOTALL
+        )
+
+    content = re.sub(r'(True Positives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['tp']:,}" + r'\g<3>', content, flags=re.DOTALL)
+    content = re.sub(r'(False Positives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['fp']:,}" + r'\g<3>', content, flags=re.DOTALL)
+    content = re.sub(r'(False Negatives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['fn']:,}" + r'\g<3>', content, flags=re.DOTALL)
+    content = re.sub(r'Run ID: .*?</div>', f'Run ID: {dataset_id} (Rechecked)</div>', content)
+
+    if '<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12">' in content:
+        content = content.split('<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12">')[0]
+
+    diff_sections = '<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12"><h2 class="text-2xl lg:text-3xl font-extrabold text-slate-900 mb-6">Detailed Semantic Comparisons</h2>'
+    for label, data in detailed_results.items():
+        diff_sections += f'''
+        <div class="mb-10 bg-white p-6 lg:p-8 rounded-2xl shadow-lg border-2 border-slate-200">
+            <h3 class="text-xl lg:text-2xl font-black mb-4 uppercase tracking-wider text-indigo-900 border-b-2 border-indigo-100 pb-2">{label}</h3>
+            <div class="grid grid-cols-3 gap-4 mb-6">
+                <div class="bg-emerald-50 p-4 rounded-xl border-2 border-emerald-200"><span class="block text-xs font-black uppercase tracking-widest">Precision</span><span class="text-2xl font-black">{data['metrics']['precision']*100:.1f}%</span></div>
+                <div class="bg-amber-50 p-4 rounded-xl border-2 border-amber-200"><span class="block text-xs font-black uppercase tracking-widest">Recall</span><span class="text-2xl font-black">{data['metrics']['recall']*100:.1f}%</span></div>
+                <div class="bg-indigo-50 p-4 rounded-xl border-2 border-indigo-200"><span class="block text-xs font-black uppercase tracking-widest">F1 Score</span><span class="text-2xl font-black">{data['metrics']['f1']*100:.1f}%</span></div>
+            </div>
+            <div class="text-xs font-bold text-slate-500 mb-2 font-mono uppercase tracking-widest">Semantic Match (TP: {data['metrics']['tp']} | FP: {data['metrics']['fp']} | FN: {data['metrics']['fn']})</div>
+            <pre class="bg-slate-900 text-slate-100 p-6 rounded-xl overflow-auto max-h-[30rem] text-sm font-mono whitespace-pre-wrap">{data['diff'] if data['diff'] else 'No differences found.'}</pre>
+        </div>
+        '''
+    diff_sections += '</section>'
+    return content.replace('</body>', diff_sections + '</body>') if '</body>' in content else content + diff_sections
+
+def process_single_dataset(test_dir, gold_dir, output_dir, dataset_id, template_content):
+    if not os.path.exists(output_dir): os.makedirs(output_dir)
+
+    file_mappings = [
+        ('pvmap.csv', 'output_pvmap.csv', 'output_pvmap.csv', True, False),
+        ('stat_vars.mcf', 'output_stat_vars.mcf', 'output_stat_vars.mcf', False, True),
+        ('stat_vars_schema.mcf', 'output_stat_vars_schema.mcf', 'output_stat_vars_schema.mcf', False, False),
+        ('tmcf', 'output.tmcf', 'output.tmcf', False, False),
+    ]
+
+    detailed_results = {}
+    total_tp = 0
+    total_fp = 0
+    total_fn = 0
+
+    print(f"\n--- Rechecking Efficacy for {dataset_id} ---")
+    for label, pred_name, gold_name, is_pv, use_fp in file_mappings:
+        print(f"Comparing {label}...")
+        diff_text, metrics = run_comparison(os.path.join(test_dir, pred_name), os.path.join(gold_dir, gold_name), is_pvmap=is_pv, use_fingerprint=use_fp)
+        detailed_results[label] = {'diff': diff_text, 'metrics': metrics}
+        print(f"  Result: F1={metrics['f1']:.1%}, TP={metrics['tp']}, FP={metrics['fp']}, FN={metrics['fn']}")
+
+        total_tp += metrics['tp']
+        total_fp += metrics['fp']
+        total_fn += metrics['fn']
+
+    precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
+    recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
+    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
+
+    hero_metrics = {
+        'tp': total_tp,
+        'fp': total_fp,
+        'fn': total_fn,
+        'precision': precision,
+        'recall': recall,
+        'f1': f1
+    }
+
+    final_html = update_html(template_content, dataset_id, hero_metrics, detailed_results)
+
+    with open(os.path.join(output_dir, 'Agent_Efficacy_Board.html'), 'w') as f: f.write(final_html)
+    return hero_metrics
+
+def main():
+    parser = argparse.ArgumentParser(description="Calculate efficacy metrics.")
+    parser.add_argument('--test', required=True, help="Path to the test (prediction) directory.")
+    parser.add_argument('--gold', required=True, help="Path to the gold (reviewed) directory.")
+    parser.add_argument('--output', required=True, help="Path to the output directory to save results.")
+    parser.add_argument('--dataset_id', default="Dataset", help="Optional dataset ID for display.")
+    parser.add_argument('--bulk', action='store_true', help="If set, treats test and gold as parent directories containing multiple dataset folders.")
+    args = parser.parse_args()
+
+    template_path = os.path.join(os.path.dirname(__file__), 'Agent_Efficacy_Board.html')
+    with open(template_path, 'r') as f: 
+        template_content = f.read()
+
+    if args.bulk:
+        run_id = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        args.output = os.path.join(args.output, f"bulk_run_{run_id}")
+        print(f"Starting Bulk Efficacy Calculation natively... Output: {args.output}")
+        summary_data = []
+        if not os.path.exists(args.output):
+            os.makedirs(args.output)
+
+        for dataset_id in os.listdir(args.gold):
+            gold_ds_dir = os.path.join(args.gold, dataset_id)
+            if not os.path.isdir(gold_ds_dir):
+                continue
+
+            # Allow test directories to either match exactly or be prefixed with test_
+            test_ds_dir = os.path.join(args.test, dataset_id)
+            if not os.path.isdir(test_ds_dir):
+                test_ds_dir = os.path.join(args.test, f"test_{dataset_id}")
+
+            if os.path.isdir(test_ds_dir):
+                out_ds_dir = os.path.join(args.output, dataset_id)
+                try:
+                    hero_metrics = process_single_dataset(test_ds_dir, gold_ds_dir, out_ds_dir, dataset_id, template_content)
+                    if hero_metrics:
+                        summary_data.append({
+                            'dataset_id': dataset_id,
+                            'f1': hero_metrics['f1'],
+                            'precision': hero_metrics['precision'],
+                            'recall': hero_metrics['recall'],
+                            'tp': hero_metrics['tp'],
+                            'fp': hero_metrics['fp'],
+                            'fn': hero_metrics['fn']
+                        })
+                except Exception as e:
+                    print(f"Error processing {dataset_id}: {e}")
+            else:
+                print(f"Skipped: {dataset_id} (Matching test directory not found)")
+
+        # Write summary.csv
+        if summary_data:
+            summary_path = os.path.join(args.output, 'summary.csv')
+            with open(summary_path, 'w', newline='') as f:
+                writer = csv.DictWriter(f, fieldnames=['dataset_id', 'f1', 'precision', 'recall', 'tp', 'fp', 'fn'])
+                writer.writeheader()
+                writer.writerows(summary_data)
+            print(f"\nBulk run complete. Summary saved to {summary_path}")
+    else:
+        process_single_dataset(args.test, args.gold, args.output, args.dataset_id, template_content)
+        print(f"\nRecheck complete. Results saved to {args.output}")
+
+if __name__ == "__main__":
+    main()