Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
307 changes: 307 additions & 0 deletions tools/agent_efficacy/Agent_Efficacy_Board.html

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions tools/agent_efficacy/EFFICACY_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Agent Efficacy Testing Guide

The efficacy calculation pipeline (`calculate_efficacy.py`) evaluates the quality of the autoschematization agent by comparing its output predictions against human-reviewed "gold standard" directories. It leverages semantic graph comparisons (`mcf_diff.py`) to accurately compute metrics like **Precision, Recall, and the F1 Score**.

---

### 1. Understanding the Evaluated Files

The script compares four primary output artifacts generated by the agent against their reviewed counterparts. It computes True Positives (TP), False Positives (FP), and False Negatives (FN) to determine the overall agent efficacy.

* **`output_pvmap.csv` (Primary Efficacy Metric):**
* **What it is:** The Property-Value (PV) mapping file that links dataset column headers to Data Commons schema nodes.
* **How it's tested:** This is processed using the `PropertyValueMapper`. The script extracts a normalized semantic graph and calculates the "Hero Metrics" (Precision, Recall, F1) that define the overall run's success.
* **`output_stat_vars.mcf`:**
* **What it is:** Defines the statistical variables extracted from the dataset.
* **How it's tested:** The script uses **fingerprinting** (`use_fingerprint=True`). This means it evaluates if the core semantic definition (properties and values) of the StatVar is correct, even if the exact node ID (DCID) strings have slight, functionally irrelevant differences.
* **`output_stat_vars_schema.mcf` & `output.tmcf`:**
* **What it is:** Extends the schema definitions and dictates the tabular template bindings.
* **How it's tested:** Compared as standard MCF graphs to ensure the underlying structure matches the gold standard perfectly.

---

### 2. Testing a Single Dataset

The `calculate_efficacy.py` script uses command-line arguments to accept the required input and output paths dynamically. You should run this script within the project's virtual environment.

**Command Line Arguments:**
* `--test`: Path to the test (prediction) directory.
* `--gold`: Path to the gold (reviewed) directory.
* `--output`: Path to the directory where the efficacy results dashboard should be saved.
* `--dataset_id` (Optional): The dataset ID for display purposes in the HTML dashboard.

**Example execution:**
```bash
source venv/bin/activate
python tools/agent_efficacy/calculate_efficacy.py \
--test test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
--gold undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
--output undata/DESA/efficacy_results/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
--dataset_id DESA-GENDER_2025_OBS_ICT_SKILL_RT
```
**Results:** Open the generated dashboard (`Agent_Efficacy_Board.html` inside your `--output` directory) in your web browser.

---

### 3. Testing in Bulk (Multiple Datasets)

To evaluate the agent across multiple datasets simultaneously, you can use the built-in `--bulk` flag. When this flag is passed, the `--test` and `--gold` arguments are treated as the parent directories containing all the datasets.

**Example Bulk Run Execution:**

```bash
# Ensure virtual environment is active
source venv/bin/activate

python tools/agent_efficacy/calculate_efficacy.py \
--bulk \
--test undata/DESA/output/reviewed \
--gold undata/DESA/output/unreviewed \
--output undata/DESA/efficacy_results/bulk_run
```

The script will automatically detect datasets and generate a unique run directory (e.g., `bulk_run_20260601_120000`) inside your `--output` path. This directory will contain a separate HTML dashboard folder for every evaluated dataset, along with an aggregated `summary.csv` file containing the F1, Precision, Recall, TP, FP, and FN scores for all datasets in that run.

---

### 4. Interpreting the Output

The HTML dashboard (`Agent_Efficacy_Board.html`) gives you a visual breakdown of the metrics:
* **Hero Metrics (Top Section):** Represents the accuracy of the `pvmap.csv`. A high F1 score here means the agent successfully mapped columns to the correct semantic properties.
* **Detailed Semantic Comparisons:** Look here to see the specific True Positives, False Positives (incorrect mappings), and False Negatives (missed mappings). The dashboard will output raw MCF node diffs to show you exactly *where* the agent deviated from the reviewed standard for `pvmap`, `stat_vars`, `schema`, and `tmcf`.
44 changes: 44 additions & 0 deletions tools/agent_efficacy/USAGE_EXAMPLES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Efficacy Tool Quick-Start Examples

This guide provides sample commands for running the `calculate_efficacy.py` script in both single and bulk modes.

## 1. Single Dataset Evaluation
Use this command to compare a single agent-generated folder against a reviewed gold standard.

### Sample Command
```bash
python3 tools/agent_efficacy/calculate_efficacy.py \
--test /usr/local/google/home/nehil/datacommons/import/git/data/test_DESA-GENDER_2025_OBS_ICT_SKILL_RT \
--gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish/DESA-GENDER_2025_OBS_ICT_SKILL_RT \
--output /usr/local/google/home/nehil/datacommons/import/git/data/tmp/efficacy_results/single_run \
--dataset_id ICT_SKILL_RT
```

### Result
- Dashboard: `/tmp/efficacy_results/single_run/Agent_Efficacy_Board.html`
- Updates: Precision, Recall, and F1 will be correctly populated in the HTML.

---

## 2. Bulk Evaluation (Multiple Datasets)
Use this command to evaluate all datasets in a directory. It will create a unique, timestamped folder for the run.

### Sample Command
```bash
python3 tools/agent_efficacy/calculate_efficacy.py \
--bulk \
--test /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/agent_predictions \
--gold /usr/local/google/home/nehil/datacommons/import/git/data/undata/DESA/output/reviewed_pvmap_harish \
--output /tmp/efficacy_results/bulk_runs
```

### Result
- Output Directory: `/tmp/efficacy_results/bulk_runs/bulk_run_20260602_HHMMSS/`
- Summary File: `summary.csv` inside the new run folder.
- Dashboards: Individual `Agent_Efficacy_Board.html` files for every dataset found.

---

## 3. How to Present Results
1. **HTML Dashboard:** Open the `Agent_Efficacy_Board.html` file in any browser to view the "Hero Metrics" and detailed semantic diffs.
2. **Summary CSV:** Use the `summary.csv` generated during bulk runs to create a high-level report or table of performance across all indicators.
213 changes: 213 additions & 0 deletions tools/agent_efficacy/calculate_efficacy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
import os
import sys
import re
import csv
import argparse
import datetime

# Set up paths to import tools
PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PROJECT_ROOT path is hardcoded to a local directory, which makes the script non-portable. To ensure portability and avoid relying on specific directory structures, use a more robust method to locate the repository root, such as searching for marker files like .git or WORKSPACE.

Suggested change
PROJECT_ROOT = '/usr/local/google/home/nehil/datacommons/import/git/data'
def find_repo_root(path):
if os.path.exists(os.path.join(path, '.git')) or os.path.exists(os.path.join(path, 'WORKSPACE')):
return path
parent = os.path.dirname(path)
return find_repo_root(parent) if parent != path else path
PROJECT_ROOT = find_repo_root(os.path.abspath(os.path.dirname(__file__)))
References
  1. Avoid hardcoding paths that rely on specific directory structures. Instead, use more robust methods to locate the repository root, such as searching for marker files (e.g., .git or WORKSPACE).

sys.path.append(os.path.join(PROJECT_ROOT, 'tools/statvar_importer'))
sys.path.append(os.path.join(PROJECT_ROOT, 'util'))
sys.path.append(os.path.join(PROJECT_ROOT, 'tools/agentic_import/metrics'))

import mcf_diff
from counters import Counters
from property_value_mapper import PropertyValueMapper
from pvmap_generator_metrics import PVMapGeneratorMetricsRunner

def get_metrics_from_counters(counters_obj):
# Use metrics formula directly from pvmap_generator_metrics.py
diff_stats = {'counters': counters_obj.get_counters()}
stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Once get_stats_from_diff_counters is refactored to a @staticmethod, you can call it directly without passing None as the first argument.

Suggested change
stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(None, diff_stats)
stats = PVMapGeneratorMetricsRunner.get_stats_from_diff_counters(diff_stats)


return {
'tp': stats.get('true_positive', 0),
'fp': stats.get('false_positive', 0),
'fn': stats.get('false_negative', 0),
'precision': stats.get('precision', 0),
'recall': stats.get('recall', 0),
'f1': stats.get('f1', 0)
}

def load_pv_map_nodes(file_path):
"""Loads PV map into MCF-like nodes using PropertyValueMapper normalization."""
pv_mapper = PropertyValueMapper()
pv_mapper.load_pvs_from_file(file_path)
# Get the raw GLOBAL map
raw_map = pv_mapper.get_pv_map().get('GLOBAL', {})
# Convert to standard node format: {dcid: {prop: val}}
nodes = {}
for key, pvs in raw_map.items():
# Ensure DCID is clean
nodes[key] = pvs
return nodes

def run_comparison(pred_path, gold_path, is_pvmap=False, use_fingerprint=False):
if not os.path.exists(pred_path) or not os.path.exists(gold_path):
print(f"Skipping: missing {pred_path} or {gold_path}")
return "", {'tp': 0, 'fp': 0, 'fn': 0, 'precision': 0, 'recall': 0, 'f1': 0}

counters = Counters()
config = {
'show_diff_nodes_only': True,
'ignore_property': ['description', 'provenance', 'memberOf', 'member', 'name', 'constraintProperties', 'keyString', 'relevantVariable'],
'fingerprint_dcid': use_fingerprint
}

if is_pvmap:
# Specialized loading for PV Maps to handle wide vs narrow formats
nodes1 = load_pv_map_nodes(pred_path)
nodes2 = load_pv_map_nodes(gold_path)
print(f" [PVMap] Loaded {len(nodes1)} nodes from pred, {len(nodes2)} from gold")
diff_text = mcf_diff.diff_mcf_nodes(nodes1, nodes2, config, counters)
else:
# Standard MCF loading
diff_text = mcf_diff.diff_mcf_files(pred_path, gold_path, config, counters)

metrics = get_metrics_from_counters(counters)
return diff_text, metrics

def update_html(template_content, dataset_id, hero_metrics, detailed_results):
content = template_content
for key, label in [('f1', 'Agent Efficacy (F1)'), ('precision', 'Precision'), ('recall', 'Recall')]:
escaped_label = re.escape(label)
content = re.sub(
rf'({escaped_label}.*?tracking-tight(?:er)?">)([\d\.]+%)(</p>)',
r'\g<1>' + f"{hero_metrics[key]*100:.1f}%" + r'\g<3>',
content, flags=re.DOTALL
)

content = re.sub(r'(True Positives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['tp']:,}" + r'\g<3>', content, flags=re.DOTALL)
content = re.sub(r'(False Positives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['fp']:,}" + r'\g<3>', content, flags=re.DOTALL)
content = re.sub(r'(False Negatives.*?tracking-tighter">)([\d,]+)(</p>)', r'\g<1>' + f"{hero_metrics['fn']:,}" + r'\g<3>', content, flags=re.DOTALL)
content = re.sub(r'Run ID: .*?</div>', f'Run ID: {dataset_id} (Rechecked)</div>', content)

if '<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12">' in content:
content = content.split('<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12">')[0]

diff_sections = '<section class="mt-8 px-4 sm:px-6 lg:px-8 pb-12"><h2 class="text-2xl lg:text-3xl font-extrabold text-slate-900 mb-6">Detailed Semantic Comparisons</h2>'
for label, data in detailed_results.items():
diff_sections += f'''
<div class="mb-10 bg-white p-6 lg:p-8 rounded-2xl shadow-lg border-2 border-slate-200">
<h3 class="text-xl lg:text-2xl font-black mb-4 uppercase tracking-wider text-indigo-900 border-b-2 border-indigo-100 pb-2">{label}</h3>
<div class="grid grid-cols-3 gap-4 mb-6">
<div class="bg-emerald-50 p-4 rounded-xl border-2 border-emerald-200"><span class="block text-xs font-black uppercase tracking-widest">Precision</span><span class="text-2xl font-black">{data['metrics']['precision']*100:.1f}%</span></div>
<div class="bg-amber-50 p-4 rounded-xl border-2 border-amber-200"><span class="block text-xs font-black uppercase tracking-widest">Recall</span><span class="text-2xl font-black">{data['metrics']['recall']*100:.1f}%</span></div>
<div class="bg-indigo-50 p-4 rounded-xl border-2 border-indigo-200"><span class="block text-xs font-black uppercase tracking-widest">F1 Score</span><span class="text-2xl font-black">{data['metrics']['f1']*100:.1f}%</span></div>
</div>
<div class="text-xs font-bold text-slate-500 mb-2 font-mono uppercase tracking-widest">Semantic Match (TP: {data['metrics']['tp']} | FP: {data['metrics']['fp']} | FN: {data['metrics']['fn']})</div>
<pre class="bg-slate-900 text-slate-100 p-6 rounded-xl overflow-auto max-h-[30rem] text-sm font-mono whitespace-pre-wrap">{data['diff'] if data['diff'] else 'No differences found.'}</pre>
</div>
'''
diff_sections += '</section>'
return content.replace('</body>', diff_sections + '</body>') if '</body>' in content else content + diff_sections

def process_single_dataset(test_dir, gold_dir, output_dir, dataset_id, template_content):
if not os.path.exists(output_dir): os.makedirs(output_dir)

file_mappings = [
('pvmap.csv', 'output_pvmap.csv', 'output_pvmap.csv', True, False),
('stat_vars.mcf', 'output_stat_vars.mcf', 'output_stat_vars.mcf', False, True),
('stat_vars_schema.mcf', 'output_stat_vars_schema.mcf', 'output_stat_vars_schema.mcf', False, False),
('tmcf', 'output.tmcf', 'output.tmcf', False, False),
]

detailed_results = {}
total_tp = 0
total_fp = 0
total_fn = 0

print(f"\n--- Rechecking Efficacy for {dataset_id} ---")
for label, pred_name, gold_name, is_pv, use_fp in file_mappings:
print(f"Comparing {label}...")
diff_text, metrics = run_comparison(os.path.join(test_dir, pred_name), os.path.join(gold_dir, gold_name), is_pvmap=is_pv, use_fingerprint=use_fp)
detailed_results[label] = {'diff': diff_text, 'metrics': metrics}
print(f" Result: F1={metrics['f1']:.1%}, TP={metrics['tp']}, FP={metrics['fp']}, FN={metrics['fn']}")

total_tp += metrics['tp']
total_fp += metrics['fp']
total_fn += metrics['fn']

precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

hero_metrics = {
'tp': total_tp,
'fp': total_fp,
'fn': total_fn,
'precision': precision,
'recall': recall,
'f1': f1
}

final_html = update_html(template_content, dataset_id, hero_metrics, detailed_results)

with open(os.path.join(output_dir, 'Agent_Efficacy_Board.html'), 'w') as f: f.write(final_html)
return hero_metrics

def main():
parser = argparse.ArgumentParser(description="Calculate efficacy metrics.")
parser.add_argument('--test', required=True, help="Path to the test (prediction) directory.")
parser.add_argument('--gold', required=True, help="Path to the gold (reviewed) directory.")
parser.add_argument('--output', required=True, help="Path to the output directory to save results.")
parser.add_argument('--dataset_id', default="Dataset", help="Optional dataset ID for display.")
parser.add_argument('--bulk', action='store_true', help="If set, treats test and gold as parent directories containing multiple dataset folders.")
args = parser.parse_args()

template_path = os.path.join(os.path.dirname(__file__), 'Agent_Efficacy_Board.html')
with open(template_path, 'r') as f:
template_content = f.read()

if args.bulk:
run_id = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
args.output = os.path.join(args.output, f"bulk_run_{run_id}")
print(f"Starting Bulk Efficacy Calculation natively... Output: {args.output}")
summary_data = []
if not os.path.exists(args.output):
os.makedirs(args.output)

for dataset_id in os.listdir(args.gold):
gold_ds_dir = os.path.join(args.gold, dataset_id)
if not os.path.isdir(gold_ds_dir):
continue

# Allow test directories to either match exactly or be prefixed with test_
test_ds_dir = os.path.join(args.test, dataset_id)
if not os.path.isdir(test_ds_dir):
test_ds_dir = os.path.join(args.test, f"test_{dataset_id}")

if os.path.isdir(test_ds_dir):
out_ds_dir = os.path.join(args.output, dataset_id)
try:
hero_metrics = process_single_dataset(test_ds_dir, gold_ds_dir, out_ds_dir, dataset_id, template_content)
if hero_metrics:
summary_data.append({
'dataset_id': dataset_id,
'f1': hero_metrics['f1'],
'precision': hero_metrics['precision'],
'recall': hero_metrics['recall'],
'tp': hero_metrics['tp'],
'fp': hero_metrics['fp'],
'fn': hero_metrics['fn']
})
except Exception as e:
print(f"Error processing {dataset_id}: {e}")
else:
print(f"Skipped: {dataset_id} (Matching test directory not found)")

# Write summary.csv
if summary_data:
summary_path = os.path.join(args.output, 'summary.csv')
with open(summary_path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['dataset_id', 'f1', 'precision', 'recall', 'tp', 'fp', 'fn'])
writer.writeheader()
writer.writerows(summary_data)
print(f"\nBulk run complete. Summary saved to {summary_path}")
else:
process_single_dataset(args.test, args.gold, args.output, args.dataset_id, template_content)
print(f"\nRecheck complete. Results saved to {args.output}")

if __name__ == "__main__":
main()
Loading