Skip to content

ashioyajotham/Value-Aligned-Confabulation-VAC-Research

Repository files navigation

Value-Aligned Confabulation (VAC) Research

Overview

This repository contains a research implementation investigating context-dependent evaluation of factually ungrounded outputs in Large Language Models (LLMs). The work examines whether current evaluation frameworks adequately distinguish between harmful misinformation and potentially beneficial speculation that aligns with human values.

Research Problem

Current LLM evaluation metrics treat all factually ungrounded outputs ("hallucinations") as uniformly problematic, applying binary truthfulness standards regardless of context. This approach fails to account for:

  1. Domain-specific tolerance: Medical contexts require strict factual accuracy, while creative or educational contexts may benefit from reasonable speculation
  2. Value alignment: Some factually ungrounded outputs may serve beneficial purposes (e.g., empathetic support, educational analogies)
  3. Risk-dependent standards: High-risk scenarios demand different evaluation criteria than low-risk informational queries

Why "Confabulation"?

This research adopts the term "confabulation" rather than "hallucination" following Geoffrey Hinton's advocacy for more precise terminology when describing LLM behavior. As Hinton argues, LLMs don't hallucinate in the perceptual sense—they confabulate, generating plausible-sounding outputs that may not be grounded in their training data or factual reality.

Key distinction: Confabulation is a neutral, descriptive term that:

  • Accurately describes the generative process of LLMs
  • Avoids anthropomorphizing AI systems with human perceptual experiences
  • Allows for nuanced evaluation (some confabulation may be acceptable or even beneficial)
  • Aligns with cognitive science terminology for similar phenomena in human cognition

This terminological choice is central to our research premise: not all confabulation is equally problematic, and evaluation frameworks should account for context-dependent acceptability.

Terminology

  • Harmful Confabulation: Factually incorrect outputs that mislead users or pose safety risks
  • Value-Aligned Confabulation: Factually ungrounded outputs that align with human values and provide utility within appropriate contextual bounds
  • Confabulation Tolerance: Domain and context-specific threshold for acceptable factual imprecision

Research Hypothesis

Primary Hypothesis: Traditional weighted scoring of alignment, truthfulness, utility, and transparency metrics fails to distinguish harmful from beneficial confabulation in medical contexts without careful weight optimization and metric design.

Status: ✅ Empirically Validated (Phase 1)

  • Baseline configuration (T:50%, A:30%, U:15%, Tr:5%) produced incorrect ordering: Harmful (0.440) > Beneficial (0.437) > Truthful (0.435)
  • Optimized configuration (T:70%, A:30%, U:0%, Tr:0%) achieved correct ordering but minimal separation (<1%)
  • Conclusion: Weight optimization necessary but insufficient; underlying metrics require redesign

Research Questions

  1. RQ1: What weight configurations enable automated metrics to correctly rank truthful, beneficial, and harmful responses in medical contexts?
  2. RQ2: How well do automated VAC scores correlate with human judgments of response quality and safety?
  3. RQ3: What contextual factors (domain, risk level, user demographics) modulate acceptable confabulation tolerance?
  4. RQ4: Can metric improvements achieve clinically significant separation (>10%) between response types for safe deployment?

Repository Structure

value-aligned-confabulation/
├── docs/                    # Research documentation
├── src/                     # Core implementation
│   ├── evaluation/         # Evaluation framework
│   ├── data/               # Data collection and management
│   ├── models/             # Model implementations
│   └── analysis/           # Analysis tools
├── experiments/            # Experimental protocols
├── tests/                  # Testing framework
├── configs/                # Configuration files
└── scripts/                # Utility scripts

Installation

pip install -r requirements.txt
python setup.py install

Quick Start

from src.evaluation.vac_evaluator import ValueAlignedConfabulationEvaluator

evaluator = ValueAlignedConfabulationEvaluator()
score = evaluator.evaluate_response(prompt, response, context)

Web UI (Streamlit) for Value Elicitation

Prefer a friendlier interface? Launch the Streamlit app:

# From the project root (activate venv first if needed)
python -m pip install -r requirements.txt
streamlit run experiments\pilot_studies\streamlit_app.py

The app collects demographics, shows scenario pairs with styled cards, and saves:

  • JSON bundle with analysis
  • JSONL rows (one per recorded choice)
  • CSV table

Files are written to experiments/results/value-elicitation_streamlit/<DATE>/.

Phase 1 Research Insights

Experimental Validation Results

Our Phase 1 experiments validated the core hypothesis through systematic testing of 62 responses across 11 medical scenarios:

graph LR
    A[Baseline Config<br/>T:50%, A:30%] -->|Failed| B[Harmful: 0.440<br/>Beneficial: 0.437<br/>Truthful: 0.435]
    C[Optimized Config<br/>T:70%, A:30%] -->|Success| D[Truthful: 0.544<br/>Beneficial: 0.541<br/>Harmful: 0.540]
    
    style A fill:#FF6B6B
    style B fill:#FF6B6B
    style C fill:#90EE90
    style D fill:#90EE90
Loading

Key Findings

Metric Before Optimization After Optimization Improvement
Response Ordering ❌ Wrong (H>B>T) ✅ Correct (T>B>H) Fixed
Pairwise Accuracy 50% (chance) 100% (perfect) +100%
Sanity Checks 0/2 passed 2/2 passed +100%
Separation N/A 0.4% Needs improvement

Weight Sensitivity Analysis

Ablation study across 6 configurations revealed optimal weight range:

pie title "Optimal Weight Distribution"
    "Truthfulness" : 70
    "Alignment" : 30
    "Utility" : 0
    "Transparency" : 0
Loading

Critical Discovery: Truthfulness weight must be 66-78% to achieve correct response ordering in medical domain.

Statistical Summary

  • Total Evaluations: 62 responses (11 truthful, 18 beneficial, 33 harmful)
  • Scenarios: 11 medical scenarios across 4 risk levels
  • Ablation Runs: 6 weight configurations tested
  • Success Rate: 67% of configurations pass sanity checks (vs 0% baseline)

📊 Full Phase 1 Report | 📈 Detailed Analysis


Research Phases

Phase 1: Foundation ✅ COMPLETED (Weeks 1-2)

  • ✅ Core evaluation framework
  • ✅ Initial benchmark scenarios (11 medical scenarios, 62 responses)
  • ✅ Basic metrics implementation
  • ✅ Weight optimization through ablation studies
  • Hypothesis validated: Traditional metrics fail without optimization

Key Findings:

  • Baseline weights failed to distinguish harmful from beneficial responses
  • Optimal configuration: 70% truthfulness, 30% alignment, 0% utility/transparency
  • Achieved correct response ordering but <1% separation
  • Conclusion: Metric redesign needed for production use

📄 Phase 1 Completion Report


Phase 2: Human Studies 🔄 IN PROGRESS (Weeks 3-4)

Objectives:

  • Collect human preference data (target: 50+ participants)
  • Validate automated VAC scores against human judgment
  • Identify systematic disagreements between humans and metrics
  • Recalibrate metrics based on human feedback
  • Expand benchmark scenarios (target: 50+ scenarios)

Success Criteria:

  • Human-AI agreement >0.60 (Cohen's kappa)
  • Response separation >10% (currently <1%)
  • Maintain 100% pairwise accuracy

Get Started:

streamlit run experiments/pilot_studies/streamlit_app.py

Phase 3: Model Evaluation (Weeks 5-6)

  • Baseline model evaluation with actual LLM APIs
  • Cross-domain testing (medical, creative, educational)
  • Alignment-truthfulness trade-off analysis
  • Real-time evaluation integration

Phase 4: Analysis & Iteration (Weeks 7-8)

  • Statistical analysis of human study results
  • Metric refinement based on findings
  • Research publication preparation
  • Framework deployment guidelines

Contributing

This is a research project focused on advancing our understanding of beneficial AI confabulation. We welcome contributions from researchers, developers, and AI safety practitioners.

Ways to Contribute

  • Research: New evaluation metrics, benchmark scenarios, human study protocols
  • Technical: Code improvements, integrations, analysis tools
  • Documentation: Methodology improvements, examples, tutorials
  • Community: Cross-cultural validation, expert reviews, ethical guidelines

Please see our Contributing Guide for detailed information on how to get involved.

Research Ethics

This project follows ethical guidelines for human subjects research and AI safety. All contributions should consider potential societal impacts and promote beneficial uses of confabulation research.

Acknowledgements

This research builds upon important insights from the AI research community:

Terminology

  • Geoffrey Hinton has advocated for using "confabulation" rather than "hallucination" when describing AI-generated content that isn't grounded in training data, emphasizing that the term better captures the nature of how language models generate responses. See his discussion in the 60 Minutes interview and the full interview.

  • Andrej Karpathy has discussed the nuanced nature of what we call "hallucinations" in language models, noting that not all factually ungrounded outputs are equally problematic - a key insight that motivates this research. His thoughts on this topic have been shared in various Twitter/X discussions.

Foundational Work

Research Community

We acknowledge the broader AI safety and alignment research community, whose ongoing work on AI evaluation, human preference modeling, and value alignment provides the foundation for this research.

License

MIT License - See LICENSE file for details.

Citation

If you use this work in your research, please cite:

@misc{vac_research_2025,
  title={Value-Aligned Confabulation: Moving Beyond Binary Truthfulness in LLM Evaluation},
  author={Ashioya Jotham Victor},
  year={2025},
  note={Research in progress}
}

About

Driving away from the binary "hallucinations" evals to a more nuanced and context-dependent eval technique.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages