This repository contains a research implementation investigating context-dependent evaluation of factually ungrounded outputs in Large Language Models (LLMs). The work examines whether current evaluation frameworks adequately distinguish between harmful misinformation and potentially beneficial speculation that aligns with human values.
Current LLM evaluation metrics treat all factually ungrounded outputs ("hallucinations") as uniformly problematic, applying binary truthfulness standards regardless of context. This approach fails to account for:
- Domain-specific tolerance: Medical contexts require strict factual accuracy, while creative or educational contexts may benefit from reasonable speculation
- Value alignment: Some factually ungrounded outputs may serve beneficial purposes (e.g., empathetic support, educational analogies)
- Risk-dependent standards: High-risk scenarios demand different evaluation criteria than low-risk informational queries
This research adopts the term "confabulation" rather than "hallucination" following Geoffrey Hinton's advocacy for more precise terminology when describing LLM behavior. As Hinton argues, LLMs don't hallucinate in the perceptual sense—they confabulate, generating plausible-sounding outputs that may not be grounded in their training data or factual reality.
Key distinction: Confabulation is a neutral, descriptive term that:
- Accurately describes the generative process of LLMs
- Avoids anthropomorphizing AI systems with human perceptual experiences
- Allows for nuanced evaluation (some confabulation may be acceptable or even beneficial)
- Aligns with cognitive science terminology for similar phenomena in human cognition
This terminological choice is central to our research premise: not all confabulation is equally problematic, and evaluation frameworks should account for context-dependent acceptability.
- Harmful Confabulation: Factually incorrect outputs that mislead users or pose safety risks
- Value-Aligned Confabulation: Factually ungrounded outputs that align with human values and provide utility within appropriate contextual bounds
- Confabulation Tolerance: Domain and context-specific threshold for acceptable factual imprecision
Primary Hypothesis: Traditional weighted scoring of alignment, truthfulness, utility, and transparency metrics fails to distinguish harmful from beneficial confabulation in medical contexts without careful weight optimization and metric design.
Status: ✅ Empirically Validated (Phase 1)
- Baseline configuration (T:50%, A:30%, U:15%, Tr:5%) produced incorrect ordering: Harmful (0.440) > Beneficial (0.437) > Truthful (0.435)
- Optimized configuration (T:70%, A:30%, U:0%, Tr:0%) achieved correct ordering but minimal separation (<1%)
- Conclusion: Weight optimization necessary but insufficient; underlying metrics require redesign
- RQ1: What weight configurations enable automated metrics to correctly rank truthful, beneficial, and harmful responses in medical contexts?
- RQ2: How well do automated VAC scores correlate with human judgments of response quality and safety?
- RQ3: What contextual factors (domain, risk level, user demographics) modulate acceptable confabulation tolerance?
- RQ4: Can metric improvements achieve clinically significant separation (>10%) between response types for safe deployment?
value-aligned-confabulation/
├── docs/ # Research documentation
├── src/ # Core implementation
│ ├── evaluation/ # Evaluation framework
│ ├── data/ # Data collection and management
│ ├── models/ # Model implementations
│ └── analysis/ # Analysis tools
├── experiments/ # Experimental protocols
├── tests/ # Testing framework
├── configs/ # Configuration files
└── scripts/ # Utility scripts
pip install -r requirements.txt
python setup.py installfrom src.evaluation.vac_evaluator import ValueAlignedConfabulationEvaluator
evaluator = ValueAlignedConfabulationEvaluator()
score = evaluator.evaluate_response(prompt, response, context)Prefer a friendlier interface? Launch the Streamlit app:
# From the project root (activate venv first if needed)
python -m pip install -r requirements.txt
streamlit run experiments\pilot_studies\streamlit_app.pyThe app collects demographics, shows scenario pairs with styled cards, and saves:
- JSON bundle with analysis
- JSONL rows (one per recorded choice)
- CSV table
Files are written to experiments/results/value-elicitation_streamlit/<DATE>/.
Our Phase 1 experiments validated the core hypothesis through systematic testing of 62 responses across 11 medical scenarios:
graph LR
A[Baseline Config<br/>T:50%, A:30%] -->|Failed| B[Harmful: 0.440<br/>Beneficial: 0.437<br/>Truthful: 0.435]
C[Optimized Config<br/>T:70%, A:30%] -->|Success| D[Truthful: 0.544<br/>Beneficial: 0.541<br/>Harmful: 0.540]
style A fill:#FF6B6B
style B fill:#FF6B6B
style C fill:#90EE90
style D fill:#90EE90
| Metric | Before Optimization | After Optimization | Improvement |
|---|---|---|---|
| Response Ordering | ❌ Wrong (H>B>T) | ✅ Correct (T>B>H) | Fixed |
| Pairwise Accuracy | 50% (chance) | 100% (perfect) | +100% |
| Sanity Checks | 0/2 passed | 2/2 passed | +100% |
| Separation | N/A | 0.4% | Needs improvement |
Ablation study across 6 configurations revealed optimal weight range:
pie title "Optimal Weight Distribution"
"Truthfulness" : 70
"Alignment" : 30
"Utility" : 0
"Transparency" : 0
Critical Discovery: Truthfulness weight must be 66-78% to achieve correct response ordering in medical domain.
- Total Evaluations: 62 responses (11 truthful, 18 beneficial, 33 harmful)
- Scenarios: 11 medical scenarios across 4 risk levels
- Ablation Runs: 6 weight configurations tested
- Success Rate: 67% of configurations pass sanity checks (vs 0% baseline)
📊 Full Phase 1 Report | 📈 Detailed Analysis
- ✅ Core evaluation framework
- ✅ Initial benchmark scenarios (11 medical scenarios, 62 responses)
- ✅ Basic metrics implementation
- ✅ Weight optimization through ablation studies
- ✅ Hypothesis validated: Traditional metrics fail without optimization
Key Findings:
- Baseline weights failed to distinguish harmful from beneficial responses
- Optimal configuration: 70% truthfulness, 30% alignment, 0% utility/transparency
- Achieved correct response ordering but <1% separation
- Conclusion: Metric redesign needed for production use
Objectives:
- Collect human preference data (target: 50+ participants)
- Validate automated VAC scores against human judgment
- Identify systematic disagreements between humans and metrics
- Recalibrate metrics based on human feedback
- Expand benchmark scenarios (target: 50+ scenarios)
Success Criteria:
- Human-AI agreement >0.60 (Cohen's kappa)
- Response separation >10% (currently <1%)
- Maintain 100% pairwise accuracy
Get Started:
streamlit run experiments/pilot_studies/streamlit_app.py- Baseline model evaluation with actual LLM APIs
- Cross-domain testing (medical, creative, educational)
- Alignment-truthfulness trade-off analysis
- Real-time evaluation integration
- Statistical analysis of human study results
- Metric refinement based on findings
- Research publication preparation
- Framework deployment guidelines
This is a research project focused on advancing our understanding of beneficial AI confabulation. We welcome contributions from researchers, developers, and AI safety practitioners.
- Research: New evaluation metrics, benchmark scenarios, human study protocols
- Technical: Code improvements, integrations, analysis tools
- Documentation: Methodology improvements, examples, tutorials
- Community: Cross-cultural validation, expert reviews, ethical guidelines
Please see our Contributing Guide for detailed information on how to get involved.
This project follows ethical guidelines for human subjects research and AI safety. All contributions should consider potential societal impacts and promote beneficial uses of confabulation research.
This research builds upon important insights from the AI research community:
-
Geoffrey Hinton has advocated for using "confabulation" rather than "hallucination" when describing AI-generated content that isn't grounded in training data, emphasizing that the term better captures the nature of how language models generate responses. See his discussion in the 60 Minutes interview and the full interview.
-
Andrej Karpathy has discussed the nuanced nature of what we call "hallucinations" in language models, noting that not all factually ungrounded outputs are equally problematic - a key insight that motivates this research. His thoughts on this topic have been shared in various Twitter/X discussions.
- This research was originally conceptualized in "Hallucinations in Large Language Models" (Ashioya, 2024), which explored the need for more nuanced evaluation of AI-generated content.
We acknowledge the broader AI safety and alignment research community, whose ongoing work on AI evaluation, human preference modeling, and value alignment provides the foundation for this research.
MIT License - See LICENSE file for details.
If you use this work in your research, please cite:
@misc{vac_research_2025,
title={Value-Aligned Confabulation: Moving Beyond Binary Truthfulness in LLM Evaluation},
author={Ashioya Jotham Victor},
year={2025},
note={Research in progress}
}