Rule-Based Explanations for Retrieval-Augmented LLM Systems

This repository contains code to reproduce experiments from our journal paper "Rule-Based Explanations for Retrieval-Augmented LLM Systems", as well as the raw results described therein.

Note: Our experiments make use of several LLM APIs, which do not offer guarantees of determinism or reproducibility. We ensure that everything else is fully reproducible, so expect that your results will be nearly identical to ours. The raw data from our experimental results can be found in the compressed file final_results_jan_18_2026.tar.xz.

Setup

To reproduce our experiments, you will first need to confugre your environment. We run all experiments using Python 3.10.12. You can install the exact dependencies we used by running the following command:

pip install -r requirements.txt

Alternatively, you can install the following core dependencies individually:

anthropic
google-genai
openai
pydantic
pytest
python-dotenv
matplotlib
numpy

Next, you will need to download the HotpotQA training dataset used in the experiments. At the time of writing, it can be found at http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json. Please note that the synthetic data used in our quantitative experiments is generated at runtime, so no download is required for that.

Lastly, you will need to provide your own LLM API keys and paths to read / write data. In the top-level directory, create a .env file and initialize it with the following contents (replace all missing "..." values with your own values).

OPENAI_API_KEY=...
OPENAI_MODEL_QA=gpt-5-mini-2025-08-07
OPENAI_MODEL_JUDGE=gpt-5-mini-2025-08-07
GEMINI_API_KEY=...
GEMINI_MODEL_QA=gemini-2.5-flash
ANTHROPIC_API_KEY=...
ANTHROPIC_MODEL_QA=claude-haiku-4-5-20251001
HOTPOTQA_PATH=...
LLM_TEMPERATURE=1
LLM_MAX_OUTPUT_TOKENS=100
RANDOM_SEED=123
EFFICIENCY_MIN_SOURCES=1
EFFICIENCY_MAX_SOURCES=10
EFFICIENCY_NUM_EXAMPLES_HOTPOTQA=50
EFFICIENCY_NUM_EXAMPLES_SYNTHETIC=1000
ROBUSTNESS_MIN_SAMPLES=1
ROBUSTNESS_MAX_SAMPLES=10
ROBUSTNESS_NUM_SOURCES=5
ROBUSTNESS_NUM_EXAMPLES_HOTPOTQA=50
ROBUSTNESS_NUM_EXAMPLES_SYNTHETIC=1000
EXPERT_VALIDATION_NUM_JUDGMENTS=100
CACHE_PATH=...
PLOTS_PATH=...

Running Experiments

To run the efficiency experiment, run the following command: python run_efficiency_experiment.py

To run the robustness experiment, run the following command: python run_robustness_experiment.py

Bonus: Tests

Several tests have been written to validate core rule mining functionality. They are located in the top-level test directory, and can be run using the following command:

python -m pytest test

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
experiments		experiments
judges		judges
llms		llms
rule_miners		rule_miners
test		test
utilities		utilities
.gitignore		.gitignore
README.md		README.md
compute_validation_stats.py		compute_validation_stats.py
draw_validation_judgments.py		draw_validation_judgments.py
final_results_jan_18_2026.tar.xz		final_results_jan_18_2026.tar.xz
requirements.txt		requirements.txt
run_efficiency_experiment.py		run_efficiency_experiment.py
run_robustness_experiment.py		run_robustness_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rule-Based Explanations for Retrieval-Augmented LLM Systems

Setup

Running Experiments

Bonus: Tests

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rule-Based Explanations for Retrieval-Augmented LLM Systems

Setup

Running Experiments

Bonus: Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages