This repository contains code to reproduce experiments from our journal paper "Rule-Based Explanations for Retrieval-Augmented LLM Systems", as well as the raw results described therein.
Note: Our experiments make use of several LLM APIs, which do not offer guarantees of
determinism or reproducibility. We ensure that everything else is fully reproducible,
so expect that your results will be nearly identical to ours. The raw data from our
experimental results can be found in the compressed file
final_results_jan_18_2026.tar.xz.
To reproduce our experiments, you will first need to confugre your environment. We run all experiments using Python 3.10.12. You can install the exact dependencies we used by running the following command:
pip install -r requirements.txt
Alternatively, you can install the following core dependencies individually:
anthropic
google-genai
openai
pydantic
pytest
python-dotenv
matplotlib
numpy
Next, you will need to download the HotpotQA training dataset used in the experiments.
At the time of writing, it can be found at
http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json.
Please note that the synthetic data used in our quantitative experiments is generated at runtime,
so no download is required for that.
Lastly, you will need to provide your own LLM API keys and paths to read / write data.
In the top-level directory, create a .env file and initialize it with the following contents
(replace all missing "..." values with your own values).
OPENAI_API_KEY=...
OPENAI_MODEL_QA=gpt-5-mini-2025-08-07
OPENAI_MODEL_JUDGE=gpt-5-mini-2025-08-07
GEMINI_API_KEY=...
GEMINI_MODEL_QA=gemini-2.5-flash
ANTHROPIC_API_KEY=...
ANTHROPIC_MODEL_QA=claude-haiku-4-5-20251001
HOTPOTQA_PATH=...
LLM_TEMPERATURE=1
LLM_MAX_OUTPUT_TOKENS=100
RANDOM_SEED=123
EFFICIENCY_MIN_SOURCES=1
EFFICIENCY_MAX_SOURCES=10
EFFICIENCY_NUM_EXAMPLES_HOTPOTQA=50
EFFICIENCY_NUM_EXAMPLES_SYNTHETIC=1000
ROBUSTNESS_MIN_SAMPLES=1
ROBUSTNESS_MAX_SAMPLES=10
ROBUSTNESS_NUM_SOURCES=5
ROBUSTNESS_NUM_EXAMPLES_HOTPOTQA=50
ROBUSTNESS_NUM_EXAMPLES_SYNTHETIC=1000
EXPERT_VALIDATION_NUM_JUDGMENTS=100
CACHE_PATH=...
PLOTS_PATH=...
To run the efficiency experiment, run the following command:
python run_efficiency_experiment.py
To run the robustness experiment, run the following command:
python run_robustness_experiment.py
Several tests have been written to validate core rule mining functionality.
They are located in the top-level test directory, and can be run using the following command:
python -m pytest test