Generates multi-hop question-answer pairs from ClueWeb22-B documents using an LLM-driven agentic loop. Designed to produce hard, research-style questions where the answer requires reasoning across multiple linked documents.
Based on ASearcher, adapted to ClueWeb.
asearcher_rlm.py implements a ConstructQAAgent that:
- Seeds from a random Wikipedia-subset ClueWeb document -- extracts title, summary, and information points via LLM.
- Creates a base QA grounded in that document's content.
- Iterates (up to
MAX_TURNS) choosing actions:- SELECT -- pick a linked document (via inlinks/outlinks), build a bridging QA, and merge it into the current question.
- FUZZ -- rephrase the question to make it vaguer or harder.
- BRAINSTORM -- propose related entities (with ClueWeb IDs) and fold their facts into the question.
- EXIT -- stop when the question is sufficiently hard.
- Validates each iteration: checks QA validity, runs direct-answer generation (n=8), uses LLM judge to measure accuracy, and rejects questions with alternative correct answers.
- Saves each generated QA as a
.jsonlfile in./generated_qas/.
openai(AsyncOpenAI)clueweb22(ClueWeb22Api) -- requires local access to ClueWeb22-B at/data/datasets/clueweb22/ClueWeb22_Btqdm,python-dotenv
Edit the __main__ block in asearcher_rlm.py:
| Variable | Default | Description |
|---|---|---|
REASONING_MODEL |
gpt-4.1-mini |
Model for reasoning tasks (action choice, QA construction) |
INSTRUCT_MODEL |
gpt-4.1-mini |
Model for extraction tasks (summarization, info points) |
BASE_URL |
None |
Custom API endpoint (e.g. vLLM) |
NUM_QUESTIONS |
740 |
Total questions to generate |
MAX_TURNS |
6 |
Max iterations per question |
SAVE_PATH |
./generated_qas |
Output directory |
API key is loaded from keys.env (OPENAI_API_KEY).
python asearcher_rlm.pyRuns 128 concurrent generation tasks. Each QA is saved as ./generated_qas/<uuid>.jsonl.
Each .jsonl file contains a single JSON object:
{
"qa": {"question": "...", "answer": "..."},
"relevant": [{"name": "...", "clueweb_id": "...", "summary": "...", ...}],
"statements": ["fact 1", "fact 2"],
"edit_history": ["action log..."],
"qa_history": [{"question": "...", "answer": "...", "direct_gen_acc": "0/8"}]
}relevant-- all ClueWeb documents used to construct the question.statements-- factual claims the question depends on.qa_history-- progression of the question through iterations, with direct-generation accuracy at each step.
data/asearcher/full.json