Synthetic QA Generation over ClueWeb

Generates multi-hop question-answer pairs from ClueWeb22-B documents using an LLM-driven agentic loop. Designed to produce hard, research-style questions where the answer requires reasoning across multiple linked documents.

Based on ASearcher, adapted to ClueWeb.

How it works

asearcher_rlm.py implements a ConstructQAAgent that:

Seeds from a random Wikipedia-subset ClueWeb document -- extracts title, summary, and information points via LLM.
Creates a base QA grounded in that document's content.
Iterates (up to MAX_TURNS) choosing actions:
- SELECT -- pick a linked document (via inlinks/outlinks), build a bridging QA, and merge it into the current question.
- FUZZ -- rephrase the question to make it vaguer or harder.
- BRAINSTORM -- propose related entities (with ClueWeb IDs) and fold their facts into the question.
- EXIT -- stop when the question is sufficiently hard.
Validates each iteration: checks QA validity, runs direct-answer generation (n=8), uses LLM judge to measure accuracy, and rejects questions with alternative correct answers.
Saves each generated QA as a .jsonl file in ./generated_qas/.

Dependencies

openai (AsyncOpenAI)
clueweb22 (ClueWeb22Api) -- requires local access to ClueWeb22-B at /data/datasets/clueweb22/ClueWeb22_B
tqdm, python-dotenv

Configuration

Edit the __main__ block in asearcher_rlm.py:

Variable	Default	Description
`REASONING_MODEL`	`gpt-4.1-mini`	Model for reasoning tasks (action choice, QA construction)
`INSTRUCT_MODEL`	`gpt-4.1-mini`	Model for extraction tasks (summarization, info points)
`BASE_URL`	`None`	Custom API endpoint (e.g. vLLM)
`NUM_QUESTIONS`	`740`	Total questions to generate
`MAX_TURNS`	`6`	Max iterations per question
`SAVE_PATH`	`./generated_qas`	Output directory

API key is loaded from keys.env (OPENAI_API_KEY).

Usage

python asearcher_rlm.py

Runs 128 concurrent generation tasks. Each QA is saved as ./generated_qas/<uuid>.jsonl.

Output format

Each .jsonl file contains a single JSON object:

{
  "qa": {"question": "...", "answer": "..."},
  "relevant": [{"name": "...", "clueweb_id": "...", "summary": "...", ...}],
  "statements": ["fact 1", "fact 2"],
  "edit_history": ["action log..."],
  "qa_history": [{"question": "...", "answer": "...", "direct_gen_acc": "0/8"}]
}

relevant -- all ClueWeb documents used to construct the question.
statements -- factual claims the question depends on.
qa_history -- progression of the question through iterations, with direct-generation accuracy at each step.

Pre-generated queries

data/asearcher/full.json

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
clueweb22		clueweb22
data		data
.gitignore		.gitignore
README.md		README.md
asearcher_rlm.py		asearcher_rlm.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic QA Generation over ClueWeb

How it works

Dependencies

Configuration

Usage

Output format

Pre-generated queries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synthetic QA Generation over ClueWeb

How it works

Dependencies

Configuration

Usage

Output format

Pre-generated queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages