Skip to content

cxcscmu/search_agent_synthetic_query

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic QA Generation over ClueWeb

Generates multi-hop question-answer pairs from ClueWeb22-B documents using an LLM-driven agentic loop. Designed to produce hard, research-style questions where the answer requires reasoning across multiple linked documents.

Based on ASearcher, adapted to ClueWeb.

How it works

asearcher_rlm.py implements a ConstructQAAgent that:

  1. Seeds from a random Wikipedia-subset ClueWeb document -- extracts title, summary, and information points via LLM.
  2. Creates a base QA grounded in that document's content.
  3. Iterates (up to MAX_TURNS) choosing actions:
    • SELECT -- pick a linked document (via inlinks/outlinks), build a bridging QA, and merge it into the current question.
    • FUZZ -- rephrase the question to make it vaguer or harder.
    • BRAINSTORM -- propose related entities (with ClueWeb IDs) and fold their facts into the question.
    • EXIT -- stop when the question is sufficiently hard.
  4. Validates each iteration: checks QA validity, runs direct-answer generation (n=8), uses LLM judge to measure accuracy, and rejects questions with alternative correct answers.
  5. Saves each generated QA as a .jsonl file in ./generated_qas/.

Dependencies

  • openai (AsyncOpenAI)
  • clueweb22 (ClueWeb22Api) -- requires local access to ClueWeb22-B at /data/datasets/clueweb22/ClueWeb22_B
  • tqdm, python-dotenv

Configuration

Edit the __main__ block in asearcher_rlm.py:

Variable Default Description
REASONING_MODEL gpt-4.1-mini Model for reasoning tasks (action choice, QA construction)
INSTRUCT_MODEL gpt-4.1-mini Model for extraction tasks (summarization, info points)
BASE_URL None Custom API endpoint (e.g. vLLM)
NUM_QUESTIONS 740 Total questions to generate
MAX_TURNS 6 Max iterations per question
SAVE_PATH ./generated_qas Output directory

API key is loaded from keys.env (OPENAI_API_KEY).

Usage

python asearcher_rlm.py

Runs 128 concurrent generation tasks. Each QA is saved as ./generated_qas/<uuid>.jsonl.

Output format

Each .jsonl file contains a single JSON object:

{
  "qa": {"question": "...", "answer": "..."},
  "relevant": [{"name": "...", "clueweb_id": "...", "summary": "...", ...}],
  "statements": ["fact 1", "fact 2"],
  "edit_history": ["action log..."],
  "qa_history": [{"question": "...", "answer": "...", "direct_gen_acc": "0/8"}]
}
  • relevant -- all ClueWeb documents used to construct the question.
  • statements -- factual claims the question depends on.
  • qa_history -- progression of the question through iterations, with direct-generation accuracy at each step.

Pre-generated queries

  • data/asearcher/full.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors