This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Agentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval.
Tech Stack: Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI.
# Install dependencies
uv pip install .
uv pip install -e ".[dev]" # with dev dependencies
# Run CLI (agentic exploration)
uv run explore --task "What is the purchase price?" --folder data/test_acquisition/
# Run CLI (indexed query - requires prior indexing)
uv run explore index data/test_acquisition/
uv run explore query --task "What is the purchase price?" --folder data/test_acquisition/
# Schema management
uv run explore schema discover data/test_acquisition/
uv run explore schema show data/test_acquisition/
# Run web UI
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000
# Run tests
uv run pytest # all tests
uv run pytest tests/test_fs.py # single file
uv run pytest -k "test_name" # single test
# Lint, format, typecheck (also available via Makefile)
uv run pre-commit run -a # lint (or: make lint)
uv run ruff check . # ruff only
uv run ruff format # format (or: make format)
uv run ty check src/fs_explorer/ # typecheck (or: make typecheck)Entry points defined in pyproject.toml: explore → fs_explorer.main:app, explore-ui → fs_explorer.server:run_server.
User Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem
User Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results
-
workflow.py: Event-driven orchestration using
llama-index-workflows. DefinesFsExplorerWorkflowwith steps:start_exploration,go_deeper_action,tool_call_action,receive_human_answer. Uses singleton agent viaget_agent(). -
agent.py:
FsExplorerAgentmanages Gemini API interaction. Chat history accumulates in_chat_history.take_action()sends history to LLM, receives structured JSONAction, auto-executes tool calls.TokenUsagetracks costs. Also contains theTOOLSregistry (9 tools),SYSTEM_PROMPT, and indexed tool functions (semantic_search,get_document,list_indexed_documents). Index context is managed via module-levelset_index_context()/clear_index_context(). -
models.py: Pydantic schemas for structured LLM output.
Actioncontains one of:ToolCallAction,GoDeeperAction,StopAction,AskHumanAction.ToolsTypeAlias defines all available tool names. -
fs.py: Filesystem operations.
scan_folder()uses ThreadPoolExecutor for parallel document processing._DOCUMENT_CACHE(dict) caches parsed documents keyed bypath:mtime. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown. -
main.py: Typer CLI entry point with subcommands: default (agentic explore),
index,query,schema discover,schema show. -
server.py: FastAPI server with WebSocket endpoint
/ws/explorefor real-time streaming. -
exploration_trace.py: Records tool call paths and extracts cited sources from final answers for the CLI summary.
-
pipeline.py:
IndexingPipelineorchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates toSmartChunkerandextract_metadata(), handles schema resolution and deleted-file cleanup. -
chunker.py:
SmartChunkersplits parsed document text into overlapping chunks. -
schema.py:
SchemaDiscoveryauto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields likementions_currency/mentions_dates). Optionally includes langextract fields. -
metadata.py:
extract_metadata()produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls thelangextractlibrary for entity extraction (organizations, people, deal terms, etc.) via configurable profiles.
-
query.py:
IndexedQueryEngineruns parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks viaRankedDocument.combined_score. -
filters.py:
parse_metadata_filters()parses a human-readable filter DSL (field=value,field>=num,field in (a, b),field~substring) intoMetadataFilterobjects. Validates against allowed schema fields. -
ranker.py:
RankedDocumentdataclass withcombined_score(semantic * 100 + metadata * 10).rank_documents()sorts and limits.
-
duckdb.py:
DuckDBStoragemanages four tables:corpora,documents,chunks,schemas. Key operations:upsert_document,search_chunks(keyword-based scoring),search_documents_by_metadata(JSON path filtering viajson_extract_string), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes. -
base.py:
StorageBackendprotocol and shared dataclasses (DocumentRecord,ChunkRecord,SchemaRecord).
- index_config.py:
resolve_db_path()resolves DuckDB path with precedence: CLI--db-path>FS_EXPLORER_DB_PATHenv >~/.fs_explorer/index.duckdb.
InputEvent→ starts explorationToolCallEvent→ tool executionGoDeeperEvent→ directory navigationAskHumanEvent/HumanAnswerEvent→ human interactionExplorationEndEvent→ completion withfinal_resultorerror
- Implement function in
fs.py(filesystem) oragent.py(indexed) returningstr - Add to
TOOLSdict inagent.py - Add to
ToolsTypeAlias inmodels.py - Update
SYSTEM_PROMPTinagent.py - Update
TOOL_ICONSandPHASE_DESCRIPTIONSinmain.py
GOOGLE_API_KEY(required) — in.envfile or environment variableFS_EXPLORER_DB_PATH(optional) — override default DuckDB locationFS_EXPLORER_LANGEXTRACT_MAX_CHARS(optional) — max chars sent to langextract (default 6000)FS_EXPLORER_LANGEXTRACT_MODEL(optional) — model for langextract (defaultgemini-3-flash-preview)
Tests mock the Gemini client via MockGenAIClient in conftest.py. Use reset_agent() to clear singleton state between tests. The mock always returns a StopAction response.
Key test files:
test_agent.py/test_e2e.py— agent and workflow integrationtest_fs.py— filesystem toolstest_indexing.py/test_cli_indexing.py— indexing pipeline and CLItest_search.py— search/filter/rankingtest_exploration_trace.py— trace and citation extraction
Test documents live in data/test_acquisition/ and data/large_acquisition/. Test fixtures for unit tests are in tests/testfiles/.