An Automated Research Pipeline from GitHub Repositories to Structured Semantic Indexes
Automatically transform any GitHub code repository into structured semantic assets indexed by research questions and grounded in code structure ,
providing reproducible, computable, and extensible intermediate representations for downstream graph search, subgraph matching, rule discovery, and LLM-based reasoning systems .
When facing a large and unfamiliar codebase, common approaches are:
- Manually reading the README
- Searching keywords across files
- Directly feeding code snippets to an LLM
These approaches suffer from fundamental limitations:
- Large repositories exceed controllable context limits
- Lack of structure prevents systematic search and reasoning
- LLM outputs are ephemeral and non-reproducible
- Pure text-based retrieval ignores structural constraints
The goal of this project is not to “make an LLM understand code”.
Instead:
**First transform the repository into structured, indexable semantic and structural assets,
then let LLMs operate within a constrained and well-defined structure.**
This project follows a clear guiding principle:
Structure First, LLM Second
- Code structure (files, functions, calls, code flows) is generated by deterministic programs
- LLMs operate only within explicitly constrained structural spaces
- All critical intermediate results are persisted as JSON / DOT artifacts , enabling reproducibility and debugging
GitHub Repository ↓ Repository Structure JSON (directory hierarchy) ↓ Flattened File Path List ↓ README Summary (LLM, semantic anchor) ↓ Technical Research Question Generation (LLM, JSON) ↓ Question → File PathMapping (JSON) ↓ Question → File → Code Node Mapping (structured) ↓ (Downstream: Graph Search / Subgraph Matching / Reasoning)
The final output is not a natural-language answer ,
but a collection of reusable structured semantic assets .
- Convert nested directory structures into a flattened list of file paths
- Explicitly constrain the LLM to select only from real, existing files
This design:
- Prevents hallucination
- Reduces cognitive load
- Restricts free-form generation into controlled selection
👉 This is a system-level input constraint layer , not a simple utility.
In this system, README summarization is not cosmetic .
It provides:
- High-level semantic compression of project intent
- Architectural and technical signal extraction
- The only global context for downstream question generation
Design principle:
- README = global semantics
- File list = local facts
Question generation is not about “asking a few questions”.
Its purpose is:
To elevate a code repository from a code collection into a research object
Each question is:
- Hierarchical : architecture → subsystems → algorithms → implementation
- Executable : explicitly mapped to real file paths
- Structured : enforced JSON output, no free text
Prompt constraints ensure:
- No hallucinated files
- Bounded question counts
- Emphasis on research value over generic queries
This stage bridges:
Natural-language research questions → the code structure world
By mapping:
- Question → file paths
- File paths → code nodes (functions / directory nodes)
Three layers are aligned:
- Semantic layer (Question)
- File layer (Repository paths)
- Structural layer (Code / Graph nodes)
Path suffix matching strategy:
file_parts == node_parts[-len(file_parts):]
This design:
- Avoids absolute path dependency
- Is robust to repository root changes
- Preserves localization accuracy
It lays the foundation for graph construction, subgraph search, path reasoning, and rule abstraction .
In this project, JSON is not merely an output format—it is a system contract :
- LLM output: JSON string
- System boundary:
json.loads()→ JSON object - Deterministic processing begins here
Confusing strings with objects will cause downstream failures in mapping, graph construction, and indexing.
This is not an LLM tool.
It is not a simple code analysis script.
It is a:
Research engine that transforms code → semantics → structure → reasoning
LLMs are controlled components;
structure and data flow are the core of the system .
- Core pipeline fully implemented and validated
- Tested on multiple real-world repositories
- Intermediate artifacts are stable and reproducible
- Graph and subgraph matching
- Path-level similarity computation
- Rule abstraction and transfer
- Integration with vector indexes and graph databases
CodeWhisperer is a system that automatically converts GitHub repositories into structured semantic indexes grounded in code structure, enabling long-term, evolvable reasoning over complex codebases.
This appendix documents the engineering structure, module responsibilities,
and output artifacts of CodeWhisperer.
backend/ ├── data/ │ └── git_repo/ # Real repositories for experiments ├── src/ │ ├── services/ │ │ ├── data_io/ │ │ ├── file_processor/ │ │ ├── git_repo_processor/ │ │ ├── llm/ │ │ ├── prompt/ │ │ └── retrieval/ │ └── output/ # All structured artifacts └── tests/
data_io/ ├── file_reader.py └── file_writer.py
- Unified file I/O abstraction
- Clear JSON vs. text boundaries
- All intermediate artifacts persist through this layer
file_processor/ ├── extractor/ │ ├── code_func_finder.py │ └── code_edge_extractor.py ├── parser/ │ └── dot_file_parser.py ├── flow/ │ └── code_flow_analyzer.py ├── readme/ │ └── markdown_processor.py └── utils/
Responsibilities:
- Function node extraction
- Call / dependency edge extraction
- DOT graph parsing
- Raw → processed code flow construction
git_repo_processor/ └── git_repo_cloner/
- Repository cloning
- Extensible to branch / commit / cache control
llm/ ├── chatgpt_client.py └── llm_coder_handler.py
- Unified LLM invocation layer
- Long-context segmentation support
- Prompt and model decoupling
prompt/ ├── system_prompt/ └── user_prompt/
- System prompts: role and output constraints
- User prompts: dynamic structural injection
- Supports prompt iteration and A/B testing
retrieval/ ├── readme_summarizer.py ├── technical_question_mapper.py ├── query_with_related_code_builder.py └── repo_analysis_pipeline.py
- README semantic anchoring
- Question → file mapping
- Query construction with structural context
output/ ├── codeflow_raw/ # Raw call / dependency graphs ├── codeflow_processed/ # Structured code flows ├── function_extraction/ # Node / edge JSONs └── code_graph_data/ # Graph representations
All outputs are:
- Reproducible
- Indexable
- Reusable
tests/ ├── extractor / flow / parser ├── llm ├── retrieval └── temp
- Each pipeline stage tested independently
- No redundant upstream validation
- Clear separation of responsibilities
generate_and_save_*names explicitly declare side effects- JSON is a hard system contract
- Pipelines over single functions
- Structure over models