Skip to content

YiboLi1986/CODEWHISPERER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeWhisperer

An Automated Research Pipeline from GitHub Repositories to Structured Semantic Indexes


0. One-Sentence Goal

Automatically transform any GitHub code repository into structured semantic assets indexed by research questions and grounded in code structure ,

providing reproducible, computable, and extensible intermediate representations for downstream graph search, subgraph matching, rule discovery, and LLM-based reasoning systems .


1. What Problem Does This Project Solve?

When facing a large and unfamiliar codebase, common approaches are:

  • Manually reading the README
  • Searching keywords across files
  • Directly feeding code snippets to an LLM

These approaches suffer from fundamental limitations:

  • Large repositories exceed controllable context limits
  • Lack of structure prevents systematic search and reasoning
  • LLM outputs are ephemeral and non-reproducible
  • Pure text-based retrieval ignores structural constraints

The goal of this project is not to “make an LLM understand code”.

Instead:

**First transform the repository into structured, indexable semantic and structural assets,

then let LLMs operate within a constrained and well-defined structure.**


2. Core Design Philosophy (Structure-first)

This project follows a clear guiding principle:

Structure First, LLM Second

  • Code structure (files, functions, calls, code flows) is generated by deterministic programs
  • LLMs operate only within explicitly constrained structural spaces
  • All critical intermediate results are persisted as JSON / DOT artifacts , enabling reproducibility and debugging

3. End-to-End Pipeline (Core Workflow)

GitHub Repository ↓ Repository Structure JSON (directory hierarchy) ↓ Flattened File Path List ↓ README Summary (LLM, semantic anchor) ↓ Technical Research Question Generation (LLM, JSON) ↓ Question → File PathMapping (JSON) ↓ Question → File → Code Node Mapping (structured) ↓ (Downstream: Graph Search / Subgraph Matching / Reasoning)

The final output is not a natural-language answer ,

but a collection of reusable structured semantic assets .


4. Key Stages and Design Motivation

Stage 1: Repository Structure → LLM-Readable Input

  • Convert nested directory structures into a flattened list of file paths
  • Explicitly constrain the LLM to select only from real, existing files

This design:

  • Prevents hallucination
  • Reduces cognitive load
  • Restricts free-form generation into controlled selection

👉 This is a system-level input constraint layer , not a simple utility.


Stage 2: README → Semantic Anchor

In this system, README summarization is not cosmetic .

It provides:

  • High-level semantic compression of project intent
  • Architectural and technical signal extraction
  • The only global context for downstream question generation

Design principle:

  • README = global semantics
  • File list = local facts

Stage 3: Technical Question Generation (Cognitive Core)

Question generation is not about “asking a few questions”.

Its purpose is:

To elevate a code repository from a code collection into a research object

Each question is:

  • Hierarchical : architecture → subsystems → algorithms → implementation
  • Executable : explicitly mapped to real file paths
  • Structured : enforced JSON output, no free text

Prompt constraints ensure:

  • No hallucinated files
  • Bounded question counts
  • Emphasis on research value over generic queries

Stage 4: Question → File → Code Node (Structural Grounding)

This stage bridges:

Natural-language research questions → the code structure world

By mapping:

  • Question → file paths
  • File paths → code nodes (functions / directory nodes)

Three layers are aligned:

  1. Semantic layer (Question)
  2. File layer (Repository paths)
  3. Structural layer (Code / Graph nodes)

Path suffix matching strategy:

file_parts == node_parts[-len(file_parts):]

This design:

  • Avoids absolute path dependency
  • Is robust to repository root changes
  • Preserves localization accuracy

It lays the foundation for graph construction, subgraph search, path reasoning, and rule abstraction .


5. JSON as a Hard System Contract

In this project, JSON is not merely an output format—it is a system contract :

  • LLM output: JSON string
  • System boundary: json.loads() → JSON object
  • Deterministic processing begins here

Confusing strings with objects will cause downstream failures in mapping, graph construction, and indexing.


6. Project Positioning (Critical)

This is not an LLM tool.

It is not a simple code analysis script.

It is a:

Research engine that transforms code → semantics → structure → reasoning

LLMs are controlled components;

structure and data flow are the core of the system .


7. Current Status and Future Directions

Current Status

  • Core pipeline fully implemented and validated
  • Tested on multiple real-world repositories
  • Intermediate artifacts are stable and reproducible

Future Directions

  • Graph and subgraph matching
  • Path-level similarity computation
  • Rule abstraction and transfer
  • Integration with vector indexes and graph databases

8. One-Sentence Summary

CodeWhisperer is a system that automatically converts GitHub repositories into structured semantic indexes grounded in code structure, enabling long-term, evolvable reasoning over complex codebases.



Appendix A: Architecture & Implementation Details

This appendix documents the engineering structure, module responsibilities,

and output artifacts of CodeWhisperer.


A.1 Overall Project Structure

backend/ ├── data/ │ └── git_repo/ # Real repositories for experiments ├── src/ │ ├── services/ │ │ ├── data_io/ │ │ ├── file_processor/ │ │ ├── git_repo_processor/ │ │ ├── llm/ │ │ ├── prompt/ │ │ └── retrieval/ │ └── output/ # All structured artifacts └── tests/

A.2 Core Modules

A.2.1 data_io

data_io/ ├── file_reader.py └── file_writer.py
  • Unified file I/O abstraction
  • Clear JSON vs. text boundaries
  • All intermediate artifacts persist through this layer

A.2.2 file_processor (Structural Engine Core)

file_processor/ ├── extractor/ │ ├── code_func_finder.py │ └── code_edge_extractor.py ├── parser/ │ └── dot_file_parser.py ├── flow/ │ └── code_flow_analyzer.py ├── readme/ │ └── markdown_processor.py └── utils/

Responsibilities:

  • Function node extraction
  • Call / dependency edge extraction
  • DOT graph parsing
  • Raw → processed code flow construction

A.2.3 git_repo_processor

git_repo_processor/ └── git_repo_cloner/
  • Repository cloning
  • Extensible to branch / commit / cache control

A.2.4 llm

llm/ ├── chatgpt_client.py └── llm_coder_handler.py
  • Unified LLM invocation layer
  • Long-context segmentation support
  • Prompt and model decoupling

A.2.5 prompt

prompt/ ├── system_prompt/ └── user_prompt/
  • System prompts: role and output constraints
  • User prompts: dynamic structural injection
  • Supports prompt iteration and A/B testing

A.2.6 retrieval

retrieval/ ├── readme_summarizer.py ├── technical_question_mapper.py ├── query_with_related_code_builder.py └── repo_analysis_pipeline.py
  • README semantic anchoring
  • Question → file mapping
  • Query construction with structural context

A.3 Output Artifacts (Core Assets)

output/ ├── codeflow_raw/ # Raw call / dependency graphs ├── codeflow_processed/ # Structured code flows ├── function_extraction/ # Node / edge JSONs └── code_graph_data/ # Graph representations

All outputs are:

  • Reproducible
  • Indexable
  • Reusable

A.4 Testing Principles

tests/ ├── extractor / flow / parser ├── llm ├── retrieval └── temp
  • Each pipeline stage tested independently
  • No redundant upstream validation
  • Clear separation of responsibilities

A.5 Engineering Philosophy

  • generate_and_save_* names explicitly declare side effects
  • JSON is a hard system contract
  • Pipelines over single functions
  • Structure over models

About

Structure-first pipeline that transforms GitHub repositories into question-indexed, graph-grounded semantic assets for code search, Graph-RAG, and LLM-based reasoning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors