CodeWhisperer

An Automated Research Pipeline from GitHub Repositories to Structured Semantic Indexes

0. One-Sentence Goal

Automatically transform any GitHub code repository into structured semantic assets indexed by research questions and grounded in code structure ,

providing reproducible, computable, and extensible intermediate representations for downstream graph search, subgraph matching, rule discovery, and LLM-based reasoning systems .

1. What Problem Does This Project Solve?

When facing a large and unfamiliar codebase, common approaches are:

Manually reading the README
Searching keywords across files
Directly feeding code snippets to an LLM

These approaches suffer from fundamental limitations:

Large repositories exceed controllable context limits
Lack of structure prevents systematic search and reasoning
LLM outputs are ephemeral and non-reproducible
Pure text-based retrieval ignores structural constraints

The goal of this project is not to “make an LLM understand code”.

Instead:

**First transform the repository into structured, indexable semantic and structural assets,

then let LLMs operate within a constrained and well-defined structure.**

2. Core Design Philosophy (Structure-first)

This project follows a clear guiding principle:

Structure First, LLM Second

Code structure (files, functions, calls, code flows) is generated by deterministic programs
LLMs operate only within explicitly constrained structural spaces
All critical intermediate results are persisted as JSON / DOT artifacts , enabling reproducibility and debugging

3. End-to-End Pipeline (Core Workflow)


GitHub Repository
   ↓
Repository Structure JSON (directory hierarchy)
   ↓
Flattened File Path List
   ↓
README Summary (LLM, semantic anchor)
   ↓
Technical Research Question Generation (LLM, JSON)
   ↓
Question → File PathMapping (JSON)
   ↓
Question → File → Code Node Mapping (structured)
   ↓
(Downstream: Graph Search / Subgraph Matching / Reasoning)

The final output is not a natural-language answer ,

but a collection of reusable structured semantic assets .

4. Key Stages and Design Motivation

Stage 1: Repository Structure → LLM-Readable Input

Convert nested directory structures into a flattened list of file paths
Explicitly constrain the LLM to select only from real, existing files

This design:

Prevents hallucination
Reduces cognitive load
Restricts free-form generation into controlled selection

👉 This is a system-level input constraint layer , not a simple utility.

Stage 2: README → Semantic Anchor

In this system, README summarization is not cosmetic .

It provides:

High-level semantic compression of project intent
Architectural and technical signal extraction
The only global context for downstream question generation

Design principle:

README = global semantics
File list = local facts

Stage 3: Technical Question Generation (Cognitive Core)

Question generation is not about “asking a few questions”.

Its purpose is:

To elevate a code repository from a code collection into a research object

Each question is:

Hierarchical : architecture → subsystems → algorithms → implementation
Executable : explicitly mapped to real file paths
Structured : enforced JSON output, no free text

Prompt constraints ensure:

No hallucinated files
Bounded question counts
Emphasis on research value over generic queries

Stage 4: Question → File → Code Node (Structural Grounding)

This stage bridges:

Natural-language research questions → the code structure world

By mapping:

Question → file paths
File paths → code nodes (functions / directory nodes)

Three layers are aligned:

Semantic layer (Question)
File layer (Repository paths)
Structural layer (Code / Graph nodes)

Path suffix matching strategy:


file_parts == node_parts[-len(file_parts):]

This design:

Avoids absolute path dependency
Is robust to repository root changes
Preserves localization accuracy

It lays the foundation for graph construction, subgraph search, path reasoning, and rule abstraction .

5. JSON as a Hard System Contract

In this project, JSON is not merely an output format—it is a system contract :

LLM output: JSON string
System boundary: json.loads() → JSON object
Deterministic processing begins here

Confusing strings with objects will cause downstream failures in mapping, graph construction, and indexing.

6. Project Positioning (Critical)

This is not an LLM tool.

It is not a simple code analysis script.

It is a:

Research engine that transforms code → semantics → structure → reasoning

LLMs are controlled components;

structure and data flow are the core of the system .

7. Current Status and Future Directions

Current Status

Core pipeline fully implemented and validated
Tested on multiple real-world repositories
Intermediate artifacts are stable and reproducible

Future Directions

Graph and subgraph matching
Path-level similarity computation
Rule abstraction and transfer
Integration with vector indexes and graph databases

8. One-Sentence Summary

CodeWhisperer is a system that automatically converts GitHub repositories into structured semantic indexes grounded in code structure, enabling long-term, evolvable reasoning over complex codebases.

Appendix A: Architecture & Implementation Details

This appendix documents the engineering structure, module responsibilities,

and output artifacts of CodeWhisperer.

A.1 Overall Project Structure


backend/
├── data/
│   └── git_repo/              # Real repositories for experiments
├── src/
│   ├── services/
│   │   ├── data_io/
│   │   ├── file_processor/
│   │   ├── git_repo_processor/
│   │   ├── llm/
│   │   ├── prompt/
│   │   └── retrieval/
│   └── output/                # All structured artifacts
└── tests/

A.2 Core Modules

A.2.1 data_io


data_io/
├── file_reader.py
└── file_writer.py

Unified file I/O abstraction
Clear JSON vs. text boundaries
All intermediate artifacts persist through this layer

A.2.2 file_processor (Structural Engine Core)


file_processor/
├── extractor/
│   ├── code_func_finder.py
│   └── code_edge_extractor.py
├── parser/
│   └── dot_file_parser.py
├── flow/
│   └── code_flow_analyzer.py
├── readme/
│   └── markdown_processor.py
└── utils/

Responsibilities:

Function node extraction
Call / dependency edge extraction
DOT graph parsing
Raw → processed code flow construction

A.2.3 git_repo_processor


git_repo_processor/
└── git_repo_cloner/

Repository cloning
Extensible to branch / commit / cache control

A.2.4 llm


llm/
├── chatgpt_client.py
└── llm_coder_handler.py

Unified LLM invocation layer
Long-context segmentation support
Prompt and model decoupling

A.2.5 prompt


prompt/
├── system_prompt/
└── user_prompt/

System prompts: role and output constraints
User prompts: dynamic structural injection
Supports prompt iteration and A/B testing

A.2.6 retrieval


retrieval/
├── readme_summarizer.py
├── technical_question_mapper.py
├── query_with_related_code_builder.py
└── repo_analysis_pipeline.py

README semantic anchoring
Question → file mapping
Query construction with structural context

A.3 Output Artifacts (Core Assets)


output/
├── codeflow_raw/              # Raw call / dependency graphs
├── codeflow_processed/        # Structured code flows
├── function_extraction/       # Node / edge JSONs
└── code_graph_data/           # Graph representations

All outputs are:

Reproducible
Indexable
Reusable

A.4 Testing Principles


tests/
├── extractor / flow / parser
├── llm
├── retrieval
└── temp

Each pipeline stage tested independently
No redundant upstream validation
Clear separation of responsibilities

A.5 Engineering Philosophy

generate_and_save_* names explicitly declare side effects
JSON is a hard system contract
Pipelines over single functions
Structure over models

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docs		docs
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
config.py		config.py
requirements.txt		requirements.txt
setup.bat		setup.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeWhisperer

0. One-Sentence Goal

1. What Problem Does This Project Solve?

2. Core Design Philosophy (Structure-first)

3. End-to-End Pipeline (Core Workflow)

4. Key Stages and Design Motivation

Stage 1: Repository Structure → LLM-Readable Input

Stage 2: README → Semantic Anchor

Stage 3: Technical Question Generation (Cognitive Core)

Stage 4: Question → File → Code Node (Structural Grounding)

5. JSON as a Hard System Contract

6. Project Positioning (Critical)

7. Current Status and Future Directions

Current Status

Future Directions

8. One-Sentence Summary

Appendix A: Architecture & Implementation Details

A.1 Overall Project Structure

A.2 Core Modules

A.2.1 data_io

A.2.2 file_processor (Structural Engine Core)

A.2.3 git_repo_processor

A.2.4 llm

A.2.5 prompt

A.2.6 retrieval

A.3 Output Artifacts (Core Assets)

A.4 Testing Principles

A.5 Engineering Philosophy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeWhisperer

0. One-Sentence Goal

1. What Problem Does This Project Solve?

2. Core Design Philosophy (Structure-first)

3. End-to-End Pipeline (Core Workflow)

4. Key Stages and Design Motivation

Stage 1: Repository Structure → LLM-Readable Input

Stage 2: README → Semantic Anchor

Stage 3: Technical Question Generation (Cognitive Core)

Stage 4: Question → File → Code Node (Structural Grounding)

5. JSON as a Hard System Contract

6. Project Positioning (Critical)

7. Current Status and Future Directions

Current Status

Future Directions

8. One-Sentence Summary

Appendix A: Architecture & Implementation Details

A.1 Overall Project Structure

A.2 Core Modules

A.2.1 data_io

A.2.2 file_processor (Structural Engine Core)

A.2.3 git_repo_processor

A.2.4 llm

A.2.5 prompt

A.2.6 retrieval

A.3 Output Artifacts (Core Assets)

A.4 Testing Principles

A.5 Engineering Philosophy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages