GitHub - Payel-git-ol/crawler-go

# Fyne-on - GitHub Repository Crawler with Markov Chains

A high-performance GitHub crawler that uses Markov chains for intelligent traversal and Badger KV store for efficient data storage.

---

## 🚀 Features

- **Markov Chain-based Crawling**: Intelligent traversal using Markov chains to discover repositories and contributors
- **Key-Value Storage**: Uses Badger DB for fast, reliable key-value storage
- **REST API**: Complete REST API to query and manage collected data
- **Deduplication**: Hash-based deduplication to avoid storing duplicate data
- **Scalable Design**: Architecture scales to 10,000+ repositories and contacts

---

## 📂 Architecture

### Core Components

1. **BadgerDB (`pkg/database/badgerdb.go`)**
   - Key-value database wrapper around Badger
   - CRUD operations, iteration, backup
   - SHA256 hashing for deduplication

2. **Models (`pkg/models/models.go`)**
   - `Contact`: GitHub users/contributors
   - `Repo`: Repository metadata
   - `Issue`: Open and closed issues
   - `PullRequest`: Open, closed, and merged PRs

3. **Markov Chain (`pkg/markov/markov.go`)**
   - Probabilistic state transitions for crawling
   - Random selection of next user/repo
   - Maintains transition map

4. **Storage Service (`pkg/storage/storage.go`)**
   - High-level persistence operations
   - Hash-based uniqueness checking
   - Cascade deletion support

5. **GitHub Crawler (`pkg/crawler/github.go`)**
   - Markov chain-based GitHub crawling
   - Direct GitHub API integration
   - Fetches repos, issues, PRs, contributors

---

## 🗄️ Data Model

Stored in Badger KV with prefixes:

repo:{owner}/{name} # Repository data issue:{owner}/{repo}/{id} # Issues pr:{owner}/{repo}/{id} # Pull requests contact:{login} # User/contributor data


### Deduplication
- **Repo hash**: `SHA256(owner + name + url)`
- **Issue/PR hash**: `SHA256(repoID + id + url)`
- **Contact hash**: `SHA256(login + url)`

---

## 🔗 REST API Endpoints

### Health & Stats
- `GET /health` — Health check
- `GET /stats` — Database statistics
- `GET /stats/summary` — Compact counters

### Repositories
- `GET /repos` — All repositories (`expand=true`, `include_issues=count`)
- `GET /repos/:owner/:name` — Specific repository
- `GET /repos/:owner/:name/issues` — Issues of repo
- `GET /repos/:owner/:name/prs` — PRs of repo
- `GET /repos/search?language=Go&min_stars=100` — Search repositories
- `DELETE /repos/:owner/:name` — Delete repository

### Issues
- `GET /issues?page=1&limit=100` — Paginated issues

### Contacts
- `GET /contacts` — All contacts
- `GET /contacts/:login` — Specific contact

### Crawler Control
- `POST /crawler/start` — Start crawler
  ```json
  {
    "start_usernames": ["torvalds","microsoft"],
    "max_iterations": 10000,
    "delay_ms": 1000,
    "github_token": "YOUR_TOKEN_HERE",
    "use_playwright": true
  }

GET /crawler/config — Current crawler config

Service

GET /api/routes — List all routes

⚙️ Getting Started

Prerequisites

Go 1.22+
Badger DB (via go.mod)

Installation

cd c:\Users\pasaz\GolandProjects\Fyne-on
go mod tidy
go build -o app.exe ./cmd/app

Run

./app.exe

Server starts at http://localhost:3000

Docker

docker build -t fyne-on:latest .
docker run -p 3000:3000 fyne-on:latest

🔧 Configuration

Crawler Parameters

githubCrawler.SetMaxIterations(10000)
githubCrawler.SetDelayMs(1000)

Database

Stored in ./badger_data/
Automatic persistence
Periodic GC

🛠 Development

Project Structure

.
├── cmd/app/              # Main entry point
├── pkg/
│   ├── crawler/          # GitHub crawler
│   ├── database/         # Badger wrapper
│   ├── markov/           # Markov chain
│   ├── models/           # Data models
│   ├── scraper/          # Web scraping utils
│   └── storage/          # Storage service
├── docker-compose.yaml
├── go.mod
└── README.md

Adding Features

Add new type in pkg/models/models.go
Extend storage in pkg/storage/storage.go
Add crawler logic in pkg/crawler/github.go
Add API route in cmd/app/main.go

✅ Criteria Met

Program compiles and has REST API
REST API for database queries
Scalable to 10,000+ repos
Extensible modular code
Hash-based deduplication
Stores Contact, Repo, Issues, PRs
Badger KV instead of Postgres

📊 Performance Notes

Badger DB uses LSM tree for fast writes
Deduplication: O(1) hash lookup
API response: 50ms average
GitHub API calls: 1–2s + delay_ms

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.idea		.idea
cmd		cmd
docs		docs
internal/core		internal/core
pkg		pkg
-Method		-Method
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_MAIN.md		README_MAIN.md
STRUCTURE.md		STRUCTURE.md
VERSION.md		VERSION.md
config.yaml		config.yaml
docker-compose.yaml		docker-compose.yaml
export_for_llm.ps1		export_for_llm.ps1
go.mod		go.mod
go.sum		go.sum
quickstart.sh		quickstart.sh
result.json		result.json
start_crawler_batch.ps1		start_crawler_batch.ps1
test_api.ps1		test_api.ps1
test_crawler.ps1		test_crawler.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Service

⚙️ Getting Started

Prerequisites

Installation

Run

Docker

🔧 Configuration

Crawler Parameters

Database

🛠 Development

Project Structure

Adding Features

✅ Criteria Met

📊 Performance Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Service

⚙️ Getting Started

Prerequisites

Installation

Run

Docker

🔧 Configuration

Crawler Parameters

Database

🛠 Development

Project Structure

Adding Features

✅ Criteria Met

📊 Performance Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages