# Fyne-on - GitHub Repository Crawler with Markov Chains
A high-performance GitHub crawler that uses Markov chains for intelligent traversal and Badger KV store for efficient data storage.
---
## 🚀 Features
- **Markov Chain-based Crawling**: Intelligent traversal using Markov chains to discover repositories and contributors
- **Key-Value Storage**: Uses Badger DB for fast, reliable key-value storage
- **REST API**: Complete REST API to query and manage collected data
- **Deduplication**: Hash-based deduplication to avoid storing duplicate data
- **Scalable Design**: Architecture scales to 10,000+ repositories and contacts
---
## 📂 Architecture
### Core Components
1. **BadgerDB (`pkg/database/badgerdb.go`)**
- Key-value database wrapper around Badger
- CRUD operations, iteration, backup
- SHA256 hashing for deduplication
2. **Models (`pkg/models/models.go`)**
- `Contact`: GitHub users/contributors
- `Repo`: Repository metadata
- `Issue`: Open and closed issues
- `PullRequest`: Open, closed, and merged PRs
3. **Markov Chain (`pkg/markov/markov.go`)**
- Probabilistic state transitions for crawling
- Random selection of next user/repo
- Maintains transition map
4. **Storage Service (`pkg/storage/storage.go`)**
- High-level persistence operations
- Hash-based uniqueness checking
- Cascade deletion support
5. **GitHub Crawler (`pkg/crawler/github.go`)**
- Markov chain-based GitHub crawling
- Direct GitHub API integration
- Fetches repos, issues, PRs, contributors
---
## 🗄️ Data Model
Stored in Badger KV with prefixes:
repo:{owner}/{name} # Repository data issue:{owner}/{repo}/{id} # Issues pr:{owner}/{repo}/{id} # Pull requests contact:{login} # User/contributor data
### Deduplication
- **Repo hash**: `SHA256(owner + name + url)`
- **Issue/PR hash**: `SHA256(repoID + id + url)`
- **Contact hash**: `SHA256(login + url)`
---
## 🔗 REST API Endpoints
### Health & Stats
- `GET /health` — Health check
- `GET /stats` — Database statistics
- `GET /stats/summary` — Compact counters
### Repositories
- `GET /repos` — All repositories (`expand=true`, `include_issues=count`)
- `GET /repos/:owner/:name` — Specific repository
- `GET /repos/:owner/:name/issues` — Issues of repo
- `GET /repos/:owner/:name/prs` — PRs of repo
- `GET /repos/search?language=Go&min_stars=100` — Search repositories
- `DELETE /repos/:owner/:name` — Delete repository
### Issues
- `GET /issues?page=1&limit=100` — Paginated issues
### Contacts
- `GET /contacts` — All contacts
- `GET /contacts/:login` — Specific contact
### Crawler Control
- `POST /crawler/start` — Start crawler
```json
{
"start_usernames": ["torvalds","microsoft"],
"max_iterations": 10000,
"delay_ms": 1000,
"github_token": "YOUR_TOKEN_HERE",
"use_playwright": true
}
GET /crawler/config— Current crawler config
GET /api/routes— List all routes
- Go 1.22+
- Badger DB (via go.mod)
cd c:\Users\pasaz\GolandProjects\Fyne-on
go mod tidy
go build -o app.exe ./cmd/app./app.exeServer starts at http://localhost:3000
docker build -t fyne-on:latest .
docker run -p 3000:3000 fyne-on:latestgithubCrawler.SetMaxIterations(10000)
githubCrawler.SetDelayMs(1000)- Stored in
./badger_data/ - Automatic persistence
- Periodic GC
.
├── cmd/app/ # Main entry point
├── pkg/
│ ├── crawler/ # GitHub crawler
│ ├── database/ # Badger wrapper
│ ├── markov/ # Markov chain
│ ├── models/ # Data models
│ ├── scraper/ # Web scraping utils
│ └── storage/ # Storage service
├── docker-compose.yaml
├── go.mod
└── README.md
- Add new type in
pkg/models/models.go - Extend storage in
pkg/storage/storage.go - Add crawler logic in
pkg/crawler/github.go - Add API route in
cmd/app/main.go
- Program compiles and has REST API
- REST API for database queries
- Scalable to 10,000+ repos
- Extensible modular code
- Hash-based deduplication
- Stores Contact, Repo, Issues, PRs
- Badger KV instead of Postgres
- Badger DB uses LSM tree for fast writes
- Deduplication: O(1) hash lookup
- API response: 50ms average
- GitHub API calls: 1–2s + delay_ms