Skip to content

Payel-git-ol/crawler-go

Repository files navigation

# Fyne-on - GitHub Repository Crawler with Markov Chains

A high-performance GitHub crawler that uses Markov chains for intelligent traversal and Badger KV store for efficient data storage.

---

## 🚀 Features

- **Markov Chain-based Crawling**: Intelligent traversal using Markov chains to discover repositories and contributors
- **Key-Value Storage**: Uses Badger DB for fast, reliable key-value storage
- **REST API**: Complete REST API to query and manage collected data
- **Deduplication**: Hash-based deduplication to avoid storing duplicate data
- **Scalable Design**: Architecture scales to 10,000+ repositories and contacts

---

## 📂 Architecture

### Core Components

1. **BadgerDB (`pkg/database/badgerdb.go`)**
   - Key-value database wrapper around Badger
   - CRUD operations, iteration, backup
   - SHA256 hashing for deduplication

2. **Models (`pkg/models/models.go`)**
   - `Contact`: GitHub users/contributors
   - `Repo`: Repository metadata
   - `Issue`: Open and closed issues
   - `PullRequest`: Open, closed, and merged PRs

3. **Markov Chain (`pkg/markov/markov.go`)**
   - Probabilistic state transitions for crawling
   - Random selection of next user/repo
   - Maintains transition map

4. **Storage Service (`pkg/storage/storage.go`)**
   - High-level persistence operations
   - Hash-based uniqueness checking
   - Cascade deletion support

5. **GitHub Crawler (`pkg/crawler/github.go`)**
   - Markov chain-based GitHub crawling
   - Direct GitHub API integration
   - Fetches repos, issues, PRs, contributors

---

## 🗄️ Data Model

Stored in Badger KV with prefixes:

repo:{owner}/{name} # Repository data issue:{owner}/{repo}/{id} # Issues pr:{owner}/{repo}/{id} # Pull requests contact:{login} # User/contributor data


### Deduplication
- **Repo hash**: `SHA256(owner + name + url)`
- **Issue/PR hash**: `SHA256(repoID + id + url)`
- **Contact hash**: `SHA256(login + url)`

---

## 🔗 REST API Endpoints

### Health & Stats
- `GET /health` — Health check
- `GET /stats` — Database statistics
- `GET /stats/summary` — Compact counters

### Repositories
- `GET /repos` — All repositories (`expand=true`, `include_issues=count`)
- `GET /repos/:owner/:name` — Specific repository
- `GET /repos/:owner/:name/issues` — Issues of repo
- `GET /repos/:owner/:name/prs` — PRs of repo
- `GET /repos/search?language=Go&min_stars=100` — Search repositories
- `DELETE /repos/:owner/:name` — Delete repository

### Issues
- `GET /issues?page=1&limit=100` — Paginated issues

### Contacts
- `GET /contacts` — All contacts
- `GET /contacts/:login` — Specific contact

### Crawler Control
- `POST /crawler/start` — Start crawler
  ```json
  {
    "start_usernames": ["torvalds","microsoft"],
    "max_iterations": 10000,
    "delay_ms": 1000,
    "github_token": "YOUR_TOKEN_HERE",
    "use_playwright": true
  }
  • GET /crawler/config — Current crawler config

Service

  • GET /api/routes — List all routes

⚙️ Getting Started

Prerequisites

  • Go 1.22+
  • Badger DB (via go.mod)

Installation

cd c:\Users\pasaz\GolandProjects\Fyne-on
go mod tidy
go build -o app.exe ./cmd/app

Run

./app.exe

Server starts at http://localhost:3000

Docker

docker build -t fyne-on:latest .
docker run -p 3000:3000 fyne-on:latest

🔧 Configuration

Crawler Parameters

githubCrawler.SetMaxIterations(10000)
githubCrawler.SetDelayMs(1000)

Database

  • Stored in ./badger_data/
  • Automatic persistence
  • Periodic GC

🛠 Development

Project Structure

.
├── cmd/app/              # Main entry point
├── pkg/
│   ├── crawler/          # GitHub crawler
│   ├── database/         # Badger wrapper
│   ├── markov/           # Markov chain
│   ├── models/           # Data models
│   ├── scraper/          # Web scraping utils
│   └── storage/          # Storage service
├── docker-compose.yaml
├── go.mod
└── README.md

Adding Features

  1. Add new type in pkg/models/models.go
  2. Extend storage in pkg/storage/storage.go
  3. Add crawler logic in pkg/crawler/github.go
  4. Add API route in cmd/app/main.go

✅ Criteria Met

  • Program compiles and has REST API
  • REST API for database queries
  • Scalable to 10,000+ repos
  • Extensible modular code
  • Hash-based deduplication
  • Stores Contact, Repo, Issues, PRs
  • Badger KV instead of Postgres

📊 Performance Notes

  • Badger DB uses LSM tree for fast writes
  • Deduplication: O(1) hash lookup
  • API response: 50ms average
  • GitHub API calls: 1–2s + delay_ms

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors