This guide explains how to automatically download GitHub repositories with open licenses for code training using the repository downloader scripts.
The repository downloader allows you to automatically find and clone GitHub repositories based on:
- Categories: Neovim configs, Lua repos, Bash scripts, Zsh configs, Python repos, ethical hacking tools, security tools, and all open-license repos
- Languages: Python, JavaScript, Go, Rust, and 15+ more
- Licenses: MIT, Apache, BSD, GPL, and other open source licenses
- Quality: Filter by minimum stars (popularity)
- Size Limits: Automatic stopping when reaching storage limits (default: 1 TB)
There are two scripts available:
download_all_repos.py- Convenience script to download all common categories at oncedownload_repos.py- Full-featured script with all options and flexibility
The easiest way to download all repository categories:
python3 download_all_repos.pyThis will download:
- π¦ Neovim configurations and plugins
- π¦ Lua programming repositories
- π¦ Bash/shell script repositories
- π¦ Zsh configuration and plugins
- π¦ Python programming repositories
- π¦ Ethical hacking and cybersecurity tools
Default settings:
- Max repos per category: 50
- Min stars: 100
- Output directory:
data/repos - Size limit: 1 TB (1024 GB)
- Shallow clones (faster, less disk space)
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50Download repositories with any open license (any language):
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0python3 download_repos.py --language python --max-repos 100No additional dependencies required! The script uses:
- Python standard library (
urllib,json,subprocess) tqdm(already in requirements.txt)git(should be installed on your system)
Neovim configuration files and plugins written in Lua.
python3 download_repos.py --categories nvim --max-repos 100What it searches for:
neovim OR nvim-config OR neovim-config- MIT licensed repositories (default)
- 100+ stars minimum (default)
Lua programming language repositories.
python3 download_repos.py --categories lua --max-repos 50What it searches for:
- Language: Lua
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Bash and shell script repositories.
python3 download_repos.py --categories bash --max-repos 50What it searches for:
- Language: Shell
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Zsh configuration files and plugins (Oh My Zsh, etc.).
python3 download_repos.py --categories zsh --max-repos 50What it searches for:
zsh-config OR oh-my-zsh OR zsh-plugin- MIT licensed repositories (default)
- 100+ stars minimum (default)
Python programming language repositories.
python3 download_repos.py --categories python --max-repos 100What it searches for:
- Language: Python
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Ethical hacking and cybersecurity tools.
python3 download_repos.py --categories hacking --max-repos 100What it searches for:
ethical-hacking OR cybersecurity OR penetration-testing OR security-tools OR red-team- MIT licensed repositories (default)
- 100+ stars minimum (default)
General security and cybersecurity repositories.
python3 download_repos.py --categories security --max-repos 50What it searches for:
security-tools OR cybersecurity OR penetration-testing OR red-team OR blue-team- MIT licensed repositories (default)
- 100+ stars minimum (default)
All repositories with open licenses, any language. This is useful for downloading a diverse set of repositories.
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0What it searches for:
- Any open-license repository (no language filter)
- No specific license filter (searches all open licenses)
- 100+ stars minimum (default)
Note: This category searches broadly and may return repositories with various licenses. You can still specify --license to filter to a specific license type.
python3 download_repos.py [OPTIONS]Options:
--output DIR- Output directory (default:data/repos)--categories CAT1 CAT2 ...- Categories to download:nvim,lua,bash,zsh,python,hacking,security,all-open--language LANG- Single language to filter by--languages LANG1 LANG2 ...- Multiple languages to download--license LICENSE- License type (default:mit)--min-stars N- Minimum stars (default: 100)--max-repos N- Maximum repos per category/language (default: 50)--max-size N- Maximum total size in GB (stops downloading when reached, e.g.,1024.0for 1 TB)--full-clone- Do full clone instead of shallow (slower but includes full history)
python3 download_all_repos.py [OPTIONS]Options:
--max-repos N- Maximum repos per category (default: 50)--min-stars N- Minimum stars (default: 100)--output DIR- Output directory (default:data/repos)--max-size N- Maximum total size in GB (default: 1024.0 = 1 TB)--full-clone- Do full clone instead of shallow
Example:
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0mit(default)apache-2.0bsd-3-clausebsd-2-clauseiscunlicensempl-2.0lgpl-2.1lgpl-3.0gpl-2.0gpl-3.0
pythonjavascripttypescriptjavacppcgorustrubyphpswiftkotlinscalarsqlluashell(for bash/shell scripts)
python3 download_all_repos.pyDownloads all categories (nvim, lua, bash, zsh, python, hacking) with default settings and 1 TB size limit.
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0Downloads all categories with:
- 100 repos per category
- Minimum 200 stars
- 2 TB size limit
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50Downloads specific categories with 50 repos each.
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0Downloads up to 1000 repositories with any open license, stopping at 1 TB.
python3 download_repos.py --categories nvim lua bash zsh python hacking --min-stars 1000 --max-repos 20Downloads only highly popular repositories (1000+ stars).
python3 download_repos.py --languages python javascript go rust --max-repos 50Downloads repositories in multiple programming languages.
python3 download_repos.py --categories nvim --license apache-2.0 --max-repos 50Downloads Neovim repos with Apache 2.0 license.
python3 download_repos.py --categories nvim lua bash zsh python hacking --output /path/to/reposSaves repositories to a custom directory.
python3 download_repos.py --categories nvim --full-clone --max-repos 10Does full clone including full git history (slower but more complete).
python3 download_repos.py --categories all-open --max-repos 2000 --max-size 512.0Downloads repositories but stops when reaching 512 GB (0.5 TB).
The scripts include visual progress bars showing:
- Category progress: Overall progress across all categories
- Repository progress: Progress for each category
- Real-time statistics: Current repo, stars, language, cloned/failed counts
- Size tracking: Current total size and size limit (when
--max-sizeis used)
Example output:
π Current directory size: 45.23 GB
π Size limit: 1024.00 GB
π¦ Processing 6 categories...
Category: nvim: 100%|ββββββββββββ| 6/6 [15:23<00:00, Size=156.78 GB, Total Cloned=300, Total Failed=2]
Cloning nvim: 45%|βββββββββββββββββ | 23/50 [02:15<03:45, Current=awesome-nvim, Stars=5.2k, Lang=Lua, Cloned=22, Failed=1, Size=12.45 GB]
Size limit reached:
When the size limit is reached, the script will stop downloading and show:
β οΈ Size limit reached: 1024.00 GB >= 1024.00 GB
Stopping all downloads.
GitHub API has rate limits:
- Unauthenticated: 60 requests/hour
- Authenticated: 5,000 requests/hour
To increase rate limits, set a GitHub Personal Access Token:
export GITHUB_TOKEN=your_token_here
python3 download_repos.py --categories nvim lua bash hackingHow to create a token:
- Go to GitHub Settings β Developer settings β Personal access tokens
- Generate new token (classic)
- Select scope:
public_repo(read-only is enough) - Copy token and set as environment variable
The repository downloader includes automatic size limit checking to prevent running out of disk space.
- Default limit: 1 TB (1024 GB) for
download_all_repos.py - Customizable: Use
--max-sizeto set any limit - Real-time tracking: Size is checked before each repository clone
- Automatic stopping: Downloads stop when limit is reached
- Progress display: Current size shown in progress bars
With download_all_repos.py:
# Default 1 TB
python3 download_all_repos.py
# Custom limit (2 TB)
python3 download_all_repos.py --max-size 2048.0
# Smaller limit (500 GB)
python3 download_all_repos.py --max-size 512.0With download_repos.py:
# No limit (downloads until max-repos reached)
python3 download_repos.py --categories nvim --max-repos 100
# With 1 TB limit
python3 download_repos.py --categories nvim --max-repos 1000 --max-size 1024.0The script calculates total size by:
- Scanning all files in the output directory (
data/reposby default) - Summing file sizes recursively
- Checking before each new repository clone
- Displaying human-readable sizes (B, KB, MB, GB, TB)
Note: Size checking happens before cloning, so the actual size may be slightly less than the limit when stopping.
The scripts automatically:
- Skips existing repos: If a repository already exists, it's skipped (no re-download)
- Resumes downloads: You can run the script multiple times safely
- Progress tracking: Shows what's already downloaded
- Size awareness: Accounts for existing repositories when checking size limits
After downloading repositories, they're automatically processed during training:
# Download repos
python3 download_all_repos.py
# Train with all data (text + code)
python3 train.py --data data/ --config config.json --device cudaThe training script will:
- Process all your text data (Wiki, Books, Amazon reviews, etc.)
- Process all code repositories
- Combine everything into training data
The data processor automatically handles code files from repositories:
- Text files:
.txt,.md,.rst,.log,.csv,.json,.jsonl,.xml,.html,.htm - Code files:
.py,.js,.ts,.java,.cpp,.c,.go,.rs,.rb,.php,.swift,.lua,.sh, and 30+ more - PDF files:
.pdf(if pdfplumber is installed) - Images:
.png,.jpg, etc. (if OCR is set up)
Error: Rate limit exceeded
Solution:
- Wait a few minutes and try again
- Use a GitHub token:
export GITHUB_TOKEN=your_token - Reduce
--max-reposto download fewer repos per run
Error: Failed to clone repository
Possible causes:
- Repository was deleted or made private
- Network issues
- Repository is too large (timeout)
Solution:
- The script continues with other repos
- Failed repos are counted and reported at the end
- You can re-run the script to retry failed repos
Error: No repositories found
Possible causes:
- Search query too restrictive
- License filter too narrow
- Minimum stars too high
Solution:
- Lower
--min-starsthreshold - Try different
--licenseoptions - Check if category name is correct
Test with a small number first:
python3 download_repos.py --categories nvim --max-repos 10Always set a size limit to prevent running out of disk space:
# Recommended: 1 TB limit
python3 download_all_repos.py --max-size 1024.0
# Or custom limit based on available space
python3 download_repos.py --categories all-open --max-size 512.0Shallow clones are faster and use less disk space:
# Default (shallow clone)
python3 download_repos.py --categories nvim
# Full clone (only if you need history)
python3 download_repos.py --categories nvim --full-cloneUse --min-stars to get quality repositories:
python3 download_repos.py --categories nvim --min-stars 500 --max-repos 50For large downloads, use a GitHub token:
export GITHUB_TOKEN=your_token_here
python3 download_all_repos.py --max-repos 100Check available disk space before starting:
df -h data/reposThe all-open category downloads broadly. Consider:
- Setting a reasonable
--max-reposlimit - Using
--min-starsto filter quality - Setting
--max-sizeto prevent excessive downloads
python3 download_repos.py --categories all-open --max-repos 500 --min-stars 200 --max-size 1024.0- Default: 1 TB (1024 GB) for
download_all_repos.py - Recommended: Set based on available disk space
- Monitoring: Script shows current size vs limit in progress bars
Shallow clones (default):
- Faster download
- Less disk space (~10-50% of full clone)
- No git history
- Good for training data
Full clones:
- Slower download
- More disk space (includes full history)
- Includes full git history
- Useful if you need version history
Typical sizes (shallow clones):
- Small repo: 1-10 MB
- Medium repo: 10-100 MB
- Large repo: 100 MB - 1 GB
- Very large repo: 1-10 GB+
Example: Downloading 300 repositories with shallow clones typically uses 5-30 GB, depending on repository sizes.
To estimate how many repositories you can download:
-
Check current size:
du -sh data/repos
-
Calculate average repo size:
- Small repos: ~5 MB average
- Medium repos: ~50 MB average
- Large repos: ~500 MB average
-
Estimate:
- 100 small repos: ~500 MB
- 100 medium repos: ~5 GB
- 100 large repos: ~50 GB
- 1000 mixed repos: ~50-200 GB
-
Set appropriate limit:
# For 1 TB available space, use 900 GB limit (leave buffer) python3 download_all_repos.py --max-size 900.0
The repository downloader makes it easy to:
- β Automatically find high-quality open-source repositories
- β Filter by category, language, license, and popularity
- β Download with progress tracking and size monitoring
- β Set size limits to prevent running out of disk space
- β Integrate seamlessly with training pipeline
- β Resume interrupted downloads
Available categories:
nvim- Neovim configurations and pluginslua- Lua programming repositoriesbash- Bash/shell script repositorieszsh- Zsh configuration and pluginspython- Python programming repositorieshacking- Ethical hacking and cybersecurity toolssecurity- Security and cybersecurity repositoriesall-open- All repositories with open licenses (any language)
Quick commands to get started:
# Download all categories with 1 TB limit (recommended)
python3 download_all_repos.py
# Download specific categories
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
# Download all open-license repos with size limit
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0This downloads repositories and prepares them for training!