Fine-tune the all-MiniLM-L6-v2 sentence transformer model on a dataset of markdown files.
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windowspip install -r requirements.txtDownload ~20,000 markdown files from the GitHub code dataset:
python download_dataset.py --target-files 20000Options:
--output-dir: Output directory (default:data/markdown)--target-files: Number of files to download (default: 20000)--no-streaming: Disable streaming mode (not recommended)
Train using TSDAE (unsupervised):
python train.py --method tsdae --epochs 1Train using contrastive learning:
python train.py --method contrastive --epochs 1Options:
--data-dir: Directory containing markdown files--output-dir: Output directory for trained model--method: Training method (tsdaeorcontrastive)--epochs: Number of training epochs--batch-size: Batch size for training--model-name: Base model to fine-tune
python inference.pypython web_ui.pyThen open http://localhost:7860 in your browser.
Features:
- Compute similarity between texts
- Semantic search over a corpus
- Batch similarity matrix
- Embedding inspection
- Compare different models (base vs fine-tuned)
- Dimensions: 384
- Max Sequence Length: 256 tokens
- Performance: Excellent balance of speed and quality
- Use Cases: Semantic search, clustering, similarity comparison
Best for unsupervised domain adaptation. The model learns to reconstruct sentences from noisy inputs, improving its understanding of your domain's vocabulary and structure.
Creates positive pairs from related sentences and trains the model to distinguish them from negative examples. Good when you have naturally paired data.
sentence-transformer/
├── requirements.txt # Python dependencies
├── download_dataset.py # Download markdown files
├── train.py # Training script
├── inference.py # Inference examples
├── models/
│ └── all-MiniLM-L6-v2/ # Base model (committed)
├── data/ # Downloaded data (gitignored)
│ └── markdown/ # Markdown files
└── models/finetuned-*/ # Trained models
- Python 3.9+
- CUDA-compatible GPU (recommended for training)
- ~10GB disk space for dataset
MIT