A production-ready semantic search engine and RAG (Retrieval-Augmented Generation) system for Hacker News comments, built with vector embeddings, PostgreSQL with pgvector, and LangGraph.
This project implements a full-stack semantic search system over millions of Hacker News comments (2023-2025), enabling natural language queries and AI-powered question answering. Unlike traditional keyword search, it uses dense vector embeddings to understand semantic meaning, finding relevant discussions even when exact keywords don't match.
Live Demo:
Hosted at hn.fiodorov.es
Key Features:
- π Semantic Search: Natural language queries with cosine similarity ranking
- π€ RAG System: AI-powered Q&A using LangGraph workflows and DeepSeek LLM
- π Scalable Architecture: PostgreSQL with pgvector for production-grade vector search
- β‘ Redis Caching: Sub-second response times for repeated queries
- π Incremental Updates: Idempotent data pipeline for fetching new HN comments
- π¨ Web Interface: Gradio-based UI with URL parameter support
- ποΈ Partitioned Tables: Time-based partitioning for efficient query performance
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BigQuery (HN Public Dataset) β
β β β
β Fetch New Comments (idempotent, resumable) β
β β β
β Generate Embeddings (sentence-transformers, MPS/CUDA) β
β β β
β PostgreSQL + pgvector (partitioned by month) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β User Query β
β β β
β Encode with sentence-transformers/all-mpnet-base-v2 β
β β β
β Redis Cache Check βββββββββββββββββ β
β β β (cache hit) β
β PostgreSQL Vector Search (cosine) β β
β β β β
β Cache Results ββββββββββββββββββββββ β
β β β
β Return Top K Documents β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LangGraph Workflow (StateGraph) β
β β β
β [Retrieve] β Vector Search β Top 10 Comments β
β β β
β [Answer] β DeepSeek LLM β Generated Response β
β β β
β Gradio Web UI (with sources & citations) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Language: Python 3.13
- Package Manager: uv (fast Python package installer)
- Database: PostgreSQL 16 + pgvector extension
- Vector Model: all-mpnet-base-v2 (768-dim embeddings)
- LLM: DeepSeek via OpenAI-compatible API
- Cache: Redis 6.x
- Data Source: BigQuery HN Public Dataset
- sentence-transformers: High-quality sentence embeddings
- psycopg3: Modern PostgreSQL adapter with async support
- pgvector: PostgreSQL extension for vector similarity search
- LangGraph: Orchestration framework for LLM workflows
- LangChain: LLM abstractions and prompting
- Gradio: Web UI framework
- torch: PyTorch for model inference (MPS support for Apple Silicon)
- pandas + pyarrow: Data processing pipeline
- Compute: Railway (PostgreSQL + Redis) / Local development
- Deployment: Docker-ready with docker-compose.yml
Source: Hacker News comments from BigQuery public dataset (bigquery-public-data.hacker_news.full)
Scope:
- Time range: January 2023 - September 2025
- Comment count: ~9.4M comments
- Filters: Non-deleted, non-dead, non-null text
Partitioning Strategy:
- Tables partitioned by month (e.g.,
hn_documents_2023_01,hn_documents_2023_02, ...) - Enables efficient querying and index management
- Simplifies incremental updates
# Install uv (recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or use pip
pip install uv
# Install PostgreSQL with pgvector
# See: https://github.com/pgvector/pgvector#installation
# Install Redis (optional, for caching)
brew install redis # macOS
sudo apt install redis # Ubuntu# Clone the repository
git clone https://github.com/yourusername/hn-search.git
cd hn-search
# Install dependencies
uv sync
# For development (includes BigQuery tools)
uv sync --extra dev
# Set up environment variables
cp .env.example .env
# Edit .env with your credentials:
# DATABASE_URL=postgres://user:pass@host:port/dbname
# DEEPSEEK_API_KEY=sk-...
# REDIS_URL=redis://localhost:6379 (optional)# Initialize database with precomputed embeddings
uv run python -m hn_search.init_db_pgvector
# Or initialize with test mode (100 docs only)
uv run python -m hn_search.init_db_pgvector --test# Basic search
uv run python -m hn_search.query "What do people think about Rust vs Go?"
# Get more results
uv run python -m hn_search.query "best practices for system design" 20
# Output includes:
# - Comment ID (with HN link)
# - Author
# - Timestamp
# - Cosine distance score
# - Full comment text# Start the Gradio web interface
uv run python -m hn_search.rag.web_ui
# Open http://localhost:7860
# Ask questions like:
# "What are the main criticisms of microservices?"
# "How do people debug production issues?"
# "What do HN users think about AI coding assistants?"Features:
- Real-time streaming responses
- Source citations with HN links
- URL parameter support:
?q=your+question - Auto-search from URL parameters
# Fetch new comments, generate embeddings, and upsert to DB
uv run --extra dev python misc/fetch_and_embed_new_comments.py
# Options:
# --project <GCP_PROJECT> # Specify BigQuery billing project
# --skip-fetch # Skip BigQuery download
# --skip-embed # Skip embedding generation
# --skip-upsert # Skip database insertion
# --reset # Clear state and start fresh
# Resume interrupted runs (automatic)
uv run --extra dev python misc/fetch_and_embed_new_comments.py
# The script is fully idempotent and resumable:
# - Saves state to data/raw/fetch_state.json
# - Checks for existing files before re-downloading
# - Incremental embedding generation with checkpoints
# - Tracks processed IDs to avoid duplicate insertsA single 5090 Nvidia was rented from vast.ai to compute all historical embeddings for a few dollars.
# Process raw parquet files and generate embeddings
uv run python misc/generate_embeddings_gpu.py
# Uses MPS (Apple Silicon) or CUDA automatically
# Processes in batches to avoid OOM
# Saves to embeddings/*.parquetThe system uses cosine distance for similarity search:
SELECT id, clean_text, author, timestamp, type,
embedding <=> query_vector AS distance
FROM hn_documents
ORDER BY embedding <=> query_vector
LIMIT 10Performance Optimizations:
- HNSW index for approximate nearest neighbor search
- Redis caching layer reduces repeated queries to <100ms
- Connection pooling with psycopg3
- Partitioned tables for efficient index scans
The RAG system uses LangGraph to orchestrate a two-node workflow:
-
Retrieve Node:
- Encodes user query with sentence-transformers
- Performs vector search in PostgreSQL
- Returns top 10 most relevant comments
-
Answer Node:
- Formats retrieved comments as context
- Prompts DeepSeek LLM with query + context
- Streams response back to user
Prompt Engineering:
system_prompt = """You are a helpful assistant that answers questions
based on Hacker News discussions. Use the provided comments to give
accurate, well-sourced answers. Cite comment numbers [1], [2], etc."""
user_prompt = f"""Question: {query}
Context from HN comments:
{formatted_comments}
Answer:"""Model: sentence-transformers/all-mpnet-base-v2
- Dimensions: 768
- Max sequence length: 384 tokens
- Training: MS MARCO + Natural Questions + other datasets
- Performance: SOTA for semantic similarity tasks
Why this model?
- Excellent balance of quality vs. speed
- Pre-trained on diverse Q&A datasets
- Good generalization to HN comment domain
- Efficient inference on CPU/MPS/CUDA
- Documents: ~9.4M Hacker News comments
- Storage: ~40 GB (including embeddings)
- Query Latency:
- Cold query: ~30s (embedding + search + LLM)
- Cached query: <1s (Redis cache hit)
- Concurrent duplicate query: <1s (job deduplication)
- RAG end-to-end: ~30s (including LLM generation)
Implemented:
- Singleton Embedding Model: Model loaded once and reused (3-5x throughput)
- Job Deduplication: Concurrent duplicate queries share processing (saves 90%+ compute)
- Multi-layer Caching: Redis cache for vector search, LLM answers, and job results
- Connection Pooling: PostgreSQL connection pool (min: 2, max: 20)
- Partitioned Tables: Monthly partitions for efficient indexing
- Incremental Updates: Only process new comments since last run
Single Instance:
- 20-30 concurrent users (unique queries)
- 100+ concurrent users (with 80% cache hit rate)
Horizontal Scaling (Railway/Cloud):
- 2 replicas: 40-60 concurrent users
- 4 replicas: 80-120 concurrent users
- 8 replicas: 160-240 concurrent users
See RAILWAY.md for deployment guide.
- RAM: 2-3 GB per instance (1.5GB for embedding model)
- Storage: ~40 GB for database (9.4M documents + embeddings)
- CPU: 0.5-1.0 cores per instance
- PostgreSQL: 100+ connections (20 per instance)
- Redis: 512MB-1GB for cache
# Required
DATABASE_URL=postgres://user:pass@host:port/dbname
DEEPSEEK_API_KEY=sk-...
# Optional
REDIS_URL=redis://localhost:6379
GOOGLE_CLOUD_PROJECT=your-gcp-project
TOKENIZERS_PARALLELISM=false # Disable for multi-threaded useFor optimal performance on Railway/cloud instances:
-- Default settings (already applied)
shared_buffers = 128MB
maintenance_work_mem = 64MB
work_mem = 4MB
max_parallel_workers = 8
effective_cache_size = 4GB# Format code
make format
# Run linter
make lint
# Sort imports
make importshn-search/
βββ hn_search/ # Main package
β βββ query.py # Vector search interface
β βββ init_db_pgvector.py # Database initialization
β βββ db_config.py # Database connection config
β βββ cache_config.py # Redis caching layer
β βββ common.py # Shared utilities
β βββ rag/ # RAG system
β βββ graph.py # LangGraph workflow
β βββ nodes.py # Retrieve & Answer nodes
β βββ state.py # State management
β βββ cli.py # CLI interface
β βββ web_ui.py # Gradio web interface
βββ misc/ # Utility scripts
β βββ generate_embeddings_gpu.py # Batch embedding generation
β βββ fetch_and_embed_new_comments.py # Incremental updates
βββ data/ # Data directory
β βββ raw/ # Raw parquet files
βββ pyproject.toml # Project dependencies
βββ Makefile # Development shortcuts
This project demonstrates:
- Vector Search at Scale: Implementing semantic search with pgvector on millions of documents
- Production ML Pipelines: Idempotent, resumable data processing with checkpointing
- RAG Architecture: Building retrieval-augmented generation with LangGraph
- Database Optimization: Partitioning strategies, connection pooling, caching
- Modern Python Tooling: uv, ruff, type hints, async patterns
- Cloud Integration: BigQuery public datasets, Railway deployment, Redis caching
- GPU Optimization: MPS/CUDA support for efficient embedding generation
- pgvector: Open-source vector similarity search for Postgres
- sentence-transformers: State-of-the-art sentence embeddings
- LangGraph: Building stateful, multi-actor LLM applications
- BigQuery HN Dataset
- Retrieval-Augmented Generation (Lewis et al., 2020)
MIT License - see LICENSE file for details
Contributions welcome! Please open an issue or PR.
# Fork and clone
git clone https://github.com/yourusername/hn-search.git
# Create a branch
git checkout -b feature/your-feature
# Make changes and test
uv run python -m hn_search.query "test query"
# Format and lint
make format
# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature- Y Combinator for open-sourcing Hacker News data
- The pgvector team for excellent PostgreSQL integration
- sentence-transformers community for pre-trained models
- LangChain team for RAG tooling
β If you find this project useful, please consider giving it a star!
