Skip to content

afiodorov/hn-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”Ž HN Search: Semantic Search & RAG for Hacker News

A production-ready semantic search engine and RAG (Retrieval-Augmented Generation) system for Hacker News comments, built with vector embeddings, PostgreSQL with pgvector, and LangGraph.

🎯 Project Overview

This project implements a full-stack semantic search system over millions of Hacker News comments (2023-2025), enabling natural language queries and AI-powered question answering. Unlike traditional keyword search, it uses dense vector embeddings to understand semantic meaning, finding relevant discussions even when exact keywords don't match.

Live Demo:

Hosted at hn.fiodorov.es

Key Features:

  • πŸ” Semantic Search: Natural language queries with cosine similarity ranking
  • πŸ€– RAG System: AI-powered Q&A using LangGraph workflows and DeepSeek LLM
  • πŸ“Š Scalable Architecture: PostgreSQL with pgvector for production-grade vector search
  • ⚑ Redis Caching: Sub-second response times for repeated queries
  • πŸ”„ Incremental Updates: Idempotent data pipeline for fetching new HN comments
  • 🎨 Web Interface: Gradio-based UI with URL parameter support
  • πŸ—οΈ Partitioned Tables: Time-based partitioning for efficient query performance

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Data Pipeline                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  BigQuery (HN Public Dataset)                                   β”‚
β”‚         ↓                                                        β”‚
β”‚  Fetch New Comments (idempotent, resumable)                     β”‚
β”‚         ↓                                                        β”‚
β”‚  Generate Embeddings (sentence-transformers, MPS/CUDA)          β”‚
β”‚         ↓                                                        β”‚
β”‚  PostgreSQL + pgvector (partitioned by month)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Query System                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  User Query                                                      β”‚
β”‚         ↓                                                        β”‚
β”‚  Encode with sentence-transformers/all-mpnet-base-v2            β”‚
β”‚         ↓                                                        β”‚
β”‚  Redis Cache Check ────────────────┐                            β”‚
β”‚         ↓                           β”‚ (cache hit)                β”‚
β”‚  PostgreSQL Vector Search (cosine) β”‚                            β”‚
β”‚         ↓                           β”‚                            β”‚
β”‚  Cache Results β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚         ↓                                                        β”‚
β”‚  Return Top K Documents                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         RAG System                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LangGraph Workflow (StateGraph)                                β”‚
β”‚         ↓                                                        β”‚
β”‚  [Retrieve] β†’ Vector Search β†’ Top 10 Comments                    β”‚
β”‚         ↓                                                        β”‚
β”‚  [Answer] β†’ DeepSeek LLM β†’ Generated Response                   β”‚
β”‚         ↓                                                        β”‚
β”‚  Gradio Web UI (with sources & citations)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technical Stack

Core Technologies

  • Language: Python 3.13
  • Package Manager: uv (fast Python package installer)
  • Database: PostgreSQL 16 + pgvector extension
  • Vector Model: all-mpnet-base-v2 (768-dim embeddings)
  • LLM: DeepSeek via OpenAI-compatible API
  • Cache: Redis 6.x
  • Data Source: BigQuery HN Public Dataset

Key Libraries

  • sentence-transformers: High-quality sentence embeddings
  • psycopg3: Modern PostgreSQL adapter with async support
  • pgvector: PostgreSQL extension for vector similarity search
  • LangGraph: Orchestration framework for LLM workflows
  • LangChain: LLM abstractions and prompting
  • Gradio: Web UI framework
  • torch: PyTorch for model inference (MPS support for Apple Silicon)
  • pandas + pyarrow: Data processing pipeline

Infrastructure

  • Compute: Railway (PostgreSQL + Redis) / Local development
  • Deployment: Docker-ready with docker-compose.yml

πŸ“Š Dataset

Source: Hacker News comments from BigQuery public dataset (bigquery-public-data.hacker_news.full)

Scope:

  • Time range: January 2023 - September 2025
  • Comment count: ~9.4M comments
  • Filters: Non-deleted, non-dead, non-null text

Partitioning Strategy:

  • Tables partitioned by month (e.g., hn_documents_2023_01, hn_documents_2023_02, ...)
  • Enables efficient querying and index management
  • Simplifies incremental updates

πŸš€ Getting Started

Prerequisites

# Install uv (recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or use pip
pip install uv

# Install PostgreSQL with pgvector
# See: https://github.com/pgvector/pgvector#installation

# Install Redis (optional, for caching)
brew install redis  # macOS
sudo apt install redis  # Ubuntu

Installation

# Clone the repository
git clone https://github.com/yourusername/hn-search.git
cd hn-search

# Install dependencies
uv sync

# For development (includes BigQuery tools)
uv sync --extra dev

# Set up environment variables
cp .env.example .env
# Edit .env with your credentials:
#   DATABASE_URL=postgres://user:pass@host:port/dbname
#   DEEPSEEK_API_KEY=sk-...
#   REDIS_URL=redis://localhost:6379  (optional)

Database Setup

# Initialize database with precomputed embeddings
uv run python -m hn_search.init_db_pgvector

# Or initialize with test mode (100 docs only)
uv run python -m hn_search.init_db_pgvector --test

πŸ’‘ Usage

1. Semantic Search (CLI)

# Basic search
uv run python -m hn_search.query "What do people think about Rust vs Go?"

# Get more results
uv run python -m hn_search.query "best practices for system design" 20

# Output includes:
# - Comment ID (with HN link)
# - Author
# - Timestamp
# - Cosine distance score
# - Full comment text

2. RAG System (Web UI)

# Start the Gradio web interface
uv run python -m hn_search.rag.web_ui

# Open http://localhost:7860
# Ask questions like:
#   "What are the main criticisms of microservices?"
#   "How do people debug production issues?"
#   "What do HN users think about AI coding assistants?"

Features:

  • Real-time streaming responses
  • Source citations with HN links
  • URL parameter support: ?q=your+question
  • Auto-search from URL parameters

3. Incremental Data Updates

# Fetch new comments, generate embeddings, and upsert to DB
uv run --extra dev python misc/fetch_and_embed_new_comments.py

# Options:
#   --project <GCP_PROJECT>  # Specify BigQuery billing project
#   --skip-fetch             # Skip BigQuery download
#   --skip-embed             # Skip embedding generation
#   --skip-upsert            # Skip database insertion
#   --reset                  # Clear state and start fresh

# Resume interrupted runs (automatic)
uv run --extra dev python misc/fetch_and_embed_new_comments.py

# The script is fully idempotent and resumable:
# - Saves state to data/raw/fetch_state.json
# - Checks for existing files before re-downloading
# - Incremental embedding generation with checkpoints
# - Tracks processed IDs to avoid duplicate inserts

4. Generate Embeddings (Batch)

A single 5090 Nvidia was rented from vast.ai to compute all historical embeddings for a few dollars.

# Process raw parquet files and generate embeddings
uv run python misc/generate_embeddings_gpu.py

# Uses MPS (Apple Silicon) or CUDA automatically
# Processes in batches to avoid OOM
# Saves to embeddings/*.parquet

πŸ” How It Works

Vector Search

The system uses cosine distance for similarity search:

SELECT id, clean_text, author, timestamp, type,
       embedding <=> query_vector AS distance
FROM hn_documents
ORDER BY embedding <=> query_vector
LIMIT 10

Performance Optimizations:

  • HNSW index for approximate nearest neighbor search
  • Redis caching layer reduces repeated queries to <100ms
  • Connection pooling with psycopg3
  • Partitioned tables for efficient index scans

RAG Pipeline

The RAG system uses LangGraph to orchestrate a two-node workflow:

  1. Retrieve Node:

    • Encodes user query with sentence-transformers
    • Performs vector search in PostgreSQL
    • Returns top 10 most relevant comments
  2. Answer Node:

    • Formats retrieved comments as context
    • Prompts DeepSeek LLM with query + context
    • Streams response back to user

Prompt Engineering:

system_prompt = """You are a helpful assistant that answers questions
based on Hacker News discussions. Use the provided comments to give
accurate, well-sourced answers. Cite comment numbers [1], [2], etc."""

user_prompt = f"""Question: {query}

Context from HN comments:
{formatted_comments}

Answer:"""

Embedding Model

Model: sentence-transformers/all-mpnet-base-v2

  • Dimensions: 768
  • Max sequence length: 384 tokens
  • Training: MS MARCO + Natural Questions + other datasets
  • Performance: SOTA for semantic similarity tasks

Why this model?

  • Excellent balance of quality vs. speed
  • Pre-trained on diverse Q&A datasets
  • Good generalization to HN comment domain
  • Efficient inference on CPU/MPS/CUDA

πŸ“ˆ Performance & Scale

Current Scale

  • Documents: ~9.4M Hacker News comments
  • Storage: ~40 GB (including embeddings)
  • Query Latency:
    • Cold query: ~30s (embedding + search + LLM)
    • Cached query: <1s (Redis cache hit)
    • Concurrent duplicate query: <1s (job deduplication)
    • RAG end-to-end: ~30s (including LLM generation)

Production Optimizations ⚑

Implemented:

  1. Singleton Embedding Model: Model loaded once and reused (3-5x throughput)
  2. Job Deduplication: Concurrent duplicate queries share processing (saves 90%+ compute)
  3. Multi-layer Caching: Redis cache for vector search, LLM answers, and job results
  4. Connection Pooling: PostgreSQL connection pool (min: 2, max: 20)
  5. Partitioned Tables: Monthly partitions for efficient indexing
  6. Incremental Updates: Only process new comments since last run

Capacity

Single Instance:

  • 20-30 concurrent users (unique queries)
  • 100+ concurrent users (with 80% cache hit rate)

Horizontal Scaling (Railway/Cloud):

  • 2 replicas: 40-60 concurrent users
  • 4 replicas: 80-120 concurrent users
  • 8 replicas: 160-240 concurrent users

See RAILWAY.md for deployment guide.

Resource Requirements

  • RAM: 2-3 GB per instance (1.5GB for embedding model)
  • Storage: ~40 GB for database (9.4M documents + embeddings)
  • CPU: 0.5-1.0 cores per instance
  • PostgreSQL: 100+ connections (20 per instance)
  • Redis: 512MB-1GB for cache

πŸ”§ Configuration

Environment Variables

# Required
DATABASE_URL=postgres://user:pass@host:port/dbname
DEEPSEEK_API_KEY=sk-...

# Optional
REDIS_URL=redis://localhost:6379
GOOGLE_CLOUD_PROJECT=your-gcp-project
TOKENIZERS_PARALLELISM=false  # Disable for multi-threaded use

PostgreSQL Settings

For optimal performance on Railway/cloud instances:

-- Default settings (already applied)
shared_buffers = 128MB
maintenance_work_mem = 64MB
work_mem = 4MB
max_parallel_workers = 8
effective_cache_size = 4GB

πŸ§ͺ Development

Code Quality

# Format code
make format

# Run linter
make lint

# Sort imports
make imports

Project Structure

hn-search/
β”œβ”€β”€ hn_search/              # Main package
β”‚   β”œβ”€β”€ query.py           # Vector search interface
β”‚   β”œβ”€β”€ init_db_pgvector.py # Database initialization
β”‚   β”œβ”€β”€ db_config.py       # Database connection config
β”‚   β”œβ”€β”€ cache_config.py    # Redis caching layer
β”‚   β”œβ”€β”€ common.py          # Shared utilities
β”‚   └── rag/               # RAG system
β”‚       β”œβ”€β”€ graph.py       # LangGraph workflow
β”‚       β”œβ”€β”€ nodes.py       # Retrieve & Answer nodes
β”‚       β”œβ”€β”€ state.py       # State management
β”‚       β”œβ”€β”€ cli.py         # CLI interface
β”‚       └── web_ui.py      # Gradio web interface
β”œβ”€β”€ misc/                   # Utility scripts
β”‚   β”œβ”€β”€ generate_embeddings_gpu.py  # Batch embedding generation
β”‚   └── fetch_and_embed_new_comments.py  # Incremental updates
β”œβ”€β”€ data/                   # Data directory
β”‚   └── raw/               # Raw parquet files
β”œβ”€β”€ pyproject.toml         # Project dependencies
└── Makefile               # Development shortcuts

πŸŽ“ Learning Outcomes

This project demonstrates:

  1. Vector Search at Scale: Implementing semantic search with pgvector on millions of documents
  2. Production ML Pipelines: Idempotent, resumable data processing with checkpointing
  3. RAG Architecture: Building retrieval-augmented generation with LangGraph
  4. Database Optimization: Partitioning strategies, connection pooling, caching
  5. Modern Python Tooling: uv, ruff, type hints, async patterns
  6. Cloud Integration: BigQuery public datasets, Railway deployment, Redis caching
  7. GPU Optimization: MPS/CUDA support for efficient embedding generation

πŸ“š References

πŸ“„ License

MIT License - see LICENSE file for details

🀝 Contributing

Contributions welcome! Please open an issue or PR.

Development Setup

# Fork and clone
git clone https://github.com/yourusername/hn-search.git

# Create a branch
git checkout -b feature/your-feature

# Make changes and test
uv run python -m hn_search.query "test query"

# Format and lint
make format

# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature

πŸ™ Acknowledgments

  • Y Combinator for open-sourcing Hacker News data
  • The pgvector team for excellent PostgreSQL integration
  • sentence-transformers community for pre-trained models
  • LangChain team for RAG tooling

⭐ If you find this project useful, please consider giving it a star!

example

About

Vector search over Hacker News comments and RAG pipeline with a simple web ui

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages