html2md
Extract the main content of a web page into Markdown, with optional translation to Chinese via LLM or Baidu. Includes CLI tools and a clean Web UI.
Chinese README: README.zh-CN.md
Features
- Main content extraction (Readability + structural scoring)
- Markdown output with absolute URLs for links/images
- Translation: default LLM (OpenAI‑compatible), optional Baidu
- Long‑form support via chunked translation
- Parallel translation (CLI configurable, Web default 16)
- Web UI with real progress (SSE)
Requirements
- Python 3.9+
- macOS / Linux / Windows
Install
python -m venv .venv
source .venv/bin/activate
pip install -e .Quick Start (CLI) Extract Markdown:
html2md "https://example.com/article" -o output.mdIf -o is omitted, the filename is derived from the page title or URL.
Fetch + translate (default LLM):
export NVIDIA_API_KEY="your_api_key"
html2md "https://example.com/article" --translate -o output.zh.mdTranslate an existing Markdown file:
html2md-translate output.md -o output.zh.mdUse Baidu:
export BAIDU_TRANSLATE_APPID="your_appid"
export BAIDU_TRANSLATE_APPKEY="your_appkey"
html2md-translate output.md -o output.zh.md --provider baiduParallel translation:
html2md-translate output.md -o output.zh.md --provider llm --workers 4Web UI Start:
html2md-webOpen:
http://localhost:8000
The UI exposes only minimal options (URL + translate toggle). API keys are read from server env vars.
Translation Defaults LLM (default):
- base_url:
https://integrate.api.nvidia.com/v1 - model:
stepfun-ai/step-3.5-flash - max_tokens:
16384 - max_chars:
10000 - sleep:
0
Baidu:
- max_chars:
3000 - sleep:
1
Advanced CLI Options
html2md-translate output.md -o output.zh.md \
--provider llm \
--llm-base-url "https://integrate.api.nvidia.com/v1" \
--llm-model "stepfun-ai/step-3.5-flash" \
--llm-temperature 1.0 \
--llm-top-p 0.9 \
--llm-max-tokens 16384 \
--max-chars 10000 \
--workers 4Environment Variables
export NVIDIA_API_KEY="your_llm_api_key"
export BAIDU_TRANSLATE_APPID="your_baidu_appid"
export BAIDU_TRANSLATE_APPKEY="your_baidu_appkey"Project Structure
src/html2md/
main.py # fetch + extract + convert
translate.py # translation (LLM / Baidu)
web.py # Web UI (FastAPI + SSE)
Notes
- Some sites require JS rendering or login and may not be fetchable.
- LLM translation may slightly paraphrase while preserving Markdown structure.
- Translation logs print before/after blocks (verbose).
Contributing PRs and issues welcome. Please include:
- Repro steps
- Expected vs actual behavior
- Environment details
License
No license yet. Add a LICENSE before open‑sourcing.