Skip to content

Pi3AI/html2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

html2md

Extract the main content of a web page into Markdown, with optional translation to Chinese via LLM or Baidu. Includes CLI tools and a clean Web UI.

Chinese README: README.zh-CN.md

Features

  • Main content extraction (Readability + structural scoring)
  • Markdown output with absolute URLs for links/images
  • Translation: default LLM (OpenAI‑compatible), optional Baidu
  • Long‑form support via chunked translation
  • Parallel translation (CLI configurable, Web default 16)
  • Web UI with real progress (SSE)

Requirements

  • Python 3.9+
  • macOS / Linux / Windows

Install

python -m venv .venv
source .venv/bin/activate
pip install -e .

Quick Start (CLI) Extract Markdown:

html2md "https://example.com/article" -o output.md

If -o is omitted, the filename is derived from the page title or URL.

Fetch + translate (default LLM):

export NVIDIA_API_KEY="your_api_key"
html2md "https://example.com/article" --translate -o output.zh.md

Translate an existing Markdown file:

html2md-translate output.md -o output.zh.md

Use Baidu:

export BAIDU_TRANSLATE_APPID="your_appid"
export BAIDU_TRANSLATE_APPKEY="your_appkey"
html2md-translate output.md -o output.zh.md --provider baidu

Parallel translation:

html2md-translate output.md -o output.zh.md --provider llm --workers 4

Web UI Start:

html2md-web

Open:

http://localhost:8000

The UI exposes only minimal options (URL + translate toggle). API keys are read from server env vars.

Translation Defaults LLM (default):

  • base_url: https://integrate.api.nvidia.com/v1
  • model: stepfun-ai/step-3.5-flash
  • max_tokens: 16384
  • max_chars: 10000
  • sleep: 0

Baidu:

  • max_chars: 3000
  • sleep: 1

Advanced CLI Options

html2md-translate output.md -o output.zh.md \
  --provider llm \
  --llm-base-url "https://integrate.api.nvidia.com/v1" \
  --llm-model "stepfun-ai/step-3.5-flash" \
  --llm-temperature 1.0 \
  --llm-top-p 0.9 \
  --llm-max-tokens 16384 \
  --max-chars 10000 \
  --workers 4

Environment Variables

export NVIDIA_API_KEY="your_llm_api_key"
export BAIDU_TRANSLATE_APPID="your_baidu_appid"
export BAIDU_TRANSLATE_APPKEY="your_baidu_appkey"

Project Structure

src/html2md/
  main.py        # fetch + extract + convert
  translate.py   # translation (LLM / Baidu)
  web.py         # Web UI (FastAPI + SSE)

Notes

  • Some sites require JS rendering or login and may not be fetchable.
  • LLM translation may slightly paraphrase while preserving Markdown structure.
  • Translation logs print before/after blocks (verbose).

Contributing PRs and issues welcome. Please include:

  • Repro steps
  • Expected vs actual behavior
  • Environment details

License No license yet. Add a LICENSE before open‑sourcing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages