./dotnet-install.sh --version 9.0.100 --install-dir "$HOME/dotnet"
export PATH="$HOME/dotnet:$PATH"- Entry point:
MarkItDownConverter.ConvertAsync(string path, string mimeType, CancellationToken) - Response:
MarkItDownResultMarkdown– normalised textPages–Page(number,width,height)Lines–Line(page,text,bbox)Words–Word(page,text,bbox)
BoundingBoxis[x,y,w,h]with values in[0,1]and a top‑left origin.
- PDFs use PdfPig for text extraction. When native words are below
MinimumNativeWordThreshold, pages are rasterised with PDFtoImage and passed to Tesseract OCR. - Images are processed directly with Tesseract.
- SkiaSharp is used for image manipulation; avoid SixLabors.ImageSharp.
- Markdown is optionally normalised via Markdig.
- Cancellation tokens are honoured on every stage.
- Serilog is the logging framework.
- Configure sinks and levels via
Serilogsettings (seesrc/MarkItDownNet/appsettings.json). - Use
Serilog__MinimumLevel=Verboseto enable detailed timings and counts.
Le dipendenze native minime per Linux x64 (Tesseract e Leptonica) sono incluse nel repository in runtimes/linux-x64/native e vengono copiate accanto ai binari. Non è richiesta l'installazione separata di Tesseract.
Il binding .NET di Tesseract è distribuito come pacchetto NuGet locale (local-packages/Tesseract.5.2.0.nupkg); nuget.config forza l'uso di questa sorgente.
Per l'OCR servono solo i dati delle lingue. Su Ubuntu 24.04 possono essere installati con:
sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-ita tesseract-ocr-osdIndicare quindi il percorso tramite OcrDataPath.
- Install system Tesseract languages as above and set
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata. - Install Python dependencies:
pip install 'markitdown[all]' pytesseract. - Generate Markdown with
python tools/markitdown_ocr.py <image_or_text> -o <out.md>. - When running benchmarks, the CLI automatically calls this script for
pythonmode. - For warm-start benchmarks,
python tools/run_markitdown_hot.py <text> <out.md>loads the text once and runs five conversions, printing timing data as JSON.
- Install dependencies (PPStructureV3 currently requires NumPy 1.x and PaddleX):
pip install 'numpy<2' paddlepaddle==3.0.0 "paddlex[ocr]"
- Clone the reference repo once and convert an image to Markdown:
git clone https://github.com/PaddlePaddle/PaddleOCR /tmp/PaddleOCR
python - <<'PY'
import os,sys,tempfile
repo='/tmp/PaddleOCR'; sys.path.insert(0, repo)
from paddleocr import PPStructureV3
pipeline=PPStructureV3(lang='en')
res=pipeline.predict(sys.argv[1])
tmp=tempfile.mkdtemp()
res[0].save_to_markdown(tmp)
md_path=os.path.join(tmp, os.path.splitext(os.path.basename(sys.argv[1]))[0]+'.md')
print(open(md_path,encoding='utf-8').read())
PY <image_path>- Tests spin up a long-lived helper that instantiates
PPStructureV3withuse_chart_recognition=Falseanduse_formula_recognition=Falseto trim unused modules and reuse a single model across images. Send each image path overstdinand read a JSON blob with both layout labels and markdown. - Use this script to compare .NET Markdown output with the Python reference.
- Set
ENABLE_PPSTRUCTURE=1before running tests to enable these comparisons; otherwise the Python helper is skipped.
- Set
StructureOptions.DetectOrientationto enable angle-aware OCR and invoke an optionalIOrientationDetectorto rotate pages.RapidOcrOrientationDetectoruses thech_ppocr_mobile_v2.0_cls_infer_opt.onnxmodel to return 0/90/180/270 angles and all bounding boxes are mapped back to the original image space.