PP-OCRv5 on ONNX Runtime + Arm KleidiAI | 100% Accuracy Aligned with PaddleOCR | Apple M4 Benchmark | ORT Version Comparison
English | 中文
A production-ready, single-file PP-OCRv5 inference pipeline using ONNX Runtime, featuring up to 2.0x pipeline speedup via KleidiAI SME2 (ORT 1.21.1 → 1.24.3 on Apple M4) with 100% text-level accuracy alignment with PaddleOCR — verified on 228 text regions across 7 images with zero mismatch.
- Up to 2.0x pipeline speedup via KleidiAI SME2 (ORT 1.21.1 → 1.24.3 at t=1 on Apple M4); rec model 4.4x faster
- 100% accuracy match with PaddleOCR — 228/228 texts identical, confidence diff < 0.00002
- Single-file deployment —
ppocrv5_onnx.py(~720 lines), copy-paste into any ARM app - Reproducible benchmarks — ORT 1.21.1 vs 1.24.3 across t=1, t=2, t=8; run on your own platform in 3 commands
- KleidiAI SME2 analysis — investigated with ARM in onnxruntime#27633
| Threads | ORT 1.21.1 (NEON) | ORT 1.24.3 (KleidiAI SME2) | Speedup |
|---|---|---|---|
| t=1 | 16,909 ms | 8,295 ms | 2.04x faster |
| t=2 (recommended) | 9,346 ms | 6,332 ms | 1.48x faster |
| t=8 | 6,497 ms | 7,096 ms | 0.92x (det regression) |
| Comparison | Texts | Match Rate | Avg Confidence Diff |
|---|---|---|---|
| PaddleOCR 3.3.0 vs ORT 1.21.1 | 228 | 100.0% | 0.000019 |
| PaddleOCR 3.3.0 vs ORT 1.24.3 | 228 | 100.0% | 0.000019 |
All benchmarks: 3 runs/image, 1 warmup. All ORT configurations produce 100% identical text output to PaddleOCR native inference. Reproduce:
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 2
Why does ORT 1.24.3 at t=2 beat t=8? ORT 1.24.3's KleidiAI SME2 kernels dramatically accelerate rec (2.5x) and textline_ori (2.7x), but the det model regresses on high-resolution images due to a large-kernel Conv regression in ORT 1.24.x (kernel ≥ 7×7 on large spatial inputs). This regression is NOT caused by SME contention — it persists at t=2 and with disable-kleidiai. Factor decomposition: the 3.2x det gap at t=2 (ORT 1.24.3 vs 1.21.1) = 1.94x (thread count: t=2 vs t=8) × 1.65x (Conv kernel regression). The pipeline still wins overall because rec's 2.5x speedup outweighs det's regression. At t=1, the speedup is most dramatic (2.04x) because NEON can't parallelize at all, while KleidiAI SME2 still delivers full acceleration. See SME Thread Scaling for the full analysis.
Per-model inference breakdown (ms, averaged across 7 images)
| Model | ORT 1.21.1 (t=1) | ORT 1.21.1 (t=2) | ORT 1.21.1 (t=8) | ORT 1.24.3 (t=1) | ORT 1.24.3 (t=2) | ORT 1.24.3 (t=8) |
|---|---|---|---|---|---|---|
| doc_ori | 6.53 | 4.94 | 6.63 | 6.68 | 3.41 | 4.82 |
| det | 4,876.13 | 2,571.48 | 1,328.30 | 5,440.39 | 4,246.58 | 4,311.64 |
| textline_ori | 264.21 | 187.45 | 222.09 | 123.32 | 80.84 | 98.81 |
| rec | 11,703.89 | 6,523.42 | 4,850.45 | 2,647.07 | 1,931.17 | 2,579.15 |
Key insight: KleidiAI SME2 delivers massive acceleration on rec (4.4x at t=1) and textline_ori (2.1x at t=1). The det model regresses at high resolution, but rec dominates total pipeline time, so the net effect is a 2.04x pipeline speedup at t=1 and 1.48x at t=2.
Per-image latency breakdown
| Image | ORT 1.21.1 (t=1) | ORT 1.21.1 (t=2) | ORT 1.21.1 (t=8) | ORT 1.24.3 (t=1) | ORT 1.24.3 (t=2) | ORT 1.24.3 (t=8) |
|---|---|---|---|---|---|---|
| ancient_demo.png | 5,882 ms | 3,192 ms | 2,246 ms | 1,391 ms | 1,022 ms | 1,340 ms |
| handwrite_ch_demo.png | 3,125 ms | 1,681 ms | 1,144 ms | 1,096 ms | 673 ms | 1,175 ms |
| handwrite_en_demo.png | 3,907 ms | 2,107 ms | 1,429 ms | 2,296 ms | 828 ms | 2,215 ms |
| japan_demo.png | 22,953 ms | 12,250 ms | 7,971 ms | 19,560 ms | 18,425 ms | 14,549 ms |
| magazine.png | 36,143 ms | 19,873 ms | 14,025 ms | 15,990 ms | 11,601 ms | 16,118 ms |
| magazine_vetical.png | 36,115 ms | 20,386 ms | 14,341 ms | 14,984 ms | 9,732 ms | 11,599 ms |
| pinyin_demo.png | 10,235 ms | 5,931 ms | 4,323 ms | 2,749 ms | 2,039 ms | 2,676 ms |
┌─────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐ ┌───────┐
│ Image │────▶│ doc_ori │────▶│ det │────▶│ textline_ori │────▶│ rec │────▶ Results
│ (BGR) │ │ 4-class │ │ DB │ │ 2-class │ │ CTC │ [{text,
└─────────┘ │ rotation │ │ boxes │ │ rotation │ │ decode│ conf,
└──────────┘ └───────┘ └──────────────┘ └───────┘ bbox}]
LCNet PP-OCRv5 LCNet PP-OCRv5
224×224 HxW→stride32 160×80 48×W
See docs/PIPELINE_ARCHITECTURE.md for preprocessing parameters and implementation details.
Investigated in onnxruntime#27633 — resolved with ORT maintainer confirmation.
ORT 1.24.x uses KleidiAI SME2 kernels (SGEMM, IGEMM Conv, Dynamic QGemm) that leverage ARM's Scalable Matrix Extension. On Apple M4, SME is a shared coprocessor (2 devices total), not per-core like NEON. Combined with a large-kernel Conv regression in ORT 1.24.x, this creates a per-model trade-off:
| Threads | det (Conv-heavy) | rec (GEMM-heavy) | Pipeline Total |
|---|---|---|---|
| ORT 1.21.1, 1 (NEON) | 4,876 ms | 11,704 ms | 16,909 ms |
| ORT 1.21.1, 2 (NEON) | 2,571 ms | 6,523 ms | 9,346 ms |
| ORT 1.21.1, 8 (NEON) | 1,328 ms | 4,850 ms | 6,497 ms |
| ORT 1.24.3, 1 (SME2) | 5,440 ms | 2,647 ms | 8,295 ms |
| ORT 1.24.3, 2 (SME2) | 4,247 ms | 1,931 ms | 6,332 ms |
| ORT 1.24.3, 8 (SME2) | 4,312 ms | 2,579 ms | 7,096 ms |
At t=1: rec is 4.4x faster on ORT 1.24.3, giving a 2.04x pipeline speedup. At t=2: pipeline is 1.48x faster. At t=8: det benefits from NEON parallelism but rec's SME2 advantage is diluted by contention.
Recommended: threads=2 for ORT >= 1.24 on Apple M4.
# Recommended for Apple M4 with ORT >= 1.24
pipeline = PPOCRv5Pipeline(model_dir, dict_path=dict_path, threads=2)See docs/SME_THREAD_SCALING.md for the full analysis, experimental data, and background.
git clone https://github.com/AIwork4me/ppocrv5-kleidiAI-appleM4.git
cd ppocrv5-kleidiAI-appleM4
pip install onnxruntime>=1.21.0 opencv-python-headless numpy pyclipperDownload from Baidu Pan (password: uepw), place under models/. See models/README.md for expected layout.
python scripts/download_models.py # verify models are in placefrom ppocrv5_onnx import PPOCRv5Pipeline
pipeline = PPOCRv5Pipeline("models", dict_path="data/dict/ppocrv5_dict.txt", threads=2)
results = pipeline.predict("image.png")
for r in results:
print(f"{r['text']} ({r['confidence']:.4f})")ppocrv5_onnx.py is a single-file module (~720 lines) with minimal dependencies. Copy it directly into your project:
from ppocrv5_onnx import PPOCRv5Pipeline
pipeline = PPOCRv5Pipeline(
model_dir="path/to/onnx/models",
dict_path="path/to/ppocrv5_dict.txt",
threads=2, # Recommended for Apple M4. See docs/SME_THREAD_SCALING.md
)
results = pipeline.predict(bgr_image_array) # accepts file path or BGR ndarray
# [{"text": "...", "confidence": 0.98, "bounding_box": [[x,y], ...]}, ...]Dependencies: onnxruntime, opencv-python-headless, numpy, pyclipper
# ORT 1.24.3, threads=2 (recommended, best pipeline throughput on Apple M4)
pip install onnxruntime==1.24.3
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 2
# ORT 1.24.3, threads=1 (shows maximum KleidiAI SME2 advantage)
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 1
# ORT 1.24.3, threads=8 (shows det regression at high thread count)
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 8
# ORT 1.24.3, threads=8, KleidiAI disabled (NEON fallback)
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 8 --disable-kleidiai
# ORT 1.21.1 (NEON baseline)
pip install onnxruntime==1.21.1
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 1
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 2
python benchmarks/benchmark_unified.py --backend ort --num-runs 3 --threads 8
# Paddle (for accuracy verification only — not for speed comparison)
pip install paddlepaddle==3.3.0
python benchmarks/benchmark_unified.py --backend paddle --num-runs 3
# Compare all results in results/
python benchmarks/compare_results.pyResults are saved to results/*.json and can be compared across platforms.
The ONNX pipeline produces 100% identical text output to PaddleOCR/PaddleX 3.4.x native inference, achieved through 6 rounds of systematic debugging:
| Round | Fix | Match Rate |
|---|---|---|
| 1 | CTC decode, normalize, box sorting, ... | 65.6% → 71.8% |
| 3 | det resize params (Pipeline runtime overrides inference.yml) | → 90.8% |
| 5 | crop coordinate precision (int16 → minAreaRect float32) | → 93.3% |
| 6 | rec batch padding (batch_size=6, ratio sort, per-batch pad) | → 100.0% |
See docs/ACCURACY_ALIGNMENT.md for the full story and key insights.
ppocrv5-kleidiAI-appleM4/
├── ppocrv5_onnx.py # Core: single-file inference pipeline
├── benchmarks/
│ ├── benchmark_unified.py # Unified benchmark (--backend paddle|ort)
│ └── compare_results.py # Multi-backend comparison report
├── results/ # ORT version comparison data (Apple M4)
│ ├── ort_1.21.1_t1.json
│ ├── ort_1.21.1_t2.json
│ ├── ort_1.21.1.json
│ ├── ort_1.24.3_t1.json
│ ├── ort_1.24.3_t2.json # KleidiAI SME2, threads=2 (recommended)
│ ├── ort_1.24.3.json # KleidiAI SME2, threads=8
│ ├── ort_1.24.3_no_kleidiai.json # KleidiAI disabled, threads=8
│ └── paddle_3.3.0.json # For accuracy verification only
├── data/
│ ├── dict/ppocrv5_dict.txt # Character dictionary (18,383 chars)
│ └── images/ # 7 test images
├── models/ # ONNX models (download separately, ~180 MB)
├── docs/
│ ├── ACCURACY_ALIGNMENT.md # 6-round alignment process
│ ├── BENCHMARK_RESULTS.md # Full benchmark tables
│ ├── PIPELINE_ARCHITECTURE.md # 4-model pipeline details
│ └── SME_THREAD_SCALING.md # KleidiAI SME thread scaling analysis
├── scripts/download_models.py # Model verification tool
└── examples/quickstart.py # Minimal usage example
| Document | Description |
|---|---|
| Pipeline Architecture | 4-model pipeline, preprocessing parameters, batch strategy |
| Accuracy Alignment | 6-round debugging journey from 65.6% to 100% |
| Benchmark Results | Full speed/accuracy tables, per-model KleidiAI analysis |
| SME Thread Scaling | KleidiAI SME contention on Apple Silicon, thread tuning guide |
| Package | Version | Notes |
|---|---|---|
| Python | >= 3.10 | |
| onnxruntime | >= 1.21.0 | >= 1.24 for KleidiAI SME2 |
| opencv-python-headless | >= 4.8.0 | |
| numpy | >= 1.24.0 | |
| pyclipper | >= 1.3.0 | DB post-processing |
- PaddleOCR — PP-OCRv5 models and the original inference pipeline
- ONNX Runtime — Cross-platform inference engine
- KleidiAI — Arm CPU micro-kernel library for accelerated ML inference
If you use PP-OCRv5 models in your work, please cite the PaddleOCR 3.0 Technical Report:
@article{cui2025paddleocr,
title={PaddleOCR 3.0 Technical Report},
author={Cui, Cheng and Sun, Ting and Lin, Manhui and Gao, Tingquan and Zhang, Yubo and Liu, Jiaxuan and Wang, Xueqing and Zhang, Zelun and Zhou, Changda and Liu, Hongen and Zhang, Yue and Lv, Wenyu and Huang, Kui and Zhang, Yichao and Zhang, Jing and Zhang, Jun and Liu, Yi and Yu, Dianhai and Ma, Yanjun},
journal={arXiv preprint arXiv:2507.05595},
year={2025}
}- Paper: https://arxiv.org/abs/2507.05595
- Source Code: https://github.com/PaddlePaddle/PaddleOCR
- Document: https://paddlepaddle.github.io/PaddleOCR
- Models & Online Demo: https://huggingface.co/PaddlePaddle
This project is licensed under the Apache License 2.0.