Skip to content

recipe: llama-cpp-python 0.3.32#91

Open
ndonkoHenri wants to merge 2 commits into
mainfrom
llama-cpp-python
Open

recipe: llama-cpp-python 0.3.32#91
ndonkoHenri wants to merge 2 commits into
mainfrom
llama-cpp-python

Conversation

@ndonkoHenri

Copy link
Copy Markdown

Closes flet-dev/flet#6627 — run local GGUF LLMs on-device with Flet.

llama-cpp-python is a scikit-build-core / CMake package that vendors the full llama.cpp engine; its Python layer is a pure-ctypes binding that loads the bundled libllama + libggml* shared libs. No separate flet-lib* recipe.

Recipe shape

  • CPU-only baseline: all GPU backends (GGML_METAL/CUDA/VULKAN/OPENCL/HIP/RPC), BLAS, Accelerate, OpenMP and LLAMAFILE off; GGML_NATIVE=OFF; LLAVA_BUILD=OFF (the multimodal mtmd surface is imported lazily, so text inference never needs it).
  • Android links libc++_shared (flet-libcpp-shared) and gets the 16 KB page-size flags; iOS uses the Unix Makefiles generator.
  • -DCMAKE_INSTALL_LIBDIR=llama_cpp/lib to merge llama.cpp's standard install into the package dir (drops a duplicate top-level lib/).

mobile.patch (4 parts)

  1. Gate the CMakeLists Apple block to skip iOS — it FORCE-enables GGML_METAL (via CACHE … FORCE, so a -D can't override) and guesses the arch from uname -m; the iOS cross build is CPU-only with an explicit arch.
  2. Skip llama.cpp's unused common helper lib (~5 MB).
  3. Strip SONAME versioning from the shipped libs, so the wheel carries single unversioned files instead of a lib*.dylib → .0 → .0.15.3 symlink triplet (forge's packer dereferences those into 3 copies / colliding iOS frameworks). This cut the wheel from ~14 MB to ~1.7 MB.
  4. Rewrite the ctypes loader to (a) find the lib under its iOS framework name (lib<name>.fwork) and on sys.platform == "ios", and (b) preload the ggml dependency chain with RTLD_GLOBAL — the bundled libs carry no RUNPATH, so the platform linker can't resolve siblings on its own.

Testing

Full 6-slice matrix builds green (iOS device / arm64-sim / x86_64-sim, Android arm64-v8a / x86_64 / armeabi-v7a), ~1.7–2.0 MB per wheel.

On-device (recipe-tester): import + native ctypes calls + real GGUF inference (SmolLM2-135M Q4) all pass on Android arm64 (emulator), and iOS arm64 simulator.

CI note

The Android job passes fully, including the on-device x86_64 emulator test.

The iOS-simulator on-device test currently fails — but not because of this recipe. It's a serious-python limitation: its darwin packaging only converts .so C-extensions into per-slice xcframeworks, so a ctypes-loaded .dylib ships the device build into the simulator app and fails to dlopen (incompatible platform (have 'iOS', need 'iOS-simulator')). The recipe's iOS wheels are correct (verified platform 7), and it was proven to run on the simulator.
Fix: companion PR flet-dev/serious-python#223 (fix/darwin-ctypes-dylib-xcframework) — with it, the iOS-sim CI test passes with zero change to this recipe.

Notes

  • GGUF weights are user-supplied at runtime (download-on-first-run); only ~1B–3B Q4 models are practical on a phone.

ndonkoHenri and others added 2 commits July 1, 2026 02:49
Runs local GGUF LLMs on-device (flet-dev/flet#6627). scikit-build-core /
CMake package that vendors the full llama.cpp engine; the Python layer is a
pure-ctypes binding that loads the bundled libllama/libggml* shared libs, so
this is the duckdb archetype crossed with a pyzbar-style loader.

CPU-only baseline: all GPU backends (Metal/CUDA/Vulkan/OpenCL/HIP/RPC), BLAS,
Accelerate, OpenMP and LLAMAFILE are disabled, GGML_NATIVE=OFF, LLAVA_BUILD=OFF
(the multimodal mtmd surface is imported lazily so text inference never needs
it). Android links libc++_shared (flet-libcpp-shared) and gets the 16 KB
page-size flags; iOS uses the Unix Makefiles generator.

mobile.patch (4 parts):
  1. Gate the CMakeLists Apple block to skip iOS — it FORCE-enables GGML_METAL
     (via CACHE...FORCE, which -DGGML_METAL=OFF can't override) and guesses the
     arch from `uname -m`; iOS is CPU-only with an explicit arch.
  2. Skip llama.cpp's unused `common` helper lib (~5 MB).
  3. Strip SONAME versioning from the shipped libs, so the wheel carries single
     unversioned files instead of a lib*.dylib -> .0 -> .0.15.3 symlink triplet
     (forge's packer dereferences those into 3 copies / colliding iOS
     frameworks). Cuts the wheel from ~14 MB to ~1.7 MB.
  4. Rewrite the ctypes loader to (a) find the lib under its iOS framework name
     (lib<name>.fwork) and on sys.platform == "ios", and (b) preload the ggml
     dependency chain with RTLD_GLOBAL — the bundled libs carry no RUNPATH, so
     the platform linker can't resolve siblings on its own.

Also -DCMAKE_INSTALL_LIBDIR=llama_cpp/lib to merge llama.cpp's standard install
into the package dir (drops the duplicate top-level lib/).

Full 6-slice matrix builds green (iOS device/arm64-sim/x86_64-sim, Android
arm64-v8a/x86_64/armeabi-v7a). On-device validated end to end on Android arm64
and the iOS arm64 simulator: import + native calls + real GGUF inference
(SmolLM2-135M Q4) via the recipe-tester app.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant