New Features
- Pythonic eager execution – Rapid model development
- Unified hardware support – Arm CPU, OpenCL GPU, QNN NPU
- Advanced optimizations – Quantization, pruning, speculative execution
- NPU-ready IR – Seamless integration with NPU frameworks
- Deployment toolkit – SDK + CLI inference tool
- mllm JIT Kernel
News
[2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! Quick Start, Technical Report
[2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
[2025 Nov 23] MLLM v2 released!
What's Changed
- Develop qnn zh by @liang1232018 in #42
- Develop qnn zh by @liang1232018 in #43
- Develop qnn zh by @liang1232018 in #44
- fix: qnn rope file name by @liang1232018 in #46
- Develop qnn zh by @liang1232018 in #47
- Develop qnn zh by @liang1232018 in #48
- chore: qnn arm build config by @liang1232018 in #49
- Develop qnn zh by @liang1232018 in #50
- Develop qnn zh by @liang1232018 in #51
- Develop qnn zh by @liang1232018 in #52
- fix: qnn linear quantize tensor duplicate by @liang1232018 in #53
- Feat: Add new FrontEnd and model demos. by @yirongjie in #68
- feat: Add OPT Tokenizer. by @lx200916 in #66
- Feat: Optimize the operation process by @yirongjie in #69
- Fix: Tensor:: mm(): reference not passed in as input by @yirongjie in #70
- Feat: Fill in input Tensor by @yirongjie in #72
- Single precision inference support for the gemma-2B model by @chenghuaWang in #75
- Update README.md by @yirongjie in #76
- Support for the QWen1.5-0.5B model by @chenghuaWang in #79
- feat: mistral v0.2 7B support by @chenghuaWang in #83
- Update requirements.txt by @lx200916 in #87
- doc: Update README.md by @xumengwei in #89
- feat: Add Multi-Head Latent Attention(MLA) support. by @yirongjie in #90
- feat: add sparse inference like powerinfer by @XieWeikai in #86
- feat: Yi-1.5-6B support by @chenghuaWang in #88
- feat: Inference speed(tokens/s) profiling by @yirongjie in #91
- feat: Add new demo: demo_imagebind_1mod by @yirongjie in #92
- feat: Stablelm 2 1.6b support by @emt0re0 in #94
- doc: Update README.md by @yirongjie in #95
- feat: add elastic llama by @yirongjie in #98
- feat:Add OPT support by @yirongjie in #99
- feat: add Qwen 1.8B demo by @yirongjie in #100
- perf: Use
vector<shared_ptr<Tensor>> Tensor::graphsby @yirongjie in #101 - perf: add AArch64 GEMM/GEMV for q4_0. by @yirongjie in #104
- feat: add DEBUGSAVETENSOR & DEBUGOPTIME by @yirongjie in #106
- feat: topk/topp sampling by @chenghuaWang in #105
- fix: Qwen v1.5 Tokenizer bug by @chenghuaWang in #107
- feat: add clear_kvcache && fix: BUG in quantize. by @yirongjie in #108
- feat: GEMV + Bias mixed precision support for ARM Devices by @chenghuaWang in #109
- feat: llamafile_sgemm bias support by @chenghuaWang in #111
- chore: Disable OpenMP for Mac. by @lx200916 in #110
- feat: Preliminary implementation on Qualcomm NPU (QNN) backend. by @liang1232018 in #112
- doc: Update README.md by @xumengwei in #113
- refactor:
Layer::run&Tensor::getStaticFuncby @yirongjie in #120 - feat: add Phi-3-mini model by @WhiteNight123 in #119
- refactor:
Tensor::run&Layer::getFunc: Tensor& -> Tensor by @yirongjie in #121 - perf: CPU Function: +-*/ by @yirongjie in #122
- fix: +-*/ for old front end by @yirongjie in #129
- refactor:
Tensor::run&Layer::getFuncby @yirongjie in #130 - fix 修复windows环境 by @WhiteNight123 in #127
- feat: add MiniCPM 2B demo by @yirongjie in #132
- refactor:: remove Layer Class
Split, replace it withTensor::splitby @yirongjie in #136 - fix: python bindings, clang-tidy, set line width to 100 by @chenghuaWang in #142
- fix: Memory Alignment Error by @chenghuaWang in #143
- fix: calculate bugs, cmakelist and clang-tidy by @yirongjie in #144
- fix: bug fix for windows compilation by @chenghuaWang in #145
- fix: windows compile bug by @chenghuaWang in #147
- feat: cross compile arm on windows(x86) by @chenghuaWang in #148
- fix: Memory Alloc bug in CPU Backend by @chenghuaWang in #149
- Fix: QNN Cmakelists Config by @oreomaker in #150
- Xnnpack backend support by @chenghuaWang in #152
- Fixed typos. by @hustc12 in #155
- fix: SmolLM name by @chenghuaWang in #157
- feat: Support QWen2.5-1.5B, OpenELM-1,1B, DCLM-1B by @yirongjie in #160
- feat: add_profilling_activation by @chunfenri in #154
- fix: CMakeLists.txt in
exampleby @yirongjie in #161 - refactor: add TransformerConfig by @yirongjie in #162
- fix: mv
Tensor::graphtoModule.activation_tensors; by @yirongjie in #164 - feat: add PhoneLM by @yirongjie in #165
- QNN Module API(new frontend) Preliminary Support by @oreomaker in #158
- fix: repe_theta is set wrong by @yirongjie in #169
- feat:QNN New Frontend End to End Inference by @oreomaker in #170
- feat: Add modeling bert support by @XieWeikai in #166
- fix: commen used in uni by @yirongjie in #171
- fix: BerTokenizer::tokenizes by @yirongjie in #172
- Add Bert for JNI. by @lx200916 in #173
- Xnnpack backend support by @chenghuaWang in #159
- feat: Boost xnnpack backend inference speed by freeze tensor weight. by @chenghuaWang in #174
- fix: CPUTensorFunction.hpp by @UbiquitousLearning in #176
- feat: drop xnn wrapper and move xnnwrapper to new front-end by @chenghuaWang in #177
- feat: QNN New Frontend Phonelm Support and Refactors by @oreomaker in #179
- fix: smollm tokenizer regex pattern by @chenghuaWang in #180
- refactor: change tokenize method parameter from std::string& to const by @lx200916 in #181
- fix: NPU affect CPU by @yirongjie in #182
- fix: remove unused "fmt" files by @yirongjie in #185
- feat: PhoneLM Instruct Android Demo. by @lx200916 in #188
- Support PhoneLM decoding configuration by @liang1232018 in #190
- feat: QNN Multi Chunk Execution in New Frontend by @oreomaker in #191
- scripts: update build scripts by @yirongjie in #192
- feat: support PhonLM-1.5B-Call demo in Android Demo by @yirongjie in #193
- fix: Android LFS. by @lx200916 in #194
- feat: add phi3v model by @k0zhevnikov in #186
- feat: add Qwen1.5 1.8B Chat Android Demo. by @yirongjie in #195
- perf: i8 * i8 -> fp32 GEMM Boost by @chenghuaWang in #187
- files: Move
src/backend/cpu/CPUXXXtosrc/backend/cpu/op/CPUXXX. by @yirongjie in #197 - fix: Fix the IROPE bug caused by #197 by @yirongjie in #198
- feat: Android App Add Profile. by @lx200916 in #199
- Update README.md by @xumengwei in #200
- feat add
deallocforactivation_tensorsfor only CPU Backend. by @yirongjie in #201 - fix: BUG in Tensor::checkDim; by @yirongjie in #202
- fix: free
toTensor in Matmul files. by @yirongjie in #210 - feat: Add MiniCPM MoE 8x2B. by @liang1232018 in #217
- doc: Update README.md by @yirongjie in #218
- feat: QNN New Frontend Pipeline by @oreomaker in #219
- feat: Support Gemma 2 by @oreomaker in #223
- feat: Add Llama3.2 by @XieWeikai in #221
- fix: in libHelper qnn setSeqLength->setCurSequenceLength by @oreomaker in #227
- Fix QNN Op and LibHelper bugs by @oreomaker in #229
- Update README.md by @xumengwei in #231
- Chore: remove redundant QNN SDK include files in project, improve qnn building by @oreomaker in #233
- feat: MiniCPM 3 4B by @yirongjie in #235
- feat: Add Qwen2-VL-2B by @yirongjie in #240
- fix: qwen25 setup issues in lib helper by @chenghuaWang in #237
- feat: deepseek distilled qwen2 1.5B by @chenghuaWang in #242
- refactor: QNN Refactor by @oreomaker in #250
- fix: qnn profile quant bugs by @oreomaker in #256
- doc: Update README.md by @xumengwei in #257
- feat: set Eager Execution for CPU Backend; by @yirongjie in #287
- feat: add qwen3 by @HanoFleet in #280
- feat: Add retrieval-based speculative decoding support to Qwen 1.5 for CPU backend by @csAugust in #254
- feat: Power Counter by @chenghuaWang in #293
- refactor: A journey of a thousand miles begins with a single step. by @chenghuaWang in #302
- feat(compile): add symbolic expression parser and evaluator by @chenghuaWang in #303
- examples(algorithms): add fancy_algorithm case by @chenghuaWang in #304
- feat(utils): implement argparse and logging utilities by @chenghuaWang in #305
- feat(core): Device Types, Data Types, Tensor Memory Types. by @chenghuaWang in #306
- feat(utils): improve Dbg macro and add Windows support by @chenghuaWang in #307
- feat(core): Tensor by @chenghuaWang in #308
- feat(core, x86, compile): X86 Backend Allocator, Compiler IR README. by @chenghuaWang in #310
- chore: add & rewrite yaml task schema by @oreomaker in #311
- feat(core): ParameterFile & Linaer Op and Layers & Module by @chenghuaWang in #312
- test(core): Test of ParamFile<CPU, V1> by @chenghuaWang in #313
- fix(core): ModleFileV1 bugs by @chenghuaWang in #314
- feat(tools): add mllm-params-inspector tool by @chenghuaWang in #315
- feat: init QNN runtime and utils by @oreomaker in #317
- build(deps): add stdexec dependency for MLLM project by @chenghuaWang in #318
- feat(x86): add X86 backend support for Linear operation by @chenghuaWang in #320
- feat(core): implement basic task dispatching and memory management by @chenghuaWang in #321
- refactor(backends): rename x86 backend to cpu backend by @chenghuaWang in #322
- fix(cpu): adjust memory alignment for different CPU architectures by @chenghuaWang in #323
- refactor(cpu): remove x86-specific memory operations and simplify all… by @oreomaker in #324
- feat(core): implement FillOp and optimize CPU allocation by @chenghuaWang in #325
- feat(compile): implement core IR node classes and utilities by @chenghuaWang in #326
- feat(compile): add graph IR and update related components by @chenghuaWang in #327
- feat(compile): add pass manager and pass infrastructure by @chenghuaWang in #328
- fix(IR): tensor.register build error. by @chenghuaWang in #329
- feat(compile): add IR trace functionality by @chenghuaWang in #330
- feat(x86 backend): implement arange and random fill operations for x8… by @chenghuaWang in #331
- feat(preprocessor): add Image class for visual data preprocessing by @chenghuaWang in #332
- feat(preprocessor): add tokenizers for multilingual support by @chenghuaWang in #333
- feat(compile): implement program IR op and fragment op by @chenghuaWang in #334
- feat(compile): implement program lowering pipeline by @chenghuaWang in #335
- feat(tool) per-tools and per-viewer. by @chenghuaWang in #337
- feat(cpu arm backend): add apple silicon support by @chenghuaWang in #338
- feat(cpu): add ARM backend fill kernel support by @chenghuaWang in #339
- docs(README): add features, usage examples, and installation instruct… by @chenghuaWang in #340
- feat(cpu): add support for element-wise operations by @chenghuaWang in #342
- refactor(arm): remove unused code and add CPU architecture checks by @chenghuaWang in #343
- feat(cpu arm backend): add kleidiai submodule for CPU backend by @chenghuaWang in #344
- feat(cpu): add support for transpose and permute operations by @chenghuaWang in #345
- feat(core): add new ops and remove D2H/H2D ops by @chenghuaWang in #346
- feat(core): implement tensor operations and add functional API by @chenghuaWang in #347
- feat(cpu): add layer normalization and optimize bit-packing by @chenghuaWang in #348
- feat(core): all close, tests kernels by @chenghuaWang in #349
- feat(arm cpu): move ggml things from v1 to v2 by @chenghuaWang in #350
- feat(core, lmcache): Dynamic Cache. by @chenghuaWang in #351
- feat(engine): add support for async execution of Modules by @chenghuaWang in #352
- feat(async): add support for multiple concurrent tasks and improve lo… by @chenghuaWang in #353
- feat(backends): add OpenCL backend support by @chenghuaWang in #354
- feat(auto_tune): implement auto tuning functionality for CPU operations by @chenghuaWang in #356
- feat(cpu): add flash attention 2 implementation for CPU by @chenghuaWang in #357
- feat(quantizer): add quantization tool for MLLM parameters by @chenghuaWang in #358
- feat(pymllm): implement Python bindings for MLLM core and engine by @chenghuaWang in #359
- refactor(pymllm): update API and add new classes by @chenghuaWang in #360
- feat(cpu): implement FlashAttention2 kernel for CPU by @chenghuaWang in #361
- feat(cpu): add flash attention2 operator support by @chenghuaWang in #362
- refactor(cpu): optimize FlashAttention2 implementation and update doc… by @chenghuaWang in #363
- feat(cpu): add BLAS support and ARM optimization by @chenghuaWang in #364
- feat(arm): add HPC SGEMV kernel support for MLLM by @chenghuaWang in #365
- feat(scripts): add CUDA core dump setup script by @chenghuaWang in #366
- refactor(cpu): replace hpc_sgemm with mllm_blas_sgemm by @oreomaker in #367
- feat(core): add support for SliceOp and implement StaticCache by @chenghuaWang in #368
- feat(vision): add VisionRoPE (Rotary Position Embedding) support by @chenghuaWang in #369
- docs: update Arm kernel support and add MLLM BLAS operations by @chenghuaWang in #370
- feat(nn): implement Conv3D layer and improve Module by @chenghuaWang in #371
- feat(backend/cpu): add new operations and update documentation by @chenghuaWang in #372
- feat(qwen2vl): implement KV cache and reshape op by @chenghuaWang in #373
- feat(models): add Qwen2VL tokenizer support by @chenghuaWang in #374
- feat(core): add support for BLAS and optimize linear operations by @chenghuaWang in #375
- feat(qwen2vl): add text generation and optimize conv3d operation by @chenghuaWang in #376
- feat(cpu/arm): Add LlamaFile SGEMM kernel and corresponding unit tests by @oreomaker in #377
- fix(cpu/arm): some llama file bugs. by @chenghuaWang in #378
- feat(arm): add KLEIDIAI support for ARM backend by @chenghuaWang in #379
- feat(qwen2_vl): add model support and quantization for Qwen-2VL by @chenghuaWang in #380
- refactor(qwen2vl): integrate VisionRoPE operations into model layers by @chenghuaWang in #381
- feat(perf): integrate Perfetto for performance tracing by @chenghuaWang in #382
- cpu stft op basic version by @oreomaker in #383
- fix(NDK): compile error. Add mllm_blas_sgemm by @chenghuaWang in #384
- test(cpu): add GELU kernel test and update related files by @chenghuaWang in #385
- fix(cpu): fix GELUOp dtype support and memory management by @chenghuaWang in #386
- feat(unsafe macros, qa): add Unsafe Macros and add FAQ section on pre… by @chenghuaWang in #387
- feat(audio): implement Bluestein's algorithm for non-power-of-2 FFT (… by @oreomaker in #389
- perf(qwen2_vl): update quantization config and improve ARGeneration by @chenghuaWang in #390
- feat(core): add support for complex indexing in Tensor by @chenghuaWang in #391
- V2 by @oreomaker in #392
- feat(core): add signal handling and context management by @chenghuaWang in #393
- feat(sdk): enable C SDK binding and refactor CLI build process by @chenghuaWang in #394
- feat(qwen2_5vl): add model support and update related files by @chenghuaWang in #396
- V2 by @oreomaker in #397
- fix(mllm): update OpenMP settings and optimize CPU operations by @chenghuaWang in #398
- test(cpu): add transpose and permute operation tests by @chenghuaWang in #399
- V2 by @oreomaker in #400
- feat(bugs, log, reduce): reduce ops bugs by @chenghuaWang in #401
- feat(algorithms): add lazy_vlm algorithm and update related files by @chenghuaWang in #402
- feat(examples): add tracers for Qwen-2 models by @chenghuaWang in #403
- refactor(compiler): remove debug information and refactor source by @chenghuaWang in #404
- docs(conf): make Doxygen optional for documentation build by @chenghuaWang in #406
- ci(docs): add GitHub Actions workflow for docs deployment by @chenghuaWang in #407
- feat(qwen2vl_tracer): add canonicalization pass to model by @chenghuaWang in #409
- docs(README): update documentation and add new features by @chenghuaWang in #410
- test(docs): add static html talks slides. by @chenghuaWang in #411
- V2 by @oreomaker in #412
- ci(build): add GitHub Actions workflow for macOS Apple Silicon build by @chenghuaWang in #414
- feat(compile): implement IR serialization and refactor IR persistence by @chenghuaWang in #415
- feat(compile): Add KernelSymbolOp and ValueSymbolOp to the IR by @chenghuaWang in #416
- feat(compile): add IR serialization and interpretation functionality by @chenghuaWang in #417
- feat(compile, ir interpreter): add compilation, ir interpreter and lowering of LLM model by @chenghuaWang in #419
- feat(lazy_vlm, nn::sequential): add lazy VLM support for Qwen-2.5VL. Add 🍬🍬🍬 nn::sequential by @chenghuaWang in #420
- feat(python): implement C++ backend for layers and add ParameterFile support in python🐍 by @chenghuaWang in #421
- feat(pymllm): add support for accessing Linear layer weights and biases by @chenghuaWang in #422
- feat(lazy_vlm): implement pruning for Qwen-2.5VL decoder layers by @chenghuaWang in #423
- feat(lazy_vlm): implement normal prefill and decode logic by @chenghuaWang in #424
- docs(contribute): update roadmap with new contributions and initiatives by @chenghuaWang in #425
- feat(Vocos): add decode method and modify constructor by @oreomaker in #426
- refactor(lazy_vlm): optimize KV cache management and update process by @chenghuaWang in #427
- perf(lazy_vlm): optimize model performance and debugging by @chenghuaWang in #428
- docs(roadmap): update roadmap with new features and improvements by @chenghuaWang in #429
- feat(models): add ARGeneration chat functionality by @chenghuaWang in #430
- feat(arm): add NT-T matmul kernel and optimize lazy VLM performance by @chenghuaWang in #431
- V2 by @oreomaker in #432
- refactor(nn): improve module naming and registration by @chenghuaWang in #433
- feat(models): add LLaMA model support by @oreomaker in #434
- feat(ffi): Roadmap FFI by @chenghuaWang in #435
- feat(cpu): add mllm_blas_linear kernel for ARM CPUs by @chenghuaWang in #436
- feat(cpu): u1-u7 bitspack packing by @chenghuaWang in #437
- refactor(mllm_blas): implement I8GEMM using bitspack by @chenghuaWang in #438
- V2 by @oreomaker in #439
- feat(core): add PagedAttnOp implementation by @chenghuaWang in #440
- fix(lazy_vlm): update HKVCacheFast and Qwen2.5VL model implementations by @chenghuaWang in #441
- feat(build): Move FFI and auto tune to experiments. Add mllm-arm-pmu tool by @chenghuaWang in #442
- build(mllm): conditionally compile FFI extension by @chenghuaWang in #443
- feat(ffi): start building FFI extension for MLLM by @chenghuaWang in #444
- build(pymllm): update CMakeLists and packaging configuration by @chenghuaWang in #446
- feat(models): add basic ChatTTS support for MiniCPM-O2-6 model by @oreomaker in #445
- fix(bugs): In PR445 by @chenghuaWang in #447
- feat(mllm/ffi/Extension.cc, pymllm): Add tensor operations and bindings by @chenghuaWang in #451
- Some sugar(not too sweet) by @oreomaker in #450
- feat(qwen3): add Qwen3 model support and example by @chenghuaWang in #453
- feat(xxHash): integrate xxHash library for hashing functionality by @chenghuaWang in #457
- fix(mllm-cli): correct return value check in isOk function by @chenghuaWang in #458
- feat(lmcache, paged_attn): implement ZenFS with mmap-based blob storage and recovery. Coworks with special paged_attn kernel by @chenghuaWang in #459
- feat(service): implement interactive chat loop in qwen3 service by @chenghuaWang in #462
- QNN Graph Building Core Features by @oreomaker in #463
- fix(cpu): add missing task type kExecuteModule in CPUDispatcher by @chenghuaWang in #464
- feat(service): implement OpenAI-compatible chat completion API by @chenghuaWang in #466
- ci(macOS): update nightly workflow to use python-build and improve ve… by @chenghuaWang in #467
- ci(workflows): simplify workflow trigger syntax by @chenghuaWang in #468
- feat(workflow): trigger macOS nightly build on PR and push to v2 branch by @chenghuaWang in #469
- ci(pymllm-macos-nightly): restrict publish job to merged pull requests by @chenghuaWang in #470
- ci(pymllm-macos-nightly): simplify workflow trigger conditions by @chenghuaWang in #471
- build(pymllm): update dependencies and version for nightly build by @chenghuaWang in #472
- refactor(convertor): lazy import torch and numpy based on availability by @chenghuaWang in #473
- build(workflows): compute and bump to next beta version in CI by @chenghuaWang in #474
- build(workflows): simplify nightly version reading in macOS workflow by @chenghuaWang in #475
- build(workflows): update macOS nightly workflow sed command by @chenghuaWang in #476
- fix(cpu): remove hardcoded paths and unused arm_neon include by @chenghuaWang in #477
- feat(cuda): add CUDA backend initialization and device info test by @chenghuaWang in #478
- feat(plugin): implement plugin system for dynamic op loading by @chenghuaWang in #479
- feat(scripts): add MLIR installation script by @chenghuaWang in #480
- feat(cli): add mllm-llm-benchmark tool for performance testing by @chenghuaWang in #481
- feat(qwen3): add config and quantization files for 0.6B model by @chenghuaWang in #482
- feat(cpu): add inplace rmsnorm implementations for fp32 and fp16 by @chenghuaWang in #483
- feat(cpu-kernels): add SIMD-based vector operations for flash attention by @chenghuaWang in #484
- feat(qnn): Basic QNN Prefill on v2 by @oreomaker in #485
- feat: Add benchmark for Qwen3 and update readme about benchmark by @jialilve in #487
- doc(Qnn): Qnn Documents by @oreomaker in #488
- feat(deepseek-ocr): deepseek-ocr support(On working) by @chenghuaWang in #486
- feat(deepseek-ocr): update inference image and add perf by @chenghuaWang in #494
- feat:Add support of MiniCPM-o-2_6 model by @KKkai0315 in #495
- refactor(minicpm_o): reformat code and remove unused includes by @oreomaker in #496
- feat(engine): add Tracy profiler support and CPU memory-disk async I/O by @chenghuaWang in #497
- feat(tracy): integrate Tracy performance profiling support by @chenghuaWang in #498
- refactor(minicpm_o2_6): optimize tensor operations and memory access by @oreomaker in #499
- fix ggml op by @oreomaker in #502
- feat(cpu, smollm3-tokenizer): add KAI SGEMM NEON implementation for ARM by @chenghuaWang in #503
- feat: add GGUF quantization support by @KKkai0315 in #501
- feat: Add mllm-cli support for Qwen3 and update docs by @yuerqiqi in #493
- feat(mllm-chat): add mllm-chat submodule by @chenghuaWang in #504
- fix minicpm image preprocessing by @oreomaker in #505
- Add SmolLM3-3B model support by @nuozhihan in #509
- fix: QNN Execute Return Order - handle output reordering by @jialilve in #510
- feat(compile-stack, cpu): MLIR, flash attention's swa support. by @chenghuaWang in #512
- fix(cpu ops extension): flash attention swa with sink bugs for GPT-OSS model. by @chenghuaWang in #514
- feat(cpu): implement RadixAttnSwaSink with sliding window attention support by @chenghuaWang in #516
- feat(examples): add paged attention hybrid example with sliding window support by @chenghuaWang in #517
- feat(cpu): add partial_dim support in RoPE operation by @chenghuaWang in #518
- feat(build): install mllm-ext-opset headers and libraries by @chenghuaWang in #519
- feat(mllm-ext-opset): add radix attention relax implementation for flexible tensor dims by @chenghuaWang in #520
- fix(radix-attn): correct shape indexing for K and V tensors in forward method by @chenghuaWang in #521
- fix: resolve type mismatch in Smollm3Attention KV cache update by @Shimmer22 in #523
- feat(radix-attn): implement pattern-based forwarding for Radix Attent… by @chenghuaWang in #524
- Minicpm o2.6 Basic Support by @oreomaker in #530
- feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations by @chenghuaWang in #531
- refactor(hpc-thread-pool): remove unused NUMA affinity functions and related includes by @chenghuaWang in #532
- feat(cpu-backend): add support for SME2 and SVE2 in ARM backend configurations by @chenghuaWang in #533
- refactor: update ARM backend compile options and disable SME2 support for OSX by default by @chenghuaWang in #536
- feat: Implement Qwen NPU Decoding Support with Memory Management Fixes by @jialilve in #537
- QNN Op Package Migrate to v2 by @oreomaker in #539
- feat: add DeepSeek-OCR support, C++ API updates, and dual-model loadi… by @yuerqiqi in #534
- test: fix CausalMaskOp CPU coverage by @jialilve in #538
- update docs by @oreomaker in #541
- feat(build): update threading options for Apple GCD support in build configurations by @chenghuaWang in #540
- fix(docs): update links for Qwen2 and Qwen2.5 models in README by @chenghuaWang in #542
- feat(docs): add mllm-params-inspector tool usage instructions to README by @chenghuaWang in #543
- docs(readme): add OrangePi AI Pro and Studio build status by @chenghuaWang in #544
- feat(docs): enhance README with MLLM's role and workflow diagrams by @chenghuaWang in #547
- fix(assets): update mllm_role image to reflect recent changes by @chenghuaWang in #548
- feat(MiniCPM4, MiniCPM-o, AvgPool1dOp): Add support for MiniCPM4 model, MiniCPM-o's audio modality inference capability, and AvgPool1dOp by @KKkai0315 in #526
- fix(minicpmo): fix minicpmo tokenization logic & streaming generation by @KKkai0315 in #549
- feat(cpu): add support for new attention ops and improve parallel scheduling by @chenghuaWang in #553
- Enhance README with Android Demo & Architecture details by @yuerqiqi in #554
- Merge pull request #526 from KKkai0315/v2 minicpm v4 & minicpm o audi… by @oreomaker in #551
- feat: qwen2 cpu model and connection with npu prefill by @Sp0tless in #555
- chore: update submodule to main by @yuerqiqi in #557
- OpenCL Backend Init by @oreomaker in #558
- feat(qnn): add Qualcomm QNN AOT support on x86 platforms by @chenghuaWang in #562
- feat(qwen3, cpu): add support for Qwen3 model on x86 architecture by @HayzelHan in #561
- feat(qnn): add QcomTargetMachine and related enums for AOT environment by @chenghuaWang in #563
- feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. by @chenghuaWang in #567
- feat(Qnn AOT): Refactor code structure for improved readability and maintainability by @chenghuaWang in #568
- feat(ascend): initial Ascend backend and add elementwise add op by @lywbarca in #564
- feat(Qnn AOT): Add MarkTensorIO pass and related changes for QNN AOT pipeline by @chenghuaWang in #569
- feat(Qnn AOT): Implement LLMQuantRecipePass and associated patterns for quantization by @chenghuaWang in #572
- feat: add LLM2QnnLoweringPass and update graph splitting logic by @chenghuaWang in #577
- fix: Qualcomm QNN AOT Pass by @chenghuaWang in #579
- fix(qualcomm): Qnn AOT refactor by @oreomaker in #578
- feat(qualcomm): Qnn AOT Lowering pass by @oreomaker in #580
- fix(qualcomm): use unsigned int for qualcomm model quantization by @chenghuaWang in #581
- fix(qualcomm): QNNParamScalarWrapper type error. by @chenghuaWang in #582
- fix: reshape tensor weight to 2d by @chenghuaWang in #583
- feat(qualcomm): Qnn AOT Lowering Passes by @oreomaker in #584
- feat(qualcomm): AOTPipeline update by @chenghuaWang in #585
- feat: add readme_zh by @PiaoAdmin in #588
- fix(compile): using ssa ViewOp and SliceOp in MLLM-IR by @chenghuaWang in #589
- fix: update README-ZH by @PiaoAdmin in #592
- feat(qualcomm): PTQPass add constant ptq impl. by @chenghuaWang in #593
- feat: PTQPass will modify the tensor ir constant! by @chenghuaWang in #594
- fix: LPBQ return shape fellow qnn spec by @chenghuaWang in #595
- feat(qualcomm): Qnn AOT Lowering passes by @oreomaker in #596
- fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend by @chenghuaWang in #600
- fix(qualcomm): LM Head Merge pass by @chenghuaWang in #601
- feat: update mllm-cli and android build tasks by @yuerqiqi in #605
- fix(pymllm): qnn_aot_env.py针对x86的改进 by @Lucyliu1234 in #604
- feat(qualcomm): Qnn aot runner by @oreomaker in #603
- feat(qualcomm): Qnn AOT Runtime by @oreomaker in #606
- fix(qualcomm): Enhance quantization modules. by @chenghuaWang in #607
- feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs. by @chenghuaWang in #609
- feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements. by @chenghuaWang in #611
- feat(qnn): Update quantization handling in LLM passes and improve logging in QNN runtime. by @chenghuaWang in #613
- refactor(qwen3): Simplify input handling in Qwen3 model by @chenghuaWang in #614
- feat(qwen3): Introduce Single Head Attention (SHA) optimization for Qualcomm qwen model by @chenghuaWang in #616
- feat(core): Introduce kBool data type for Qnn ElewiseEqual Op by @chenghuaWang in #618
- feat(Ascend): Add some new Ascend Ops by @lywbarca in #621
- mileston(qnn): Qnn AOT by @chenghuaWang in #624
- doc(qualcomm): Qnn aot by @oreomaker in #625
- docs(qnn_backend): update AOT execution flow documentation by @oreomaker in #628
- docs(qnn_backend): enhance AOT execution documentation with installation by @chenghuaWang in #630
- docs(Qnn): correct symbolic link path in AOT execution documentation by @chenghuaWang in #631
- docs: update latest news section in README files for qnn aot by @chenghuaWang in #633
- [Ascend] Implement Concat and Slice operators by @yuerqiqi in #629
- feat(mllm_kernel): add initial implementation of mllm-kernel with CPU and JIT utilities by @chenghuaWang in #634
New Contributors
- @liang1232018 made their first contribution in #42
- @xumengwei made their first contribution in #89
- @XieWeikai made their first contribution in #86
- @emt0re0 made their first contribution in #94
- @WhiteNight123 made their first contribution in #119
- @hustc12 made their first contribution in #155
- @chunfenri made their first contribution in #154
- @k0zhevnikov made their first contribution in #186
- @HanoFleet made their first contribution in #280
- @csAugust made their first contribution in #254
- @jialilve made their first contribution in #487
- @KKkai0315 made their first contribution in #495
- @yuerqiqi made their first contribution in #493
- @nuozhihan made their first contribution in #509
- @Shimmer22 made their first contribution in #523
- @Sp0tless made their first contribution in #555
- @HayzelHan made their first contribution in #561
- @lywbarca made their first contribution in #564
- @PiaoAdmin made their first contribution in #588
- @Lucyliu1234 made their first contribution in #604
Full Changelog: 1.0.0...2.0.0