Description
After thoroughly investigation, we found for agentic LLM (where we deploy our models in Mac Studio M3 Ultra), expanding context with tools calling is essential part.
Hence we need
- High performance radix tree caching CUDA (metal) kernels (see SGLang radix tree kernel for example), which is essential for agentic LLM deployment
- High performance vector searching services CUDA (metal) kernels, which is currently temporilly deployed in our vector searching services in a local NVIDIA Hopper DGX GPU; however this may not be availabe in client's environment
Proposal
We could add such kernels as plugins in mlx-lm, as the example below: #609
Adding them directly to MLX is also possible but need more efforts to dive into the MLX stack since writing an efficient customer metal kernels is not trival as writing a plugin in mlx-lm.
RAG has already been depreciated by many large companies as :
- their search engine prefer cache what crawled (multi-media data) in local database with local hashing stragey (openai Clip model for example)
- to quickly find the context, users refer to, we need precisely retrieve contents before in user's chat-db and generate context for the query
cc @awni
Description
After thoroughly investigation, we found for agentic LLM (where we deploy our models in Mac Studio M3 Ultra), expanding context with tools calling is essential part.
Hence we need
Proposal
We could add such kernels as plugins in mlx-lm, as the example below: #609
Adding them directly to MLX is also possible but need more efforts to dive into the MLX stack since writing an efficient customer metal kernels is not trival as writing a plugin in mlx-lm.
RAG has already been depreciated by many large companies as :
cc @awni