[Feature Request] Radix tree kernel for Agentic LLM

## Description

After thoroughly investigation, we found for agentic LLM (where we deploy our models in Mac Studio M3 Ultra), expanding context with tools calling is essential part.

Hence we need

- High performance radix tree caching CUDA (metal) kernels (see SGLang radix tree kernel for example), which is essential for agentic LLM deployment
- High performance vector searching services CUDA (metal) kernels, which is currently temporilly deployed in our vector searching services in a local NVIDIA Hopper DGX GPU; however this may not be availabe in client's environment 

## Proposal

We could add such kernels as plugins in mlx-lm, as the example below: https://github.com/ml-explore/mlx-lm/pull/609

Adding them directly to MLX is also possible but need more efforts to dive into the MLX stack since writing an efficient customer metal kernels is not trival as writing a plugin in mlx-lm.

RAG has already been depreciated by many large companies as :

- their search engine prefer cache what crawled (multi-media data) in local database with local hashing stragey (openai Clip model for example)
- to quickly find the context, users refer to, we need precisely retrieve contents before in user's chat-db and generate context for the query

cc @awni 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Radix tree kernel for Agentic LLM #630

Description

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Radix tree kernel for Agentic LLM #630

Description

Description

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions