Skip to content

Commit 3e004a7

Browse files
Adressing reviewers comments, added missing figures, added benchmarking data
1 parent 1391560 commit 3e004a7

File tree

3 files changed

+144
-0
lines changed

3 files changed

+144
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Pluggable Body-Based Routing (BBR) Framework
2+
3+
Author(s): @davidbreitgand @srampal
4+
5+
## Proposal Status
6+
7+
***Draft***
8+
9+
## Summary
10+
11+
The Gateway API Inference Extension (v1.2.1) includes an initial implementation of Body-Based Routing (BBR). Currently, BBR provides a single capability: it extracts the model name from the request body and adds it to the `X-Gateway-Model-Name` header. This header is then used to route the request to the appropriate InferencePool and its associated Endpoint Picker Extension (EPP) instances.
12+
13+
The current BBR implementation is limited and lacks extensibility. Similar to the [pluggability introduced in the scheduling subsystem](../0845-scheduler-architecture-proposal/README.md), BBR should support custom extensions without requiring modifications to the GIE code base.
14+
15+
This proposal introduces a plugin architecture for BBR that allows developers to implement custom logic. Plugins could be organized into a chain or DAG for ordered and concurrent execution.
16+
17+
See [this document](https://docs.google.com/document/d/1So9uRjZrLUHf7Rjv13xy_ip3_5HSI1cn1stS3EsXLWg/edit?tab=t.0#heading=h.55jwocr94axs) for additional context amd reference.
18+
19+
## Goals
20+
21+
The pluggable BBR Framework aims at addressing the following goals
22+
23+
### Immediate Goals
24+
25+
- Avoid monolithic architecture
26+
- Mimic pluggability and configurability of the scheduling subsystem without coupling between the two
27+
- Limit changes to the BBR feature to avoid any changes in the rest of the code base
28+
- Follow best practices and experience from the Scheduling subsystem
29+
pluggability effort. For example, extending the system to support the above
30+
should be through implementing well defined `Plugin` interfaces and registering
31+
them in the BBR subsystem; any configuration would be done in the
32+
same way (e.g., code and/or configuration file)
33+
- Reuse common code from EPP, such as `TypedName`, wherever make sense, but avoid reusing specialized code with non-BBR functionality to avoid abuse
34+
- Provide reference plugin implementation(s).
35+
36+
### Extended Goals
37+
38+
- Enable organizing plugins into a topology for sequential and concurrent execution. Note that while BBR stands for Body-Based Routing and this proposal does not aim at general Payload Processing, routing decisions might require pre-processing/postprocessing operations
39+
- Avoid redundant recurrent body parsing across plugins in a topology for the sake of performance
40+
- Enable extensible collection and registration of metrics using lessons from the pluggable scheduling sub-system
41+
42+
## Non-Goals
43+
44+
- Modify existing GIE abstractions
45+
- Fully align plugins, registries, and factories across BBR and EPP
46+
- Dynamically reconfigure plugins and plugin topologies at runtime
47+
- Enable extensibility of the BBRPlugin registration mechanisms in third party extensions
48+
49+
## Proposal
50+
51+
### Overview
52+
53+
There is an embedded `BBRPlugin` interface building on the `Plugin` interface adopted from EPP. This interface should be implemented by any BBR plugin. Each plugin is identified by its `TypedName` (adopted from EPP), where `TypedName().Type` gives the string representing the type of the plugin and `TypedName().Name()` returns the string representing the plugins implementation. BBR is refactored to implement the registered factory pattern.
54+
55+
In addition, as an extended functionality, a `PluginsChain` interface is defined to define an order of plugin executions. In the future, `PluginsChain` might be replaced by `PluginsDAG` to allow for more complex topological order and concurrency.
56+
57+
`PluginsChain` only contains ordered `BBRPlugin` types registered in the `PluginRegistry`. `RequestPluginsChain` and `ResponsePluginsChain` are optionally configured for handling requests and responses respectively. If no configuration is provided, default `PluginsChain` instances will be configured automatically.
58+
59+
Depending on a `BBRPlugin` functionality and implementation, the plugin might require full or selective body parsing. To save the parsing overhead, if there is at least one `BBRPlugin` in the `PluginsChain` that requires full body parsing, the parsing is performed only once into a shared official appropriate `openai-go` struct (either `openai.CompletionNewParams` or `openai.ChatCompletionNewParams` depending on the request endpoint). This struct is shared for read-only to all plugins in the chain. Each `BBRplugin` receives the shared struct by value. If a plugin needs to mutate the body, in the initial implementation, it MUST work on its own copy, and the a mutated body is returned separately by each plugin.
60+
61+
Even simple BBR plugin implementations can considerably differ in their performance w.r.t. to latency and memory. This justifies different implementations of BBR Plugins in different contexts.
62+
63+
![Benchmarking different implementation of OpenAI message body parsing to extract `model` metadata](./images/benchmark-summary.png)
64+
65+
[The benchmark details and code can be found here](https://github.com/davidbreitgand/scripts/tree/main/benchmarks).
66+
67+
### Suggested Components
68+
69+
The sketch of the proposed framework is shown in the figure below.
70+
![Pluggable BBR framework architecture showing components including BBRPlugin interface, PluginRegistry, PluginFactory, PluginsChain orchestrator, and data flow between request handler, shared parsed body struct, plugin execution chain, and response handler with headers and mutated body outputs](./images/pluggable-framework-architecture-sketch.png)
71+
72+
### Suggested BBR Pluggable Framework Interfaces
73+
74+
```go
75+
// ------------------------------------ Defaults ------------------------------------------
76+
77+
const DefaultPluginType = "MetadataExtractor"
78+
const DefaultPluginImplementation = "simple-model-selector"
79+
80+
// BBRPlugin defines the interface for plugins in the BBR framework
81+
type BBRPlugin interface {
82+
plugins.Plugin
83+
84+
// Execute runs the plugin logic on the request body.
85+
// A plugin's implementation logic CAN mutate the body of the message.
86+
// A plugin's implementation MUST return a map of headers
87+
// If no headers are set by the implementation, the map must be empty
88+
// A value of a header in an extended implementation NEED NOT to be identical to the value of that same header as would be set
89+
// in a default implementation.
90+
// Example: in the body of a request model is set to "semantic-model-selector",
91+
// which, say, stands for "select a best model for this request at minimal cost"
92+
// A plugin implementation of "semantic-model-selector" sets X-Gateway-Model-Name to any valid
93+
// model name from the inventory of the backend models and also mutates the body accordingly
94+
95+
Execute(requestBodyBytes []byte) (headers map[string]string, mutatedBodyBytes []byte, err error)
96+
}
97+
98+
99+
// NeedsFullParsing is an optional capability interface.
100+
// Plugins that require full body parsing implement this marker method.
101+
// The method has no return value; presence of the method is the signal.
102+
type NeedsFullParsing interface {
103+
FullParsingNeeded(){}
104+
}
105+
106+
// placeholder for BBRPlugin constructors
107+
// Concrete constructors are assigned to this type
108+
109+
type PluginFactoryFunc func() (bbrplugins.BBRPlugin, error)
110+
111+
### Defaults
112+
113+
A default plugin instance that sets `X-Gateway-Model-Name` header will always be configured automatically if a specific plugin is not configured. The default plugin will only set the header without body mutation.
114+
115+
### Current BBR reimplementation as BBRPlugin
116+
117+
Will be done according to this proposal and phased approach detailed in the next section.
118+
119+
### Implementation Phases
120+
121+
The pluggable framework will be implemented iteratively over several phases and a series of small PRs.
122+
123+
1. Introduce `BBRPlugin` `MetadataExtractor`, interface, registry, default plugin implementation (`simple-model-selector`) and its factory. Plugin configuration will be implemented via environment variables set in helm chart
124+
1. Introduce plugins topogy (initially a `PluginsChain`)
125+
1. Introduce shared struct (shared among the plugins of a plugins chain) to
126+
1. Introduce an interface for guardrail plugin, introduce simple reference implementation, experiment with plugins chains on request and response messages
127+
1. Refactor metrics as needed to work with the new pluggable framework
128+
1. Implement configuration via manifests similar to those in EPP
129+
1. Implement `PluginsDAG` to allow for more complex topological order and concurrency.
130+
1. Continously learn lessons from this implementation and scheduling framework to improve the implementation
131+
1. Aim at aligning and cross-polination with the [AI GW WG]("https://github.com/kubernetes-sigs/wg-ai-gateway").
132+
133+
## Open Questions
134+
135+
1. More elaborate topology definition and execution
136+
1. More elaborate shared memory architecture for the best performance
137+
1. Considerations for handling newer OpenAI API
138+
1. OpenAI API continues to evolve and most recently they added the "responses api" which has some stateful logic in addition to the ChatCompletions endpoint. The design will be extended also to cover the OpenAI Responses API. For example the `PluginsChain` might be extended to provide common utilities to either help with state caching or letting plugins handle that completely.
139+
1. TBA
140+
141+
## Note 1
142+
143+
The proposed interfaces can slightly change from those implemented in the [initial PR 1981]("https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1981").
144+
The initial PR will be refactored into a series of small PRs which should be evaluated in reference to this proposal.
47 KB
Loading
31.3 KB
Loading

0 commit comments

Comments
 (0)