Skip to content
Merged
215 changes: 189 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,36 +19,12 @@ The library here has the following abstractions.

- **tools**: server tools, prompts, and resources
- **ui**: user interface that an engine (with a main manager) uses
- **core**: shared assets, primarily the plan/step/config definitions
- **core**: shared assets, primarily the plan/step/config definitions and worker/hub hierarchy roles
- **routes**: server views not related to mcp.
- **backends**: child of an engine, these are the model services (llama, openai, gemini)
- **databases**: how to save results as we progress in a pipeline (currently we support sqlite and filesystem JSON)

For the above, the engines, tools, ui, databases, and backends are interfaces.

### Tools

There are different means to add tools here:

- **internal** are discovered in `mcpserver/tools` (assist the server).
- **external modules**: externally discovered via the same mechanism.
- **external one-off**: add a specific tool, prompt, or resource to a server (suggested)

I am suggesting a combined approach of the first and last bullet for security. E.g., when we deploy, we do not want to open a hole to add functions that are not known. In the context of a job, we likely have a specific need or use case and can select from a library. I am developing scoped tools with this aim or goal -- to be able to deploy a job and start a server within the context of the job with exactly what is needed. Here is how the module discovery works:

```python
from mcpserver.tools.manager import ToolManager

# Discover and register defaults
manager = ToolManager()

# The tools vendored here are automatically discovered..
manager.register("mcpserver.tools")

# Register a different module
manager.register("mymodule.tools")
```

## Development

It is recommended to open in VSCode container. Then install:
Expand Down Expand Up @@ -250,6 +226,184 @@ export SSL_CERT_FILE=$(pwd)/certs/cert.pem
```
And you'll see the server get hit.


## Full Architecture

### Starting a Hub

You'll need to install support for the associated worker and resource discovery:

```bash
pip install mcp-serve[hub]
pip install mcp-serve[all]
```

The mcp-server can register worker hubs, which are other MCP servers that register to it. To start the mcpserver as a hub:

```bash
# Start a hub in one terminal
mcpserver start --hub --hub-secret potato
```

In another terminal, start a worker using the token that is generated. Add some functions for fun.

```bash
# If it wants to write batch jobs.
pip install hpc-mcp --break-system-packages
mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --join-secret potato --port 7777
```

Note that you can also set the secret in the environemnt.

```bash
export MCPSERVER_JOIN_SECRET=potato
mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777
```

You can also start a mock worker. By default, we choose 40/40/20 for archetypes for hpc, cloud, and standalone. You can
also specify an archetype.

```bash
mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --mock
mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --archetype hpc
```

### Mocking a Hub

If you are doing experiments, you can bring up a hub the same way:

```bash
mcpserver start --hub --hub-secret potato
```

To mock (simulate) a worker, add `--mock`, optionally with a particular archetype (one of `hpc`, `cloud`, or `standalone`). A worker ID is suggested to make the seed reproducible.

```bash
mcpserver start --join http://0.0.0.0:8000 --port 7777 --worker-id 10 --mock --join-secret potato
mcpserver start --join http://0.0.0.0:8000 --port 7777 --worker-id 10 --mock hpc --join-secret potato
```

In another terminal, you can request to export the simulation "truth" - the metadata generated for the providers chosen for the archetype.

```bash
mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --worker-id 10
```

And export "truth" metadata.

```bash
resource-ask export --output ground-truth.json
```

#### Manual Queries

Test doing raw queries for status. These are manual and local queries.

```bash
# Get listing of workers and metadata
python3 ./examples/mcp-query.py

# Get a specific tool metadata from the worker
python3 ./examples/mcp-query.py http://localhost:7777/mcp get_status

# Call a namespaced tool on the hub (e.g., get the status)
python3 ./examples/mcp-query.py http://localhost:8000/mcp n_781e903e4f10_get_status
```

You can test it without the join secret, or a wrong join secret, to see it fail.

### Resource Secretary Client

This is the client general interface:

```bash
# Includes request, asking secretaries, selection, and dispatch
resource-ask negotiate "I need <resources, constraints>"

# Includes request and asking secretaries
resource-ask satisfy "I need <resources, constraints>"

# Includes request, asking secretaries, and selection
resource-ask select "I need <resources, constraints>"

# The same, but from a proposals file (json dict with data.proposal for each)
resource-ask select --proposals proposals.json "I need <resources, constraints>"

# Dispatch directly to a named cluster
resource-ask dispatch <cluster> "I need <resources>"
```

The `resource-ask` client, which can support using a local model to run selection and other algorithms. You can also "roll your own" stuff using the server endpoints, but this library provides interfaces for doing and extending that already.

```bash
pip install resource-secretary
```


#### Negotiating a Job

When a user has a request, it goes to the hub as a prompt. We use a prompt instead of a set of hard coded policies, because it can technically say anything. E.g.,

> I have a paper due in 3 hours and I need to run LAMMPS. Find me at least 3 nodes and minimize time to completion. My budget is X.

If you are using gemini or openai, make sure to install the libraries.

```bash
pip install -e .[gemini] --break-system-packages
pip install -e .[openai] --break-system-packages
```

For the example, I like to find spack to be discoverable. We can install to spack to see how the responses change.

```bash
git clone --depth 1 https://github.com/spack/spack /tmp/spack
export SPACK_ROOT=/tmp/spack
flux start
```

And start the worker after that. Since we are running in a VSCode environment, let's asked a smaller scoped task.

```bash
# Satisfy request
resource-ask satisfy "Can you run cowsay on one node?"

# List selection algorihtms
resource-ask list select

# Negotiat (sastisfy, select, and dispatch) with a selection algorithm
resource-ask negotiate "I need to run LAMMPS with 1 node." --select agentic
```

You'll notice that the interface suggests using "select" next. The above "negotiate" is akin to a satisfy request. We do the following:

```console
[resource-ask] (client) --> [negotiate_job] (hub) --> [secretary_ask] (workers) --> return to hub --> [client]
```

A select would take this a step further, and select.

The above is working, and the response comes back! Next I need to work on the selection algorithm and delegation.
Likely to start I'll randomly select (that will be an interface that is valid to choose) and then allow me to implement delegation. The remainder of notes are from before.

For this to work we:

1. Make a call to the mcp server hub to `negotiate_work`
- Negotiate work is going to prompt the secretary to send back a response with:
- a quick yes/no response that can eliminate contenders
- policy specific metrics (e.g., estimated time to start, estimated cost, performance)
- Importantly, the hub will evaluate the importance of a set of factors for the job. E.g., "this job requires good network, completed in X time, under N cost, storage does not matter." It will come up with factors and weights (importance) and an equation for the factors and not tell the secretaries its relative weights. The hub wil prepare a prompt that describes the needs, not only the importance, but provide reference for the secretary agents. E.g., "Evaluate your network where 1.0 is 100Gbps InfiniBand and 0.0 is 1Gbps Ethernet." The secretaries will then evaluate the quality of their resources toward the goal, and send back scores and reasons/justification to evaluate each variable. We can test binary (0/1), requesting specific ranges, and normalized scores (0 to 1). The hub then just needs to evaluate the returned values against its equation. This needs to be a two step process, first quantiative, and then adjustment based on qualitative. E.g., maybe a specific filesystem is given 0.5, but the secretary also notes it is undergoing a rebuild, so the hub decides to penalize it.
2. The hub sends the request to the children workers (each a different cluster)
3. Each child worker has a secretary that receives it.
- The secretary has metadata about the cluster that is discovered on startup that does not change (e.g., hardware)
- The secretary also is able to register handles to detailed discovery tools (e.g., software, you'd do for example, `spack find lammps`)
- The secretary makes a call to request state data like queue status
- The secretary also uses the discovery tools to look for the software of choice.
4. Each secretary sends back their response - quantitative scores, plus qualitative reasons.
5. Each secretary has a trust score. It is based on two things:
- The actual discovery of resources is a known truth that is always returned. What the secretary says is compared against that.
- An actual performance of a job can be evaluated against what was promised.
- A trust score can (somehow) go into a future evaluation.

### Design Choices

Here are a few design choices (subject to change, of course). I am starting with re-implementing our fractale agents with this framework. For that, instead of agents being tied to specific functions (as classes on their agent functions) we will have a flexible agent class that changes function based on a chosen prompt. It will use mcp functions, prompts, and resources. In addition:
Expand All @@ -263,7 +417,16 @@ Here are a few design choices (subject to change, of course). I am starting with

## TODO

- Full operator with Flux example (Flux operator with HPC apps and jobspec translation)
- [ ] should we be reporting utilization (e.g., mock or nvidia smi) if it might just be a login node?
- [ ] write function to compare reported agent result from truth? How?
- [ ] need way to "pass forward" an error from a worker that, for example, API key not set.
- [ ] I want to have the equivalent of a satisfy endpoint, checking for the negotiate but not dispatch.
- [ ] I also want an equivalent "just submit to this cluster" endpoint.

Idea:

- the mcp-server worker should have a tool that generates a prompt for an agent. "Here is a request for lammps, this many nodes, and here are the resources we see (call to get_status, which also will be returned to the caller). Can we support it? Use your tools to figure it out. then the created agent should use the tools in the same server it is generated in to answer that question. The response from the agent plus the status should return to the hub. The hub can have the weighted equation to decide on a final cluster.
- TODO: ask agent which flux variables we should eliminate.

## License

Expand Down
6 changes: 1 addition & 5 deletions examples/jobspec/mcpserver.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
tools:
- path: flux_mcp.validate.flux_validate_jobspec
- path: flux_mcp.transformer.transform_jobspec

prompts:
- path: flux_mcp.validate.flux_validate_jobspec_persona
- path: flux_mcp.transformer.transform_jobspec_persona
- path: hpc_mcp.filesystem.filesystem_write_file
Loading
Loading