converged-computing · vsoch · Apr 8, 2026 · Mar 8, 2026 · Mar 9, 2026 · Mar 9, 2026
diff --git a/README.md b/README.md
@@ -19,36 +19,12 @@ The library here has the following abstractions.
 
 - **tools**: server tools, prompts, and resources
 - **ui**: user interface that an engine (with a main manager) uses
-- **core**: shared assets, primarily the plan/step/config definitions
+- **core**: shared assets, primarily the plan/step/config definitions and worker/hub hierarchy roles
 - **routes**: server views not related to mcp.
-- **backends**: child of an engine, these are the model services (llama, openai, gemini)
 - **databases**: how to save results as we progress in a pipeline (currently we support sqlite and filesystem JSON)
 
 For the above, the engines, tools, ui, databases, and backends are interfaces.
 
-### Tools
-
-There are different means to add tools here:
-
- - **internal** are discovered in `mcpserver/tools` (assist the server).
- - **external modules**: externally discovered via the same mechanism.
- - **external one-off**: add a specific tool, prompt, or resource to a server (suggested)
-
-I am suggesting a combined approach of the first and last bullet for security. E.g., when we deploy, we do not want to open a hole to add functions that are not known. In the context of a job, we likely have a specific need or use case and can select from a library. I am developing scoped tools with this aim or goal -- to be able to deploy a job and start a server within the context of the job with exactly what is needed. Here is how the module discovery works:
-
-```python
-from mcpserver.tools.manager import ToolManager
-
-# Discover and register defaults
-manager = ToolManager()
-
-# The tools vendored here are automatically discovered..
-manager.register("mcpserver.tools")
-
-# Register a different module
-manager.register("mymodule.tools")
-```
-
 ## Development
 
 It is recommended to open in VSCode container. Then install:
@@ -250,6 +226,184 @@ export SSL_CERT_FILE=$(pwd)/certs/cert.pem
 ```
 And you'll see the server get hit.
 
+
+## Full Architecture
+
+### Starting a Hub
+
+You'll need to install support for the associated worker and resource discovery:
+
+```bash
+pip install mcp-serve[hub]
+pip install mcp-serve[all]
+```
+
+The mcp-server can register worker hubs, which are other MCP servers that register to it. To start the mcpserver as a hub:
+
+```bash
+# Start a hub in one terminal
+mcpserver start --hub --hub-secret potato
+```
+
+In another terminal, start a worker using the token that is generated. Add some functions for fun.
+
+```bash
+# If it wants to write batch jobs.
+pip install hpc-mcp --break-system-packages
+mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --join-secret potato --port 7777
+```
+
+Note that you can also set the secret in the environemnt.
+
+```bash
+export MCPSERVER_JOIN_SECRET=potato
+mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777
+```
+
+You can also start a mock worker. By default, we choose 40/40/20 for archetypes for hpc, cloud, and standalone. You can
+also specify an archetype.
+
+```bash
+mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --mock
+mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --archetype hpc
+```
+
+### Mocking a Hub
+
+If you are doing experiments, you can bring up a hub the same way:
+
+```bash
+mcpserver start --hub --hub-secret potato
+```
+
+To mock (simulate) a worker, add `--mock`, optionally with a particular archetype (one of `hpc`, `cloud`, or `standalone`). A worker ID is suggested to make the seed reproducible.
+
+```bash
+mcpserver start --join http://0.0.0.0:8000 --port 7777 --worker-id 10 --mock --join-secret potato
+mcpserver start --join http://0.0.0.0:8000 --port 7777 --worker-id 10 --mock hpc --join-secret potato
+```
+
+In another terminal, you can request to export the simulation "truth" - the metadata generated for the providers chosen for the archetype.
+
+```bash
+mcpserver start --config examples/jobspec/mcpserver.yaml --join http://0.0.0.0:8000 --port 7777 --worker-id 10
+```
+
+And export "truth" metadata.
+
+```bash
+resource-ask export --output ground-truth.json
+```
+
+#### Manual Queries
+
+Test doing raw queries for status. These are manual and local queries.
+
+```bash
+# Get listing of workers and metadata
+python3 ./examples/mcp-query.py
+
+# Get a specific tool metadata from the worker
+python3 ./examples/mcp-query.py http://localhost:7777/mcp get_status
+
+# Call a namespaced tool on the hub (e.g., get the status)
+python3 ./examples/mcp-query.py http://localhost:8000/mcp n_781e903e4f10_get_status
+```
+
+You can test it without the join secret, or a wrong join secret, to see it fail.
+
+### Resource Secretary Client
+
+This is the client general interface:
+
+```bash
+# Includes request, asking secretaries, selection, and dispatch
+resource-ask negotiate "I need <resources, constraints>"
+
+# Includes request and asking secretaries
+resource-ask satisfy "I need <resources, constraints>"
+
+# Includes request, asking secretaries, and selection
+resource-ask select "I need <resources, constraints>"
+
+# The same, but from a proposals file (json dict with data.proposal for each)
+resource-ask select --proposals proposals.json "I need <resources, constraints>"
+
+# Dispatch directly to a named cluster
+resource-ask dispatch <cluster> "I need <resources>"
+```
+
+The `resource-ask` client, which can support using a local model to run selection and other algorithms. You can also "roll your own" stuff using the server endpoints, but this library provides interfaces for doing and extending that already.
+
+```bash
+pip install resource-secretary
+```
+
+
+#### Negotiating a Job
+
+When a user has a request, it goes to the hub as a prompt. We use a prompt instead of a set of hard coded policies, because it can technically say anything. E.g.,
+
+> I have a paper due in 3 hours and I need to run LAMMPS. Find me at least 3 nodes and minimize time to completion. My budget is X.
+
+If you are using gemini or openai, make sure to install the libraries.
+
+```bash
+pip install -e .[gemini] --break-system-packages
+pip install -e .[openai] --break-system-packages
+```
+
+For the example, I like to find spack to be discoverable. We can install to spack to see how the responses change.
+
+```bash
+git clone --depth 1 https://github.com/spack/spack /tmp/spack
+export SPACK_ROOT=/tmp/spack
+flux start
+```
+
+And start the worker after that. Since we are running in a VSCode environment, let's asked a smaller scoped task.
+
+```bash
+# Satisfy request
+resource-ask satisfy "Can you run cowsay on one node?"
+
+# List selection algorihtms
+resource-ask list select
+
+# Negotiat (sastisfy, select, and dispatch) with a selection algorithm
+resource-ask negotiate "I need to run LAMMPS with 1 node." --select agentic
+```
+
+You'll notice that the interface suggests using "select" next. The above "negotiate" is akin to a satisfy request. We do the following:
+
+```console
+[resource-ask] (client)  --> [negotiate_job] (hub) --> [secretary_ask] (workers) --> return to hub --> [client]
+```
+
+A select would take this a step further, and select.
+
+The above is working, and the response comes back! Next I need to work on the selection algorithm and delegation.
+Likely to start I'll randomly select (that will be an interface that is valid to choose) and then allow me to implement delegation. The remainder of notes are from before.
+
+For this to work we:
+
+1. Make a call to the mcp server hub to `negotiate_work`
+ - Negotiate work is going to prompt the secretary to send back a response with:
+   - a quick yes/no response that can eliminate contenders
+   - policy specific metrics (e.g., estimated time to start, estimated cost, performance)
+   - Importantly, the hub will evaluate the importance of a set of factors for the job. E.g., "this job requires good network, completed in X time, under N cost, storage does not matter." It will come up with factors and weights (importance) and an equation for the factors and not tell the secretaries its relative weights. The hub wil prepare a prompt that describes the needs, not only the importance, but provide reference for the secretary agents. E.g., "Evaluate your network where 1.0 is 100Gbps InfiniBand and 0.0 is 1Gbps Ethernet." The secretaries will then evaluate the quality of their resources toward the goal, and send back scores and reasons/justification to evaluate each variable. We can test binary (0/1), requesting specific ranges, and normalized scores (0 to 1). The hub then just needs to evaluate the returned values against its equation. This needs to be a two step process, first quantiative, and then adjustment based on qualitative. E.g., maybe a specific filesystem is given 0.5, but the secretary also notes it is undergoing a rebuild, so the hub decides to penalize it.
+2. The hub sends the request to the children workers (each a different cluster)
+3. Each child worker has a secretary that receives it.
+ - The secretary has metadata about the cluster that is discovered on startup that does not change (e.g., hardware)
+ - The secretary also is able to register handles to detailed discovery tools (e.g., software, you'd do for example, `spack find lammps`)
+ - The secretary makes a call to request state data like queue status
+ - The secretary also uses the discovery tools to look for the software of choice.
+4. Each secretary sends back their response - quantitative scores, plus qualitative reasons.
+5. Each secretary has a trust score. It is based on two things:
+ - The actual discovery of resources is a known truth that is always returned. What the secretary says is compared against that.
+ - An actual performance of a job can be evaluated against what was promised.
+ - A trust score can (somehow) go into a future evaluation.
+
 ### Design Choices
 
 Here are a few design choices (subject to change, of course). I am starting with re-implementing our fractale agents with this framework. For that, instead of agents being tied to specific functions (as classes on their agent functions) we will have a flexible agent class that changes function based on a chosen prompt. It will use mcp functions, prompts, and resources. In addition:
@@ -263,7 +417,16 @@ Here are a few design choices (subject to change, of course). I am starting with
 
 ## TODO
 
-- Full operator with Flux example (Flux operator with HPC apps and jobspec translation)
+- [ ] should we be reporting utilization (e.g., mock or nvidia smi) if it might just be a login node?
+- [ ] write function to compare reported agent result from truth? How?
+- [ ] need way to "pass forward" an error from a worker that, for example, API key not set.
+- [ ] I want to have the equivalent of a satisfy endpoint, checking for the negotiate but not dispatch.
+- [ ] I also want an equivalent "just submit to this cluster" endpoint.
+
+Idea:
+
+- the mcp-server worker should have a tool that generates a prompt for an agent. "Here is a request for lammps, this many nodes, and here are the resources we see (call to get_status, which also will be returned to the caller). Can we support it? Use your tools to figure it out. then the created agent should use the tools in the same server it is generated in to answer that question. The response from the agent plus the status should return to the hub. The hub can have the weighted equation to decide on a final cluster.
+- TODO: ask agent which flux variables we should eliminate.
 
 ## License
 

diff --git a/examples/jobspec/mcpserver.yaml b/examples/jobspec/mcpserver.yaml
@@ -1,7 +1,3 @@
 tools:
   - path: flux_mcp.validate.flux_validate_jobspec
-  - path: flux_mcp.transformer.transform_jobspec
-
-prompts:
-  - path: flux_mcp.validate.flux_validate_jobspec_persona
-  - path: flux_mcp.transformer.transform_jobspec_persona
+  - path: hpc_mcp.filesystem.filesystem_write_file