want to read PMBus status registers via MGS

This is motivated by two different needs:

1. we recently had a rectifier die on us at a customer site ([customer-support#898](https://github.com/oxidecomputer/customer-support/issues/898)), and we were not able to determine if the rectifier was so dead that we were unable to talk PMBus at it or not, because it was in a state where the PSC sequencer does not automatically attempt to read and record the PMBus status. It would have been nice to be able to use faux-mgs to talk to it while debugging.
2. for automated fault management in the control plane, reading the status registers of a PMBus device is an important [health endpoint (as described in RFD 589)](https://rfd.shared.oxide.computer/rfd/0589#_health_endpoints) for that device.

It would be nice if there was a way to read the contents of all of the standard PMBus status registers over the management network. Ideally, this would happen automagically via codegen for any device with a 
```toml
power = { /* ... */ pmbus = true }
```
in its declaration in the TOML file.

This could either be done using a new `ComponentDetails` entry for PMBus status, or a separate MGS API that's PMBus-specific. I'm a bit on the fence as to which is better --- personally, I think hanging it off of component details might be a bit of a shame as it means we would do a bunch more I2C traffic when trying to access _other_ component details from that device --- every PMbus thing already has sensor data in its `ComponentDetails` as well. But, it is kinda the standard interface for most of this sort of thing. 

Since status registers are paged in PMBus, we would want to ensure that we can make it clear to the caller which rail a status register value refers to. This could be done either by having the control plane _request_ a specific rail when reading, or by having the caller read them by device ID and reading every rail on the device, but including the rail names in the response. Doing it by rail kinda seems nicer, especially since ereports for power-related faults will include the rail name, but this would require `control-plane-agent` to maintain an additional index of devices by rail name in addition to devices by refdes, which probably eats up a bunch more flash/RAM, and means we can't just hang it off `ComponentDetails`.

Personally, I think it would also make sense to add a new [`DeviceCapabilities`](https://github.com/oxidecomputer/management-gateway-service/blob/177c9c719e12896c566a1b6b5416c9bc686531d3/gateway-messages/src/sp_to_mgs.rs#L903-L913) flag for PMBus things, to advertise that they have the standard set of PMBus status registers in addition to sensor values.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

want to read PMBus status registers via MGS #2463

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

want to read PMBus status registers via MGS #2463

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions