We saw an issue where a cosmo-b sled in london racklette was not able to ramp the fans fast enough to prevent the CPU from hitting overtemp.
The ambient temperature in the fridge (machine room) that contained the cosmo-b was measured around 80F.
The ambient had come down from mid 90s earlier in the day, so components were probably still pretty warm.
We started up a heavy write IO workload (raw zvol init) on 10 disks at the same time. This caused the CPU temp to get too hot and the SP powered off the cosmo.
11 1112 3610 1 CriticalDueTo { sensor_id: SensorId(0x37), temperature: Celsius(90.142) }
12 1120 3610 1 AutoState(Overheated)
The full ringbuf thermal output (thanks to angela because I can't get the syntax correct):
angela@castle ~ $ humility -i fe80::aa40:25ff:fe05:700%london_sw1tp0 -a /staff/alan/image-for-dublin/extracted-ls6/repo/targets/9ddee3e4d85411efaffec562e20452ad309498b237cf06f3ec60986c9a7ad40c.switch_sp-sidecar-b-1.0.56.tar.gz ringbuf thermal
humility: connecting to fe80::aa40:25ff:fe05:700%15
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
TOTAL VARIANT
4450 ControlPwm
29 AutoState(Boot)
7 AutoState(Running)
1 AutoState(Overheated)
1 AutoState(Uncontrollable)
25 AddedDynamicInput
8 FanAdded
5 RemovedDynamicInput
3 PowerModeChanged
2 FanControllerInitialized
1 Start
1 ThermalMode(Auto)
1 CriticalDueTo
1 PowerDownAt
1 SetFanWatchdogOk
NDX LINE GEN COUNT PAYLOAD
30 1065 5 1 AutoState(Running)
31 1206 5 1 ControlPwm(0x33)
0 1206 6 1 ControlPwm(0x34)
1 1206 6 1 ControlPwm(0x33)
2 1206 6 6 ControlPwm(0x34)
3 1206 6 6 ControlPwm(0x35)
4 1206 6 7 ControlPwm(0x36)
5 1206 6 7 ControlPwm(0x37)
6 1206 6 5 ControlPwm(0x38)
7 1206 6 1 ControlPwm(0x39)
8 1206 6 1 ControlPwm(0x38)
9 1206 6 6 ControlPwm(0x39)
10 1206 6 4 ControlPwm(0x3a)
11 1112 6 1 CriticalDueTo { sensor_id: SensorId(0x19), temperature: Celsius(70) }
12 1120 6 1 AutoState(Overheated)
13 1206 6 61 ControlPwm(0x64)
14 1191 6 1 AutoState(Uncontrollable)
15 1210 6 1 PowerDownAt(0x7365c)
16 964 6 1 PowerModeChanged(PowerBitmask(0b1))
17 788 6 1 AutoState(Boot)
18 1065 6 1 AutoState(Running)
19 1206 6 6 ControlPwm(0xa)
20 1206 6 6 ControlPwm(0x9)
21 1206 6 7 ControlPwm(0x8)
22 1206 6 8 ControlPwm(0x7)
23 1206 6 9 ControlPwm(0x6)
24 1206 6 9 ControlPwm(0x5)
25 1206 6 9 ControlPwm(0x4)
26 1206 6 8 ControlPwm(0x3)
27 1206 6 10 ControlPwm(0x2)
28 1206 6 9 ControlPwm(0x1)
29 1206 6 3907 ControlPwm(0x0)
We saw an issue where a cosmo-b sled in london racklette was not able to ramp the fans fast enough to prevent the CPU from hitting overtemp.
The ambient temperature in the fridge (machine room) that contained the cosmo-b was measured around 80F.
The ambient had come down from mid 90s earlier in the day, so components were probably still pretty warm.
We started up a heavy write IO workload (raw zvol init) on 10 disks at the same time. This caused the CPU temp to get too hot and the SP powered off the cosmo.
The full
ringbuf thermaloutput (thanks to angela because I can't get the syntax correct):