Skip to content

cosmo-b fan ramp not quick enough when baseline temperature is high #2361

@leftwo

Description

@leftwo

We saw an issue where a cosmo-b sled in london racklette was not able to ramp the fans fast enough to prevent the CPU from hitting overtemp.
The ambient temperature in the fridge (machine room) that contained the cosmo-b was measured around 80F.
The ambient had come down from mid 90s earlier in the day, so components were probably still pretty warm.

We started up a heavy write IO workload (raw zvol init) on 10 disks at the same time. This caused the CPU temp to get too hot and the SP powered off the cosmo.

  11 1112     3610        1 CriticalDueTo { sensor_id: SensorId(0x37), temperature: Celsius(90.142) }
  12 1120     3610        1 AutoState(Overheated)

The full ringbuf thermal output (thanks to angela because I can't get the syntax correct):

angela@castle ~ $ humility -i fe80::aa40:25ff:fe05:700%london_sw1tp0 -a /staff/alan/image-for-dublin/extracted-ls6/repo/targets/9ddee3e4d85411efaffec562e20452ad309498b237cf06f3ec60986c9a7ad40c.switch_sp-sidecar-b-1.0.56.tar.gz ringbuf thermal
humility: connecting to fe80::aa40:25ff:fe05:700%15
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
    4450 ControlPwm
      29 AutoState(Boot)
       7 AutoState(Running)
       1 AutoState(Overheated)
       1 AutoState(Uncontrollable)
      25 AddedDynamicInput
       8 FanAdded
       5 RemovedDynamicInput
       3 PowerModeChanged
       2 FanControllerInitialized
       1 Start
       1 ThermalMode(Auto)
       1 CriticalDueTo
       1 PowerDownAt
       1 SetFanWatchdogOk
 NDX LINE      GEN    COUNT PAYLOAD
  30 1065        5        1 AutoState(Running)
  31 1206        5        1 ControlPwm(0x33)
   0 1206        6        1 ControlPwm(0x34)
   1 1206        6        1 ControlPwm(0x33)
   2 1206        6        6 ControlPwm(0x34)
   3 1206        6        6 ControlPwm(0x35)
   4 1206        6        7 ControlPwm(0x36)
   5 1206        6        7 ControlPwm(0x37)
   6 1206        6        5 ControlPwm(0x38)
   7 1206        6        1 ControlPwm(0x39)
   8 1206        6        1 ControlPwm(0x38)
   9 1206        6        6 ControlPwm(0x39)
  10 1206        6        4 ControlPwm(0x3a)
  11 1112        6        1 CriticalDueTo { sensor_id: SensorId(0x19), temperature: Celsius(70) }
  12 1120        6        1 AutoState(Overheated)
  13 1206        6       61 ControlPwm(0x64)
  14 1191        6        1 AutoState(Uncontrollable)
  15 1210        6        1 PowerDownAt(0x7365c)
  16  964        6        1 PowerModeChanged(PowerBitmask(0b1))
  17  788        6        1 AutoState(Boot)
  18 1065        6        1 AutoState(Running)
  19 1206        6        6 ControlPwm(0xa)
  20 1206        6        6 ControlPwm(0x9)
  21 1206        6        7 ControlPwm(0x8)
  22 1206        6        8 ControlPwm(0x7)
  23 1206        6        9 ControlPwm(0x6)
  24 1206        6        9 ControlPwm(0x5)
  25 1206        6        9 ControlPwm(0x4)
  26 1206        6        8 ControlPwm(0x3)
  27 1206        6       10 ControlPwm(0x2)
  28 1206        6        9 ControlPwm(0x1)
  29 1206        6     3907 ControlPwm(0x0)

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions