neutron: set a failure-timeout on neutron-ha-tool#2063
neutron: set a failure-timeout on neutron-ha-tool#2063dirkmueller wants to merge 1 commit intocrowbar:masterfrom
Conversation
| agent "systemd:neutron-l3-ha-service" | ||
| op node[:neutron][:ha][:neutron_l3_ha_resource][:op] | ||
| action :update | ||
| meta ({ |
There was a problem hiding this comment.
Lint/ParenthesesAsGroupedExpression: (...) interpreted as grouped expression. (https://github.com/bbatsov/ruby-style-guide#parens-no-spaces)
aspiers
left a comment
There was a problem hiding this comment.
The commit message references the l3 agent but the change affects neutron-l3-ha-service. It's not clear to me what the exact problem is or why timing out a failure of neutron-l3-ha-service would address it. I'm guessing there is some missing detail regarding the interaction between the two - please can you clarify in the commit message?
We don't want the neutron-ha-tool service to be stopped after 3 weeks of weekly patching and rebooting the rabbitmq cluster. Set a timeout of a failure if it happened more than 10 minutes ago.
f194469 to
25eab46
Compare
|
@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors. For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability. |
OK thanks, that makes sense now. Ideally I would prefer that info to be in the commit message too, since the commit message doesn't feel entirely self-explanatory yet. But the main problem seems to be that the CI is currently failing: I guess that's probably related to this change somehow. |
|
@aspiers commented on March 31, 2019 1:26 PM:
[snipped] I'm going to see if |
We don't want the l3 ha tool service to be stopped after 3 weeks of weekly
patching and rebooting the rabbitmq cluster. Set a timeout of a failure
if it happened more than 10 minutes ago.