Skip to content

Remove NACKs from IPC code#4041

Open
clumens wants to merge 20 commits intoClusterLabs:mainfrom
clumens:remove-naks
Open

Remove NACKs from IPC code#4041
clumens wants to merge 20 commits intoClusterLabs:mainfrom
clumens:remove-naks

Conversation

@clumens
Copy link
Copy Markdown
Contributor

@clumens clumens commented Jan 29, 2026

No description provided.

No other daemon behaves this way on unknown message, and this isn't old
code.  I just added it in 7d56d46 without mentioning any reasoning
behind it.  I can't find any clients that are expecting a NACK in this
case so we should be good to just change this.

Ref T991
It's possible for execd to reply with an ACK message with an error code.
This happens if execd receives an invalid message - this could be a
programming error, or it could potentially be a problem in transmission.

The client code was only incidentally handling these messages.  Before,
it would fall through to the "Invalid registration message" case and be
handled that way.  However, because it's part of the protocol for the
server to respond with a NACK on errors, it seems more correct to me for
us to explicitly handle this message first thing.

Note that this doesn't change any behavior.  Before the client would get
a message it didn't expect and the IPC connection would fail.  Now the
client gets an error message and the IPC connection will fail.  I just
like the more defined behavior.

Ref T991
It's confusing to have both ACKs and NACKs in the IPC protocol when both
of those can also include an error code.  Most daemons do not send a
NACK anyway, and most (all?) clients aren't set up to handle it.  In the
interest of simplifying the IPC code a bit, I'm going to remove NACKs so
clients can always expect an ACK and read the error code to know what to
do.

Ref T991
There's no doubt more that could be done along these lines, but for the
moment this will have to do.  I need to add some additional error
handling into this function so it makes sense to unindent to make some
room.
It's possible for based to reply with an ACK message with an error code.
This happens if based receives an invalid message - this could be a
programming error, or it could potentially be a problem in transmission.

The client code was only incidentally handling these messages.  Before,
it would fall through to the "Reply to the CIB registration message has
unknown type" case and be handled that way.  However, because it's part
of the protocol for the server to respond with a NACK on errors, it
seems more correct to me for us to explicitly handle this message first
thing.

Note that this doesn't change any behavior.  Before the client would get
a message it didn't expect and the IPC connection would fail.  Now the
client gets an error message and the IPC connection will fail.  I just
like the more defined behavior.

Ref T991
* Change the tag that pcmk__log_xml_trace logs with to be the same as in
  cib_native_signon.

* Use g_clear_pointer when cleaning up the answer variable.

* Unindent the code at the end of the function.  I'm going to be adding
  additional error handling here too, so it's nice to have some extra
  room to do so.
This is just like previous patches, but applies to signing on to a
remote CIB instead.

Ref T991
It's confusing to have both ACKs and NACKs in the IPC protocol when both
of those can also include an error code.  Most daemons do not send a
NACK anyway, and most (all?) clients aren't set up to handle it.  In the
interest of simplifying the IPC code a bit, I'm going to remove NACKs so
clients can always expect an ACK and read the error code to know what to
do.

Ref T991
I'm going to be adding some additional error handling in this function,
so it'll be nice to have some room to do so.  This only unindents the
first level of the end of this function.  There's more to do, but it's
easier to follow if it's broken up into two patches.
It's possible for fenced to reply with an ACK message with an error
code.  This happens if fenced receives an invalid message - this could
be a programming error, or it could potentially be a problem in
transmission.

The client code was only incidentally handling these messages.  Before,
it would fall through to the "invalid reply type" case and be handled
that way.  However, because it's part of the protocol for the server to
respond with a NACK on errors, it seems more correct to me for us to
explicitly handle this message first thing.

Note that this doesn't change any behavior.  Before the client would get
a message it didn't expect and the IPC connection would fail.  Now the
client gets an error message and the IPC connection will fail.  I just
like the more defined behavior.

Ref T991
It's confusing to have both ACKs and NACKs in the IPC protocol when both
of those can also include an error code.  Most daemons do not send a
NACK anyway, and most (all?) clients aren't set up to handle it.  In the
interest of simplifying the IPC code a bit, I'm going to remove NACKs so
clients can always expect an ACK and read the error code to know what to
do.

Ref T991
I've been testing the handling of invalid messages by making sure
execd_invalid_msg always returns true.  This will trigger for the very
first message a daemon receives, which makes it pretty easy to
experiment with.  In this case, the first message is register:

(pcmk__remote_message_xml@remote.c:358)  trace: [remote msg]   <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="pacemaker-remote-rhel9-scratch-4:3121" lrmd_protocol_ version="1.2" lrmd_is_ipc_provider="true" lrmd_remote_msg_id="1" lrmd_remote_msg_type="request"/>

In real life, we could be getting an invalid message by something being
garbled or a misbehaving client.

Tracing through the code that got us to the above log message, you'll
see that the remote message was almost certainly received via the TLS
channel (it could also be a TCP socket, but I haven't seen anything use
that).

However, ACKs/NACKs are an IPC mechanism, which happens on the local
system only and uses completely different communications channels.
There is a way to proxy IPC so that it's retransmitted to another
system, but we only proxy IPC if the received message has
lrmd_op="ipc_fwd".

In this case, the local system is the end point of the invalid message.
We would also be originating the ACK/NACK, which would mean we'd need to
be able to construct some sort of fake proxied message and put it in the
TLS or TCP channel.  I guess it's possible we could do this, but it
seems like a lot of work when the client isn't even expecting it (see
handle_remote_msg).

This was just recently added in 7acc4f3 and has not been in any
release, so there's no mixed version upgrade concerns here.

Ref T991
That way, we don't have to free it if we're just going to return right
away.
Due to the if/else block, in all these log messages, op will be
"request".  This means all these log messages say something along the
lines of "Relayed request request 1 from...".
* Make the comment explaining why we're sending a NACK a little more
  verbose so we can understand it better in the future.

* Return after sending the NACK, allowing the code under it to be
  unindented.  There's tons more of this that could be done, but I'm
  not going to worry about that right now.

Ref T991
…_ack_as.

All ACKs are now created with PCMK__XE_ACK so this argument serves no
purpose.

Ref T991
…_ack.

All ACKs are now created with PCMK__XE_ACK so this argument serves no
purpose.

Ref T991
…ck_as.

All ACKs are now created with PCMK__XE_ACK so this argument serves no
purpose.

Ref T991
All ACKs are now created with PCMK__XE_ACK so this argument serves no
purpose.

Fixes T991
@clumens clumens requested a review from nrwahl2 January 29, 2026 20:05
@nrwahl2 nrwahl2 added the review: in progress PRs that are currently being reviewed label Jan 29, 2026
Copy link
Copy Markdown
Contributor

@nrwahl2 nrwahl2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Very minor feedback


rc = pcmk__xe_get_int(reply, PCMK_XA_STATUS, &status);

if ((rc == pcmk_rc_ok) && (status != 0)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe compare against CRM_EX_OK instead of 0?

Seems like there should also be an error if rc != pcmk_rc_ok, but it may not be worth handling that, since it should be impossible.

On that note, it should also be impossible to receive a NACK with a zero (CRM_EX_OK) status code. It's always CRM_EX_PROTOCOL. (Edit: this is addressed by a comment as of the "Don't return a NACK on invalid execd messages" commit.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there should also be an error if rc != pcmk_rc_ok, but it may not be worth handling that, since it should be impossible.

I think the only problem here really worth considering is what happens if there's no PCMK_XA_STATUS attribute. I don't think any other problems can come up, aside from network transmission problems which I expect would error out somewhere else first.

It is possible to not have a PCMK_XA_STATUS attribute. Look for PCMK__XE_ACK in crmd_remote_proxy_cb. I would prefer to not think too hard about the proxy code and where these status-less ACKs can end up, so perhaps it's worth adding a check to everywhere I've written this code in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to not have a PCMK_XA_STATUS attribute. Look for PCMK__XE_ACK in crmd_remote_proxy_cb.

Huh, yeah you're right. I assumed that all the ACKs got created by pcmk__ipc_create_ack_as(). Silly me.

so perhaps it's worth adding a check to everywhere I've written this code in this PR.

I don't even know what kind of check you'd want to add. The error will be ENXIO if no PCMK_XA_STATUS attribute is set, but I don't love the idea of having to check for that specifically. And I don't know how we'd want to handle a status-less ACK.

  • Falling through on status-less ACK instead of treating it as an error seems to make sense.
  • If for some reason PCMK_XA_STATUS is set and we fail to parse it (which should never happen in practice), then it seems like that should be an error. But it complicates the code for something that, again, should never happen.

If you want to ignore this thread and move on, that's fine.

*
* NOTE: At the moment, all ACK messages sent in the handshake process
* will have an error status. However, this may change in the future so
* we'll let those fall through to the rest of the message handling below
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "let those fall through" is a tiny bit ambiguous when I think we mean non-error ACK messages (same for other daemons)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made sense to me when I wrote it. It makes far less sense now. I think what I'm getting at is that if we get an ACK message here, the only place it could have come from is the call to pcmk__ipc_send_ack in fenced_ipc_dispatch, and the only possible ACK we can get there will have a CRM_EX_PROTOCOL status. The block under this comment handles that case, and any future ACKs with a non-error status will fall through to the rest of the function.

If that's more clear, I can change the comment. If the comment doesn't add anything, I can just nuke it. Or, I can add some additional handling here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's more clear, I can change the comment.

Yeah, I think that is more clear.

I understand the comment as written, except that "we'll let those fall through" is ambiguous about what "those" refers to. Even for that, I got the idea: in case non-error ACKs become possible in this circumstance in the future, we'll let non-error ACKs fall through to the rest of the message handling.

Whatever you feel in your heart will probably be fine.

@nrwahl2 nrwahl2 removed the review: in progress PRs that are currently being reviewed label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants