Skip to content

AP-25563: Add diagnostic logging for intermittent Python gateway comm…#85

Merged
chaubold merged 1 commit intomasterfrom
bug/AP-25563-cannot-obtain-communication
Mar 19, 2026
Merged

AP-25563: Add diagnostic logging for intermittent Python gateway comm…#85
chaubold merged 1 commit intomasterfrom
bug/AP-25563-cannot-obtain-communication

Conversation

@HedgehogCode
Copy link
Contributor

…unication failures

Logs gateway lifecycle events (creation, closure) with thread and identity information to help diagnose the root cause of "Cannot obtain a new communication channel" errors. Tracks gateway ownership through object hash codes for correlation.

  • DefaultPythonGateway.close(): Log PID and calling thread at INFO level
  • PythonScriptingSession: Log gateway hash and thread at creation (INFO) and shutdown (ERROR)
  • PythonGatewayTracker.clear(): Log process count and triggering thread at ERROR level
  • QueuedPythonGatewayFactory: Log eviction count and thread at gate-close (WARN)
  • PythonGatewayCreationGate: Include thread name in P2 phase event logs (INFO)
  • PythonScriptNodeModel: Handle no-cause "Cannot obtain" variant with improved error message

When this error occurs again, correlating gateway hash and PID across log entries will reveal which code path triggered the unexpected shutdown.

AP-25563 (Investigate "Cannot obtain a new communication channel" Python failures)

@chaubold chaubold force-pushed the bug/AP-25563-cannot-obtain-communication branch from 9af8607 to 12b3548 Compare March 19, 2026 09:20
…unication failures

Logs gateway lifecycle events (creation, closure) with thread and identity information
to help diagnose the root cause of "Cannot obtain a new communication channel" errors.
Tracks gateway ownership through object hash codes for correlation.

- DefaultPythonGateway.close(): Log PID and calling thread at INFO level
- PythonScriptingSession: Log gateway hash and thread at creation (INFO) and shutdown (ERROR)
- PythonGatewayTracker.clear(): Log process count and triggering thread at ERROR level
- QueuedPythonGatewayFactory: Log eviction count and thread at gate-close (WARN)
- PythonGatewayCreationGate: Include thread name in P2 phase event logs (INFO)
- PythonScriptNodeModel: Handle no-cause "Cannot obtain" variant with improved error message

When this error occurs again, correlating gateway hash and PID across log entries will
reveal which code path triggered the unexpected shutdown.

AP-25563 (Investigate "Cannot obtain a new communication channel" Python failures)
@chaubold chaubold force-pushed the bug/AP-25563-cannot-obtain-communication branch from 12b3548 to ad8b02f Compare March 19, 2026 09:47
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
15.9% Coverage on New Code (required ≥ 85%)

See analysis details on SonarQube Cloud

@chaubold chaubold marked this pull request as ready for review March 19, 2026 11:18
@chaubold chaubold requested a review from a team as a code owner March 19, 2026 11:18
@chaubold chaubold requested review from Copilot and knime-ghub-bot and removed request for a team March 19, 2026 11:18
@chaubold chaubold merged commit 100b2df into master Mar 19, 2026
2 of 3 checks passed
@chaubold chaubold deleted the bug/AP-25563-cannot-obtain-communication branch March 19, 2026 11:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds diagnostic logging and improved error messaging across the KNIME Java↔Python (Py4J) gateway lifecycle to help trace intermittent “Cannot obtain a new communication channel” failures by correlating gateway identity, PID, and thread context.

Changes:

  • Add structured lifecycle logs (thread, gateway hash, PID, eviction counts) across gateway creation/closure and installation gating.
  • Improve handling of the no-cause “Cannot obtain a new communication channel” Py4J exception with a clearer KNIMEException message and resolutions.
  • Enhance installation-phase logs and tracker diagnostics to better identify who/what triggered gateway shutdown/termination.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
org.knime.python3/src/main/java/org/knime/python3/QueuedPythonGatewayFactory.java Logs eviction count/thread when the gateway creation gate closes before evicting queued gateways.
org.knime.python3/src/main/java/org/knime/python3/PythonGatewayTracker.java Adds process count + triggering thread to the “aborting running Python processes” log entry.
org.knime.python3/src/main/java/org/knime/python3/PythonGatewayCreationGate.java Adds thread name to P2 phase transition INFO logs controlling gateway creation blocking/unblocking.
org.knime.python3/src/main/java/org/knime/python3/DefaultPythonGateway.java Logs PID + calling thread when closing a Python gateway.
org.knime.python3.scripting.nodes/src/main/java/org/knime/python3/scripting/nodes2/PythonScriptingSession.java Logs gateway identity hash on creation; adds targeted ERROR diagnostics when CallbackClient is already shut down.
org.knime.python3.scripting.nodes/src/main/java/org/knime/python3/scripting/nodes2/PythonScriptNodeModel.java Adds a dedicated handler branch for the no-cause “Cannot obtain…” error with a clearer user-facing message.

Comment on lines +220 to +226
var gatewaysToEvict = m_gateways.values().stream()//
.flatMap(Collection::stream)//
.collect(Collectors.toList());
LOGGER.warnWithFormat(
"PythonGatewayCreationGate closed: evicting %d queued gateways from thread '%s'",
gatewaysToEvict.size(), Thread.currentThread().getName());
evictGateways(gatewaysToEvict);
@Override
public void close() throws IOException {
if (m_clientServer != null) {
LOGGER.infoWithFormat("Closing PythonGateway (PID=%s) from thread '%s'", m_pid,
Comment on lines +185 to +186
LOGGER.info("Blocking Python process startup during installation (thread='"
+ Thread.currentThread().getName() + "')");
+ "If this leads to failures in node execution, "
+ "please restart those nodes once the installation has finished");
LOGGER.errorWithFormat(
"Found running Python processes (%d). Aborting them to allow installation process. "
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants