Skip to content

AP-25628: add checkpoint/restore (CRaC support in executor)#88

Open
bernd-wiswedel wants to merge 1 commit intomasterfrom
todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup
Open

AP-25628: add checkpoint/restore (CRaC support in executor)#88
bernd-wiswedel wants to merge 1 commit intomasterfrom
todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup

Conversation

@bernd-wiswedel
Copy link
Member

AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))

AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))
@bernd-wiswedel bernd-wiswedel requested a review from a team as a code owner March 18, 2026 10:16
@bernd-wiswedel bernd-wiswedel requested review from Copilot and knime-ghub-bot and removed request for a team March 18, 2026 10:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CRaC (Coordinated Restore at Checkpoint) hook to the KNIME Python gateway tracking layer so Python processes are terminated before the JVM is checkpointed, aiming to improve executor startup/restore behavior.

Changes:

  • Register a PhasedInit callback in PythonGatewayTracker to run cleanup before checkpointing.
  • Reuse existing gateway cleanup logic (clear()) to forcefully terminate tracked Python gateways/processes.

Comment on lines +84 to +92
// Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing
PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() {
@Override
public void beforeCheckpoint() throws RuntimeException {
try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);
}
Comment on lines +85 to +94
PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() {
@Override
public void beforeCheckpoint() throws RuntimeException {
try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);
}
}
});
try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);

private PythonGatewayTracker() {
m_openGateways = gatewaySet();
// Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants