Skip to content

NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002

Open
pvillard31 wants to merge 2 commits intoapache:mainfrom
pvillard31:NIFI-15698
Open

NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002
pvillard31 wants to merge 2 commits intoapache:mainfrom
pvillard31:NIFI-15698

Conversation

@pvillard31
Copy link
Contributor

@pvillard31 pvillard31 commented Mar 13, 2026

Summary

NIFI-15698 - Fix Python bridge hang during startup with many Python processors

It has been challenging to work on this one and I was unable to come up with a system test systematically reproducing the issue. It was, however, very easy to reproduce the problem following the steps in the repository shared by the reporter: https://github.com/distroitt/nifi-bug

I was able to confirm the issue on latest release and was able to confirm that the fix is solving the problem by building the 2.9.0-SNAPSHOT Docker image and running the same tests.

When loading a flow with many Python processors, NiFi can hang during startup or restart and never reach "Started Application". The root cause is virtual thread pinning in NiFiPythonGateway. The four methods that guard the activeInvocations list (beginInvocation, endInvocation, putNewObject, putObject) use synchronized, which pins virtual threads to their carrier threads in JDK 21. During flow synchronization, the main thread and many processor-initialization virtual threads all contend for this single intrinsic lock. Because each waiting virtual thread pins its carrier, the ForkJoinPool carrier threads are quickly exhausted, and no thread can make progress - including the one holding the lock. This change replaces the synchronized methods with a ReentrantLock, which is virtual-thread-friendly: blocked virtual threads yield their carrier thread instead of pinning it.

The PythonProcess lifecycle has been updated so that a process is only handed out to callers after discoverExtensions() completes. New isReady(), waitUntilReady(), and markReadyAndNotify() methods prevent the main thread or initialization threads from calling into a Python process that is still loading extensions, which was another source of hangs on first start.

The getProcessForNextComponent method in StandardPythonBridge has been restructured to hold the bridge lock only for the decision phase (picking or creating a process), then release it before performing blocking operations like start() and discoverExtensions(). Previously the entire method was synchronized, blocking all other processor creation threads during these slow operations.

The createProcessorBridge method now receives the already-resolved PythonProcessorDetails from its caller instead of calling getProcessorTypes() again. This eliminates two redundant Python proxy round-trips per processor creation, reducing gateway lock contention during startup.

A workaround has been added in ProcessorInspection.py for a CPython 3.11+ bug (gh-95185) where ast.parse() can raise SystemError: AST constructor recursion depth mismatch under concurrent load. The error is caught and the file is treated as a non-processor module so that extension loading continues.

For reference, extract of thread dump when reproducing the issue:


"main" #1 [95] prio=5 os_prio=0 cpu=10197.37ms elapsed=247.51s tid=0x0000ffffa802b610 nid=95 waiting for monitor entry  [0x0000ffffaf3bb000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.nifi.py4j.client.NiFiPythonGateway.endInvocation(NiFiPythonGateway.java:198)
	- waiting to lock <0x00000000d9907b40> (a org.apache.nifi.py4j.client.NiFiPythonGateway)
	at org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:103)
	at jdk.proxy8.$Proxy73.getProcessorTypes(jdk.proxy8/Unknown Source)
	at org.apache.nifi.py4j.StandardPythonBridge.getProcessorTypes(StandardPythonBridge.java:327)
	at org.apache.nifi.py4j.StandardPythonBridge.createProcessor(StandardPythonBridge.java:162)
	at org.apache.nifi.components.ClassLoaderAwarePythonBridge.createProcessor(ClassLoaderAwarePythonBridge.java:109)
	at org.apache.nifi.controller.ExtensionBuilder.createLoggablePythonProcessor(ExtensionBuilder.java:912)
	at org.apache.nifi.controller.ExtensionBuilder.createLoggableProcessor(ExtensionBuilder.java:792)
	at org.apache.nifi.controller.ExtensionBuilder.buildProcessor(ExtensionBuilder.java:261)
	at org.apache.nifi.controller.flow.StandardFlowManager.createProcessor(StandardFlowManager.java:369)
	at org.apache.nifi.controller.flow.AbstractFlowManager.createProcessor(AbstractFlowManager.java:430)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.addProcessor(StandardVersionedComponentSynchronizer.java:2696)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeProcessors(StandardVersionedComponentSynchronizer.java:1197)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:601)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.addProcessGroup(StandardVersionedComponentSynchronizer.java:1387)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:697)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:595)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$10(StandardVersionedComponentSynchronizer.java:394)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer$$Lambda/0x0000000100b4cc78.run(Unknown Source)
	at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:670)
	at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:389)
	at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3895)
	at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
	at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:221)
	- locked <0x00000000d8a781d0> (a org.apache.nifi.controller.serialization.VersionedFlowSynchronizer)
	at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1840)
	at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:92)
	- locked <0x00000000d8e388a8> (a org.apache.nifi.persistence.StandardFlowConfigurationDAO)
	at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:775)
	at org.apache.nifi.controller.StandardFlowService.load(StandardFlowService.java:497)
...

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@pvillard31 pvillard31 added bug python Pull requests that update Python code labels Mar 13, 2026
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this issue @pvillard31.

On an initial review, I'm concerned about the increased complexity of the PythonProcess, extending the contract and exposing more internals. There are some inherent limitations with Python Processors, which prompts some questions about how much complexity to introduce in order to support a larger number of Python Processors.

Although the current behavior is certainly problematic, I would like to give this closer consideration before moving forward with the new locking approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants