NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002
Open
pvillard31 wants to merge 2 commits intoapache:mainfrom
Open
NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002pvillard31 wants to merge 2 commits intoapache:mainfrom
pvillard31 wants to merge 2 commits intoapache:mainfrom
Conversation
Contributor
exceptionfactory
left a comment
There was a problem hiding this comment.
Thanks for working on this issue @pvillard31.
On an initial review, I'm concerned about the increased complexity of the PythonProcess, extending the contract and exposing more internals. There are some inherent limitations with Python Processors, which prompts some questions about how much complexity to introduce in order to support a larger number of Python Processors.
Although the current behavior is certainly problematic, I would like to give this closer consideration before moving forward with the new locking approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NIFI-15698 - Fix Python bridge hang during startup with many Python processors
It has been challenging to work on this one and I was unable to come up with a system test systematically reproducing the issue. It was, however, very easy to reproduce the problem following the steps in the repository shared by the reporter: https://github.com/distroitt/nifi-bug
I was able to confirm the issue on latest release and was able to confirm that the fix is solving the problem by building the 2.9.0-SNAPSHOT Docker image and running the same tests.
When loading a flow with many Python processors, NiFi can hang during startup or restart and never reach "Started Application". The root cause is virtual thread pinning in
NiFiPythonGateway. The four methods that guard theactiveInvocationslist (beginInvocation,endInvocation,putNewObject,putObject) usesynchronized, which pins virtual threads to their carrier threads in JDK 21. During flow synchronization, the main thread and many processor-initialization virtual threads all contend for this single intrinsic lock. Because each waiting virtual thread pins its carrier, the ForkJoinPool carrier threads are quickly exhausted, and no thread can make progress - including the one holding the lock. This change replaces thesynchronizedmethods with aReentrantLock, which is virtual-thread-friendly: blocked virtual threads yield their carrier thread instead of pinning it.The
PythonProcesslifecycle has been updated so that a process is only handed out to callers afterdiscoverExtensions()completes. NewisReady(),waitUntilReady(), andmarkReadyAndNotify()methods prevent the main thread or initialization threads from calling into a Python process that is still loading extensions, which was another source of hangs on first start.The
getProcessForNextComponentmethod inStandardPythonBridgehas been restructured to hold the bridge lock only for the decision phase (picking or creating a process), then release it before performing blocking operations likestart()anddiscoverExtensions(). Previously the entire method wassynchronized, blocking all other processor creation threads during these slow operations.The
createProcessorBridgemethod now receives the already-resolvedPythonProcessorDetailsfrom its caller instead of callinggetProcessorTypes()again. This eliminates two redundant Python proxy round-trips per processor creation, reducing gateway lock contention during startup.A workaround has been added in
ProcessorInspection.pyfor a CPython 3.11+ bug (gh-95185) whereast.parse()can raiseSystemError: AST constructor recursion depth mismatchunder concurrent load. The error is caught and the file is treated as a non-processor module so that extension loading continues.For reference, extract of thread dump when reproducing the issue:
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation