[GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan#11896
Open
guowangy wants to merge 1 commit intoapache:mainfrom
Open
[GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan#11896guowangy wants to merge 1 commit intoapache:mainfrom
guowangy wants to merge 1 commit intoapache:mainfrom
Conversation
a962aba to
368f446
Compare
Contributor
|
Hold on to merge until we fix #11452 (comment) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Fix SIGSEGV on
CPUThreadPoolthreads during HDFS scan caused byDetachCurrentThreadpoisoninglibhdfs.so's TLS-cachedJNIEnv*.Fixes #11895
Root cause
libhdfs.socachesJNIEnv*per thread in a two-level TLS structure:static __thread ThreadLocalState *quickTlsEnv(ELF linker-initialized, zero-cost read, no re-validation)pthread_getspecific(gTlsKey)(mutex-protected, only on first call per thread)After the first
AttachCurrentThreadon aCPUThreadPoolthread, libhdfs caches theJNIEnv*inquickTlsEnv. The fast path returns this pointer on all subsequent calls without checking validity.Two Gluten destructors called
vm_->DetachCurrentThread()unconditionally after JNI cleanup:JniColumnarBatchIterator::~JniColumnarBatchIterator()(cpp/core/jni/JniCommon.cc)JavaInputStreamAdaptor::Close()(cpp/core/jni/JniWrapper.cc)Both used
attachCurrentThreadAsDaemonOrThrow()followed by unconditionalDetachCurrentThread(). This was not a proper attach/detach pair:attachCurrentThreadAsDaemonOrThrowonly attaches if the thread is not already attached, butDetachCurrentThreadran regardless — detaching threads that libhdfs had attached. This freed the JVM'sJavaThreadobject whilequickTlsEnvstill held the stale pointer.The crash sequence:
CPUThreadPool21runs a preload task → libhdfs callsAttachCurrentThread, cachesJNIEnv*inquickTlsEnvDetachCurrentThread→JavaThreadfreed, butquickTlsEnvstill holds stale pointerCPUThreadPool21runs next preload task →hdfsGetPathInfo()→ libhdfs fast path returns stale env →jni_NewStringUTF(stale_env)→ SIGSEGVCPUThreadPoolthreads live for the entire executor JVM lifetime (created once inVeloxBackend::initConnector, destroyed only inVeloxBackend::tearDown), so the stale pointer persists across all subsequent queries.Confirmed by core dump analysis (
core.CPUThreadPool21.1770392from TPC-DS on YARN):RDI = 0x0(NULLJavaThread*, set byblock_if_vm_exited)R12 = 0x7f3003a52200(staleJNIEnv*from libhdfs TLS cache)Fix
Remove
DetachCurrentThread()from both destructors with the following considerations:attachCurrentThreadAsDaemonOrThrowis conditional (no-op if already attached) butDetachCurrentThreadwas unconditionalAttachCurrentThreadAsDaemon) do not block JVM shutdownfolly::ThreadLocal<HdfsFile>destructors callhdfsCloseFile()(JNI) during thread exit — detaching mid-lifetime would crash these toopthread_keydestructor (hdfsThreadDestructor) callsDetachCurrentThreadwhen the thread actually exits — proper cleanup still happens at thread exit when libhdfs is in usepthread_keyis registered, daemon-attached threads are safely reclaimed when the executor JVM exitsHow was this patch tested?
JniThreadDetachTest.testIteratorDestructorDoesNotDetachThread: on a nativestd::thread(simulatingCPUThreadPool), savesJNIEnv*(simulating libhdfs TLS cache), creates and destroys a realJniColumnarBatchIterator, then reuses the saved env forFindClass(simulating libhdfs's nexthdfsGetPathInfo). With the bug: SIGSEGV crashes the JVM. With the fix: succeeds normally.mvn surefire:test -pl backends-velox)CPUThreadPoolSIGSEGVWas this patch authored or co-authored using generative AI tooling?
Co-authored-by: Claude Opus 4.6