Skip to content

Create external input temp dir before fetch in MSQ EXTERN#19574

Open
mormigil wants to merge 1 commit into
apache:masterfrom
mormigil:fix/msq-extern-temp-dir
Open

Create external input temp dir before fetch in MSQ EXTERN#19574
mormigil wants to merge 1 commit into
apache:masterfrom
mormigil:fix/msq-extern-temp-dir

Conversation

@mormigil

Copy link
Copy Markdown

Description

All SQL/MSQ ingestion of the form INSERT/REPLACE … SELECT … FROM TABLE(EXTERN(...)) that reads a random-access input format (Parquet, ORC, Avro-OCF, SQL, Druid-segment) from remote storage (e.g. S3) fails on the worker/peon with:

Caused by: java.io.IOException: No such file or directory
    at java.base/java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.base/java.io.File.createTempFile(File.java:2170)
    at org.apache.druid.data.input.InputEntity.fetch(InputEntity.java)
    at org.apache.druid.data.input.parquet.ParquetReader.intermediateRowIterator(ParquetReader.java:86)
    at ... ExternalSegment ... ScanQueryFrameProcessor ...

Streaming formats (JSON/CSV) and index_kafka ingestion are unaffected. This is a regression: it works in 32.x and breaks in 37.0.0.

Root cause

Random-access formats download each remote object to a local temp file via InputEntity#fetch(temporaryDirectory, …), which calls File.createTempFile(prefix, suffix, temporaryDirectory). createTempFile does not create parent directories. In the MSQ indexer worker the directory is derived lazily and never created:

Path Created? Where
<taskWorkDir>/indexing-tmp TaskToolbox#getIndexingTmpDir (mkdirp)
…/indexing-tmp/stage_NNNNNN IndexerFrameContext#tempDir
…/stage_NNNNNN/external RunWorkOrderframeContext.tempDir("external")ExternalInputSliceReader

Output channels work because FileOutputChannelFactory#openChannel already calls FileUtils.mkdirp(...) before writing; the input fetch path simply lacked the symmetric call. Streaming formats read via InputEntity#open() and never create a temp file, which is why only fetch-based formats regressed. This nesting was introduced by the background-fetch / virtual-storage external-input rewrite (#19539).

Fix

mkdirp the directory in InputEntity#fetch right before createTempFile, mirroring FileOutputChannelFactory#openChannel. The call is idempotent and covers every fetch-based input format.

Verification

Added Parquet coverage to S3ExternQueryTest (real embedded cluster + MinIO), exercising the actual indexer fetch path for both backgroundFetchExternalFiles on and off.

  • With the fix: all 4 cases pass.
  • Reverting the one-line fix: test_externParquet_backgroundFetchDisabled fails with java.io.IOException: No such file or directory (the direct-read path; the background-fetch path stages via a local FileEntity that skips createTempFile).

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Made with Cursor

InputEntity#fetch creates a temp file via File.createTempFile(prefix, suffix, dir)
in a caller-supplied directory. For MSQ external inputs the directory is derived
lazily as <indexing-tmp>/stage_NNNNNN/external (IndexerFrameContext#tempDir ->
RunWorkOrder -> ExternalInputSliceReader) and is never created, so random-access
formats (Parquet, ORC, Avro-OCF, etc.) that fetch remote objects to a local temp
file fail with "java.io.IOException: No such file or directory". Streaming formats
(JSON/CSV) read via open() and were unaffected. This regressed in the background
fetch / virtual-storage external-input rewrite (apache#19539); 32.x did not build this
uncreated nested subdir.

Create the directory with FileUtils.mkdirp before createTempFile, mirroring
FileOutputChannelFactory#openChannel. Adds Parquet coverage to S3ExternQueryTest
that exercises the real indexer fetch path (both background-fetch modes).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant