-
Notifications
You must be signed in to change notification settings - Fork 18
fix: prevent tableFromIPC from hanging on corrupted files #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -29,6 +29,20 @@ import { ArrowJSON, ArrowJSONLike, ITERATOR_DONE, FileHandle } from '../io/inter | |||||
| /** @ignore */ const invalidMessageMetadata = (expected: number, actual: number) => `Expected to read ${expected} metadata bytes, but only read ${actual}.`; | ||||||
| /** @ignore */ const invalidMessageBodyLength = (expected: number, actual: number) => `Expected to read ${expected} bytes for message body, but only read ${actual}.`; | ||||||
|
|
||||||
| /** | ||||||
| * Maximum allowed metadata length (256 MB). This is a safeguard against corrupted | ||||||
| * files that could cause the reader to hang or consume excessive memory. | ||||||
| * @ignore | ||||||
| */ | ||||||
| const MAX_METADATA_LENGTH = 256 * 1024 * 1024; | ||||||
|
|
||||||
| /** | ||||||
| * Maximum allowed message body length (2 GB). This is a safeguard against corrupted | ||||||
| * files that could cause the reader to hang or consume excessive memory. | ||||||
| * @ignore | ||||||
| */ | ||||||
| const MAX_BODY_LENGTH = 2 * 1024 * 1024 * 1024; | ||||||
|
|
||||||
| /** @ignore */ | ||||||
| export class MessageReader implements IterableIterator<Message> { | ||||||
| protected source: ByteStream; | ||||||
|
|
@@ -59,6 +73,9 @@ export class MessageReader implements IterableIterator<Message> { | |||||
| } | ||||||
| public readMessageBody(bodyLength: number): Uint8Array { | ||||||
| if (bodyLength <= 0) { return new Uint8Array(0); } | ||||||
| if (bodyLength > MAX_BODY_LENGTH) { | ||||||
| throw new Error(`Message body length ${bodyLength} exceeds maximum allowed size of ${MAX_BODY_LENGTH} bytes. The file may be corrupted.`); | ||||||
| } | ||||||
|
||||||
| const buf = toUint8Array(this.source.read(bodyLength)); | ||||||
| if (buf.byteLength < bodyLength) { | ||||||
| throw new Error(invalidMessageBodyLength(bodyLength, buf.byteLength)); | ||||||
|
|
@@ -81,6 +98,10 @@ export class MessageReader implements IterableIterator<Message> { | |||||
| const buf = this.source.read(PADDING); | ||||||
| const bb = buf && new ByteBuffer(buf); | ||||||
| const len = bb?.readInt32(0) || 0; | ||||||
| // Guard against corrupted files with unreasonable metadata lengths | ||||||
| if (len < 0 || len > MAX_METADATA_LENGTH) { | ||||||
|
||||||
| if (len < 0 || len > MAX_METADATA_LENGTH) { | |
| if (len < -1 || len > MAX_METADATA_LENGTH) { |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new validation logic for MAX_METADATA_LENGTH lacks test coverage. Consider adding test cases that verify the error is thrown when metadata length exceeds MAX_METADATA_LENGTH (256MB) or is negative (excluding the valid -1 continuation indicator), and that valid metadata lengths still work correctly.
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new validation logic for MAX_BODY_LENGTH lacks test coverage. Consider adding test cases that verify the error is thrown when bodyLength exceeds MAX_BODY_LENGTH (2GB), and that valid body lengths just under the limit still work correctly.
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation condition len < 0 will break the continuation indicator logic used for Arrow IPC format v0.15+. According to the comment on lines 132-134, a value of -1 is a valid continuation indicator per ARROW-6313. This validation should exclude -1 from being treated as an error. Consider changing the condition to len < -1 or (len < 0 && len !== -1).
| if (len < 0 || len > MAX_METADATA_LENGTH) { | |
| if ((len < 0 && len !== -1) || len > MAX_METADATA_LENGTH) { |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new validation logic for MAX_METADATA_LENGTH lacks test coverage. Consider adding test cases that verify the error is thrown when metadata length exceeds MAX_METADATA_LENGTH (256MB) or is negative (excluding the valid -1 continuation indicator), and that valid metadata lengths still work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah these are wrong, Arrow tables can be larger than this. It'd be better to figure out why your IPC stream is corrupted when it's written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! You're right - I've reverted the size limits as they were too restrictive for valid Arrow files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to add a timeout here? A function that never returns and has no way to be interrupted isn't something we can use in production.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't add a timeout in Arrow since we're just processing a stream the library user gives us. We don't have enough context to know why a read call might take a certain amount of time. Is the file on a slow NFS drive? Is it a
fetch()to a dynamic resource and the the next batch hasn't been computed yet? Is the user's connection just slow?That said, you can add a timeout before/after Arrow's transform stream to either terminate/cancel based on your users' expected behavior.