Closed
Conversation
Contributor
|
This one looked good but then I realized there's a problem, there are files that have UTF8 content but no BOM, and that's what the original code intended. A better solution would be to try with a BufferedInputStream, use mark / reset and then pass to the reader. |
Author
I didn't quite understand why this code supports UTF-8 but not supports UTF-16 etc. |
asf-gitbox-commits
pushed a commit
that referenced
this pull request
Jan 12, 2026
git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1931273 13f79535-47bb-0310-9956-ffa450edef68
asf-gitbox-commits
pushed a commit
that referenced
this pull request
Jan 12, 2026
git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1931275 13f79535-47bb-0310-9956-ffa450edef68
Contributor
|
Thanks... lets wait if somebody needs it 😂 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current algorithm is:
1 if charset is UTF8 then read 3 bytes.
2 if these 3 bytes have expected values then mark a hasUtf8BOM variable as true
3 close stream
4 open new stream of the same file
5 if the variable hasUtf8BOM is true then skip 3 bytes.
6 if couldn't skip 3 bytes then throw an exception
7 If bytes were skipped or there is no need to skip them, the rest of the file should be read.
These are questions not for you, but for the algorithm:
1 why we need read the file twice when it increases the likelihood that we won't succeed the second time?
2 Opening a stream twice is slower than opening it once.
3 According to the code, there's a possibility that we couldn't read the file a second time. Then why isn't there a check to see if the file is corrupted? That is, it's UTF-8 encoding, but what if one or two of these three bytes are different. Perhaps the format has such combinations that this can't be verified.
At this point, I propose making one stream instead of two. There are no other changes.