refactor TextToPDF.call method by valerybokov · Pull Request #388 · apache/pdfbox

valerybokov · 2026-01-01T15:42:36Z

Current algorithm is:
1 if charset is UTF8 then read 3 bytes.
2 if these 3 bytes have expected values then mark a hasUtf8BOM variable as true
3 close stream
4 open new stream of the same file
5 if the variable hasUtf8BOM is true then skip 3 bytes.
6 if couldn't skip 3 bytes then throw an exception
7 If bytes were skipped or there is no need to skip them, the rest of the file should be read.

These are questions not for you, but for the algorithm:
1 why we need read the file twice when it increases the likelihood that we won't succeed the second time?
2 Opening a stream twice is slower than opening it once.
3 According to the code, there's a possibility that we couldn't read the file a second time. Then why isn't there a check to see if the file is corrupted? That is, it's UTF-8 encoding, but what if one or two of these three bytes are different. Perhaps the format has such combinations that this can't be verified.

At this point, I propose making one stream instead of two. There are no other changes.

THausherr · 2026-01-12T11:35:40Z

This one looked good but then I realized there's a problem, there are files that have UTF8 content but no BOM, and that's what the original code intended. A better solution would be to try with a BufferedInputStream, use mark / reset and then pass to the reader.

valerybokov · 2026-01-12T17:34:12Z

This one looked good but then I realized there's a problem, there are files that have UTF8 content but no BOM, and that's what the original code intended. A better solution would be to try with a BufferedInputStream, use mark / reset and then pass to the reader.

I didn't quite understand why this code supports UTF-8 but not supports UTF-16 etc.

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1931273 13f79535-47bb-0310-9956-ffa450edef68

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1931275 13f79535-47bb-0310-9956-ffa450edef68

THausherr · 2026-01-12T19:46:34Z

Thanks... lets wait if somebody needs it 😂

refactor TextToPDF.call method

ffc6b82

support UTF8 with BOM and no BOM in TextToPDF.call

a6356cf

asf-gitbox-commits pushed a commit that referenced this pull request Jan 12, 2026

PDFBOX-5660: don't open twice, as suggested by Valery Bokov; closes #388

87011ad

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1931273 13f79535-47bb-0310-9956-ffa450edef68

asf-gitbox-commits closed this in 0d815b8 Jan 12, 2026

asf-gitbox-commits pushed a commit that referenced this pull request Jan 12, 2026

PDFBOX-5660: don't open twice, as suggested by Valery Bokov; closes #388

2dcdd73

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1931275 13f79535-47bb-0310-9956-ffa450edef68

valerybokov deleted the refactor-TextToPDF.call branch January 21, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor TextToPDF.call method#388

refactor TextToPDF.call method#388
valerybokov wants to merge 2 commits intoapache:trunkfrom
valerybokov:refactor-TextToPDF.call

valerybokov commented Jan 1, 2026

Uh oh!

THausherr commented Jan 12, 2026

Uh oh!

valerybokov commented Jan 12, 2026

Uh oh!

THausherr commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

valerybokov commented Jan 1, 2026

Uh oh!

THausherr commented Jan 12, 2026

Uh oh!

valerybokov commented Jan 12, 2026

Uh oh!

THausherr commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants