#466 - Handle text/plain content in JSoupParserBolt by rzo1 · Pull Request #1943 · apache/stormcrawler

rzo1 · 2026-06-14T17:42:09Z

text/plain content requires no markup parsing, so instead of raising an error (the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded content directly as the extracted text and emits no outlinks.

Adds a unit test and documentation updates.

Closes #466

For all changes

Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

Have you ensured that the full suite of tests is executed via mvn clean verify?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file?
If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

dpol1

looks good. the shell-doc trick keeps redirects and filters working and leaves the html path untouched, nice.

text/plain content requires no markup parsing, so instead of raising an error (the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded content directly as the extracted text and emits no outlinks. The plain-text path does not run the TextExtractor (there is no markup to extract from), so the two size-related knobs are read in prepare() and applied directly: empty text when textextractor.no.text is set, truncated to textextractor.skip.after otherwise. substring keeps the original layout, which is the point of a .txt; http.content.limit remains the bound for the raw fetched bytes. The include/exclude knobs require markup and have no effect. Adds unit tests for the verbatim, skip.after-truncation and no.text cases and documents the behaviour and bounds in configuration.adoc and internals.adoc. Closes #466

dpol1

👍

rzo1 added this to the 3.6.1 milestone Jun 14, 2026

rzo1 requested review from dpol1, jnioche, mvolikas and sigee June 14, 2026 17:42

dpol1 approved these changes Jun 15, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java Outdated

rzo1 force-pushed the feature/466 branch from 54d0e21 to 372e8f8 Compare June 16, 2026 09:23

dpol1 approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#466 - Handle text/plain content in JSoupParserBolt#1943

#466 - Handle text/plain content in JSoupParserBolt#1943
rzo1 wants to merge 1 commit into
mainfrom
feature/466

rzo1 commented Jun 14, 2026

Uh oh!

dpol1 left a comment

Uh oh!

Uh oh!

dpol1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rzo1 commented Jun 14, 2026

For all changes

For code changes

Note

Uh oh!

dpol1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dpol1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants