Skip to content

#466 - Handle text/plain content in JSoupParserBolt#1943

Open
rzo1 wants to merge 1 commit into
mainfrom
feature/466
Open

#466 - Handle text/plain content in JSoupParserBolt#1943
rzo1 wants to merge 1 commit into
mainfrom
feature/466

Conversation

@rzo1

@rzo1 rzo1 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

text/plain content requires no markup parsing, so instead of raising an error (the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded content directly as the extracted text and emits no outlinks.

Adds a unit test and documentation updates.

Closes #466

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@rzo1 rzo1 added this to the 3.6.1 milestone Jun 14, 2026
@rzo1 rzo1 requested review from dpol1, jnioche, mvolikas and sigee June 14, 2026 17:42

@dpol1 dpol1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. the shell-doc trick keeps redirects and filters working and leaves the html path untouched, nice.

Comment thread core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java Outdated
text/plain content requires no markup parsing, so instead of raising an error
(the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded
content directly as the extracted text and emits no outlinks.

The plain-text path does not run the TextExtractor (there is no markup to
extract from), so the two size-related knobs are read in prepare() and applied
directly: empty text when textextractor.no.text is set, truncated to
textextractor.skip.after otherwise. substring keeps the original layout, which
is the point of a .txt; http.content.limit remains the bound for the raw
fetched bytes. The include/exclude knobs require markup and have no effect.

Adds unit tests for the verbatim, skip.after-truncation and no.text cases and
documents the behaviour and bounds in configuration.adoc and internals.adoc.

Closes #466

@dpol1 dpol1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSoup parser to handle text/plain

2 participants