Skip to content

Commit 2ac4fef

Browse files
tballisonclaude
andauthored
TIKA-4656 - add content-only parse mode, markdown handler integration, and docs (#2600)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 212a467 commit 2ac4fef

16 files changed

Lines changed: 499 additions & 11 deletions

File tree

docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
** xref:using-tika/cli/index.adoc[Command Line]
2222
** xref:using-tika/grpc/index.adoc[gRPC]
2323
* xref:pipes/index.adoc[Pipes]
24+
** xref:pipes/parse-modes.adoc[Parse Modes]
2425
** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]
2526
* xref:configuration/index.adoc[Configuration]
2627
** xref:configuration/parsers/pdf-parser.adoc[PDF Parser]

docs/modules/ROOT/pages/pipes/index.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Tika Pipes provides a framework for processing large volumes of documents with:
2929

3030
== Topics
3131

32+
* xref:pipes/parse-modes.adoc[Parse Modes] - Control how documents are parsed and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`)
3233
* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] - Extract raw bytes from embedded documents using `ParseMode.UNPACK`
3334

3435
// Add links to specific topics as they are created
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
//
2+
// Licensed to the Apache Software Foundation (ASF) under one or more
3+
// contributor license agreements. See the NOTICE file distributed with
4+
// this work for additional information regarding copyright ownership.
5+
// The ASF licenses this file to You under the Apache License, Version 2.0
6+
// (the "License"); you may not use this file except in compliance with
7+
// the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing, software
12+
// distributed under the License is distributed on an "AS IS" BASIS,
13+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
// See the License for the specific language governing permissions and
15+
// limitations under the License.
16+
//
17+
18+
= Parse Modes
19+
20+
Tika Pipes uses `ParseMode` to control how documents are parsed and how results are emitted.
21+
The parse mode is set on the `ParseContext` or configured in `PipesConfig`.
22+
23+
== Available Parse Modes
24+
25+
[cols="1,3"]
26+
|===
27+
|Mode |Description
28+
29+
|`RMETA`
30+
|Default mode. Each embedded document produces a separate `Metadata` object.
31+
Results are returned as a JSON array of metadata objects.
32+
33+
|`CONCATENATE`
34+
|All content from embedded documents is concatenated into a single content field.
35+
Results are returned as a single `Metadata` object with all metadata preserved.
36+
37+
|`CONTENT_ONLY`
38+
|Parses like `CONCATENATE` but emits only the raw extracted content — no JSON wrapper,
39+
no metadata fields. Useful when you want just the text, markdown, or HTML output.
40+
41+
|`NO_PARSE`
42+
|Skip parsing entirely. Useful for pipelines that only need to fetch and emit raw bytes.
43+
44+
|`UNPACK`
45+
|Extract raw bytes from embedded documents. See xref:pipes/unpack-config.adoc[Extracting Embedded Bytes].
46+
|===
47+
48+
== CONCATENATE Mode
49+
50+
`CONCATENATE` merges all content from embedded documents into a single content field
51+
while preserving all metadata from parsing:
52+
53+
[source,json]
54+
----
55+
{
56+
"parseContext": {
57+
"parseMode": "CONCATENATE"
58+
}
59+
}
60+
----
61+
62+
The result is a single `Metadata` object containing the concatenated content in
63+
`X-TIKA:content` along with all other metadata fields (title, author, content type, etc.).
64+
65+
== CONTENT_ONLY Mode
66+
67+
`CONTENT_ONLY` is designed for use cases where you want just the extracted content
68+
written to storage — no JSON wrapping, no metadata overhead. This is particularly
69+
useful for:
70+
71+
* Extracting markdown files from a document corpus
72+
* Building plain text search indexes
73+
* Generating HTML versions of documents
74+
75+
[source,json]
76+
----
77+
{
78+
"parseContext": {
79+
"parseMode": "CONTENT_ONLY"
80+
}
81+
}
82+
----
83+
84+
=== How It Works
85+
86+
1. Documents are parsed identically to `CONCATENATE` mode — all embedded content is
87+
merged into a single content field.
88+
2. A metadata filter automatically strips all metadata except `X-TIKA:content` and
89+
`X-TIKA:CONTAINER_EXCEPTION` (for error tracking).
90+
3. When the emitter is a `StreamEmitter` (such as the filesystem or S3 emitter), the
91+
raw content string is written directly as bytes — no JSON serialization.
92+
93+
=== Metadata Filtering
94+
95+
By default, `CONTENT_ONLY` mode applies an `IncludeFieldMetadataFilter` that retains
96+
only `X-TIKA:content` and `X-TIKA:CONTAINER_EXCEPTION`. If you set your own
97+
`MetadataFilter` on the `ParseContext`, your filter takes priority.
98+
99+
=== CLI Usage
100+
101+
The `tika-async-cli` batch processor supports `CONTENT_ONLY` via the `--content-only`
102+
flag:
103+
104+
[source,bash]
105+
----
106+
java -jar tika-async-cli.jar -i /input -o /output -h m --content-only
107+
----
108+
109+
This produces `.md` files (when using the `m` handler type) containing only the
110+
extracted markdown content.
111+
112+
=== Content Handler Types
113+
114+
The content format depends on the configured handler type:
115+
116+
[cols="1,1,2"]
117+
|===
118+
|Handler |Extension |Description
119+
120+
|`t` (text)
121+
|`.txt`
122+
|Plain text output
123+
124+
|`h` (html)
125+
|`.html`
126+
|HTML output
127+
128+
|`x` (xml)
129+
|`.xml`
130+
|XHTML output
131+
132+
|`m` (markdown)
133+
|`.md`
134+
|Markdown output
135+
136+
|`b` (body)
137+
|`.txt`
138+
|Body content handler output
139+
|===

docs/modules/ROOT/pages/using-tika/cli/index.adoc

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,9 @@ java -jar tika-app.jar [option...] [file|port...]
8383
|`-t` or `--text`
8484
|Output plain text
8585

86+
|`--md`
87+
|Output Markdown
88+
8689
|`-m` or `--metadata`
8790
|Output metadata only
8891

@@ -124,6 +127,13 @@ Process entire directories by specifying input and output paths:
124127
java -jar tika-app.jar -i /path/to/input -o /path/to/output
125128
----
126129

130+
=== Extract Markdown from a file
131+
132+
[source,bash]
133+
----
134+
java -jar tika-app.jar --md document.docx
135+
----
136+
127137
=== Custom configuration
128138

129139
Use a custom configuration file:
@@ -132,3 +142,68 @@ Use a custom configuration file:
132142
----
133143
java -jar tika-app.jar --config=tika-config.json document.pdf
134144
----
145+
146+
== Batch Processing (tika-async-cli)
147+
148+
For processing large numbers of files, use `tika-async-cli`. It uses the Tika Pipes
149+
architecture with forked JVM processes for fault tolerance.
150+
151+
=== Basic Batch Usage
152+
153+
[source,bash]
154+
----
155+
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output
156+
----
157+
158+
This processes all files in the input directory and writes JSON metadata (RMETA format)
159+
to the output directory.
160+
161+
=== Batch Options
162+
163+
[cols="1,3"]
164+
|===
165+
|Option |Description
166+
167+
|`-i`
168+
|Input directory
169+
170+
|`-o`
171+
|Output directory
172+
173+
|`-h` or `--handlerType`
174+
|Content handler type: `t`=text, `h`=html, `x`=xml, `m`=markdown, `b`=body, `i`=ignore (default: `t`)
175+
176+
|`--concatenate`
177+
|Concatenate content from all embedded documents into a single content field
178+
179+
|`--content-only`
180+
|Output only extracted content (no metadata, no JSON wrapper); implies `--concatenate`
181+
182+
|`-T` or `--timeoutMs`
183+
|Timeout for each parse in milliseconds
184+
185+
|`-n` or `--numClients`
186+
|Number of parallel forked processes
187+
188+
|`-p` or `--pluginsDir`
189+
|Plugins directory
190+
|===
191+
192+
=== Batch Examples
193+
194+
Extract markdown content only (no metadata) from all files:
195+
196+
[source,bash]
197+
----
198+
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m --content-only
199+
----
200+
201+
This produces `.md` files in the output directory containing just the extracted markdown
202+
content — no JSON wrappers, no metadata fields.
203+
204+
Extract text with all metadata in concatenated mode:
205+
206+
[source,bash]
207+
----
208+
java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output --concatenate
209+
----

docs/modules/ROOT/pages/using-tika/java-api/index.adoc

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,12 +100,24 @@ For example, use `TikaInputStream.get(path)` for a `Path`, or `TikaInputStream.g
100100
for a `byte[]`. This allows Tika to access the underlying resource efficiently and enables
101101
features like mark/reset support that many parsers and detectors require.
102102

103-
=== Utility Classes
103+
=== Content Handlers
104+
105+
Tika provides several content handlers that control the output format:
104106

105107
**BodyContentHandler**:: Extracts and converts the body content to streams or strings.
106108

109+
**ToTextContentHandler**:: Outputs plain text.
110+
111+
**ToHTMLContentHandler**:: Outputs HTML.
112+
113+
**ToXMLContentHandler**:: Outputs XHTML/XML.
114+
115+
**ToMarkdownContentHandler**:: Outputs Markdown, preserving structural semantics like headings, lists, tables, code blocks, emphasis, and links.
116+
107117
**ParsingReader**:: Uses background threading to return extracted text as character streams.
108118

119+
Use `BasicContentHandlerFactory` to create handlers by type: `TEXT`, `HTML`, `XML`, `BODY`, `MARKDOWN`, `IGNORE`.
120+
109121
=== Key Metadata Properties
110122

111123
* `TikaCoreProperties.RESOURCE_NAME_KEY` - filename or resource identifier

docs/modules/ROOT/pages/using-tika/server/index.adoc

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,60 @@ java -jar tika-server-standard.jar
3333

3434
The server starts on port 9998 by default.
3535

36+
== Endpoints
37+
38+
=== Content Extraction (`/tika`)
39+
40+
The `/tika` endpoint extracts content from a document as plain text.
41+
42+
[source,bash]
43+
----
44+
curl -T document.pdf http://localhost:9998/tika
45+
----
46+
47+
==== Markdown Output (`/tika/md`)
48+
49+
The `/tika/md` endpoint extracts content as Markdown, preserving structural semantics
50+
like headings, lists, tables, and emphasis:
51+
52+
[source,bash]
53+
----
54+
curl -T document.docx http://localhost:9998/tika/md
55+
----
56+
57+
==== Custom Handler Type
58+
59+
Use the `X-Tika-Handler` header to control the output format. Valid values: `text` (default),
60+
`html`, `xml`, `markdown`, `ignore`.
61+
62+
[source,bash]
63+
----
64+
curl -T document.pdf -H "X-Tika-Handler: markdown" http://localhost:9998/tika
65+
----
66+
67+
=== Recursive Metadata (`/rmeta`)
68+
69+
The `/rmeta` endpoint returns metadata for the container document and all embedded documents
70+
as a JSON array of metadata objects.
71+
72+
[source,bash]
73+
----
74+
curl -T document.pdf http://localhost:9998/rmeta
75+
----
76+
77+
Content handler can be specified in the URL path:
78+
79+
* `/rmeta/text` - plain text content (default)
80+
* `/rmeta/html` - HTML content
81+
* `/rmeta/xml` - XHTML content
82+
* `/rmeta/markdown` - Markdown content
83+
* `/rmeta/ignore` - metadata only, no content
84+
85+
[source,bash]
86+
----
87+
curl -T document.docx http://localhost:9998/rmeta/markdown
88+
----
89+
3690
== Topics
3791

3892
* xref:using-tika/server/tls.adoc[TLS/SSL Configuration] - Secure your server with TLS and mutual authentication

tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@
100100
import org.apache.tika.sax.ContentHandlerFactory;
101101
import org.apache.tika.sax.ExpandedTitleContentHandler;
102102
import org.apache.tika.sax.RecursiveParserWrapperHandler;
103+
import org.apache.tika.sax.ToMarkdownContentHandler;
103104
import org.apache.tika.sax.WriteOutContentHandler;
104105
import org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler;
105106
import org.apache.tika.serialization.JsonMetadata;
@@ -225,6 +226,12 @@ public void process(TikaInputStream tis, OutputStream output, Metadata metadata)
225226
* Fork mode plugins directory.
226227
*/
227228
private String forkPluginsDir = null;
229+
private final OutputType MARKDOWN = new OutputType() {
230+
@Override
231+
protected ContentHandler getContentHandler(OutputStream output, Metadata metadata) throws Exception {
232+
return new BodyContentHandler(new ToMarkdownContentHandler(getOutputWriter(output, encoding)));
233+
}
234+
};
228235
private final OutputType XML = new OutputType() {
229236
@Override
230237
protected ContentHandler getContentHandler(OutputStream output, Metadata metadata) throws Exception {
@@ -483,6 +490,8 @@ public void process(String arg) throws Exception {
483490
type = XML;
484491
} else if (arg.equals("-h") || arg.equals("--html")) {
485492
type = HTML;
493+
} else if (arg.equals("--md")) {
494+
type = MARKDOWN;
486495
} else if (arg.equals("-t") || arg.equals("--text")) {
487496
type = TEXT;
488497
} else if (arg.equals("-T") || arg.equals("--text-main")) {
@@ -744,6 +753,7 @@ private void usage() {
744753
out.println(" -x or --xml Output XHTML content (default)");
745754
out.println(" -h or --html Output HTML content");
746755
out.println(" -t or --text Output plain text content (body)");
756+
out.println(" --md Output Markdown content (body)");
747757
out.println(" -T or --text-main Output plain text content (main content only via boilerpipe handler)");
748758
out.println(" -A or --text-all Output all text content");
749759
out.println(" -m or --metadata Output only metadata");

0 commit comments

Comments
 (0)