|
| 1 | +// |
| 2 | +// Licensed to the Apache Software Foundation (ASF) under one or more |
| 3 | +// contributor license agreements. See the NOTICE file distributed with |
| 4 | +// this work for additional information regarding copyright ownership. |
| 5 | +// The ASF licenses this file to You under the Apache License, Version 2.0 |
| 6 | +// (the "License"); you may not use this file except in compliance with |
| 7 | +// the License. You may obtain a copy of the License at |
| 8 | +// |
| 9 | +// http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +// |
| 11 | +// Unless required by applicable law or agreed to in writing, software |
| 12 | +// distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +// See the License for the specific language governing permissions and |
| 15 | +// limitations under the License. |
| 16 | +// |
| 17 | + |
| 18 | += Parse Modes |
| 19 | + |
| 20 | +Tika Pipes uses `ParseMode` to control how documents are parsed and how results are emitted. |
| 21 | +The parse mode is set on the `ParseContext` or configured in `PipesConfig`. |
| 22 | + |
| 23 | +== Available Parse Modes |
| 24 | + |
| 25 | +[cols="1,3"] |
| 26 | +|=== |
| 27 | +|Mode |Description |
| 28 | + |
| 29 | +|`RMETA` |
| 30 | +|Default mode. Each embedded document produces a separate `Metadata` object. |
| 31 | +Results are returned as a JSON array of metadata objects. |
| 32 | + |
| 33 | +|`CONCATENATE` |
| 34 | +|All content from embedded documents is concatenated into a single content field. |
| 35 | +Results are returned as a single `Metadata` object with all metadata preserved. |
| 36 | + |
| 37 | +|`CONTENT_ONLY` |
| 38 | +|Parses like `CONCATENATE` but emits only the raw extracted content — no JSON wrapper, |
| 39 | +no metadata fields. Useful when you want just the text, markdown, or HTML output. |
| 40 | + |
| 41 | +|`NO_PARSE` |
| 42 | +|Skip parsing entirely. Useful for pipelines that only need to fetch and emit raw bytes. |
| 43 | + |
| 44 | +|`UNPACK` |
| 45 | +|Extract raw bytes from embedded documents. See xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]. |
| 46 | +|=== |
| 47 | + |
| 48 | +== CONCATENATE Mode |
| 49 | + |
| 50 | +`CONCATENATE` merges all content from embedded documents into a single content field |
| 51 | +while preserving all metadata from parsing: |
| 52 | + |
| 53 | +[source,json] |
| 54 | +---- |
| 55 | +{ |
| 56 | + "parseContext": { |
| 57 | + "parseMode": "CONCATENATE" |
| 58 | + } |
| 59 | +} |
| 60 | +---- |
| 61 | + |
| 62 | +The result is a single `Metadata` object containing the concatenated content in |
| 63 | +`X-TIKA:content` along with all other metadata fields (title, author, content type, etc.). |
| 64 | + |
| 65 | +== CONTENT_ONLY Mode |
| 66 | + |
| 67 | +`CONTENT_ONLY` is designed for use cases where you want just the extracted content |
| 68 | +written to storage — no JSON wrapping, no metadata overhead. This is particularly |
| 69 | +useful for: |
| 70 | + |
| 71 | +* Extracting markdown files from a document corpus |
| 72 | +* Building plain text search indexes |
| 73 | +* Generating HTML versions of documents |
| 74 | + |
| 75 | +[source,json] |
| 76 | +---- |
| 77 | +{ |
| 78 | + "parseContext": { |
| 79 | + "parseMode": "CONTENT_ONLY" |
| 80 | + } |
| 81 | +} |
| 82 | +---- |
| 83 | + |
| 84 | +=== How It Works |
| 85 | + |
| 86 | +1. Documents are parsed identically to `CONCATENATE` mode — all embedded content is |
| 87 | + merged into a single content field. |
| 88 | +2. A metadata filter automatically strips all metadata except `X-TIKA:content` and |
| 89 | + `X-TIKA:CONTAINER_EXCEPTION` (for error tracking). |
| 90 | +3. When the emitter is a `StreamEmitter` (such as the filesystem or S3 emitter), the |
| 91 | + raw content string is written directly as bytes — no JSON serialization. |
| 92 | + |
| 93 | +=== Metadata Filtering |
| 94 | + |
| 95 | +By default, `CONTENT_ONLY` mode applies an `IncludeFieldMetadataFilter` that retains |
| 96 | +only `X-TIKA:content` and `X-TIKA:CONTAINER_EXCEPTION`. If you set your own |
| 97 | +`MetadataFilter` on the `ParseContext`, your filter takes priority. |
| 98 | + |
| 99 | +=== CLI Usage |
| 100 | + |
| 101 | +The `tika-async-cli` batch processor supports `CONTENT_ONLY` via the `--content-only` |
| 102 | +flag: |
| 103 | + |
| 104 | +[source,bash] |
| 105 | +---- |
| 106 | +java -jar tika-async-cli.jar -i /input -o /output -h m --content-only |
| 107 | +---- |
| 108 | + |
| 109 | +This produces `.md` files (when using the `m` handler type) containing only the |
| 110 | +extracted markdown content. |
| 111 | + |
| 112 | +=== Content Handler Types |
| 113 | + |
| 114 | +The content format depends on the configured handler type: |
| 115 | + |
| 116 | +[cols="1,1,2"] |
| 117 | +|=== |
| 118 | +|Handler |Extension |Description |
| 119 | + |
| 120 | +|`t` (text) |
| 121 | +|`.txt` |
| 122 | +|Plain text output |
| 123 | + |
| 124 | +|`h` (html) |
| 125 | +|`.html` |
| 126 | +|HTML output |
| 127 | + |
| 128 | +|`x` (xml) |
| 129 | +|`.xml` |
| 130 | +|XHTML output |
| 131 | + |
| 132 | +|`m` (markdown) |
| 133 | +|`.md` |
| 134 | +|Markdown output |
| 135 | + |
| 136 | +|`b` (body) |
| 137 | +|`.txt` |
| 138 | +|Body content handler output |
| 139 | +|=== |
0 commit comments