SOLR-18071 Support Stored Fields in Export Writer #4053

kotman12 · 2026-01-15T22:57:22Z

https://issues.apache.org/jira/browse/SOLR-18071

Description

Adds support for exporting stored-only fields (fields without docValues) in the /export request handler via a new includeStoredFields parameter. Previously, all fields in the field list (fl) were required to have docValues enabled. This change allows users to include stored fields that don't have docValues, which can be useful, i.e. when exporting fields which don't support docValues or trying to export data that has already been indexed without DVs.

Solution

If fl explicitly names a stored-only field and includeStoredFields is not enabled, the request fails with a 400 and a hint to add includeStoredFields=true. For glob patterns (e.g., fl=* or fl=intdv,*), stored-only fields are skipped unless includeStoredFields=true, to preserve backward compatibility. The current implementation fetches from StoredFields DV-enabled fields when some stored fields have already been requested. This avoids DV-lookup for a field, which makes sense since we have to parse the StoredFields anyway. My (somewhat limited) benchmarks appear to corroborate that this is the best choice for performance.

A quirky thing about this implementation is that the very much internal FieldWriter API was changed to support more than one field. This makes it more interchangeable with the existing StoredFieldVisitor interface, which assumes one visitor per many fields. I landed on this rather than creating an adapter to bridge the two as it appeared to be simpler. It's worth stressing again that the FieldWriter is very much internal to the export package and the boolean it returned was effectively discarded (the local fieldIndex it drives isn't even used anywhere). It could be argued that FieldWriter::write could be void and I'd also be open to such a change.

Tests

Explicit stored-only field export succeeds with includeStoredFields=true (single-valued and multi-valued).
Explicit stored-only field export fails without the parameter and includes the includeStoredFields=true hint and field name.
Glob fl skips stored-only fields without the parameter (request succeeds, stored-only fields not present).
Glob fl includes stored-only fields with the parameter.
Coverage for stored field types: string, int, long, float, double, boolean, date.

Also have some performance comparisons of exporting vs not exporting stored fields:
stored-fields-export-writer-1k-doc-benchmark.txt
stored-fields-export-writer-140k-doc-benchmark.txt

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

epugh

I think this will scratch some of my own issues!

epugh · 2026-01-16T16:41:29Z

solr/core/src/java/org/apache/solr/handler/export/ExportWriter.java

      }
      SchemaField schemaField = req.getSchema().getField(field);
-      if (!schemaField.hasDocValues()) {
-        throw new IOException(schemaField + " must have DocValues to use this feature.");


Love this! I've been burned so many times on wanting to do this and discovering missing DocValues.

solr/core/src/java/org/apache/solr/handler/export/DoubleFieldWriter.java

solr/core/src/test/org/apache/solr/handler/export/TestExportWriter.java

epugh · 2026-01-16T16:48:28Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

 An optional parameter `batchSize` determines the size of the internal buffers for partial results.
 The default value is `30000` but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don't bring proportionally larger performance increases).

+An optional parameter `includeStoredFields` (default `false`) enables exporting fields that only have stored values (no docValues).


some day we'll have a way of thinking about should this be "true" ;-). Challenge of open source is we struggle to know what hsould be a default yes even when it has knock on impacts until a lot of time passes. ;-)

Good point. I will add an explanation.

Actually, can I put the explanation in the CHANGELOG? The docs already warn:

Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.

dsmiley · 2026-01-17T18:10:24Z

Just a quick observation -- I see a force push that came after Eric's review. I strongly recommend against force pushing because it resets the review state for anyone who has already reviewed -- Eric in this case.

dsmiley

I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!

Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)

solr/core/src/java/org/apache/solr/handler/export/FieldWriter.java

dsmiley · 2026-01-17T18:19:14Z

solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java

+    if (map == null) {
+      map = new WeakHashMap<>();
+      storedFieldsMap.set(map);
+    }


this pattern can be improved to basically be handled at the ThreadLocal declaration to provide an initializer.

But moreover I'm concerned about the use of ThreadLocal in the first place -- it's typically a tool of last resort. And further true with use of weak references.

BTW I'm fine with your use of it here but was speaking out-loud, maybe a little too out-loud :-)

I also dislike ThreadLocal, personally coming from reactive programming experience it's especially painful. But I am not sure how else to share this non-thread-safe resource here without adding significant complexity.

solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java

dsmiley · 2026-01-17T18:24:23Z

solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java

+      if (fieldType instanceof BoolField) {
+        // Convert "T"/"F" stored value to boolean true/false
+        addField(fieldInfo.name, Boolean.valueOf(fieldType.indexedToReadable(value)));
+      } else {
+        addField(fieldInfo.name, value);
+      }


when I see special cases like this, I ask myself... is there a fieldType method that should be handling this? If not do we need to add one?
CC @hossman if you are interested in review; you've looked at this topic in a nearby issue lately

solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java

solr/core/src/test/org/apache/solr/handler/export/TestExportWriter.java

dsmiley · 2026-01-17T19:07:05Z

solr/core/src/java/org/apache/solr/handler/export/ExportWriter.java

+      // Check if field can use DocValues
+      boolean canUseDocValues =
+          schemaField.hasDocValues()
+              && (!(fieldType instanceof SortableTextField) || schemaField.useDocValuesAsStored());


is there precedent for this special case RE instanceof SortableTextField?

The reason is for /export it was decided that you don't need an explicit udvas for "cheap" fields I guess (so anything that is not a text field)? Idk, I am trying to follow this thread.

Edit: could also be a backwards compat thing since this udvas requirement wasn't there for the other fields and then when SortableTextField was added it was decided to handle it a more correct way.

I've added a comment for this.

Could the selection logic here, even if somewhat trivial, be extracted so it's also used in multiple places in ExportWriter? That would help keep them in sync and also show via find-usages that the logic is used from multiple use-cases.

solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java

solr/core/src/java/org/apache/solr/handler/export/DoubleFieldWriter.java

…n with Cursors

kotman12 · 2026-01-20T22:13:32Z

I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!

Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)

I've added a "Comparison with Cursors" section to the doc with my best effort comparing the two. Definitely needs a second pair of eyes if you don't mind.

gus-asf

Q's on the expanded docs (didn't look at the rest)

gus-asf · 2026-01-23T18:57:56Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.


One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.

Maybe a pro/con 2x2 table?

Is it dangerous to export a very large stored field? What is the risk? How big is big?

I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.

Would the drawback be keeping older segments around longer than you would otherwise as well as increased memory usage as the index drifts away from these "old" readers stuck on a view of the index from when the stream was started?

Is it dangerous to export a very large stored field? What is the risk? How big is big?

I didn't test the limits of this. I assume if your field could be ingested then it could also be exported although that may be a naive assumption. I imagine very large fields, i.e. >100MB would be problematic in a variety of circumstances not just for the ExportHandler.

Maybe a pro/con 2x2 table?

If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.

I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

I added comments since he asked for a second set of eyes on the docs, and didn't mark it changes requested. Feel free to take em or leave em. These are the things I suspect a reader might wonder. I have a notion of some of the answers, though it's quite likely @kotman12 has thought more carefully about it more recently than I.

Maybe a pro/con 2x2 table?

If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.

The table idea is just one way to present pro/con stuff. The prose I was reading didn't seem to clearly group all the pros for one thing where they could easily be compared to the pro's for the other etc. Bullet lists or careful use of paragraphs could also be effective.

As for building the guide I think that these days it's much easier than it used to be. Should be just a matter of running the right gradle task. (see https://github.com/apache/solr/blob/main/solr/solr-ref-guide/README.adoc) At one time there was a lot of fiddly stuff with getting the right NPM deps installed, but I think that has gone away.

Oh crud it's late and I've forgotten to switch browser profiles... so I've got the wrong user name (again). fsparv <==> gus-asf

dsmiley

This is super close to merging... I'll handle it next week :-)
This is an exciting improvement capability that really opens doors to some use-cases :-)

dsmiley · 2026-01-23T22:15:28Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

+
+Another advantage of `/export` is significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size.
+With cursors, you typically need to set the `rows` parameter to a high value (e.g., 100,000) to achieve decent throughput.
+However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits.


I affirm the glugging but your rationale/guessing is certainly false. For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired. This is the "QTime". Retrieving data to return it is after; it's not accumulated in memory; it's streamed, and lengthens the true elapsed time.

Wouldn't "export" have similar up-front costs to execute the query?

Any way, the broad strokes of your message look good.

For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired.

If this is the case then one should be able to get a very large number of rows with one shot with the SearchHandler. I assume this is a Solr doc ID list? If so, it will be larger than a doc Id bitset but can still be pretty small for a lot of cases making the need for export handler more exotic. I do wonder how this SearchHandler response writing/streaming works in the multi-sharded case? My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding. I do know that /streaming over the /export handler is pretty efficient even in the multi-shard case.

bq. My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding.

Aaaah, ok right, you have a good point here! I wasn't considering distributed-search, only a single shard's contribution. Albeit comparing /export with cursor under apple's to apple's, it'd be single-shard to single-shard. But the big picture changes things, so you're right about that in practice. I suppose there's some useful need for a streaming expression aggregator that can cursor-mark over the shards individually to avoid that cost. I looked for cursorMark usage in streaming expressions and I'm surprised to see none. There will be even less need for such a thing after this PR!

Any way, I don't think the pros/cons in this document needs to go into technical/internal depth; it's so rare that the ref guide speaks of such geeky internal things (e.g. searchers).

dsmiley · 2026-01-23T22:18:30Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc


 The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats.

+== Comparison with Cursors


BTW I very much appreciate the extra effort here!

Unless you are in the mood, don't go off an do performance experiments just because we're asking questions. Say what you're comfortable claiming and not more and that's fine :-)

dsmiley · 2026-01-23T22:20:59Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

I think we should at least cross-link between pagination-of-results.adoc with exporting-result-sets.adoc because they are obviously related. Their embeddings ought to be similar ;-)

dsmiley · 2026-01-23T22:31:12Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

+With the `/export` handler, these steps are decoupled - Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client.
+
+The advantage of cursors is flexibility.
+A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session.


Suggested change

A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session.

A `cursorMark` can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case.

An `/export` stream must be consumed in a single session.

I'm tempted to say that a stream should be completely consumed but maybe /export can handle a client that doesn't want more data, gracefully? Do you know?

You can close a TupleStream. IIRC this results in an exception on the server side indicating the client closed the stream.

I do feel it is implied that an export stream doesn't need to be fully consumed. Like what if the client crashes? It would be unreasonable to implement export in a way that can't handle a crashing client. I suppose one could mention close but not sure if this is the right place.

right, of course, I mean I wonder if there was testing/care to ensure Solr isn't unreasonably noisy in its logs if a client were to do this routinely.

For example, a year ago I was faced with a use-case that wanted to consume an unknown number of docs so it could post-filter those to ultimately come up with a target threshold of documents. I was thinking of cursorMark at the time; I wanted the score returned. If /export is a possibility, it could have a benefit of being able to stop exactly when I'm finished consuming docs. But if that would spew logs/errors then maybe that wouldn't be a great idea.

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

dsmiley · 2026-01-23T22:46:17Z

solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc

+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.


I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

Co-authored-by: David Smiley <dsmiley@apache.org>

kotman12 added 10 commits January 10, 2026 20:49

opus assisted first pass

c7c2703

clean up cruft

7a41252

change expected exception message and remove duplication

a16594d

allow arbitrary index shift in field writer

ec14338

add udvas test

58e9a5b

avoid DV lookup if we're reading StoredFields

9d25470

revert ... overridden by mistake

cad9607

clean up comment

8f08d35

better comment

e064f6e

actually test glob pattern

815892a

github-actions bot added documentation Improvements or additions to documentation tests labels Jan 15, 2026

changelog entry

093af42

kotman12 force-pushed the stored-fields-export-writer branch from bf813b7 to 093af42 Compare January 15, 2026 23:01

epugh reviewed Jan 16, 2026

View reviewed changes

kotman12 added 2 commits January 16, 2026 14:55

add unreleased changelog entry

ac7966a

move StoredFieldsWriter to its own file

d510972

kotman12 force-pushed the stored-fields-export-writer branch from 9ff0b20 to d510972 Compare January 16, 2026 20:22

dsmiley reviewed Jan 17, 2026

View reviewed changes

dsmiley and others added 3 commits January 18, 2026 21:15

simplify logic (subjective)

c30ed95

drop fieldIndex from FieldWriter + other cleanup

9f7d53f

still can't sort without DVs (document and test) + document compariso…

ee88b4b

…n with Cursors

kotman12 added 2 commits January 20, 2026 17:48

simplify testSortingWithoutDocValues

6220814

improve document accuracy

92cef30

gus-asf reviewed Jan 23, 2026

View reviewed changes

dsmiley reviewed Jan 23, 2026

View reviewed changes

kotman12 and others added 2 commits January 23, 2026 21:23

italicize

314949b

Co-authored-by: David Smiley <dsmiley@apache.org>

missing comma

83dd652

Co-authored-by: David Smiley <dsmiley@apache.org>

format and wording

c1c183f

Co-authored-by: David Smiley <dsmiley@apache.org>


		The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats.

		== Comparison with Cursors

	A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session.
	A `cursorMark` can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case.
	An `/export` stream must be consumed in a single session.

SOLR-18071 Support Stored Fields in Export Writer #4053

Are you sure you want to change the base?

SOLR-18071 Support Stored Fields in Export Writer #4053

Uh oh!

Conversation

kotman12 commented Jan 15, 2026

Description

Solution

Tests

Checklist

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kotman12 Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsmiley commented Jan 17, 2026

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kotman12 Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kotman12 commented Jan 20, 2026

Uh oh!

gus-asf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kotman12 Jan 16, 2026 •

edited

Loading

kotman12 Jan 20, 2026 •

edited

Loading