Skip to content

Conversation

@kotman12
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-18071

Description

Adds support for exporting stored-only fields (fields without docValues) in the /export request handler via a new includeStoredFields parameter. Previously, all fields in the field list (fl) were required to have docValues enabled. This change allows users to include stored fields that don't have docValues, which can be useful, i.e. when exporting fields which don't support docValues or trying to export data that has already been indexed without DVs.

Solution

If fl explicitly names a stored-only field and includeStoredFields is not enabled, the request fails with a 400 and a hint to add includeStoredFields=true. For glob patterns (e.g., fl=* or fl=intdv,*), stored-only fields are skipped unless includeStoredFields=true, to preserve backward compatibility. The current implementation fetches from StoredFields DV-enabled fields when some stored fields have already been requested. This avoids DV-lookup for a field, which makes sense since we have to parse the StoredFields anyway. My (somewhat limited) benchmarks appear to corroborate that this is the best choice for performance.

A quirky thing about this implementation is that the very much internal FieldWriter API was changed to support more than one field. This makes it more interchangeable with the existing StoredFieldVisitor interface, which assumes one visitor per many fields. I landed on this rather than creating an adapter to bridge the two as it appeared to be simpler. It's worth stressing again that the FieldWriter is very much internal to the export package and the boolean it returned was effectively discarded (the local fieldIndex it drives isn't even used anywhere). It could be argued that FieldWriter::write could be void and I'd also be open to such a change.

Tests

  • Explicit stored-only field export succeeds with includeStoredFields=true (single-valued and multi-valued).
  • Explicit stored-only field export fails without the parameter and includes the includeStoredFields=true hint and field name.
  • Glob fl skips stored-only fields without the parameter (request succeeds, stored-only fields not present).
  • Glob fl includes stored-only fields with the parameter.
  • Coverage for stored field types: string, int, long, float, double, boolean, date.

Also have some performance comparisons of exporting vs not exporting stored fields:
stored-fields-export-writer-1k-doc-benchmark.txt
stored-fields-export-writer-140k-doc-benchmark.txt

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions bot added documentation Improvements or additions to documentation tests labels Jan 15, 2026
@kotman12 kotman12 force-pushed the stored-fields-export-writer branch from bf813b7 to 093af42 Compare January 15, 2026 23:01
Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will scratch some of my own issues!

}
SchemaField schemaField = req.getSchema().getField(field);
if (!schemaField.hasDocValues()) {
throw new IOException(schemaField + " must have DocValues to use this feature.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! I've been burned so many times on wanting to do this and discovering missing DocValues.

An optional parameter `batchSize` determines the size of the internal buffers for partial results.
The default value is `30000` but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don't bring proportionally larger performance increases).

An optional parameter `includeStoredFields` (default `false`) enables exporting fields that only have stored values (no docValues).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some day we'll have a way of thinking about should this be "true" ;-). Challenge of open source is we struggle to know what hsould be a default yes even when it has knock on impacts until a lot of time passes. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add an explanation.

Copy link
Contributor Author

@kotman12 kotman12 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can I put the explanation in the CHANGELOG? The docs already warn:

Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.

@kotman12 kotman12 force-pushed the stored-fields-export-writer branch from 9ff0b20 to d510972 Compare January 16, 2026 20:22
@dsmiley
Copy link
Contributor

dsmiley commented Jan 17, 2026

Just a quick observation -- I see a force push that came after Eric's review. I strongly recommend against force pushing because it resets the review state for anyone who has already reviewed -- Eric in this case.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!

Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)

Comment on lines 52 to 55
if (map == null) {
map = new WeakHashMap<>();
storedFieldsMap.set(map);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern can be improved to basically be handled at the ThreadLocal declaration to provide an initializer.

But moreover I'm concerned about the use of ThreadLocal in the first place -- it's typically a tool of last resort. And further true with use of weak references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I'm fine with your use of it here but was speaking out-loud, maybe a little too out-loud :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dislike ThreadLocal, personally coming from reactive programming experience it's especially painful. But I am not sure how else to share this non-thread-safe resource here without adding significant complexity.

Comment on lines +82 to +87
if (fieldType instanceof BoolField) {
// Convert "T"/"F" stored value to boolean true/false
addField(fieldInfo.name, Boolean.valueOf(fieldType.indexedToReadable(value)));
} else {
addField(fieldInfo.name, value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I see special cases like this, I ask myself... is there a fieldType method that should be handling this? If not do we need to add one?
CC @hossman if you are interested in review; you've looked at this topic in a nearby issue lately

// Check if field can use DocValues
boolean canUseDocValues =
schemaField.hasDocValues()
&& (!(fieldType instanceof SortableTextField) || schemaField.useDocValuesAsStored());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there precedent for this special case RE instanceof SortableTextField?

Copy link
Contributor Author

@kotman12 kotman12 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is for /export it was decided that you don't need an explicit udvas for "cheap" fields I guess (so anything that is not a text field)? Idk, I am trying to follow this thread.

Edit: could also be a backwards compat thing since this udvas requirement wasn't there for the other fields and then when SortableTextField was added it was decided to handle it a more correct way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the selection logic here, even if somewhat trivial, be extracted so it's also used in multiple places in ExportWriter? That would help keep them in sync and also show via find-usages that the logic is used from multiple use-cases.

@kotman12
Copy link
Contributor Author

I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!

Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)

I've added a "Comparison with Cursors" section to the doc with my best effort comparing the two. Definitely needs a second pair of eyes if you don't mind.

Copy link
Contributor

@gus-asf gus-asf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q's on the expanded docs (didn't look at the rest)

With cursors, the query is re-executed for each page of results.
In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over.
Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors.
The trade-off is that IndexReaders are kept around for longer periods of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a pro/con 2x2 table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it dangerous to export a very large stored field? What is the risk? How big is big?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.

Would the drawback be keeping older segments around longer than you would otherwise as well as increased memory usage as the index drifts away from these "old" readers stuck on a view of the index from when the stream was started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it dangerous to export a very large stored field? What is the risk? How big is big?

I didn't test the limits of this. I assume if your field could be ingested then it could also be exported although that may be a naive assumption. I imagine very large fields, i.e. >100MB would be problematic in a variety of circumstances not just for the ExportHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a pro/con 2x2 table?

If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

I added comments since he asked for a second set of eyes on the docs, and didn't mark it changes requested. Feel free to take em or leave em. These are the things I suspect a reader might wonder. I have a notion of some of the answers, though it's quite likely @kotman12 has thought more carefully about it more recently than I.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a pro/con 2x2 table?

If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.

The table idea is just one way to present pro/con stuff. The prose I was reading didn't seem to clearly group all the pros for one thing where they could easily be compared to the pro's for the other etc. Bullet lists or careful use of paragraphs could also be effective.

As for building the guide I think that these days it's much easier than it used to be. Should be just a matter of running the right gradle task. (see https://github.com/apache/solr/blob/main/solr/solr-ref-guide/README.adoc) At one time there was a lot of fiddly stuff with getting the right NPM deps installed, but I think that has gone away.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh crud it's late and I've forgotten to switch browser profiles... so I've got the wrong user name (again). fsparv <==> gus-asf

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super close to merging... I'll handle it next week :-)
This is an exciting improvement capability that really opens doors to some use-cases :-)


Another advantage of `/export` is significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size.
With cursors, you typically need to set the `rows` parameter to a high value (e.g., 100,000) to achieve decent throughput.
However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I affirm the glugging but your rationale/guessing is certainly false. For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired. This is the "QTime". Retrieving data to return it is after; it's not accumulated in memory; it's streamed, and lengthens the true elapsed time.

Wouldn't "export" have similar up-front costs to execute the query?

Any way, the broad strokes of your message look good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired.

If this is the case then one should be able to get a very large number of rows with one shot with the SearchHandler. I assume this is a Solr doc ID list? If so, it will be larger than a doc Id bitset but can still be pretty small for a lot of cases making the need for export handler more exotic. I do wonder how this SearchHandler response writing/streaming works in the multi-sharded case? My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding. I do know that /streaming over the /export handler is pretty efficient even in the multi-shard case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq. My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding.

Aaaah, ok right, you have a good point here! I wasn't considering distributed-search, only a single shard's contribution. Albeit comparing /export with cursor under apple's to apple's, it'd be single-shard to single-shard. But the big picture changes things, so you're right about that in practice. I suppose there's some useful need for a streaming expression aggregator that can cursor-mark over the shards individually to avoid that cost. I looked for cursorMark usage in streaming expressions and I'm surprised to see none. There will be even less need for such a thing after this PR!

Any way, I don't think the pros/cons in this document needs to go into technical/internal depth; it's so rare that the ref guide speaks of such geeky internal things (e.g. searchers).


The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats.

== Comparison with Cursors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I very much appreciate the extra effort here!

Unless you are in the mood, don't go off an do performance experiments just because we're asking questions. Say what you're comfortable claiming and not more and that's fine :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least cross-link between pagination-of-results.adoc with exporting-result-sets.adoc because they are obviously related. Their embeddings ought to be similar ;-)

With the `/export` handler, these steps are decoupled - Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client.

The advantage of cursors is flexibility.
A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session.
A `cursorMark` can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case.
An `/export` stream must be consumed in a single session.

I'm tempted to say that a stream should be completely consumed but maybe /export can handle a client that doesn't want more data, gracefully? Do you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can close a TupleStream. IIRC this results in an exception on the server side indicating the client closed the stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel it is implied that an export stream doesn't need to be fully consumed. Like what if the client crashes? It would be unreasonable to implement export in a way that can't handle a crashing client. I suppose one could mention close but not sure if this is the right place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, of course, I mean I wonder if there was testing/care to ensure Solr isn't unreasonably noisy in its logs if a client were to do this routinely.

For example, a year ago I was faced with a use-case that wanted to consume an unknown number of docs so it could post-filter those to ultimately come up with a target threshold of documents. I was thinking of cursorMark at the time; I wanted the score returned. If /export is a possibility, it could have a benefit of being able to stop exactly when I'm finished consuming docs. But if that would spew logs/errors then maybe that wouldn't be a great idea.

With cursors, the query is re-executed for each page of results.
In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over.
Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors.
The trade-off is that IndexReaders are kept around for longer periods of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.

kotman12 and others added 2 commits January 23, 2026 21:23
Co-authored-by: David Smiley <dsmiley@apache.org>
Co-authored-by: David Smiley <dsmiley@apache.org>
Co-authored-by: David Smiley <dsmiley@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants