-
Notifications
You must be signed in to change notification settings - Fork 809
SOLR-18071 Support Stored Fields in Export Writer #4053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bf813b7 to
093af42
Compare
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will scratch some of my own issues!
| } | ||
| SchemaField schemaField = req.getSchema().getField(field); | ||
| if (!schemaField.hasDocValues()) { | ||
| throw new IOException(schemaField + " must have DocValues to use this feature."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! I've been burned so many times on wanting to do this and discovering missing DocValues.
solr/core/src/java/org/apache/solr/handler/export/DoubleFieldWriter.java
Outdated
Show resolved
Hide resolved
| An optional parameter `batchSize` determines the size of the internal buffers for partial results. | ||
| The default value is `30000` but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don't bring proportionally larger performance increases). | ||
|
|
||
| An optional parameter `includeStoredFields` (default `false`) enables exporting fields that only have stored values (no docValues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some day we'll have a way of thinking about should this be "true" ;-). Challenge of open source is we struggle to know what hsould be a default yes even when it has knock on impacts until a lot of time passes. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can I put the explanation in the CHANGELOG? The docs already warn:
Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.
9ff0b20 to
d510972
Compare
|
Just a quick observation -- I see a force push that came after Eric's review. I strongly recommend against force pushing because it resets the review state for anyone who has already reviewed -- Eric in this case. |
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!
Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)
| if (map == null) { | ||
| map = new WeakHashMap<>(); | ||
| storedFieldsMap.set(map); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern can be improved to basically be handled at the ThreadLocal declaration to provide an initializer.
But moreover I'm concerned about the use of ThreadLocal in the first place -- it's typically a tool of last resort. And further true with use of weak references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I'm fine with your use of it here but was speaking out-loud, maybe a little too out-loud :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also dislike ThreadLocal, personally coming from reactive programming experience it's especially painful. But I am not sure how else to share this non-thread-safe resource here without adding significant complexity.
solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java
Outdated
Show resolved
Hide resolved
| if (fieldType instanceof BoolField) { | ||
| // Convert "T"/"F" stored value to boolean true/false | ||
| addField(fieldInfo.name, Boolean.valueOf(fieldType.indexedToReadable(value))); | ||
| } else { | ||
| addField(fieldInfo.name, value); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I see special cases like this, I ask myself... is there a fieldType method that should be handling this? If not do we need to add one?
CC @hossman if you are interested in review; you've looked at this topic in a nearby issue lately
solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/handler/export/TestExportWriter.java
Outdated
Show resolved
Hide resolved
| // Check if field can use DocValues | ||
| boolean canUseDocValues = | ||
| schemaField.hasDocValues() | ||
| && (!(fieldType instanceof SortableTextField) || schemaField.useDocValuesAsStored()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there precedent for this special case RE instanceof SortableTextField?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is for /export it was decided that you don't need an explicit udvas for "cheap" fields I guess (so anything that is not a text field)? Idk, I am trying to follow this thread.
Edit: could also be a backwards compat thing since this udvas requirement wasn't there for the other fields and then when SortableTextField was added it was decided to handle it a more correct way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the selection logic here, even if somewhat trivial, be extracted so it's also used in multiple places in ExportWriter? That would help keep them in sync and also show via find-usages that the logic is used from multiple use-cases.
solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/export/StoredFieldsWriter.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/export/DoubleFieldWriter.java
Outdated
Show resolved
Hide resolved
I've added a "Comparison with Cursors" section to the doc with my best effort comparing the two. Definitely needs a second pair of eyes if you don't mind. |
gus-asf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q's on the expanded docs (didn't look at the rest)
| With cursors, the query is re-executed for each page of results. | ||
| In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over. | ||
| Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors. | ||
| The trade-off is that IndexReaders are kept around for longer periods of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a pro/con 2x2 table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it dangerous to export a very large stored field? What is the risk? How big is big?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more sentence clarifying/quantifying the significance of readers being kept around would probably be good.
Would the drawback be keeping older segments around longer than you would otherwise as well as increased memory usage as the index drifts away from these "old" readers stuck on a view of the index from when the stream was started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it dangerous to export a very large stored field? What is the risk? How big is big?
I didn't test the limits of this. I assume if your field could be ingested then it could also be exported although that may be a naive assumption. I imagine very large fields, i.e. >100MB would be problematic in a variety of circumstances not just for the ExportHandler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a pro/con 2x2 table?
If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.
I added comments since he asked for a second set of eyes on the docs, and didn't mark it changes requested. Feel free to take em or leave em. These are the things I suspect a reader might wonder. I have a notion of some of the answers, though it's quite likely @kotman12 has thought more carefully about it more recently than I.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a pro/con 2x2 table?
If I have time I can give this a try. This is the first time I am updating the ref guide so I probably should figure out how to build it locally to see how something like this renders.
The table idea is just one way to present pro/con stuff. The prose I was reading didn't seem to clearly group all the pros for one thing where they could easily be compared to the pro's for the other etc. Bullet lists or careful use of paragraphs could also be effective.
As for building the guide I think that these days it's much easier than it used to be. Should be just a matter of running the right gradle task. (see https://github.com/apache/solr/blob/main/solr/solr-ref-guide/README.adoc) At one time there was a lot of fiddly stuff with getting the right NPM deps installed, but I think that has gone away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh crud it's late and I've forgotten to switch browser profiles... so I've got the wrong user name (again). fsparv <==> gus-asf
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super close to merging... I'll handle it next week :-)
This is an exciting improvement capability that really opens doors to some use-cases :-)
|
|
||
| Another advantage of `/export` is significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size. | ||
| With cursors, you typically need to set the `rows` parameter to a high value (e.g., 100,000) to achieve decent throughput. | ||
| However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I affirm the glugging but your rationale/guessing is certainly false. For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired. This is the "QTime". Retrieving data to return it is after; it's not accumulated in memory; it's streamed, and lengthens the true elapsed time.
Wouldn't "export" have similar up-front costs to execute the query?
Any way, the broad strokes of your message look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SearchHandler, The payload is streamed/produced on the fly as it iterates documents. The code isn't there; it's elsewhere in a ResponseWriter, if I recall. Solr does have to do some up-front work -- producing a list of document IDs that match the search, and sorted as desired.
If this is the case then one should be able to get a very large number of rows with one shot with the SearchHandler. I assume this is a Solr doc ID list? If so, it will be larger than a doc Id bitset but can still be pretty small for a lot of cases making the need for export handler more exotic. I do wonder how this SearchHandler response writing/streaming works in the multi-sharded case? My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding. I do know that /streaming over the /export handler is pretty efficient even in the multi-shard case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. My recollection is that the full responses of each shard are merged together at once (not streamed) but I could be misunderstanding.
Aaaah, ok right, you have a good point here! I wasn't considering distributed-search, only a single shard's contribution. Albeit comparing /export with cursor under apple's to apple's, it'd be single-shard to single-shard. But the big picture changes things, so you're right about that in practice. I suppose there's some useful need for a streaming expression aggregator that can cursor-mark over the shards individually to avoid that cost. I looked for cursorMark usage in streaming expressions and I'm surprised to see none. There will be even less need for such a thing after this PR!
Any way, I don't think the pros/cons in this document needs to go into technical/internal depth; it's so rare that the ref guide speaks of such geeky internal things (e.g. searchers).
|
|
||
| The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats. | ||
|
|
||
| == Comparison with Cursors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I very much appreciate the extra effort here!
Unless you are in the mood, don't go off an do performance experiments just because we're asking questions. Say what you're comfortable claiming and not more and that's fine :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least cross-link between pagination-of-results.adoc with exporting-result-sets.adoc because they are obviously related. Their embeddings ought to be similar ;-)
| With the `/export` handler, these steps are decoupled - Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client. | ||
|
|
||
| The advantage of cursors is flexibility. | ||
| A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A cursor mark can be persisted and resumed later, even across restarts, whereas an `/export` stream is entirely in-memory and must be consumed in a single session. | |
| A `cursorMark` can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case. | |
| An `/export` stream must be consumed in a single session. |
I'm tempted to say that a stream should be completely consumed but maybe /export can handle a client that doesn't want more data, gracefully? Do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can close a TupleStream. IIRC this results in an exception on the server side indicating the client closed the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do feel it is implied that an export stream doesn't need to be fully consumed. Like what if the client crashes? It would be unreasonable to implement export in a way that can't handle a crashing client. I suppose one could mention close but not sure if this is the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, of course, I mean I wonder if there was testing/care to ensure Solr isn't unreasonably noisy in its logs if a client were to do this routinely.
For example, a year ago I was faced with a use-case that wanted to consume an unknown number of docs so it could post-filter those to ultimately come up with a target threshold of documents. I was thinking of cursorMark at the time; I wanted the score returned. If /export is a possibility, it could have a benefit of being able to stop exactly when I'm finished consuming docs. But if that would spew logs/errors then maybe that wouldn't be a great idea.
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc
Outdated
Show resolved
Hide resolved
| With cursors, the query is re-executed for each page of results. | ||
| In contrast, `/export` runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over. | ||
| Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors. | ||
| The trade-off is that IndexReaders are kept around for longer periods of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we're potentially suggesting the contributor here put more work into this than he bargained for. Any documentation he's comfortable writing is encouraged... and beyond that, well let's just get this merged and have real users kick the tires and we'll see.
Co-authored-by: David Smiley <dsmiley@apache.org>
Co-authored-by: David Smiley <dsmiley@apache.org>
https://issues.apache.org/jira/browse/SOLR-18071
Description
Adds support for exporting stored-only fields (fields without docValues) in the
/exportrequest handler via a newincludeStoredFieldsparameter. Previously, all fields in the field list (fl) were required to have docValues enabled. This change allows users to include stored fields that don't have docValues, which can be useful, i.e. when exporting fields which don't support docValues or trying to export data that has already been indexed without DVs.Solution
If fl explicitly names a stored-only field and includeStoredFields is not enabled, the request fails with a 400 and a hint to add includeStoredFields=true. For glob patterns (e.g., fl=* or fl=intdv,*), stored-only fields are skipped unless includeStoredFields=true, to preserve backward compatibility. The current implementation fetches from
StoredFieldsDV-enabled fields when some stored fields have already been requested. This avoids DV-lookup for a field, which makes sense since we have to parse the StoredFields anyway. My (somewhat limited) benchmarks appear to corroborate that this is the best choice for performance.A quirky thing about this implementation is that the very much internal
FieldWriterAPI was changed to support more than one field. This makes it more interchangeable with the existingStoredFieldVisitorinterface, which assumes one visitor per many fields. I landed on this rather than creating an adapter to bridge the two as it appeared to be simpler. It's worth stressing again that theFieldWriteris very much internal to the export package and the boolean it returned was effectively discarded (the localfieldIndexit drives isn't even used anywhere). It could be argued thatFieldWriter::writecould bevoidand I'd also be open to such a change.Tests
Also have some performance comparisons of exporting vs not exporting stored fields:
stored-fields-export-writer-1k-doc-benchmark.txt
stored-fields-export-writer-140k-doc-benchmark.txt
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.