PARQUET-2249: Introduce IEEE 754 total order & NaN-counts#514
Conversation
| * | ||
| * When writing statistics the following rules should be followed: | ||
| * - NaNs should not be written to min or max statistics fields. | ||
| * - It is suggested to always set the nan_count field for floating |
There was a problem hiding this comment.
The heading says "the following rules should be followed" but the first one says "it is suggested" which seems to imply it is optional. I recommend removing the "suggested"
There was a problem hiding this comment.
Even 'should' is still optional per RFC2119. If this isn't optional, 'must' or 'shall' is the correct term. But this is a problem in more places in this document, so I take it you're not following that RFC.
There was a problem hiding this comment.
I think the requirement should be to set nan_count, so I agree that this should not be a suggestion.
There was a problem hiding this comment.
This whole document uses "should", where "shall" or "must" would actually be more appropriate. Therefore, I was going with the general speak of this document for consistency. I totally agree though; it always bothered me a lot that the spec always uses "should", making it sounds as if all of this is just a mere suggestion and implementors can do what they want.
That being said, if we want to change this, I would rather do it in one PR that updates the whole document to stay consistent.
There was a problem hiding this comment.
@JFinis, I don't think we are talking about the same things. I was saying that I think nan_count should be "required" not "suggested". Maybe this doesn't matter anymore?
| * anything else. It also defines an order between different bit representations | ||
| * of the same value. | ||
| * | ||
| * The formal definition is as follows: |
There was a problem hiding this comment.
Do we need to copy this here? Having a version here that duplicates the IEEE-754 version makes me worry that they will get out of sync or have errors due to copying. We should make it clear that the IEEE-754 version is correct if the two ever differ.
There was a problem hiding this comment.
I don't fully understand what lines your comment is referring to. What is the part you see duplicated?
There was a problem hiding this comment.
I think the assumption is that this text duplicates exactly what is in the IEEE specification.
There was a problem hiding this comment.
We should either rely on an external total order specification, or we should include a total order specification here, but we should not do both. Since this is defined in Section 5.10 of IEE-754 (2008), I would remove this.
There was a problem hiding this comment.
I don't agree with this TBH. How should the two versions get out of sync? I mentioned a precise revision for the IEEE spec, so that will never change. This code is also not supposed to change. So I don't agree to your rationale why this should be removed. Can you elaborate on how you think this would diverge?
Having the pseudocode here will be of tremendous help to new implementers. It's not always easy to get IEEE standards and just adds friction.
Having said that, this isn't too important to me, so sure, we can also throw it out. But my honest opinion is that this document would be more useful with having it in.
|
@tustvold @crepererum @westonpace I would especially like to get your opions on this, as you were proponents of total order for simplicity. Does this PR still cover what you wanted to get out of this? @mapleFU @etseidl @gszadovszky @pitrou FYI, as you were part of the initial discussion on the previous PRs. |
tustvold
left a comment
There was a problem hiding this comment.
Am I understanding correctly that the change to add nan_counts means that NaNs are now excluded from the min/max statistics? I believe this creates a new problem for engines that use total ordering as the sign of the NaNs is not known, and therefore there are scenarios where predicates can't be pushed down where they could be in the absence of nan_counts.
My 2 cents is that there probably is no perfect solution here, adding nan_counts is worse for some engines and better for some others, and as such my personal preference would still be to go with the simpler solution, that has existing implementations and can ship today.
The argument for nan_counts appears to be that the presence of a NaN has the effect of poisoning the statistics. I don't really follow why NaNs are special in this regard, any extreme value has the same effect, if people want to effectively push down such predicates they should be sorting so as to cluster data points of similar magnitude. It has been said that people coming from numpy use NaN instead of Null, but as was said on the mailing list, I don't think the onus should be on parquet to handle this, rather people should convert NaNs to Nulls if that is the semantic they want.
That being said if people are motivated to push ahead with this hybrid solution, I don't really feel strongly. Ultimately the issue is fixing the fact predicates can't be pushed down on floats in general, how that applies to NaNs IMO is kind of irrelevant to the vast majority of workloads. This has dragged on for 2 years, we'd almost reached consensus and I can't help feeling we've been startled by one dissenting opinion...
No, the predicates still can be pushed down, regardless of the semantics of your engine. If your engine uses total ordering with predicate
It's not worse for any engine as far as I can see, a strict improvement for everyone in filtering capabilities.
NaNs aren't extreme values for engines without total ordering, they are simply unordered. |
|
I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example. Regardless my broader point still stands that we are adding a non-trivial amount of complexity, there is a lot of subtlety both at read and write time, to yield an improvement that to be brutally honest I am not really sure will benefit all that many real workloads. It's all tradeoffs, if people want to and are motivated to proceed with the hybrid approach I don't feel strongly, but I personally prefer the simpler option that is more likely to be implemented correctly across the many many parquet implementations. The proposal here is strictly more complex than the existing specification which implementations still implement incorrectly. |
Signed NaN counts are only useful to avoid treating a But only having a single NaN count doesn't block predicate pushdown whatsoever, and since you're fine with NaN poisoning anyway I don't think you'd care about the above |
Gut feeling wise, I agree with your assessment. I feel like I personally can implement this correctly in our engine, but I have now spent years of my life thinking about this, so I guess I'm not representative for the average Parquet maintainer. I also feel that this has such a high chance of being implemented incorrectly that the more straightforward solution of just IEEE total ordering would be preferrable. My gut feels like the risk of added implementation complexity and thus chance for wrong implementations is higher than the advantage of not having "NaN poisoning" of statistics. |
|
I have filed a ticket to track trying to implement this in the arrow-rs implementation of Parquet, which will hopefully help us quantify how complicated this proposal is to implement: |
I think we have waited enough time for feedback. Do you want to start a vote with two proposals on the dev ML, @JFinis? |
c2116b0 to
62f4dc8
Compare
|
@rdblue @gszadovszky @alamb @pitrou @emkornfield Could you take a look at the updated spec? I've just completed the Java POC: #514 |
emkornfield
left a comment
There was a problem hiding this comment.
Generally looks good to me.
I think the main question is whether we should be prescriptive on not writing NaN's in optional statistics with the new sort order.
This commit is a combination of the following PRs: * Introduce IEEE 754 total order apache#221 * Add nan_count to handle NaNs in statistics apache#196 Both these PRs try to solve the same problems; read the description of the respective PRs for explanation. This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s The contents of this PR are basically a straightforward combination of the two approaches: * IEEE total order is introduced as a new order for floating point types * nan_count and nan_counts fields are added Legacy writers may not write nan_count(s) fields, so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that. As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used. This commit removes `nan_pages` from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.
62f4dc8 to
a1b420a
Compare
|
Just added a Rust PoC apache/arrow-rs#9619. @wgtmac how do you propose we test interoperability? |
|
@etseidl Perhaps we can add a Parquet file to the parquet-testing repo with different floating types, column orders and NaN distributions? I can do that next week but it doesn't hurt if you do the same thing using your rust impl. UPDATE: I've added a file generator for the interop test and also added a test case to verify its data and stats are correct: apache/parquet-java@68ed638. |
|
Thanks @wgtmac. I'll add a similar test to my PR. |
|
I've now verified the file @wgtmac produced, and verified that parquet-rs will produce the same statistics from the same input. I'd like to do some benchmarking early next week to see if there is any serious performance hit from this change. I hope to update the PoC for #221 to see if one approach is significantly faster than the other. Once that's done I think we'll be ready to finally put this issue to rest. |
|
Following up, I've modernized the #221 PoC now as well. I need to do some more testing, but early results indicate somewhere around a 5-10% speed up for floating point columns for that version, but a 3-7% slowdown for the formulation here. I have to say I still prefer the version without NaN counts. It's conceptually simpler, requires no special handling, and is indeed a good deal easier to implement. But, if we go with the NaN counts, I'm sure I can get the performance up eventually. @wgtmac do you have any thoughts either way given your experience with Java? Are we ready for a vote on this? |
|
@etseidl I'm expecting a review from @rdblue @julienledem before voting. I think previous discussion has converged to the approach with nan_count so it may take longer time if we want to remove nan_count. |
| * - NaNs should not be written to min or max statistics fields except | ||
| * in the column index when a page contains only NaN values. In this | ||
| * case, since min_values and max_values are required, a NaN value | ||
| * must be written. |
There was a problem hiding this comment.
I think that this needs to be required if nan_count is present to signal that min/max values are reliable.
Also, I'd prefer to have two cases. One for page and column min/max values and another for the values in column indexes since they have different rules.
There was a problem hiding this comment.
I've rephrased this section a little bit. Let me know what you think. Thanks for your review!
|
FWIW there is a vote started on the mailing list to adopt this feature: https://lists.apache.org/thread/h0k0hqo0sojqphnbnrkp8b0gmwdzq9on |
This commit is a combination of the following PRs:
Both these PRs try to solve the same problems; read the description of the respective PRs for explanation.
This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here:
https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s
The contents of this PR are basically a straightforward combination of the two approaches:
Legacy writers may not write nan_count(s) fields,
so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that.
As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used.
This commit removes
nan_pagesfrom the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.Note: If this PR is chosen to be adopted, I will update the commit message to state the problem without just pointing onto the other PRs. But until then, I rather link to them instead of repeating myself, in case this PR isn't chosen anyway.
Rationale for this change
What changes are included in this PR?
Do these changes have PoC implementations?