[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619
[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619etseidl wants to merge 29 commits into
Conversation
| // For floating point we need to compare NaN values until we encounter a non-NaN | ||
| // value which then becomes the new min/max. After this, only non-NaN values are | ||
| // evaluated. If all values are NaN, then the min/max NaNs as determined by | ||
| // IEEE 754 total order are returned. |
There was a problem hiding this comment.
This has me a bit worried. I need to do some benchmarking to make sure all the complicated NaN logic isn't killing performance.
There was a problem hiding this comment.
Agree, this method is on the hot path. I had a look at optimizing it, but could not get the compiler to generate nice auto-vectorized code for nan-handling yet. I think we can try optimizations in a followup, it would be more important to get the semantics correct first and make sure there are tests for edge cases.
If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.
Does the current code correctly distinguish different NaN payloads according to their sign and bit patterns?
(Solved, github was hiding the changes to compare_greater in mod.rs)
|
Tests will fail until apache/parquet-testing#104 is merged. |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (5de1817) to 65ad652 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: float |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (e09cce0) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: list_prim |
|
run benchmark arrow_writer env:
BENCH_FILTER: dictionary |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (30f50ca) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (30f50ca) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: float |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (396edf0) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env: |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (13dc03b) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env: |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (13dc03b) to f790721 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (d0b93d4) to edfb9ab (merge-base) diff File an issue against this benchmark runner |
| } | ||
|
|
||
| impl<T: DataType> ColumnValueEncoderImpl<T> { | ||
| fn min_max(&self, values: &[T::T], value_indices: Option<&[usize]>) -> Option<(T::T, T::T)> { |
There was a problem hiding this comment.
This was only called with None for value_indices, so I simplified it.
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
This takes the implementation done by @Xuanwo (#8158) and updates it to the new thrift format and recent changes to the original proposal (apache/parquet-format#514).
What changes are included in this PR?
Adds needed thrift structures as well as NaN counts for pages and column chunks.
Are these changes tested?
Yes, new tests added (more may be needed).
Are there any user-facing changes?
Yes, this is a breaking change.