Conversation
stanbrub
commented
Mar 26, 2026
- Added "training tests" that are representative benchmarks for comparing JDK versions, GC types, Python versions, etc. They are meant to provide as much coverage as possible with the fewest tests
- Added a LocalParquetGenerator to generate very large parquet files into the DHC data directory. The typical standard benchmarks generate data through DHC, which is great for small to mid sized data sets.
- Tests added: AggBy, Filter, Join, UpdateBy, Formula,
cpwright
left a comment
There was a problem hiding this comment.
All the benchmarks are going to do work. I have some concerns that we might do too much work related to the timestamp calculation and traversing a UnionSourceManager (merge) where not necessary.
| merge([ | ||
| read('/data/timed.parquet').view(formulas=[${loadColumns}])${headRows} | ||
| ] * ${scaleFactor}).update_view([ | ||
| 'timestamp=timestamp.plusMillis((long)(ii / ${rows}) * ${rows})' |
There was a problem hiding this comment.
Is there a reason we can't use the timestamp from the file? I have a few worries about doing rowset calculation as part of the benchmark (to come up with ii).
For the actual test benchmarks, without a select we would also just prefer more/bigger parquet files to avoid the overhead of going through the merge data structures. We might even be able to get away with symlinks to have the data just repeate itself.
There was a problem hiding this comment.
For the "train" benchmarks, since we don't use Scale Factors, that section of code will not be hit. This is only used when we are doing merges to simulate larger data sets. So for the nightly runs, this will happen BEFORE the "select" into memory, which is not included in the measurement. But for the "train" benchmarks, we only read timestamps directly from the parquet file(s), and that only if they are used in the benchmark (like for rollingtime).
| @Test | ||
| void filter1Col() { | ||
| setup(40); | ||
| var q = "timed.where_in(where_filter, cols=['key1 = set1']).where(['key1 < `4`'])"; |
There was a problem hiding this comment.
I am a bit torn in that we should be asking the key1 < 4 inside of the where_in set table for a "real" query because it will be semantically equivalent and be faster. Maybe we do want to bounce through the entire parquet file anyway though.
There was a problem hiding this comment.
I've struggled with this one. If we are testing GC, it doesn't make sense to do separate operations that produce tables that just get GC'd while we are measuring. We are trying to understand the operations. Would it make sense to do multiple "where" operations from the same source, like we do in DHE combo benchmarks, that match very little so we don't blow up memory? Or is it better to match as much as we can with the first "where" without blowing up memory and then run the second one on that?
There was a problem hiding this comment.
Generating the intermediate rowsets is nice for creating garbage. I just get bothered by code that uses the system "wrong". If we were to filter on another column after the where_in; I would not be bothered. Like do a range filter on key2. The rowsets can actually be very interesting in terms of garbage when they have a large number of included rows which are non-contiguous.