Skip to content

feat: Gc benchmarking#421

Draft
stanbrub wants to merge 9 commits intodeephaven:mainfrom
stanbrub:gc-benchmarking
Draft

feat: Gc benchmarking#421
stanbrub wants to merge 9 commits intodeephaven:mainfrom
stanbrub:gc-benchmarking

Conversation

@stanbrub
Copy link
Copy Markdown
Collaborator

  • Added "training tests" that are representative benchmarks for comparing JDK versions, GC types, Python versions, etc. They are meant to provide as much coverage as possible with the fewest tests
  • Added a LocalParquetGenerator to generate very large parquet files into the DHC data directory. The typical standard benchmarks generate data through DHC, which is great for small to mid sized data sets.
  • Tests added: AggBy, Filter, Join, UpdateBy, Formula,

@stanbrub stanbrub self-assigned this Mar 26, 2026
Copy link
Copy Markdown

@cpwright cpwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the benchmarks are going to do work. I have some concerns that we might do too much work related to the timestamp calculation and traversing a UnionSourceManager (merge) where not necessary.

merge([
read('/data/timed.parquet').view(formulas=[${loadColumns}])${headRows}
] * ${scaleFactor}).update_view([
'timestamp=timestamp.plusMillis((long)(ii / ${rows}) * ${rows})'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't use the timestamp from the file? I have a few worries about doing rowset calculation as part of the benchmark (to come up with ii).

For the actual test benchmarks, without a select we would also just prefer more/bigger parquet files to avoid the overhead of going through the merge data structures. We might even be able to get away with symlinks to have the data just repeate itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "train" benchmarks, since we don't use Scale Factors, that section of code will not be hit. This is only used when we are doing merges to simulate larger data sets. So for the nightly runs, this will happen BEFORE the "select" into memory, which is not included in the measurement. But for the "train" benchmarks, we only read timestamps directly from the parquet file(s), and that only if they are used in the benchmark (like for rollingtime).

@Test
void filter1Col() {
setup(40);
var q = "timed.where_in(where_filter, cols=['key1 = set1']).where(['key1 < `4`'])";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit torn in that we should be asking the key1 < 4 inside of the where_in set table for a "real" query because it will be semantically equivalent and be faster. Maybe we do want to bounce through the entire parquet file anyway though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struggled with this one. If we are testing GC, it doesn't make sense to do separate operations that produce tables that just get GC'd while we are measuring. We are trying to understand the operations. Would it make sense to do multiple "where" operations from the same source, like we do in DHE combo benchmarks, that match very little so we don't blow up memory? Or is it better to match as much as we can with the first "where" without blowing up memory and then run the second one on that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating the intermediate rowsets is nice for creating garbage. I just get bothered by code that uses the system "wrong". If we were to filter on another column after the where_in; I would not be bothered. Like do a range filter on key2. The rowsets can actually be very interesting in terms of garbage when they have a large number of included rows which are non-contiguous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants