perf: improve parallelism of SortBotMerge#831
Open
jodavies wants to merge 1 commit into
Open
Conversation
Currently, the sortbot stage of tform sorting does not achieve good parallelism. Primarily, this is because the "sortblock" size has grown with default SmallSize and LargeSize adjustments, such that in many cases the sortbot levels do not run in parallel at all, because each thread's output fits in a single block. This commit adjusts the logic for filling and unlocking the blocks such that sortbot threads can start work as soon as possible: - Only put complete terms into the blocks, no splitting over the blocks. - Track the number of terms in each block, and use this when reading the data to determine when a block is complete, rather than waiting for a term to overlap the "stop" pointer. - When filling blocks (in PutToMaster), if a certain (small) minimum number of terms has been written, probe if a reading thread is waiting on the current block by attempting to lock the previous block. If a thread is waiting (so the lock fails), unlock the current block immediately and write the term into the next. Also reduce the MINIMUMNUMBEROFTERMS parameter from 10 to 1, so that the small+large buffer does not scale (so much) with large MaxTermSize and many threads, and similarly reduce NUMBEROFBLOCKSINSORT to its minimal value of 8.
Collaborator
Author
|
I have removed the second commit, which get rid of "block 0": there is another circumstance in which it is required, separate from copying a partial term from the last block, such that it is contiguous with the tail in block 1, which is when adding terms with polyratfun. When terms merge, if the result is larger, it is copied into the space before "term1" in the merge: this requires block 0 if term1 is the first term of block 1. I added commentary to the code to highlight this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change comes from discussion and benchmarking efforts in collaboration with @AnaIPereira. See the commit message for a full description.
On my 12-core 7900X, the results are as follows. @AnaIPereira is finishing up some testing on higher core-count sytems, to make sure we have a good value for
MINWRITENUMBEROFTERMSand that the adjusted, lower value ofMINIMUMNUMBEROFTERMSdoes not affect performance.Speedup w.r.t. master:
chromaticcolorfmftforcer-expforcerhyperformmass-factmbox1lminceexmincermzv-dmsort-disksort-largesort-smalltrace