Speed-up reduce view merging in fabric#6046
Open
nickva wants to merge 1 commit into
Open
Conversation
Previously, `find_next_key/5` reduce collector [1] effectively ran `hd(lists:sort(...))` to get the next key. It was spending time sorting, then throwing everything away except the head. For the next row, it was sorting again, throwing everything away, and taking only the head, and so on... Needless to say, that's wasteful and we should simply find the minimum and return that. As an another optimization replace the dict with a map. Maps are more efficient and generally faster. Also, add extra unit tests asserting ICU collation behavior actually takes place. In a few cases explicitly assert that the new behavior is the same as the previous `hd(lists:sort(...))` behavior. To benchmark used a simple reduce view with a built-in _count reducer [2]. For 100k docs, Q=8 this optimization leads to an ~2x speedup in view querying. Theoretically it would be an O(n log n) -> O(n) speedup. PR ``` ./reduce_group_bench.py --ndocs 100000 --q 8 --reps 7 db=reduce_group_bench q=8 n=1 ndocs=100000 group=true loading... loaded 100000 docs in 4.4s building index... built in 4.5s timing group=true query x7... rows=100000 min=5644.0ms median=5786.6ms max=6108.0ms ``` MAIN ``` ./reduce_group_bench.py --ndocs 100000 --q 8 --reps 7 db=reduce_group_bench q=8 n=1 ndocs=100000 group=true loading... loaded 100000 docs in 4.5s building index... built in 4.3s timing group=true query x7... rows=100000 min=10910.8ms median=11212.5ms max=11627.4ms ``` [1] Reduce collector buffers rows from workers and keys them by the view key. For each row it emits it 1) finds the next key 2) takes all the rows that collate equal to it. This optimization will only improve reduce views [2] https://gist.github.com/nickva/cc0a3fb296be965e014eec8c484aa8a5
96ef801 to
c401c5d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously,
find_next_key/5reduce collector [1] effectively ranhd(lists:sort(...))to get the next key. It was spending time sorting, then throwing everything away except the head. For the next row, it was sorting again, throwing everything away, and taking only the head, and so on... Needless to say, that's wasteful and we should simply find the minimum and return that.As an another optimization replace the dict with a map. Maps are more efficient and generally faster.
Also, add extra unit tests asserting ICU collation behavior actually takes place. In a few cases explicitly assert that the new behavior is the same as the previous
hd(lists:sort(...))behavior.To benchmark used a simple reduce view with a built-in _count reducer [2]. For 100k docs, Q=8 this optimization leads to an ~2x speedup in view querying. Theoretically it would be an O(n log n) -> O(n) speedup.
PR
MAIN
[1] Reduce collector buffers rows from workers and keys them by the view key. For each row it emits it 1) finds the next key 2) takes all the rows that collate equal to it. This optimization will only improve reduce views
[2] https://gist.github.com/nickva/cc0a3fb296be965e014eec8c484aa8a5