Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance Optimizations
Tested using:
The baseline and these optimizations return the same DBSCAN results.
Memory leaks were not detected.
Results vs. baseline (dev branch)
Speed (ms/iter)
Speedup per step (vs. previous commit)
Cumulative speedup (vs. baseline)
Summary of changes
1.
39805c2— Replace timing-based granularity with simple heuristic in parforRemoved
get_granularity()which ran a timed benchmark to decide chunk sizes at runtime. Replaced with a simplen / (4 * num_threads)heuristic plus a sequential fast-path for small ranges. Biggest win on the small 3K cloud (2.26x).2.
8da5e33— Fix idle CPU usage via progressive backoff and CV-based sleepRewrote the work-stealing loop in
scheduler.hwith a 3-phase backoff: aggressive spinning, yielding, then sleeping on a condition variable.spawn()only signals the CV when threads are actually sleeping (zero-cost during active computation). Also changedwait()to opportunistically execute stolen work instead of re-enteringstart(), preventing potential deadlocks. Neutral on performance but eliminates CPU burn on idle threads.3.
96c9cff— Pre-compute cell keys for grid sort, avoiding floor()In
grid.h, pre-computes integer cell coordinates (floor((P[i][d] - pMin[d]) / r)) into a flat array before sorting, so the sort comparator uses integer comparisons instead of repeated floating-pointfloor().4.
4a4025f— Defer grid cell/mutex init to used cells only, reduce hash table sizeMoves
std::mutexconstruction andnbrCache/cellsinitialization from allcellCapacityslots to only thenumCellsactually used after insertion. Also shrinks the initial hash table fromcellMax*2tomax(2048, cellMax/4)with a rebuild if the estimate is too small.5.
607417e— O(n) cluster renumbering via prefix sumReplaced an O(n log n)
sampleSort+ hash-table lookup for remapping cluster IDs with a simple O(n) approach: mark which IDs are used in a flag array, prefix-sum to get sequential IDs, then remap. Eliminates themyPairstruct,hashSimplePair, andTableallocation entirely.6.
847e798— Eliminate sqrt from distance comparisonsAdded
nodeDistanceSqr()tokdNode.hand switched all distance comparisons incoreBccp.h,kdNode.h, andkdTree.hto use the existingdistSqr()and the newnodeDistanceSqr()instead ofdist()/nodeDistance(). The BCCP thresholdr <= epsilonbecomesr <= epsilon * epsilon. Uses ternary operators instead ofstd::maxfor branchless codegen hints. 1.34x on 3K.7.
909a621— Dynamic thread count via getWorkers()Replaced hardcoded
static const intT P = 36*8withgetWorkers() * 8inboundingBoxParallel()andpMinParallel(). Also switched VLA-style stack arrays to heap-allocated (newA+free) since the count is now dynamic. Portable across machines with different core counts.