Skip to content

Performance Optimizations#28

Open
John-194 wants to merge 7 commits intowangyiqiu:masterfrom
John-194:optimizations
Open

Performance Optimizations#28
John-194 wants to merge 7 commits intowangyiqiu:masterfrom
John-194:optimizations

Conversation

@John-194
Copy link
Copy Markdown
Contributor

@John-194 John-194 commented Apr 4, 2026

Performance Optimizations

  • Improved performance 1.7x - 2.9x
  • Fixes high idle CPU usage

Tested using:

  • 3K 3D — 3k-point 3D cloud
  • 13K 2D — 13k-point 2D cloud
  • 100K 2D — 100k-point 2D cloud

The baseline and these optimizations return the same DBSCAN results.
Memory leaks were not detected.

Results vs. baseline (dev branch)

Speed (ms/iter)

Commit Description 3K 3D 13K 2D 100K 2D
5c7fdd5 baseline 0.86 1.29 9.38
39805c2 parfor heuristic 0.38 0.76 8.86
8da5e33 260% idle CPU usage fix 0.40 0.79 8.88
96c9cff pre-compute cell keys 0.40 0.79 7.71
4a4025f defer cell/mutex init 0.41 0.79 6.27
607417e O(n) cluster renumbering 0.43 0.66 5.51
847e798 eliminate sqrt 0.32 0.64 5.41
909a621 dynamic thread count 0.30 0.65 5.45

Speedup per step (vs. previous commit)

Commit Description 3K 3D 13K 2D 100K 2D
5c7fdd5 baseline - - -
39805c2 parfor heuristic 2.26x 1.70x 1.06x
8da5e33 260% idle CPU usage fix 0.95x 0.96x 1.00x
96c9cff pre-compute cell keys 1.00x 1.00x 1.15x
4a4025f defer cell/mutex init 0.98x 1.00x 1.23x
607417e O(n) cluster renumbering 0.95x 1.20x 1.14x
847e798 eliminate sqrt 1.34x 1.03x 1.02x
909a621 dynamic thread count 1.07x 0.98x 0.99x

Cumulative speedup (vs. baseline)

Commit Description 3K 3D 13K 2D 100K 2D
5c7fdd5 baseline - - -
39805c2 parfor heuristic 2.26x 1.70x 1.06x
8da5e33 260% idle CPU usage fix 2.15x 1.63x 1.06x
96c9cff pre-compute cell keys 2.15x 1.63x 1.22x
4a4025f defer cell/mutex init 2.10x 1.63x 1.50x
607417e O(n) cluster renumbering 2.00x 1.95x 1.70x
847e798 eliminate sqrt 2.69x 2.02x 1.73x
909a621 dynamic thread count 2.87x 1.98x 1.72x

Summary of changes

1. 39805c2 — Replace timing-based granularity with simple heuristic in parfor
Removed get_granularity() which ran a timed benchmark to decide chunk sizes at runtime. Replaced with a simple n / (4 * num_threads) heuristic plus a sequential fast-path for small ranges. Biggest win on the small 3K cloud (2.26x).

2. 8da5e33 — Fix idle CPU usage via progressive backoff and CV-based sleep
Rewrote the work-stealing loop in scheduler.h with a 3-phase backoff: aggressive spinning, yielding, then sleeping on a condition variable. spawn() only signals the CV when threads are actually sleeping (zero-cost during active computation). Also changed wait() to opportunistically execute stolen work instead of re-entering start(), preventing potential deadlocks. Neutral on performance but eliminates CPU burn on idle threads.

3. 96c9cff — Pre-compute cell keys for grid sort, avoiding floor()
In grid.h, pre-computes integer cell coordinates (floor((P[i][d] - pMin[d]) / r)) into a flat array before sorting, so the sort comparator uses integer comparisons instead of repeated floating-point floor().

4. 4a4025f — Defer grid cell/mutex init to used cells only, reduce hash table size
Moves std::mutex construction and nbrCache/cells initialization from all cellCapacity slots to only the numCells actually used after insertion. Also shrinks the initial hash table from cellMax*2 to max(2048, cellMax/4) with a rebuild if the estimate is too small.

5. 607417e — O(n) cluster renumbering via prefix sum
Replaced an O(n log n) sampleSort + hash-table lookup for remapping cluster IDs with a simple O(n) approach: mark which IDs are used in a flag array, prefix-sum to get sequential IDs, then remap. Eliminates the myPair struct, hashSimplePair, and Table allocation entirely.

6. 847e798 — Eliminate sqrt from distance comparisons
Added nodeDistanceSqr() to kdNode.h and switched all distance comparisons in coreBccp.h, kdNode.h, and kdTree.h to use the existing distSqr() and the new nodeDistanceSqr() instead of dist()/nodeDistance(). The BCCP threshold r <= epsilon becomes r <= epsilon * epsilon. Uses ternary operators instead of std::max for branchless codegen hints. 1.34x on 3K.

7. 909a621 — Dynamic thread count via getWorkers()
Replaced hardcoded static const intT P = 36*8 with getWorkers() * 8 in boundingBoxParallel() and pMinParallel(). Also switched VLA-style stack arrays to heap-allocated (newA + free) since the count is now dynamic. Portable across machines with different core counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant