Performance Optimizations by John-194 · Pull Request #28 · wangyiqiu/dbscan-python

John-194 · 2026-04-04T11:45:00Z

Performance Optimizations

Improved performance 1.7x - 2.9x
Fixes high idle CPU usage

Tested using:

3K 3D — 3k-point 3D cloud
13K 2D — 13k-point 2D cloud
100K 2D — 100k-point 2D cloud

The baseline and these optimizations return the same DBSCAN results.
Memory leaks were not detected.

Results vs. baseline (dev branch)

Speed (ms/iter)

Commit	Description	3K 3D	13K 2D	100K 2D
`5c7fdd5`	baseline	0.86	1.29	9.38
`39805c2`	parfor heuristic	0.38	0.76	8.86
`8da5e33`	260% idle CPU usage fix	0.40	0.79	8.88
`96c9cff`	pre-compute cell keys	0.40	0.79	7.71
`4a4025f`	defer cell/mutex init	0.41	0.79	6.27
`607417e`	O(n) cluster renumbering	0.43	0.66	5.51
`847e798`	eliminate sqrt	0.32	0.64	5.41
`909a621`	dynamic thread count	0.30	0.65	5.45

Speedup per step (vs. previous commit)

Commit	Description	3K 3D	13K 2D	100K 2D
`5c7fdd5`	baseline	-	-	-
`39805c2`	parfor heuristic	2.26x	1.70x	1.06x
`8da5e33`	260% idle CPU usage fix	0.95x	0.96x	1.00x
`96c9cff`	pre-compute cell keys	1.00x	1.00x	1.15x
`4a4025f`	defer cell/mutex init	0.98x	1.00x	1.23x
`607417e`	O(n) cluster renumbering	0.95x	1.20x	1.14x
`847e798`	eliminate sqrt	1.34x	1.03x	1.02x
`909a621`	dynamic thread count	1.07x	0.98x	0.99x

Cumulative speedup (vs. baseline)

Commit	Description	3K 3D	13K 2D	100K 2D
`5c7fdd5`	baseline	-	-	-
`39805c2`	parfor heuristic	2.26x	1.70x	1.06x
`8da5e33`	260% idle CPU usage fix	2.15x	1.63x	1.06x
`96c9cff`	pre-compute cell keys	2.15x	1.63x	1.22x
`4a4025f`	defer cell/mutex init	2.10x	1.63x	1.50x
`607417e`	O(n) cluster renumbering	2.00x	1.95x	1.70x
`847e798`	eliminate sqrt	2.69x	2.02x	1.73x
`909a621`	dynamic thread count	2.87x	1.98x	1.72x

Summary of changes

1. 39805c2 — Replace timing-based granularity with simple heuristic in parfor
Removed get_granularity() which ran a timed benchmark to decide chunk sizes at runtime. Replaced with a simple n / (4 * num_threads) heuristic plus a sequential fast-path for small ranges. Biggest win on the small 3K cloud (2.26x).

2. 8da5e33 — Fix idle CPU usage via progressive backoff and CV-based sleep
Rewrote the work-stealing loop in scheduler.h with a 3-phase backoff: aggressive spinning, yielding, then sleeping on a condition variable. spawn() only signals the CV when threads are actually sleeping (zero-cost during active computation). Also changed wait() to opportunistically execute stolen work instead of re-entering start(), preventing potential deadlocks. Neutral on performance but eliminates CPU burn on idle threads.

3. 96c9cff — Pre-compute cell keys for grid sort, avoiding floor()
In grid.h, pre-computes integer cell coordinates (floor((P[i][d] - pMin[d]) / r)) into a flat array before sorting, so the sort comparator uses integer comparisons instead of repeated floating-point floor().

4. 4a4025f — Defer grid cell/mutex init to used cells only, reduce hash table size
Moves std::mutex construction and nbrCache/cells initialization from all cellCapacity slots to only the numCells actually used after insertion. Also shrinks the initial hash table from cellMax*2 to max(2048, cellMax/4) with a rebuild if the estimate is too small.

5. 607417e — O(n) cluster renumbering via prefix sum
Replaced an O(n log n) sampleSort + hash-table lookup for remapping cluster IDs with a simple O(n) approach: mark which IDs are used in a flag array, prefix-sum to get sequential IDs, then remap. Eliminates the myPair struct, hashSimplePair, and Table allocation entirely.

6. 847e798 — Eliminate sqrt from distance comparisons
Added nodeDistanceSqr() to kdNode.h and switched all distance comparisons in coreBccp.h, kdNode.h, and kdTree.h to use the existing distSqr() and the new nodeDistanceSqr() instead of dist()/nodeDistance(). The BCCP threshold r <= epsilon becomes r <= epsilon * epsilon. Uses ternary operators instead of std::max for branchless codegen hints. 1.34x on 3K.

7. 909a621 — Dynamic thread count via getWorkers()
Replaced hardcoded static const intT P = 36*8 with getWorkers() * 8 in boundingBoxParallel() and pMinParallel(). Also switched VLA-style stack arrays to heap-allocated (newA + free) since the count is now dynamic. Portable across machines with different core counts.

…rk-stealing loop

…sh table

…codegen

John-194 added 7 commits April 3, 2026 15:55

Replace timing-based granularity with simple heuristic in parfor

39805c2

Fix idle CPU usage using progressive backoff and CV based sleep in wo…

8da5e33

…rk-stealing loop

Pre-compute cell keys for grid sort avoiding floor()

96c9cff

Defer grid cell/mutex init to used cells only, reduce hash table size

4a4025f

O(n) cluster renumbering via prefix sum, replace O(n log n) sort + ha…

607417e

…sh table

Eliminate sqrt from distance comparisons, use ternary for branchless …

847e798

…codegen

Dynamic thread count via getWorkers(), fix hardcoded P=36*8

909a621

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimizations#28

Performance Optimizations#28
John-194 wants to merge 7 commits into
wangyiqiu:masterfrom
John-194:optimizations

John-194 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

John-194 commented Apr 4, 2026

Performance Optimizations

Results vs. baseline (dev branch)

Speed (ms/iter)

Speedup per step (vs. previous commit)

Cumulative speedup (vs. baseline)

Summary of changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant