Command for initial similarity-based clustering #80

ksy36 · 2026-01-26T15:59:58Z

This command is a first part of clustering infrastructure. It clusters and creates buckets for existing reports and intended to be run once. I plan on adding a separate PR for clustering incoming reports.

server/reportmanager/clustering/ClusterBucketManager.py

ksy36 · 2026-01-27T04:35:57Z

server/reportmanager/clustering/ClusterBucketManager.py

+        # Calculate if this is a high-volume domain
+        # and if so, only use reports in the last 14 days
+        is_high_volume = self.is_high_volume_domain(reports)
+


Thinking it might be a good idea to limit clustering existing reports for popular domains to 7 or 14 days as I suspect clusters will be filled pretty fast with incoming reports given the report volume. All unclustered reports still remain in the main domain-based bucket along with duplicates

ksy36 · 2026-01-27T04:37:33Z

server/server/settings.py

+        "handlers": ["console"],
+        "level": "INFO",
+    },
 }


Added this to get LOG.info logs displayed in the terminal

jgraham

I've added a few code-level issues, mostly about preferring to use dataclasses over dicts where possible.

However I think the biggest question I have is about the overall lifecycle here. IIUC the model is that we periodically run the clustering algorithm against all reports in the database, and that has the impact of deleting all the existing buckets that were created by clustering, reclustering all the reports, and then creating new buckets based on the existing clusters? That feels like it might cause problems; for example if we start to attach other metadata to the buckets (e.g. a bug number or something indicating that we couldn't find any reproducable reports) then that will be lost when we re-cluster.

Is it feasible to do this in a way that preserves the existing clusters and only examines new reports to see if they are close enough to an existing cluster or if the unclustered reports now form a new cluster?

Also, it would be good to have some documentation in the code about the high-level data flow and approach we're using (not the clustering algorithm itself, but how it gets applied to reports)

server/reportmanager/management/commands/cluster_reports.py

pyproject.toml

server/reportmanager/clustering/ClusterBucketManager.py

jgraham · 2026-01-28T16:52:06Z

server/reportmanager/clustering/SBERTClusterer.py

+        """Calculate Euclidean distance between two embeddings."""
+        return sum((a - b) ** 2 for a, b in zip(emb1, emb2)) ** 0.5
+
+    def find_centroid_for_cluster(


We compute this centroid and store it, but it's somewhat unclear to me if we actually have plans to use it for anything?

I'm planning to display it in the UI on the buckets page, as a "representative" report of a clustered bucket, so we have an idea what this cluster context is.
Was initially thinking to also compare incoming reports with each centroid to detect if a new report belongs to a cluster, but I tried that and it doesn't seem to be enough, so I'm comparing it with all clustered reports for a domain and "voting" for a cluster instead, but it is rather slow..

server/reportmanager/clustering/SBERTClusterer.py

server/reportmanager/clustering/ClusterBucketManager.py

jgraham · 2026-01-28T17:15:42Z

server/reportmanager/clustering/ClusterBucketManager.py

+
+    def group_reports_by_label(
+        self, reports: list[dict], labels: list[int], embeddings: list
+    ) -> dict[int, dict[str, list]]:


In general prefer Mapping as the type unless we need the data to be mutable.

server/reportmanager/clustering/ClusterBucketManager.py

jgraham · 2026-01-28T17:23:26Z

server/reportmanager/clustering/ClusterBucketManager.py

+            return []
+
+        # Use different thresholds for high vs normal volume
+        threshold = (


This is probably fine, but it seems like it's basically a simple two-state model reflecting a desire to require stronger similarity when we have more reports, but allowing weaker similarity when we don't? I started to wonder if we actually want a continuous function that determines the threshold for a given report volume?

Yeah we can try coming up with with a function that determines threshold based on volume. So far I've noticed that the same threshold doesn't work for high volume and low volume domains. Using around 60% similarity to youtube.com reports clumps too many reports together and clusters become too broad, and applying 70% to not so popular domain, splits clusters into multiples even though reports are generally about one problem (i.e. can't login etc.). I decided to use two states to start, but it can probably be improved.

server/reportmanager/clustering/ClusterBucketManager.py

jgraham · 2026-01-28T17:40:17Z

OK now I'm back at the initial PR, I notice that you already said you plan to address incoming reports in a followup :)

Given that I think we mostly need documentation in the code that explains that this command is for the initial import and not for the ongoing operation.

jgraham

OK, I think we should go ahead and land this so that we can experiment with the clustering.

(One very minor style point that I didn't put in other comments, and which we don't need to address right away, is that I personally prefer using the collections.abc types like Mapping, Sequence, Iterable etc. over concrete types like dict or list in many cases because they both allow slightly more flexibility and also encode more invariants e.g. trying to mutate something declared as Mapping will cause a type error whereas dict won't).

ksy36 commented Jan 27, 2026

View reviewed changes

server/reportmanager/clustering/ClusterBucketManager.py Outdated Show resolved Hide resolved

ksy36 commented Jan 27, 2026

View reviewed changes

ksy36 marked this pull request as ready for review January 27, 2026 04:39

ksy36 requested review from denschub and jgraham January 27, 2026 04:39

ksy36 force-pushed the initial_clustering branch 4 times, most recently from f25f197 to 4ae6b7c Compare January 27, 2026 17:47

jgraham requested changes Jan 28, 2026

View reviewed changes

jgraham approved these changes Feb 6, 2026

View reviewed changes

ksy36 force-pushed the initial_clustering branch 3 times, most recently from 6d63c4e to 901c244 Compare February 9, 2026 19:32

Command for initial similarity-based clustering

76dabe6

ksy36 force-pushed the initial_clustering branch from 901c244 to 76dabe6 Compare February 9, 2026 19:34

ksy36 merged commit 4735129 into main Feb 9, 2026
6 checks passed

Command for initial similarity-based clustering #80

Command for initial similarity-based clustering #80

Uh oh!

Conversation

ksy36 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ksy36 Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ksy36 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

jgraham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgraham Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ksy36 Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgraham Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgraham Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ksy36 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgraham commented Jan 28, 2026

Uh oh!

jgraham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksy36 commented Jan 26, 2026 •

edited

Loading

ksy36 Jan 27, 2026 •

edited

Loading

ksy36 Feb 6, 2026 •

edited

Loading