Skip to content

Commit 901c244

Browse files
committed
Command for initial similarity-based clustering
1 parent 59ee985 commit 901c244

16 files changed

Lines changed: 1295 additions & 40 deletions

File tree

CLUSTERING.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Clustering reports by similarity
2+
3+
## Overview
4+
5+
The clustering mechanism groups similar reports within each domain using unsupervised machine learning (SBERT embeddings and agglomerative clustering) and creates a bucket for each cluster.
6+
7+
## One-time clustering of existing reports
8+
9+
This command clusters reports by similarity and creates buckets for existing reports and intended to be run once.
10+
Note that rerunning this command **will delete existing clusters and cluster-based buckets** and recreate them from scratch.
11+
12+
13+
```bash
14+
# Cluster reports for a specific domain only
15+
uv run -p 3.12 --extra=server server/manage.py cluster_reports --domain example.com
16+
17+
# Cluster all reports across all domains
18+
uv run -p 3.12 --extra=server server/manage.py cluster_reports cluster_reports
19+
20+
```
21+
22+
The command performs the following steps:
23+
24+
1. Removes existing clusters and their associated buckets, if any exist.
25+
2. Fetches reports from the database based on the following criteria:
26+
- Non-empty comments
27+
- ML validity score above 0.03 (reports that are invalid with probability 97% are skipped)
28+
29+
Reports that don't meet these criteria are skipped from clustering and remain in the default domain-based bucket.
30+
31+
3. Reports are organized by domain and each domain is processed independently.
32+
4. There are different strategies based on domain volume:
33+
34+
**High-Volume Domains** (>20 reports per week):
35+
- Uses only the last 14 days of reports to focus on recent issues
36+
- Applies a stricter similarity threshold (0.30) to create more granular clusters
37+
38+
**Normal-Volume Domains** (≤20 reports per week):
39+
- Uses all historical reports
40+
- Applies a more permissive threshold (0.38) so we can find patterns even with fewer reports
41+
42+
The clustering process then:
43+
- Generates semantic embeddings for each report comment using SBERT
44+
- Groups similar embeddings using agglomerative clustering
45+
- Identifies a centroid (most representative report) for each cluster
46+
47+
5. Creates clusters based on the results of the clustering algorithm. Single-report clusters are discarded if their ML validity probability is below 0.60. These reports remain in the default domain-based buckets.
48+
49+
6. Clusters are saved to the database along with corresponding buckets. Each bucket receives a signature containing the domain and cluster ID for future report assignment.
50+
51+
## Clustering algorithm details
52+
53+
### Semantic Embeddings
54+
The system uses SBERT (Sentence-BERT) to convert report text into semantic embeddings—vector representations that capture meaning of a report.
55+
Once embeddings are created is uses agglomerative clustering, a hierarchical approach that:
56+
1. Begins with each report as its own cluster
57+
2. Iteratively merges the most similar clusters
58+
3. Stops when all remaining cluster pairs exceed the distance threshold
59+
60+
### Threshold Selection
61+
The distance threshold determines the maximum distance at which two reports will be grouped together. Lower thresholds produce smaller, more specific clusters (only very similar reports group together); higher thresholds create larger, more general clusters (moderately similar reports can group together).
62+
63+
The threshold varies based on domain volume:
64+
- **High-volume domains** (0.30 threshold = 70% similarity): Stricter matching since sufficient data exists to form meaningful specific clusters
65+
- **Normal-volume domains** (0.38 threshold = 62% similarity): More permissive matching to detect patterns despite limited number of reports
66+
67+
### Centroid Selection
68+
Each cluster's centroid is the report whose embedding is closest to the cluster's mean embedding. This report serves as the most representative example of the cluster.

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ dev = [
3939
"mypy==1.19.1",
4040
"pytest==9.0.2",
4141
"pytest-cov==7.0.0",
42+
"pytest-django==4.11.1",
4243
"pytest-mock==3.15.1",
4344
"celery-types==0.24.0",
4445
"django-stubs==5.2.9",
@@ -60,6 +61,8 @@ server = [
6061
"djangorestframework==3.16.1",
6162
"google-cloud-bigquery==3.40.0",
6263
"pyyaml==6.0.3",
64+
"scikit-learn>=1.3.0",
65+
"sentence-transformers>=2.2.0",
6366
"whitenoise==6.11.0",
6467
]
6568

@@ -134,7 +137,7 @@ select = [
134137
# pycodestyle
135138
"W",
136139
]
137-
ignore = ["SIM117"]
140+
ignore = ["SIM117", "RUF012"]
138141

139142
[tool.ruff.lint.isort]
140143
known-first-party = ["reportmanager"]

0 commit comments

Comments
 (0)