|
| 1 | +# Clustering reports by similarity |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The clustering mechanism groups similar reports within each domain using unsupervised machine learning (SBERT embeddings and agglomerative clustering) and creates a bucket for each cluster. |
| 6 | + |
| 7 | +## One-time clustering of existing reports |
| 8 | + |
| 9 | +This command clusters reports by similarity and creates buckets for existing reports and intended to be run once. |
| 10 | +Note that rerunning this command **will delete existing clusters and cluster-based buckets** and recreate them from scratch. |
| 11 | + |
| 12 | + |
| 13 | +```bash |
| 14 | +# Cluster reports for a specific domain only |
| 15 | +uv run -p 3.12 --extra=server server/manage.py cluster_reports --domain example.com |
| 16 | + |
| 17 | +# Cluster all reports across all domains |
| 18 | +uv run -p 3.12 --extra=server server/manage.py cluster_reports cluster_reports |
| 19 | + |
| 20 | +``` |
| 21 | + |
| 22 | +The command performs the following steps: |
| 23 | + |
| 24 | +1. Removes existing clusters and their associated buckets, if any exist. |
| 25 | +2. Fetches reports from the database based on the following criteria: |
| 26 | + - Non-empty comments |
| 27 | + - ML validity score above 0.03 (reports that are invalid with probability 97% are skipped) |
| 28 | + |
| 29 | + Reports that don't meet these criteria are skipped from clustering and remain in the default domain-based bucket. |
| 30 | + |
| 31 | +3. Reports are organized by domain and each domain is processed independently. |
| 32 | +4. There are different strategies based on domain volume: |
| 33 | + |
| 34 | + **High-Volume Domains** (>20 reports per week): |
| 35 | + - Uses only the last 14 days of reports to focus on recent issues |
| 36 | + - Applies a stricter similarity threshold (0.30) to create more granular clusters |
| 37 | + |
| 38 | + **Normal-Volume Domains** (≤20 reports per week): |
| 39 | + - Uses all historical reports |
| 40 | + - Applies a more permissive threshold (0.38) so we can find patterns even with fewer reports |
| 41 | + |
| 42 | + The clustering process then: |
| 43 | + - Generates semantic embeddings for each report comment using SBERT |
| 44 | + - Groups similar embeddings using agglomerative clustering |
| 45 | + - Identifies a centroid (most representative report) for each cluster |
| 46 | + |
| 47 | +5. Creates clusters based on the results of the clustering algorithm. Single-report clusters are discarded if their ML validity probability is below 0.60. These reports remain in the default domain-based buckets. |
| 48 | + |
| 49 | +6. Clusters are saved to the database along with corresponding buckets. Each bucket receives a signature containing the domain and cluster ID for future report assignment. |
| 50 | + |
| 51 | +## Clustering algorithm details |
| 52 | + |
| 53 | +### Semantic Embeddings |
| 54 | +The system uses SBERT (Sentence-BERT) to convert report text into semantic embeddings—vector representations that capture meaning of a report. |
| 55 | +Once embeddings are created is uses agglomerative clustering, a hierarchical approach that: |
| 56 | +1. Begins with each report as its own cluster |
| 57 | +2. Iteratively merges the most similar clusters |
| 58 | +3. Stops when all remaining cluster pairs exceed the distance threshold |
| 59 | + |
| 60 | +### Threshold Selection |
| 61 | +The distance threshold determines the maximum distance at which two reports will be grouped together. Lower thresholds produce smaller, more specific clusters (only very similar reports group together); higher thresholds create larger, more general clusters (moderately similar reports can group together). |
| 62 | + |
| 63 | +The threshold varies based on domain volume: |
| 64 | +- **High-volume domains** (0.30 threshold = 70% similarity): Stricter matching since sufficient data exists to form meaningful specific clusters |
| 65 | +- **Normal-volume domains** (0.38 threshold = 62% similarity): More permissive matching to detect patterns despite limited number of reports |
| 66 | + |
| 67 | +### Centroid Selection |
| 68 | +Each cluster's centroid is the report whose embedding is closest to the cluster's mean embedding. This report serves as the most representative example of the cluster. |
0 commit comments