-
Notifications
You must be signed in to change notification settings - Fork 2
Command for initial similarity-based clustering #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # Calculate if this is a high-volume domain | ||
| # and if so, only use reports in the last 14 days | ||
| is_high_volume = self.is_high_volume_domain(reports) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking it might be a good idea to limit clustering existing reports for popular domains to 7 or 14 days as I suspect clusters will be filled pretty fast with incoming reports given the report volume. All unclustered reports still remain in the main domain-based bucket along with duplicates
| "handlers": ["console"], | ||
| "level": "INFO", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to get LOG.info logs displayed in the terminal
f25f197 to
4ae6b7c
Compare
jgraham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few code-level issues, mostly about preferring to use dataclasses over dicts where possible.
However I think the biggest question I have is about the overall lifecycle here. IIUC the model is that we periodically run the clustering algorithm against all reports in the database, and that has the impact of deleting all the existing buckets that were created by clustering, reclustering all the reports, and then creating new buckets based on the existing clusters? That feels like it might cause problems; for example if we start to attach other metadata to the buckets (e.g. a bug number or something indicating that we couldn't find any reproducable reports) then that will be lost when we re-cluster.
Is it feasible to do this in a way that preserves the existing clusters and only examines new reports to see if they are close enough to an existing cluster or if the unclustered reports now form a new cluster?
Also, it would be good to have some documentation in the code about the high-level data flow and approach we're using (not the clustering algorithm itself, but how it gets applied to reports)
| """Calculate Euclidean distance between two embeddings.""" | ||
| return sum((a - b) ** 2 for a, b in zip(emb1, emb2)) ** 0.5 | ||
|
|
||
| def find_centroid_for_cluster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We compute this centroid and store it, but it's somewhat unclear to me if we actually have plans to use it for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to display it in the UI on the buckets page, as a "representative" report of a clustered bucket, so we have an idea what this cluster context is.
Was initially thinking to also compare incoming reports with each centroid to detect if a new report belongs to a cluster, but I tried that and it doesn't seem to be enough, so I'm comparing it with all clustered reports for a domain and "voting" for a cluster instead, but it is rather slow..
|
|
||
| def group_reports_by_label( | ||
| self, reports: list[dict], labels: list[int], embeddings: list | ||
| ) -> dict[int, dict[str, list]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general prefer Mapping as the type unless we need the data to be mutable.
| return [] | ||
|
|
||
| # Use different thresholds for high vs normal volume | ||
| threshold = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine, but it seems like it's basically a simple two-state model reflecting a desire to require stronger similarity when we have more reports, but allowing weaker similarity when we don't? I started to wonder if we actually want a continuous function that determines the threshold for a given report volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can try coming up with with a function that determines threshold based on volume. So far I've noticed that the same threshold doesn't work for high volume and low volume domains. Using around 60% similarity to youtube.com reports clumps too many reports together and clusters become too broad, and applying 70% to not so popular domain, splits clusters into multiples even though reports are generally about one problem (i.e. can't login etc.). I decided to use two states to start, but it can probably be improved.
|
OK now I'm back at the initial PR, I notice that you already said you plan to address incoming reports in a followup :) Given that I think we mostly need documentation in the code that explains that this command is for the initial import and not for the ongoing operation. |
jgraham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think we should go ahead and land this so that we can experiment with the clustering.
(One very minor style point that I didn't put in other comments, and which we don't need to address right away, is that I personally prefer using the collections.abc types like Mapping, Sequence, Iterable etc. over concrete types like dict or list in many cases because they both allow slightly more flexibility and also encode more invariants e.g. trying to mutate something declared as Mapping will cause a type error whereas dict won't).
6d63c4e to
901c244
Compare
901c244 to
76dabe6
Compare
This command is a first part of clustering infrastructure. It clusters and creates buckets for existing reports and intended to be run once. I plan on adding a separate PR for clustering incoming reports.