Add models for labels and assign first labels to buckets#219
Conversation
| class Label(models.Model): | ||
| name: models.CharField = models.CharField(max_length=50, unique=True) | ||
| description: models.TextField = models.TextField(blank=True, default="") | ||
| domain_source: models.OneToOneField = models.OneToOneField( |
There was a problem hiding this comment.
So is the idea that this would be NULL if we want to add arbitrary labels that aren't defined by a domain list?
There was a problem hiding this comment.
Yeah, it would be NULL for manually created labels. Added it mainly to have a connection if we ever wanted to bulk delete or merge labels based on their sources
|
|
||
|
|
||
| @receiver(post_save, sender=Bucket) | ||
| def Bucket_save(sender, instance, created, **kwargs): |
There was a problem hiding this comment.
This means that every time we save any bucket we rerun labeling for that bucket irrespective of whether the domain property on the bucket actually changed (which AIUI is the only case at present where the automatic labels might change). That doesn't necessarily seem bad if we planned to add the ability to auto-label on more than just the domain, but the examples I can think of also depend on the reports in the bucket (e.g. if we wanted to label something "Android" if 80% of the reports were on Android or "Japan" if some reports came from Japan), and I don't think we'd go through this codepath if we were just adding entries to a bucket rather than updating the bucket properties?
There was a problem hiding this comment.
The idea I had is to run this only on bucket creation (it has a check below for if not created to return early). I guess we don't really edit bucket domain at the moment, so I didn't add that for every save/change.
So with this PR there are two ways a bucket can receive labels:
- on creation (if a given source list exists already)
- on domain list creation / update (it runs
call_command("label_buckets", source_name=name))
e.g. if we wanted to label something "Android" if 80% of the reports were on Android or "Japan" if some reports came from Japan
This can probably be run in a similar manner, on bucket creation and a scheduled run a few times a day on a set of rules that we define in some config?
There was a problem hiding this comment.
I think for rules that don't depend on fixed properties of the bucket they should probably be applied at the point that the bucket is updated. Making everything async makes it hard to reason about the system.
| source_names = get_label_source_names(source_name) | ||
|
|
||
| if bucket_id is not None: | ||
| for mapped_source_name in source_names: |
There was a problem hiding this comment.
It seems like we could make these queries operate over all the labels at once rather than doing them one at a time (but not a blocker).
| # store the domain outside the signature only if the signature includes | ||
| # a non-regex domain symptom and no other symptoms (for quick exclusion) | ||
| domain: models.CharField = models.CharField(max_length=255, null=True) | ||
| domain_normalized: models.CharField = models.CharField( |
There was a problem hiding this comment.
On the BigQuery side we never started to store this, instead we just have a routine that knows how to make the normalized comparisons, which would make it easier to change things in the future. Storing a normalized domain is probably fine, but it does end up with something that's basically part of the business logic directly in the data layer.
There was a problem hiding this comment.
yeah, I couldn’t come up with a clean way to do the comparison-time normalization across both sqlite and MySQL without making the join query pretty awkward, so decided to store it
|
|
||
|
|
||
| @receiver(post_save, sender=Bucket) | ||
| def Bucket_save(sender, instance, created, **kwargs): |
There was a problem hiding this comment.
I think for rules that don't depend on fixed properties of the bucket they should probably be applied at the point that the bucket is updated. Making everything async makes it hard to reason about the system.
This PR adds auto labeling based on 2 lists we have at the moment (worldcup2026 and nsfw). Labeling command is run when bucket is created as well as on each domain list update.