Skip to content

Add models for labels and assign first labels to buckets#219

Open
ksy36 wants to merge 2 commits into
mainfrom
auto_labeling
Open

Add models for labels and assign first labels to buckets#219
ksy36 wants to merge 2 commits into
mainfrom
auto_labeling

Conversation

@ksy36
Copy link
Copy Markdown
Collaborator

@ksy36 ksy36 commented May 26, 2026

This PR adds auto labeling based on 2 lists we have at the moment (worldcup2026 and nsfw). Labeling command is run when bucket is created as well as on each domain list update.

@ksy36 ksy36 marked this pull request as ready for review May 26, 2026 04:09
@ksy36 ksy36 requested a review from jgraham May 26, 2026 04:12
class Label(models.Model):
name: models.CharField = models.CharField(max_length=50, unique=True)
description: models.TextField = models.TextField(blank=True, default="")
domain_source: models.OneToOneField = models.OneToOneField(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the idea that this would be NULL if we want to add arbitrary labels that aren't defined by a domain list?

Copy link
Copy Markdown
Collaborator Author

@ksy36 ksy36 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be NULL for manually created labels. Added it mainly to have a connection if we ever wanted to bulk delete or merge labels based on their sources



@receiver(post_save, sender=Bucket)
def Bucket_save(sender, instance, created, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that every time we save any bucket we rerun labeling for that bucket irrespective of whether the domain property on the bucket actually changed (which AIUI is the only case at present where the automatic labels might change). That doesn't necessarily seem bad if we planned to add the ability to auto-label on more than just the domain, but the examples I can think of also depend on the reports in the bucket (e.g. if we wanted to label something "Android" if 80% of the reports were on Android or "Japan" if some reports came from Japan), and I don't think we'd go through this codepath if we were just adding entries to a bucket rather than updating the bucket properties?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I had is to run this only on bucket creation (it has a check below for if not created to return early). I guess we don't really edit bucket domain at the moment, so I didn't add that for every save/change.

So with this PR there are two ways a bucket can receive labels:

  • on creation (if a given source list exists already)
  • on domain list creation / update (it runs call_command("label_buckets", source_name=name) )

e.g. if we wanted to label something "Android" if 80% of the reports were on Android or "Japan" if some reports came from Japan

This can probably be run in a similar manner, on bucket creation and a scheduled run a few times a day on a set of rules that we define in some config?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for rules that don't depend on fixed properties of the bucket they should probably be applied at the point that the bucket is updated. Making everything async makes it hard to reason about the system.

Comment thread server/reportmanager/management/commands/label_buckets.py Outdated
Comment thread server/reportmanager/management/commands/label_buckets.py Outdated
source_names = get_label_source_names(source_name)

if bucket_id is not None:
for mapped_source_name in source_names:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could make these queries operate over all the labels at once rather than doing them one at a time (but not a blocker).

# store the domain outside the signature only if the signature includes
# a non-regex domain symptom and no other symptoms (for quick exclusion)
domain: models.CharField = models.CharField(max_length=255, null=True)
domain_normalized: models.CharField = models.CharField(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the BigQuery side we never started to store this, instead we just have a routine that knows how to make the normalized comparisons, which would make it easier to change things in the future. Storing a normalized domain is probably fine, but it does end up with something that's basically part of the business logic directly in the data layer.

Copy link
Copy Markdown
Collaborator Author

@ksy36 ksy36 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I couldn’t come up with a clean way to do the comparison-time normalization across both sqlite and MySQL without making the join query pretty awkward, so decided to store it

@ksy36 ksy36 requested a review from jgraham May 30, 2026 01:42


@receiver(post_save, sender=Bucket)
def Bucket_save(sender, instance, created, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for rules that don't depend on fixed properties of the bucket they should probably be applied at the point that the bucket is updated. Making everything async makes it hard to reason about the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants