Skip to content

Realistic test data: factory_boy fixtures + curated prod media subset #1272

@jonfroehlich

Description

@jonfroehlich

Context

Spinoff from #1268, which was originally bundling REST API + test-data sync. The actual driver was getting realistic content into localhost so UI/template changes can be reviewed against prod-like data. This issue tracks that work as a standalone effort with no API dependency.

Background

Current test infrastructure (landed in #1267) uses DatabaseTestCase + small make_person / make_publication / make_news_item helpers in website/tests.py. These work for unit/integration tests but:

  • Don't compose well for the M2M / SortedManyToManyField / ProjectRole-through graph.
  • Don't include real media files, so image_cropping / easy_thumbnails / PDF preview code paths only ever see empty file fields.
  • Don't produce a coherent dev environment for visual review.

Recommendation — split into two layers

Layer 1: factory_boy + Faker (Django standard)

  • Add factory_boy and Faker to requirements.txt. De-facto Django standard for ~12 years; preferred over model_bakery here because explicit factories are easier to debug with our complex relationship graph.

  • Create website/tests/factories.py with PersonFactory, PublicationFactory, ProjectFactory, ProjectRoleFactory, TalkFactory, PosterFactory, VideoFactory, NewsItemFactory, AwardFactory. Pattern:

    class PublicationFactory(DjangoModelFactory):
        class Meta:
            model = Publication
        title = factory.Faker("sentence", nb_words=8)
        date = factory.Faker("date_between", start_date="-3y", end_date="today")
        pdf_file = factory.django.FileField(from_path=SEED_PDF_PATH)
    
        @factory.post_generation
        def authors(self, create, extracted, **kwargs):
            if create:
                self.authors.set(extracted or PersonFactory.create_batch(3))
  • Refactor existing make_person / make_publication / make_news_item helpers in website/tests.py to thin wrappers that delegate to the new factories (keeps existing tests green).

  • Add website/tests/seed_media/ with ~5 hand-curated representative files (one real PDF, a few JPGs at different aspect ratios, a project logo). Factory FileFields point at these so every test exercises real thumbnail / PDF / image-cropping code paths.

Layer 2: prod media subset + seed_dev_data command

  • Add python manage.py seed_dev_data that uses the factories to build a coherent graph (~5 projects, ~15 people, ~20 pubs with cross-links, ~5 talks/posters, news items, awards). Idempotent. Run after docker-compose up for a realistic-shaped site with zero prod dependency.
  • For prod-realistic media (when you actually want to see what real content looks like): add scripts/pull_prod_subset.sh that rsyncs a curated subset from recycle.

Prod media size analysis (2026-06-14)

Total media tree at /cse/web/research/makelab/www/media on recycle: 56 GB / 13,130 files.

Folder Size
talks/ 50 GB
publications/ 1.7 GB
banner/ 1.3 GB
person/ 837 MB
posters/ 719 MB
projects/ 611 MB
news/ 738 MB
uploads/ 323 MB
others ~120 MB

Whole-tree rsync is impractical (talks alone is 50 GB, mostly slide PDFs/videos). But person/ + projects/ + publications/ + posters/ ≈ 3.9 GB is the visual core of the site and is realistic to mirror.

Optional refinement: pair the rsync with a dumpdata JSON of the matching DB rows so the local DB references real filenames. Gitignore both outputs.

Suggested first PR slice

  1. Add factory_boy + Faker to requirements.txt.
  2. Create website/tests/factories.py with core factories.
  3. Add website/tests/seed_media/ with curated files.
  4. Migrate existing make_* helpers to delegate to factories.

Subsequent PRs: seed_dev_data management command, then pull_prod_subset.sh.

Out of scope (parked in #1268)

REST API for external consumers. Conversation concluded the only known consumer is Jon's academic page ("Recent Pubs" list), which can be served by a single JsonResponse view rather than a full DRF API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions