Metadata class wrapper for .metadata.json by dale-wahl · Pull Request #597 · digitalmethodsinitiative/4cat

dale-wahl · 2026-05-21T13:45:34Z

This pull request introduces a new, unified interface for reading and writing media archive metadata across processors, and refactors several processors to use this interface. The changes improve code maintainability, consistency, and reliability when working with media metadata. Additionally, a new MediaArchiveLibrary class is added to facilitate efficient reuse of previously downloaded media files. Several processors are updated to use the new metadata methods, and error handling is improved throughout.

Unified media metadata handling and library:

Added read_media_metadata and new_media_metadata methods to the DataSet class in common/lib/dataset.py, providing a standard way to read and write media archive metadata using the MediaArchiveMetadata class. This ensures all processors interact with metadata in a consistent manner.
Introduced the MediaArchiveLibrary class in common/lib/media_archive_library.py, which aggregates metadata from previous downloader runs, allowing processors to efficiently check for and reuse previously downloaded media files.
Added the MetadataException class to common/lib/exceptions.py for more precise error handling when working with metadata files.

Notes and findings

The MediaArchiveLibrary replaced the DataSetVideoLibrary and is still only used by the video downloader, BUT could presumably be used by any downloader.
I noticed a few other metadata like files: the tokenizer and the topic modeller both have ones and there is also a metadata file created when extracting full DataSets. Neither really fit the media metadata format though, so instead I made a base class that we could extend to accommodate those if that is desired.
The refactors a lot of processors and I tried to prevent repetitive code (yay!). I did notice one thing we do a lot which is match on the filename stem instead of the full filename. This could end up with some collision (for example ffmpeg creates logs with the same filename stem or you could have two images with different extensions) but it is actually a downstream problem. For example, the DMI Service Manager will use the stem, rename to stem.json and return only that with no proper mapping to verify. It was already an existing potential issue, I am just flagging it here.

…to copy existing vids

dale-wahl added 22 commits May 20, 2026 15:40

create metadata class

44d8d1f

dataset helpers and exceptions

60b8ba1

add some tests...

b12a560

resolve commit once

a6e6551

fix my tests

729e53f

image_downloader: create a metadata file

fa29968

image_download map_metadata and use processor

2308250

fix up video downloaders

ae3ab42

download_videos: wire up the new metadata to the DataSetVideoLibrary …

6762fb0

…to copy existing vids

update hash_images

90daf3a

unique_images: use metadata class

e651e29

pixplot metadata

f3ea70f

audio extractors metadata

c126d64

image processors metadata

838f075

more image processors

13d8978

llm: metadata is now way easier

2d59ed4

video processors metadata

b62f615

catch bad JSON

5952b88

make sure all our post_ids are strings

fbe335a

don't leave a tmp metadata file on fail

fc2e376

replace metadata if downloaded again

1e683ad

fix: ensure no data lost from old metadata (add it all to extra)

635f87c

dale-wahl marked this pull request as ready for review May 21, 2026 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata class wrapper for .metadata.json#597

Metadata class wrapper for .metadata.json#597
dale-wahl wants to merge 22 commits into
masterfrom
metadata_class

dale-wahl commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dale-wahl commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant