Skip to content

Metadata class wrapper for .metadata.json#597

Open
dale-wahl wants to merge 22 commits into
masterfrom
metadata_class
Open

Metadata class wrapper for .metadata.json#597
dale-wahl wants to merge 22 commits into
masterfrom
metadata_class

Conversation

@dale-wahl
Copy link
Copy Markdown
Member

This pull request introduces a new, unified interface for reading and writing media archive metadata across processors, and refactors several processors to use this interface. The changes improve code maintainability, consistency, and reliability when working with media metadata. Additionally, a new MediaArchiveLibrary class is added to facilitate efficient reuse of previously downloaded media files. Several processors are updated to use the new metadata methods, and error handling is improved throughout.

Unified media metadata handling and library:

  • Added read_media_metadata and new_media_metadata methods to the DataSet class in common/lib/dataset.py, providing a standard way to read and write media archive metadata using the MediaArchiveMetadata class. This ensures all processors interact with metadata in a consistent manner.
  • Introduced the MediaArchiveLibrary class in common/lib/media_archive_library.py, which aggregates metadata from previous downloader runs, allowing processors to efficiently check for and reuse previously downloaded media files.
  • Added the MetadataException class to common/lib/exceptions.py for more precise error handling when working with metadata files.

Notes and findings

  • The MediaArchiveLibrary replaced the DataSetVideoLibrary and is still only used by the video downloader, BUT could presumably be used by any downloader.
  • I noticed a few other metadata like files: the tokenizer and the topic modeller both have ones and there is also a metadata file created when extracting full DataSets. Neither really fit the media metadata format though, so instead I made a base class that we could extend to accommodate those if that is desired.
  • The refactors a lot of processors and I tried to prevent repetitive code (yay!). I did notice one thing we do a lot which is match on the filename stem instead of the full filename. This could end up with some collision (for example ffmpeg creates logs with the same filename stem or you could have two images with different extensions) but it is actually a downstream problem. For example, the DMI Service Manager will use the stem, rename to stem.json and return only that with no proper mapping to verify. It was already an existing potential issue, I am just flagging it here.

@dale-wahl dale-wahl marked this pull request as ready for review May 21, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant