Add per-frame timestamp embedding to the VLM video path#128
Open
amazloumi wants to merge 3 commits into
Open
Conversation
bcb435f to
1ea9c4b
Compare
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds absolute per-frame timestamp conditioning to the VLM video pathway by propagating decoded frame presentation times through the data pipeline and injecting a registry-configurable, zero-initialized time embedding into each frame’s visual tokens.
Changes:
decode_video_framesnow returns(frames, times); datasets/collator propagateframe_timesas(F,)/(B, F)and training threads it into the model forward.- Introduces a registry-driven time-embedding module (
FrameTimeEmbeddingdefault;"none"disables) and applies it per-frame in the VLM visual-token projection (video-only). - Ensures distributed/FSDP build paths materialize and shard the new submodule; adds unit coverage for config, embedding behavior, and build wiring.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_vlm.py | Adds unit tests ensuring video wrappers attach the module, image wrappers do not, and forward wiring/shape checks behave as expected. |
| tests/unit/test_video_io.py | Updates tests for (frames, times) return and basic timestamp properties. |
| tests/unit/test_video_dataset.py | Updates dataset/collator tests to validate frame_times padding/stacking behavior. |
| tests/unit/test_time_embedding_config.py | Adds coverage for TimeEmbeddingConfig defaults, validation, and kwargs. |
| tests/unit/test_frame_time.py | Adds coverage for embedding shape, zero-init behavior, gradient flow, dtype behavior, and registry builder. |
| tests/unit/test_distributed.py | Verifies distributed build attaches/casts frame_time_embed for video and omits it for images. |
| scripts/train.py | Threads time_embedding_config into model build and passes frame_times into VLM forward. |
| kempnerforge/model/vlm.py | Adds frame_times plumbing and applies per-frame timestamp embeddings in _project_visual_features. |
| kempnerforge/model/frame_time.py | Introduces TimeEmbedding interface, FrameTimeEmbedding, and registry-driven build_time_embedding. |
| kempnerforge/distributed/parallel.py | Builds/materializes/casts frame_time_embed in meta/CPU paths and FSDP-shards it when present. |
| kempnerforge/data/video_io.py | Changes decode_video_frames to also return matched presentation timestamps. |
| kempnerforge/data/video_dataset.py | Emits frame_times per sample and stacks it in VideoCollator. |
| kempnerforge/config/time_embedding.py | Adds [time_embedding] config with validation and builder kwargs. |
| kempnerforge/config/schema.py | Exposes TimeEmbeddingConfig in the config schema surface. |
| kempnerforge/config/registry.py | Adds time-embedding registry hooks (register/get/list_time_embedding). |
| kempnerforge/config/job.py | Adds optional time_embedding field to JobConfig. |
| docs/how-to/train-on-video.md | Documents per-frame timestamp embedding and registry semantics. |
| CHANGELOG.md | Records the new per-frame timestamp feature and affected components. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FrameTimeEmbedding(kempnerforge/model/frame_time.py): sinusoidal features of a frame's timestamp (seconds) at log-spaced periods → a zero-initialized projection (identity at step 0).decode_video_framesreturns(frames, times);WebVidVideoDatasetemitsframe_times (F,),VideoCollatorstacks to(B, F)._project_visual_featuresas aVLMWrappersibling submodule (video only;Nonefor the image path); built + FSDP-sharded + meta-materialized at both build sites.scripts/train.pythreadsframe_times; docs + CHANGELOG updated.Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v --timeout=60passes (1527 passed, 2 skipped)uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v← running this nowvlm_video_webvid.toml(random encoder): trains, +33,792 params confirms the module is sharded/trainableCloses #127