-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Dear authors,
Thanks for your interesting paper!
I see you mention in your paper in section B.2 of the appendix that you train an attention pooling head for evaluation.
At the same time, I see that the model you released on HuggingFace has a clip_projector module (AttentionPoolingBlock) attached to it, and in the forward of the model you set projected=True (HF code here)
What I would like to know is:
- on what data was this release clip_projector trained?
- if I want to evaluate your model on action recognition, should I:
a. train the attentive probe from scratch on the target dataset
b. finetune the attentive probe on the target dataset (if so, from which weights?) - which setup did you use in your paper, 2.a. or 2.b., or none of them?
You released some action recognition code for InternVideo2 and for linear probing / attention probing there is this --open_clip_projector parameter which controls whether you're finetuning the head, or not. But this doesn't say on which data the head was trained before the finetuning.
You mention in the paper in section 4.2 (Video Classification)
We test the model in an ‘Attentive Probing’ setting
where the encoders are frozen and a single-layer attention
pooling head is trained. Such Frozen Encoder settings can
test representation’s quality in an unbiased way. Our methods achieve the best results with only public data and less
computation cost on these foundation tasks.
What would make sense to me is that you're not using the Internvideo2 evaluation script anymore and you're training the attentive probe from scratch (2.a. mentioned above). But if so, I'd like to know on which evaluation experiment the weights of the release clip_projector were obtained.
Thanks in advance for your clarifications!