At the beginning of this process the data is transpose to have T at the end of the tensor. Not sure how it works for 1D and spectrograms. But for videos, the input shape is (T,H,W) while the output becomes (W,H,T) which leads to error in training.
As a bandaid fix I transpose the output around lines 712 and 744 of data loader. But we needs a more permanent fix.
At the beginning of this process the data is transpose to have T at the end of the tensor. Not sure how it works for 1D and spectrograms. But for videos, the input shape is (T,H,W) while the output becomes (W,H,T) which leads to error in training.
As a bandaid fix I transpose the output around lines 712 and 744 of data loader. But we needs a more permanent fix.