Videos
Due to the memory constraints, we have to do some down-sampling
Raw video: long high fps
Training: short clips with low fps
testing: run model on different clips, averaging predictions
Single-frame CNN
train normal 2D CNN to classify frames independently!
easy but a very strong baseline!
Late fusion
take the time axis into account
(flatten / average pooling) concatenate the results of CNNs and feed to MLP to get a classification score
Problem: Hard to compare low-level motion between frames
Early Fusion
compare frames with very first conv layer after that normal 2D CNN
then pass a 2D CNN to get class score
Problem: only one layer of the temporal processing may be not enough