Abstract: Systems and methods for video representation learning using triplet training are provided. The system receives a video file and extracts features associated with the video file such as video features, audio features, and valence-arousal-dominance (VAD) features. The system processes the video features, audio features, and VAD features using a hierarchical attention network to generate a video embedding, an audio embedding, and a VAD embedding, respectively. The system concatenates the video embedding, the audio embedding and VAD embedding to create a concatenated embedding. The system processes the concatenated embedding using a non-local attention network to generate a fingerprint associated with the video file. The system then processes the fingerprint generate one or more of a mood prediction, a genre prediction, and a keyword prediction.
Type:
Application
Filed:
August 23, 2023
Publication date:
February 29, 2024
Applicant:
Vionlabs AB
Inventors:
Alden Coots, Rithika Harish Kumar, Paula Diaz Benet, Marcus Bergström