Affordable Access

deepdyve-link
Publisher Website

Video2vec Embeddings Recognize Events When Examples Are Scarce.

Authors
  • Habibian, Amirhossein
  • Mensink, Thomas
  • Snoek, Cees G M
Type
Published Article
Journal
IEEE transactions on pattern analysis and machine intelligence
Publication Date
Oct 01, 2017
Volume
39
Issue
10
Pages
2089–2103
Identifiers
DOI: 10.1109/TPAMI.2016.2627563
PMID: 27849523
Source
Medline
License
Unknown

Abstract

This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations between the words are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the Video2vec embedding using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose an event specific variant of Video2vec to learn a more accurate representation for the words, which are indicative of the event, by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of Video2vec over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition. By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video.

Report this publication

Statistics

Seen <100 times