Affordable Access

Access to the full text

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Authors
  • Niebles, Juan Carlos1, 2
  • Wang, Hongcheng3
  • Fei-Fei, Li4
  • 1 Princeton University, Engineering Quadrangle, Department of Electrical Engineering, Olden Street, Princeton, NJ, 08544, USA , Princeton (United States)
  • 2 Universidad del Norte, Robotics and Intelligent Systems Group, Km 5 Vía Puerto Colombia, Barranquilla, Colombia , Barranquilla (Colombia)
  • 3 United Technologies Research Center (UTRC), 411 Silver Lane, East Hartford, CT, 06108, USA , East Hartford (United States)
  • 4 Princeton University, Department of Computer Science, 35 Olden Street, Princeton, NJ, 08540, USA , Princeton (United States)
Type
Published Article
Journal
International Journal of Computer Vision
Publisher
Springer-Verlag
Publication Date
Mar 04, 2008
Volume
79
Issue
3
Pages
299–318
Identifiers
DOI: 10.1007/s11263-007-0122-4
Source
Springer Nature
Keywords
License
Yellow

Abstract

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

Report this publication

Statistics

Seen <100 times