Affordable Access

Access to the full text

Viewpoint invariant semantic object and scene categorization with RGB-D sensors

Authors
  • Zaki, Hasan F. M.1
  • Shafait, Faisal2
  • Mian, Ajmal3
  • 1 International Islamic University Malaysia, Department of Mechatronics Engineering, Kuala Lumpur, 53100, Malaysia , Kuala Lumpur (Malaysia)
  • 2 National University of Sciences and Technology, Islamabad, Pakistan , Islamabad (Pakistan)
  • 3 The University of Western Australia, School of Computer Science and Software Engineering, Crawley, WA, 6009, Australia , Crawley (Australia)
Type
Published Article
Journal
Autonomous Robots
Publisher
Springer US
Publication Date
Jul 05, 2018
Volume
43
Issue
4
Pages
1005–1022
Identifiers
DOI: 10.1007/s10514-018-9776-8
Source
Springer Nature
Keywords
License
Yellow

Abstract

Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.

Report this publication

Statistics

Seen <100 times