We propose an approach to accurately estimate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose embedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull. The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complementary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accuracy over prior methods. Extensive evaluation is performed with state of the art performance reported on the popular Human 3.6M dataset (Ionescu et al. in Intell IEEE Trans Pattern Anal Mach 36(7):1325–1339, 2014), the newly released TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi-viewpoint video, IMU and accurate 3D skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.