In this paper we present an online camera pose estimation method that combines Content-Based Image Retrieval (CBIR) and pose refinement based on a learned representation of the scene geometry extracted from monocular images. Our pose estimation method is two-step, we first retrieve an initial 6 Degrees of Freedom (DoF) location of an unknown-pose query by retrieving the most similar candidate in a pool of geo-referenced images. In a second time, we refine the query pose with a Perspective-n-Point (PnP) algorithm where the 3D points are obtained thanks to a generated depth map from the retrieved image candidate. We make our method fast and lightweight by using a common neural network architecture to generate the image descriptor for image indexing and the depth map used to create the 3D points required in the PnP pose refinement step. We demonstrate the effectiveness of our proposal through extensive experimentation on both indoor and outdoor scenes, as well as generalisation capability of our method to unknown environment. Finally, we show how to deploy our system even if geometric information is missing to train our monocular-image-to-depth neural networks.