Joint modeling of language and vision has been drawing increasing interest. A multimodal data representation allowing for bidirectional retrieval of images by sentences and vice versa is a key aspect. In this paper we present three contributions in canonical correlation analysis (CCA) based multimodal retrieval. Firstly, we show that an asymmetric weighting of the canonical weights, while achieving a cross-view mapping from the search to the query space, it improves the retrieval performance. Secondly, we devise a computationally efficient model selection - crucial to generalization and stability - in the framework of the Bjork Golub algorithm for regularized CCA via spectral filtering. Finally, we introduce a Hierarchical Kernel Sentence Embedding (HKSE) that approximates Kernel CCA for a special similarity kernel between words distributions. State of the art results are obtained on MSCOCO and Flickr benchmarks when these three techniques are used in conjunction.