Abstract A recent trend for image search is to fuse the two basic modalities of Web images, i.e., textual features (usually represented by keywords) and visual features for retrieval. The key issue is how to associate the two modalities for fusion. In this paper, a new approach based on Multi-Modal Semantic Association Rule (MMSAR) is proposed to fuse keywords and visual features automatically for Web image retrieval. A MMSAR contains a single keyword and several visual feature clusters, which crosses and associates the two modalities of Web images. A customized frequent itemsets mining algorithm is designed for the particular MMSARs based on the existing inverted file, and a new support–confidence framework is defined for the mining algorithm. Based on the mined MMSARs, the keywords and the visual features are fused automatically in the retrieval process. The proposed approach not only remarkably improves the retrieval precision, but also has fast response time. The experiments are carried out in a Web image retrieval system, VAST (VisuAl & SemanTic image search), and the results show the superiority and effectiveness of the proposed approach.