Facial landmark detection is typically cast as a point-wise regression problem that focuses on how to build an effective image-to-point mapping function. In this paper, we propose an end-to-end deep learning approach for contextually discriminative feature construction together with effective facial structure modeling. The proposed learning approach is able to predict more contextually discriminative facial landmarks by capturing their associated contextual information. Moreover, we present a tree model to characterize human face structure and a structural loss function to measure the deformation cost between the ground-truth and predicted tree model, which are further incorporated into the proposed learning approach and jointly optimized within a unified framework. The presented tree model is able to well characterize the spatial layout patterns of facial landmarks for capturing the facial structure information. Experimental results demonstrate the effectiveness of the proposed approach against the state-of-the-art over the MTFL and AFLW-full data sets.