Enabling machines to understand non-verbal facial behavior from visual data is crucial for building smart interactive systems. This thesis focusses on human behavior analysis in videos. Previous state-of-the-art methods generally employed global temporal pooling approaches that, (i) assume presence of a single uniform event spanning the sequence, and (ii) discard temporal ordering by squashing all information along the temporal dimension. In this dissertation we focus on two specific modeling challenges unaddressed by previous approaches. First issue is training with weak labels that only provide video-level annotations and are much cheaper to obtain than fine (frame-level) annotations. The second concerns modeling temporal dynamics during prediction, as facial expressions are dynamic actions with sub-events. We propose to tackle these issues by proposing methods based on Weakly Supervised Latent Variable Models (WSLVM) and evaluate them on real-world spontaneous expressions. We begin with addressing these challenges by combining Multiple Instance Learning (MIL) framework and Multiple Segment representation (MS-MIL). MS-MIL can simultaneously classify and localize target behavior in videos despite training with weak annotations. However, this method lacks the capability to explicitly model multiple latent concepts or global temporal order. We address this issue in the next chapter by explicitly modeling temporal orderings by learning an exemplar Hidden Markov Model for each sequence. This algorithm models dependencies between segments but is limited in its modeling capacity due to the use of generative modeling. Chapter~4 extends MIL to learn multiple discriminative concepts in a novel formulation for joint clustering and classification. This algorithm shows consistent performance improvement but does not capture temporal structure. We finally present a unified learning framework that combines the strengths of the previously proposed algorithms in that it (i) addresses weakly labeled data (ii) learns multiple discriminative concepts, and (iii) models the temporal ordering structure of the concepts. This method is a novel WSLVM that models a video as a sequence of automatically mined, multiple discriminative sub-events with a loose temporal structure. We show both qualitative and quantitative results highlighting improvements over state-of-the-art algorithms by jointly addressing weak labels and temporal dynamics.