This paper assumes prior detections of multiple targets at each time instant, and uses a graph-based approach to connect those detections across time, based on their position and appearance estimates. In contrast to most earlier works in the field, our framework has been designed to exploit the appearance features, even when they are only sporadically available, or affected by a non-stationary noise, along the sequence of detections. This is done by implementing an iterative hypothesis testing strategy to progressively aggregate the detections into short trajectories, named tracklets. Specifically, each iteration considers a node, named key-node, and investigates how to link this key-node with other nodes in its neighborhood, under the assumption that the target appearance is defined by the key-node appearance estimate. This is done through shortest path computation in a temporal neighborhood of the key-node. The approach is conservative in that it only aggregates the shortest paths that are sufficiently better compared to alternative paths. It is also multi-scale in that the size of the investigated neighborhood is increased proportionally to the number of detections already aggregated into the key-node. The multi-scale nature of the process and the progressive relaxation of its conservativeness makes it both computationally efficient and effective. Experimental validations are performed extensively on a toy example, a 15 minutes long multi-view basketball dataset, and other monocular pedestrian datasets.