Methods and technologies enabling the estimation at large scale of important traits for the dairy sector are of great interest. Those phenotypes are necessary to improve herd management, animal genetic evaluation, and milk quality control. In the recent years, the research was very active to predict new phenotypes from the mid-infrared (MIR) analysis of milk. Models were developed to predict phenotypes such as fine milk composition, milk technological properties or traits related to cow health, fertility and environmental impact. Most of models were developed within research contexts and often not designed for routine use. The implementation of models at a large scale to predict new traits of interest brings new challenges as the factors influencing the robustness of models are poorly documented. The first objective of this work is to highlight the impact on prediction accuracy of factors such as the variability of the spectral and reference data, the spectral regions used and the complexity of models. The second objective is to emphasize methods and indicators to evaluate the quality of models and the quality of predictions generated under routine conditions. The last objective is to outline the issues and the solutions linked with the use and transfer of models on large number of instruments. Based on partial least square regression and 10 datasets including milk MIR spectra and reference quantitative values for 57 traits of interest, the impact of the different factors is illustrated by evaluating the influence on the validation root mean square error of prediction (RMSEP). In the displayed examples, all factors, when well set up, increase the quality of predictions, with an improvement of the RMSEP ranging from 12% to 43%. This work also aims to underline the need for and the complementarity between different validation procedures, statistical parameters and quality assurance methods. Finally, when using and transferring models, the impact of the spectral standardization on the prediction reproducibility is highlighted with an improvement up to 86% with the tested models, and the monitoring of individual spectrometer stability over time appears essential. This list inspired from our experience is of course not exhaustive. The displayed results are only examples and not general rules and other aspects play a role in the quality of final predictions. However, this work highlights good practices, methods and indicators to increase and evaluate quality of phenotypes predicted at a large scale. The results obtained argue for the development of guidelines at international levels, as well as international collaborations in order to constitute large and robust datasets and enable the use of models in routine conditions.