Parallelizing industrial simulation codes like the EUROPLEXUS software dedicated to the analysis of fast transient phenomena, is challenging. In this paper we focus on the efficient parallelization on a multi-core shared memory node. We propose to have each thread gather the data it needs for processing a given iteration range, before to actually advance the computation by one time step on this range. This lazy cache aware layout construction enables to keep the original data structure and leads to very localised code modifications. We show that this approach can improve the execution time by up to 40% when the task size is set to have the data fit in the L2 cache.