To reduce the dimension of large datasets, it is common to express each vector of this dataset using few atoms of a redundant dictionary. In order to select these atoms, many models and algorithms have been proposed, leading to state-of-the-art performances in many machine learning, signal and image processing applications. The classical sparsifying algorithms compute at each iteration matrix-vector multiplications where the matrix contains the atoms of the dictionary. As a consequence, the numerical complexity of the sparsifying algorithm is always proportional to the numerical complexity of the matrix-vector multiplication. In some applications, the matrix-vector multiplications can be computed using handcrafted fast transforms (such as the Fourier or the wavelet transforms). However, the complexity of the matrix-vector multiplications very often limits the capacities of the sparsifying algorithms. It is particularly the case when the transform is learned from the data. In order to avoid this limitation, we study a strategy to optimize convolutions of sparse kernels living on the edges of a tree. These convolutions define a fast transform (algorithmically similar to a fast wavelet transform) that can approximate atoms prescribed by the user. The optimization problem associated with the learning of these fast transforms is smooth but can be strongly non-convex. We propose in this paper a proximal alternating linearized minimization algorithm (PALMTREE) allowing Curvelet or Wavelet-packet transforms to be approximated with excellent performance. Empirically, the profile of the objective function associated with this optimization problem has a small number of critical values with large watershed. This confirms that the resulting fast transforms can be optimized efficiently, which opens many theoretical and applicative perspectives.