We explore two significant factors on the outcomes of benchmark studies for enzymatically catalyzed reactions, namely the level of theory of the benchmarks and the size of the model system used to represent the enzyme active site. For the benchmarks, we compare two potential alternatives to canonical coupled cluster results for situations where CCSD(T) is computationally too demanding: a strategy to estimate finite basis set coupled cluster values and the local-correlation DLPNO-CCSD(T) method at the complete basis set limit. We confirm the high accuracy of DLPNO-CCSD(T) used with tight thresholds. We also show that notable differences can be seen when using both sets of references for a benchmark study, with absolute deviations from the higher-quality references generally smaller than those from lower-quality ones as well as changes in the ranking of the assessed methods. For geometries, we test three models for the active site of 4-oxalocrotonate tautomerase: one typical of the QM region that may be used in QM/MM studies, and two smaller variants that neglect the surrounding chemical environment. Benchmarking of 12 density functionals known to perform well on enzymatically catalyzed reactions shows inconsistent performance of each method across the three models, contradicting the common idea that small representative systems can be used to accurately assess the applicability of low-level methods for larger biochemical applications. Our findings shall serve as a reminder on the standards that should be adhered to in benchmark studies, and as a guide for future studies, both on enzyme-related and other chemical problems.