Controversy over non-reproducible published research reporting a statistically significant result has produced substantial discussion in the literature. p-value calibration is a recently proposed procedure for adjusting p-values to account for both random and systematic errors that address one aspect of this problem. The method's validity rests on the key assumption that bias in an effect estimate is drawn from a normal distribution whose mean and variance can be correctly estimated. We investigated the method's control of type I and type II error rates using simulated and real-world data. Under mild violations of underlying assumptions, control of the type I error rate can be conservative, while under more extreme departures, it can be anti-conservative. The extent to which the assumption is violated in real-world data analyses is unknown. Barriers to testing the plausibility of the assumption using historical data are discussed. Our studies of the type II error rate using simulated and real-world electronic health care data demonstrated that calibrating p-values can substantially increase the type II error rate. The use of calibrated p-values may reduce the number of false-positive results, but there will be a commensurate drop in the ability to detect a true safety or efficacy signal. While p-value calibration can sometimes offer advantages in controlling the type I error rate, its adoption for routine use in studies of real-world health care datasets is premature. Separate characterizations of random and systematic errors provide a richer context for evaluating uncertainty surrounding effect estimates. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.