Affordable Access

deepdyve-link
Publisher Website

Limitations of empirical calibration of p-values using observational data.

Authors
  • Gruber, Susan1
  • Tchetgen Tchetgen, Eric2
  • 1 Reagan-Udall Foundation for the Food and Drug Administration, Washington, DC, U.S.A.
  • 2 Departments of Biostatistics and Epidemiology, Harvard School of Public Health, Boston, MA, U.S.A.
Type
Published Article
Journal
Statistics in Medicine
Publisher
Wiley (John Wiley & Sons)
Publication Date
Sep 30, 2016
Volume
35
Issue
22
Pages
3869–3882
Identifiers
DOI: 10.1002/sim.6936
PMID: 26970249
Source
Medline
Keywords
Language
English
License
Unknown

Abstract

Controversy over non-reproducible published research reporting a statistically significant result has produced substantial discussion in the literature. p-value calibration is a recently proposed procedure for adjusting p-values to account for both random and systematic errors that address one aspect of this problem. The method's validity rests on the key assumption that bias in an effect estimate is drawn from a normal distribution whose mean and variance can be correctly estimated. We investigated the method's control of type I and type II error rates using simulated and real-world data. Under mild violations of underlying assumptions, control of the type I error rate can be conservative, while under more extreme departures, it can be anti-conservative. The extent to which the assumption is violated in real-world data analyses is unknown. Barriers to testing the plausibility of the assumption using historical data are discussed. Our studies of the type II error rate using simulated and real-world electronic health care data demonstrated that calibrating p-values can substantially increase the type II error rate. The use of calibrated p-values may reduce the number of false-positive results, but there will be a commensurate drop in the ability to detect a true safety or efficacy signal. While p-value calibration can sometimes offer advantages in controlling the type I error rate, its adoption for routine use in studies of real-world health care datasets is premature. Separate characterizations of random and systematic errors provide a richer context for evaluating uncertainty surrounding effect estimates. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

Report this publication

Statistics

Seen <100 times