The Statistical Crisis in Science. Gelman, A. & Loken, E. American Scientist, 102(6):460+, 2014.
doi  abstract   bibtex   
Data-dependent analysis – a '' garden of forking paths'' – explains why many statistically significant comparisons don't hold up. [Excerpt] There is a growing realization that reported '' statistically significant'' claims in scientific publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for '' probability'') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p-value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear. [\n] The idea is that when p is less than some prespecified value such as 0.05, the null hypothesis is rejected by the data, allowing researchers to claim strong evidence in favor of the alternative. The concept of p-values was originally developed by statistician Ronald Fisher in the 1920s in the context of his research on crop variance in Hertfordshire, England. Fisher offered the idea of p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist, p-values are now often used to lend credence to noisy claims based on small samples. [\n] In general, p-values are based on what would have happened under other possible data sets. [...] The question may be framed nonspecifically as an investigation of possible associations [...] across contexts. [...] At this point a huge number of possible comparisons could be performed, all consistent with the researcher's theory. [...] [\n] This multiple comparisons issue is well known in statistics and has been called '' p -hacking'' in an influential 2011 paper by the psychology researchers Joseph Simmons, Leif Nelson, and Uri Simonsohn. Our main point in the present article is that it is possible to have multiple potential comparisons (that is, a data analysis whose details are highly contingent on data, invalidating published p-values) without the researcher performing any conscious procedure of fishing through the data or explicitly examining multiple comparisons. [...] [The Way Forward] We must realize that, absent preregistration or opportunities for authentic replication, our choices for data analysis will be data dependent, even when they are motivated directly from theoretical concerns. When preregistered replication is difficult or impossible (as in much research in social science and public health), we believe the best strategy is to move toward an analysis of all the data rather than a focus on a single comparison or small set of comparisons. There is no statistical quality board that could enforce such larger analyses – nor would we believe such coercion to be appropriate – but as more and more scientists follow the lead of Brian Nosek, who openly expressed concerns about the malign effects of p-values on his own research, we hope there will be an increasing motivation toward more comprehensive data analyses that will be less subject to these concerns. If necessary, one must step back to a sharper distinction between exploratory and confirmatory data analysis, recognizing the benefits and limitations of each. [\n] In fields where new data can readily be gathered, perhaps the two-part structure of Nosek and his colleagues – attempting to replicate his results before publishing – will set a standard for future research. Instead of the current norm in which several different studies are performed, each with statistical significance but each with analyses that are contingent on data, perhaps researchers can perform half as many original experiments in each paper and just pair each new experiment with a preregistered replication. We encourage awareness among scientists that p-values should not necessarily be taken at face value. However, this does not mean that scientists are without options for valid statistical inference. [\n] Our positive message is related to our strong feeling that scientists are interested in getting closer to the truth. In the words of the great statistical educator Frederick Mosteller, it is easy to lie with statistics, but easier without them.
@article{gelmanStatisticalCrisisScience2014,
  title = {The Statistical Crisis in Science},
  author = {Gelman, Andrew and Loken, Eric},
  year = {2014},
  volume = {102},
  pages = {460+},
  issn = {0003-0996},
  doi = {10.1511/2014.111.460},
  abstract = {Data-dependent analysis -- a '' garden of forking paths'' -- explains why many statistically significant comparisons don't hold up.

[Excerpt] There is a growing realization that reported '' statistically significant'' claims in scientific publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for '' probability'') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p-value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear. 

[\textbackslash n] The idea is that when p is less than some prespecified value such as 0.05, the null hypothesis is rejected by the data, allowing researchers to claim strong evidence in favor of the alternative. The concept of p-values was originally developed by statistician Ronald Fisher in the 1920s in the context of his research on crop variance in Hertfordshire, England. Fisher offered the idea of p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist, p-values are now often used to lend credence to noisy claims based on small samples. 

[\textbackslash n] In general, p-values are based on what would have happened under other possible data sets. [...] The question may be framed nonspecifically as an investigation of possible associations [...] across contexts. [...] At this point a huge number of possible comparisons could be performed, all consistent with the researcher's theory. [...]

[\textbackslash n] This multiple comparisons issue is well known in statistics and has been called '' p -hacking'' in an influential 2011 paper by the psychology researchers Joseph Simmons, Leif Nelson, and Uri Simonsohn. Our main point in the present article is that it is possible to have multiple potential comparisons (that is, a data analysis whose details are highly contingent on data, invalidating published p-values) without the researcher performing any conscious procedure of fishing through the data or explicitly examining multiple comparisons. 

[...] 

[The Way Forward]

We must realize that, absent preregistration or opportunities for authentic replication, our choices for data analysis will be data dependent, even when they are motivated directly from theoretical concerns. When preregistered replication is difficult or impossible (as in much research in social science and public health), we believe the best strategy is to move toward an analysis of all the data rather than a focus on a single comparison or small set of comparisons. There is no statistical quality board that could enforce such larger analyses -- nor would we believe such coercion to be appropriate -- but as more and more scientists follow the lead of Brian Nosek, who openly expressed concerns about the malign effects of p-values on his own research, we hope there will be an increasing motivation toward more comprehensive data analyses that will be less subject to these concerns. If necessary, one must step back to a sharper distinction between exploratory and confirmatory data analysis, recognizing the benefits and limitations of each.

[\textbackslash n] In fields where new data can readily be gathered, perhaps the two-part structure of Nosek and his colleagues -- attempting to replicate his results before publishing -- will set a standard for future research. Instead of the current norm in which several different studies are performed, each with statistical significance but each with analyses that are contingent on data, perhaps researchers can perform half as many original experiments in each paper and just pair each new experiment with a preregistered replication. We encourage awareness among scientists that p-values should not necessarily be taken at face value. However, this does not mean that scientists are without options for valid statistical inference.

[\textbackslash n] Our positive message is related to our strong feeling that scientists are interested in getting closer to the truth. In the words of the great statistical educator Frederick Mosteller, it is easy to lie with statistics, but easier without them.},
  journal = {American Scientist},
  keywords = {*imported-from-citeulike-INRMM,~INRMM-MiD:c-13469628,~to-add-doi-URL,communicating-uncertainty,computational-science,p-value,science-ethics,science-literacy,scientific-communication,statistics,validation},
  lccn = {INRMM-MiD:c-13469628},
  number = {6}
}

Downloads: 0