'One-size-fits-all’ threshold for P values under fire

'One-size-fits-all’ threshold for P values under fire. Singh Chawla, D. Nature News, September, 2017.

Researchers are at odds over when to dub a discovery 'significant'. In July, 72 researchers took aim at the P value, calling for a lower threshold for the popular but much-maligned statistic. In a response published on 18 September1, a group of 88 researchers have responded, saying that a better solution would be to make academics justify their use of specific P values, rather than adopt another arbitrary threshold. P values have been used as measures of significance for decades, but academics have become increasingly aware of their shortcomings and the potential for abuse. In 2015, one psychology journal banned P values entirely. The statistic is used to test a ‘null hypothesis’, a default state positing that there is no relationship between the phenomena being measured. The smaller the P value, the less likely it is that the results are due to chance — presuming that the null hypothesis is true. Results have typically been deemed ‘statistically significant’ — and the null hypothesis dismissed — when P values are below 0.05. Related stories Big names in statistics want to shake up much-maligned P value Statisticians issue warning over misuse of P values Reproducibility: A tragedy of errors More related stories In a July preprint, since published in Nature Human Behaviour2, researchers, including leaders in the push for greater reproducibility, said that this threshold should be reduced to 0.005 to keep false positives from creeping into social sciences and biomedical literature. But “setting this one threshold for all sciences is too extreme,” says Daniel Lakens, an experimental psychologist at Eindhoven University of Technology in the Netherlands and lead author of the new commentary, which was posted to the PsyArXiv preprint server. “The moment you ask people to justify what they are doing, science will improve,” he adds. Unintended consequences Some researchers worry that lowering P value cut-offs may exacerbate the ‘file-drawer problem’, when studies containing negative results are left unpublished. A more stringent P value threshold could also lead to more false negatives — claiming that an effect doesn’t exist when in fact it does. “Before you implement any policy, you want to be more certain that there are no unintended negative consequences,” says Lakens. Instead, Lakens and colleagues say, researchers should select and justify P value thresholds for their experiments, before collecting any data. These levels would be based on factors such as the potential impact of a discovery, or how surprising it would be. Such thresholds could then be evaluated via their registered reports, a type of scientific article in which methods and proposed analyses are peer-reviewed before any experiments are conducted. “I don’t think researchers will ever have an incentive to say they need to use a more stringent threshold of evidence,” counters Valen Johnson, a statistician at Texas A&M University in College Station who is a co-author of the July manuscript. And many scientists are likely to go easy on their own work, says another co-author, Daniel Benjamin, a behavioural economist at the University of Southern California, Los Angeles. But Lakens thinks that any attempts to manipulate P values will be obvious from the justifications that researchers pick. “At least everyone agrees that it’s good to change the mindless use of 0.05,” he says. Setting specific thresholds for standards of evidence is “bad for science”, says Ronald Wasserstein, executive director of the American Statistical Association, which last year took the unusual step of releasing explicit recommendations on the use of P values for the first time in its 177-year history. Next month, the society will hold a symposium on statistical inference, which follows on from its recommendations. Wasserstein says he hasn’t yet taken a position on the current debate over P value thresholds, but adds that “we shouldn’t be surprised that there isn’t a single magic number”.

@article{singh_chawla_one-size-fits-all_2017,
title = {'{One}-size-fits-all’ threshold for {P} values under fire},
url = {http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625},
doi = {10.1038/nature.2017.22625},
abstract = {Researchers are at odds over when to dub a discovery 'significant'. In
July, 72 researchers took aim at the P value, calling for a lower
threshold for the popular but much-maligned statistic. In a response
published on 18 September1, a group of 88 researchers have responded,
saying that a better solution would be to make academics justify their use
of specific P values, rather than adopt another arbitrary threshold. P
values have been used as measures of significance for decades, but
academics have become increasingly aware of their shortcomings and the
potential for abuse. In 2015, one psychology journal banned P values
entirely. The statistic is used to test a ‘null hypothesis’, a default
state positing that there is no relationship between the phenomena being
measured. The smaller the P value, the less likely it is that the results
are due to chance — presuming that the null hypothesis is true. Results
have typically been deemed ‘statistically significant’ — and the null
hypothesis dismissed — when P values are below 0.05. Related stories Big
names in statistics want to shake up much-maligned P value Statisticians
issue warning over misuse of P values Reproducibility: A tragedy of errors
More related stories In a July preprint, since published in Nature Human
Behaviour2, researchers, including leaders in the push for greater
reproducibility, said that this threshold should be reduced to 0.005 to
keep false positives from creeping into social sciences and biomedical
literature. But “setting this one threshold for all sciences is too
extreme,” says Daniel Lakens, an experimental psychologist at Eindhoven
University of Technology in the Netherlands and lead author of the new
commentary, which was posted to the PsyArXiv preprint server. “The moment
you ask people to justify what they are doing, science will improve,” he
adds. Unintended consequences Some researchers worry that lowering P value
cut-offs may exacerbate the ‘file-drawer problem’, when studies containing
negative results are left unpublished. A more stringent P value threshold
could also lead to more false negatives — claiming that an effect doesn’t
exist when in fact it does. “Before you implement any policy, you want to
be more certain that there are no unintended negative consequences,” says
Lakens. Instead, Lakens and colleagues say, researchers should select and
justify P value thresholds for their experiments, before collecting any
data. These levels would be based on factors such as the potential impact
of a discovery, or how surprising it would be. Such thresholds could then
be evaluated via their registered reports, a type of scientific article in
which methods and proposed analyses are peer-reviewed before any
experiments are conducted. “I don’t think researchers will ever have an
incentive to say they need to use a more stringent threshold of evidence,”
counters Valen Johnson, a statistician at Texas A\&M University in College
Station who is a co-author of the July manuscript. And many scientists are
likely to go easy on their own work, says another co-author, Daniel
Benjamin, a behavioural economist at the University of Southern
California, Los Angeles. But Lakens thinks that any attempts to manipulate
P values will be obvious from the justifications that researchers pick.
“At least everyone agrees that it’s good to change the mindless use of
0.05,” he says. Setting specific thresholds for standards of evidence is
“bad for science”, says Ronald Wasserstein, executive director of the
American Statistical Association, which last year took the unusual step of
releasing explicit recommendations on the use of P values for the first
time in its 177-year history. Next month, the society will hold a
symposium on statistical inference, which follows on from its
recommendations. Wasserstein says he hasn’t yet taken a position on the
current debate over P value thresholds, but adds that “we shouldn’t be
surprised that there isn’t a single magic number”.},
urldate = {2017-11-17},
journal = {Nature News},
author = {Singh Chawla, Dalmeet},
month = sep,
year = {2017},
keywords = {Archive}
}

Downloads: 0

{"_id":"XJEyZoC4BLYzWhXtf","bibbaseid":"singhchawla-onesizefitsallthresholdforpvaluesunderfire-2017","downloads":0,"creationDate":"2018-04-27T04:04:45.487Z","title":"'One-size-fits-all’ threshold for P values under fire","author_short":["Singh Chawla, D."],"year":2017,"bibtype":"article","biburl":"https://bibbase.org/zotero/davidlloyd33","bibdata":{"bibtype":"article","type":"article","title":"'One-size-fits-all’ threshold for P values under fire","url":"http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625","doi":"10.1038/nature.2017.22625","abstract":"Researchers are at odds over when to dub a discovery 'significant'. In July, 72 researchers took aim at the P value, calling for a lower threshold for the popular but much-maligned statistic. In a response published on 18 September1, a group of 88 researchers have responded, saying that a better solution would be to make academics justify their use of specific P values, rather than adopt another arbitrary threshold. P values have been used as measures of significance for decades, but academics have become increasingly aware of their shortcomings and the potential for abuse. In 2015, one psychology journal banned P values entirely. The statistic is used to test a ‘null hypothesis’, a default state positing that there is no relationship between the phenomena being measured. The smaller the P value, the less likely it is that the results are due to chance — presuming that the null hypothesis is true. Results have typically been deemed ‘statistically significant’ — and the null hypothesis dismissed — when P values are below 0.05. Related stories Big names in statistics want to shake up much-maligned P value Statisticians issue warning over misuse of P values Reproducibility: A tragedy of errors More related stories In a July preprint, since published in Nature Human Behaviour2, researchers, including leaders in the push for greater reproducibility, said that this threshold should be reduced to 0.005 to keep false positives from creeping into social sciences and biomedical literature. But “setting this one threshold for all sciences is too extreme,” says Daniel Lakens, an experimental psychologist at Eindhoven University of Technology in the Netherlands and lead author of the new commentary, which was posted to the PsyArXiv preprint server. “The moment you ask people to justify what they are doing, science will improve,” he adds. Unintended consequences Some researchers worry that lowering P value cut-offs may exacerbate the ‘file-drawer problem’, when studies containing negative results are left unpublished. A more stringent P value threshold could also lead to more false negatives — claiming that an effect doesn’t exist when in fact it does. “Before you implement any policy, you want to be more certain that there are no unintended negative consequences,” says Lakens. Instead, Lakens and colleagues say, researchers should select and justify P value thresholds for their experiments, before collecting any data. These levels would be based on factors such as the potential impact of a discovery, or how surprising it would be. Such thresholds could then be evaluated via their registered reports, a type of scientific article in which methods and proposed analyses are peer-reviewed before any experiments are conducted. “I don’t think researchers will ever have an incentive to say they need to use a more stringent threshold of evidence,” counters Valen Johnson, a statistician at Texas A&M University in College Station who is a co-author of the July manuscript. And many scientists are likely to go easy on their own work, says another co-author, Daniel Benjamin, a behavioural economist at the University of Southern California, Los Angeles. But Lakens thinks that any attempts to manipulate P values will be obvious from the justifications that researchers pick. “At least everyone agrees that it’s good to change the mindless use of 0.05,” he says. Setting specific thresholds for standards of evidence is “bad for science”, says Ronald Wasserstein, executive director of the American Statistical Association, which last year took the unusual step of releasing explicit recommendations on the use of P values for the first time in its 177-year history. Next month, the society will hold a symposium on statistical inference, which follows on from its recommendations. Wasserstein says he hasn’t yet taken a position on the current debate over P value thresholds, but adds that “we shouldn’t be surprised that there isn’t a single magic number”.","urldate":"2017-11-17","journal":"Nature News","author":[{"propositions":[],"lastnames":["Singh","Chawla"],"firstnames":["Dalmeet"],"suffixes":[]}],"month":"September","year":"2017","keywords":"Archive","bibtex":"@article{singh_chawla_one-size-fits-all_2017,\n\ttitle = {'{One}-size-fits-all’ threshold for {P} values under fire},\n\turl = {http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625},\n\tdoi = {10.1038/nature.2017.22625},\n\tabstract = {Researchers are at odds over when to dub a discovery 'significant'. In\nJuly, 72 researchers took aim at the P value, calling for a lower\nthreshold for the popular but much-maligned statistic. In a response\npublished on 18 September1, a group of 88 researchers have responded,\nsaying that a better solution would be to make academics justify their use\nof specific P values, rather than adopt another arbitrary threshold. P\nvalues have been used as measures of significance for decades, but\nacademics have become increasingly aware of their shortcomings and the\npotential for abuse. In 2015, one psychology journal banned P values\nentirely. The statistic is used to test a ‘null hypothesis’, a default\nstate positing that there is no relationship between the phenomena being\nmeasured. The smaller the P value, the less likely it is that the results\nare due to chance — presuming that the null hypothesis is true. Results\nhave typically been deemed ‘statistically significant’ — and the null\nhypothesis dismissed — when P values are below 0.05. Related stories Big\nnames in statistics want to shake up much-maligned P value Statisticians\nissue warning over misuse of P values Reproducibility: A tragedy of errors\nMore related stories In a July preprint, since published in Nature Human\nBehaviour2, researchers, including leaders in the push for greater\nreproducibility, said that this threshold should be reduced to 0.005 to\nkeep false positives from creeping into social sciences and biomedical\nliterature. But “setting this one threshold for all sciences is too\nextreme,” says Daniel Lakens, an experimental psychologist at Eindhoven\nUniversity of Technology in the Netherlands and lead author of the new\ncommentary, which was posted to the PsyArXiv preprint server. “The moment\nyou ask people to justify what they are doing, science will improve,” he\nadds. Unintended consequences Some researchers worry that lowering P value\ncut-offs may exacerbate the ‘file-drawer problem’, when studies containing\nnegative results are left unpublished. A more stringent P value threshold\ncould also lead to more false negatives — claiming that an effect doesn’t\nexist when in fact it does. “Before you implement any policy, you want to\nbe more certain that there are no unintended negative consequences,” says\nLakens. Instead, Lakens and colleagues say, researchers should select and\njustify P value thresholds for their experiments, before collecting any\ndata. These levels would be based on factors such as the potential impact\nof a discovery, or how surprising it would be. Such thresholds could then\nbe evaluated via their registered reports, a type of scientific article in\nwhich methods and proposed analyses are peer-reviewed before any\nexperiments are conducted. “I don’t think researchers will ever have an\nincentive to say they need to use a more stringent threshold of evidence,”\ncounters Valen Johnson, a statistician at Texas A\\&M University in College\nStation who is a co-author of the July manuscript. And many scientists are\nlikely to go easy on their own work, says another co-author, Daniel\nBenjamin, a behavioural economist at the University of Southern\nCalifornia, Los Angeles. But Lakens thinks that any attempts to manipulate\nP values will be obvious from the justifications that researchers pick.\n“At least everyone agrees that it’s good to change the mindless use of\n0.05,” he says. Setting specific thresholds for standards of evidence is\n“bad for science”, says Ronald Wasserstein, executive director of the\nAmerican Statistical Association, which last year took the unusual step of\nreleasing explicit recommendations on the use of P values for the first\ntime in its 177-year history. Next month, the society will hold a\nsymposium on statistical inference, which follows on from its\nrecommendations. Wasserstein says he hasn’t yet taken a position on the\ncurrent debate over P value thresholds, but adds that “we shouldn’t be\nsurprised that there isn’t a single magic number”.},\n\turldate = {2017-11-17},\n\tjournal = {Nature News},\n\tauthor = {Singh Chawla, Dalmeet},\n\tmonth = sep,\n\tyear = {2017},\n\tkeywords = {Archive}\n}\n\n","author_short":["Singh Chawla, D."],"key":"singh_chawla_one-size-fits-all_2017","id":"singh_chawla_one-size-fits-all_2017","bibbaseid":"singhchawla-onesizefitsallthresholdforpvaluesunderfire-2017","role":"author","urls":{"Paper":"http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625"},"keyword":["Archive"],"downloads":0},"search_terms":["one","size","fits","threshold","values","under","fire","singh chawla"],"keywords":["archive"],"authorIDs":[],"dataSources":["pWG6kkFjeQheTxKbs"]}