Getting science right by proper application of statistics should be at the forefront for all of us now who are in harm’s way of the current outbreak of measles. This preventable disease is spreading because of parents who choose not to vaccinate their children for fear of autism—a long discredited side-effect reported erroneously based on a fraudulent 1999 study. If this had never seen the light of day due to better vetting, it would have prevented a great deal of misery. Sadly even the most well-meaning researchers tend to put too much faith in probability (P) values that seemingly provide significance to data they have collected as a test of their hypothesis.
Nature weighed in with their shots against scientists who misuse P values in this February 2014 article by statistics professor Regina Nuzzo. She bemoans the data dredgers who come up with attention-getting counterintuitive results using the widely-accepted 0.05 P filter on long-shot hypotheses. A prime example is the finding by three University of Virginia finding that moderates literally perceived the shades of gray more accurately than extremists on the left and right (P=0.01). As they admirably admitted in this follow up report on Restructuring Incentives and Practices to Promote Truth Over Publishability, this controversial effect evaporated upon replication. This chart on probable cause reveals that these significance chasers produce results with a false-positive rate of near 90%!
Nuzzo lays out a number of proposals to put a damper on overly-confident reports on purported scientific studies. I like the preregistered replication standard developed by Andrew Gelman of Columbia University, which he noted in this article on The Statistical Crisis in Science in the November-December issue of American Scientist. This leaves scientists free to pursue potential breakthroughs at early stages when data remain sketchy, while subjecting them to rigorous standards further on—prior to publication.
“The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.”