Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.
“There are more false claims made in the medical literature than anybody appreciates,”
Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical.
“A lot of scientists don’t understand statistics,” says Goodman. “And they don’t understand statistics because the statistics don’t make sense.”
How could so many studies be wrong?Because their conclusions relied on “statistical significance,” a concept at the heart of the mathematical analysis of modern scientific experiments.
Statistical significance is a phrase that every science graduate student learns, but few comprehend.
the modern notion was pioneered by the mathematician Ronald A. Fisher in the 1920s.
If P is less than .05 — meaning the chance of a fluke is less than 5 percent — the result should be declared “statistically significant,” Fisher arbitrarily declared
But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion
If the chance of a fluke is less than 5 percent, two possible conclusions remain:
- There is a real effect, or
- the result is an improbable fluke.
Fisher’s method offers no way to know which is which.
On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.
most scientists are confused about the meaning of a P value or how to interpret it. “It’s almost never, ever, ever stated correctly, what it means,” says Goodman.
a commonly held misperception about the meaning of statistical significance at the .05 level:
“This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”
That interpretation commits an egregious logical error, confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result.
A great example:
A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry. A dog may bark 5 percent of the time even if it is well-fed all of the time.
Another common error equates statistical significance to “significance” in the ordinary use of the word.
a study with a very large sample can detect “statistical significance” for a small effect that is meaningless in practical terms.
Similarly, when studies claim that a chemical causes a “significantly increased risk of cancer,” they often mean that it is just statistically significant, possibly posing only a tiny absolute increase in risk.
Statisticians perpetually caution against mistaking statistical significance for practical importance, but scientific papers commit that error often
“I found that eight or nine of every 10 articles published in the leading journals make the fatal substitution” of equating statistical significance to importance,
Multiplicity of mistakes
Even when “significance” is properly defined and P values are carefully calculated, statistical inference is plagued by many other problems.
Chief among them is the “multiplicity” issue — the testing of many hypotheses simultaneously. When several drugs are tested at once, or a single drug is tested on several groups, chances of getting a statistically significant but false result rise rapidly.
Recognizing these problems, some researchers now calculate a “false discovery rate” to warn of flukes disguised as real effects.
Clinical trials and errors
Statistical problems also afflict the “gold standard” for medical research, the randomized, controlled clinical trials that test drugs for their ability to cure or their power to harm
Randomization also should ensure that unknown differences among individuals are mixed in roughly the same proportions in the groups being tested. But statistics do not guarantee an equal distribution any more than they prohibit 10 heads in a row when flipping a penny.
Still, trial results are reported as averages that may obscure individual differences, masking beneficial or harmful effects and possibly leading to approval of drugs that are deadly for some and denial of effective treatment to others.
“Determining the best treatment for a particular patient is fundamentally different from determining which treatment is best on average,”
“Reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.”
Another concern is the common strategy of combining results from many trials into a single “meta-analysis,” a study of studies.
But statistical techniques for doing so are valid only if certain criteria are met. For one thing, all the studies conducted on the drug must be included — published and unpublished. And all the studies should have been performed in a similar way, using the same protocols, definitions, types of patients and doses
Meta-analyses have produced many controversial conclusions.
Common claims that antidepressants work no better than placebos, for example, are based on meta-analyses that do not conform to the criteria that would confer validity.
In principle, a proper statistical analysis can suggest an actual risk even though the raw numbers show a benefit
“Across the trials, there was no standard method for identifying or validating outcomes; events … may have been missed or misclassified,”
More recently, epidemiologist Charles Hennekens and biostatistician David DeMets have pointed out that combining small studies in a meta-analysis is not a good substitute for a single trial sufficiently large to test a given question
These concerns do not make clinical trials worthless, nor do they render science impotent. Some studies show dramatic effects that don’t require sophisticated statistics to interpret. If the P value is 0.0001 — a hundredth of a percent chance of a fluke — that is strong evidence
“Replication is vital,” says statistician Juliet Shaffer, a lecturer emeritus at the University of California, Berkeley. And in medicine, she says, the need for replication is widely recognized
Such sad statistical situations suggest that the marriage of science and math may be desperately in need of counseling.
Most critics of standard statistics advocate the Bayesian approach to statistical reasoning, a methodology that derives from a theorem credited to Bayes, an 18th century English clergyman
Bayesian math seems baffling at first, even to many scientists, but it basically just reflects the need to include previous knowledge when drawing conclusions from new observations.
To infer the odds that a barking dog is hungry, for instance, it is not enough to know how often the dog barks when well-fed. You also need to know how often it eats — in order to calculate the prior probability of being hungry.
Bayesian math combines a prior probability with observed data to produce an estimate of the likelihood of the hunger hypothesis.
the Bayesian approach has become more widely applied in medicine and other fields in recent years. In many real-life contexts, Bayesian methods do produce the best answers to important questions.
But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world.
- Standard or “frequentist” statistics treat probabilities as objective realities;
- Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation.
That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics. “Subjective prior beliefs are anathema to the frequentist, who relies instead on a series of ad hoc algorithms that maintain the facade of scientific objectivity,”
“What does probability mean in real life?” the statistician David Salsburg asked in his 2001 book The Lady Tasting Tea. “This problem is still unsolved, and … if it remains unsolved, the whole of the statistical approach to science may come crashing down from the weight of its own inconsistencies.”