Skip to content
AL | Apatheia Labs

Statistical Fallacies - Common Errors in Quantitative Reasoning

A comprehensive examination of statistical fallacies that compromise research validity and institutional decision-making, with methods for detection and prevention.

CompleteData Science18 January 202620 min read

Statistical Fallacies: Common Errors in Quantitative Reasoning

Executive Summary

Statistical reasoning is the foundation of evidence-based decision-making across law enforcement, healthcare, journalism, regulatory investigations, and scientific research. Yet statistical fallacies---systematic errors in interpreting or presenting quantitative evidence---pervade institutional practice. These errors are not mere technical mistakes; they drive wrongful convictions, flawed policy decisions, misallocated resources, and unjustified conclusions that propagate through institutional document chains.

This article examines the most consequential statistical fallacies encountered in forensic document analysis, provides frameworks for their detection, and demonstrates how the Phronesis platform systematically surfaces these errors during institutional accountability analysis.

Core insight: Statistical fallacies often appear mathematically sophisticated while being fundamentally misleading. Detection requires understanding not just what numbers say, but what questions they can legitimately answer.



1. Base Rate Neglect and the Prosecutor's Fallacy

1.1 The Fundamental Error

Base rate neglect occurs when decision-makers ignore the underlying frequency of events when interpreting evidence. This fallacy is particularly dangerous in legal and investigative contexts where it manifests as the prosecutor's fallacy: confusing the probability of evidence given innocence with the probability of innocence given evidence.

Consider a forensic match with a 1-in-10,000 random match probability. The prosecutor's fallacy concludes that the defendant has only a 1-in-10,000 chance of being innocent. This is mathematically wrong.

1.2 Bayes' Theorem and Correct Reasoning

The correct analysis requires Bayes' theorem:

P(Guilty|Evidence) = P(Evidence|Guilty) * P(Guilty) / P(Evidence)

If the prior probability of guilt (base rate) is 1-in-1,000,000 suspects, and the evidence has a 1-in-10,000 false positive rate, then even with a match, approximately 99 out of 100 matches would be innocent persons.

Worked Example: DNA evidence in a city of 5 million people.

  • Random match probability: 1 in 10,000
  • Number who would match by chance: 500 people
  • Actual perpetrator: 1 person
  • Probability matched person is guilty: 1/501 = 0.2%

The 1-in-10,000 statistic, when presented without base rate context, misleads juries into believing guilt is near-certain when statistical reality suggests the opposite.

1.3 Institutional Manifestations

Expert witness reports frequently present match probabilities without adequate base rate context. Statements like "the probability of this DNA profile appearing by chance is 1 in 10 billion" invite base rate neglect by the fact-finder.

Medical diagnosis documents similarly confuse sensitivity and specificity with predictive value. A test with 99% sensitivity and 99% specificity applied to a condition affecting 1 in 1,000 people will yield more false positives than true positives.

Detection method: Identify any probabilistic claim presented without explicit base rate context. Flag statements of the form "the probability of X is Y" that do not specify the reference population and prior probability.


2. P-Hacking, HARKing, and the Garden of Forking Paths

2.1 The Replication Crisis

The scientific community's crisis of confidence in published research stems largely from researcher degrees of freedom---the many decisions made during data collection and analysis that are not reported but affect results. As Simmons, Nelson, and Simonsohn (2011) demonstrated, flexible analysis practices can generate statistically significant results from pure noise.

2.2 P-Hacking Defined

P-hacking refers to practices that inflate the likelihood of obtaining statistically significant results:

  • Testing multiple outcome variables and reporting only significant ones
  • Adding or removing control variables until significance is achieved
  • Continuing data collection until results reach p < 0.05
  • Excluding "outliers" based on whether they affect significance
  • Transforming variables to achieve significance

Gelman and Loken (2013) term this the "garden of forking paths": each analytical decision represents a fork, and researchers traverse the path that leads to publishable results without acknowledging the paths not taken.

2.3 HARKing: Hypothesising After Results Known

HARKing occurs when researchers present post-hoc hypotheses as if they were pre-specified predictions. A study that tested 20 variables, found one significant at p < 0.05 (expected by chance alone), and presents this finding as a confirmed hypothesis is engaging in HARKing.

The probability of at least one false positive in 20 independent tests at alpha = 0.05 is:

P(at least one false positive) = 1 - (0.95)^20 = 0.64

Sixty-four percent of studies testing 20 variables will find at least one "significant" result by chance alone.

2.4 Detecting These Fallacies in Institutional Documents

Indicators of p-hacking:

  • Round p-values clustering just below 0.05
  • Inconsistency between stated hypotheses and reported analyses
  • Unexplained exclusion of data points
  • Multiple analysis approaches with only one reported
  • Effect sizes implausibly large for the field

Indicators of HARKing:

  • Vague or absent pre-registration
  • Perfect alignment between "predictions" and results
  • Post-hoc framing language ("as expected," "confirming our hypothesis") for exploratory findings
  • Absence of null or unexpected results

Ioannidis (2005) famously argued that "most published research findings are false," particularly in fields with small sample sizes, many tested relationships, flexible designs, and financial or career incentives. Forensic document analysis must approach research claims with appropriate scepticism.


3. Simpson's Paradox and Ecological Fallacy

3.1 When Aggregation Reverses Relationships

Simpson's paradox occurs when a trend present in aggregated data reverses or disappears when the data is disaggregated by a confounding variable. The paradox is not merely theoretical; it has driven real institutional decisions with devastating consequences.

Classic Example: UC Berkeley Admissions (1973)

Aggregated data suggested discrimination against women:

  • Men admitted: 44%
  • Women admitted: 35%

However, disaggregated by department, women were admitted at equal or higher rates than men in most departments. The paradox arose because women disproportionately applied to more competitive departments with lower overall admission rates.

3.2 The Ecological Fallacy

The ecological fallacy is the related error of drawing individual-level conclusions from group-level data. If regions with higher chocolate consumption have more Nobel laureates, we cannot conclude that eating chocolate makes individuals more likely to win Nobel Prizes. The correlation may reflect confounding variables (national wealth, education infrastructure) that affect both measures.

3.3 Institutional Manifestations

Performance statistics comparing institutions often fall prey to Simpson's paradox. A hospital with worse aggregate survival rates may actually provide better care---if it treats more severe cases. League tables and performance metrics require case-mix adjustment.

Discrimination claims based on aggregate disparities must consider whether the comparison groups are equivalent on relevant characteristics. A pay gap between groups may reflect different role distributions rather than discrimination within roles.

Policy evaluations using ecological correlations ("countries with policy X have better outcomes") cannot establish that implementing policy X would improve outcomes in a different context.

Detection method: When encountering aggregate statistics that imply causation or discrimination, demand disaggregated analysis. Ask: "Does this relationship hold within relevant subgroups?"


4. Regression to the Mean Misattribution

4.1 The Statistical Phenomenon

Regression to the mean describes the statistical tendency for extreme observations to be followed by less extreme ones. This occurs whenever there is random variation in measured outcomes and the measurement is imperfectly correlated over time.

4.2 The Attribution Error

The fallacy arises when regression to the mean is misattributed to an intervention. Consider:

  • Patients are selected for treatment because their symptoms are unusually severe (extreme observation)
  • After treatment, symptoms improve (regression toward the mean)
  • The improvement is attributed to the treatment

Without a control group, regression to the mean is indistinguishable from treatment effect.

4.3 Institutional Manifestations

Educational interventions: Schools identified as "failing" based on one year's test scores often improve the next year---with or without intervention---because the extreme score contained random measurement error that regresses.

Crime reduction programmes: Areas selected for intervention due to crime spikes may show improvement through regression alone. "Successful" programmes may simply be measuring statistical noise.

Performance management: Employees identified for "performance improvement plans" based on a poor period may naturally improve through regression. The plan receives credit for statistical inevitability.

Medical treatment evaluations: Kahneman (2011) notes that punishing poor performance and rewarding good performance appears effective because regression makes the punished improve and the rewarded decline---reinforcing incorrect beliefs about the efficacy of punishment.

Detection method: Ask whether cases were selected based on extreme values of the outcome variable. If so, apparent change may reflect regression rather than any genuine effect.


5. Survivorship Bias

5.1 The Visibility of Survivors

Survivorship bias occurs when analysis focuses on cases that passed some selection process while ignoring those that did not, leading to false conclusions about what predicts success.

The classic illustration comes from World War II: Abraham Wald advised the military to armour the parts of returning aircraft that showed no damage, reasoning that aircraft hit in those locations had not returned to be studied.

5.2 Institutional Manifestations

Success studies: Analysing only successful organisations to identify "best practices" ignores failed organisations that may have followed identical practices. Successful outcomes may reflect luck rather than replicable strategy.

Career guidance: Interviewing successful people about their path creates survivorship bias---identical paths followed by unsuccessful people are invisible. "Follow your passion" may be advice that survivors can afford to give retrospectively.

Investment analysis: Funds that underperform are closed or merged, leaving only successful funds in the analysed pool. Historical fund performance overstates likely future returns because failures have been culled from the record.

Legal precedent: Published court decisions represent a biased sample of legal disputes---cases that settled, were abandoned, or produced unreported decisions are invisible. Precedent may not reflect the full distribution of relevant fact patterns.

Detection method: Ask what happened to cases that were filtered out before analysis. Is the sample representative of the population about which conclusions are drawn?


6. File Drawer Effect and Publication Bias

6.1 The Missing Studies

Rosenthal (1979) introduced the file drawer problem: studies with null or negative results remain unpublished, creating a systematic bias in the published literature toward positive findings. If 20 research teams test a hypothesis and only the one finding significance publishes, the literature will consist entirely of a false positive.

6.2 Magnitude and Detection

Meta-analyses reveal publication bias through funnel plot asymmetry: small studies should scatter symmetrically around the true effect size, but if negative small studies are suppressed, the funnel is asymmetric.

Franco, Malhotra, and Simonovits (2014) tracked all studies associated with the TESS experiment programme, finding that 40% of studies with null results went unpublished compared to 10% with positive results.

6.3 Institutional Implications

Expert testimony: When experts cite "the literature shows..." they cite a biased sample. Null results---equally informative scientifically---are systematically absent.

Evidence-based practice: Guidelines based on published research inherit publication bias. Interventions may appear more effective than they are because failures go unpublished.

Regulatory decisions: Drug trials with negative results are less likely to be published, inflating apparent efficacy of approved medications. The true evidence base is not the published evidence base.

Detection method: Look for pre-registration of studies, meta-analyses that include unpublished data, and acknowledgement of the file drawer problem. Be sceptical of literature reviews that do not address publication bias.


7. Cherry-Picking and the Texas Sharpshooter Fallacy

7.1 Selective Presentation

Cherry-picking is the deliberate selection of data points or time periods that support a preferred conclusion while ignoring contradictory evidence. Unlike p-hacking, which may be unconscious, cherry-picking often involves conscious selection.

7.2 The Texas Sharpshooter Fallacy

This fallacy involves drawing a target around bullet holes after shooting---finding patterns in random data and treating them as meaningful. Applied to institutions:

  • Examining many possible clusters until one appears significant
  • Defining "success" after observing which outcomes occurred
  • Selecting comparison groups post-hoc to maximise apparent effect

7.3 Institutional Manifestations

Trend selection: Organisations select time periods that show improvement. "Performance improved 20% since 2019" may hide deterioration since 2020 or improvement already occurring before intervention.

Geographic cherry-picking: "Our programme succeeded in Region X" ignores Regions Y and Z where it failed. Without pre-specification of evaluation sites, positive results may reflect chance or site-specific factors.

Metric selection: When many metrics are available, organisations report those showing improvement. "Academic standards" may include only metrics that rose while omitting those that fell.

Document analysis context: A broadcast presenting a 13.2:1 framing ratio in favour of prosecution represents extreme cherry-picking of which facts to include. The statistical fingerprint of directional omission (100% prosecution-favouring) quantifies what might otherwise appear as editorial judgment.

Detection method: Demand specification of the full dataset, time period, geographic scope, and metric set before examining results. Pre-registration prevents post-hoc cherry-picking.


8. Sampling Bias and Non-Representative Samples

8.1 The Generalisation Problem

Conclusions are valid only for the population from which samples were drawn. Sampling bias occurs when the sample systematically differs from the target population, rendering conclusions non-generalisable.

8.2 Common Sampling Problems

Convenience sampling: Using easily accessible cases (university students, online volunteers, existing clients) produces samples that differ from general populations on multiple dimensions.

Self-selection bias: Voluntary participation creates samples that differ in motivation, awareness, and often outcomes from non-participants. Programme evaluations based on participants cannot be generalised to non-participants.

Non-response bias: Those who respond to surveys differ from non-respondents. Low response rates guarantee non-representativeness.

Survivor sampling: Studying only ongoing cases ignores completed or failed cases that may differ systematically.

8.3 Institutional Manifestations

Complaint analysis: Analysing complaints tells you about people who complain, not about the overall population. High complaint rates may reflect complainant characteristics rather than institutional performance.

Customer satisfaction: Satisfaction surveys capture only customers who remained customers long enough to be surveyed. Dissatisfied customers who left are absent.

Risk assessment validation: Validating risk instruments on cases that reached a certain stage ignores cases that were diverted earlier. The validation sample differs from the application sample.

Detection method: Identify the sample and the claimed population of generalisation. Assess whether selection mechanisms could create systematic differences between them.


9. Correlation-Causation Confusion

9.1 The Fundamental Distinction

Correlation measures the strength of association between variables. Causation implies that changing one variable produces change in another. Correlation is symmetric and does not imply direction; causation is asymmetric and directional.

9.2 Why Correlation Does Not Imply Causation

Three alternative explanations for any observed correlation:

  1. Reverse causation: B causes A, not A causes B
  2. Common cause: C causes both A and B
  3. Coincidence: With enough variables, spurious correlations arise by chance

9.3 Establishing Causation

The counterfactual framework (Rubin, 1974; Pearl, 2000) defines causation in terms of what would have happened in the absence of the cause. Establishing causation requires:

  • Randomised controlled trials (experimental manipulation)
  • Natural experiments (as-if random assignment)
  • Instrumental variables (exogenous variation)
  • Regression discontinuity (threshold-based assignment)
  • Difference-in-differences (parallel trends assumption)

Observational correlations alone cannot establish causation regardless of sample size or statistical sophistication.

9.4 Institutional Manifestations

Performance attribution: "Outcome improved after intervention" does not establish that the intervention caused improvement. Pre-existing trends, regression to the mean, or confounding factors may explain the change.

Risk factor claims: "People with characteristic X have worse outcomes" does not establish that X causes the outcomes. People with X may differ on unmeasured variables that actually drive outcomes.

Media narratives: Documentary treatments frequently imply causation from temporal sequence. "After A happened, B occurred" invites but does not establish a causal interpretation.

Detection method: Demand explicit causal identification strategy. What alternative explanations were ruled out? How was confounding addressed?


10. Misuse of Statistical Significance

10.1 What P-Values Actually Mean

A p-value is the probability of obtaining results at least as extreme as observed, given that the null hypothesis is true. It is not:

  • The probability the null hypothesis is true
  • The probability the result occurred by chance
  • The probability of replication
  • The probability the finding is real

10.2 The Significance Threshold Problem

Fisher intended p < 0.05 as a heuristic for "worth further investigation," not as a bright line separating real from unreal effects. The arbitrary threshold creates perverse incentives:

  • P = 0.049 is treated as categorically different from p = 0.051
  • Journals and careers reward "significant" results
  • Effect size and practical importance are ignored

The American Statistical Association's (2016) statement on p-values emphasised that "scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

10.3 Confidence Intervals and Effect Sizes

Superior to significance testing:

  • Confidence intervals quantify uncertainty around estimates
  • Effect sizes (Cohen's d, odds ratios) measure practical magnitude
  • Bayesian methods directly estimate probability of hypotheses

10.4 Institutional Manifestations

Research claims: "Statistically significant effect" may describe trivially small effects with no practical importance. An intervention that improves outcomes by 0.01% may achieve statistical significance with sufficient sample size.

Regulatory decisions: Binary significant/not-significant distinctions ignore the continuous nature of evidence. Absence of significance is not evidence of absence.

Litigation: Expert witnesses may present p-values without explaining their meaning, leading fact-finders to misinterpret statistical significance as practical significance or certainty.

Detection method: Demand effect sizes and confidence intervals alongside p-values. Ask whether effects are practically meaningful, not just statistically detectable.


11. Application to Evaluating Research Claims and Institutional Statistics

11.1 A Systematic Evaluation Framework

When encountering statistical claims in institutional documents, apply this checklist:

Base Rates

  • Is the base rate specified?
  • Has the prosecutor's fallacy been avoided?
  • Is Bayesian reasoning applied where appropriate?

Research Practices

  • Was the study pre-registered?
  • How many variables were tested?
  • Are all analyses reported or only significant ones?

Aggregation

  • Could Simpson's paradox affect interpretation?
  • Has the data been disaggregated appropriately?
  • Are ecological correlations mistaken for individual effects?

Selection Mechanisms

  • Were cases selected on extreme values (regression to mean)?
  • What happened to cases that were filtered out (survivorship)?
  • Is the sample representative of the target population?

Publication and Presentation

  • What studies might be missing (file drawer)?
  • Were time periods or cases selected to show desired results?
  • Is the causal identification strategy explicit?

Significance Claims

  • Are effect sizes and confidence intervals provided?
  • Is statistical significance conflated with practical importance?
  • Has multiple testing been addressed?

11.2 Red Flags in Institutional Documents

  • Probability statements without base rate context
  • Round p-values (exactly 0.05, 0.01)
  • No mention of pre-registration or multiple testing
  • Causal language based on observational data
  • Metrics defined or selected post-hoc
  • Missing or unexplained exclusions
  • Perfect alignment between predictions and results
  • Absence of null or unexpected findings

12. Connection to Phronesis Platform

The Phronesis platform operationalises detection of statistical fallacies through multiple integrated engines.

12.1 Bias Detection Engine

The Bias Detection Engine quantifies directional imbalance in statistical presentation:

  • Omission analysis: Calculates the directional bias score for what statistics are included versus omitted
  • Framing ratios: Measures relative attention to different aspects of statistical evidence
  • Cherry-pick detection: Flags time periods, subgroups, or metrics that differ from pre-specified analysis plans

A directional bias score of +1.0 in statistical presentation---where all presented statistics favour one interpretation while contradicting statistics are omitted---quantifies cherry-picking with mathematical precision.

12.2 Contradiction Engine

The Contradiction Engine flags internal statistical inconsistencies:

  • Numbers that do not sum correctly
  • Percentages exceeding 100% or logically incompatible
  • Temporal impossibilities in dated statistics
  • Claims contradicted by cited sources

When a report claims an effect is "statistically significant at p < 0.01" but the confidence interval includes zero, the Contradiction Engine surfaces this inconsistency.

12.3 Evidence Mapping

Phronesis requires explicit evidence mapping for statistical claims:

  • Each claim linked to its supporting data
  • Sample sizes and populations identified
  • Causal claims flagged for identification strategy
  • Uncertainty quantified through confidence intervals

Claims presented as fact but lacking statistical support are automatically flagged for scrutiny.

12.4 CASCADE Integration

The CASCADE methodology traces how statistical fallacies propagate through institutional document chains:

  • ANCHOR: Identify where flawed statistics originated
  • INHERIT: Track which subsequent documents cite the flawed claim without verification
  • COMPOUND: Quantify how the claim gained apparent authority through repetition
  • ARRIVE: Map the decisions and outcomes driven by the flawed statistical reasoning

A prosecutor's fallacy in an original expert report, inherited through multiple institutional documents without correction, compounding into apparent certainty, and arriving at a wrongful conviction---this is the pattern CASCADE reveals.


13. Conclusion: Statistical Literacy as Institutional Defence

Statistical fallacies are not rare exceptions but pervasive features of institutional practice. They survive because:

  • They often favour institutional narratives
  • Detection requires statistical literacy that many decision-makers lack
  • The mathematics of fallacious reasoning can appear sophisticated
  • Correction requires admitting previous error

Defending against statistical fallacies requires:

  1. Education: Decision-makers must understand base rates, sampling, causation, and significance
  2. Structure: Systematic review processes that force explicit consideration of alternative interpretations
  3. Tools: Automated detection that surfaces inconsistencies and biased presentation
  4. Accountability: Consequences for institutions that persist in fallacious statistical reasoning

The goal is not perfect statistical practice---an impossible standard---but systematic detection and correction of the errors that compromise institutional decision-making. Statistics, properly understood and properly challenged, remain essential tools for achieving clarity without distortion.


Sources

Foundational Texts

  • Ioannidis, John P. A. "Why Most Published Research Findings Are False." PLOS Medicine 2, no. 8 (2005): e124.
  • Kahneman, Daniel. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011.
  • Gelman, Andrew, and Eric Loken. "The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No 'Fishing Expedition' or 'P-Hacking' and the Research Hypothesis Was Posited Ahead of Time." Working paper, 2013.
  • Pearl, Judea. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

Statistical Methodology

  • Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant." Psychological Science 22, no. 11 (2011): 1359-1366.
  • Wasserstein, Ronald L., and Nicole A. Lazar. "The ASA's Statement on P-Values: Context, Process, and Purpose." The American Statistician 70, no. 2 (2016): 129-133.
  • Franco, Annie, Neil Malhotra, and Gabor Simonovits. "Publication Bias in the Social Sciences: Unlocking the File Drawer." Science 345, no. 6203 (2014): 1502-1505.
  • Rosenthal, Robert. "The File Drawer Problem and Tolerance for Null Results." Psychological Bulletin 86, no. 3 (1979): 638-641.
  • Thompson, William C., and Edward L. Schumann. "Interpretation of Statistical Evidence in Criminal Trials: The Prosecutor's Fallacy and the Defense Attorney's Fallacy." Law and Human Behavior 11, no. 3 (1987): 167-187.
  • Bickel, Peter J., Eugene A. Hammel, and J. William O'Connell. "Sex Bias in Graduate Admissions: Data from Berkeley." Science 187, no. 4175 (1975): 398-404.
  • Rubin, Donald B. "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies." Journal of Educational Psychology 66, no. 5 (1974): 688-701.

Institutional Analysis

  • Gigerenzer, Gerd. Calculated Risks: How to Know When Numbers Deceive You. Simon and Schuster, 2002.
  • Spiegelhalter, David. The Art of Statistics: Learning from Data. Penguin, 2019.
  • McRaney, David. You Are Not So Smart: Why You Have Too Many Friends on Facebook, Why Your Memory Is Mostly Fiction, and 46 Other Ways You're Deluding Yourself. Penguin, 2012.

Document Control

Version: 1.0 Date: 2026-01-18 Author: Apatheia Labs Research Classification: Research Article - Data Science Word Count: Approximately 3,400 Review Cycle: Annual update recommended


"Statistical fallacies often appear mathematically sophisticated while being fundamentally misleading. Detection requires understanding not just what numbers say, but what questions they can legitimately answer."


End of Document