Patrick J. Rolwes
← Projects

Interactive Tool

Permutation Testing
for Two-Group Comparisons

An interactive guide to permutation testing — what it is, how it works, when to use it, and where it falls short. The calculator handles any binary two-group outcome, with adverse impact analysis as the primary example.

Companion to: Rolwes, P. J., Courey, K. A., Oswald, F. L., & Martín-Raugh, M. P. (under review). Advancing adverse impact analysis: Permutation testing and Bayesian methods for small sample challenges. Organizational Research Methods.

01

What is Permutation Testing?

A permutation test is a nonparametric significance test that answers a direct question: if group membership had no effect on the outcome, how often would we observe a difference as large as the one we actually saw?

Rather than relying on theoretical probability distributions — like the chi-square or normal — permutation tests build an empirical null distribution directly from your own data. The logic:

1

Observe. Calculate the test statistic from your real data — typically the difference in outcome rates between two groups.

2

Pool. Combine all participants into one pool, temporarily ignoring group membership.

3

Shuffle. Randomly reassign participants to groups of the original sizes. This simulates the null hypothesis.

4

Record. Calculate the test statistic for this shuffled dataset.

5

Repeat. Do this thousands of times to build the null distribution.

6

Compare. Count how often the permuted statistic is as extreme as what you observed. That proportion is your p-value.

Core assumption: Under H₀, the outcome labels are exchangeable across groups — meaning group membership provides no information about who was selected or not. This is a much weaker assumption than normality or large sample size.

02

Interactive Calculator

Enter any binary two-group comparison — selection, promotion, training completion, or any pass/fail outcome. Use the presets to explore different scenarios, or enter your own data below.

A small-sample adverse impact scenario. Note how permutation and chi-square p-values can diverge — this is the core motivation for our ORM paper.

Majority Group

Rate: 50.0%

Minority Group

Rate: 30.0%


03

When to Use Permutation Testing

Small samples

The chi-square test relies on asymptotic approximations that degrade when expected cell counts fall below 5. Permutation tests make no distributional assumptions and remain valid regardless of sample size — making them especially valuable in adverse impact analysis, where samples are often small.

Binary outcomes

Any pass/fail, selected/not-selected, or yes/no outcome across two groups is a natural fit. This includes personnel selection, promotion decisions, training completion, and termination rates.

When exact inference matters

Because the null distribution is constructed from your actual data, the resulting p-value is exact rather than approximate. This is particularly important when the stakes of the decision — legal, organizational, or otherwise — are high.

Teaching and communication

The permutation distribution is visually intuitive: it shows what results would look like if group membership had no effect. This makes the null hypothesis concrete and easier to communicate to audiences without statistical training.


04

Limitations & Honest Caveats

Permutation testing is a powerful tool, but not a universal one. A clear-eyed understanding of its limits is as important as knowing its strengths.

The exchangeability assumption

Permutation tests assume that, under H₀, outcomes are exchangeable across groups. If groups differ systematically on variables correlated with the outcome — qualifications, experience, tenure — the assumption is violated and results can be misleading.

P-values are not effect sizes

A significant permutation test tells you the observed result is unlikely under the null. It says nothing about how large or practically meaningful the difference is. Always accompany p-values with effect size estimates and, where applicable, confidence intervals.

Multiple comparisons

When testing across multiple groups, job categories, or time points simultaneously, permutation tests face the same multiple comparisons problem as any significance test. Corrections such as Bonferroni, or a Bayesian approach, should be considered.

Still a frequentist framework

Permutation tests improve on traditional significance testing but retain its core limitation: inference is framed around long-run error rates, not the probability that a hypothesis is true. Bayesian methods offer a complementary perspective for directly quantifying uncertainty.

Computational cost at scale

5,000–10,000 permutations are sufficient for stable p-value estimates and run in milliseconds for typical applied sample sizes. Very large datasets or extremely high permutation counts can introduce latency, though this is rarely a practical concern.


Built by Patrick J. Rolwes

Questions? 4rolwes@gmail.com

← Back to Projects