Is your A/B test result actually significant?
Enter your control and variant data. Get statistical significance, confidence level, p-value, and a clear verdict on whether to ship.
Control (A)
Your current version - the baseline you're testing against.
Variant (B)
The new version you want to validate.
Enter data for both variants
Statistical significance and uplift will appear here.
How statistical significance is calculated
This calculator uses a two-proportion z-test - the same method used by major A/B testing platforms.
Enter your data
Add visitor and conversion counts for both the control (A) and variant (B) versions.
We run the z-test
The calculator uses a two-proportion z-test to determine whether the difference between A and B is statistically significant or within normal variation.
Read the verdict
You get the confidence level, p-value, relative uplift, and a clear recommendation on whether the result is conclusive.
The formula
p1 = convA / visA (control rate)
p2 = convB / visB (variant rate)
p_pooled = (convA + convB) / (visA + visB)
se = sqrt(p_pooled * (1 - p_pooled) * (1/visA + 1/visB))
z = (p2 - p1) / se
p-value = 2 * (1 - normalCDF(|z|))
confidence = (1 - p-value) * 100
How to read your results
What each metric means and when it matters.
Confidence level
The probability that the observed difference is real and not due to chance. 95% is the industry standard threshold for shipping a winning variant.
P-value
The probability of seeing this difference by random chance alone. A p-value below 0.05 means there's less than a 5% chance the result is a fluke - significant at 95% confidence.
Relative uplift
How much the variant's conversion rate differs from the control as a percentage. A variant converting at 2.5% vs a control at 2% is a 25% relative uplift.
Z-score
How many standard deviations the result is from zero. A z-score above 1.96 (in absolute value) means significance at 95% confidence.
The most expensive A/B testing mistakes
Statistical significance is only valid if the test was run correctly from the start.
Stopping the test too early
The most expensive A/B testing mistake. Random variation early in a test can look like a winner. The confidence threshold only means something when you collect the pre-calculated sample size - not when you hit 80% confidence after 3 days.
Running multiple tests at once on the same page
If you're testing the headline and the CTA button simultaneously, you can't know which change drove the result. Run one test at a time per conversion goal, or use a proper multivariate setup.
Testing without a hypothesis
Testing random ideas produces random results. A strong A/B test starts with a specific hypothesis rooted in user research: "Visitors are confused by the headline" leads to a testable change. Surveying visitors before testing produces better hypotheses.
Frequently asked questions
Common questions about A/B testing, statistical significance, and interpreting results.
What statistical test does this calculator use?
This calculator uses a two-proportion z-test, which is the standard method for A/B testing with binary outcomes (converted / not converted). It calculates the pooled standard error of both proportions, then computes the z-score for the difference. The p-value is derived from the standard normal distribution. This is the same method used by most major A/B testing platforms.
What confidence level should I use for A/B testing?
95% is the industry standard and appropriate for most tests. It means you're accepting a 5% false positive rate - 1 in 20 tests will show a winner that isn't real. For high-stakes, hard-to-reverse changes (redesigning checkout, changing pricing), hold out for 99%. For low-stakes, easily reversible changes (button color, microcopy), 90% can be acceptable if you're time-constrained and the effect size is large. The key: set your threshold before looking at the data, not after.
How do I know when to stop an A/B test?
The correct answer: stop when you've reached the pre-calculated sample size, not when you hit significance. Stopping early because you see 95% confidence after 3 days is called "peeking" and inflates false positive rates significantly. Calculate the required sample size before the test using a sample size calculator, and don't look at significance until you've hit that number. The only valid early stopping reasons: clear harm to one group, or a pre-agreed sequential testing protocol.
What does p-value mean in A/B testing?
The p-value is the probability of observing this large a difference between variants if there were actually no real difference (i.e., if the null hypothesis were true). A p-value of 0.05 means there's a 5% probability the observed difference happened by chance. It does NOT mean there's a 95% probability the variant is better - that's a common misinterpretation. It also doesn't tell you anything about the size of the effect, only whether it's distinguishable from noise.
What's the difference between relative and absolute uplift?
Absolute uplift is the raw difference: if control converts at 2% and variant at 2.5%, absolute uplift is 0.5 percentage points. Relative uplift is the percentage improvement: that same change is a 25% relative uplift (0.5 / 2.0 = 25%). Be careful with how results are reported - "a 25% improvement" sounds much larger than "0.5 percentage points". Both are accurate descriptions of the same result. Use absolute uplift when communicating revenue impact, relative uplift when comparing tests across pages with different baseline rates.
How many visitors do I need for an A/B test?
It depends on three things: your baseline conversion rate, the minimum effect size you want to detect, and your desired confidence level and power. A page converting at 5% with 10,000 monthly visitors can detect a 10% relative improvement in about 2-3 weeks at 95% confidence. A page converting at 0.5% needs substantially more. Use a sample size calculator before starting - it'll tell you exactly how many visitors per variant you need.
Can I run multiple A/B tests at once?
You can, but they should be on different pages or different conversion goals - not on the same page with the same conversion event. Running two tests on the same page simultaneously creates interaction effects: if both win or both lose, you don't know which change drove it. The exception is multivariate testing (MVT), which specifically accounts for interactions - but MVT requires much larger sample sizes. For most teams: one test per page, per conversion goal, at a time.
What if my A/B test shows no significant difference?
A null result is still a valid result - it means the change you made probably doesn't affect conversion rate in a meaningful way. This is useful information. It rules out that hypothesis and tells you to look elsewhere. The typical mistake is running the same test again hoping for a different result, or calling the test too early when confidence is at 60%. If you've hit your target sample size and there's no significance, the variant probably doesn't matter. Use survey data to find a more impactful hypothesis to test next.