∫ Z-test · T-test · Chi-Square · F-test · 5 Tail Types

P-value Calculator

Calculate the p-value from a Z-score, t-statistic, chi-square, or F-statistic. Supports all tail types with a live normal distribution curve and significance interpretation.

βˆ‘
Test Parameters
Z
Two-tailed
H₁: ΞΌ β‰  ΞΌβ‚€  |  Both extremes
Right-tailed
H₁: ΞΌ > ΞΌβ‚€  |  Upper tail
Left-tailed
H₁: ΞΌ < ΞΌβ‚€  |  Lower tail
Try:
∫
Enter test parameters
P-value and distribution curve will appear here
P-value
β€”
β€”
P-value
β€”
Test Stat
β€”
Ξ± Level
β€”
Decision
β€”
β€”
β€”
Confidence Ξ± (two-tailed) Z critical Ξ± (one-tailed) Z critical
90%0.10Β±1.6450.101.282
95%0.05Β±1.9600.051.645
99%0.01Β±2.5760.012.326
99.9%0.001Β±3.2910.0013.090

What Is a P-value and How Is It Calculated?

A p-value (probability value) is the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis (Hβ‚€) is true. In other words: if there really were no effect or difference in the population, how likely would it be to see data as extreme as what you observed just by chance? A small p-value means your observed data would be very unlikely under Hβ‚€ β€” suggesting that Hβ‚€ may be false, and that your result is statistically significant. For related statistics, see our Standard Deviation Calculator and Average Calculator.

The Standard Normal Distribution & Z-scores

P-value from Z-score
Two-tailed: p = 2 Γ— Ξ¦(βˆ’|Z|) where Ξ¦ is the standard normal CDF Right-tailed: p = 1 βˆ’ Ξ¦(Z) Left-tailed: p = Ξ¦(Z)

Ξ¦(Z) = area under the standard normal curve to the left of Z. This calculator uses the error function (erf) approximation for high accuracy.

Supported Statistical Tests

Z

Z-test

Used when population standard deviation is known, or sample size is large (n β‰₯ 30). Test statistic follows the standard normal distribution. Common for proportions and large samples.

t

T-test

Used when population std dev is unknown and estimated from the sample. Requires degrees of freedom (df = nβˆ’1 for one-sample; df = n₁+nβ‚‚βˆ’2 for two-sample). Follows t-distribution.

χ²

Chi-Square Test

Used for categorical data β€” goodness-of-fit tests and tests of independence in contingency tables. Always right-tailed. df = (rowsβˆ’1)Γ—(colsβˆ’1) for contingency tables.

F

F-test (ANOVA)

Used to compare variances or in ANOVA to compare means across 3+ groups. Requires two degrees of freedom: df₁ (numerator) and dfβ‚‚ (denominator). Always right-tailed.

Significance Levels (Ξ±) Explained

The significance level Ξ± is the threshold below which you reject the null hypothesis. It represents the probability of a Type I error β€” concluding there is an effect when there isn't one. Common choices:

Ξ± LevelConfidence LevelMeaningCommon Use
0.1090%10% false positive rateExploratory research, weak evidence
0.0595%5% false positive rateStandard threshold in most sciences
0.0199%1% false positive rateMedical trials, high-stakes decisions
0.00199.9%0.1% false positive ratePhysics (e.g., Higgs boson detection)

The most widely used Ξ± = 0.05 was originally proposed by Ronald Fisher in the 1920s. It means: if Hβ‚€ were true, you'd see results this extreme only 5% of the time by chance. Use our Standard Deviation Calculator to prepare your test statistic from raw data.

One-tailed vs Two-tailed Tests

Choose your tail type based on your research hypothesis before seeing the data: a two-tailed test is used when you're testing for any difference (H₁: ΞΌ β‰  ΞΌβ‚€). It splits Ξ± between both tails (Ξ±/2 each). A right-tailed test is used when you hypothesize an increase (H₁: ΞΌ > ΞΌβ‚€). A left-tailed test is used when you hypothesize a decrease (H₁: ΞΌ < ΞΌβ‚€). One-tailed tests are more powerful when the direction is known but can be misleading if the direction is chosen after seeing the data β€” a practice called "p-hacking." Two-tailed tests are conservative and more commonly published.

Frequently Asked Questions

Common questions about p-values, hypothesis testing, and statistical significance

p < 0.05 means that if the null hypothesis were true, there would be less than a 5% probability of observing results as extreme as those obtained. This is the standard threshold for "statistical significance" in most scientific fields. When p < Ξ± (your chosen significance level), you reject the null hypothesis and conclude there is statistically significant evidence for your alternative hypothesis. However, p < 0.05 does not mean: the result is practically important; the effect is large; the probability that Hβ‚€ is true is 5%; or the probability that your finding is a "false positive" is exactly 5%. It is a conditional probability under Hβ‚€, not a direct measure of the probability that your hypothesis is correct. Always report effect sizes alongside p-values for full context. Use our Standard Deviation Calculator to understand your data's spread.
A Z-test uses the standard normal distribution and is appropriate when: (1) the population standard deviation (Οƒ) is known, or (2) the sample size is large (typically n β‰₯ 30, where the Central Limit Theorem ensures normality). A t-test uses the t-distribution and is appropriate when Οƒ is unknown and must be estimated from the sample β€” which is almost always the case in practice. The t-distribution has heavier tails than the normal distribution, reflecting extra uncertainty from estimating Οƒ. As degrees of freedom increase (larger sample), the t-distribution approaches the normal distribution. For df > 30, Z and t critical values become nearly identical. The t-test requires specifying degrees of freedom (df = nβˆ’1 for one-sample t-test). Our Average Calculator and Standard Deviation Calculator can help compute your test statistic.
If p > Ξ± (e.g., p = 0.12 with Ξ± = 0.05), you fail to reject the null hypothesis. Crucially, this does NOT mean you "accept" Hβ‚€ or that Hβ‚€ is true. It means your data do not provide sufficient evidence to reject Hβ‚€ at your chosen significance level. There are several possible explanations: the null hypothesis really is true; the effect exists but your sample was too small to detect it (low statistical power); there was measurement error; or there was too much variability. A result of p > 0.05 is often reported as "not statistically significant" (NS) and represented as the failure to find an effect β€” not proof of absence. Increasing sample size or reducing variability can sometimes reveal effects that were masked by insufficient power. Use our Standard Deviation Calculator to understand variance in your dataset.
The null hypothesis (Hβ‚€) is the default assumption β€” typically that there is no effect, no difference, or no relationship. For example: "the new drug has no effect on blood pressure" or "the two groups have equal means." The alternative hypothesis (H₁ or Hₐ) is what you're trying to show evidence for β€” that there IS an effect, difference, or relationship. For example: "the drug lowers blood pressure" (one-tailed) or "the drug changes blood pressure" (two-tailed). The p-value tests Hβ‚€: small p-values provide evidence against Hβ‚€ and in favour of H₁. You never "prove" H₁ β€” you merely find or fail to find sufficient evidence against Hβ‚€. A p-value is essentially asking: "how consistent are these data with Hβ‚€?" A small p-value says: "not very consistent" β†’ reject Hβ‚€.
A Type I error (false positive) occurs when you reject Hβ‚€ when it is actually true β€” concluding there's an effect when there isn't one. The probability of a Type I error equals Ξ± (your significance level). Choosing Ξ± = 0.05 means you accept a 5% chance of a false positive. A Type II error (false negative) occurs when you fail to reject Hβ‚€ when it is actually false β€” missing a real effect. The probability of a Type II error is denoted Ξ². Statistical power = 1 βˆ’ Ξ² = the probability of correctly detecting a real effect. There is a trade-off: reducing Ξ± (stricter standard) reduces Type I errors but increases Type II errors. In medical research where false positives are costly, Ξ± = 0.01 is often used. In exploratory research where missing effects is costly, Ξ± = 0.10 may be appropriate. This calculator helps you evaluate Type I error risk through the p-value vs Ξ± comparison.
The formula for a one-sample Z-test statistic is: Z = (xΜ„ βˆ’ ΞΌβ‚€) / (Οƒ / √n) where xΜ„ is your sample mean, ΞΌβ‚€ is the hypothesised population mean, Οƒ is the population standard deviation, and n is the sample size. For a two-sample Z-test: Z = (x̄₁ βˆ’ xΜ„β‚‚) / √(σ₁²/n₁ + Οƒβ‚‚Β²/nβ‚‚). For a one-sample t-test (unknown Οƒ): t = (xΜ„ βˆ’ ΞΌβ‚€) / (s / √n) where s is the sample standard deviation. Use our Average Calculator to find xΜ„ and our Standard Deviation Calculator to find s, then plug into the formula and enter the result here.
P-hacking (also called data dredging or fishing) refers to manipulating data analysis decisions β€” collecting more data until p < 0.05, trying multiple tests and reporting only significant ones, choosing between one-tailed and two-tailed tests after seeing results, removing outliers selectively, or trying different subgroup analyses β€” until a significant result is found. This inflates the false positive rate far above Ξ±. For example, if you run 20 independent tests at Ξ± = 0.05, you'd expect one false positive just by chance. P-hacking is a significant contributor to the "replication crisis" in psychology and social sciences. Best practices to avoid it include: pre-registering your hypothesis and analysis plan before collecting data; adjusting for multiple comparisons (Bonferroni correction: Ξ±/number of tests); and reporting all analyses conducted, not just significant ones. Always report p-values alongside effect sizes and confidence intervals for full transparency.