CFA Level 1 Hypothesis Testing: Sampling, CLT, Confidence Intervals & Statistical Tests
What’s Covered: Three Learning Modules at a Glance
| Learning Module | Topic | Core Skills |
|---|---|---|
| LM 7 | Estimation and Inference | Sampling methods, sampling error, Central Limit Theorem, standard error, confidence intervals, bootstrapping |
| LM 8 | Hypothesis Testing | 6-step process, null/alternative hypotheses, Type I & II errors, power, t-tests, chi-square, F-test, p-values |
| LM 9 | Parametric & Non-Parametric Tests of Independence | Pearson correlation test, Spearman rank correlation, contingency tables, chi-square test of independence |
LM 7: Estimation and Inference
Before you can test hypotheses, you need to understand how samples relate to populations. This module covers how to draw samples, what happens to sample statistics as sample size grows, and how to construct confidence intervals.
Sampling Methods
The CFA curriculum distinguishes between probability sampling (where every element has a known chance of selection) and non-probability sampling (where it doesn’t).
| Method | How It Works | When to Use |
|---|---|---|
| Simple Random | Every member of the population has an equal probability of being selected | Default method; works well when the population is relatively homogeneous |
| Stratified Random | Divide population into subgroups (strata), then sample randomly within each stratum | When subgroups differ meaningfully — e.g., sampling a bond index by credit rating and maturity |
| Cluster | Divide population into clusters, randomly select entire clusters, then sample within them | When a full population list is impractical — e.g., geographic clusters |
| Convenience (non-probability) | Select whatever data is readily available | Quick and cheap, but introduces selection bias |
| Judgmental (non-probability) | Researcher handpicks elements based on expertise | Relies on the researcher’s knowledge; results may not generalize |
Sampling Error
Sampling error is the difference between a sample statistic (like the sample mean) and the corresponding population parameter. It’s unavoidable — you’re estimating a population from a subset. The goal isn’t to eliminate sampling error but to understand and control it.
The Central Limit Theorem (CLT)
The CLT is one of the most powerful results in statistics, and it’s a favorite CFA exam topic.
Why this matters: even if the underlying returns are skewed or non-normal, the distribution of the sample mean will be approximately normal for large samples (typically n ≥ 30). This is what justifies using z-tests and t-tests on real financial data.
Standard Error of the Sample Mean
The standard error tells you how much the sample mean is expected to vary from the population mean. As sample size increases, standard error decreases — your estimate gets more precise. Doubling the sample size reduces the standard error by a factor of √2 (about 29%), not by half.
Confidence Intervals
A 95% confidence interval means: if we repeated this sampling procedure many times, 95% of the resulting intervals would contain the true population mean. It does not mean there’s a 95% probability the population mean is in this particular interval.
| Confidence Level | z Critical Value (two-tailed) | Interpretation |
|---|---|---|
| 90% | ±1.645 | Wider net, less precision |
| 95% | ±1.960 | Most commonly used in practice |
| 99% | ±2.576 | Very high confidence, very wide interval |
LM 8: Hypothesis Testing
This is the core of the three modules. You need to know the 6-step process cold and be able to apply it to questions about means, differences between means, and variances.
The 6-Step Hypothesis Testing Process
| Step | What You Do | Key Details |
|---|---|---|
| 1. State the hypotheses | Define H₀ (null) and Hₐ (alternative) | The null is what you’re trying to reject. The alternative is what you’re trying to support. |
| 2. Identify the test statistic | Choose the right test (z, t, chi-square, F) | Depends on what you’re testing (mean, variance, proportion) and what you know about the population |
| 3. Specify significance level | Set α (typically 0.05 or 0.01) | α = probability of Type I error = probability of rejecting a true null |
| 4. State the decision rule | Determine the critical value(s) | Reject H₀ if test statistic falls in the rejection region (beyond critical values) |
| 5. Calculate test statistic | Plug sample data into the formula | Compare computed value to critical value |
| 6. Make a decision | Reject or fail to reject H₀ | You never “accept” the null — you either reject it or fail to reject it |
One-Tailed vs. Two-Tailed Tests
| Test Type | Hypotheses | Rejection Region |
|---|---|---|
| Two-tailed | H₀: μ = μ₀ vs. Hₐ: μ ≠ μ₀ | Both tails — reject if test stat is too far in either direction |
| Upper one-tailed | H₀: μ ≤ μ₀ vs. Hₐ: μ > μ₀ | Right tail only |
| Lower one-tailed | H₀: μ ≥ μ₀ vs. Hₐ: μ < μ₀ | Left tail only |
Type I and Type II Errors
This is tested constantly. You must know the trade-off.
| H₀ is Actually True | H₀ is Actually False | |
|---|---|---|
| Reject H₀ | Type I Error (false positive) — probability = α | Correct decision — probability = Power (1 − β) |
| Fail to reject H₀ | Correct decision — probability = (1 − α) | Type II Error (false negative) — probability = β |
Key relationships:
- α (significance level) = P(Type I error) = P(rejecting a true null)
- β = P(Type II error) = P(failing to reject a false null)
- Power = 1 − β = probability of correctly rejecting a false null
- Decreasing α (say from 5% to 1%) increases β — there’s a direct trade-off
- Increasing sample size reduces both types of error
The p-Value Approach
The p-value is the smallest significance level at which you would reject the null. If p-value ≤ α, reject H₀. If p-value > α, fail to reject. Many CFA questions give you a p-value and ask for the conclusion at a given significance level — just compare the two numbers.
Tests of a Single Mean
Degrees of freedom: n − 1. This is the workhorse test on the exam. You’ll be given a sample mean, hypothesized population mean, sample standard deviation, and sample size.
Tests of Differences Between Means
The exam tests two scenarios:
| Scenario | Test | When to Use |
|---|---|---|
| Independent samples, equal variances | Pooled t-test | Two separate groups (e.g., returns of fund A vs. fund B) |
| Dependent (paired) samples | Paired t-test (test of mean differences) | Same group measured twice (e.g., returns before and after an event) |
Where d̄ is the mean of the differences and s_d is the standard deviation of the differences. Degrees of freedom: n − 1 (number of pairs minus 1).
Test of a Single Variance (Chi-Square)
Degrees of freedom: n − 1. The chi-square distribution is always non-negative and right-skewed. Use this when testing whether a portfolio’s volatility matches a claimed level.
Test of Equality of Two Variances (F-Test)
Put the larger variance in the numerator. Degrees of freedom: (n₁ − 1, n₂ − 1). The F-distribution is always positive. You’ll use this to test whether two portfolios have significantly different risk levels.
Parametric vs. Nonparametric Tests
| Feature | Parametric Tests | Nonparametric Tests |
|---|---|---|
| Assumptions | Specific distributional assumptions (e.g., normality) | Minimal or no distributional assumptions |
| Data type | Continuous, interval/ratio scale | Ordinal, ranked, or non-normal data |
| Power | More powerful when assumptions hold | Less powerful but more robust |
| Examples | z-test, t-test, F-test | Spearman rank correlation, chi-square test of independence |
LM 9: Parametric & Non-Parametric Tests of Independence
This module applies hypothesis testing specifically to testing whether two variables are related. It covers three tests you need to know.
Parametric Test of Correlation (Pearson)
Tests whether the population correlation coefficient (ρ) equals zero.
Degrees of freedom: n − 2. Reject H₀: ρ = 0 if the calculated t exceeds the critical value. An important nuance: as sample size increases, smaller correlations become statistically significant — a correlation of r = 0.35 might not be significant with n = 12, but it could be significant with n = 32.
Spearman Rank Correlation
The nonparametric alternative to Pearson. Instead of testing raw data values, you rank them first and then calculate the correlation on the ranks. Use it when:
- Data may not be normally distributed
- You’re working with ordinal data (rankings, ratings)
- Outliers are a concern
- The relationship might be monotonic but not linear
The test for significance uses the same t-formula as Pearson, just applied to the rank correlation coefficient (r_s) instead of the raw correlation.
Chi-Square Test of Independence (Contingency Tables)
Tests whether two categorical variables are independent using observed vs. expected frequencies in a contingency table.
Where O is the observed frequency and E is the expected frequency (calculated assuming independence). Degrees of freedom: (rows − 1)(columns − 1). This is always a one-sided test — the rejection region is on the right because the chi-square statistic is always positive.
Example application: testing whether ETF performance category (outperform/underperform) is independent of fund type (equity/bond/alternative). If the chi-square statistic exceeds the critical value, you reject independence — the two variables are related.
Which Test to Use: Decision Framework
The exam often tests whether you can pick the right test for the scenario. Here’s a quick decision guide:
| What You’re Testing | Test Statistic | Distribution | Degrees of Freedom |
|---|---|---|---|
| Single mean (σ unknown) | t = (X̄ − μ₀) / (s/√n) | t | n − 1 |
| Difference of means (independent) | Pooled t-test | t | n₁ + n₂ − 2 |
| Difference of means (paired) | t = d̄ / (s_d/√n) | t | n − 1 |
| Single variance | χ² = (n−1)s²/σ₀² | Chi-square | n − 1 |
| Equality of two variances | F = s₁²/s₂² | F | (n₁−1, n₂−1) |
| Correlation (parametric) | t = r√(n−2)/√(1−r²) | t | n − 2 |
| Independence (categorical) | χ² = Σ(O−E)²/E | Chi-square | (r−1)(c−1) |
How These Modules Connect to the Rest of the Curriculum
| Concept from LM 7–9 | Where It Appears Later |
|---|---|
| Central Limit Theorem | Justifies normal-distribution-based tests throughout the curriculum |
| Confidence intervals | Economics (forecasting), equity valuation (range estimates) |
| t-tests on means | Testing whether portfolio returns exceed a benchmark in Portfolio Management |
| F-test on variances | Comparing risk levels across portfolios; ANOVA in regression (LM 10) |
| Correlation tests | Beta estimation, factor models, diversification analysis |
| Chi-square test of independence | Testing relationships between categorical financial variables (e.g., sector and performance) |
Study Strategy for LM 7–9
- Memorize the 6-step process. Every hypothesis test question follows this framework. If you can lay out the steps, the rest is mechanical.
- Know the Type I/II error trade-off cold. This is tested on almost every exam. Practice articulating what happens when you change α or increase sample size.
- Practice the “which test?” decision. The exam often gives you a scenario and asks you to identify the correct test. Use the decision framework table above until it becomes automatic.
- Don’t memorize every formula in isolation. The t-test structure (point estimate − hypothesized value) / standard error is the same pattern for all mean tests. Recognize the pattern.
- p-value questions are free points. If p-value ≤ α → reject. If p-value > α → fail to reject. That’s it.
For all formulas consolidated, see the CFA Level 1 Formula Sheet. For additional drill problems, visit Practice Questions. And for broader exam strategy, check Tips & Strategies.
Key Takeaways
- Stratified random sampling ensures every subgroup is represented — it’s the preferred method for bond index replication.
- The Central Limit Theorem guarantees the sample mean is approximately normal for large n, regardless of the population distribution.
- Standard error = s/√n. Increasing sample size improves precision but with diminishing returns (you need to quadruple n to halve standard error).
- Type I error (α) = rejecting a true null. Type II error (β) = failing to reject a false null. Power = 1 − β. Decreasing α increases β.
- Use the t-statistic when population variance is unknown (almost always on the exam). Use chi-square for single variance tests and F for comparing two variances.
- The p-value is the smallest α at which you’d reject H₀. If p ≤ α, reject. Period.
- Spearman rank correlation is the nonparametric alternative to Pearson — use it when normality is in question or data are ordinal.
- The chi-square test of independence uses a contingency table of observed vs. expected frequencies. Degrees of freedom = (rows − 1)(columns − 1).
Frequently Asked Questions
What’s the difference between a Type I and Type II error on the CFA exam?
A Type I error means you rejected the null hypothesis when it was actually true — a false positive. A Type II error means you failed to reject a false null — a false negative. The significance level α directly controls the Type I error rate. There’s a trade-off: decreasing α makes Type I errors less likely but Type II errors more likely, unless you also increase sample size.
When should I use a t-test vs. a z-test?
Use a z-test only when the population variance is known — which almost never happens in practice. On the CFA exam, you’ll use the t-test in nearly every question about means because you’ll be working with a sample standard deviation. As sample size gets large (n > 30 or so), t and z values converge and the distinction becomes less important, but the t-test is still technically correct.
How do I decide between a one-tailed and two-tailed test?
Read the alternative hypothesis. If Hₐ says “not equal to” (≠), it’s two-tailed. If Hₐ says “greater than” (>) or “less than” (<), it's one-tailed. The CFA exam usually tells you which one to use. One-tailed tests are more powerful for detecting an effect in a specific direction because the entire rejection region is on one side.
What does “power of a test” mean?
Power is the probability of correctly rejecting a false null hypothesis — it equals 1 − β, where β is the Type II error probability. A test with high power is good at detecting a real effect. You can increase power by increasing sample size, increasing α (at the cost of more Type I errors), or testing larger true effect sizes.
Why is the chi-square test of independence always one-sided?
Because the chi-square statistic sums squared differences between observed and expected frequencies — it’s always non-negative. Large values indicate that observed data differ significantly from what you’d expect under independence. There’s no concept of a “negative” chi-square value, so the rejection region is always in the right tail only.
How does the Central Limit Theorem help with hypothesis testing?
The CLT guarantees that the sampling distribution of the mean is approximately normal for large sample sizes, even if the underlying population isn’t normal. This allows you to use z-based and t-based tests on real financial data — which is typically skewed and leptokurtic — as long as your sample is large enough. Without the CLT, you’d need to know the exact population distribution to run most tests.