Inference for Categorical Data

The analysis of categorical data generally involves the proportion of "successes" in a given population. This may consist of estimating a single parameter, comparing two parameters, or investigating the potential relationship between two or more categorical variables. Note: This section addresses the first two areas -- see the chi-square test for a discussion of the latter.

Confidence Intervals and Significance Tests for a Single Proportion

Given a simple random sample of size n from a population, the number of "successes" X divided by the sample size n provides the sample proportion , an estimate of the population proportion p. This proportion follows a binomial distribution with mean p and variance (p(1-p))/n. Since the binomial distribution is approximately normal for large sample sizes, tests of significance and confidence intervals for a single proportion use a z statistic.

Example

A marketing team wishes to evaluate the popularity of a new product in a particular city. A random survey of 500 shoppers indicates that 287 shoppers favor the new product, 123 shoppers dislike the product, and the remaining 90 shoppers have no opinion. Is there evidence that more than 50% of shoppers like the product?

The sample proportion of shoppers who favor the product is 287/500 = 0.574. What is a 95% confidence interval for the proportion? Is the proportion significantly different from 0.5?


To find a confidence interval for a proportion, estimate the standard deviation sp from the data by replacing the unknown value p with the sample proportion , giving the standard error sp = .

An approximate level C confidence interval for p is + z* where z* is the upper (1-C)/2 critical value from the standard normal distribution.


Example

In the example above, the sample proportion is 0.574. The standard error sp is equal to sqrt((0.574(1-0.574))/500) = sqrt((0.574*0.426)/500) = sqrt(0.245/500) = sqrt(0.00049) = 0.022. The critical value for a 95% confidence interval is 1.96, so the confidence interval for the proportion is 0.574 + 1.96*0.022 = (0.574 - 0.043, 0.574 + 0.043) = (0.531, 0.617).
To test the null hypothesis H0: p = p0 against a one- or two-sided alternative hypothesis Ha, replace p with p0 in the test statistic

The test statistic follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the P-value for the standard normal distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Given the null hypothesis that the population proportion p is equal to a given value p0, the P-values for testing H0 against each of the possible alternative hypotheses are:
P(Z > z) for Ha: p > p0
P(Z < z) for Ha: p < p0
2P(Z>|z|) for Ha: p p0.

Example

In the example above, the marketing team wishes to test the one-sided hypothesis Ha: p > 0.5 against the null hypothesis that p = 0.5. The test statistic z is equal to (0.574 - 0.5)/(sqrt((0.5)(0.5)/500)) = 0.074/sqrt(0.25/500) = 0.074/0.022 = 3.364. The probability P(Z > 3.364) = 1 - P(Z < 3.364) = 1 - 0.9996 = 0.0004, so this result is highly significant. The marketing team can conclude that more that 50% of the population favor the new product.

Sample Size

An increase in sample size will decrease the length of the confidence interval without reducing the level of confidence. This is because the standard deviation decreases as n increases. The margin of error m of a confidence interval is defined to be the value added or subtracted from the sample proportion which determines the length of the interval: m = z*. Given a guessed value p* for the proportion p, substitute p* for p to calculate m. Solving for n gives the expression n = (z*/m)²p*(1-p*). The margin of error is maximized when p* = 0.5, in which case n = (z*/2m)².

Example

Suppose the marketing team in the above example had wished to achieve a margin of error less than or equal to 2% with 95% confidence. Assuming p* = 0.5, they calculate n to be greater than or equal to (1.96/(2*0.02))² = (1.96/0.04)² = 49² = 2401. This is significantly larger than the sample of size 500 taken by the intitial survey.

Comparison of Two Proportions

Like the comparison of two population means, the comparison of two proportions p1 and p2 involves analyzing the difference between the two sample proportions, 1 - 2. The mean of the difference between the two proportions is the difference of the means, p1-p2, and the variance of the difference is the sum of the variances, (p1(1-p1))/n1 + (p2(1-p2))/n2.


To find a confidence interval for the difference of proportions p1-p2, estimate the standard deviation sD from the data by replacing the unknown values p1 and p2 with the sample proportions 1 and 2 taken from samples of size n1 and n2, giving the standard error of the difference sD =
.

An approximate level C confidence interval for p1 - p2 is 1 - 2 + z*sD where z* is the upper (1-C)/2 critical value from the standard normal distribution.


Example

In the dataset "Popular Kids," students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. Is popularity more important to girls or boys? What is a confidence interval for the difference?

169 girls and 166 boys were included in the survey. Of the girls, 58 ranked popularity most important, compared to 40 of the boys. The sample proportion for girls is 58/169 = 0.34, and for boys it is 40/166 = 0.24. A 95% confidence interval for the difference between the two proportions is 0.34 - 0.24 + 1.96*sD, where sD = sqrt((0.34(1-0.34))/169 + (0.24(1-0.24))/166) = sqrt(0.0013 + 0.0011) = sqrt(0.0024) = 0.049, so the confidence interval is equal to (0.1 - 1.96*0.049, 0.1 + 1.96*0.049) = (0.004, 0.196). Although the confidence interval does not contain 0, it is very close to zero, indicating that the difference is not highly significant.

Data source: Chase, M.A and Dummer, G.M. (1992), "The Role of Sports as a Social Determinant for Children," Research Quarterly for Exercise and Sport, 63, 418-424. Dataset available through the Statlib Data and Story Library (DASL).


To test the null hypothesis H0: p1 = p2 against a one- or two-sided alternative hypothesis Ha, first compute a pooled estimate for the parameter =
(X1 + X2)/(n1 + n2), where X1 and X2 represent the number of "successes" in each population sample. This estimate for a single sample proportion agrees with the null hypothesis, where the two proportions are assumed to be equal. Calculate the pooled standard error sp, equal to
.
The test statistic z = (1 - 2)/sp follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the P-value for the standard normal distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Given the null hypothesis that the population proportions are equal, the P-values for testing H0 against each of the possible alternative hypotheses are:
P(Z > z) for Ha: p1 > p2
P(Z < z) for Ha: p1 < p2
2P(Z>|z|) for Ha: p1 p2.

Example

To test the difference of the proportions of girls and boys who rated popularity most important, first compute the pooled estimate = (58 + 40)/(166 + 169) = 98/335 = 0.29. The pooled standard error is equal to sqrt((0.29(1-0.29)/(1/166 + 1/169)) = sqrt(0.206*0.012) = sqrt(0.0025) = 0.05. The test statistic z = (0.34 - 0.24)/0.05 = 0.10/0.05 = 2. Since this is a two-sided hypothesis, we are interested in the probability 2P(Z > 2) = 2(1 - P(Z < 2)) = 2(1 - 0.9772) = 2(0.0228) = 0.0456. This is significant at the 0.05 level, although it is not significant at the 0.01 level.