# Lecture 7

## Confidence Interval Basics

The sample mean, $$\bar{x}$$, is usually not considered enough to know where the true mean of a population is. It is seen as one instantiation of the random variable $$\overline{X}$$, and since we interpret it as being the result of a random process, we would like to describe the uncertainty we associate with its position relative to the population mean. This is true not only of the sample mean but any statistic we estimate from a random sample.

The last lecture was concerned with computational methods for performing statistical inference. In this lecture, we consider procedures for inferential statistics derived from probability models. In particular, I focus on parametric methods, which aim to uncover information regarding the location of a parameter of a probability model assumed to describe the data, such as the mean $$\mu$$ in the Normal distribution, or the population proportion $$p$$. (The methods explored last lecture, such as bootstrapping, are non-parametric; they do not assume a probability model describing the underlying distribution of the data. MATH 3080 covers other non-parametric methods.)

Why use probabilistic methods when we could use computational methods for many procedures? First, computational power and complexity may limit the effectiveness of computational methods; large or complex data sets may be time consuming to process using computational methods alone. Second, methods relying on simulation, such as bootstrapping, are imprecise. Many times this is fine, but the price of imprecision is large when handling small-probability events and other contexts where precision is called for. Third, while computational methods are robust (deviation from underlying assumptions has little effect on the end result), this robustness comes at the price of power (the ability to detect effects, even very small ones), and this may not be acceptable to a practitioner (much of statistical methodology can be seen as a battle between power and robustness, which are often mutually exclusive properties). Furthermore, many computational methods can be seen as a means to approximate some probabilistic method that is for some reason inaccessible or not quite optimal.

A $$100C$$% confidence interval is an interval or region constructed in such a way that the probability the procedure used to construct the interval results in an interval containing the true population parameter of interest is $$C$$. (Note that this is not the same as the probability that the parameter of interest is in the confidence interval, which does not even make sense in this context since the confidence interval computed from the sample is fixed and the parameter of interest is also considered fixed, albeit unknown; this is a common misinterpretation of the meaning of a confidence interval and I do not intend to spread it.) We call $$C$$ the confidence level of the confidence interval. Smaller $$C$$ result in intervals that capture the true parameter less frequently, while larger $$C$$ result in intervals capturing the desired parameter more frequently. There is (always) a price, though; smaller $$C$$ yield intervals that are more precise (and more likely to be wrong), while larger $$C$$ yield intervals that are less precise (and less likely to be wrong). There is usually only one way to get more precise confidence intervals while maintaining the same confidence level $$C$$: collect more data.

In this lecture we consider only confidence intervals for univariate data (or data that could be univariate). When discussing confidence intervals, the estimate is the estimated value of the parameter of interest, and the margin of error (which I abbreviate with “moe”) is the quantity used to describe the uncertainty associated with the estimate. The most common and familiar type of confidence interval is a two-sided confidence interval of the form:

$\text{estimate} \pm \text{moe}$

For estimates of parameters from symmetric distributions, such a confidence interval is called an equal-tail confidence interval (so called because the confidence interval is equally likely to err on either side of the estimate). For estimates for parameters from non-symmetric distributions, an equal-tail confidence interval may not be symmetric around the estimate. Also, confidence intervals need not be two-sided; one-sided confidence intervals, which effectively are a lower-bound or upper-bound for the location of the parameter of interest, may also be constructed.

## Confidence Intervals “By Hand”

One approach for constructing confidence intervals in R is “by hand”, where the user computes the estimate and the moe. This approach is the only one available if no function for constructing the desired confidence interval exists (that will not be the case for the confidence intervals we will see, but may be the case for others).

Let’s consider a confidence interval for the population mean when the population standard deviation is known. The estimate is the sample mean, $$\bar{x}$$, and the moe is $$z_{\frac{1 - C}{2}}\frac{\sigma}{\sqrt{n}}$$, where $$C$$ is the confidence level, $$z_{\alpha}$$ is the quantile of the standard Normal distribution such that $$P(Z > z_{\alpha}) = 1 - \Phi(z_{\alpha}) = \alpha$$, $$\sigma$$ is the population standard deviation (presumed known, which is unrealistic), and $$n$$ is the sample size. So the two-sided confidence interval is:

$\bar{x} \pm z_{\frac{1 - C}{2}}\frac{\sigma}{\sqrt{n}}$

Such an interval relies on the Central Limit Theorem to ensure the quality of the interval. For large $$n$$, it should be safe (with “large” meaning over 30, according to DeVore), and otherwise one may want to look at the shape of the distribution of the data to decide whether it is safe to use these procedures (the more “Normal”, the better).

The “by hand” approach finds all involved quantities individually and uses them to construct the confidence intervals.

Let’s construct a confidence interval for sepal length of versicolor iris flowers in the iris data set. There are 50 observations, so it should be safe to use the formula for the CI described above. We will construct a 95% confidence interval assume that $$\sigma = 0.5$$.

# Get the data
vers <- split(iris$Sepal.Length, iris$Species)$versicolor xbar <- mean(vers) xbar ## [1] 5.936 # Get critical value for confidence level zstar <- qnorm(0.025, lower.tail = FALSE) sigma <- 0.5 # Compute margin of error moe <- zstar * sigma/sqrt(length(vers)) moe ## [1] 0.1385904 # The confidence interval ci <- c(Lower = xbar - moe, Upper = xbar + moe) ci ## Lower Upper ## 5.79741 6.07459 Of course, this is the long way to compute confidence intervals, and R has built-in functions for computing most of the confidence intervals we want. The “by hand” method relies simply on knowing how to compute the values used in the definition of the desired confidence interval, and combining them to get the desired interval. Since this uses R as little more than a glorified calculator and alternative to a table, I will say no more about the “by hand” approach (of course, if you are computing a confidence interval for which there is no R function, the “by hand” approach is the only available route, though you may save some time and make a contribution to the R community by writing your own R function for computing this novel confidence interval in general). ## R Functions for Confidence Intervals Confidence intervals and hypothesis testing are closely related concepts, so the R functions that find confidence intervals are also the functions that perform hypothesis testing (all of these functions being in the stats package, which is loaded by default). If desired, it is possible to extract just the desired confidence interval from the output of these functions, though when used interactively, the confidence interval and the result of some hypothesis test are often shown together. When constructing confidence intervals, there are two parameters to be aware of: conf.level and alternative. conf.level specifies the desired confidence level associated with the interval, $$C$$. Thus, conf.level = .99 tells the function computing the confidence interval that a 99% confidence interval is desired. (The default behavior is to construct a 95% confidence interval.) alternative determines whether a two-sided interval, an upper-bound, or a lower-bound are computed. alternative = "two.sided" creates a two-sided confidence interval (this is usually the default behavior, though, so specifying this may not be necessary). alternative = "greater" computes a one-sided $$00C$$% confidence lower bound, and alternative = "less" returns a one-sided $$100C$$% confidence upper bound. (The parameter alternative gets its name and behavior from the alternative hypothesis when performing hypothesis testing, which we will discuss next lecture.) I now discuss different confidence intervals intended to capture different population parameters. ### Population Mean, Single Sample We saw a procedure for constructing a confidence interval for the population mean earlier, and that procedure assumed the population standard deviation $$\sigma$$ is known. That assumption is unrealistic, and instead we base our confidence intervals on the random variable $$T = \frac{\overline{X} - \mu}{\frac{S}{\sqrt{n}}}$$. If the data $$X_i$$ is drawn from some Normal distribution, then $$T$$ will follow Student’s $$t$$ distribution with $$\nu = n - 1$$ degrees of freedom: $$T \sim t(n - 1)$$. Thus the confidence interval is constructed using critical values from the $$t$$ distribution, and the resulting confidence interval is: $\bar{x} \pm t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}} \equiv \left(\bar{x} - t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}}; \bar{x} + t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}}\right)$ where if $$T \sim t(\nu)$$, $$P\left(T \geq t_{\alpha,\nu}\right) = \alpha$$. One should check the Normality assumption made by this confidence interval before using it. A Q-Q plot is a good way to do so, and a Q-Q plot checking for Normality can be created using the qnorm() function. qnorm(x) creates a Q-Q plot checking normality for a data vector x, and qqline(x) will add a line by which to judge how well the distribution fits the data (being clost to the line suggests a good fit, while strange patterns like an S-shape curve suggest a poor fit). I check how well the Normal distribution describes the versicolor iris flower sepal length data below. It is clear that the assumption of Normality holds quite well for this data. qqnorm(vers) qqline(vers) That said, the Normality assumption turns out not to be crucial for these confidence intervals, and the methods turn out to be robust enough to work well even when that assumption does not hold. If a distribution is roughly symmetric with no outliers, it’s probably safe to build a confidence interval using methods based on the $$t$$ distribution, and even if these two do not quite hold, there are circumstances when even then the $$t$$ distribution will still work well. The function $$t.test()$$ will construct a confidence interval for the true mean $$\mu$$, and I demonstrate its use below. # Construct a 95% two-sided confidence interval t.test(vers) ## ## One Sample t-test ## ## data: vers ## t = 81.318, df = 49, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 5.789306 6.082694 ## sample estimates: ## mean of x ## 5.936 # Construct a 99% upper confidence bound t.test(vers, conf.level = 0.99, alternative = "less") ## ## One Sample t-test ## ## data: vers ## t = 81.318, df = 49, p-value = 1 ## alternative hypothesis: true mean is less than 0 ## 99 percent confidence interval: ## -Inf 6.111551 ## sample estimates: ## mean of x ## 5.936 # t.test creates a list containing all computed stats, including the # confidence interval. I can extract the CI by accessing the conf.int # variable in the list tt1 <- t.test(vers, conf.level = 0.9) tt2 <- t.test(vers, conf.level = 0.95) tt3 <- t.test(vers, conf.level = 0.99) str(tt1) ## List of 9 ##$ statistic  : Named num 81.3
##   ..- attr(*, "names")= chr "t"
##  $parameter : Named num 49 ## ..- attr(*, "names")= chr "df" ##$ p.value    : num 6.14e-54
##  $conf.int : atomic [1:2] 5.81 6.06 ## ..- attr(*, "conf.level")= num 0.9 ##$ estimate   : Named num 5.94
##   ..- attr(*, "names")= chr "mean of x"
##  $null.value : Named num 0 ## ..- attr(*, "names")= chr "mean" ##$ alternative: chr "two.sided"
##  $method : chr "One Sample t-test" ##$ data.name  : chr "vers"
##  - attr(*, "class")= chr "htest"
# Create a graphic comparing these CIs
library(ggplot2)
# The end goal is to create an object easily handled by ggplot. I start by
# making a list with my data
conf_int_dat <- list(90% CI = as.list(tt1$conf.int), 95% CI = as.list(tt2$conf.int),
VC <- split_len$VC # Gets CI for difference in means t.test(OJ, VC) ## ## Welch Two Sample t-test ## ## data: OJ and VC ## t = 1.9153, df = 55.309, p-value = 0.06063 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.1710156 7.5710156 ## sample estimates: ## mean of x mean of y ## 20.66333 16.96333 # Is there homoskedasticity? Let's check a boxplot boxplot(len ~ supp, data = ToothGrowth) # These populations look homoskedastic; spread does not differ drastically # between the two. A CI that assumes homoskedasticity is thus computed. t.test(OJ, VC, var.equal = TRUE) ## ## Two Sample t-test ## ## data: OJ and VC ## t = 1.9153, df = 58, p-value = 0.06039 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.1670064 7.5670064 ## sample estimates: ## mean of x mean of y ## 20.66333 16.96333 # Not much changed (only slightly narrower). ### Difference in Population Proportions If we have two populations, each with a respective population proportion representing the proportion of that population possessing some trait, we can construct a confidence interval for the difference in the proportions, $$p_X - p_Y$$. The form of the confidence interval depends on which type is being used, and I will not discuss the details of how these confidence intervals are computed, but simply say that some of the functions we have seen for single sample confidence intervals, such as prop.test(), support comparing proportions (binconf() and binom.test() do not). The only difference is that rather than passing a single count and a single sample size to these functions, you pass a vector containing the count of successes in the two population, and a vector containing the two sample sizes (naturally they should be in the same order). In the next example, I compare the proportions of men and women with melanoma who die from the disease. mel_split <- split(Melanoma$status, Melanoma\$sex)
# Get logical vectors for whether patient died from melanoma. Group 0 is
# women and group 1 is men, and the code 1 in the data indicates death from
# melanoma
fem_death <- mel_split[["0"]] == 1
man_death <- mel_split[["1"]] == 1
# Vector containing the number of deaths for both men and women
deaths <- c(Female = sum(fem_death), Male = sum(man_death))
# Vector containing sample sizes
size <- c(Female = length(fem_death), Male = length(man_death))
deaths/size
##    Female      Male
## 0.2222222 0.3670886
# prop.test, with the 'classic' CLT-based CI
prop.test(deaths, size)
##
##  2-sample test for equality of proportions with continuity
##  correction
##
## data:  deaths out of size
## X-squared = 4.3803, df = 1, p-value = 0.03636
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.283876633 -0.005856138
## sample estimates:
##    prop 1    prop 2
## 0.2222222 0.3670886`