# Lecture 8

## Hypothesis Testing Basics

Hypothesis testing is another common form of statistical inference. In hypothesis testing, our objective is to decide whether we have enough evidence to reject the null hypothesis, a statement about the population distribution that a priori we believe to be true, in favor of the alternative hypothesis, a statement about the population disagreeing with the null hypothesis. Usually the null hypothesis is denoted by $$H_0$$ and the alternative by $$H_A$$.

The first statistical tests introduced to students are tests about the value of a population parameter (other tests are possible, though). These tests generally take the form:

$H_0: \theta = \theta_o$ $H_A: \left\{\begin{array}{l} \theta < \theta_0 \\ \theta \neq \theta_0 \\ \theta > \theta_0 \end{array}\right.$

These are the tests that I discuss.

In statistical testing, one computes a test statistic used to compute a p-value, which represents the probability of observing a test statistic at least as “extreme” as the one actually observed if $$H_0$$ were in fact true. Very small p-values indicate a false null hypothesis. Usually what constitutes a “small” p-value is decided beforehand by choosing a level of significance, typically denoted $$\alpha$$. If the p-value is less than $$\alpha$$, $$H_0$$ is rejected in favor of $$H_A$$; otherwise, we fail to reject $$H_0$$. Common $$\alpha$$ include 0.05, 0.01, 0.001, and 0.1. The smaller $$\alpha$$, the more difficult it is to reject $$H_0$$.

In hypothesis testing, we need to be aware of two types of errors, called Type I and Type II errors. A Type I error is rejecting $$H_0$$ when $$H_0$$ is true. A Type II error is failing to reject $$H_0$$ when $$H_A$$ is true. For any test, we want to know the probability of making either type of error. The probability of a Type I error is $$\alpha$$, the level of significance; this means we specify beforehand what Type I error we want. Type II errors are much more complicated, since they depend not only on what the true value of $$\theta$$ is (which we will call $$\theta_A$$, the value of $$\theta$$ under the alternative assumed true for Type II error analysis) but additionally on the specified $$\alpha$$ and sample size $$n$$ (other parameters may be involved as well, but they are assumed constant and unable to be changed). For any testing scheme we represent the Type II error with $$\beta(\theta_A)$$, the probability of failing to reject $$H_0$$ when the true value of $$\theta$$ is $$\theta_A$$. Generally $$\beta(\theta_A)$$ is large when $$\theta_A$$ is close to $$\theta_0$$ (in fact, $$\beta(\theta_0) = 1 - \alpha$$), and small when $$\theta_A$$ is distant from $$\theta_0$$. This should make intuitive sense; a big difference between $$\theta_0$$ and $$\theta_A$$ should be easy to detect, but a small difference may be more difficult. In practice, researchers pick a $$\theta_A$$ for which they want to be able to detect a difference with some specified probability $$\beta$$. They then use this to pick a sample size that gives the test the desired property.

Some prefer to discuss the power of a test rather than the probability of a Type II error. The power of a test is the probability of rejecting $$H_0$$ when $$\theta = \theta_A$$. Power connects both Type I and Type II errors, since the power of a test is defined to be $$\pi(\theta_A) = 1 - \beta(\theta_A)$$ and $$\pi(\theta_0) = \alpha$$. The same principles discussed with regard to the probability of Type I and Type II errors hold with power.

As with confidence intervals, R has many functions for handling hypothesis testing (in fact, you have already seen most of them). We can perform hypothesis testing “by hand” (not using any functions designed for performing an entire test), or using functions designed for hypothesis testing. We start with methods “by hand”.

## Hypothesis Testing “By Hand”

Hypothesis testing “by hand” involves computing the test statistic and finding the appropriate p-value for a test directly. This is the only means of hypothesis testing if a function for performing some desired test does not exist (which is probably not the case unless you’re working with a very obscure test).

Let’s start by considering a simple $$z$$-test, where the hypotheses are:

$H_0: \mu = \mu_0$ $H_A: \left\{\begin{array}{l} \mu < \mu_0 \\ \mu \neq \mu_0 \\ \mu > \mu_0 \end{array}\right.$

The test statistic is $$z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}$$ and the p-value (denoted $$p_{\text{val}}$$) is:

$p_{\text{val}} = \left\{\begin{array}{lr} \Phi(z) & \text{if } H_A \text{ is } \mu < \mu_0 \\ 2\left(1 - \Phi(|z|)\right) & \text{if } H_A \text{ is } \mu \neq \mu_0 \\ 1 - \Phi(z) & \text{if } H_A \text{ is } \mu > \mu_0 \end{array}\right.$

This test is unrealistic since it assumes $$\sigma$$ is known, but it is simple to analyze.

I demonstrate these procedures by testing whether the diameter of black cherry trees is 12 in. I test the hypotheses:

$H_0: \mu = 12$ $H_A: \mu \neq 12$

I use the data set trees and assume that the population standard deviation is 3. I will base my test on the significance level of $$\alpha = .05$$.

# Some basic numbers
xbar <- mean(trees$Girth) xbar ## [1] 13.24839 n <- nrow(trees) mu_0 <- 12 sigma <- 3 # Test statistic z <- (xbar - mu_0)/(sigma/sqrt(n)) z ## [1] 2.316908 # Get p-value, using pnorm pval <- 2 * (1 - pnorm(abs(z))) pval ## [1] 0.02050872 Since 0.02 is less than my significance level of .05, I reject the null hypothesis; the mean diameter of black cherry trees is not 12. That said, if I were to use a significance level of $$\alpha = .01$$, I would not reject the null hypothesis. Remember that the p-value represents the largest level of significance at which I would fail to reject the null hypothesis. Thus the p-value measures how unlikely my data is if the null hypothesis were true, with smaller p-values indicating more evidence against the null hypothesis. Testing hypotheses “by hand” follows a similar format to the one demonstrated here, so I do not demonstrate this approach any more in this lecture. ## R Functions for Statistical Testing R has functions for performing many common statistical tests, including many that come with any R installation in the stats package. We already saw all the functions for performing statistical tests I consider in this lecture when we discussed confidence intervals. There may be additional parameters to specify depending on the statistical test, such as what the population mean under the null hypothesis is, but otherwise little has changed. Many of these functions have a common parameter alternative that determines what the alternative hypothesis is. For a two-sided test (that is, when the alternative hypothesis is that $$\theta \neq \theta_0$$), you may set alternative = "two.sided" (this is the default, though, so setting this may not be necessary). If the alternative hypothesis says $$\theta > \theta_0$$, set alternative = "greater", and if the alternative hypothesis says $$\theta < \theta_0$$, set alternative = "less". The statistical testing functions in stats will report the results of a statistical test and many other important quantities, such as the estimate of the parameters investigated, the sample size, degrees of freedom, the value of a test statistic, the p-value, and even the corresponding confidence interval. They do not state whether to reject or not reject the null hypothesis; you are expected to tell from the p-value reported whether to reject or not reject based on the significance level you have chosen. stats includes some functions for Type II error analysis (often in the form of power analysis, which amounts to the same sort of analysis), but this is a difficult analysis in general. Functions for study planning may be included in other packages. ### Test for Location of Population Mean Suppose you wish to test the hypotheses: $H_0: \mu = \mu_0$ $H_A: \left\{\begin{array}{l} \mu < \mu_0 \\ \mu \neq \mu_0 \\ \mu > \mu_0 \end{array}\right.$ Furthermore, you assume (after perhaps using qqnorm() or some other procedure to check) that your data is Normally distributed, and thus it is safe to use the $$t$$-test (or your sample size is large enough for this to be safe to use anyway). The function t.test() allows you to test these hypotheses. The call t.test(x, mu = mu0) will test whether the data stored in vector x has a mean of mu0 or whether the true mean differs from mu0 (remember: change the form of the alternative hypothesis with the parameter alternative), using the test statistic: $t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}$ I test the hypotheses regarding tree girth in the trees data set mentioned earlier in this lecture, but this time using t.test(), using a level of significance of $$\alpha = .05$$. # Check the normality assumption qqnorm(trees$Girth)
qqline(trees$Girth) # Data appears Normally distributed, so it's safe to use t.test t.test(trees$Girth, mu = 12)
##
##  One Sample t-test
##
VC <- split_len$VC # Perform statistical test t.test(OJ, VC, alternative = "greater") ## ## Welch Two Sample t-test ## ## data: OJ and VC ## t = 1.9153, df = 55.309, p-value = 0.03032 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 0.4682687 Inf ## sample estimates: ## mean of x mean of y ## 20.66333 16.96333 # Is there homoskedasticity? Let's check a boxplot boxplot(len ~ supp, data = ToothGrowth) # These populations look homoskedastic; spread does not differ drastically # between the two. A test that assumes homoskedasticity is thus performed. t.test(OJ, VC, var.equal = TRUE, alternative = "greater") ## ## Two Sample t-test ## ## data: OJ and VC ## t = 1.9153, df = 58, p-value = 0.0302 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 0.4708204 Inf ## sample estimates: ## mean of x mean of y ## 20.66333 16.96333 # Not much changed. In both versions of the test, I reject the null hypothesis. ### Testing for Difference in Population Proportion Suppose you have two data sets of Bernoulli random variables, $$X_1, ..., X_n$$ and $$Y_1, ..., Y_m$$, and you wish to know if the population proportion of “successes” for the first population, $$p_X$$, differs from the corresponding proportion in the second population, $$p_Y$$. Your set of hypotheses are: $H_0: p_X = p_Y \equiv p_X - p_Y = 0$ $H_A: \left\{\begin{array}{l} p_X < p_Y \equiv p_X - p_Y < 0 \\ p_X \neq p_Y \equiv p_X - p_Y \neq 0 \\ p_X > p_Y \equiv p_X - p_Y > 0 \end{array}\right.$ If $$m$$ and $$n$$ are reasonably large, the Central Limit Theorem can be used to obtain a test statistic: $z = \frac{\hat{p}_X - \hat{p}_Y}{\sqrt{\hat{p}(1 - \hat{p})\left(\frac{1}{n} + \frac{1}{m} \right)}}$ where $$\hat{p}$$ is the pooled sample proportion (that is, the proportion of “successes” from the combined sample). Under $$H_0$$, this test statistic follows a standard Normal distribution (when applying the Central Limit Theorem). In R, prop.test() can be used to conduct this test, using a call similar to prop.test(x, n) where x is a vector containing the number of successes in the samples, and n is the size of both samples. Below, I use prop.test() and the Melanoma data set to test whether males and females are equally likely to die from melanoma after being diagnosed, or whether they differ. I use a significance level of $$\alpha = .05$$. # First, I obtain sample sizes, using the fact that sex == 1 for males and 0 # for females n <- c(Female = nrow(Melanoma) - sum(Melanoma$sex), Male = sum(Melanoma$sex)) # Now, find how many in each group died from melanoma x <- aggregate(Melanoma$status == 1, list(Melanoma\$sex), sum)
x <- c(Female = x[1, "x"], Male = x[2, "x"])
prop.test(x, n)
##
##  2-sample test for equality of proportions with continuity
##  correction
##
## data:  x out of n
## X-squared = 4.3803, df = 1, p-value = 0.03636
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.283876633 -0.005856138
## sample estimates:
##    prop 1    prop 2
## 0.2222222 0.3670886

With a p-value of 0.0364, I reject the null hypothesis; men and women are not equally likely to die from melanoma.

## Power Analysis

Power analysis is an important part of planning a statistical study. Researchers usually try to make a test as powerful as reasonable (there is always a more powerful study than the one currently employed; simply use a sample size of $$n + 1$$ rather than $$n$$). A useful technique in planning a study is to decide what effect size the test should be able to detect with some specified probability, then choose a sample size that will give this property to the test.

Whole R packages are devoted to providing tools for study planning, but the stats package included with any R installation has some function for power analysis, including those for the two classes of tests we have studied: tests for population mean, and tests for population proportion.

### Power Analysis for the $$t$$-Test

power.t.test() allows you to perform power analysis for the $$t$$-test. It can be used in different ways depending on which parameters are passed to it and which are set to NULL, and you are encouraged to read the documentation (with, say, help("power.t.test")) to see how this function behaves, but I will focus on two applications: computing power for a test, and computing a sample size for a given power.

power.t.test(n, delta, sd, sig.level, type = "some.type.of.test", alternative = "some.alternative") will compute the power of a test with sample size n, where the difference between the mean under the null hypothesis and the mean under the alternative hypothesis, $$\mu_0 - \mu_A$$, is delta, the population standard deviation is sd, and the significance level is sig.level (by default, sig.level = .05). The type of the test administered is specified by type, and can be either "one.sample", "two.sample", or "paired" (the meaning should be self-evident). alternative specifies whether the alternative hypothesis is one-sided or two-sided. Notice that if alternative = "two.sided", delta will be perceived as also communicating which direction the difference between $$\mu_0$$ and $$\mu_A$$ occurs, so if you want the power to include the probability of rejecting in the opposite direction as well, you should set the parameter strict to TRUE (by default, strict = FALSE).

Suppose you plan to conduct a study to determine whether a new drug would induce weight loss. You plan to give all study participants both the drug and a placebo (in random order, with neither the study participants or experiment staff knowing which treatment is the drug or placebo, thus helping combat bias), and measure the difference in weight loss when the two treatments are administered. Thus your hypotheses are:

$H_0: \mu_{\text{drug}} = \mu_{\text{placebo}}$ $H_A: \mu_{\text{drug}} > \mu_{\text{placebo}}$

Your test will use a significance level of $$\alpha = .01$$. You believe that $$\sigma = 20$$ (you estimate high to be on the safe side). A researcher on staff suggests a sample size of 20. You are skeptical that a study with that sample size will be able to detect a five-pound difference in weight loss between the drug and the placebo, and thus compute $$\pi(5)$$, the power of the test when the true difference is 5 lbs.

power.t.test(n = 20, delta = 5, sd = 20, sig.level = 0.01, type = "paired",
alternative = "one.sided")
##
##      Paired t test power calculation
##
##               n = 20
##           delta = 5
##              sd = 20
##       sig.level = 0.01
##           power = 0.09924502
##     alternative = one.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs

This study will only detect a five-pound difference about 10% of the time, which is too low for your liking. You want to find a sample size to guarantee detecting this difference with some higher probability. This may involve a much larger sample size.

power.t.test(delta = d, sd = s, sig.level = alpha, power = p, type = "some.type.of.test", alternative = "some.alternative") is similar in usage to the earlier command but instead of finding power, the sample size will be found such that the power of the test when the difference between $$\mu_0$$ and $$\mu_A$$ is d is power. Thus this call is useful when planning a test and choosing an appropriate sample size for detecting a specified effect with some desired probability.

You have decided that you want the study to detect a five-pound difference in weight loss 90% of the time, and want to find a sample size that will give your test this property. You use R to find this sample size:

power.t.test(power = 0.9, delta = 5, sd = 20, sig.level = 0.01, type = "paired",
alternative = "one.sided")
##
##      Paired t test power calculation
##
##               n = 210.9878
##           delta = 5
##              sd = 20
##       sig.level = 0.01
##           power = 0.9
##     alternative = one.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs

The results suggest that you need 211 study participants for your study to have the desired property.

### Power Analysis for Tests of Proportion

power.prop.test() does for tests for population proportion what power.t.test() does for tests for population mean. The syntax is similar, except there is no parameter type, and delta is replaced with p1 and p2, which specify the population proportions under the two hypotheses. (There is no need to specify sd, so clearly that is not a parameter either.)

Gallup polls often survey samples of 1500 adults. Suppose a Gallup poll asks individuals whether they support Hillary Clinton or Donald Trump for President, and the poll uses a significance level of $$\alpha = .05$$ (the default for power.prop.test(), thus allowing us to ignore the parameter sig.level). Suppose we wish to use the results of the poll to test:

$H_0: p = .5$ $H_A: p > .5$

where $$p$$ is the proportion of the population supporting Hillary Clinton. We would like to know if the Gallup poll can reasonably detect a 1% advantage for Clinton, and use power.prop.test() to detect this:

power.prop.test(n = 1500, p1 = .5, p2 = .51, alternative = "one.sided")
##
##      Two-sample comparison of proportions power calculation
##
##               n = 1500
##              p1 = 0.5
##              p2 = 0.51
##       sig.level = 0.05
##           power = 0.136286
##     alternative = one.sided
##
## NOTE: n is number in *each* group

The Gallup poll will detect this difference only 14% of the time. If we wanted to detect this advantage for Clinton 95% of the time, what sample size do we need? By specifying power = .95 and omitting n, we can find the desired sample size.

power.prop.test(power = .95, p1 = .5, p2 = .51, alternative = "one.sided")
##
##      Two-sample comparison of proportions power calculation
##
##               n = 54102.75
##              p1 = 0.5
##              p2 = 0.51
##       sig.level = 0.05
##           power = 0.95
##     alternative = one.sided
##
## NOTE: n is number in *each* group

We would need a sample size of 54,103 people to have a test with these properties.

## p-Hacking

p-hacking (also known as data dredging or other names) is the practice of reformulating hypotheses and research questions and computing p-values for the same data set until a statistically significant result is obtained. This is considered to be very bad practice, and results found by p-hacking are untrustworthy. The probability of a Type I error is inflated when p-hacking is used, yet nowhere is this communicated or adjusted for. You do not want to be accused of p-hacking.

When conducting a study, you should have a well-defined problem prior to any testing. Exploratory analysis with visualization, or deeper analysis on small subsets of the data (that you do not use later when testing) You should also report what you did. If you perform an analysis and do not get the results you expected and you are certain that the analysis was done correctly, or you simply are unsatisfied with the conclusion, you need to perform analysis on new data, not the same data. Furthermore, you should report any prior analyses. Keep in mind that if you perform lots of tests and end when you find a significant result, you may be p-hacking, especially if you do not report that this was done.

There are many do’s and don’t’s with regards to how to properly conduct hypothesis testing and avoid p-hacking. Some rules of thumb would be to have a well-defined plan and research question prior to analysis and keep statistical testing critical to the research question at a minimum. The website FiveThirtyEight has made a web app where you can see what p-hacking is and why it cannot be trusted.