Lecture 1

R Basics

Using R as a Calculator

Lots of work in R is done via the interpreter. You can use the interpreter as a calculator, and thus can do some simple calculations.

# Some basic arithmetic
2 + 2
## [1] 4
1 - 7
## [1] -6
2 * 3
## [1] 6
(1 + 1) * 3
## [1] 6
2^3
## [1] 8
10/2
## [1] 5
# One can also use functions for more advanced calculations
sqrt(9)
## [1] 3
exp(3)
## [1] 20.08554
# A familiar built-in value
pi
## [1] 3.141593

When using a computer as a calculator, though, you may observe behavior that seems… strange.

# Be aware that any time you use computers for calculation you may get the
# 'wrong' answer. This is due to floating-point arithmetic and how computers
# see numbers.
sqrt(2)^2
## [1] 2
# What's this?
sqrt(2)^2 - 2
## [1] 4.440892e-16

(For an explanation of this phenomenon, watch this video by Computerphile on floating-point numbers.)

Obviously, though, we want to use R for more than a glorified calculator.

Variables

As with any programming language, R can store information in variables. R has a few ways to store variables, with the <-, ->, and = operators (there are a few others that I won’t discuss here).

# One way to create a variable
var1 <- 1
# Another way to create a variable
var2 = 2
# A third way, this time where the variable is on the RIGHT side of the
# assignment operator!
var3 <- 3
# What do you get?
var1 + var2
## [1] 3
var2 * var3
## [1] 6

There are rules as to how variables can be named. In addition to not being reserved words, they can only use alpha-numeric characters (letters and numbers), or the _ or . characters. Also, they cannot begin with a number (so while var1 is legal, 1var is not). Variables are case-sensitive; var1 is not the same as Var1 or VAR1.

# These are all valid names for variables
var1 <- 10
variable_number_2 <- 20
variable.number.3 <- 30  # <---- I would suggest avoiding this style of variable naming, though; use '_' instead
variableNumber4 <- 40

# None of these are the same
var1 <- 10
Var1 <- 230
VAR1 <- -5

var1 + 1
## [1] 11
Var1 + 1
## [1] 231
VAR1 + 1
## [1] -4

As a side note, notice that a line break will usually begin a new command. This is not always true! If a command is not finished from R’s perspective, it will see the contents of the next line as being part of the same command. This can sometimes be confusing, while other times it can be used to your advantage. Furthermore, you can use the ; character to separate commands on the same line, or explicitly end a command like in C-based programming languages like C, C++, or Java (though this is usually not done). Aside from this, R generally ignores white space (space, tabs, new lines, etc.).

# Because there is nothing after the '+' on the first line, R looks for the
# rest of the command on the second line.
5 +
5
## [1] 10
# It could even be a few lines down! (DON'T EVER DO THIS!!!!)
10 +

1
## [1] 11

R variables are untyped, but that does not mean that the object they reference is untyped. Some basic types you will encounter early (and there are many, many others) include:

• Numeric, like 124 or 3.14159
• Character, like "hello bobby" or "124" or 'cmiller@math.utah.edu'
• Boolean, either TRUE or FALSE
• Vectors, which are a collection of objects of all the same type
• Functions, which you can think of as being a “mini program”, like log, sqrt, mean, or help
• Data frames, which store datasets in memory in a tabular format
# A numeric variable
num_var <- 124
# Character data
char_var <- "hello"
# Boolean
bool_var <- TRUE
# A vector of data
data_vec <- c(1, 20, 6, 2)
# Another vector of data
data_vec2 <- c("hello", "world")

# There are functions for type checking
not_a_number <- "124"
is.numeric(not_a_number)
## [1] FALSE
is.character(not_a_number)
## [1] TRUE
# Sometimes you can force a variable of one type to be another type.
is_a_number <- as.numeric(not_a_number)
is.numeric(is_a_number)
## [1] TRUE

Packages

The true power of R is in its packages. R has an ever-growing community of users and developers, many of whom write free packages to extend R’s functionality. These packages are made available to all R users on websites like CRAN (the Comprehensive R Archive Network) or GitHub.

Packages from CRAN are easily downloaded and installed using the install.packages() function, like so:

# Install the "UsingR" package for Verzani's book
install.packages("UsingR")

Once you install a new package, you can load it into the R environment using require() or library().

library(UsingR)
# Alternatively, you could use require("UsingR")

Today there are well over 7000 packages on CRAN alone, and this growth is likely to continue.

Data sets

Datasets in R are usually stored in vectors (for univariate data) or data frames (for multivariate data). For now, we will work with built-in data sets or those included in packages. If a dataset isn’t already in the R environment but does exist in some package, it can be brought in using the data() function. The head() function allows us to view only a few observations from a dataset (in order to prevent our screen from being flooded with all the data), and the str() function gives us a further description of the dataset.

# Let's load in a univeriate dataset containing the lengths of major North
# American rivers
data(rivers)
# We can see a few of the observations to get a glimpse at the nature of the
# data.
head(rivers)
## [1] 735 320 325 392 524 450
# We can see more information via str()
str(rivers)
##  num [1:141] 735 320 325 392 524 ...
# Just for fun, what is the combined length of all these rivers?
sum(rivers)
## [1] 83357
# Let's have even more fun by seeing what proportion of this total length
# each river accounts for.
rivers/sum(rivers)
##   [1] 0.008817496 0.003838910 0.003898893 0.004702664 0.006286215
##   [6] 0.005398467 0.017503029 0.001619540 0.005578416 0.007197956
##  [11] 0.003958876 0.004030855 0.003359046 0.003778927 0.010437036
##  [16] 0.010868913 0.002423312 0.003946879 0.003479012 0.011996593
##  [21] 0.007197956 0.006058279 0.017395060 0.010077138 0.014911765
##  [26] 0.010676968 0.004198808 0.004882613 0.003431026 0.003359046
##  [31] 0.006298211 0.008637547 0.004678671 0.002999148 0.003922886
##  [36] 0.002759216 0.003179097 0.010197104 0.002519285 0.007557854
##  [41] 0.003119114 0.002759216 0.004318773 0.008757513 0.007197956
##  [46] 0.003670957 0.004678671 0.005038569 0.003491009 0.008517581
##  [51] 0.004078842 0.002603261 0.003371043 0.004222801 0.003107118
##  [56] 0.002999148 0.005638399 0.008157683 0.006838058 0.004198808
##  [61] 0.003598978 0.006718092 0.010796934 0.007497871 0.003982869
##  [66] 0.028168000 0.014048010 0.044507360 0.027772113 0.030387370
##  [71] 0.009357343 0.003359046 0.004918603 0.005518433 0.003119114
##  [76] 0.003059131 0.005170532 0.004198808 0.009117411 0.007413894
##  [81] 0.004054848 0.011768658 0.015667550 0.005998296 0.008349629
##  [86] 0.007257939 0.002999148 0.004930600 0.012644409 0.008817496
##  [91] 0.002795206 0.005218518 0.005878331 0.003718944 0.005518433
##  [96] 0.004594695 0.004498722 0.015235673 0.006538143 0.005338484
## [101] 0.022613578 0.004558705 0.003598978 0.004558705 0.004522716
## [106] 0.005098552 0.003311060 0.002519285 0.009597274 0.005038569
## [111] 0.004198808 0.004318773 0.006454167 0.013196252 0.014455895
## [116] 0.003766930 0.002843193 0.007317922 0.004318773 0.006478160
## [121] 0.012452464 0.005086555 0.003718944 0.003598978 0.005326487
## [126] 0.003610974 0.003215087 0.007437888 0.002579267 0.007821779
## [131] 0.010796934 0.006298211 0.002951162 0.004318773 0.006346198
## [136] 0.005998296 0.008637547 0.003239080 0.005158535 0.008049714
## [141] 0.021233970
# Let's look at our first ever data frame, the famous iris dataset
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... # You can think of this as being a table or matrix. In fact, it's a # combination of equal-length vectors. We can get the Sepal.Length vector # using the$ operator
iris$Sepal.Length ## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 ## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 ## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 ## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 ## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 ## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 ## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 ## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 ## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9 Comments You may have noticed already the # symbol in the code. This is called a comment. The R interpreter completely ignores comments. They are used for annotating code. Commenting is not optional; it’s essential to understanding code when it is read (and I promise you, it will be). You can’t write good code without comments. Consequently, I require you to write good, extensive comments in your homework! # Hi! I'm a comment! R doesn't care about me, but that's okay. I'm very # useful for understanding code, and if you forget to add me, an angry # programmer will come to your house and punch you in the face! # Here, I'm saying that the following code makes a histogram of the rivers # data, using relative frequency, a custom main title, a custom x-axis # label, a bin count based on the square root of the size of the dataset, # right-inclusive classes, and a beautiful rainbow fill because we're classy # people who thinks this looks good (even though it doesn't) hist(rivers, breaks = ceiling(length(rivers)), freq = FALSE, right = FALSE, main = "Histogram for the Lengths of Rivers", xlab = "River Length", col = rainbow(ceiling(length(rivers)))) Help R has many, many functions. In 2014, there were approximately 182,393 R functions that could be used (most of them in packages). There is no way anyone could remember all of them. Fortunately, it’s easy to access documentation in R, especially if you are using RStudio. The help() function will look up documentation for a string entered. This is made even easier by just typing ? and the name of the package/function/dataset you want to see documentation for. # Here's how to access the documentation of the mean() function help("mean") # Or even easier: ?mean Univariate Data Analysis Vectors Univariate data is usually stored as vectors. The most basic function for creating a vector is the c() function, where the arguments of the function (separated by commas) become the contents of the vector. It generally doesn’t matter what the type of data the contents of the vector are so long as they are all the same. # A numeric vector of fictitious data num_vec <- c(10, 13, -1, 0.02, 0, -3.31) # A vector of character data char_vec <- c("joe", "phil", "jan", "denise", "tom") # A vector of boolean values bool_vec <- c(TRUE, TRUE, FALSE) # A vector of functions? WHAT IS THIS MADNESS? func_vec <- c(mean, sum, sd) # But this does not create a vector of vectors; all vectors are flattened into one vector (it is possible to make a vector of vectors, but it's tricky) not_a_vec_of_vecs <- c(c(1,2,3),c(4,54),c(10,2,-6)) # If I want to see the contents of a vector, just type its name into the interpreter not_a_vec_of_vecs ## [1] 1 2 3 4 54 10 2 -6 If you don’t specify the type of data you are storing in a variable, or if one has not already been assigned, R will think it’s a vector. im_a_vector <- 5 is.vector(im_a_vector) ## [1] TRUE To access the contents of a vector, you can use [] notation, like vec[x]. x identifies the elements of the vector you want. This is called indexing. Some notes about x: • Items in a vector can always be found via an integer. vec[1] will get the first element of the vector, and vec[5] the fifth. Also, instead of specifying what indices you do want, you can specify the indices you don’t want with negative integers. For example, vec[-1] is vec with all except for vec[1]. vec[0] is an empty vector that is of the same type as vec. • Some vectors have named elements. In that case, you can access elements of the vector by name, using a character string. For example, if cities is a vector of city populations which is indexed with city names, cities["Salt Lake City"] will find the population of Salt Lake City in the vector. If you want to see the names of the elements, use the names() function. (Not surprisingly, the object returned by names() is also a vector, specifically a character vector.) • x can be another vector, and thus you can index a vector with another vector so long as the values contained in x are a valid means of indexing. We could index a vector with vec[c(1,2,3)] or cities[c("Salt Lake City", "Provo")]. • x could be a vector of booleans. Every TRUE in x will lead to the element in vec corresponding to where the TRUE is located in x will be included, and every element in vec where x is FALSE will not be included. (While this implies that x must be the same length as vec, this is not necessarily true; if x is shorter, the contents of x will be recycled until R has made a decition for each element of vec whether to include it or not. I discuss recycling more later.) • vec[] will return the entire vector vec. This is sometimes useful. # You can assign names to elements using name = value notation in c() friendly_vector <- c(jon = 1, tony = 4, janet = 5) names(friendly_vector) ## [1] "jon" "tony" "janet" friendly_vector[1] ## jon ## 1 friendly_vector[c(2,3)] ## tony janet ## 4 5 friendly_vector[-2] ## jon janet ## 1 5 friendly_vector["tony"] ## tony ## 4 friendly_vector[c("jon", "tony")] ## jon tony ## 1 4 friendly_vector[c(TRUE, TRUE, FALSE)] ## jon tony ## 1 4 # You can rename elements in a vector like so: names(friendly_vector) <- c("jack", "jill", "dick") friendly_vector["jack"] ## jack ## 1 You can also use indexing to change the values of a vector, or even expand it. vec[x] <- val will replace the contents of vec at x with val if val is the same type as the rest of the data in vec and x is any valid means of indexing vec. x could consist of indices that do not already exist in vec, in which case those indices will be added to vec, thus expanding it. If x consists of integer indices outside the range of existing integer indices, these other indices will be created as well and filled with NA’s. You can also delete values from the vector with vec[x] <- NULL # friendly_vector was defined in an earlier code block # Change existing values friendly_vector["jack"] <- 16 friendly_vector[2] <- 19 friendly_vector ## jack jill dick ## 16 19 5 # Adding new values friendly_vector[4] <- 21 friendly_vector ## jack jill dick ## 16 19 5 21 friendly_vector["danielle"] <- 0 friendly_vector ## jack jill dick danielle ## 16 19 5 21 0 friendly_vector[c(6, 7, 8)] <- 12 friendly_vector ## jack jill dick danielle ## 16 19 5 21 0 12 12 12 friendly_vector[20] <- 18 While vectors must contain data of one type, there is a special type that can be included in any vector: NA. This value represents “missing information”. NA isn’t like other values and needs to be handled with care. The function is.na() identifies these values in vectors. not_finished <- c(1, 4, 5, NA, 2, 2) not_finished ## [1] 1 4 5 NA 2 2 # If I want to access the non-NA parts of the vector, I can do so like this not_finished[!is.na(not_finished)] ## [1] 1 4 5 2 2 There are other special types in R resembling but dfferent from NA. NULL is a lot like NA but usually means that something in R is unavailable (whereas NA is more akin to missing data in a dataset). Inf and -Inf are special values denoting infinity and negative infinity respectively. These are, in some sense, numeric, and represent values that are very large (or very small, in the case of -Inf), and can occur when dividing non-zero numbers by zero. NaN effectively means “not a number”, and occurs when some numerical error occurs, like dividing zero by zero. Again, these cases must be handled with care, and there are special functions like is.na() for detecting them in vectors. There are other ways to create vectors other than with the c() function. Some common ways are listed below: • The : constructor can be used to create sequences. For example, 1:10 will create a vector of numbers from 1 to 10, incrementing by 1. 10:0 creates a vector starting at 10 and ending at 0, decrementing by 1. You can also replace the endpoints with variables, or expressions in parentheses, like 1:n or 1:(2 + 2). • Sometimes you may want to create a sequence but want more control over incrementation, or how many elements in the vector you want. In this case, use the seq() function. You can make a sequence of numbers incrementing by 2 going from 1 to 100 with seq(1, 100, by = 2), or a sequence of numbers between 0 and 1 with length 100 with seq(0, 1, length = 100). (See help("seq") to see all the many ways to create sequences with seq().) • The rep() function can make vectors with repeating elements. Let’s say I want to repeat the character values “"a", "b", and "c" three times total. I can do so with rep(c("a", "b", "c"), times = 3). Alternatively, if I wanted to repeat these values each three times, I would do so with rep(c("a", "b", "c"), each = 3). (See help("rep") to see all the many ways to create sequences with rep().) • Sometimes I want to create character vectors where I have pasted together strings from separate vectors. For example, if I want a character vector containing the names of one hundred treatments, it may be tedious to type c("Treatment 1", "Treatment 2", ..., "Treatment 100"). The paste() function makes this much easier. I can create such a vector with paste("Treatment", 1:100); each of the elements from both vectors will be pasted together into a new vector. By default, these elements will be separated with a space character, but I can change this behavior by specifying the sep parameter in paste(). For example, if I want "Treatment_1" rather than "Treatment 1", I can do so with paste("Treatment", 1:100, sep = "_"). I demonstrate these techniques below. # Create a vector of numbers from 1 to 10 1:10 ## [1] 1 2 3 4 5 6 7 8 9 10 # In reverse 10:1 ## [1] 10 9 8 7 6 5 4 3 2 1 # Getting creative 1:(2 + 2) ## [1] 1 2 3 4 # Another way to make a sequence seq(-1, 1, by = 0.5) ## [1] -1.0 -0.5 0.0 0.5 1.0 # Yet another way to make a sequence seq(0, 20, length = 3) ## [1] 0 10 20 # A repeating sequence of letters rep(c("a", "b", "c"), times = 3) ## [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" # A sequence of repeating letters rep(c("a", "b", "c"), each = 3) ## [1] "a" "a" "a" "b" "b" "b" "c" "c" "c" # A quick way to make a character vector paste("Treatment", 1:5) ## [1] "Treatment 1" "Treatment 2" "Treatment 3" "Treatment 4" "Treatment 5" You can do mathematical operations with vectors, such as +, -, *, /, ^, and others. Operations are often applied component-wise, using R’s recycling behavior. Recycling occurs when two vectors are not of equal length. Let’s say you have a vector vec1 that is longer than vec2, and you do an operation such as vec1 + vec2. Let’s say length(vec1) == 12 and length(vec2) == 4. At first, the resulting vector will add, component-wise, each element from each vector; think vec1[1] + vec2[1], vec1[2] + vec2[2], vec1[3] + vec2[3], and vec1[4] + vec2[4]. After the fourth index, though, there are no more elements in vec2, so R will start from the beginning of vec2 and keep going, resulting in vec1[5] + vec2[1], vec1[6] + vec2[2], and so on. R will continue to do this until it has reached the end of vec1. You would get the same result for vec2 + vec1; it doesn’t matter which side of + has the longer vector. If the longer vector is not a multiple of the shorter vector, though, R will throw a warning message. vec1 <- c(0, 0, 10, -1, 5, 6) # R treats "1" as a vector of length 1, and thus recycles vec1 + 1 ## [1] 1 1 11 0 6 7 vec1 * 2 ## [1] 0 0 20 -2 10 12 vec1 ^ 2 ## [1] 0 0 100 1 25 36 vec2 <- 1:2 # Another case of recycling behavior vec1 + vec2 ## [1] 1 2 11 1 6 8 # Same result if I switch the order vec2 + vec1 ## [1] 1 2 11 1 6 8 # R will complain if the length of the longer vector is not a multiple of the length of the shorter object, though it will still produce a result vec1 / 1:4 ## Warning in vec1/1:4: longer object length is not a multiple of shorter ## object length ## [1] 0.000000 0.000000 3.333333 -0.250000 5.000000 3.000000 vec1 ^ vec2 ## [1] 0 0 10 1 5 36 # Does NOT get same result because ^ is not commutative vec2 ^ vec1 ## [1] 1.0 1.0 1.0 0.5 1.0 64.0 Of course, when working with vectors of the same length, recycling isn’t a concern. # These vectors are the same length, so no need to worry about recycling vec1 <- 1:3 vec2 <- 10:12 vec1 + vec2 ## [1] 11 13 15 vec1 * vec2 ## [1] 10 22 36 vec1 ^ vec2 ## [1] 1 2048 531441 # That said, it's not hard to guess what will happen when one of the objects is length one (like a vector) (1:10) ^ 2 # First ten squares ## [1] 1 4 9 16 25 36 49 64 81 100 2 ^ (1:10) # First ten powers of two ## [1] 2 4 8 16 32 64 128 256 512 1024 Recycling occurs elsewhere in R. We saw recycling behavior when indexing a vector with a vector of booleans earlier. There are other instances in R as well. Some functions are vector-valued. For example, when passed a vector, sqrt() will take the square root of every element in the vector. vec <- 1:5 sqrt(vec) ## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 exp(vec) ## [1] 2.718282 7.389056 20.085537 54.598150 148.413159 vec[3] <- NA # Creates a vector of booleans is.na(vec) ## [1] FALSE FALSE TRUE FALSE FALSE Another important set of operators are logical operators, which return boolean data. Such operators include: • ==: This detects equality (notice that this is not =, which is an assignment operator). If or when the object on the left equals the object on the right, the result will be TRUE; otherwise, it will be FALSE. For vectors, this does not return whether the two vectors are identical, but when one vector is equal to the other component-wise. • < and >: These detects “less” or “greater than”, like in mathematics. This is intended for numeric data, but can be used for other types of data as well (although rarely, and probably not well). • <= and >=: These detect “less than or equal to” or “greater than or equal to”. • !=: This detects “not equal to”, and is the opposite of ==. • &: This is logical “and”, and is true when the boolean or statment on the left is true and the boolean or statement on the right is true. Thus, (1 == 1) & (2 >= 1) == TRUE and (1 == 2) & (2 >= 1) == FALSE. • |: This is logical “or”, and is true when the boolean or statement on the left is true or the boolean or statement on the right is true. Thus, (1 == 1) | (2 >= 1) == TRUE and (1 == 2) | (2 >= 1) == TRUE. • !: This is logical “not”, negating any truth statement. So !TRUE == FALSE and !(1 == 2) == TRUE. • %in%: This logical operator is unique compared to the others considered here, since this is actually a function. The argument on the right-hand side of this operator must be a vector, and this operator whether elements on the left-hand side are in the vector on the right-hand side (in logic, this is $$x \in S$$ where $$S$$ is a set). Thus, 3 %in% c(1,2,3) == TRUE and 3 %in% c(1, 2) == FALSE. Like other operators, these utilize recycling and may return vectors. Examples are shown below. vec1 <- c(1, 4, 21, 22, -5) vec2 <- c(1, 2, 3, 4, -5) # True only in the first and last positions vec1 == vec2 ## [1] TRUE FALSE FALSE FALSE TRUE vec1 < 4 # True for first and last elements ## [1] TRUE FALSE FALSE FALSE TRUE vec2 <= 4 # True in first, second, and last elements ## [1] TRUE TRUE TRUE TRUE TRUE !(vec1 <= 4) # Inverse of above ## [1] FALSE FALSE TRUE TRUE FALSE # %% is the modulus operator, returning the remainder when the number on the # left is divided by the number on the right. vec1 %% 2 == 0 is a way to # detect even numbers (since the remainder when dividing an even number by 2 # must be zero). (vec1 > 20) & !(vec1%%2 == 0) # Only true in third position ## [1] FALSE FALSE TRUE FALSE FALSE (vec1 > 20) | !(vec1%%2 == 0) # Not true in second or fourth ## [1] TRUE FALSE TRUE TRUE TRUE 1:4 %in% vec1 # One and four are in vec1 ## [1] TRUE FALSE FALSE TRUE # Some useful functions are the any() and all() functions, which take # boolean vectors as arguments and return whether anywhere the vector is # true or whether the vector is true everywhere, respectively. In other # words, any() will 'or' all elements of the vector, and all() will 'and' # all elements in the vector. any(1:4 %in% vec1) # Is there a number between 1 and 4 in vec1? ## [1] TRUE all(1:4 %in% vec1) # Are all numbers between 1 and 4 in vec1? ## [1] FALSE Many other functions return logical data, notably the is.object() family of functions like is.vector() or is.na(). While a vector of booleans for when a condition is true is nice, sometimes you may not want whether a statement is true for each element of a vector, but for which elements of a vector a statement is true. In this case, the which() function will tell you the indices of where an input vector is TRUE. which(c(TRUE, FALSE, FALSE, TRUE, FALSE)) ## [1] 1 4 which(15:25 > 20) ## [1] 7 8 9 10 11 # I'm going to create a dataset with NA's, and use which() to find the NA's # and also the 'good' data data_vec <- 1:100 data_vec[c(21, 33, 49, 61, 62)] <- NA which(is.na(data_vec)) # Where are the NA's ## [1] 21 33 49 61 62 which(!is.na(data_vec)) # Where is the non-NA data? ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ## [18] 18 19 20 22 23 24 25 26 27 28 29 30 31 32 34 35 36 ## [35] 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 ## [52] 55 56 57 58 59 60 63 64 65 66 67 68 69 70 71 72 73 ## [69] 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 ## [86] 91 92 93 94 95 96 97 98 99 100 # How much data is NA? length(which(is.na(data_vec))) ## [1] 5 # Alternatively, I can sum a boolean vector to find the same answer (FALSE # is treated as 0 and TRUE as 1) sum(is.na(data_vec)) ## [1] 5 A special type of vector not discussed earlier is a factor vector, which is similar to a character vector but requires that each value of the vector be in a list of levels, which describe the values a factor vector can take. (R also views vactors differently internally than character vectors.) Factor vectors are thus used for storing categorical data. You can create a factor with the factor() function. When manipulating factor data, there is an additional restriction to the ones discussed already: you cannot add a value that is not in the levels of the factor (aside from NA). You would have to add the value to the levels of the factor first, then add the value. Also, beware that changing the levels will also change the values stored in the factor. You can set the levels of a factor with levels(x) <- y, where x is the factor vector and y a vector representing the new levels of the factor. # Create a factor of color data color_data <- c("blue", "red", "blue", "blue", "blue", "red", "red", "blue") # Create the factor, declaring the levels; notice that I included a category # that is not in the data vector colcat1 <- factor(color_data, levels = c("red", "green", "blue")) colcat1 ## [1] blue red blue blue blue red red blue ## Levels: red green blue # If I do not declare the levels, R will use the values stored in the vector # to guess what they are colcat2 <- factor(color_data) # Now I change data; for the first, no complaint if I add 'green' colcat1[1] <- "green" colcat1 ## [1] green red blue blue blue red red blue ## Levels: red green blue # But 'green' is not in the levels of colcat2, so an warning is issued and # NA is added instead colcat2[1] <- "green" ## Warning in [<-.factor(*tmp*, 1, value = "green"): invalid factor level, ## NA generated colcat2 ## [1] <NA> red blue blue blue red red blue ## Levels: blue red # I can see the levels of these with levels() levels(colcat1) ## [1] "red" "green" "blue" levels(colcat2) ## [1] "blue" "red" # I can rearrange all the categories by changing the levels levels(colcat1) <- c("blue", "red", "green") colcat1 ## [1] red blue green green green blue blue green ## Levels: blue red green # Here I add a new level to those specified for colcat2 levels(colcat2)[3] <- "green" # Now I can add 'green' to colcat2 without complaint colcat2[1] <- "green" colcat2 ## [1] green red blue blue blue red red blue ## Levels: blue red green Functions Functions are one of the most important objects in R. In fact, R follows a programming paradigm called functional programming, where most operations are seen as the evaluation of functions. You can think of functions as miniature programs for performing certain tasks. In R, functions have two parts: • Arguments are the values passed to the function for evaluation. In the statement f(x, y), the variables between the parentheses, x and y, are the arguments of the function. A function will typically expect an argument to be of a particular type and may throw an error when that type is not received, but the expected type could be anything from a vector to a data frame to even another function. Some functions have many arguments but have default values assigned to most of them so that the user specifies those arguments only if they wish to change the function’s default behavior. • The body of a function is the part of the function that performs computations (involving the parameters) and possibly returning an output. The output of the function is either the last value computed (and not assigned to a variable) or a value returned by the return() function. To create a named function, use the syntax my_func <- function(arguments) { function body }. Below I create a function. # Let's create a function that returns the length of the longest vector # passed to it. It will take two vectors x and y as arguments, and return # the length of the longer vector longest_length <- function(x, y) { l1 <- length(x) # This is a local variable that will be visible only in the function, not the rest of R l2 <- length(y) max(l1, l2) # This is the last unevaluated computation and thus is returned by the function } # Note that longest_length is a variable storing a function, and # longest_length() is a function call # Testing it out vec1 <- c("bob", "jim", "margaret", "danny") vec2 <- c(22, -9) longest_length(vec1, vec2) ## [1] 4 # Functions will check the position of objects passed and assign them to the # arguments in the same position in the function definition. Alternatively # (and very useful when there are many arguments not all of which are # specified), you can use = to set specific arguments by name. longest_length(y = vec2, x = vec1) ## [1] 4 Functions are extremely important objects in R and the key to R programming. Appendix A of the Verzani textbook discusses function programming in more detail. Visually Exploring a Dataset R has many techniques built-in for visually analyzing datasets, and many packages that do so even better, such as the very popular ggplot2 package (I used ggplot2 for all the graphics for my first report on Utah’s gender gap in wages written for Voices for Utah Children, which you can read here). Here I will discuss how to make a few basic graphics for visually analyzing a dataset. Stem-and-leaf plot Use the stem() function to create a stem-and-leaf plot. stem(rivers) ## ## The decimal point is 2 digit(s) to the right of the | ## ## 0 | 4 ## 2 | 011223334555566667778888899900001111223333344455555666688888999 ## 4 | 111222333445566779001233344567 ## 6 | 000112233578012234468 ## 8 | 045790018 ## 10 | 04507 ## 12 | 1471 ## 14 | 56 ## 16 | 7 ## 18 | 9 ## 20 | ## 22 | 25 ## 24 | 3 ## 26 | ## 28 | ## 30 | ## 32 | ## 34 | ## 36 | 1 The display suggests that the rivers dataset is very right-skewed. Let’s see if this agrees with other plots. Dotplot You can use stripchart() to make dotplots. The default behavior of this function doesn’t produce a very useful plot (truth be told, plotting functions in other packages make better dotplots in general than stripchart()), so set method = "stack" to enable stacking. I also set pch = 20 to change the plotting character used from the default square to a filled-in circle, which is easier to read. Many R plotting functions have parameters main, xlab, and ylab. These are so you set the title of the plot, and the names of the labels. I do so in the plot I create as well. # Make a dotplot of the rivers data stripchart(rivers, method="stack", pch = 20, # Adding axis labels main = "Length of North American Rivers", xlab = "Length") The rivers dataset is slightly large, with 141 observations. The plots used so far may not result in very good graphics for datasets like this. I next consider graphics that may handle larger datasets better. Histogram The hist() function creates histograms in R. Calling hist(x) for some dataset x will create a histogram R thinks is appropriate. R will automatically choose classes and the number of classes to used based on built-in algorithms. I show an example below: hist(rivers) We can change the parameters of the hist() function to have more control over the result. For example: • R automatically chooses axis names and the main title, which usually are not very good names. We can change these defaults to reasonable labels by setting the main, xlab, and ylab parameters. • By default, R will plot the frequency rather than the relative frequency of each class. The result is indistinguishable until you wish to overlay a histogram with a smooth curve or otherwise be more creative with the chart. Set freq = FALSE to show relative frequencies, or probabilities. • R creates left-inclusive histograms by default. If we want right-inclusive histograms, set right = TRUE. • The breaks parameters is used for setting where the break points are located. If we set breaks to an integer, this will tell R how many classes to use. For example, if we want $$\sqrt{n}$$ classes for the rivers dataset, we can do so by setting breaks = round(sqrt(length(x))). (That said, the method R uses for determining how many classes to use is usually better than the $$\sqrt{n}$$ rule.) • Do we really like white bars in our histogram? The col parameter can change their color. For example, set col = "gray" for gray bars. There are many more parameters for the hist() function than this; I invite you to read the function’s documentation for more details. Here is another histogram for the rivers dataset, this one changing some parameters. hist(rivers, main = "Distribution of the Lengths of Major North American Rivers", xlab = "Length of River", freq = FALSE, right = TRUE, breaks = round(sqrt(length(rivers))), # Using the sqrt(n) rule col = "gray") Density Curve A density curve is another way to view a distribution where the end result is a smooth curve, It can be interpreted like a histogram, but it avoids disadvantages that come with choosing discrete classes. You can create a density curve with plot(density(x)). plot(density(rivers), main = "Distribution of Lengths of Rivers") # It is possible to superimpose a density plot on top of a histogram to see the relationship. Just be sure that the histogram is showing relative frequencies rather than frequencies. Here is an example. hist(rivers, main = "Distribution of the Lengths of Major North American Rivers", xlab = "Length of River", freq = FALSE, right = TRUE, breaks = round(sqrt(length(rivers))), # Using the sqrt(n) rule col = "gray", ylim = c(0, 0.0025)) # ylim sets the limits of the vertical axis lines(density(rivers)) Boxplot You can create a boxplot in R with the boxplot() function. boxplot(rivers) One of the major advantages of boxplots is being able to compare distributions of different samples. To do so, issue a call to boxplot() of the form boxplot(x ~ y), where x is a data vector with all samples combined, and y is a character or factor vector identifying the sample each element of x belongs to. # I will be working with the iris dataset for this example # Looking at the data str(iris) ## 'data.frame': 150 obs. of 5 variables: ##$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ##$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ##$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris$Sepal.Length ## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 ## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 ## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 ## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 ## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 ## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 ## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 ## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 ## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9 iris$Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa
##   [7] setosa     setosa     setosa     setosa     setosa     setosa
##  [13] setosa     setosa     setosa     setosa     setosa     setosa
##  [19] setosa     setosa     setosa     setosa     setosa     setosa
##  [25] setosa     setosa     setosa     setosa     setosa     setosa
##  [31] setosa     setosa     setosa     setosa     setosa     setosa
##  [37] setosa     setosa     setosa     setosa     setosa     setosa
##  [43] setosa     setosa     setosa     setosa     setosa     setosa
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica
## [103] virginica  virginica  virginica  virginica  virginica  virginica
## [109] virginica  virginica  virginica  virginica  virginica  virginica
## [115] virginica  virginica  virginica  virginica  virginica  virginica
## [121] virginica  virginica  virginica  virginica  virginica  virginica
## [127] virginica  virginica  virginica  virginica  virginica  virginica
## [133] virginica  virginica  virginica  virginica  virginica  virginica
## [139] virginica  virginica  virginica  virginica  virginica  virginica
## [145] virginica  virginica  virginica  virginica  virginica  virginica
## Levels: setosa versicolor virginica
boxplot(iris$Sepal.Length ~ iris$Species)

Bar Chart

For a categorical dataset, we can create a bar chart using the barchart() function. This is a natural way to visualize categorical data.

# Shows the cloudiness of different days in Central Park
barchart(central.park.cloud)

Numerical Summaries

A numerical summary tries to describe some aspect of a dataset using numbers. Two classes of numerical summaries include measures of location and measures of spread. There are many other numerical summaries (the textbook used in this lab describes measures of skewness and kurtosis in addition to measures of location and spread), but we will focus on these two.

First, a great way to obtain appropriate numerical summaries for a dataset is with the summary() function. This is a generic function that changes its behavior depending on the object passed to it. If x is a numeric vector, summary(x) will compute the five-number summary of x (minimum, first quartile, median, third quartile, maximum) in addition to the mean of x. On the other hand, if x is a factor or character vector, summary(x) will compute the frequency of each category in x. Thus, if you have a dataset you are unfamiliar with, summary() is a good way to see some basic information about it.

# Some basic information about rivers, a numeric dataset
summary(rivers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   135.0   310.0   425.0   591.2   680.0  3710.0
# Frequencies for cloud conditions in central.park.cloud, a factor vector
summary(central.park.cloud)
##         clear partly.cloudy        cloudy
##            11            11             9

Of course it is possible to compute specific numeric summaries.

The mean of a dataset is defined as:

$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$

where $$x_i$$ is the $$i^{\text{th}}$$ observation and $$n$$ the size of the dataset. The mean can be computed in R using the mean() function. To obtain the trimmed mean, you can set the trim parameter to a value between 0 and 0.5; this sets how much of the data to “trim” from either end of the ordered dataset. If trim = 0 (the default), no data is trimmed from either end, but if trim = 0.5, as much data is possible is trimmed from either end and the median is computed.

# The mean executive pay (in \$10,000's)
mean(exec.pay)
## [1] 59.88945
# There are very large outliers, though; what happens to the mean if we trim
# 10% of the data from either side of the (ordered) dataset?
mean(exec.pay, trim = 0.1)
## [1] 29.96894

The median is the number that splits the dataset in half after being ordered. You can compute the median in R using the median() function.

# The median executive pay; compare to the mean or trimmed mean
median(exec.pay)
## [1] 27

The $$(100\alpha)^{\text{th}}$$ percentile is the number such that $$100\alpha\%$$ of the dataset lies below and $$100(1-\alpha)\%$$ above the number. The quantile() function in R computes percentiles (also referred to as quantiles). quantile(x) will effectively find the five-number summary of the dataset x, including the first and third quartiles. Alternatively, one can call quantile(x, p), where p is a vector (or perhaps just a number that will be interpreted as a vector) specifying with percentiles are wanted (these are numbers between 0 and 1).

The fivenum() function will find five-number summaries outright, but one may as well call quantile() with the default parameters (the presentation is better anyway).

# The five-number summary using fivenum
fivenum(exec.pay)
## [1]    0.0   14.0   27.0   41.5 2510.0
# The same information using quantile
quantile(exec.pay)
##     0%    25%    50%    75%   100%
##    0.0   14.0   27.0   41.5 2510.0
# What is the 99th percentile of executive pay?
quantile(exec.pay, .99)
##    99%
## 906.62
# The deciles of exec.pay, breaking the dataset into 10 parts
quantile(exec.pay, seq(0, 1, by = .1))
##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90%
##    0.0    9.0   12.6   16.0   22.0   27.0   31.0   38.0   48.0   91.4
##   100%
## 2510.0

Now let’s discuss measures of spread. The range of a dataset is the difference between the largest and smallest values:

$\text{range} = \max_i x_i - \min_i x_i$

R’s range() function will find the maximum and minimum of the dataset, but won’t difference them. This is not a problem, though; simply call diff() on the result of range to subtract the minimum from the maximum, like diff(range(x)), where x is the dataset.

# The largest and smallest executive pay
range(exec.pay)
## [1]    0 2510
# The range
diff(range(exec.pay))
## [1] 2510

Another (more common) means for numerically describing the spread of a dataset is with the standard deviation or its square, the variance. The variance is defined as follows:

$s^2 = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_i - \bar{x})^2$

The standard deviation is merely the square root of the variance.

$s = \sqrt{s^2}$

In R, the function var() finds a dataset’s variance, and sd() finds the standard deviation.

# The variance of exec.pay
var(exec.pay)
## [1] 42867.03
# We could take the square root of the variance to get the standard
# deviation
sqrt(var(exec.pay))
## [1] 207.0435
# Or we could just use sd
sd(exec.pay)
## [1] 207.0435

For categorical data, the simplest way to numerically summarize the data is with a table. We can create one with the table() function

table(central.park.cloud)
## central.park.cloud
##         clear partly.cloudy        cloudy
##            11            11             9
# Create a frequency table by dividing a table by the sample size (i.e. the
# length of the data vector)
table(central.park.cloud)/length(central.park.cloud)
## central.park.cloud
##         clear partly.cloudy        cloudy
##     0.3548387     0.3548387     0.2903226