Lecture 3

R Data Structures

As mentioned before, vectors are hardly the only data structure in R. There are other very important data structures R uses.

Lists

A list is a generalized vector in R. An R vector requres that all data saved stored in the vector be of the same type. A list has no such requirement. You can easily create lists with numbers, strings, vectors, functions, and other lists all in one object. Lists in R are created with the list() function, where each element of the list is separated with a , (note that lists don’t flatten vectors like c() does; every item separated by a comma gets its own index in the list).

# Let's make a list of mixed type!
l1 <- list(1, "fraggle rock", c("henry", "margaret", "donna"), list(1:2, paste("Test", 
    1:10)))
l1
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "fraggle rock"
## 
## [[3]]
## [1] "henry"    "margaret" "donna"   
## 
## [[4]]
## [[4]][[1]]
## [1] 1 2
## 
## [[4]][[2]]
##  [1] "Test 1"  "Test 2"  "Test 3"  "Test 4"  "Test 5"  "Test 6"  "Test 7" 
##  [8] "Test 8"  "Test 9"  "Test 10"
# This list has no names for its elements; we could specify some using
# names()
names(l1) <- c("num", "char", "vec", "inner_list")
# We can also assign names when we create the list
l2 <- list(char = "monday", vec = c("and", "but", "or"))
l2
## $char
## [1] "monday"
## 
## $vec
## [1] "and" "but" "or"

How do we reference the objects stored in a list? We have a few options:

  • If we wish that the object returned by the reference also be a list, we can use single-bracket notation like we did with vectors, like li[x] where x is any means for selecting elements of the list (number, string, vector, boolean vector, etc.).

  • If we the object stored at x, we can use double bracket notation, like l1[[x]] where x is either a number or a string (x cannot be a vector in this case). The difference between l1[x] and li[[x]] may be subtle, but it’s very important. li[x] is a list, and l1[[x]] is an object stored in a list. (This difference is also true for vectors; vec[x] is a vector, and vec[[x]] is an object stored in a vector. Rarely does this make a difference, but sometimes it does, like when the vector is a vector of functions.)

  • If the elements of the list are named, instead of referencing them with l1[["x"]] (x is the name of the element), we can use $ notation, like l1$x. This is usually how named elements are referenced.

# This is a list
l1[1:3]
## $num
## [1] 1
## 
## $char
## [1] "fraggle rock"
## 
## $vec
## [1] "henry"    "margaret" "donna"
is.list(l1[1:3])
## [1] TRUE
# This is item stored in the third position of the list
l1[[3]]
## [1] "henry"    "margaret" "donna"
# This is not a list
is.list(l1[[3]])
## [1] FALSE
# Notice the difference
l1[3]
## $vec
## [1] "henry"    "margaret" "donna"
# We can also reference by name
l2["vec"]
## $vec
## [1] "and" "but" "or"
l2[["vec"]]
## [1] "and" "but" "or"
# An alternative way to reference the contents of an element by name
l2$vec
## [1] "and" "but" "or"

More complex objects in R are often simply lists with a specific structure, thus making lists very important.

Matrices

An R matrix is much like an R vector (in fact, internally they are the same, with matrices having additionaly attributes for dimension). A matrix is two-dimensional, with a row and column dimension. Like a vector, matrices only allow data of a single type. There are a few ways to make matrices in R:

  • The rbind() function takes an arbitrary number of vectors as inputs (all of equal length), and creates a matrix where each input vector is a row of the matrix. cbind() is exactly like rbind() except that the vectors become columns rather than rows.
  • The matrix() function takes a single vector input and turns that vector into a matrix. You can set either the nrow parameter or the ncol parameter to the number of rows or columns respectively that you desire your matrix to have (it is not necessary to specify both, though not illegal either so long as the product of the dimensions equals the length of the input vector). By default, R will fill the matrix by column; this means that it will fill the first column with the first contents of your input vector in sequence, then the next column with remaining elements, and so on until the matrix is filled and the contents of the input vector “exhausted.”" Changing the byrow parameter to byrow = TRUE changes this behavior, and R will fill the matrix by rows rather than columns.

Both the rows and the columns of a matrix can be named, though you don’t use the names() function for seeing or changing these names. Instead, use the rownames() or colnames() function for accessing or modifying the row names and column names, respectively.

You can get the dimensions of a matrix with the dim() function. nrow() returns the number of rows of a matrix, and ncol() the number of columns. length() returns the number of elements in the matrix (so the product of the dimensions).

# Using rbind to make a matrix
mat1 <- rbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns", 
    "west jordan"), c("university of utah", "byu", "westminster"), c("slcc", 
    "snow", "suu"))
# Likewise with cbind
mat2 <- cbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns", 
    "west jordan"), c("university of utah", "byu", "westminster"), c("slcc", 
    "snow", "suu"))
mat1
##      [,1]                 [,2]          [,3]         
## [1,] "jim bridger"        "meadowbrook" "elwood"     
## [2,] "copper hills"       "kearns"      "west jordan"
## [3,] "university of utah" "byu"         "westminster"
## [4,] "slcc"               "snow"        "suu"
mat2
##      [,1]          [,2]           [,3]                 [,4]  
## [1,] "jim bridger" "copper hills" "university of utah" "slcc"
## [2,] "meadowbrook" "kearns"       "byu"                "snow"
## [3,] "elwood"      "west jordan"  "westminster"        "suu"
dim(mat1)  # The dimensions of mat1
## [1] 4 3
nrow(mat1)  # The number of rows of mat1
## [1] 4
ncol(mat1)  # The number of columns of mat1
## [1] 3
length(mat1)  # The number of elements stored in mat1
## [1] 12
# Using matrix()
mat3 <- matrix(1:10, nrow = 2)
mat4 <- matrix(1:10, nrow = 2, byrow = FALSE)
mat3
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
mat4
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
# Naming matrix dimensions
rownames(mat3) <- c("odds", "evens")
colnames(mat3) <- c("first", "second", "third", "fourth", "fifth")
mat3
##       first second third fourth fifth
## odds      1      3     5      7     9
## evens     2      4     6      8    10
# Internally, matrices are glorified vectors
as.vector(mat1)
##  [1] "jim bridger"        "copper hills"       "university of utah"
##  [4] "slcc"               "meadowbrook"        "kearns"            
##  [7] "byu"                "snow"               "elwood"            
## [10] "west jordan"        "westminster"        "suu"

To access the elements of the matrix, you could do so with mat[x], where x is a vector. This will treat the matrix mat like a vector. Sometimes this is the behavior you want, but most of the time you probably wish to access the data using the matrix’s rows and columns (otherwise you would have made a vector).

R uses the notation [,] for referencing elements in a matrix. Thus you can reference objects in a matrix with mat[x,y], where x is a vector specifying the desired rows, and y a vector specifying the desired columns. All the rules for referencing elements of a vector apply to x and y, with the additional rule that leaving a dimension blank will lead to everything in that dimension being included. Thus, mat[,y] results in a matrix with all the rows of mat and columns determined by y, and mat[x,] a matrix with all the columns of mat and rows determined by x.

# Get the (1,2) entry of mat1
mat1[1, 2]
## [1] "meadowbrook"
# The first row of mat1; notice that this is a vector
mat1[1, ]
## [1] "jim bridger" "meadowbrook" "elwood"
# The second column of mat1; notice that this is also a vector
mat1[, 2]
## [1] "meadowbrook" "kearns"      "byu"         "snow"
# We can preserve the matrix structure (in other words, not turn the result
# into a vector) by adding an additional comma and specifying the option
# drop=FALSE
mat1[1, , drop = FALSE]
##      [,1]          [,2]          [,3]    
## [1,] "jim bridger" "meadowbrook" "elwood"
mat1[, 2, drop = FALSE]
##      [,1]         
## [1,] "meadowbrook"
## [2,] "kearns"     
## [3,] "byu"        
## [4,] "snow"
# A small 2x3 submatrix of mat1
mat1[1:2, 1:3]
##      [,1]           [,2]          [,3]         
## [1,] "jim bridger"  "meadowbrook" "elwood"     
## [2,] "copper hills" "kearns"      "west jordan"
# The third odd number in 1 to 10
mat3["odds", "third"]
## [1] 5
# The first and third even numbers in 1 to 10
mat3["evens", c("first", "third")]
## first third 
##     2     6

Matrices generalize to arrays, and can have more than two dimensions. For example, if arr is a three-dimensional array, we may access an element in it with arr[1, 4, 3]. We will not discuss arrays any further than this.

Data Frames

An R data frame stores data in a tabular format. Technically, a data frame is a list of vectors of equal length, so a data frame is a list. But since each “column” of the data frame has equal length, it also looks like a matrix where each column can differ in type (so one column could be numeric data, another character data, yet another factor data, etc.). Thus we can reference the data in a data frame like it is a list or like it is a matrix.

  • The matrix style of referencing data frame data is like df[x,y], where x is the rows of the data frame and y the columns. All the rules for using this notation with matrices apply to data frames. The result is another data frame.

  • The list style for referencing a data frame references only the columns, not the rows. So df[x] will select the columns of df specified by x, and the result is another data frame. df[[x]] refers to the vector stored in df[[x]]; this is a vector, not a data frame. More commonly, though, we refer to a column of a data frame we want with the dollar notation; rather than use df[["x"]], we use df$x to get the column vector x in df.

To create a data frame, we have options:

  • We could use the data.frame() function, where each vector passed will become a column in the data frame.

  • We could use the as.data.frame() function on an object easily coerced into a data frame, like a matrix or a list.

Some examples are shown below.

# Making a data frame with data.frame
df1 <- data.frame(numbers = 1:5, letters = c("a", "b", "c", "d", "e"))
df1
##   numbers letters
## 1       1       a
## 2       2       b
## 3       3       c
## 4       4       d
## 5       5       e
# Notice that the character vector was automatically made a factor vector!
str(df1)
## 'data.frame':    5 obs. of  2 variables:
##  $ numbers: int  1 2 3 4 5
##  $ letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
colnames(mat2) <- c("elementary", "high school", "university", "local")
# Make a data frame out of a matrix If we don't want to turn character
# strings into factors, set stringsAsFactors to FALSE (this also works in
# data.frame)
df2 <- as.data.frame(mat2, stringsAsFactors = FALSE)
df2
##    elementary  high school         university local
## 1 jim bridger copper hills university of utah  slcc
## 2 meadowbrook       kearns                byu  snow
## 3      elwood  west jordan        westminster   suu
str(df2)
## 'data.frame':    3 obs. of  4 variables:
##  $ elementary : chr  "jim bridger" "meadowbrook" "elwood"
##  $ high school: chr  "copper hills" "kearns" "west jordan"
##  $ university : chr  "university of utah" "byu" "westminster"
##  $ local      : chr  "slcc" "snow" "suu"
newlist <- list(first = c("Tamara", "Danielle", "John", "Kent"), last = c("Garvey", 
    "Wu", "Godfrey", "Morgan"))
# Making a data frame from a list
df3 <- as.data.frame(newlist, stringsAsFactors = FALSE)