Lecture 2

R Data Structures

As mentioned before, vectors are hardly the only data structure in R. There are other very important data structures R uses.

Lists

A list is a generalized vector in R. An R vector requres that all data saved stored in the vector be of the same type. A list has no such requirement. You can easily create lists with numbers, strings, vectors, functions, and other lists all in one object. Lists in R are created with the list() function, where each element of the list is separated with a , (note that lists don’t flatten vectors like c() does; every item separated by a comma gets its own index in the list).

# Let's make a list of mixed type!
l1 <- list(1, "fraggle rock", c("henry", "margaret", "donna"), list(1:2, paste("Test", 
    1:10)))
l1
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "fraggle rock"
## 
## [[3]]
## [1] "henry"    "margaret" "donna"   
## 
## [[4]]
## [[4]][[1]]
## [1] 1 2
## 
## [[4]][[2]]
##  [1] "Test 1"  "Test 2"  "Test 3"  "Test 4"  "Test 5"  "Test 6"  "Test 7" 
##  [8] "Test 8"  "Test 9"  "Test 10"
# This list has no names for its elements; we could specify some using
# names()
names(l1) <- c("num", "char", "vec", "inner_list")
# We can also assign names when we create the list
l2 <- list(char = "monday", vec = c("and", "but", "or"))
l2
## $char
## [1] "monday"
## 
## $vec
## [1] "and" "but" "or"

How do we reference the objects stored in a list? We have a few options:

  • If we wish that the object returned by the reference also be a list, we can use single-bracket notation like we did with vectors, like li[x] where x is any means for selecting elements of the list (number, string, vector, boolean vector, etc.).

  • If we the object stored at x, we can use double bracket notation, like l1[[x]] where x is either a number or a string (x cannot be a vector in this case). The difference between l1[x] and li[[x]] may be subtle, but it’s very important. li[x] is a list, and l1[[x]] is an object stored in a list. (This difference is also true for vectors; vec[x] is a vector, and vec[[x]] is an object stored in a vector. Rarely does this make a difference, but sometimes it does, like when the vector is a vector of functions.)

  • If the elements of the list are named, instead of referencing them with l1[["x"]] (x is the name of the element), we can use $ notation, like l1$x. This is usually how named elements are referenced.

# This is a list
l1[1:3]
## $num
## [1] 1
## 
## $char
## [1] "fraggle rock"
## 
## $vec
## [1] "henry"    "margaret" "donna"
is.list(l1[1:3])
## [1] TRUE
# This is item stored in the third position of the list
l1[[3]]
## [1] "henry"    "margaret" "donna"
# This is not a list
is.list(l1[[3]])
## [1] FALSE
# Notice the difference
l1[3]
## $vec
## [1] "henry"    "margaret" "donna"
# We can also reference by name
l2["vec"]
## $vec
## [1] "and" "but" "or"
l2[["vec"]]
## [1] "and" "but" "or"
# An alternative way to reference the contents of an element by name
l2$vec
## [1] "and" "but" "or"

More complex objects in R are often simply lists with a specific structure, thus making lists very important.

Matrices

An R matrix is much like an R vector (in fact, internally they are the same, with matrices having additionaly attributes for dimension). A matrix is two-dimensional, with a row and column dimension. Like a vector, matrices only allow data of a single type. There are a few ways to make matrices in R:

  • The rbind() function takes an arbitrary number of vectors as inputs (all of equal length), and creates a matrix where each input vector is a row of the matrix. cbind() is exactly like rbind() except that the vectors become columns rather than rows.
  • The matrix() function takes a single vector input and turns that vector into a matrix. You can set either the nrow parameter or the ncol parameter to the number of rows or columns respectively that you desire your matrix to have (it is not necessary to specify both, though not illegal either so long as the product of the dimensions equals the length of the input vector). By default, R will fill the matrix by column; this means that it will fill the first column with the first contents of your input vector in sequence, then the next column with remaining elements, and so on until the matrix is filled and the contents of the input vector “exhausted.”" Changing the byrow parameter to byrow = TRUE changes this behavior, and R will fill the matrix by rows rather than columns.

Both the rows and the columns of a matrix can be named, though you don’t use the names() function for seeing or changing these names. Instead, use the rownames() or colnames() function for accessing or modifying the row names and column names, respectively.

You can get the dimensions of a matrix with the dim() function. nrow() returns the number of rows of a matrix, and ncol() the number of columns. length() returns the number of elements in the matrix (so the product of the dimensions).

# Using rbind to make a matrix
mat1 <- rbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns", 
    "west jordan"), c("university of utah", "byu", "westminster"), c("slcc", 
    "snow", "suu"))
# Likewise with cbind
mat2 <- cbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns", 
    "west jordan"), c("university of utah", "byu", "westminster"), c("slcc", 
    "snow", "suu"))
mat1
##      [,1]                 [,2]          [,3]         
## [1,] "jim bridger"        "meadowbrook" "elwood"     
## [2,] "copper hills"       "kearns"      "west jordan"
## [3,] "university of utah" "byu"         "westminster"
## [4,] "slcc"               "snow"        "suu"
mat2
##      [,1]          [,2]           [,3]                 [,4]  
## [1,] "jim bridger" "copper hills" "university of utah" "slcc"
## [2,] "meadowbrook" "kearns"       "byu"                "snow"
## [3,] "elwood"      "west jordan"  "westminster"        "suu"
dim(mat1)  # The dimensions of mat1
## [1] 4 3
nrow(mat1)  # The number of rows of mat1
## [1] 4
ncol(mat1)  # The number of columns of mat1
## [1] 3
length(mat1)  # The number of elements stored in mat1
## [1] 12
# Using matrix()
mat3 <- matrix(1:10, nrow = 2)
mat4 <- matrix(1:10, nrow = 2, byrow = FALSE)
mat3
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
mat4
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
# Naming matrix dimensions
rownames(mat3) <- c("odds", "evens")
colnames(mat3) <- c("first", "second", "third", "fourth", "fifth")
mat3
##       first second third fourth fifth
## odds      1      3     5      7     9
## evens     2      4     6      8    10
# Internally, matrices are glorified vectors
as.vector(mat1)
##  [1] "jim bridger"        "copper hills"       "university of utah"
##  [4] "slcc"               "meadowbrook"        "kearns"            
##  [7] "byu"                "snow"               "elwood"            
## [10] "west jordan"        "westminster"        "suu"

To access the elements of the matrix, you could do so with mat[x], where x is a vector. This will treat the matrix mat like a vector. Sometimes this is the behavior you want, but most of the time you probably wish to access the data using the matrix’s rows and columns (otherwise you would have made a vector).

R uses the notation [,] for referencing elements in a matrix. Thus you can reference objects in a matrix with mat[x,y], where x is a vector specifying the desired rows, and y a vector specifying the desired columns. All the rules for referencing elements of a vector apply to x and y, with the additional rule that leaving a dimension blank will lead to everything in that dimension being included. Thus, mat[,y] results in a matrix with all the rows of mat and columns determined by y, and mat[x,] a matrix with all the columns of mat and rows determined by x.

# Get the (1,2) entry of mat1
mat1[1, 2]
## [1] "meadowbrook"
# The first row of mat1; notice that this is a vector
mat1[1, ]
## [1] "jim bridger" "meadowbrook" "elwood"
# The second column of mat1; notice that this is also a vector
mat1[, 2]
## [1] "meadowbrook" "kearns"      "byu"         "snow"
# We can preserve the matrix structure (in other words, not turn the result
# into a vector) by adding an additional comma and specifying the option
# drop=FALSE
mat1[1, , drop = FALSE]
##      [,1]          [,2]          [,3]    
## [1,] "jim bridger" "meadowbrook" "elwood"
mat1[, 2, drop = FALSE]
##      [,1]         
## [1,] "meadowbrook"
## [2,] "kearns"     
## [3,] "byu"        
## [4,] "snow"
# A small 2x3 submatrix of mat1
mat1[1:2, 1:3]
##      [,1]           [,2]          [,3]         
## [1,] "jim bridger"  "meadowbrook" "elwood"     
## [2,] "copper hills" "kearns"      "west jordan"
# The third odd number in 1 to 10
mat3["odds", "third"]
## [1] 5
# The first and third even numbers in 1 to 10
mat3["evens", c("first", "third")]
## first third 
##     2     6

Matrices generalize to arrays, and can have more than two dimensions. For example, if arr is a three-dimensional array, we may access an element in it with arr[1, 4, 3]. We will not discuss arrays any further than this.

Data Frames

An R data frame stores data in a tabular format. Technically, a data frame is a list of vectors of equal length, so a data frame is a list. But since each “column” of the data frame has equal length, it also looks like a matrix where each column can differ in type (so one column could be numeric data, another character data, yet another factor data, etc.). Thus we can reference the data in a data frame like it is a list or like it is a matrix.

  • The matrix style of referencing data frame data is like df[x,y], where x is the rows of the data frame and y the columns. All the rules for using this notation with matrices apply to data frames. The result is another data frame.

  • The list style for referencing a data frame references only the columns, not the rows. So df[x] will select the columns of df specified by x, and the result is another data frame. df[[x]] refers to the vector stored in df[[x]]; this is a vector, not a data frame. More commonly, though, we refer to a column of a data frame we want with the dollar notation; rather than use df[["x"]], we use df$x to get the column vector x in df.

To create a data frame, we have options:

  • We could use the data.frame() function, where each vector passed will become a column in the data frame.

  • We could use the as.data.frame() function on an object easily coerced into a data frame, like a matrix or a list.

Some examples are shown below.

# Making a data frame with data.frame
df1 <- data.frame(numbers = 1:5, letters = c("a", "b", "c", "d", "e"))
df1
##   numbers letters
## 1       1       a
## 2       2       b
## 3       3       c
## 4       4       d
## 5       5       e
# Notice that the character vector was automatically made a factor vector!
str(df1)
## 'data.frame':    5 obs. of  2 variables:
##  $ numbers: int  1 2 3 4 5
##  $ letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
colnames(mat2) <- c("elementary", "high school", "university", "local")
# Make a data frame out of a matrix If we don't want to turn character
# strings into factors, set stringsAsFactors to FALSE (this also works in
# data.frame)
df2 <- as.data.frame(mat2, stringsAsFactors = FALSE)
df2
##    elementary  high school         university local
## 1 jim bridger copper hills university of utah  slcc
## 2 meadowbrook       kearns                byu  snow
## 3      elwood  west jordan        westminster   suu
str(df2)
## 'data.frame':    3 obs. of  4 variables:
##  $ elementary : chr  "jim bridger" "meadowbrook" "elwood"
##  $ high school: chr  "copper hills" "kearns" "west jordan"
##  $ university : chr  "university of utah" "byu" "westminster"
##  $ local      : chr  "slcc" "snow" "suu"
newlist <- list(first = c("Tamara", "Danielle", "John", "Kent"), last = c("Garvey", 
    "Wu", "Godfrey", "Morgan"))
# Making a data frame from a list
df3 <- as.data.frame(newlist, stringsAsFactors = FALSE)

Working with Data Frames

Data frames are such a key tool for R users that packages are written solely for the accessing and manipulation of data in data frames. Thus they deserve more discussion.

Often we wish to work with multiple variables stored in a data frame, but while the $ notation is convenient, even it can grow tiresome with complicated computations. The function with() can help simplify code. The first argument of with() is a data frame, and the second argument is a command to evaluate.

d <- mtcars[1:10, ]
# We wish to know which cars have mpg within the first and third quartile.
# Here's a first approach that is slightly cumbersome
d[d$mpg > quantile(d$mpg, 0.25) & d$mpg < quantile(d$mpg), ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# We can use the with function to clean things up
d[with(d, mpg > quantile(mpg, 0.25) & mpg < quantile(mpg)), ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Often users don’t want all the data in a data frame, but only a subset of it. The which() could be used to get the desired rows and a vector the desired columns, but this can quickly become cumbersome. Alternatively, use the subset() function for this task. The data frame is the first argument passed to subset(). Next, pass information to the subset parameter to decide on what rows to include, or the select parameter to choose the columns. Names of variables in the data frame can be used in subset() like in with(); you don’t need to use $ notation to choose the variable from within the data frame. Additionally, unlike when selecting with vectors, you can use : to choose all columns between two names, not just numbers, and you can use - in front of a vector of names to declare columns you don’t want.

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# Notice that I do not list the names as strings
subset(mtcars, select = c(mpg, cyl), subset = mpg > quantile(mpg, 0.9))
##                 mpg cyl
## Fiat 128       32.4   4
## Honda Civic    30.4   4
## Toyota Corolla 33.9   4
## Lotus Europa   30.4   4
# Other ways to select columns Using : on column names selects columns
# between the names on either side
subset(mtcars, select = hp:qsec, subset = !is.na(mpg) & mpg > quantile(mpg, 
    0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
##                    hp drat    wt  qsec
## Hornet Sportabout 175 3.15 3.440 17.02
## Merc 450SE        180 3.07 4.070 17.40
## Merc 450SL        180 3.07 3.730 17.60
## Dodge Challenger  150 2.76 3.520 16.87
## Pontiac Firebird  175 3.08 3.845 17.05
## Ford Pantera L    264 4.22 3.170 14.50
# Using - on a vector of names selects all columns except those in a vector
subset(mtcars, select = -c(drat, wt, qsec), subset = !is.na(mpg) & mpg > quantile(mpg, 
    0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
##                    mpg cyl  disp  hp vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175  0  0    3    2
## Merc 450SE        16.4   8 275.8 180  0  0    3    3
## Merc 450SL        17.3   8 275.8 180  0  0    3    3
## Dodge Challenger  15.5   8 318.0 150  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175  0  0    3    2
## Ford Pantera L    15.8   8 351.0 264  0  1    5    4
# Here is the above without using subset; notice how complicated the command
# is
mtcars[!is.na(mtcars$mpg) & mtcars$mpg > quantile(mtcars$mpg, 0.25) & mtcars$mpg < 
    quantile(mtcars$mpg, 0.75) & mtcars$cyl == 8, !(names(mtcars) %in% c("drat", 
    "wt", "qsec"))]
##                    mpg cyl  disp  hp vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175  0  0    3    2
## Merc 450SE        16.4   8 275.8 180  0  0    3    3
## Merc 450SL        17.3   8 275.8 180  0  0    3    3
## Dodge Challenger  15.5   8 318.0 150  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175  0  0    3    2
## Ford Pantera L    15.8   8 351.0 264  0  1    5    4

There are many other details about working with data frames that are common parts of an analysts workflow, such as reshaping a data frame (keeping the same information stored in a data frame but changing the data frame’s structure) and merging (combining information in two data frames). Read the textbook for more information and examples of these very important ideas. The entire process of bringing data into a workable format is called data cleaning, a significant and often underappreciated part of an analyst’s job.

Applying a Function Over a Collection

Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:

2^1:10
## [1]  2  3  4  5  6  7  8  9 10

A similar idea is that we could take the square root of numbers between 0 and 1 with:

sqrt(seq(0, 1, by = 0.1))
##  [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
##  [8] 0.8366600 0.8944272 0.9486833 1.0000000

It may not be this simple though. For example, suppose we have a data frame, which I construct below:

library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway, 
    EngineSize, Horsepower, RPM))
head(cdat)
##   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1      12.9  15.9      18.8       25          31        1.8        140
## 2      29.2  33.9      38.7       18          25        3.2        200
## 3      25.9  29.1      32.3       20          26        2.8        172
## 4      30.8  37.7      44.6       19          26        2.8        172
## 5      23.7  30.0      36.2       22          30        3.5        208
## 6      14.2  15.7      17.3       22          31        2.2        110
##    RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200

I want the mean of all the variables in cdat. mean(cdat) will not work; the mean() function does not know how to handle the different variables in a data frame.

We may instead try a for loop, like so:

# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
    # For ever vector in cdat (called vec in the body of the loop), the code in
    # the loop will be executed Compute the mean of vec, and add it to
    # cdat_means
    cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

A good R programmer will try to avoid for loops as much as possible. One reason is that for loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for loops is that there is often an alternative not using a loop that easier to both write and understand.

How could we rewrite the above code without using for? We could use the function sapply() and the call sapply(v, f), where v is either a vector or list with the items you wish to iterate over, and f is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.

# A function to check if a number is even
even <- function(x) {
    # If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
    # even and the result is TRUE. Otherwise, the result is FALSE.
    x%%2 == 0
}

# Which numbers between 1 and 10 are even?
sapply(1:10, even)
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161
# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
    diff(range(vec))
})
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##        38.7        54.5        72.1        31.0        30.0         4.7 
##  Horsepower         RPM 
##       245.0      2700.0

The lapply() function works exactly like the sapply() function, except lapply() returns a list rather than a vector.

Alternatively, if we have a function f(x) that knows how to work with an object x, we could vectorize f so it can work on a vector or list of objects like x. We can use the Vectorize() function for this task with a call like vf <- Vectorize(f), where f is the function to vectorize, and vf is the new, vectorized version of f. The example below does what we did for cdat with both a for loop and sapply(), but now does so with a vectorized version of mean().

vmean <- Vectorize(mean)
vmean(cdat)
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

Now suppose you have a data frame d, which contains information from different samples representing different populations. You wish to apply a function f() to data stored in d$x, and d$y determines which sample each row of the data frame (and thus, each entry of d$x) came from. You want f() to be applied to the data in each sample, separately. You can do so with the aggregate() function in a call of the form aggregate(x ~ y, data = d, f). I illustrate with the iris dataset below.

# The struture of iris
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588
# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)
##      Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1     setosa           4.300            4.800            5.000
## 2 versicolor           4.900            5.600            5.900
## 3  virginica           4.900            6.225            6.500
##   Sepal.Length.75% Sepal.Length.100%
## 1            5.200             5.800
## 2            6.300             7.000
## 3            6.900             7.900

Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply() function allows us to do just that in a call of the form apply(mat, m, f), where mat is the matrix with data, f the function to apply, and m the margin to apply f() over. For matrices, a value of 1 for m will lead to the function being applied across rows, and a value of 2 across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).

## Loading required package: methods
school_race_dat
##                  Entheos Academy Kearns Entheos Academy Magna
## Native American                       0                     0
## Asian                                 4                     5
## Black                                 1                     5
## Hispanic                            145                   201
## Pacific Islander                     15                     3
## White                               334                   273
## Multiple Race                        23                    15
##                  Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American                   4                   5                 9
## Asian                             6                  25                50
## Black                            12                  19                42
## Hispanic                        216                 322               551
## Pacific Islander                 12                  28                28
## White                           314                1124              1924
## Multiple Race                     7                  50               102
##                  Thomas Jefferson Jr High Kearns High
## Native American                        11          39
## Asian                                  13          49
## Black                                  17          53
## Hispanic                              260         937
## Pacific Islander                       42          99
## White                                 394        1138
## Multiple Race                           2          10
# Get row sums
apply(school_race_dat, 1, sum)
##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209
# Column sums
apply(school_race_dat, 2, sum)
##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325
# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)
##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209
colSums(school_race_dat)
##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325

Using External Data

R would not be very useful if we had no way of loading in and saving data. R has means for reading data from spreadsheets such as .xls or .xlsx files made by Microsoft Excel. Functions for reading Excel files can be found in the xlsx or gdata packages.

Common plain-text formats for reading data include the comma-separated values format (.csv), tab-separated values format (.tsv), and the fixed-width format (.fwf). These files can be read in using the read.csv(), read.table(), and the read.fwf() functions (with read.csv() being merely a front-end for read.table()). All of these functions parse a plain-text data file and return a data frame with the contents. Keep in mind that R will guess what type of data is stored in the file. Usually it makes a good guess, but this is not guaranteed and you may need to do some more data cleaning or give R more instructions on how to interpret the file.

In order to load a file, you must specify the location of the file. If the file is on your hard drive, there are a few ways to do so:

  • You could use the file.choose() command to browse your system and locate the file. Once done, you will have a text string describing the location of the file on your system.

  • Any R session has a working directory, which is where R looks first for files. You can see the current working directory with getwd(), and change the working directory with setwd(path), where path is a string for the location of the directory you wish to set as the new working directory.

Let’s assume we’re loading in a .csv file (the approach is similar for other formats). The command df <- read.csv("myfile.csv") instructs R to read myfile.csv (which is presumably in the working directory, since we did not specify a full path; if it were not, we would either change the working directory or pass the full path to the function, which may look something like read.csv("C:/path/to/myfile.csv"), or read.csv("/path/to/myfile.csv"), depending on the system) and store the resulting data frame in df. Once done, df will now be ready for us to use.

Suppose that the data file is on the Internet. You can pass the url of the file to read.csv() and R will read the file online and make it available to you in your session. I demonstrate below:

# Total Primary Energy Consumption by country and region, for years 1980
# through 2008; in Quadrillion Btu (CSV Version). Dataset from data.gov,
# from the Department of Energy's dataset on total primary energy
# consumption.  Download and load in the dataset
energy <- read.csv("http://en.openei.org/doe-opendata/dataset/d9cd39c5-492e-4e82-8765-12e0657eeb4e/resource/3c42d852-567e-4dda-a39c-2bfadf309da5/download/totalprimaryenergyconsumption.csv", 
    stringsAsFactors = FALSE)
# R did not parse everything correctly; turn some variables numeric
energy[2:30] <- lapply(energy[2:30], as.numeric)
# We want energy data for North American countries, from 2000 to 2008
us_energy <- subset(energy, select = X2000:X2008, subset = Country %in% c("Canada", 
    "United States", "Mexico"))
us_energy
##      X2000    X2001    X2002    X2003     X2004     X2005    X2006
## 2 13.07669 12.87847 13.10786 13.52061  13.83128  14.16374 13.81736
## 4  6.37958  6.32931  6.32936  6.50563   6.48998   6.80188  7.36271
## 6 99.25385 96.53415 98.03879 98.31384 100.49743 100.60722 99.90566
##       X2007    X2008
## 2  14.07179 14.02923
## 4   7.27651  7.30898
## 6 101.67563 99.53011

Naturally you can export data frames into common formats as well. write.csv(), write.table(), and write.fwf() will write data into comma-separated value, tab-separated value, and fixed width formats. Their syntax is similar. To save a .csv file, issue the command write.csv(df, file = "myfile.csv"), where df is the data frame to save and file where to save it, which could be just a file name (resulting in the file being saved in the working directory), or an absolute path.

my_data <- data.frame(var1 = 1:10, var2 = paste("word", 1:10))
write.csv(my_data, file="my_data.csv")

There are other formats R can read and write to. The foreign package allows R to read data files created for other statistical software packages such as SAS or Stata. The XML package allows R to read XML and HTML files. You can also read JSON files or data stored in Google Sheets. Refer to the textbook for more information.