Lecture 1

R Basics

Using R as a Calculator

Lots of work in R is done via the interpreter. You can use the interpreter as a calculator, and thus can do some simple calculations.

# Some basic arithmetic
2 + 2
## [1] 4
1 - 7
## [1] -6
2 * 3
## [1] 6
(1 + 1) * 3
## [1] 6
2^3
## [1] 8
10/2
## [1] 5
# One can also use functions for more advanced calculations
sqrt(9)
## [1] 3
exp(3)
## [1] 20.08554
# A familiar built-in value
pi
## [1] 3.141593

When using a computer as a calculator, though, you may observe behavior that seems… strange.

# Be aware that any time you use computers for calculation you may get the
# 'wrong' answer. This is due to floating-point arithmetic and how computers
# see numbers.
sqrt(2)^2
## [1] 2
# What's this?
sqrt(2)^2 - 2
## [1] 4.440892e-16

(For an explanation of this phenomenon, watch this video by Computerphile on floating-point numbers.)

Obviously, though, we want to use R for more than a glorified calculator.

Variables

As with any programming language, R can store information in variables. R has a few ways to store variables, with the <-, ->, and = operators (there are a few others that I won’t discuss here).

# One way to create a variable
var1 <- 1
# Another way to create a variable
var2 = 2
# A third way, this time where the variable is on the RIGHT side of the
# assignment operator!
var3 <- 3
# What do you get?
var1 + var2
## [1] 3
var2 * var3
## [1] 6

There are rules as to how variables can be named. In addition to not being reserved words, they can only use alpha-numeric characters (letters and numbers), or the _ or . characters. Also, they cannot begin with a number (so while var1 is legal, 1var is not). Variables are case-sensitive; var1 is not the same as Var1 or VAR1.

# These are all valid names for variables
var1 <- 10
variable_number_2 <- 20
variable.number.3 <- 30  # <---- I would suggest avoiding this style of variable naming, though; use '_' instead
variableNumber4 <- 40

# None of these are the same
var1 <- 10
Var1 <- 230
VAR1 <- -5


var1 + 1
## [1] 11
Var1 + 1
## [1] 231
VAR1 + 1
## [1] -4

As a side note, notice that a line break will usually begin a new command. This is not always true! If a command is not finished from R’s perspective, it will see the contents of the next line as being part of the same command. This can sometimes be confusing, while other times it can be used to your advantage. Furthermore, you can use the ; character to separate commands on the same line, or explicitly end a command like in C-based programming languages like C, C++, or Java (though this is usually not done). Aside from this, R generally ignores white space (space, tabs, new lines, etc.).

# Because there is nothing after the '+' on the first line, R looks for the
# rest of the command on the second line.
5 + 
5
## [1] 10
# It could even be a few lines down! (DON'T EVER DO THIS!!!!)
10 + 



1
## [1] 11

R variables are untyped, but that does not mean that the object they reference is untyped. Some basic types you will encounter early (and there are many, many others) include:

  • Numeric, like 124 or 3.14159
  • Character, like "hello bobby" or "124" or 'cmiller@math.utah.edu'
  • Boolean, either TRUE or FALSE
  • Vectors, which are a collection of objects of all the same type
  • Functions, which you can think of as being a “mini program”, like log, sqrt, mean, or help
  • Data frames, which store datasets in memory in a tabular format
# A numeric variable
num_var <- 124
# Character data
char_var <- "hello"
# Boolean
bool_var <- TRUE
# A vector of data
data_vec <- c(1, 20, 6, 2)
# Another vector of data
data_vec2 <- c("hello", "world")

# There are functions for type checking
not_a_number <- "124"
is.numeric(not_a_number)
## [1] FALSE
is.character(not_a_number)
## [1] TRUE
# Sometimes you can force a variable of one type to be another type.
is_a_number <- as.numeric(not_a_number)
is.numeric(is_a_number)
## [1] TRUE

Packages

The true power of R is in its packages. R has an ever-growing community of users and developers, many of whom write free packages to extend R’s functionality. These packages are made available to all R users on websites like CRAN (the Comprehensive R Archive Network) or GitHub.

Packages from CRAN are easily downloaded and installed using the install.packages() function, like so:

# Install the "UsingR" package for Verzani's book
install.packages("UsingR")

Once you install a new package, you can load it into the R environment using require() or library().

library(UsingR)
# Alternatively, you could use require("UsingR")

Today there are well over 7000 packages on CRAN alone, and this growth is likely to continue.

Data sets

Datasets in R are usually stored in vectors (for univariate data) or data frames (for multivariate data). For now, we will work with built-in data sets or those included in packages. If a dataset isn’t already in the R environment but does exist in some package, it can be brought in using the data() function. The head() function allows us to view only a few observations from a dataset (in order to prevent our screen from being flooded with all the data), and the str() function gives us a further description of the dataset.

# Let's load in a univeriate dataset containing the lengths of major North
# American rivers
data(rivers)
# We can see a few of the observations to get a glimpse at the nature of the
# data.
head(rivers)
## [1] 735 320 325 392 524 450
# We can see more information via str()
str(rivers)
##  num [1:141] 735 320 325 392 524 ...
# Just for fun, what is the combined length of all these rivers?
sum(rivers)
## [1] 83357
# Let's have even more fun by seeing what proportion of this total length
# each river accounts for.
rivers/sum(rivers)
##   [1] 0.008817496 0.003838910 0.003898893 0.004702664 0.006286215
##   [6] 0.005398467 0.017503029 0.001619540 0.005578416 0.007197956
##  [11] 0.003958876 0.004030855 0.003359046 0.003778927 0.010437036
##  [16] 0.010868913 0.002423312 0.003946879 0.003479012 0.011996593
##  [21] 0.007197956 0.006058279 0.017395060 0.010077138 0.014911765
##  [26] 0.010676968 0.004198808 0.004882613 0.003431026 0.003359046
##  [31] 0.006298211 0.008637547 0.004678671 0.002999148 0.003922886
##  [36] 0.002759216 0.003179097 0.010197104 0.002519285 0.007557854
##  [41] 0.003119114 0.002759216 0.004318773 0.008757513 0.007197956
##  [46] 0.003670957 0.004678671 0.005038569 0.003491009 0.008517581
##  [51] 0.004078842 0.002603261 0.003371043 0.004222801 0.003107118
##  [56] 0.002999148 0.005638399 0.008157683 0.006838058 0.004198808
##  [61] 0.003598978 0.006718092 0.010796934 0.007497871 0.003982869
##  [66] 0.028168000 0.014048010 0.044507360 0.027772113 0.030387370
##  [71] 0.009357343 0.003359046 0.004918603 0.005518433 0.003119114
##  [76] 0.003059131 0.005170532 0.004198808 0.009117411 0.007413894
##  [81] 0.004054848 0.011768658 0.015667550 0.005998296 0.008349629
##  [86] 0.007257939 0.002999148 0.004930600 0.012644409 0.008817496
##  [91] 0.002795206 0.005218518 0.005878331 0.003718944 0.005518433
##  [96] 0.004594695 0.004498722 0.015235673 0.006538143 0.005338484
## [101] 0.022613578 0.004558705 0.003598978 0.004558705 0.004522716
## [106] 0.005098552 0.003311060 0.002519285 0.009597274 0.005038569
## [111] 0.004198808 0.004318773 0.006454167 0.013196252 0.014455895
## [116] 0.003766930 0.002843193 0.007317922 0.004318773 0.006478160
## [121] 0.012452464 0.005086555 0.003718944 0.003598978 0.005326487
## [126] 0.003610974 0.003215087 0.007437888 0.002579267 0.007821779
## [131] 0.010796934 0.006298211 0.002951162 0.004318773 0.006346198
## [136] 0.005998296 0.008637547 0.003239080 0.005158535 0.008049714
## [141] 0.021233970
# Let's look at our first ever data frame, the famous iris dataset
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# You can think of this as being a table or matrix. In fact, it's a
# combination of equal-length vectors. We can get the Sepal.Length vector
# using the $ operator
iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Comments

You may have noticed already the # symbol in the code. This is called a comment. The R interpreter completely ignores comments. They are used for annotating code. Commenting is not optional; it’s essential to understanding code when it is read (and I promise you, it will be). You can’t write good code without comments. Consequently, I require you to write good, extensive comments in your homework!

# Hi! I'm a comment! R doesn't care about me, but that's okay. I'm very
# useful for understanding code, and if you forget to add me, an angry
# programmer will come to your house and punch you in the face!

# Here, I'm saying that the following code makes a histogram of the rivers
# data, using relative frequency, a custom main title, a custom x-axis
# label, a bin count based on the square root of the size of the dataset,
# right-inclusive classes, and a beautiful rainbow fill because we're classy
# people who thinks this looks good (even though it doesn't)
hist(rivers, breaks = ceiling(length(rivers)), freq = FALSE, right = FALSE, 
    main = "Histogram for the Lengths of Rivers", xlab = "River Length", col = rainbow(ceiling(length(rivers))))

Help

R has many, many functions. In 2014, there were approximately 182,393 R functions that could be used (most of them in packages). There is no way anyone could remember all of them.

Fortunately, it’s easy to access documentation in R, especially if you are using RStudio. The help() function will look up documentation for a string entered. This is made even easier by just typing ? and the name of the package/function/dataset you want to see documentation for.

# Here's how to access the documentation of the mean() function
help("mean")
# Or even easier:
?mean