Prepared by Mark Wilber, Original Material from Justin Kitzes and Tom Wright
Supplementary Material: Answers to Exercises
The goals of this modules is to teach you the basic, core concepts of programming that transcend languages, how they fit together, and how you can use them to become a better scientist. We will be using the language R to do this.
By the end of these R lessons, you will be able to:
Understand how to run basic statistics in R.
We (the instructors) recognize that you all unavoidably have come with very different levels of background in R programming. We expect that some of you might have little experience R and a few of you might have decent R experience already - for those in that category, this section of the workshop may not be as novel as the other sections. However, we hope that the method of presentation will help to solidify your existing knowledge. We also encourage you to take the opportunity to ask the instructors and volunteers about more advanced techniques that you might have heard of but do not know how to use well.
For those who have no (or almost no) background in programming in any language, you may find that these lessons proceed quickly. We encourage you to make liberal use of the helpful volunteers as we proceed through these lessons. You may also wish to consider working together with a partner to complete the exercises as a team.
Regardless of your background, you will probably feel like trying to take in all of this material is like trying to drink from a firehose. That's OK - the idea is to at least introduce you to a wide variety of topics, with the hope that you (a) will get to reinforce the most important concepts during exercises, and (b) will be able to come back to these materials later to continue mastering the concepts.
As noted by Greg Wilson (the founder of Software Carpentry), every programming language shares seven core elements:
length
function)The lines between these are often blurry in practice - they are purely a conceptual framework that helps programmers write code that does what they want it to do.
We expect that you'll find the basics of 1 and 2 fairly straightforward. We'll go quickly through those and will spend the most time on items 3-6. We won't really talk about 7, as it is not as common in R programming as it is in, say, shell scripting (pipes and redirection).
Don't worry if you don't already know what all of the above examples mean - you'll know by the end of this lesson.
As we go through this lesson, you can ask questions in two ways:
R is a versatile, open source programming language that was specifically designed for data analysis. As such R is extremely useful both for statistics and data science and was inspired by the programming language S. Here are some fun facts to usher you into the R community.
To open RStudio, click on the RStudio icon in the Applications (if you are on a Mac) or in your Programs if you are on Windows.
There are four windows in RStudio that we will refer to throughout the workshop
For all the work in this workshop, we will be typing code in the R script and then executing it in the R console. For simple commands (e.g. 2 + 2
) this may seem stupid, but writing in a script will help all of your work be reproducible! Think of the script as your lab notebook.
TIP: Some helpful R studio shortcuts
- Run the current line of selection
- Windows:
Ctrl-Enter
- Mac:
Command-Enter
- Source the entire script
- Windows:
Ctrl-Shift-Enter
- Mac:
Command-Shift-Enter
The most basic component of any programming language are "things", also called variables or objects.
The most common basic "things" in R are integers and doubles (i.e. numerics), characters, logicals, and some special objects of various types (e.g. lists, dataframes, etc.). We'll meet many of these as we go through the lesson.
# A thing
2
# Another thing
"hello"
Things can be stored as variables using <-
. =
also works, but R programmers are picky about <-
a <- 2
b <- "hello"
c <- TRUE # This is case sensitive
# Print the different variables
a
b
c
We can figure out the type of these things using the class
function
class(a)
class(b)
class(c)
Tip: You can also use
typeof
to get an idea of what type a thing is. Note that this will not always return the same information asclass
!
Just storing data in variables isn't much use to us. Right away, we'd like to start performing operations and manipulations on data and variables.
There are three very common means of performing an operation on a thing.
All of the basic math operators work like you think they should for numbers. They can also
do some useful operations on other things, like characters. There are also logical operators that
compare quantities and give back a logical
variable as a result.
# Standard math operators work as expected on numbers
a <- 2
b <- 3
a + b # Addition
a * b # Multiplication
a ^ b # Powers
a / b # Division
Logical operators compare two things. This amounts to asking R a question
x > y
: R, is x greater than y?x == y
: R, is x the same as y?x <= y
: R, is x less than or equal to y?x & y
: R, are both x and y TRUE
?x | y
: R, is either x or y TRUE
?R will either answer TRUE
or FALSE
(1 > 3)
a <- (1 > 3)
(3 == 3)
b <- (3 == 3)
a
b
a | b
a & b
These will be very familiar to anyone who has programmed in any language, and work like you would expect.
# There are thousands of functions that operate on things
typeof(3.3)
round(3.3)
as.integer(3.3)
paste("3.3 rounded is", round(3.3)) # Paste together characters and integers
To find out what a function does, type the ?
before the function name.
?round
round(4.567, digits=1)
Tip: Sometimes the R help can be pretty unhelpful. The best option then (or maybe before you even use the R help, is to Google it!). We instructors Google R help ALL THE TIME! Don't be afraid to do it. Here are a list of sites that we find very helpful
- Stock Overflow
- R Graphics Cookbook by Wynston Chang
- Mixed Effects Models and Extensions in Ecology with R, by Zuur et al., sometimes called "the penguin book"
- The Use R! series is largely available as downloadable ebooks from the UCSB library; they are generally excellent
- Ecological Model and Data in R by Ben Bolker
- GGplot google groups
- GGplot Docs
- And many more…
Many useful functions are in external packages and you need to install them and load them into the R environment. For example, what if we wanted to figure out how to do a negative binomial regression in R?
# Look at the following function
?glm.nb
Hmmm, that probably didn't work for you because the function lives in an external package called MASS
. We need to install package and then load the package.
# Install the package
install.packages("MASS") # MASS has quotes
library(MASS) # MASS does not have quotes
?glm.nb
There are thousands of R packages with many useful functions and datasets! That is one of the huge advantages of R, everyone can contribute.
EXERCISE 1 - Introducing logistic growth
Throughout this lesson, we will successively build towards a program that will calculate the logistic growth of a population of bacteria in a petri dish (or bears in the woods, if you prefer). The exercises will build on each other - if at any time you get behind, you can find the answers to the previous exercises here and catch up.
As a reminder, a commonly used discrete time equation for logistic population growth is
n(t+1) = n(t) + r n(t) [1 - n(t) / K]
where n(t) is the population size at time t, r is the net per capita growth rate, and K is the carrying capacity of the dish/woods.
To get started, write R expressions that do the following:
- Create variables for
r
,K
, andn0
, setting these equal to 0.6, 100, and 10, respectively.- Create the variable
n1
and calculate it's value. Do the same forn2
.- Check the type of
n2
- what is it?- Modify your calculations for
n1
andn2
so that these values are rounded to the nearest integer.Bonus
- Test whether
n2
is larger than 20, and print out a line that says "n2 more than 20: " followed by the answer (either TRUE or FALSE).- Figure out how to ask R whether
n2
is an integer
Once the number of variables that you are interested in starts getting large, working with them all individually starts to get unwieldy. To help stay organized, we can use collections of things.
The most fundamental type of collection is a vector
. A vector is a collection of elements that are most commonly character
, logical
, integer
or numeric
.
You can create an empty vector with vector()
. (By default the mode is logical
). You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character()
, numeric()
, etc
x <- vector() # Empty vector
# With a length and type
vector("character", length = 10)
character(5) # character vector of length 5
numeric(5) # numeric vector of length 5
logical(5) # Logical vector of length 5
You can easily make vectors that contain data using the c()
function
# Make a vector with numbers
x <- c(1, 2, 3)
x
length(x)
# Make a vector with characters
y <- c("R", "is", "awesome")
y
# What happens here
z <- c(1, 2, "hello")
z
typeof(z)
Vectors contain "things" of the same type!
You can easily add elements to vectors as well
# Make an empty vector
x <- numeric()
# Add 2 to x
x <- c(x, 2)
x
y <- 1:10 # Make a numeric vector with numbers 1 - 10
y
# Add y to x
z <- c(x, y)
z
You can then look at specific things in the vector by specifying the index
# Look at the first thing. Index 1
z[1]
# Look at the 2-5 things. Indices 2 - 5
z[2:5]
# Look at the last thing. Index 11
z[11]
EXERCISE 2 - Storing population size in a vector
Reuse your code from Exercise 1 and do the following:
Modify your code so that the values of
n0
,n1
, andn2
are stored in a numeric vector and not separate individual variables. HINT: You can start off by declaring an empty list using the syntaxn <- numeric()
, and then append each new calculated value ofnt
to the vector.Get the first and last values in the list, calculate their ratio, and print out "Grew by a factor of" followed by the result.
Bonus
Extract the last value in two different ways: first, by using the index for the last item in the vector, and second, presuming that you do not know how long the vector is.
Change the values of
r
andK
to make sure that your cell still runs correctly and gives reasonable answers.
Arrays are just vectors, but they can be n-dimensional. If we make a 1 dimensional array it will behave exactly like a vector
# Fill an array with NAs a space holders
x <- array(NA, dim=10)
x
length(x)
x[1]
x <- c(x, 10)
But we could make multidimensional arrays as well.
# Make a 2-D array
x <- array(NA, dim = c(5, 5))
x
dim(x) # Check the dimensions of the array
We can add elements to this 2-D array just like you would think
# Set element in row 1 and column 3 equal to 3
x[1, 3] <- 3
x
You can also do element-wise arithmetic on arrays (or vectors)
x <- c(1, 2, 3, 4, 5)
x * 2
x / 2
x ** 2
sin(x)
Logical operators also work on arrays and they return arrays containing logical variables.
x > 3
x == 3
You can also use logical arrays to subset arrays. This is called logical indexing. For example, let's get an array that only contains values greater than 2
x <- c(1, 2, 3, 4)
y <- x[x > 2]
y
You can also easily make arrays of number sequences
# Sequence of 1 through 10
a <- 1:10
a
# Sequence from 0 to 10 with 25 equally spaced elements
b <- seq(0, 10, length=25)
b
EXERCISE 3 - Storing data in arrays
Copy your code from Exercise 2 and do the following:
- Pre-allocate an array containing 100 blank space (i.e. NAs) as if we were going to fill in 100 time steps. Modify your code from Exercise 2 so that you fill your the first 4 elements of your array using array indexing rather than appending.
- Imagine that each discrete time step actually represents 0.25 of an hour. Create an array
t
storing 100 time step, in hours, from 0 to 24.75. For example,t[1]
should be 0t[2]
should be 0.25, etc.- Use logical indexing to extract the value of
n
corresponding to at
of 0.5.
So far, everything that we've done could, in principle, be done by hand calculation. In this section and the next, we really start to take advantage of the power of programming languages to do things for us automatically.
We start here with ways to repeat yourself. The two most common ways of doing this are known as for loops and while loops. For loops in R are useful when you want to cycle over all of the items in a collection (such as all of the elements of an array), and while loops are useful when you want to cycle for an indefinite amount of time until some condition is met. We will focus on for loops in this workshop.
# A basic for loop in R!
wordlist <- c('hi', 'hello', 'bye')
for(word in wordlist){
print(word)
}
You can sum a collection of numbers with a loop (though this is very inefficient!)
# Sum all of the values in a collection using a for loop
numlist <- c(1, 4, 77, 3)
total <- 0
for(num in numlist){
total <- total + num
}
total
We often want to loop over indexes of collections
wordrange <- 1:length(wordlist)
wordrange
for(i in wordrange){
print(wordlist[i])
}
Once we start looping through large numbers of values it is often better to plot the data. We will cover this in much more detail, but here is a quick example.
x <- 1:100
y <- x^2
plot(y, type='l') # Make a line plot
EXERCISE 4 - Using loops to repeat calculations
FINALLY, let's get smart about our calculations of
nt
. Copy your code from Exercise 3 into the box below, and do the following:
Write a for loop to fill in the values of
nt
for 100 time steps. HINT: You will need to create an array of the step numbers using a command likesteps <- 2:100
. (Why does this array start at 2 and not at 1?). Then, loop over the values of thesteps
, and use each step value to index the arrayn
.- Plot the array
n
.- Play around with the values of
r
andK
and see how it changes the plot. What happens if you setr
to 1.9 or 3?
Often we want to check if a condition is TRUE
and take one action if it is, and another action if the
condition is FALSE
. We can achieve this in R with an if statement.
TIP: You can use any expression that returns a boolean value (TRUE
or FALSE
) in an if statement.
Common boolean operators are ==, !=, <, <=, >, >=.
# A simple if statement
x <- 3
if(x > 0){
print('x is positive')
} else if(x < 0){
print('x is negative')
} else{
print('x is zero')
}
If statements can also rely simply on logical variables
# If statements can rely on boolean variables
x <- -1
test <- x > 0
typeof(test)
test
# You can also do an if statement in one line without the brackets
if(test) print('Test was true')
Deterministic models are boring, so let's introduce some element of randomness into our logistic growth model. We'll model a simple "catastrophe" process, in which a catastrophe happens in 10% of the time steps that reduces the population back down to the size at n0. Copy your code from Exercise 4 into the box below, and do the following:
- Inside your for loop, add a variable called
cata
, for catastrophe, that will beTRUE
if a catastrophe has occurred, andFALSE
if it hasn't. A simple way to do this is to generate a random number usingrunif(1)
, which will give you a random number between 0 and 1. Check whether this number is less than 0.1 - this check will beTRUE
10% of the time.- Using your logical variable
cata
, add an if statement to your for loop that checks whethercata
is true in each time step. If it is true, set the population back to the size at n[0]. Otherwise, perform the usual logistic growth calculation.- Plot your results. Run the code again to see a different growth trajectory.
Bonus
- Now that you have the array
n
, count the number of time steps in which the population was above 50. Although you can do this with a for loop (loop through each value ofnt
, check if it is > 50, and if so increment a counter), you can do this in one line with a simple logical operation. HINT: If you take the sum of a logical array (usingsum()
), it will give you the number ofTRUE
values (since aTRUE
is considered to be a 1, and False is a 0).
One way to write a program is to simply string together commands, like the ones described above, in a long file, and then to run that file to generate your results. This may work, but it can be cognitively difficult to follow the logic of programs written in this style. Also, it does not allow you to reuse your code easily - for example, what if we wanted to run our logistic growth model for several different choices of initial parameters?
The most important ways to "chunk" code into more manageable pieces is to create functions and then to gather these functions into modules, and eventually packages. The R packages that you download from CRAN essentially contain collections of functions, though they also contain datasets and high level chunks called objects. Below we will discuss how to create functions in R. Functions are good for making code more reusable, readable, and maintainable.
We've been using functions all day
# Some examples of R functions
x <- 3.333333
round(x, 2)
sin(x)
Creating a function can be cognitively difficult your first time. The first step is to define a function. Here is how we define a function in R.
fahr_to_celsius <- function(temp) {
celsius <- (temp - 32) * (5 / 9)
return(celsius)
}
A function has a few crucial parts
fahr_to_celsius
)temp
)celsius
). One feature unique to R is that the return statement is not required. R automatically returns whichever variable is on the last line of the body of the function. Next, we get to use the function. You pass in arguments to the function.
# Freezing point of water
fahr_to_celsius(32)
# Boiling point of water
fahr_to_celsius(212)
You often want to document your function to describe what it does. You can do that with comments
fahr_to_celsius <- function(temp) {
# Function takes in the argument temp in Fahrenheit and converts
# it to Celsius
celsius <- (temp - 32) * (5 / 9)
return(celsius)
}
Functions can also have default parameters, which don't need to be passed as arguments when the function is called.
say_hello <- function(time, people){
# Function returns a pleasant greeting
return(paste('Good', time, people))
}
say_hello('afternoon', 'friends')
What happens when you try this
say_hello('afternoon')
You must pass in two arguments because that is how you defined the function!
Let's now give people
a default value. In the example below, people will now have the value of world
unless we explicitly specify otherwise.
say_hello <- function(time, people='world'){
# Function returns a pleasant greeting
return(paste('Good', time, people))
}
# Default is returned
say_hello('afternoon')
say_hello('afternoon', 'students')
EXERCISE 6 - Creating a logistic growth function
Finally, let's turn our logistic growth model into a function that we can use over and over again. Copy your code from Exercise 5 into the box below, and do the following:
- Turn your code into a function called
logistic_growth
that takes four arguments:r
,K
,n0
, andp
(the probability of a catastrophe). Makep
a default parameter with a default value of 0.1. Have your function return then
array.- Write a nice comment describing what your function does.
- Call your function with different values of the parameters to make sure it works. Store the returned value of
n
and make a plot from it.Bonus
- Refactor (i.e. make more readable and reusable) your function by pulling out the line that actually performs the calculation of the new population given the old population. Make this line another function called
grow_one_step
that takes in the old population,r
, andK
, and returns the new population. Have yourlogistic_growth
function use thegrow_one_step
function to calculate the new population in each time step.