Aims
This is not a course to learn R. The aim of this tutorial is to offer a very, very short introduction so that you have a basic introduction as we move forward. In this tutorial, we will introduce:
- objects in R
- functions in R
- data structures in R
If you would like to develop your skills further (not such a bad idea) there are plenty of excellent online courses and resources available. Recommended elsewhere are the following:
- https://cran.r-project.org/doc/manuals/R-intro.html#Introduction-and-preliminaries
- https://www.datacamp.com/community/open-courses/r-programming-with-swirl
- http://www.burns-stat.com/pages/Tutor/hints_R_begin.html
- http://data.princeton.edu/R/gettingStarted.html
- http://www.ats.ucla.edu/stat/R/sk/
- http://www.statmethods.net/
These sites will help you learn or refresh your memory. But you can also expect to return to Google often as you go and type “R …” as a query. That’s fine, and totally normal. You will find as you do so that answers to most questions are available on fora pages such as StackOverflow and CrossValidated.
Software
For this course, you will need to download and install two software, R and RStudio, to your system. Since you are completing this tutorial, we assume you have already done so, but here we briefly explain the purpose of each.
What is R?
R is a programming language and environment for statistical computing and graphics. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License, and provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, …) and graphical techniques, and is highly extensible. This means that anybody can write extensions to R and make them publicly available, such as in the stocnet group of packages…
What is RStudio?
RStudio is an integrated development environment (IDE) for R and Python, enabling researchers to interact with R (and/or Python) through a fully-functional editor with syntax highlighting, direct code execution, autocomplete, and various tools for plotting, history, debugging, package development, and workspace management.
In the end, although you will need to make sure R has been downloaded and installed correctly on the system you are using, in practice you will never open it directly. Instead you will be using RStudio to interact with R. Think of R as the internals of the calculator, but RStudio as the case. Let’s start the calculator but opening the ‘calculator case’ app, RStudio.
Getting started
RStudio and R scripts
If we open and take a look around RStudio, we should see a window of four (4) panes. Among them there should be a console: this is where RStudio executes commands in R. You can type commands yourself (RStudio may help by suggesting autocompletions), but we usually write code in an R script instead, and then tell RStudio when to execute one or more lines from the script. There are basically three reasons for using a script: editing, repetition, sharing. You can run a command in RStudio by moving the cursor to the line or lines you want to run and then press Cmd-Enter (Mac) or Ctrl-Enter (Windows). You can try this with the following lines:
1 + 5 # This will print the result
105 * 99 + 6 # An asterisk is used for multiplication
Note that R won’t execute anything after a comment #
.
Remove the hash symbol at the start of this line to run it:
# 1/5 # this will still be commented...
In an R script you can toggle commenting for one or more lines using Cmd-Shift-C/Ctrl-Shift-C. If you try to run a commented out line, it will continue until it finds the next command.
Beyond a calculator
Ok, wow, R is a calculator! But it is also much, much more than that… Try the following command:
print("Hello World")
You’ve told R to print a string of text (identified by the quotation marks) to the console. Much more flexible than a high school calculator!
It is important to note that R is case-sensitive,
i.e. Print("Hello World")
will not work. Try it!
Print("Hello World")
This means that james
is not the same as
JAMES
(and Hollway
is not the same as
Holloway
…). In R, we can write such logical statements
as:
"James"=="james" # Try also "James"!="james"
# Other logical statements include: ">", ">=", "<=", "<".
# 1 < 5 # Try also "1 <= 5"
Logical values are always either TRUE
or
FALSE
, but can be abbreviated as T
or
F
respectively. Why do we have to use two equals signs and
quotation marks? Quotation marks tells R you are referring to a string
of text and not a named object.
Objects
Values
An object is a placeholder R uses for one or more numbers, strings,
or other things. You can assign such things to an object using one
=
sign, but it’s better to use <-
to avoid
mistakes related to =
use also in logical statements.
surname <- "Hollway"
y.chromosome <- T # or TRUE
siblings <- 1
age <- NA # This is used for missing information
# Note that these objects then appear in RStudio's environment pane (by default the top right)
You can then recover this information by simply calling these objects:
surname
y.chromosome
siblings
age
And even operate on them:
siblings*3
# Try multiplying the other objects by 3
Vectors
We can also concatenate multiple values together using the function
c()
:
lived <- c("New Zealand", "UK", "New Zealand", "Germany", "UK", "Switzerland")
And recall them. Where was the fourth place I lived? We use square
brackets [ ]
for indexing:
lived[4]
There are several shortcuts for making a series of values. For example, consecutive numbers can be produced with:
teenageyrs <- 13:19
teenageqrtrs <- seq(13, 19.99, by = 0.25)
# We can recall every third value from this object using a repeating vector
teenageqrtrs
teenageqrtrs[c(FALSE, FALSE, TRUE)]
# teenageqrtrs[c(F, F, T)] # Also works but it is best practice to write out the logic.
So R can help us store and recall values and even vectors of values, but the key is being able to relate values and vectors together. For that we use objects of more complex classes.
Classes
Matrices
Data can be aggregated in R into different formats, such as data
frames and lists but the most common one used for network research is
the matrix format. Matrices are created by populating a given number of
rows and columns with data Assigning, <-
, doesn’t print
any output unless you wrap the line in parentheses:
(my.matrix <- matrix(data = 1:9, nrow = 6, ncol = 6))
If you look in the help file, which you can access by putting a
?
before the command/function name, you will see matrix
sets byrow = F
by default.
?matrix # Forgot the exact name of the function? Use ?? for search...
This means that it populates the matrix with the data by column by default, but we can populate it by row instead by adding an extra ‘argument’:
(my.2nd.matrix <- matrix(1:9, 6, 6, byrow = T))
We can index cells of a matrix using square brackets with a comma
[ , ]
my.2nd.matrix[2, 2]
Left of the comma is the row, right of the comma is the column. We can even overwrite particular cells of the matrix by assigning a new value to those indexed cells:
my.2nd.matrix[my.2nd.matrix == 6] <- 600
my.2nd.matrix
Data frames
Data frames are like matrices, but can hold different types of variables at once, such as logical, numeric/integer, or string/character variables. Replace the missing data (the NAs) with your details:
mydf <- data.frame(Surname = c("Hollway", NA),
Born = c("New Zealand", NA),
Siblings = c(1, NA))
You can even add new variables by simply writing a new variable name:
mydf$Dept <- c("IRPS", NA)
Can you call the data frame and print to the console?
mydf
We can recall an observation (row) or variable (column) of the data
frame in the same way that we indexed the matrix above,
e.g. mydf[2,2]
, but we can also call a named variable using
the $
sign:
mydf$Surname
This can be very handy when “subsetting” the data:
james <- mydf[mydf$Surname == "Hollway", ]
james
Note, however, that data frames must have variables of equal length.
Lists
Lists are a more flexible generalisation of data frames.
mylist <- list() # Here we are initialising an empty list
List items can also be named, like data frame variables, but don’t have to be:
mylist$Surname <- c("Hollway", NA)
mylist$Siblings <- c(1, NA) # Now you can add the others from above
You can also add lists to a list:
mylist$Lived <- list(c("New Zealand", "UK", "New Zealand", "Germany", "UK", "Switzerland"), NA)
Note that we’ve been using parentheses, ()
, here and not
brackets, []
, as we did when we were indexing. Parentheses
are used for functions.
Functions
Functions are sets of actions or algorithms that are applied to values, vectors, or objects.
exp(0.09855)
mean(c(1, 5, 8, 7, 6, 4, 22, 1, 0.9))
Arguments
Usually every function must be followed by ()
. Some
functions work without any “arguments” though; that is, with empty
parentheses.
ls() # This tells you what objects are in your environment
getwd() # This tells you the directory on your computer R is working in/on
list.files() # This tells you what files are in your working directory
# setwd("...") # You can set the working directory with this function
# (or under session in RStudio)
Compare the above with functions like the following, which enables you to write an object out of R to some path on your hard-drive that you specify:
write.csv(x = mydf, file = "~/Desktop/jamesdf.csv")
Two arguments are specified for this function
write.csv()
: x
and file
. But the
function can accept other arguments as might be necessary for more
complex data, for less common outputs, or in edge cases. See
?write.csv
for a list of the different arguments the
function will accept. Usually functions include defaults so that they
work even if you do not specify all the possible arguments though. In
fact, we don’t need to write x =
, just
write.csv(mydf, file = "~/Desktop/jamesdf.csv
. That is
because the function is expecting the object to be written to be
specified as the first argument, so it is only where you want to be
explicit or use a different ordering of the arguments that you might
need to spell that out. It is good practice to be explicit whereever
possible though to avoid unexpected results.
Pipes
When working with multiple functions on the same object, we can use
pipe operators %>%
or |>
to chain
consecutive functions and avoid nesting multiple functions in the code.
Pipes take the result of the code on the left of the pipe operator and
uses it in whatever function is on the right or next line of the pipe
operator. Note that when piping over multiple lines should, the pipe
operator(s) should be used at the end of each line.
example.vector <- c(1, 5, 8, 7, 6, 4, 22, 1, 0.9)
pipe.result.1 <- example.vector |>
mean()
pipe.result.1
# library(dplyr)
# pipe.result.2 <- example.vector %>%
# mean()
# both pipe operators give the same result
# pipe.result.1 == pipe.result.2
While |>
is the native pipe operator since R v4.0.0,
those using earlier versions of R may wish to use %>%
from either the {magrittr}
or {dplyr}
packages. Note that in that case, the package would need to be loaded
first before you can use the operator.
Tasks
- Create and fill in a matrix of “whom you already know” in the class: There are other ways to do this, but for this unit test I’d like you to do it in R. You can follow my example below (copy to a new script and uncomment):
mynetwork <- matrix(0,2,2) # this creates an empty network of 2 people
# Next I'm going to name the matrix rows and columns:
rownames(mynetwork) <- c("James Hollway","Jael Tan")
colnames(mynetwork) <- c("James Hollway","Jael Tan")
mynetwork[1,2] <- 1 # this means I know Jael already
mynetwork[2,1] <- 1 # I think I can say Jael knows me already too...
mynetwork["James Hollway","Jael Tan"] <- 1 # I could also do this by name
# mynetwork[mynetwork > 0] <- 0 # Just in case you make a mistake, this wipes it!