# stable version from CRAN:
install.packages("refset")
# development version from github:
library(devtools)
install_github("hughjonesd/refset")
library(refset)
<- data.frame(
employees id=1:4,
name=c("James", "Sylvia", "Meng Qi", "Luis"),
age=c(28,44,38, 23),
gender=factor(c("M", "F", "F", "M")),
stringsAsFactors=FALSE)
refset(rs, employees[1:2,])
rs
## id name age gender
## 1 1 James 28 M
## 2 2 Sylvia 44 F
$name[1] <- "Jimmy"
employees rs
## id name age gender
## 1 1 Jimmy 28 M
## 2 2 Sylvia 44 F
$age <- c(29, 45)
rs$age employees
## [1] 29 45 38 23
<- rs
ss $name[2] <- "Silvia"
employees$name[2] rs
## [1] "Silvia"
$name[2] ss
## [1] "Sylvia"
refset(rs2, rs$id)
rs2
## [1] 1 2
$id <- rs$id + 1000
rs rs2
## [1] 1001 1002
<- 101:102
rs2 $id employees
## [1] 101 102 3 4
# the multi-argument form. Note the empty argument, to select all columns:
refset(rsd, employees, age < 30, , drop=FALSE)
rsd
## id name age gender
## 1 101 Jimmy 29 M
## 4 4 Luis 23 M
$age <- employees$age + 1
employees rsd
## id name age gender
## 4 4 Luis 24 M
<- 1:10
vec refset(rs, vec, 4:6)
<- rs*10
rs vec
## [1] 1 2 3 40 50 60 7 8 9 10
<- list(a="text", b=42, NA)
lst refset(rsl, lst$b)
<- "more text"
rsl $b lst
## [1] "more text"
%r% employees[1:3,] # equivalent to refset(rs, employees[1:3,]) rs
wrapset
to create a
parcel:<- function(x) {
f <- contents(x)
cx contents(x)$name <- paste(cx$name, "the", sample(c("Kid", "Terrible", "Silent",
"Fair"), nrow(cx), replace=TRUE))
}
<- wrapset(employees[])
parcel f(parcel)
employees
## id name age gender
## 1 101 Jimmy the Terrible 30 M
## 2 102 Silvia the Fair 46 F
## 3 3 Meng Qi the Terrible 39 F
## 4 4 Luis the Silent 24 M
Normally, R uses “pass by value”. This means that when you run
b <- a
you have two independent copies of the same data.
Similarly, the code:
<- function(x) {x <- x*2}
f <- 4
a f(a)
a
## [1] 4
does not change the value of a
, since the function
f
gets passed the contents of a
rather than
the variable a
itself.
This is fine for most cases, especially for traditional uses of R in which the programmer or statistician passes in a value to a function, and sees the result on the command line. However, in some cases we would like to work with a single object, rather than multiple copies. For example:
The refset package allows you to do this, by creating objects that refer to other objects, or subsets of them.
To create a refset, call refset with two arguments:
<- data.frame(x1=1:5, x2=rnorm(5), alpha=letters[1:5])
dfr refset(rs, dfr[dfr$x1 <= 3, c("x1", "alpha")])
The call above creates a new variable rs
in your
environment. (Strictly, it creates a new binding, but we
needn’t worry about that for now.) For comparison, we’ll also create a
standard subset.
<- dfr[dfr$x1 <= 3, c("x1", "alpha")]
ss rs
## x1 alpha
## 1 1 a
## 2 2 b
## 3 3 c
ss
## x1 alpha
## 1 1 a
## 2 2 b
## 3 3 c
rs
and ss
look and behave just the
same:
c(class(rs), class(ss))
## [1] "data.frame" "data.frame"
c(mean(rs$x1), mean(ss$x1))
## [1] 2 2
To see the difference, let’s change the data in dfr
:
$alpha <- c(NA, letters[23:26])
dfr rs
## x1 alpha
## 1 1 <NA>
## 2 2 w
## 3 3 x
ss
## x1 alpha
## 1 1 a
## 2 2 b
## 3 3 c
As is normal, ss
has not updated to reflect changes in
the original data frame. But rs
has.
The connection also works the other way, if you change
rs
.
$alpha <- LETTERS[1:3]
rs rs
## x1 alpha
## 1 1 A
## 2 2 B
## 3 3 C
dfr
## x1 x2 alpha
## 1 1 -0.7411191 A
## 2 2 -0.1082759 B
## 3 3 0.1954485 C
## 4 4 -0.5963817 y
## 5 5 0.3462606 z
Everything that you do to rs
will be reflected in the
original data, and vice versa. Well, almost everything: remember that
rs
refers to a subset of the data. If you can’t do
it to a subset, you probably can’t do it to a refset. For example,
changing the names
of a refset doesn’t work, because
assigning to the names of a subset of your data doesn’t change the
original names.
There are three ways to create a refset. The first you have already
seen: call refset(name, data[indices])
where
name
is the variable name of the variable you want to
create, and data[indices]
is the subset you want to look
at. You aren’t limited to using data frames. You can refset any object
which you can subset, and you can use any of the three standard ways to
subset data: $
, [[
and [
.
<- 1:10
vec refset(rvec, vec[2:3])
<- list(a="some", b="more", c="data")
mylist refset(rls, mylist$b)
refset(rls2, mylist[["c"]])
rvec
## [1] 2 3
c(rls, rls2)
## [1] "more" "data"
However, this won’t work:
<- subset(dfr, x1>1)
myss refset(rs, myss)
## Error in substitute(data)[[1]]: object of type 'symbol' is not subsettable
You have to specifically write out the subset you want: you can’t put it in a variable.
The second way to call refset
is using the
%r%
infix operator. This is conveniently short, and also
makes it clearer that you are assigning to a variable.
%r% dfr[1:4,]
top4 exists("top4")
## [1] TRUE
The last way to create a refset is the 3-or-more argument form of the
function. This works like the subset
command in R base: you
can refer to data frame columns by name directly.
refset(large, dfr, x2 > 0,)
large
## x1 x2 alpha
## 3 3 0.1954485 C
## 5 5 0.3462606 z
Notice that we’ve included an empty argument. This is just the same
as when you call dfr[dfr$x2 > 0, ]
with an empty
argument after the comma: it includes all the columns.
Refsets don’t just sync their data with their “parent”. They also update their indices dynamically. For example, suppose we have a database of employees, including hours worked in the past month.
<- data.frame(
employees id=1:4,
name=c("James", "Sylvia", "Meng Qi", "Luis"),
age=c(28,44,38, 23),
gender=factor(c("M", "F", "F", "M")),
hours=c(160, 130, 185, 145),
pay=c(60000, 50000, 70000, 60000),
stringsAsFactors=FALSE)
We can create a refset of employees who worked overtime:
%r% employees[employees$hours > 140,]
overtimers overtimers
## id name age gender hours pay
## 1 1 James 28 M 160 60000
## 3 3 Meng Qi 38 F 185 70000
## 4 4 Luis 23 M 145 60000
When the new monthly data comes in, the set of people in
overtimers
will change:
$hours <- c(135, 150, 70, 145)
employees overtimers
## id name age gender hours pay
## 2 2 Sylvia 44 F 150 50000
## 4 4 Luis 23 M 145 60000
Sometimes you may wish to turn this behaviour off. For example, you
may want to look at a particular subset that had a certain
characteristic at a point in time. For this, use the argument
dyn.idx=FALSE
to refset
.
# people who worked long hours last month:
refset(overtimers_static, employees, hours > 140, , dyn.idx=FALSE)
# give them a holiday...
$hours <- 0
overtimers_static# ... and a pay rise
$pay <- overtimers_static$pay * 1.1
overtimers_static overtimers_static
## id name age gender hours pay
## 2 2 Sylvia 44 F 0 55000
## 4 4 Luis 23 M 0 66000
Without the dyn.idx=FALSE
argument, the refset would
have zero rows after the call setting hours
to 0.
If you want to break the link to the parent dataset, simply assign your refset to a new variable.
<- overtimers
copy $pay <- copy$pay * 2
copy$pay # still the same :/ employees
## [1] 60000 55000 70000 66000
Refsets are implemented using an R feature called “active binding”, which calls a function when you access or change a variable. Reassigning to a new variable reassigns the contents, rather than the binding.
This causes a problem if you want to pass a reference into functions, rather than passing the value of the refset – for example, if you would like to change the refset in the body of the function, and have this affect the original data. When you use a refset in a function argument, it binds it to a new value, breaking the link with the parent.
If you are writing your own code, you can avoid this problem by
creating a refset which is “wrapped” in a parcel object. Parcels simply
contain an expression and an environment in which the expression should
be evaluated. For example, they can contain the name of a refset. When
the contents
function is called on a parcel, the expression
is reevaluated. Here’s how to write a function that changes the name of
our employees:
%r% employees[1:3,] rs
<- function(x) {
f <- contents(x)
cx contents(x)$name <- paste(cx$name, "the", sample(c("Kid", "Terrible", "Silent",
"Fair"), nrow(cx), replace=TRUE))
}
<- wrapset(employees[])
parcel f(parcel)
employees
## id name age gender hours pay
## 1 1 James the Fair 28 M 135 60000
## 2 2 Sylvia the Terrible 44 F 0 55000
## 3 3 Meng Qi the Silent 38 F 70 70000
## 4 4 Luis the Kid 23 M 0 66000
As the above shows, you can assign to contents(parcel)
as well as read from it. You can also create a new variable from the
parcel by using unwrap_as
. Another way to write the
function above would be:
<- function(parcel) {
f unwrap_as(emps, parcel)
$name <- paste(emps$name, "the", sample(c("Kid", "Terrible", "Silent",
emps"Fair"), nrow(emps), replace=TRUE))
}f(parcel)
employees
## id name age gender hours pay
## 1 1 James the Fair the Terrible 28 M 135 60000
## 2 2 Sylvia the Terrible the Fair 44 F 0 55000
## 3 3 Meng Qi the Silent the Kid 38 F 70 70000
## 4 4 Luis the Kid the Silent 23 M 0 66000
Using parcels is a way to pass references around code. You could also do this using non-standard evaluation (NSE). Parcels have the nice feature that they store the environment where they should be evaluated.
For more information, see the help files for refset
and
wrap
.
The code for refset lives at github.