Introduction
I assume that I have the following data.frame definition
d<-data.frame(a=1:10, b=rnorm(10), c=rnorm(10))
with output similar to
a b c 1 1 -1.4340611 1.3757397 2 2 0.1826867 1.4184245 3 3 -0.6749343 0.2227527 4 4 -2.1024966 -1.4681461 5 5 -1.1555581 -0.4754581 6 6 0.4927246 1.5400675 7 7 1.7117311 -0.2342912 8 8 1.5880962 -0.5721555 9 9 0.5862021 -1.1019397 10 10 1.3399340 0.6076585
subset
The subset base R function operates on the rows of the data frame to select only the rows that will be given as input to the select argument which is an expression indicating which column to keep.
Example
For the rows that have b positive return a data frame consisting of the a and c columns:
subset(d, b>0, c(a,c))
The equivalent SQL statement is
SELECT a, c from d WHERE b>0
transform
This function takes a data frame and additional arguments of the form tag=value. The tags are matched against the names of the data frame, and for those that match, the value replace the corresponding variable in the data frame, and the others are appended to it.
Example
transform(d, a=a*c, aa=1)
which is equivalent to the SQL query
SELECT d.*, a*c as a, 1 as aa FROM d
aggregate
This function is used to return scalar data summaries for one or more columns of a data frame or matrix. The first argument to aggregate is a data frame or matrix containing the variables to be summarized, the second argument is a list containing the variables to be used for grouping, and the third argument is the function to be used to summarize the data. The grouping list determines how the data are split into subsets and then each column in these subsets is given to the aggregation function (third argument). This function, being an aggregate function, should always return a scalar.
Example
To compute the sums of the odd and even groups do:
aggregate(d, list(odd=d$a %% 2), sum)
which gives
odd a b c 1 0 30 1.5009449 1.5258489 2 1 25 -0.9666203 -0.2131966
In SQL this is similar to
SELECT odd, sum(a), sum(b), sum(c)
FROM (SELECT *, a % 2 as odd
FROM d)
GROUP BY odd
See also
- Christopher Bare has a series of related blog posts.
- doBy offers a bunch of methods from grouping and summarizing data using a formula based interface.
- sqldf offers an SQL interface for data frames manipulation.
- And (finally!) plyr which provides an array of split/apply/combine-like methods from lists, arrays, data frames, etc.