Skip to content

Mapping R functions to SQL

Introduction

I assume that I have the following data.frame definition

d<-data.frame(a=1:10, b=rnorm(10), c=rnorm(10))

with output similar to

    a          b          c
1   1 -1.4340611  1.3757397
2   2  0.1826867  1.4184245
3   3 -0.6749343  0.2227527
4   4 -2.1024966 -1.4681461
5   5 -1.1555581 -0.4754581
6   6  0.4927246  1.5400675
7   7  1.7117311 -0.2342912
8   8  1.5880962 -0.5721555
9   9  0.5862021 -1.1019397
10 10  1.3399340  0.6076585

subset

The subset base R function operates on the rows of the data frame to select only the rows that will be given as input to the select argument which is an expression indicating which column to keep.

Example

For the rows that have b positive return a data frame consisting of the a and c columns:

subset(d, b>0, c(a,c))

The equivalent SQL statement is

SELECT a, c from d WHERE b>0

transform

This function takes a data frame and additional arguments of the form tag=value. The tags are matched against the names of the data frame, and for those that match, the value replace the corresponding variable in the data frame, and the others are appended to it.

Example

transform(d, a=a*c, aa=1)

which is equivalent to the SQL query

SELECT d.*, a*c as a, 1 as aa
FROM d

aggregate

This function is used to return scalar data summaries for one or more columns of a data frame or matrix. The first argument to aggregate is a data frame or matrix containing the variables to be summarized, the second argument is a list containing the variables to be used for grouping, and the third argument is the function to be used to summarize the data. The grouping list determines how the data are split into subsets and then each column in these subsets is given to the aggregation function (third argument). This function, being an aggregate function, should always return a scalar.

Example

To compute the sums of the odd and even groups do:

aggregate(d, list(odd=d$a %% 2), sum)

which gives

   odd  a          b          c
1   0 30  1.5009449  1.5258489
2   1 25 -0.9666203 -0.2131966

In SQL this is similar to

SELECT odd, sum(a), sum(b), sum(c)
FROM (SELECT *, a % 2 as odd
      FROM d)
GROUP BY odd

See also

  • Christopher Bare has a series of related blog posts.
  • doBy offers a bunch of methods from grouping and summarizing data using a formula based interface.
  • sqldf offers an SQL interface for data frames manipulation.
  • And (finally!) plyr which provides an array of split/apply/combine-like methods from lists, arrays, data frames, etc.