Skip to content

Mapping R functions to SQL

Introduction

I assume that I have the following data.frame definition

d<-data.frame(a=1:10, b=rnorm(10), c=rnorm(10))

with output similar to

    a          b          c
1   1 -1.4340611  1.3757397
2   2  0.1826867  1.4184245
3   3 -0.6749343  0.2227527
4   4 -2.1024966 -1.4681461
5   5 -1.1555581 -0.4754581
6   6  0.4927246  1.5400675
7   7  1.7117311 -0.2342912
8   8  1.5880962 -0.5721555
9   9  0.5862021 -1.1019397
10 10  1.3399340  0.6076585

subset

The subset base R function operates on the rows of the data frame to select only the rows that will be given as input to the select argument which is an expression indicating which column to keep.

Example

For the rows that have b positive return a data frame consisting of the a and c columns:

subset(d, b>0, c(a,c))

The equivalent SQL statement is

SELECT a, c from d WHERE b>0

transform

This function takes a data frame and additional arguments of the form tag=value. The tags are matched against the names of the data frame, and for those that match, the value replace the corresponding variable in the data frame, and the others are appended to it.

Example

transform(d, a=a*c, aa=1)

which is equivalent to the SQL query

SELECT d.*, a*c as a, 1 as aa
FROM d

aggregate

This function is used to return scalar data summaries for one or more columns of a data frame or matrix. The first argument to aggregate is a data frame or matrix containing the variables to be summarized, the second argument is a list containing the variables to be used for grouping, and the third argument is the function to be used to summarize the data. The grouping list determines how the data are split into subsets and then each column in these subsets is given to the aggregation function (third argument). This function, being an aggregate function, should always return a scalar.

Example

To compute the sums of the odd and even groups do:

aggregate(d, list(odd=d$a %% 2), sum)

which gives

   odd  a          b          c
1   0 30  1.5009449  1.5258489
2   1 25 -0.9666203 -0.2131966

In SQL this is similar to

SELECT odd, sum(a), sum(b), sum(c)
FROM (SELECT *, a % 2 as odd
      FROM d)
GROUP BY odd

See also

  • Christopher Bare has a series of related blog posts.
  • doBy offers a bunch of methods from grouping and summarizing data using a formula based interface.
  • sqldf offers an SQL interface for data frames manipulation.
  • And (finally!) plyr which provides an array of split/apply/combine-like methods from lists, arrays, data frames, etc.

Estimating effort

In producing estimates anchoring could be of good use.

The thought of the day

When a middleware techonology starts to be the favourite one in EU funded projects, it is near its death and ultimate demise.

Zubin’s “controversial” ideas

I just found the following interesting opinions of Zubin Ghahramani in the presentation he gave at the Machine Learning Summer School 2005:

  • I have no idea why anyone would want to use non-subjective priors. Objective priors are fraught with inconsistencies and no modeling is truly objective anyway. If you want robustness make sure your prior captures a wide range of reasonable outcomes and use decision theory to capture your losses.
  • Bayesian methods don’t over fit, because they don’t fit anything! Approximate Bayesian methods can have failure modes that look like overfitting.
  • Anything you can do easily with an SVM you can do with a Gaussian Process better.
  • Learning theory is useful to analyze bounds on the performance of algorithms but I’m not sure it should be used to design algorithms.
  • Algorithms should be designed to be sensible given the problem at hand, ignoring prior knowledge seems very silly.
  • Well designed MCMC methods can sometimes be much faster and perform better than optimization algorithms.
  • MAP methods, i.e. using a log prior as a regularizer, are not Bayesian.