Skip to contents

The Gaussian Process (GP) combines a highly flexible non-linear regression approach with rigorous handling of uncertainty. A key feature of this approach is that, while it can be used to produce a conditional expectation function representing the mode of a posterior conditional distribution, it does not ``choose a particular model fit’’ and construct uncertainty estimates conditional on putting full faith in that model. This is valuable because the uncertainty estimates reflect the lesser knowledge we have at locations near, at, or beyond the edge of the observed data, where results from other approaches would become highly model-dependent. We first offer an accessible explanation of GPs, and provide an implementation more suitable to social science inference problems, which reduces the number of user-chosen hyperparameters from three to zero. We then illustrate the settings in which GPs can be most valuable: those where conventional approaches have poor properties due to model-dependency/extrapolation in data-sparse regions. Specifically, we demonstrate the usefulness of GPs in contexts where (i) treated and control models are needed by these groups have poor covariate overlap; (ii) regression discontinuity, which depends on model estimates taken at or just beyond the edge of their supporting data; and (iii) interrupted time-series designs, where models are fitted prior to an event by extrapolated after it.

Installation

You can install the latest version by running:

devtools::install_github('doeun-kim/gpss')

Troubleshooting installation

This version uses Rcpp extensively for speed reasons. These means you need to have the right compilers on your machine.

Windows

If you are on Windows, you will need to install RTools if you haven’t already. If you still are having difficulty with installing and it says that the compilation failed, try installing it without support for multiple architectures:

devtools::install_github('doeun-kim/gpss', args=c('--no-multiarch'))

Mac OSX

In order to compile the C++ in this package, RcppArmadillo will require you to have compilers installed on your machine. You may already have these, but you can install them by running:

xcode-select --install

You may need to install the armadillo software to compile the package. This can be done using the Homebrew package manager.

brew install armadillo

If you are having problems with this install on Mac OSX, specifically if you are getting errors with either lgfortran or lquadmath, then try open your Terminal and try the following:

curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /

Also see section 2.16 here

Examples

Formula interface

library(gpss)
data(lalonde)
# categorical variables must be encoded as factors
dat <- transform(lalonde, race_ethnicity = factor(race_ethnicity))
# train and test sets
idx <- sample(seq_len(nrow(dat)), 500)
dat_train <- dat[idx, ]
dat_test <- dat[-idx, ]
# sample of data for speed
mod <- gpss(re78 ~ nsw + age + educ + race_ethnicity, data = dat_train)

#you can use print() and summary() with a gpss object:
#print(mod)
#summary(mod)

# predictions in the test set
p <- predict(mod, dat_test)
length(p)
#> [1] 6525
head(p)
#>            fit       lwr      upr
#> [1,] 22408.906 20284.161 24533.65
#> [2,] 17947.976 12033.285 23862.67
#> [3,] 17501.271 14675.931 20326.61
#> [4,]  8826.611 -2288.528 19941.75
#> [5,] 22408.906 20284.161 24533.65
#> [6,] 15825.048 13109.529 18540.57

Matrix interface

library(gpss)
data(lalonde)
cat_vars <- c("race_ethnicity", "married")
all_vars <-  c("age","educ","re74","re75","married", "race_ethnicity")

X <- lalonde[,all_vars]


Y <- lalonde$re78
D <- lalonde$nsw

X_train <- X[D==0,]
Y_train <- Y[D==0]

X_test <- X[D == 1,]
Y_test <- Y[D == 1]

gp_train.out <- gp_train(X = X_train, Y = Y_train, optimize=TRUE,
                         mixed_data = TRUE, 
                         cat_columns = cat_vars)

gp_predict.out <- gp_predict(gp_train.out, X_test)

Optional: Accelerate R (faster matrix operations in R)

Mac users can see a significant speed up (5-10x) by using Apple’s native Accelerate BLAS library (vecLib). Upgrade to the latest version of R and RStudio, then follow the steps outlined here:

cd /Library/Frameworks/R.framework/Resources/lib

# for vecLib use
ln -sf libRblas.vecLib.dylib libRblas.dylib

For Window Users and to learn more details regarding installation and reversion, please see this blog post.

Here is another Blog Post that explains about the benefits of using Accelerate BLAS libraries: MacOS and MacOS with M1.