All Subsets Regression

What is all subsets regression? It’s a technique for model building which involves taking a set of independent variables \(X_1..i\) and regressing them in sets of \(k\), where \(k\) is in \(\{1,2,\dots,i\}\), against the response variable \(Y\). The ‘all’ part of ‘all subsets’ means it’s every combination of \(X_{1..i}\) being drawn \(k\) at a time.

But…why?

You’re probably familiar with forward, backward or stepwise model building where terms are added (or removed) from a model one at a time while attempting to maximize or minimize some ‘goodness’ criteria. These are generally regarded as bad - so why all subsets? Really, it has all the faults of the other methods but it has at least one advantage - you can have more complex ‘goodness’ criteria which may hit local maxima/minima (interupting the previous methods) but allow you to find the global. So, it’s fishing to the max.

leaps

R has a great package called leaps which implements all subsets regression. But…it doesn’t meet my needs.

All Subsets Regression and Missing Values

Let’s consider a dataset which has missing values. I’ll generate a dummy set of data containing 4 independent variables and one dependent which depends on the first three variables. Then, I’ll randomly assign ~12% of the data to be missing.

set.seed(1001)
X1 = c(-1,0,1)
df = expand.grid(X1=X1,X2=X1,X3=X1,X4=X1)
df$Y = df$X1+3*df$X2-0.5*df$X3+rnorm(81,0,1)
df$X1[sample(1:81,10,replace = FALSE)] = NA
df$X2[sample(1:81,10,replace = FALSE)] = NA
df$X3[sample(1:81,10,replace = FALSE)] = NA
df$X4[sample(1:81,10,replace = FALSE)] = NA

We can verify that no rows are completely missing:

any(apply(is.na(df[,1:4]),1,all))
## [1] FALSE
# Explanation
# df %>% 
#   #only consider the first 4 columns
#   .[,1:4] %>% 
#   #convert to a logical data frame checking for missing values
#   is.na() %>% 
#   #this function applies another function row or column wise to the input dataframe or matrix
#   apply(
#     # 1 means rows
#     1, 
#     # Check if all the values are TRUE
#     all) %>% 
#   # now we have a vector of logicals checking if the whole row of df was missing
#   # check if any are TRUE
#   any()

And we can count that there are 50 complete rows (just 62% of the original data).

sum(apply(!is.na(df[,1:4]),1,all))
## [1] 50
# Explanation
# df %>% 
#   #only consider the first 4 columns
#   .[,1:4] %>% 
#    #convert to a logical data frame checking for missing values
#   is.na() %>%
#   #flip the logical values
#   `!` %>% 
#   #this function applies another function row or column wise to the input dataframe or matrix
#   apply(
#     # 1 means rows
#     1,
#     # Check if all the values are TRUE
#     all
#   ) %>% 
#   # now we have a vector of logicals checking if the whole row of df is present
#   # sum this up to get a count
#   sum()

In linear regression observations cannot be included if they are missing because we don’t know what value they take. And sure, there’s missing value imputation - as always in statistics, if you’re willing to make assumptions whatever you’re doing becomes more statistically powerful, but also more real life fragile. With missing value imputation you have to make some assumptions about why they’re missing and then you have to generate the values themselves which requires another set of assumptions. It’s fine to not want to make those assumptions.

So despite the fact that we have 81 rows, we only have 50 rows for fitting a linear regression model which means if we fit the full model we’ll only have \(n-1-k=50-1-4=45\) degrees of freedom.

fullModel = lm(Y~X1+X2+X3+X4,data=df)
summary(fullModel)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4627 -0.7666 -0.0156  0.6580  3.4490 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.02660    0.18046  -0.147               0.8835    
## X1           0.98850    0.22582   4.377            0.0000707 ***
## X2           2.96688    0.22758  13.036 < 0.0000000000000002 ***
## X3          -0.53690    0.22246  -2.413               0.0199 *  
## X4           0.06567    0.21863   0.300               0.7653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.249 on 45 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.8074, Adjusted R-squared:  0.7902 
## F-statistic: 47.15 on 4 and 45 DF,  p-value: 0.000000000000001547

You can see the DF agree with what we expect to see. Also, please ignore the fact that we’re getting results which are nearly perfect. It’s hard to generate dummy data that looks like real world data.

First Subset

The first subset is typically just the first indpendent variable we encounter, on it’s own, so X1 in our case, but we’ll start with X1+X2 (because it helps with comparing to leaps). We have 61 complete observations across X1 and X2.

sum(apply(!is.na(df[,1:2]),1,all))
## [1] 61

So when we fit the model we expect to have \(n-1-k=61-1-2=58\) degrees of freedom.

firstSubset = lm(Y~X1+X2,data=df)
summary(firstSubset)
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64333 -0.79388 -0.01199  0.74753  2.85868 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  0.05194    0.16760   0.310                0.758    
## X1           0.87991    0.20895   4.211            0.0000896 ***
## X2           2.88003    0.21237  13.561 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 58 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.7794, Adjusted R-squared:  0.7718 
## F-statistic: 102.5 on 2 and 58 DF,  p-value: < 0.00000000000000022

This is a pretty good model, and we nailed basic arithmetic! How does this model compare to what we would get if we used only the complete observations?

firstSubset_limited = lm(Y~X1+X2,data=fullModel$model)
summary(firstSubset_limited)
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = fullModel$model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4684 -0.8767 -0.0286  0.7977  3.0501 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.09628    0.18551  -0.519             0.606175    
## X1           0.94976    0.23369   4.064             0.000182 ***
## X2           2.90673    0.23552  12.342 0.000000000000000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.301 on 47 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7726 
## F-statistic: 84.26 on 2 and 47 DF,  p-value: 0.0000000000000002869

It’s not very different, but the R^2 value is a little higher. Also the AIC and BIC are little better for the reduced data set.

cat("All available observations AIC:    ",AIC(firstSubset),"\nJust the complete observations AIC:",AIC(firstSubset_limited),"\n\nAll available observations BIC:    ",BIC(firstSubset),"\nJust the complete observations BIC:",BIC(firstSubset_limited))
## All available observations AIC:     210.4281 
## Just the complete observations AIC: 173.0897 
## 
## All available observations BIC:     218.8716 
## Just the complete observations BIC: 180.7378

leaps

So what does leaps produce? The regsubsets lets us specify a formula just like in lm and then it fits all subsets of the specified full model. The summary shows the best model for each set size \(k\). It selected X2 in the \(k=1\) case, X1+X2 in the \(k=2\) case, X1+X2+X3 in the \(k=3\) and the full model for \(k=4\).

library(leaps) #Version 3.0
leapsResults = leaps::regsubsets(Y~X1+X2+X3+X4,data=df)
summary(leapsResults)
## Subset selection object
## Call: regsubsets.formula(Y ~ X1 + X2 + X3 + X4, data = df)
## 4 Variables  (and intercept)
##    Forced in Forced out
## X1     FALSE      FALSE
## X2     FALSE      FALSE
## X3     FALSE      FALSE
## X4     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          X1  X2  X3  X4 
## 1  ( 1 ) " " "*" " " " "
## 2  ( 1 ) "*" "*" " " " "
## 3  ( 1 ) "*" "*" "*" " "
## 4  ( 1 ) "*" "*" "*" "*"

The summary object has a lot in it including BIC, \(R^2_{Adj}\), Mallows Cp and more. It’s important to note that leaps only retains information about the best models selected.

It looks like the way stats::BIC calculates BIC and the way leaps does don’t match up. That won’t be much use. But look at the second value of \(R^2_{Adj} = 0.7726\) - that’s what we calculated when using only the complete observations across all variables.

leapsSummary = summary(leapsResults)
leapsSummary$bic[2] #second value for the two term model
## [1] -64.40754
leapsSummary$adjr2[2]
## [1] 0.7726352

How much data to use?

Using only the complete observations in all subset regression is common across several platforms including R and JMP. But why? Is it more statistically valid to ignore data that could be part of the regression model? Or is it simply more computationally efficient? It seems to me that each model you fit should be fit using all data available to it - why should my results be different because I collected results on X3 and X4? I should get the exact same results when fitting the selected model as when doing my model building. So, how can we do all subsets regression using all the data available? And when do we get to purrr?

New Functions

Below I present 3 functions with a limited amount of safe programming tossed in.

allSubsetsRegression

The main function is allSubsetsRegression which is too long of a name. It accepts * a data frame which should contain only the independent variables to be included and the dependent variable * a quoted string for the column name in data which is the independent variable * optionally, minimum number of variables to include * optionally, maximum number of variables to include (useful if your data is wider than it is tall)

This makes a call to calcKreg

allSubsetsRegression = function(data,y,minV=1L,maxV){
  if(missing(maxV)) maxV = as.integer(ncol(data)-1)
  if(minV > maxV) stop("Max number of variables must be bigger than the min number of variables")
  if(!is.integer(minV) | !is.integer(maxV)) stop("Min/Max variables must be integers")
  if(!is.data.frame(data)) stop("Data must be provided in a data frame")
  if(!is.character(y)) stop("Response variable 'y' must be provided as a quoted string")
  xnames = names(data)[names(data) != y]
  output = lapply(minV:maxV,calcKreg,data=data,y=y,xnames=xnames)
  class(output) = c(class(output),"subReg")
  return(output)
}

calcKreg

The name looks like “calc Kreg” which sounds like a weird German thing but it’s actually “calc K reg” meaning “calculate the \(k^{th}\) regression set”. This requires * the provided data * the quoted string of the dependent variable * the vector of quoted strings of the independent variables * k - how big the set is

This leverages the built in function combn which returns a matrix of all combinations of \(X_i\) of size \(k\). This in turn calls the function genReg.

calcKreg = function(data,y,xnames,k){
  runs = combn(xnames, k)
  models = lapply(1:ncol(runs), genReg, data=data, y=y, runs=runs)
  names(models) = apply(runs, 2, paste, collapse='+')
  return(models)
}

genReg

genReg (“generate regression”) requires * the provided data * the quoted string of the dependent variable * the output of combn * the column being used this run

It generates a formula by pasting the needed independent variables together with “+” and then returns the full lm object.

genReg = function(data,y,runs,i){
  x = runs[,i]
  form = paste0(y,"~",paste(x,collapse="+"))
  return(lm(form,data=data))
}

DANGER, WILL ROBINSON! DANGER!

I’m returning the entire lm object? Well, yeah. If you have stupid big data or stupid wide data, this won’t work well, but also, you probably aren’t really wanting to do linear regression. I’m just guessing. The advantage of returning the whole object is it will let me use whatever ‘goodness’ measure I after the fact - even use several, to determine which model I want to keep.

purrr

purrr is a package from the tidyverse that provides a set of tools to improve functional programming. It has consistent syntax versions of map and reduce type functions that make code more readable than the base versions in R. But you’ll notice I used lapply above instead of purrr::map. I’m used to using lapply and I don’t like loading a whole package for simple use cases like that. Where I do like using it is for more complex cases - like map_depth

map_depth

lapply takes an object and applies a function to each element of the object and returns it in a list. Consider a heirarchical list like the one below.

dumbList = list(z=list(a=c(1,2), b=c(2,3)), y=list(a=c(1,1),b=c(1,4)), x=list(a=c(2,2),b=c(2,0)))

Maybe I want to sum each element so that z\(a = 3, y\)a = 2, x$a = 4, etc…

lapply(dumbList,sum)
## Error in FUN(X[[i]], ...): invalid 'type' (list) of argument

It doesn’t work because dumbList[[1]] is a list and you can’t sum a list. However, with purrr::map_depth we can specify the depth we want to work at - in this case, the second level. Operate on the lists inside of lists.

library(purrr) #version 0.2.2
purrr::map_depth(dumbList,2,sum)
## $z
## $z$a
## [1] 3
## 
## $z$b
## [1] 5
## 
## 
## $y
## $y$a
## [1] 2
## 
## $y$b
## [1] 5
## 
## 
## $x
## $x$a
## [1] 4
## 
## $x$b
## [1] 2

This is equivalent to the following code which requires an anonymous function to work.

lapply(dumbList,function(x){
  lapply(x,sum)
})

How does this relate to allSubsetsRegression? Let’s look at the results of running it.

Results of allSubetsRegression

The function returns a hierarchical list - the top level has \(k\) elements and the second level has \(i\) choose \(k\) elements.

res = allSubsetsRegression(df,"Y")
length(res) #4 because we had 4 independent variables
## [1] 4
lapply(res,length) # 4, 6, 4, 1 because there are 4 ways to select 1 variable from a group of 4, 6 ways to select 2, 4 ways to select 3 and 1 way to select 4
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 1

This allows me to easily index into any specific model via res[[k]][["terms"]] and inspect that model. Want to see X1+X2+X3?

res[[3]][["X1+X2+X3"]]
## 
## Call:
## lm(formula = form, data = data)
## 
## Coefficients:
## (Intercept)           X1           X2           X3  
##     0.03424      0.97513      2.99673     -0.63855

Now I want to calculate goodness statistics on each of these so I can determine the best models.

regGoodness

regGoodness (regression goodness) operates on the hierarchical list produced by allSubsetsRegression, applying a function which produces a ‘goodness’ score and then sorts by if that goodness score should be minimzed or maximized. It does this by using purrr::map_depth to allow a clean hierarchical output when generating the models, but ease of computing summary statistics afterward. note this does leverage the stringr package.

library(stringr) #version 1.2.0

regGoodness = function(models,f,direction){
  if(!(direction %in% c(">","<"))) stop("Invalid direction - must be < or >")
  if(!is.function(f)) stop("f must be a function")
  if(!("subReg" %in% class(models))) stop("models must be generated by allSubsetsRegression")
  score = unlist(purrr::map_depth(models,2,f))
  output = data.frame(N = stringr::str_count(names(score),"\\+")+1,Terms = names(score),Value = score,stringsAsFactors = FALSE)
  rownames(output) = NULL
  if(direction == "<"){
    return(output[order(output$Value),])
  } else {
    return(output[order(output$Value,decreasing = T),])
  }
}

For example, we could get the \(R^2_{Adj}\) for each model by using this function.

asrAR2 = function(model){
  summary(model)$adj.r.squared
}

We pass it into the function thusly and get out a data.frame with summary details about each model fit.

ar2 = regGoodness(res,asrAR2,">")
ar2
##    N       Terms        Value
## 11 3    X1+X2+X3  0.804526215
## 15 4 X1+X2+X3+X4  0.790234207
## 5  2       X1+X2  0.771840082
## 12 3    X1+X2+X4  0.768256130
## 8  2       X2+X3  0.747873190
## 14 3    X2+X3+X4  0.727369827
## 2  1          X2  0.723193708
## 9  2       X2+X4  0.709493147
## 6  2       X1+X3  0.094866148
## 1  1          X1  0.055237747
## 7  2       X1+X4  0.048400898
## 13 3    X1+X3+X4  0.042796504
## 3  1          X3  0.003718353
## 4  1          X4 -0.013302437
## 10 2       X3+X4 -0.028144430

Coefficient Stability

We can also analyze how stable our coefficient estimates are as we vary what is in or out of the model. First we need to extract the coefficients from each model, then mutate the vector so it becomes a named data frame. The dplyr::bind_rows function is great because, like a “UNION ALL” query in SQL, keeps all the column names as it binds data frames together like new rows.

library(dplyr) #version 0.5.0
coefs = purrr::map_depth(res,2,coefficients) %>% 
  purrr::map_depth(2,function(x){as.data.frame(t(as.matrix(x)))}) %>% 
  purrr::reduce(bind_rows) %>% 
  mutate(Terms = ar2$Terms)
coefs
##     (Intercept)        X1       X2         X3          X4       Terms
## 1   0.150404089 0.8892705       NA         NA          NA    X1+X2+X3
## 2   0.031631652        NA 2.962518         NA          NA X1+X2+X3+X4
## 3   0.001309230        NA       NA -0.4622766          NA       X1+X2
## 4   0.017983923        NA       NA         NA  0.11390661    X1+X2+X4
## 5   0.051936938 0.8799054 2.880028         NA          NA       X2+X3
## 6   0.095013433 1.1728420       NA -0.4231423          NA    X2+X3+X4
## 7   0.047982016 0.9208031       NA         NA  0.16213820          X2
## 8   0.003984185        NA 3.093045 -0.6135762          NA       X2+X4
## 9   0.013940875        NA 2.881377         NA  0.02489288       X1+X3
## 10  0.060029856        NA       NA -0.2083143  0.11762346          X1
## 11  0.034239325 0.9751328 2.996726 -0.6385541          NA       X1+X4
## 12 -0.021320977 0.9166835 2.816281         NA  0.09154406    X1+X3+X4
## 13  0.079566784 1.0442746       NA -0.1011008  0.15250775          X3
## 14 -0.024519496        NA 2.986471 -0.5301225 -0.01590184          X4
## 15 -0.026595776 0.9885010 2.966883 -0.5368965  0.06566652       X3+X4

Conclusion

All subsets regression presents many challenges, theoretical and practical. I was able to easily write some functional code which produced a hierarchical list. The structure is nice for exploring the results but presents hurdles when trying to analyze it with functional programming. purrr::map_depth allows for easy functional programming on hierarchical lists without the need for anonymous functions or other cumbersome overhead.