An Introduction to IVS package

Wenhao Hu

2016-08-01

Introduction

IVS is a package that estimates the conditional distribution for a data-driven tuning parameter for LASSO, given the observed design matrix and solution path. The LASSO solution is defined as

\[\widehat{\boldsymbol{\beta}}(\lambda) = \arg\min_{\boldsymbol{\beta}\in\mathbb{R}^p} \left \{\frac{1}{2} ||\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}||^2 +\lambda \sum_{j = 2}^p |\beta_j| \right \}.\]

Define the generalized information criterion as \(GIC_\lambda = \log(\widehat{\sigma}^2_\lambda) + w_n \widehat{df}_{\lambda},\) where \(\widehat{\sigma}^2_\lambda = n ^{-1} \sum_{i=1}^n \left \{ Y_i - \mathbf{X_i}^T\boldsymbol{\widehat{\beta}(\lambda)} \right \} ^2\), \(\widehat{df}_{\lambda} = \sum_{j = 1}^{p} I_{|\widehat{\beta}_j(\lambda)| > 0}\), and \(w_n\) is a sequence of positive constants, with \(w_n = \log(n) / n\) and \(w_n = 2 / n\) yielding BIC and AIC respectively. We consider data-driven tuning parameters of the form \[\widehat{\lambda}_{\mathrm{GIC}} = \arg \min_\lambda \left \{ \log(\widehat{\sigma}^2_\lambda) + w_n \widehat{df}_{\lambda} \right \}.\].

Quick start

This section give a brief introduction to how to use the package. First, we load the package

library(IVS)
## Loading required package: glmpath
## Loading required package: survival
## Loading required package: nloptr

And we simulate a test dataset for penalized regression. This dataset has 50 observations and 10 variables. And the first three variables are truly correlated with the reponse.

set.seed(0622)
X <- matrix(rnorm(50*10), 50, 10)
y <- X[,1:3] %*% c(1, 0.5, 1) + rnorm(50)

Then we estimate the distribution of optimal tuning paramter by calling ivs function

n = nrow(X)
wn = log(n) / n
res <- ivs(X,y, wn = wn, alpha = 0.1, method = "norm")

It returns a ivs class with estimation of conditional distribution, confidence interval, and a glmpath object. For getting the conditional distribution, use

res$prob
##  53.9566769564434  46.7707362855955  20.2853226425962  13.8638739596733 
##      0.000000e+00      0.000000e+00      2.744490e-10      4.292041e-02 
##  13.0789024206674  11.6452417618657  3.55716404205674  3.50459695905013 
##      0.000000e+00      0.000000e+00      9.570792e-01      0.000000e+00 
##  1.41786030733843 0.482309472089842                 0 
##      3.447723e-07      1.819644e-13      4.432302e-28

It suggests there is a probablity of 4.292041e-02 to select 13.8638739596733 as tuning parameter. And a probality of 9.570792e-01 to select 3.55716404205674. And the corresponding \(90\%\) CI for those probalities is

res$CI
##      53.9566769564434 46.7707362855955 20.2853226425962 13.8638739596733
## [1,]                0                0     0.0000000000     0.0002384751
## [2,]                0                0     0.0001355421     0.5848610414
##      13.0789024206674 11.6452417618657 3.55716404205674 3.50459695905013
## [1,]                0                0        0.4149481                0
## [2,]                0                0        0.9997206                0
##      1.41786030733843 0.482309472089842            0
## [1,]     4.664195e-10      1.012039e-16 1.442103e-31
## [2,]     4.064170e-05      6.011703e-11 2.738245e-25

Or use summary function to get a summary of the conditional distribution

summary(res)
##      Lambda   Prob       5%   95%
## [1,]  13.86 0.0429 0.000238 0.585
## [2,]   3.56 0.9571 0.414948 1.000

Then one can plot the results for informative variable selection

plot(res, label = T)

The solution path is overlaid with the estimated conditional distribution.