IVS is a package that estimates the conditional distribution for a data-driven tuning parameter for LASSO, given the observed design matrix and solution path. The LASSO solution is defined as
\[\widehat{\boldsymbol{\beta}}(\lambda) = \arg\min_{\boldsymbol{\beta}\in\mathbb{R}^p} \left \{\frac{1}{2} ||\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}||^2 +\lambda \sum_{j = 2}^p |\beta_j| \right \}.\]
Define the generalized information criterion as \(GIC_\lambda = \log(\widehat{\sigma}^2_\lambda) + w_n \widehat{df}_{\lambda},\) where \(\widehat{\sigma}^2_\lambda = n ^{-1} \sum_{i=1}^n \left \{ Y_i - \mathbf{X_i}^T\boldsymbol{\widehat{\beta}(\lambda)} \right \} ^2\), \(\widehat{df}_{\lambda} = \sum_{j = 1}^{p} I_{|\widehat{\beta}_j(\lambda)| > 0}\), and \(w_n\) is a sequence of positive constants, with \(w_n = \log(n) / n\) and \(w_n = 2 / n\) yielding BIC and AIC respectively. We consider data-driven tuning parameters of the form \[\widehat{\lambda}_{\mathrm{GIC}} = \arg \min_\lambda \left \{ \log(\widehat{\sigma}^2_\lambda) + w_n \widehat{df}_{\lambda} \right \}.\].
This section give a brief introduction to how to use the package. First, we load the package
library(IVS)
## Loading required package: glmpath
## Loading required package: survival
## Loading required package: nloptr
And we simulate a test dataset for penalized regression. This dataset has 50 observations and 10 variables. And the first three variables are truly correlated with the reponse.
set.seed(0622)
X <- matrix(rnorm(50*10), 50, 10)
y <- X[,1:3] %*% c(1, 0.5, 1) + rnorm(50)
Then we estimate the distribution of optimal tuning paramter by calling ivs
function
n = nrow(X)
wn = log(n) / n
res <- ivs(X,y, wn = wn, alpha = 0.1, method = "norm")
It returns a ivs
class with estimation of conditional distribution, confidence interval, and a glmpath
object. For getting the conditional distribution, use
res$prob
## 53.9566769564434 46.7707362855955 20.2853226425962 13.8638739596733
## 0.000000e+00 0.000000e+00 2.744490e-10 4.292041e-02
## 13.0789024206674 11.6452417618657 3.55716404205674 3.50459695905013
## 0.000000e+00 0.000000e+00 9.570792e-01 0.000000e+00
## 1.41786030733843 0.482309472089842 0
## 3.447723e-07 1.819644e-13 4.432302e-28
It suggests there is a probablity of 4.292041e-02 to select 13.8638739596733 as tuning parameter. And a probality of 9.570792e-01 to select 3.55716404205674. And the corresponding \(90\%\) CI for those probalities is
res$CI
## 53.9566769564434 46.7707362855955 20.2853226425962 13.8638739596733
## [1,] 0 0 0.0000000000 0.0002384751
## [2,] 0 0 0.0001355421 0.5848610414
## 13.0789024206674 11.6452417618657 3.55716404205674 3.50459695905013
## [1,] 0 0 0.4149481 0
## [2,] 0 0 0.9997206 0
## 1.41786030733843 0.482309472089842 0
## [1,] 4.664195e-10 1.012039e-16 1.442103e-31
## [2,] 4.064170e-05 6.011703e-11 2.738245e-25
Or use summary
function to get a summary of the conditional distribution
summary(res)
## Lambda Prob 5% 95%
## [1,] 13.86 0.0429 0.000238 0.585
## [2,] 3.56 0.9571 0.414948 1.000
Then one can plot the results for informative variable selection
plot(res, label = T)
The solution path is overlaid with the estimated conditional distribution.