Sequential learning of regression models by penalized estimation

Version 2 2022-03-31, 18:00

Version 1 2022-01-31, 21:40

dataset

posted on 2022-01-31, 21:40 authored by Wessel N. van Wieringen, Harald Binder

When data arrive in a sequence of two or more data sets, modeling on the most recent data set should take previous data sets into account. We specifically investigate a strategy for regression modeling when parameter estimates from previous data can be used as anchoring points, yet may not be available for all parameters, thus covariance information cannot be reused. A procedure that updates through targeted penalized estimation, which shrinks the estimator towards a nonzero value, is presented. The parameter estimate from the previous data serves as this nonzero value when an update is sought from novel data. This naturally extends to a sequence of data sets with the same response, but potentially only partial overlap in covariates. The iteratively updated regression parameter estimator is shown to be asymptotically unbiased and consistent. The penalty parameter is chosen through constrained cross-validated loglikelihood optimization. The constraint bounds the amount of shrinkage of the updated estimator toward the current one from below. The bound aims to preserve the (updated) estimator’s goodness-of-fit on all-but-the-novel data. The proposed approach is compared to other regression modeling procedures. Finally, it is illustrated on an epidemiological study where the data arrive in batches with different covariate-availability and the model is re-fitted with the availability of a novel batch.