QSPR modelling of the soil sorption coefficient from training sets of different sizes

Quantitative structure–property relationship (QSPR) modelling has been used in many scientific fields. This approach has been extensively applied in environmental research to predict physicochemical properties of compounds with potential environmental impact. The soil sorption coefficient is an important parameter for the evaluation of environmental risks, and it helps to determine the final fate of substances in the environment. In the last few years, different QSPR models have been developed for the determination of the sorption coefficient. In this study, several QSPR models were generated and evaluated for the prediction of log Koc from the relationship with log P. These models were obtained from an extensive and diverse training set (n = 639) and from subsets of this initial set (i.e. halves, fourths and eighths). The aim of this study was to investigate whether the size of the training set affects the statistical quality of the obtained models. Furthermore, statistical equivalence was verified between the models obtained from smaller sets and the model obtained from the total training set. The results confirmed the equivalence between the models, thus indicating the possibility of using smaller training sets without compromising the statistical quality and predictive capability, as long as most chemical classes in the test set are represented in the training set.