Modelling of ready biodegradability based on combined public and industrial data sources

The European Registration, Evaluation, Authorization and Restriction of Chemical Substances Regulation, requires marketed chemicals to be evaluated for Ready Biodegradability (RB), considering in silico prediction as valid alternative to experimental testing. However, currently available models may not be relevant to predict compounds of industrial interest, due to accuracy and applicability domain restriction issues. In this work, we present a new and extended RB dataset (2830 compounds), issued by the merging of several public data sources. It was used to train classification models, which were externally validated and benchmarked against already-existing tools on a set of 316 compounds coming from the industrial context. New models showed good performances in terms of predictive power (Balance Accuracy (BA) = 0.74–0.79) and data coverage (83–91%). The Generative Topographic Mapping approach identified several chemotypes and structural motifs unique to the industrial dataset, highlighting for which chemical classes currently available models may have less reliable predictions. Finally, public and industrial data were merged into global dataset containing 3146 compounds. This is the biggest dataset reported in the literature so far, covering some chemotypes absent in the public data. Thus, predictive model developed on the Global dataset has larger applicability domain than the existing ones.