Taylor & Francis Group
Browse
gscs_a_1643345_sm2954.pdf (900.78 kB)

A data-driven selection of the number of clusters in the Dirichlet allocation model via Bayesian mixture modelling

Download (900.78 kB)
journal contribution
posted on 2019-07-18, 10:14 authored by E. F. Saraiva, C. A. B. Pereira, A. K. Suzuki

In this paper, we consider a Bayesian mixture model that allows us to integrate out the weights of the mixture in order to obtain a procedure in which the number of clusters is an unknown quantity. To determine clusters and estimate parameters of interest, we develop an MCMC algorithm denominated by sequential data-driven allocation sampler. In this algorithm, a single observation has a non-null probability to create a new cluster and a set of observations may create a new cluster through the split-merge movements. The split-merge movements are developed using a sequential allocation procedure based in allocation probabilities that are calculated according to the Kullback–Leibler divergence between the posterior distribution using the observations previously allocated and the posterior distribution including a ‘new’ observation. We verified the performance of the proposed algorithm on the simulated data and then we illustrate its use on three publicly available real data sets.

Funding

C. A. B. Pereira thanks the Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq, for support [grant number 308776/2014-3].

History