## Optimizing the Dirichlet hyperparameters

One of the things you’ll notice in papers describing generative models of documents using a Dirichlet prior is to simply fix the Dirichlet hyperparameter $\alpha$ that controls the distributions of topic mixtures for each document. This isn’t ideal when you wish to then compute the probability of an unseen document $p(\mathbf{w}_d | \alpha, \beta)$ because a fixed $\alpha$ encodes no knowledge of the distribution of topic-mixtures over documents in the training corpus.

In the appendix of the journal paper by Blei et. al [1], we find a procedure to learn the $\alpha$ that maximizes the following log-likelihood where $\theta_j$ are the document specific topic mixtures and $\alpha$ the dirichlet hyperparameter.

$\displaystyle \Lambda := \log \prod_j p(\theta_j | \alpha) = \\ N \log \Gamma(\sum \alpha_i) - N\sum \log \Gamma(\alpha_i) + \sum_{i,j} (\alpha_i -1) \log \theta_{ij}$

The expression is maximized using the Newton-Raphson method, which requires computing the derivative of $\Lambda$ and the Hessian (the matrix of second-order derivatives) – with respect to $\alpha$.

## Derivative of $\Lambda$

Making use of the digamma function $\psi(x) = \frac{\partial \log \Gamma(x)}{\partial x}$ the partial derivatives with respect of each component of $\alpha$ gives

$\displaystyle \frac{\partial \Lambda}{\partial \alpha_r} = N \psi(\sum \alpha_i) - N \psi(\alpha_r) + \sum_j \log \theta_{jr}$

## The Hessian

The Hessian component-wise with respect of $\alpha_r$ and $\alpha_s$ is

$\displaystyle \frac{\partial \log \Lambda}{\partial \alpha_r \alpha_s} = N \psi'(\sum \alpha_i) - \delta_{rs} N \psi'(\alpha_r)$

where $\delta_{rs}$ is the Kronecker delta. Note that this Hessian matrix can be written in the following form

$\displaystyle H = \text{diag}(h) + \mathbf{1}z\mathbf{1}^T$

where $\text{diag}(h)$ is the diagonal matrix containing the second-order derivatives with respect to $\alpha_i$s across the diagonal and zero elsewhere; $z = \psi'(\sum \alpha_i)$; and $\mathbf{1}$ is vector of $1$s. The inverse of a matrix of his form is given by the Matrix Inversion Lemma which states

$\displaystyle (A - BD^{-1}C)^{-1} = \frac{A^{-1} + A^{-1}BCA^{-1}}{D - CA^{-1}B}$

In our case, $H^{-1}$ is given by

$\displaystyle H^{-1} = \text{diag}(h)^{-1} - \frac{\text{diag}(h)^{-1} \mathbf{1}\mathbf{1}^T \text{diag}(h)^{-1}}{ z^-1 + \sum_{j=1}^k h_j^{-1}} \\ H^{-1}_{ij} = \text{diag}(h)^{-1}_{ij} - \frac{h^{-1}(h^{-1})^T}{ z^{-1} + \sum_j h^{-1}_j}$

## The upside

We are now ready to compute the new guess for the next iteration in the Newton-Raphson method: $\alpha_{new} = \alpha_{old} - H^{-1}(\alpha_{old}) g(\alpha_{old})$. Given the gradients $g$ evaluated at the old $\alpha$s, the new $\alpha$ are given by

$\displaystyle \alpha_{old} - (H^{-1} g)_i = \alpha_{old} - \sum_j H^{-1}_{ij} g_j \\ = \alpha_{old} - h^{-1}g_i - \frac{\sum_j (h^{-1} (h^{-1})^T)_{ij} g_j}{ z^{-1} + \sum_j h^{-1}_j} \\ = \alpha_{old} - \frac{1}{h^{-1}} \left( g_i - \frac{ \sum_j \frac{g_j}{h_j} }{z^{-1} + \sum_j h^{-1}_j} \right)$

The reason for the special attention given to the form of the Hessian in this problem si that it requires the computation of only the values $h_i$ and $g_i$ which amount to $2k$ values which is only linear in $k$ (the dimension of $\alpha$) to the otherwise $O(k^3)$ required for a full-blown matrix inversion of $H^{-1}$.

[1] David M. Blei and Andrew Y. Ng and Michael I. Jordan and John Lafferty. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research.