## Starting Probabilistic Document Retrieval

I want to work through some papers on probabilistic document retrieval mainly to find out the state of things in this area with regards to the depth of infiltration of generative models in this domain. Note that literature refers to document retrieval as ‘ad-hoc retrieval’ as opposed to say ‘passage retrieval’ for retrieving parts of a document.

I’m starting with a paper from (2006) by Xing Wei et. al [1] that introduces LDA to extend the basic language model for retrieval. The basic model (often referred to as the query likelihood model) evaluates each query term $q$ (multiple query terms are treated independently) with respect to a document $d$ according to

$\displaystyle p_1(q|d) = \frac{N_d}{N_d + \mu} p^*(q|d) + \left( 1 - \frac{N_d}{N_d + \mu} \right) p^*(q)$

where $p^*(w|d) = \frac{N_{dw}}{N_d}$ is the empirical probability of $w$ in document $d$; $p^*(w)$ is the empirical probability of $w$ in the entire collection; and $\mu$ is a tunable Dirichlet prior to prefer words in the document or to prefer words in the entire collection.

The main problem with this approach is when the query terms are words that do not exist within the target documents. For instance, a user might search for “daily planes from Chicago to New York” where “planes” is an unusual term to use compared to the usual “flights”. The use of “planes” will cause problems with the query likelihood model but we know that LDA, for instance, will tend to group “planes” and “flights” under one topic and will not confuse it much.

So, the authors augment the query likelihood model with LDA’s judgment of a query word given the topic mixture at document $d$ with preference $(1-\lambda)$.

$\displaystyle p_2(q|d) = \lambda p_1(q|d) + (1-\lambda)\sum_z p(z|d)p(q|z)$

Though it introduces yet another tuning parameter, it’s not much effort to automate its inference using hill-climbing. The results show consistent improvement over the query likelihood model. There’s not much more to say about this since many models have supplanted this; I’ll take a look at further models in the upcoming posts.

[1] Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. Research and Development in Information Retrieval (SIGIR).

## Reservoir Sampling

If you want to uniformly sample a handful of elements from a very large stream of data you probably don’t want to read it all into memory first. It would be ideal if you could sample while streaming the data. For this, reservoir sampling provides one answer. The algorithm is summarized thusly

1. Let $k$ be the number of samples required (without replacement), $R$ the array to hold the $k$ samples, and $S$ the stream of elements.
2. Fill $R$ with the first $k$ elements from $S$
3. for each $k+1 <= i <= |S|$, randomly sample $j \in [1,i]$, and set $R[j] = S[i]$ if $j \le k$.
4. The sample set is the final state of $R$

Can we show that this algorithm does indeed uniformly sample the elements (without replacement), that is, the probability of sampling any subset of $k$ elements is $\frac{k!}{(n)_k} = \frac{k}{n}\frac{k-1}{n-1}\dots\frac{1}{n-k+1}$?

Suppose $k=1$, $|S|=n$, and $1 \le r \le n$. The probability of picking $r$ is the probability of picking it at step $3$ when $i=r$ and then making sure in the later steps it doesn’t get replaced by another element

$\displaystyle \frac{1}{r} \times \prod_{r < s \le n} \frac{s-1}{s} \\ = \frac{1}{r} \times \frac{r}{n} \\ = \frac{1}{n}$

Thus, the sampling is uniform when $k=1$. In general, for sampling $k$ elements $r_1 < r_2 < \dots < r_k$ the probability due to step $3$ is

$\displaystyle \frac{k}{r_1}\frac{k-1}{r_2}\dots\frac{1}{r_k}$

That is the probability that each element is put into its own slot. The probability that $r_1$ doesn’t get overwritten by the upcoming elements (except for $r_2,\dots,r_k$ which have already been picked by step $3$) is

$\displaystyle \prod_{r_1 < s \le n-k+1} \frac{s-1}{s} = \frac{r_1}{n-k+1}$

Thus, the probability of selecting $r_1 < \dots < r_k$ according to the algorithm is

$\displaystyle \frac{k}{r_1}\frac{k-1}{r_2}\dots\frac{1}{r_k} \times \frac{r_1}{n-k+1}\dots \frac{r_k}{n} = \frac{k!}{(n)_k}$

## Regression-guided Generative Models

A generative model is pretty pointless on its own unless the generative structure itself holds intrinsic interest. Hence, papers justify their generative models either by comparing its predictive performance against another model or by extending the model to accommodate for standard machine learning tasks of dimensionality reduction, prediction, or classification.

A prime example of this is the LDA paper that evaluates the model’s usefulness for classification and collaborative filtering in addition to comparing its performance against its ancestor – the PLSA model – whose intention was to use the latent variables for indexing documents. These tasks were performed without modification to the generative model because they only required the evaluation of probability densities such as $p(\mathbf{w})$ for document density, $p(\mathbf{q} | \mathbf{w}_d)$ for query relevance against a document, and $p(w | \mathbf{w})$ for recommending an item given a set of items.

## Lack of supervision

It was realized early on that supervision would be required to get the most out of a generative model. Consider the case of clustering a set of points where we let a Gaussian mixture model group together points into $k$ clusters. The entire idea is that we hope for the clusters of points to have some significance to us. It’s quite unlikely in most non-trivial circumstances for the clusters to make complete sense to us after a run of the GMM.

So, a logical modification of clustering comes about when the sample points $X_i$ (the independent variable) come along with accompanying points $Y_i$ (the dependent variable). While we may just get lucky and find that the clusters perfectly partitions $X_i$ such that each cluster contains a disjoint subset of the $Y$s, we can’t expect this to happen unaided. Thus, the way to proceed is to treat $Y_i$ as constraints and to find clusters that do their best to encompass a disjoint subset of $Y$s.

What I want to look at is one way [1] the LDA model can be augmented to learn topics with the help of document labels such as categories, ratings, or other information.

## Regression

The independent variables are taken to be the topic-count vector $z_d$ for each document $d$ and the dependent variable are ordered user ratings (such as a score from 1 to 10) $y_d \in \mathbf{R}$. The hunch is that the ratings change linearly with respect to topic counts, i.e, two different ratings are differentiated by a linear change in the topic-counts of the corresponding documents with possible variations entertained by a Gaussian noise.

The independent variables are the normalized topic counts

$\displaystyle \bar{z}_d := \frac{1}{N} \sum_{n=1}^{N} z_{dn}$

and the ratings $y_d$ are modeled as values drawn from a Gaussian distribution with mean given by a linear combination via $\eta$ of the above topic counts

$\displaystyle y_d \sim \text{N}(\eta^T \bar{z}_d, \sigma^2)$

To keep this post short, I’ll postpone the inference discussion to another post, which will also give me a chance to walk through variational inference.

## Generalization

What if we want to entertain other responses such as a binomial or categorical response? For this, we employ the Generalized Linear Model (GLM). Note first the salient components of the Gaussian model above:

1. The mean $\mu$ is linearly determined as $\eta^T \bar{z}_d$. This linear combination is also called the linear predictor in the paper.
2. The dispersion (variance) parameter $\sigma^2$ controls how much the response varies from the linearly determined $\mu$

The GLM defines the probability of the dependent variable as

$\displaystyle p(y | \zeta,\delta) = h(y,\delta) \exp \left( \frac{\zeta y - A(\zeta)}{\delta} \right)$

where $\zeta$ is the natural parameter, $\delta$ the dispersion parameter, $h(y,\delta)$ is the base measure and $A(\zeta)$ is the log-normalizer. Let’s see if the normal distribution fits this form

$\displaystyle p(y | \eta,\bar{z},\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{(y - \eta^T \bar{z})^2}{2\sigma^2} \right) \\ = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{y^2 - 2y\eta^T \bar{z} + (\eta^T \bar{z})^2}{2\sigma^2} \right) \\ = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{-y^2}{2\sigma^2} \right) \exp \left( \frac{2y\eta^T \bar{z} - (\eta^T \bar{z})^2}{2\sigma^2} \right) \\ = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{-y^2}{2\sigma^2} \right) \exp \left( \frac{\eta^T \bar{z}y - (\eta^T \bar{z})^2/y}{\sigma^2} \right)$

This fits the bill with $h(y,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{-y^2}{2\sigma^2} \right)$, $\zeta = \eta^T \bar{z}$, $\delta = \sigma^2$, and $A(\zeta) = \frac{\zeta^2}{2}$. In the next post I’ll look at how other response types fit into the Generalized Linear Model format.

[1] Jon D. Mcauliffe and David M. Blei. 2008. “Supervised Topic Models.” Advances in Neural Information Processing Systems 20

## Topic Coherence

Evaluating unsupervised topic models is tricky business. If the resulting model is not employed in retrieval, classification, or regression there really is no way of convincing someone of the model’s worth. You may, rightly, say that there is no use for an unsupervised model without one of these objectives and that the unsupervised soubriquet serves only to distinguish it from a model whose optimization procedure includes supervision in the form of labels or outputs. The only way a generative model will have a meaning of its own is if there is a natural or physical interpretation of the generative process itself picked out by its inference over the given samples.

Nevertheless, people try to evaluate unsupervised generative models without labels of some kind. A popular method is to hold-out a portion of the data for testing and to compute its log-likelihood $\log p(\mathbf{w})$ (also called predictive log-likelihood) by integrating out latent variables. While this works for any probabilistic model, a slightly different metric is employed called predictive-perlexity for textual models

$\displaystyle perplexity(\mathbf{w}^{\text{test}}) = \exp \left( - \frac{\sum_{d=1}^M \log p(\mathbf{w}^{\text{text}}_d) } {\sum_{d=1}^M N_d} \right)$

where $N_d$ is the number of words in document $d$.

## Topic Coherence

The worth of this metric with respect to human interpretations of text is contested [2]. The authors Mimno et. al [2] suggest an alternative method of evaluation called topic coherence. The idea here is to prefer topics whose most frequent words appear, in general, together than apart. Meaning, a topic shouldn’t consist of many separate central words around which a host of supporting words are found; instead, a topic should consist of one coherent set of words. You can imagine this as being able to give a single solid name for each topic rather than having to choose from many ambiguous possibilities. The evaluation is expressed as follows

$\displaystyle C(t,V^{(t)}) = \sum_{m=2}^M \sum_{l=1}^{m-1} \log \frac{D(v^t_m, v^t_l)+1}{D(v^{t}_l)}$

where $V^t = (v^t_1,\dots, v^t_M)$ is a list of the $M$ most probable words in topic $t$. Exponentiating this we find that is is computing the empirical joint probability of all the pairwise combinations of the top $M$ words in a topic $t$.

## Example

Recent work has focused on improving topic models in this regard with the addition of external knowledge about textual behavior such as knowing which words can go together. The authors Chen et. al [1] consider improving topics by running topic models over several domains and then running a frequent item-set miner to find commonalities between the domains to iteratively enhance the topics. Like [1] they make use of the Generalized Polya urn model which I’ll explore in a another post.

I am not too taken in by the model but the key thing is that the models are evaluated using average topic coherence over held-out data and show an improvement over LDA and other knowledge-based models.

[1] Zhiyuan Chen and Bing Lui. 2014. “Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data”. International Conference on Machine Learning

[2] David M. Mimno and others. 2011. “Optimizing Semantic Coherence in Topic Models”. Empirical Methods in Natural Language Processing

## Starting Part-of-Speech Tagging

This is by no means the latest on the subject of probabilistic part-of-speech tagging of documents but nevertheless provides a good starting point to look at the basic model along with training and testing data.

This paper [1] takes a stab at comparing unsupervised HMM-based POS taggers. It is quite strange that something as simple as an HMM can be expected to tag words according to parts of speech without any supervision. By general standards of classification accuracy, we find that an unsupervised HMM does in fact perform poorly: [1] shows the average greedy 1-to-1 accuracy (explained in last section) of $51\%$ at best and the paper we looked at last time also gives the same performance on this data.

Clearly, some supervision is required. Though it isn’t the main purpose of the paper from last time, two approaches are taken. The first is to consider closed-class words. These are words which amount to a handful having a common tag such as punctuation, pronouns, and possessive markers while open classes are things like verbs and nouns. The authors modify the models by fixing the class of these closed-class words by giving them zero probability to tags they definitely don’t belong to. The results improve to about $58\%$ which is not a huge jump.

The second thing they try is to use the hidden representation from an unsupervised run of a HMM (in addition to other simple features) and feed them as features to a classification model. I am not too interested in this but suffice it to say accuracy jumps up yet again to about $85\%$.

### Data

Here are some links to high-quality data used in these papers.

### Evaluation Details

Luckily, anything falling into the category of classification is by far the easiest to evaluate in a completely intuitive and concrete way. It’s not as easy to fool someone with shady looking experiments as it is with unsupervised models without a classification objective.

The method employed in this paper is to split the data into a training and test set. Then, from the model inferred from the training set a greedy 1-to-1 map is established between each hidden state with a part-of-speech tag (another approach would be find a best-matching instead). This mapping is employed to label the test data and compare against the treebanks.

Another method is to allow a 1-to-Many mapping where more than one hidden state could be mapped to a tag by simply sending each hidden state to the best matching tag. The results with this approach seem to show higher accuracies.

[1] Jianfeng Gao and Mark Johnson. 2008. “A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers”. Empirical Methods in Natural Language Processing

## Adding (more Relaxed) Constraints during Model Inference

In the previous post on posterior regularization we saw how to specify constraints during the $E$-step of expectation maximization that would otherwise be difficult to incorporate into the model itself. The constraints took the following form

$\displaystyle \mathbf{E}_q[\mathbf{f}(\mathbf{x},\mathbf{z})] \le \mathbf{b}$

where we specified our constraints as expectations over the distribution of hidden variables. As an example, we took GMM clustering and used this framework to specify the constraint that at most two of the points in $\{x_1,x_2,x_3\}$ should be in the same cluster: $\sum_{i=1}^N \sum_{z=1}^K p_{\mu}(z | x_i) (\delta_{i1} + \delta_{i2} + \delta_{i3}) \le 2$.

What if we wanted to specify instead that those points should be separated into as many clusters as possible – if possible, each of those three points in three different clusters? This poses a problem because we do not know a priori if the data will allow (in a sensible way) for the three points to be separated and all we may know is that they are very likely well-separated.

A solution is presented in “Posterior vs. Parameter Sparsity in Latent Variable Models” [1] where the posterior regularization framework is relaxed by allowing $\mathbf{b}$ to be variable while being penalized by a function $R(\cdot)$. Specifically, the $E$-step now looks like

$\displaystyle \underset{q,b}{\arg\min} \text{ KL}(q||p) + R(\mathbf{b}) \text{ s.t. } \mathbf{E}_q[\mathbf{f}(\mathbf{x},\mathbf{z})] \le \mathbf{b}$

With this setup, a simple way to ensure our points to live in separate clusters is to let $R(b) = \sigma b$ where $\sigma$ is the strength of the regularization whose value must be ascertained experimentally. From the test section we find values of $\sigma$ between $0.6$ and $2.1$.

## Experiments

The experiments in this paper are conducted with respect to Part-of-speech (POS) tagging which warrants its own look later on. For now, let’s look at a useful number computed to compare models. The experiments in the paper want to show that a HMM augmented with this framework manages to assign fewer POS tags to a word than otherwise. To show this they consider words occurring more than $10$ times in the data and compute

$\displaystyle (\ell_1 / \ell_{\infty})(w) = \frac{n_w}{ \max_t \{ n_{wt} \}}$

where $n_w$ is the number of times $w$ occurred and $n_{wt}$ is the number of times $w$ is assigned tag $t$. The closer this value is to $1$ the sparser the tag assignments.

[1] Joao Graca and Kuzman Ganchev and others. 2009. “Posterior vs Parameter Sparsity in Latent Variable Models”. Advances in Neural Information Processing Systems 22

Posted in modeling, statistics | | 1 Comment

## Adding Constraints during Model Inference

Coming up with a probabilistic model and its inference procedure is only half the work because it’s well known that just a single run of the inference procedure is hardly likely to give you a satisfactory answer. Out of the many reasons this could be – local minima is not satisfactory, there’s not enough sparsity constraints, the model makes too many independence assumptions, and so on – the one that is significant in all cases, from the point of view of a consumer of the model, is that viable solutions clearly have to come from a restricted subspace within the complete parameter space of the model.

Take for instance the inference of an HMM where we would like the probability of transition from state $s_1 \mapsto s_2$ to be same as from $s_2 \mapsto s_1$ thereby forcing the transition matrix to be symmetric. This is easily solved by adding lagrange multipliers $\lambda_{ij}(M_{ij} - M_{ji})$ in the HMM’s EM derivation. Similarly, other equality constraints can be added in this manner.

But most times the constraints are a lot softer such as the data point $X_1$ and $X_2$ must not belong to the same cluster or that there must be at least $10$ points in each cluster. Baking these constraints in the probabilistic model itself is (pretty much) impossible and so is converting these into equality constraints because the number of such constraints is combinatorial.

A solution is proposed in “Expectation Maximization and Posterior Constraints” [1] (referred to in literature as posterior regularization) that allows one to specify a vector of constraints of the form

$\displaystyle \mathbf{E}_q[f_1(\mathbf{x},\mathbf{z})] \le b_1] \\ \dots \\ \mathbf{E}_q[f_n(\mathbf{x},\mathbf{z})] \le b_n] \\ \text{ i.e. } \mathbf{E}_q[\mathbf{f}(\mathbf{x},\mathbf{z})] \le \mathbf{b}]$

This modification is not without consequences. In particular, the paper shows that this modification trades off the maximization of the log-likelihood against the distance to the desired posterior subspace (created by the constraints). This trade-off follows from the observation that maximizing log-likelihood does not always yield desired solutions due to a model’s generality.

## Example

Let’s work out an example using this framework. Consider a GMM clustering of data points $\{ x_1,\dots,x_N \}$ using a fixed number of clusters $K$, fixed variance $\sigma^2$, and unknown means $\mathbf{\mu}$.

In addition, suppose that we would like to enforce the constraint that at most two of the points in $\{x_1, x_2, x_3\}$ should be in the same cluster. The paper allows us to specify this constraint as an inequality of an expectation over the cluster membership probabilities $p_{\mu}(z | x)$

$\displaystyle \sum_{i=1}^N \sum_{z=1}^K p_{\mu}(z | x_i) (\delta_{i1} + \delta_{i2} + \delta_{i3}) \le 2 \\ p_{\mu}(z | x_i) = \frac{p(x_i | \mu_z)}{\sum_z p(x_i | \mu_z)}$

The paper shows that we may bake in this constraint to the $E$-step by solving the following dual optimization problem

$\displaystyle \underset{\lambda \ge 0}{\arg \max} \left( \lambda 2 - \sum_i \log \sum_z p_{\mu}(z | x_i) e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})} \right) \\ p_{\lambda}(z | x_i) = \frac{p_{\mu}(z | x_i )e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})}} {\sum_z p_{\mu}(z | x_i) e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})}}$

Notice how that this only matters when $i=1,2,3$. Otherwise, the $e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})}$ term evaluates to $1$ and recovers the traditional $E$-step where $p_{\lambda}(z | x_i) = p_{\mu}(z | x_i)$. The derivative with respect to $\lambda$

$\displaystyle 2 - \sum_i \sum_z \frac{ (\delta_{i1} + \delta_{i2} + \delta_{i3}) p_{\mu}(z | x_i) e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})}} {\sum_{z} p_{\mu}(z | x_i) e^{\lambda (\delta_{i1} + \delta_{i2} + \delta_{i3})}} \\ 2 - \sum_i \sum_z p_{\lambda}(z | x_i) (\delta_{i1} + \delta_{i2} + \delta_{i3})$

With that, we get our gradient update (for use in gradient ascent) for $\lambda_i$ as the following and after convergence (or a few iterations) arrive at $\lambda^*$ and set the $E$-step to $p_{\lambda^*}(z|x_i)$.

$\displaystyle \lambda \gets \lambda + \eta (2 - \sum_i \sum_z p_{\lambda_i}(z | x_i) (\delta_{i1} + \delta_{i2} + \delta_{i3}))$

Though this framework is already quite accommodating people have relaxed the framework even more and we’ll look at one next time.

[1] Joao Graca and Kuzman Ganchev and Ben Taskar. 2007. “Expectation Maximization and Posterior Constraints”. Advances in Nerual Information Processing Systems 20

Posted in modeling, statistics | | 1 Comment