Coming up with a probabilistic model and its inference procedure is only half the work because it’s well known that just a single run of the inference procedure is hardly likely to give you a satisfactory answer. Out of the many reasons this could be – local minima is not satisfactory, there’s not enough sparsity constraints, the model makes too many independence assumptions, and so on – the one that is significant in all cases, from the point of view of a consumer of the model, is that viable solutions clearly have to come from a restricted subspace within the complete parameter space of the model.
Take for instance the inference of an HMM where we would like the probability of transition from state to be same as from thereby forcing the transition matrix to be symmetric. This is easily solved by adding lagrange multipliers in the HMM’s EM derivation. Similarly, other equality constraints can be added in this manner.
But most times the constraints are a lot softer such as the data point and must not belong to the same cluster or that there must be at least points in each cluster. Baking these constraints in the probabilistic model itself is (pretty much) impossible and so is converting these into equality constraints because the number of such constraints is combinatorial.
A solution is proposed in “Expectation Maximization and Posterior Constraints”  (referred to in literature as posterior regularization) that allows one to specify a vector of constraints of the form
This modification is not without consequences. In particular, the paper shows that this modification trades off the maximization of the log-likelihood against the distance to the desired posterior subspace (created by the constraints). This trade-off follows from the observation that maximizing log-likelihood does not always yield desired solutions due to a model’s generality.
Let’s work out an example using this framework. Consider a GMM clustering of data points using a fixed number of clusters , fixed variance , and unknown means .
In addition, suppose that we would like to enforce the constraint that at most two of the points in should be in the same cluster. The paper allows us to specify this constraint as an inequality of an expectation over the cluster membership probabilities
The paper shows that we may bake in this constraint to the -step by solving the following dual optimization problem
Notice how that this only matters when . Otherwise, the term evaluates to and recovers the traditional -step where . The derivative with respect to
With that, we get our gradient update (for use in gradient ascent) for as the following and after convergence (or a few iterations) arrive at and set the -step to .
Though this framework is already quite accommodating people have relaxed the framework even more and we’ll look at one next time.
 Joao Graca and Kuzman Ganchev and Ben Taskar. 2007. “Expectation Maximization and Posterior Constraints”. Advances in Nerual Information Processing Systems 20