There are two important things to keep in mind with Bayesian analysis. First, we always need to find a way to compute the posterior. MCMC is one way to do so. HMC is a more efficient way (in the sense of computational time and effort, not in the econometric sense of efficiency) of doing so.

Now it turns out that in many cases, and I would argue that in more cases than is given cognizance in the literature, the posterior is awful in a way that MCMC and HMC struggle to describe the posterior.

Why? Well imagine you wanted to map the Himalayas. There are so many peaks and valleys that walking up and down the mountains would be difficult. The Dakotas are flat so it's easy to map the land out. What we are saying is that many more posteriors are like the Himalayas and so the sampler is failing to describe the posterior. The model is not the issue, we are unable to sample it.

VB is a means to approximate the posterior. It turns out that when the sampler fails because the landscape is hilly and complex, so does the approximation.

Second, all model are compared on the basis of fit. If you don't fit the models properly, you will fit the models properly, you will select the wrong model. Consequently, returning to the first point, if a model is excellent but the posterior is awful and we have samplers and optimisers that are not up to the task then we don't have a way to use that awesomeness. We may sometimes select that model, some times select other models. If I was a manager, I would not believe the estimates easily because the estimates could be flat out wrong.

This is an issue with MLE as well when the likelihood is awful. A classic example is latent class models where direct optimization is known to fail. The same model can be estimated without changing the likelihood by using expectation maximisation.

Why? Because ultimately the solver is a mathematical algorithm and has its own limitations. Most of the typical solvers that we use were developed targeting a quadratic loss function and are applied blindly to more complex situations because we cannot do any better. Yes, this is a serious issue and it is taken seriously but not in marketing where with the exception of the discussion around MPEC, we like to pretend that models fit themselves.

I conjecture that the issues are brought to the fore in nonparametric Bayes for many reasons including the fact that the model can very easily be overparametrized, which is a known cause of multimodality (i.e., hills and mountains).

What is combinatorial multimodality? Assume we have A, B, ... K. I.e., K nodes. That means we have K! possible subsets of these nodes. A, AB, ABC, ... etc. This is known as combinatorial multimodality in finite mixture models where if we postulate that we have K nodes, we may actually have any subset of the K nodes. This occurs because we don't know which node to label A, B, etc. and we don't know how many nodes there are in the first place.

In nonparameric Bayes, we may put a Dirichlet Process to be nonparametric on the number of nodes and the weight on each node. That is in each iteration, using Gibbs sampling, we hold all other parameters and draw the number of nodes and the weight on the nodes. Then we estimate all other parameters holding constant the number of nodes and the weight on nodes. This iteration is what bridges parametric Bayes and nonparametric Bayes.

Now imagine the number of possible combinations in any given set of draws from the DP. Clearly that is n! for n nodes again.

Gaussian processes show the same issue because a GP is unimodal in each dimension but may be the same across. For example, suppose we have household specific parameters and assume the best fitting model is

\alpha_1 * p_1 + (1-\alpha_1) * p_2

In this case, a nonparametric model is overparameterized because people are only distributed as two types.

Suppose that we have 100 people ...See full post