What are you talking about? MLE and GMM are literally the same thing in a conditionally linear Gaussian model.

The Gaussian linear model can be generalized in many directions, GMM and MLE are only two of them.

Still, that doesn't make GMM and MLE "the same thing", nor is one a subset of the other.

What?

Y = X*b + e, where e is conditionally Gaussian (0,I).

Both GMM on the orthogonality condition and MLE will result in the OLS estimator for b. They are mechanically the same thing and will have the same properties in this model. To say that GMM suffers form a weakness that MLE does not must, then, be dependent on the context of the problem.

For example, if you're going to generalize this model but assume knowledge of the distribution (e.g., a Gaussian mixture model), then GMM will underperform MLE, but that's almost a tautology. That's like saying if I have a well-defined prior, then Bayesian methods outperform frequentist methods.

What if you generalize it in a direction where the moments are known but the distribution is not? The point is that using moments instead of the full distribution trades off efficiency for flexibility. How is this contentious?