I am a newbie to ML and have no idea what scale of computational capacity a regression needs. I have a linear model, about 300 independent variables (97% are fixed effects), about 10,000 observations. I waited for ever for an iteration in STATA. Is that normal, or maybe there is something wrong with my ML program/data? Thank you very much!
Computational Intensity of Maximum Likelihood

This is a really good question. thanks OP so I started googling some stuff, still looking but here are some initial findings:
http://math.stu.edu.cn/_admin/uploads/jcam09.pdf
http://en.citizendium.org/wiki/Newton's_method
http://faculty.fuqua.duke.edu/~abn5/MCMCposted.pdf
answer will depend a bit on which algorithm you are using to do ML, so I google BFGS, and various variants of Newton.
Based on the other answers reported here I suspect not much has been done on this in the econometric literature.
last paper is Bayesian so doesn't quite catch the original intent of OP's post.
complex problem I think.

oh, just to add ome insight, when you blow out the dimensionality of the problem like that the likelihood surface flattens, so convergence will take forever, so yes you need either computationally more efficient algorithms or to reduce the data dimensionality considerably

Conjugate gradient. And even with millions of observations and a nonlinear model, ML via CG converges quickly. In fact, familiarity with basic asymptotics would tell you why the above poster is wrong: std errors going to zero implies high curvature near the maximizer which is the opposite of a flat likelihood surface.
A bit more stats and some remedial optimization courses for e714.

Conjugate gradient. And even with millions of observations and a nonlinear model, ML via CG converges quickly. In fact, familiarity with basic asymptotics would tell you why the above poster is wrong: std errors going to zero implies high curvature near the maximizer which is the opposite of a flat likelihood surface.
A bit more stats and some remedial optimization courses for e714.you are right but I was referring not to the dimensionality of the dataset but of the parameter space so you have misinterpreted me and assumed standard errors fixed, latter assumption is I think reasonable in absence of correlation srtucture betweeen exogenous variables.
Under these circumstances my argument would still hold. But yes conjugate gradient would tend to work, but still the computational complexity of CG would be of interest, would it not?

Computational complexity of CG? It's really not that complicated  after all, CG was developed as a way around complexity/intractability. Then add in that you often have it already coded as an optimization method. CG is a huge win, IMHO.
Dimensionality of the parameter space is not a big deal either so long as you have n > p (and even better n >> p). We don't often increase the number of parameters without limit since even modest models challenge economic interpretation, so n > p is reasonable. Also, CG and other methods are often used for very large problems. (Check out airline pricing or truck routing problems done via ADP; those can be huge.) Finally, unless you have perfect collinearity, increasing amounts of data will reduce standard errors.
Now if you are talking about problems with the model formulation... well, that's a different problem than what we are discussing. But for this situation... I'd recommend OP to pick up Nocedal and Wright.

ok, but no free lunch (cs version) should tell us that there is no best algorithm so CG isn't always going to be the way to go, I agree CG amongst classical methods is the best bet in OP's case, but am still interested in a comparison of complexity of different methods and I don't think this is straightforward because there are two aspects to it, the solution algorithm which we have been discussing and the actual ML problemthe model if you like, combining them is complicated, thanks for the reference.
last reference i posted above hints at the complexity of this even though it uses MCMC.
To get ht egeometry of what i was saying earlier hold n fixed, then consider a maximization problem with one parameter then two then threeit's clear the surface flattens.

So, why don't you estimate a random effects model instead? I could not get what you mean by interval restrictions though. Be careful about the standard errors that you are getting. I am not sure whether the inverse Hessian works when there are restrictions in the model.
If you must use MLE then I would strongly recommend you finding gradient analytically which would boost the speed and accuracy of the estimates. If possible find the Hessian as well. Again this will boost the speed and increase the accuracy further. Good luck.

Use something better than conjugate gradient  check out SNOPT, KNITRO, or any of the related sequential quadratic programming methods. It seems remedial optimization classes are needed for more than just one poster here.
not clear why ML should be quadratic programming, but then I dob't think KNITRO requires that not sure about SNOPT will check it out

KNITRO is not that amazing: it has a trustregion part of the algorithm which can be rather lowperformance. SNOPT only works if you can get an exact gradient. Now, it is nice in that it tries to be light on evaluating those, but it is by no means perfect.
You're right that SQP can be very nice. No need for remedial classes, I've coded my own SQP a few times. But I wouldn't recommend that to someone as my first answer.
As an aside: there is not a nofreelunch theorem in CS. That comes from competition; algorithms are ideas and sucky ideas exist. Grid search for a convex objective function, anyone? Bubblesort? Marxism? Physiognomy?

So, why don't you estimate a random effects model instead? I could not get what you mean by interval restrictions though. Be careful about the standard errors that you are getting. I am not sure whether the inverse Hessian works when there are restrictions in the model.
Random effects models are even harder to estimate. Interval restrictions are just constraints on the parameter space. And yes: the standard errors if a constraint are binding is a total mess. Of course, you could use a soft constraint, but then your standard errors would mostly be telling you about the soft constraint.
If you must use MLE then I would strongly recommend you finding gradient analytically which would boost the speed and accuracy of the estimates. If possible find the Hessian as well. Again this will boost the speed and increase the accuracy further. Good luck.
Read up on CG as mentioned above. So long as you can find an approximation to the gradient which converges locally, you do not need it analytically or even exactly. Inverse Hessian can be found through approximations (SR1 or DFP2). Sometimes, if the approximations are close and the analytical forms are nasty, the approximation methods can even be faster.