The third frame with the "strawberry tongue" is particularly funny and significant. It is like a computer vision version of the "cheerleader effect". If you include enough different pictures, maybe people won't notice all the problems with each individual picture.
I said basically artifact free, not that they are perfect. This is just an uncurated sample from the generation process with no picture selection, so you will be able to notice some obvious issues with individual samples.
It gets most features correct at even a very high resolution. No one is going to notice the droplets or the ears unless they are sitting there analyzing the image. If for most prompts i can find something that passes the visual turing test, that's quite good. It's already the close to photorealism and this all gets better with scale.