As with any generative model, trust but verify. Try it yourself. Frankly, as a generative researcher myself, there's a lot of reason to not trust what you see in papers and pages.
They link a Huggingface page (great sign!): https://huggingface.co/spaces/tencent/Hunyuan3D-2
I tried to replicate the objects they show on their project page (https://3d-models.hunyuan.tencent.com/). The full prompts exist but are truncated so you can just inspect the element and grab the text.
Here's what I got
Leaf
PNG: https://0x0.st/8HDL.png
GLB: https://0x0.st/8HD9.glb
Guitar
PNG: https://0x0.st/8HDf.png other view: https://0x0.st/8HDO.png
GLB: https://0x0.st/8HDV.glb
Google Translate of Guitar:
Prompt: A brown guitar is centered against a white background, creating a realistic photography style. This photo captures the culture of the instrument and conveys a tranquil atmosphere.
PNG: https://0x0.st/8HDt.png and https://0x0.st/8HDv.png
Note: Weird thing on top of guitar. But at least this time the strings aren't fusing into sound hole.
I haven't tested my own prompts or the google translation of the Chinese prompts because I'm getting an over usage error (I'll edit comment if I get them). That said, these look pretty good. The paper and page images definitely look better, but these aren't like Stable Diffusion 1 paper vs Stable Diffusion 1 reality.But these are long and detailed prompts. Lots of prompt engineering. That should raise some suspicion. Real world has higher variance and let's get an idea how hard it is to use. So let's try some simpler things :)
Prompt: A guitar
PNG: https://0x0.st/8HDg.png
Note: Not bad! Definitely overfit but does that matter here? A bit too thick for a electric guitar but too thin for acoustic.
Prompt: A Monstera leaf
PNG: https://0x0.st/8HD6.png
https://0x0.st/8HDl.png
https://0x0.st/8HDU.png
Note: A bit wonkier. I picked this because it looked like the leaf in the example but this one is doing some odd things.
It's definitely a leaf and monstera like but a bit of a mutant.
Prompt: Mario from Super Mario Bros
PNG: https://0x0.st/8Hkq.png
Note: Now I'm VERY suspicious....
Prompt: Luigi from Super Mario Bros
PNG: https://0x0.st/8Hkc.png
https://0x0.st/8HkT.png
https://0x0.st/8HkA.png
Note: Highly overfit[0]. This is what I suspected. Luigi isn't just tall Mario.
Where is the tie coming from? The suspender buttons are all messed up.
Really went uncanny valley here. So this suggests we're really brittle.
Prompt: Peach from Super Mario Bros
PNG: https://0x0.st/8Hku.png
https://0x0.st/8HkM.png
Note: I'm fucking dying over here this is so funny. It's just a peach with a cute face hahahahaha
Prompt: Toad from Super Mario Bros
PNG: https://0x0.st/8Hke.png
https://0x0.st/8Hk_.png
https://0x0.st/8HkL.png
Note: Lord have mercy on this toad, I think it is a mutated Squirtle.
Paper can be found here (the arxiv badge on the page leads to a pdf in the repo, which github is slow to render those): https://arxiv.org/abs/2411.02293(If you want to share images like I did all I'm doing is `curl -F'[email protected]' https://0x0.st`)
[0] Overfit is a weird thing now. Maybe it doesn't generalize well, but sometimes that's not a problem. I think this is one of the bigger lessons we've learned with recent ML models. My viewpoint is "Sometimes you want a database with a human language interface. Sometimes you want to generalize". So we have to be more context driven here. But certainly there are a lot of things we should be careful about when we're talking about generation. These things are trained on A LOT of data. If you're more "database-like" then certainly there's potential legal ramifications...
Edit: For context, by "look pretty good" I mean in comparison to other works I've seen. I think it is likely a ways from being useful in production. I'm not sure how much human labor would be required to fix the issues.
The first guitar has one of the strings end at the sound hole, and six tuning knobs for five strings.
The second has similar problems: it has tuning knobs with missing winding posts, then five strings becoming four at the bridge. It also has a pickup under the fretboard.
Are these considered good capability examples?
Thanks for this. The results are quite impressive, after trying it myself.
Ops ran out of edit time when I was posting my last two
This last one is really key for judging where the tech is at btw. Most of the generations are assets you could download freely from the internet and you could probably get better ones by some artist on fiver or something. But the last example is more our realistic use case. Something that is relatively reasonable, probably not in the set of easy to download assets, and might be something someone wants. It isn't too crazy of an ask given Chimera and how similar a dragon is to a bird in the first place, this should be on the "easier" end. I'm sure you could prompt engineer your way into it but then we have to have the discussion of what costs more a prompt engineer or an artist? And do you need a prompt engineer who can repair models? Because these look like they need repairs.This can make it hard to really tell if there's progress or not. It is really easy to make compelling images in a paper and beat benchmarks while not actually creating a something that is __or will become__ a usable product. All the little details matter. Little errors quickly compound... That said, I do much more on generative imagery than generative 3d objects so grain of salt here.
Keep in mind: generative models (of any kind) are incredibly difficult to evaluate. Always keep that in mind. You really only have a good idea after you've generated hundreds or thousands of samples yourself and are able to look at a lot with high scrutiny.