The killer feature of LLMs is to be able to extrapolate what's really wanted from short descriptions.
Look again at Gemini's output, it looks like an actual book cover, it looks like an illustration that could be found on a book.
It takes on board corrections (albeit hilariously literaly).
Look at GPT image's output, it doesn't look anything like a book cover, and when prompted to say it got it wrong, just doubles down on what it was doing.
What you want, and what you think image generation is, is impossible.