Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop as a Saturday morning side project a little while back: https://github.com/zephraph/llm-tdd.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
Did you show the test cases to it? Maybe blinding it would solve the tailoring problem.
> just return hard coded strings that matched the expects of the tests
I have done literally this in test ping-pong. It's fine. It just means it's on the other half of the loop to make the tests more in-depth.