Hey, awesome work with the error formatter, as explained above I think error creation falls more under the server responsibilities if it contains any logic about the error (e.g. is retryable). It would be interesting to hear your opinion on a format for LLM side errors.
Regarding the multi server failure case: (without server manager) today if one of the server dies the agent will keep going, I do not think this is a particularly thought through decision, probably the client should error out, or let the agent know that the server is dead. (with server manager) the agent will try to connect to the dead server, get an error, possibly retry, but if the server keeps being unable to connect to, the agent will eventually bail. Indeed it is an interesting problem. How do you see the responsibility split here ?
Regarding the flakyness, ultimate dream, but requires some more work, I think that monitoring this is something that the client has a privileged position of doing, we will do it for sure. I think this is going to be great feedback for companies building servers. Happy to coordinate on ideas on how to do this best.