logoalt Hacker News

ares623last Saturday at 10:53 PM1 replyview on HN

Running inference for every interaction seems a bit wasteful IMO, especially with a chance for things to go wrong. I’m not smart enough to come up with a way on how to optimize a repetitive operation though.


Replies

sixdimensionallast Sunday at 12:01 AM

I totally agree. The reason I asked before offering any solution ideas was I was curious what you might think.

My brain went to the concept of memoization that we use to speed up function calls for common cases.

If you had a proxy that sat in front of the LLM and cached deterministic responses for inputs, with some way to maybe even give feedback when a response is satisfactory.. this could be a building block for a runtime design mode or something like that.