It seems you haven't done the due diligence on what part of the API is expensive - constructing...

kang • yesterday at 9:17 PM • 2 replies • view on HN

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

Replies

coldtea • yesterday at 9:40 PM

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

➕ show 1 reply

computably • today at 2:41 AM

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

alt Hacker News

Replies