Lol its kinda suprising that the level of understanding around LLMs is so little.
You already have agents, that can do a lot of "thinking", which is just generating guided context, then using that context to do tasks.
You already have Vector Databases that are used as context stores with information retrieval.
Fundamentally, you can have the same exact performance on a lot of task whether all the information exists in the model, or you use a smaller model with a bunch of context around it for guidance.
So instead of wasting energy and time encoding the knowledge information into the model, making the size large, you could have an "agent-first" model along with just files of vector databases, and the model can fit in a single graphics cards, take the question, decide which vector db it wants to load, and then essentially answer the question in the same way. At $50 per TB from SSD not only do you gain massive cost efficiency, but you also gain the ability to run a lot more inference cheaper, which can be used for refining things, background processing, and so on.
Lol its kinda suprising that the level of understanding around LLMs is so little.
You already have agents, that can do a lot of "thinking", which is just generating guided context, then using that context to do tasks.
You already have Vector Databases that are used as context stores with information retrieval.
Fundamentally, you can have the same exact performance on a lot of task whether all the information exists in the model, or you use a smaller model with a bunch of context around it for guidance.
So instead of wasting energy and time encoding the knowledge information into the model, making the size large, you could have an "agent-first" model along with just files of vector databases, and the model can fit in a single graphics cards, take the question, decide which vector db it wants to load, and then essentially answer the question in the same way. At $50 per TB from SSD not only do you gain massive cost efficiency, but you also gain the ability to run a lot more inference cheaper, which can be used for refining things, background processing, and so on.