The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.
What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
You could build realtime API routing and orchestration systems that rely on high quality language understanding but need near-instant responses. Examples:
1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).
2. Of course, realtime voice chat.. kinda like you see in movies.
3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.
4. Highly interactive what-if scenarios powered by natural language queries.
This effectively gives you database level speeds on top of natural language understanding.
Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing
I'm wondering how much the output quality of a small model could be boosted by taking multiple goes at it. Generate 20 answers and feed them back through with a "rank these responses" prompt. Or doing something like MCTS.
Maybe summarization? I’d still worry about accuracy but smaller models do quite well.
Language translation, chunk by chunk.
Coding, for some future definition of "small-model" that expands to include today's frontier models. What I commented a few days ago on codex-spark release:
"""
We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:
(A) Massively parallel (optimize for token/$)
(B) Serial low latency (optimize for token/s).
Users will switch between A and B depending on need.
Examples of (A):
- "Use subagents to search this 1M line codebase for DRY violations subject to $spec."
An example of (B):
- "Diagnose this one specific bug."
- "Apply these text edits".
(B) is used in funnels to unblock (A).
"""