What are some sample real world cases folks are using to fine tune their own small/medium models?
Only to prompt thought on this exact question, im interested in answers:
I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.
Heres raw numbers to spark discussion:
| Model | DocType% | Year% | Subject% | In $/MTok |
|---------------|----------|-------|----------|-----------|
| llama-70b -----| 83 | 98 | 96 | $0.72 |
| gpt-oss-20b --| 83 | 97 | 92 | $0.07 |
| ministral-14b -| 84 | 100 | 90 | $0.20 |
| gemma-4b ----| 75 | 93 | 91 | $0.04 |
| glm-flash-30b -| 83 | 93 | 90 | $0.07 |
| llama-1b ------| 47 | 90 | 58 | $0.10 |
percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.
in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.
Can easily generate tens of thousands of examples.
Might try myself, but always keen for an opinion.
_edit for table formatting_
Hi! I think this is a pretty good example:
https://www.atredis.com/blog/2024/6/3/how-to-train-your-larg...
Oh I wrote up a post on X on this exact question! https://x.com/danielhanchen/status/1979389893165060345?s=20
1. Cursor used online RL to get +28% approval rate: https://cursor.com/blog/tab-rl
2. Vercel used RFT for their AutoFix model for V0: https://vercel.com/blog/v0-composite-model-family
3. Perplexity's Sonar for Deep Research Reasoning I think was a finetuned model: https://docs.perplexity.ai/docs/getting-started/overview
4. Doordash uses LoRA, QLoRA for a "Generalized Attribute Extraction model" https://careersatdoordash.com/blog/unleashing-the-power-of-l...
5. NASA flood water detection https://earthdata.nasa.gov/news/nasa-ibm- openly-release-geospatial-ai-foundation-model-nasa-earth-observation-data6
6. Online RL for robotics - imagine you teaching a robot in the future via some mini finetuning
7. OpenAI's RFT page has more: https://developers.openai.com/api/docs/guides/rft-use-cases
8. For larger models - https://www.mercor.com/blog/expert-data-drives-model-perform...