Some fine tuning data questions: i see the the dataset Google published in this notebook

exacube • last Thursday at 8:54 PM • 1 reply • view on HN

Some fine tuning data questions:

i see the the dataset Google published in this notebook https://github.com/google-gemini/gemma-cookbook/blob/main/Fu... -- from looking at the dataset on huggingface, it looks synthetically generated.

1. do you recommend any particular mix or focus in the dataset for finetuning this model, without losing too much generality?

2. do you have any recommendations for how many examples per-tool?

thank you for your (and your teams) work!

Replies

canyon289 • last Thursday at 9:13 PM

> Do you recommend any particular mix or focus in the dataset for finetuning this model, without losing too much generality?

Astute questions, there's sort of two ways to think about finetuning, 1. Obliterate any general functionality and train the model on your general commands 2. As you asked maintain generality trying to preserve initial model ability

For 2 typically low learning rate or LORA is a good strategy. We show an example in our the finetuning tutorial in the blog.

> 2. do you have any recommendations for how many examples per-tool? This depends on the tool complexity and the variety of user inputs. So a simple tool like turn_flashlight_on(), with no args, will get taught quickly, especially if say you're only prompting in English.

But if you have a more complex function like get_weather(lat, lon, day, region, date) and have prompts coming in in English, Chinese, Gujarati and spanish, the model needs to do a lot more "heavy lifting" to both translate a request and fill out a complex query. We know as programmers date by themselves are insanely complex in natural language (12/18/2025 vs 18/12/2025).

To get this right it'll help the model if it was trained on data that shows it the versions of variations of inputs possible.

Long answer but I hope this makes sense.

➕ show 1 reply

alt Hacker News

Replies