logoalt Hacker News

jeffjeffbearlast Thursday at 11:35 PM1 replyview on HN

> https://huggingface.co/google/t5gemma-2-1b-1b

From here it looks like it still is long context and multimodal though?

>Inputs and outputs Input:

Text string, such as a question, a prompt, or a document to be summarized

Images, normalized to 896 x 896 resolution and encoded to 256 tokens each

Total input context of 128K tokens Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document

Total output context up to 32K tokens


Replies

rhdunnyesterday at 8:33 AM

If you are finetuning the model you need to replicate the training conditions so you don't remove those capabilities. If you just finetune a multi-modal model on text it will lose some of the vision capabilities as the text part of the model will drift from the vision, audio, etc. models. A similar thing happens with finetuning reasoning models.

Even if you did finetune the models with text and images then you could run into issues with using different descriptions for images to what it was trained with. Though you could probably work around that by getting the model to describe the images, but you'll still need to audit the results to correct any issues or add what you are training for.

You can also run into overfitting if your data does not include enough variations along a given training set that the original model had access to.

Using different training parameters could also affect the models capabilities. Just knowing things like the input context isn't enough.

show 2 replies