logoalt Hacker News

bigdictlast Wednesday at 10:56 PM4 repliesview on HN

Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?


Replies

Reubendyesterday at 12:55 AM

It's a valid criticism that this method would increase compute requirements, but sometimes an improvement in the end result justifies the compute needed. For things like code generation in large datasets, many people would be willing to "pay" with more compute if the results were better. And this doesn't seem to require more memory bandwidth, so it could be particularly good for local models.

fabmiloyesterday at 2:27 AM

I read the paper and the results don't really convince me that is the case. But the problem still remains of being able to use information from different part of the model without squishing it to a single value with the softmax.

eightysixfourlast Wednesday at 11:58 PM

That's... not always a given for SOTA sized models. When the ROI on more training stops, it is nice to have alternatives, whether that is RL-tuned reasoning models or alternative architectures that improve specific areas of weakness.

jwilberlast Wednesday at 11:00 PM

There’s no one-size-fits-all answer here, but in my experience, for long contexts, perf for conv-based methods outperforms strictly attention-based methods. See evo2:

“With the current implementation of Evo2, we do not have the heavily optimized kernels in place for convolution operators like we do for attention layers in a model like llama2. Even with this shortcoming, we see that the benefit from including more convolutional layers makes up for the earlier stage of optimization at around the 64k context length. Beyond that point we see an improvement in performance even compared to a highly optimized transformer model.“

https://docs.nvidia.com/bionemo-framework/latest/models/evo2...