logoalt Hacker News

throwaway314155yesterday at 10:01 PM2 repliesview on HN

That doesn’t tell you if the new method continues to perform better at higher parameter counts.


Replies

tunedtoday at 6:39 AM

it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models. If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.

ameliusyesterday at 10:51 PM

Nor that the training from scratch will even work.

show 1 reply