> (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minu...

threeducks • yesterday at 9:40 PM • 0 replies • view on HN

    > (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.

You have to divide SSD read speed by the size of the active parameters (~16GB at 4 bit quantization) instead of the entire model size. If you are lucky, you might get around one token per second with speculative decoding, but I agree with the general point that it will be very slow.

alt Hacker News