What they seem to want is fast-read, slow-write memory. "Primary applications include model weights in ML inference, code pages, hot instruction paths, and relatively static data pages". Is there device physics for cheaper, smaller fast-read slow write memory cells for that?
For "hot instruction paths", caching is already the answer. Not sure about locality of reference for model weights. Do LLMs blow the cache?
> Do LLMs blow the cache?
Sometimes very yes?
If you've got 1GB of weights, those are coming through the caches on their way to execution unit somehow.
Many caches are smart enough to recognize these accesses as a strided, streaming, heavily prefetchable, evictable read, and optimize for that.
Many models are now quantized too to reduce the overall the overall memory bandwidth needed for execution, which also helps with caching.
Probably not what they want, but NOR flash is generally directly addressable, it's commonly used to replace mask roms.
> device physics for cheaper, smaller
And lower power usage. Datacenters and mobile devices will always want that.
Yes, this from the paper:
> "The key insight motivating LtRAM is that long data lifetimes and read heavy access patterns allow optimizations that are unsuitable for general purpose memories. Primary applications include model weights in ML inference, code pages, hot instruction paths, and relatively static data pages—workloads that can tolerate higher write costs in exchange for lower read energy and improved cost per bit. This specialization addresses fundamental mismatches in current systems where read intensive data competes for the same resources as frequently modified data."
Essentially I guess they're calling for more specific hardware for LLM tasks, much like was done with all the networking equipment for dedicated packet processing with specialized SRAM/DRAM/TCAM tiers to keep latency to a minimum.
While there's an obvious need for this for traffic flow across the internet, whether or not LLMs are really going to scale like that, or there's a massive AI/LLM bubble about to pop, would be the practical issue, and who knows? The tea leaves are unclear.
Cheaper / smaller? I would say not likely. There is already an enormous amount of market pressure to make SRAM and DRAM smaller.
Device physics-wise, you could probably make SRAM faster by dropping the transistor threshold voltage. It would also make it harder / slower to write. The bigger downside is that it would have higher leakage power, but if it's a small portion of all the SRAM, it might be worth the tradeoff.
For DRAM, there isn't as much "device" involved because the storage element isn't transistor-based. You could probably make some design tradeoff in the sense amplifier to reduce read times by trading off write times, but I doubt it would make a significant change.