Though at the limit, it would have to be at least O(sqrt(n)) thanks to the Bekenstein bound [0]. And of course, as mentioned in TFA, you can always do better if you can get away with local random access in parallel, rather than global random access.
I've seen this idea before.
Similar article on the same topic from 2014: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.
The meat of the article:
> In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.
Interesting to see some first principles thinking to back out the “why”.
I don't think this holds up. Historically, memory sizes have increased exponentially, but access times have gotten faster, not slower. And since the access time comes from the memory architecture, you can get 8 GB of RAM or 64 GB of RAM with the same access times. The estimated values in the table are not an especially good fit (30-50% off) and get worse if you adjust the memory sizes.
Theoretically, it still doesn't hold up, at least not for the foreseeable future. PCBs and integrated circuits are basically two-dimensional. Access times are limited by things like trace lengths (at the board level) and parasitics (at the IC level), none of which are defined by volume.
I do like the idea of this, but after writing a longer response explaining my positive view on this I came to a different conclusion: I was thinking this'd be a possibly useful measure for programs running on CPUs with contention, where your data will occasionally drop out of cache because the CPU is doing something else. But in that sort of a situation, you'd expect the L1 memory speed to be overtaken _before_ L1 memory size is reached. This function instead fits fairly well to the actual L1 size (as given by ChatGPT anyway), meaning that it's best thought of as a measure of random access speed on an uncontested CPU.
That being said, I do still like the fundamental idea of figuring out a rough but usable O-estimate for random memory access speeds in a program. It never hurts to have more quick estimation tools in your toolbox.
This doesn't seem all that compelling to me - the practical argument relied on fitting into cache which is going to be more like a step function than N^1/3.
As far as the theoretical argument... idk could be true but i suspect this is too simple a model to give useful results.
Is it me or it feels like the "empirical argument" is correct (it is an observation after all), but the "theoretical argument" wildly off?
My understanding is that different levels of cache and memory are implemented pretty differently to optimize for density/speed. As in, this scaling is not the result of some natural geometric law, but rather because it was designed this way by the designers for those chips to serve the expected workloads. Some chips, like the mainframe CPUs by IBM have huge caches, which might not follow the same scaling.
I'm no performance expert, but this struck me as odd.
https://en.algorithmica.org/hpc/external-memory/model/
Sometimes everything else is O(0)
You have to cool memory when it reads or writes, so accessible memory scales with surface area, not volume. O(N^0.5) becomes the new floor.
If it turns out the universe is holographic, then you have the same bound. This might just be enforment of the holographic principle...
I’ve worked a bit comparing precomputed lookup tables with computing the result in real time, and in many cases, e.g a multiply operation will be faster than the corresponding lookup. Point being that, for the example he gives, it’s also necessary to know how optimized the machine is for performing the computation you want to look up.
I would like to give this a name: octonion memory
The math looks suspicious to me, or at least how it is presented.
If, as stated, accessing one register requires ~0.3 ns and available registers sum up to ~2560 B, while accessing RAM requires ~80 ns and available RAM is ~32 GiB, then it means that memory access time is O(N^1/3) where N is the memory size.
Thus accessing the whole N bytes of memory of a certain kind (registers, or L1/L2/L3 cache, or RAM) takes N * O(N^1/3) = O(N^4/3).
One could argue that the title "Memory access is O(N^1/3)" refers to memory access time, but that contradicts the very article's body, which explains in detail "in 2x time you can access 8x as much memory" both in text and with a diagram.
Such statement would require that accessing the whole N bytes of memory of a certain kind requires O(N^1/3) time, while the measurements themselves produce a very different estimate: accessing the whole N bytes of memory of a certain kind requires O(N^4/3) time, not O(N^1/3)
Ah yes, pretending we can access infinite amounts of memory instantaneously or in a finite/bounded amount of time is the achilles heel of the Von Neumann abstract computer model, and is the point where it completely diverges from physical reality.
Acknowledging that memory access is not instantaneous immediately throws you into the realm of distributed systems though and something much closer to an actor model of computation. It's a pretty meaningful theoretical gap, more so than people realize.
I think the notation is supposed to mean the worst performance. This is more an argument about amortized time analysis.
Not if the data is on a tape in a Turing machine.
Ummm guys when we talk about memory access in theory can we just be rigorous and talk about the computational model?
The real RAM model “in theory” tells me that memory access is O(1). Of course real RAM is a spherical cow but like we could use a bit more rigor
The fancy word for assuming O(1) access and not needing to spill over into longer addressing modes is, I believe, "transdichotomous model" - https://en.wikipedia.org/wiki/Transdichotomous_model
> In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random-access machine in which the machine word size is assumed to match the problem size.
Let's be careful about exactly what we mean by this. When we say an algorithm is O(f(N)), we need to be very clear about what N we're talking about. The whole point is to focus on a few variables of interest (often "number of elements" or "total input size") while recognizing that we are leaving out many constants: CPU speed, average CPU utilization, the speed of light, and (usually) the size of memory. If I run a task against some data and say "good job guys, this only took 1 second on 1000 data points! Looks like our work here is done", it would be quite unnerving to learn that the algorithm is actually O(3^N). 1000 data points better be pretty close to the max I'll ever run it on; 2000 data points and I might be waiting until the heat death of the universe.
I'm seeing some commenters happily adding 1/3 to the exponent of other algorithms. This insight does not make an O(N^2) algorithm O(N^7/3) or O(N^8/3) or anything else; those are different Ns. It might be O(N^2 + (N*M)^1/3) or O((N * M^(1/3))^2) or almost any other combination, depending on the details of the algorithm.
Early algorithm design was happy to treat "speed of memory access" as one of these constants that you don't worry about until you have a speed benchmark. If my algorithm takes 1 second on 1000 data points, I don't care if that's because of memory access speed, CPU speed, or the speed of light -- unless I have some control over those variables. The whole reason we like O(N) algorithms more than O(N^2) ones is because we can (usually) push them farther without having to buy better hardware.
More modern algorithm design does take memory access into account, often by trying to maximize usage of caches. The abstract model is a series of progressively larger and slower caches, and there are ways of designing algorithms that have provable bounds on their usage of these various caches. It might be useful for these algorithms to assume that the speed of a cache access is O(M^1/3), where M is the size of memory, but that actually lowers their generality: the same idea holds between L2 -> L3 cache as L3 -> RAM and even RAM -> Disk, and certainly RAM -> Disk does not follow the O(M^1/3) law. See https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
So basically this matters for people who want some idea of how much faster (or slower) algorithms might run if they change the amount of memory available to the application, but even that depends so heavily on details that it's not likely to be "8x memory = 2x slower". I'd argue it's perfectly fine to keep M^(1/3) as one of your constants that you ignore in algorithm design, even as you develop algorithms that are more cache- and memory-access-aware. This may justify why cache-aware algorithms are important, but it probably doesn't change their design or analysis at all. It seems mainly just a useful insight for people responsible for provisioning resources who think more hardware is always better.
Mathematically wrong, unless restated correctly: "accessing ALL memory at a given cache level is O(N^[1/3])". After restatement: trite and useless.
Accessing any given one byte in the address space is amortized to O(1)
> L3 cache is not built for mass throughput in the same way that DRAM is, and so it has roughly identical mass throughput despite its much closer distance to the computation.
"The von Neumann bottleneck is impeding AI computing?" (2025) https://news.ycombinator.com/item?id=45398473 :
> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?
From https://news.ycombinator.com/item?id=42875728 :
> WSE-3: 21 PB/S
From https://hackernoon.com/nvidias-mega-machine-crushes-all-of-2... :
> At Computex 2025, Nvidia’s Jensen Huang dropped a bombshell: the NVLink Spine, a compute beast pumping 130 terabytes per second, eclipsing the internet’s 2024 peak of 112.5 TB/s.
"A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems for Artificial Intelligence" (2025-03) https://arxiv.org/abs/2503.11698v1
This post was really good right up until the screenshot of ChatGPT... It would be a big improvement to convert that to HTML, fact check it, and then not cite LLM as a source.
This seems like a mathematician trying to shoehorn a closed form solution on to something which is architecture dependent and is probably better summarized with: Only operate on values which fit inside, ideally the l2 but certainly l3 cache of your target machine for performance sensitive code paths.
> The empirical argument
> We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:
"Here's what GPT says" is not an empirical argument. If you can't do better than that (run a benchmark, cite some literature), why should I bother to read what you wrote?