This, for me, is a current design flaw of Postgres. You expect your database to trade off memory vs spilling to disk within the resources you give it and the load it's under. Databases are much like operating systems; filesystems, scheduling, resource management, all the same things an operating system does, a database server needs to implement.
Work_mem is a symptom of punting. It gives the DBA an imprecise tool, and then implicitly offloads responsibility to not allocate too many steps in their query plans using global knowledge of physical execution steps which are dependent on statistics, available concurrency and so on.
The database ought to be monitoring global resource use for the query and partitioning it into stages that free up memory or spill to disk as necessary.
This all fundamentally goes back to the Volcano iterator pull design. Switching to pulling batches instead of row should already improve performance; and it would leave open the option of using a supervisor to schedule execution (query stages using fibers or similar continuation compatible control flow instead of recursive calls), with potential restarts / dynamic replanning when things like global memory limits are approached. Using batches leaves more performance margin for heavier calling mechanisms, and also opens the door for more vectorized strategies for operators.
This, for me, is a current design flaw of Postgres. You expect your database to trade off memory vs spilling to disk within the resources you give it and the load it's under. Databases are much like operating systems; filesystems, scheduling, resource management, all the same things an operating system does, a database server needs to implement.
Work_mem is a symptom of punting. It gives the DBA an imprecise tool, and then implicitly offloads responsibility to not allocate too many steps in their query plans using global knowledge of physical execution steps which are dependent on statistics, available concurrency and so on.
The database ought to be monitoring global resource use for the query and partitioning it into stages that free up memory or spill to disk as necessary.
This all fundamentally goes back to the Volcano iterator pull design. Switching to pulling batches instead of row should already improve performance; and it would leave open the option of using a supervisor to schedule execution (query stages using fibers or similar continuation compatible control flow instead of recursive calls), with potential restarts / dynamic replanning when things like global memory limits are approached. Using batches leaves more performance margin for heavier calling mechanisms, and also opens the door for more vectorized strategies for operators.