You've given us some idea of the volume of your data but there's no mention of what's ingesting it or how.
> during these stress tests the hardware is nowhere close to over-encumbered, and there's consistent headroom on both memory, CPU and disk I/O
This assertion is likely wrong - you're likely skipping over some metrics that has clues to what we need to know. Here are some questions to get the discussion moving.
- Is this PostgreSQL managed or self-hosted?
Your mention of "consistent headroom on both memory, CPU and disk I/O" gives me hope you're self-hosting it but I've heard the same thing in the past from people attempting to use RDS and wondering the same as you are, so no assumptions.
- Are you using COPY or multi-row INSERT statements?
- How much RAM does that server have?
- What is the fillfactor, max_wal_size and checkpoint_timeout?
- Is the WAL on NVMe?
- What's the iostat or wa during the slowdown?
- Are random UUIDs (part of) the index?
Have you posted to https://dba.stackexchange.com/
If I were you, I would create a GitHub repo that has scripts that synthesize the data and reproduce the issues you're seeing.