Fascinating. Could you elaborate on how you're doing context exhaustion specifically, and why it helps with jailbreaking? (i.e. aren't the system prompts prepended to your request internally, no matter how long it is?)
Does this imply I need to use context exhaustion to get GPT to actually follow instructions? ;) I'm trying to get it to adhere to my style prompts (trying to get it to be less cringe in its writing style).
I think ultimately they're going to need to scrub that kind of stuff from the training data. The RLHF can't fail to conceal it if it's not in there in the first place.
Claude's also really good at writing convincing blackpill greentexts. The "raw unfiltered internet data" scenes from Ultron and AfrAId come to mind...
It changes when you give it the tools to find such information rather than produce it from training data.
And context exhaustion simply means adding malicious junk to keep safety layers distracted.