> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.
same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.
it seems that the examples they showed off with haiku work. i'd guess llama is just too bad