logoalt Hacker News

Animatsyesterday at 7:28 PM2 repliesview on HN

Link?

It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.

That may or may not be a good thing.


Replies

thegrim33yesterday at 8:14 PM

Whether or not the linked tool uses a good approach, manipulating models like you mention is already fairly well established, see: https://huggingface.co/blog/mlabonne/abliteration .

noufalibrahimyesterday at 7:53 PM

I believe that this is already done to several models. One that I've come across are the JOSIEfied models from Gökdeniz Gülmez. I downloaded one or two and tried them on a local ollama setup. It does generate potentially dangerous output. Turning on thinking for the QWEN series shows how it arrives at it's conclusions and it's quite disturbing.

However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.