It seems solvable if you treat it as an architecture problem. I've been using LangGraph to force the model to extract and cite evidence before it runs any scoring logic. That creates an audit trail based on the flow rather than just opaque model outputs.
It's not. If you actually look at any chain-of-thought stuff long enough, you'll see instances where what it delivers directly contradicts the "thoughts."
If your AI is *ist in effect but told not to be, it will just manifest as highlighting negative things more often for the people it has bad vibes for. Just like people will do.