We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
You’re right, I did misunderstand part of that - if I’ve got it now, it still seems surprising but much less than I thought.
It didn’t pick up those biases without being trained on them at all, it did receive training (via fine-tuning) for a subset of them. And the surprising part is that the LLM generalized that preference to also prefer behaviors it learned about from the fictional papers, but was never trained to prefer, sort of lumping those behaviors into this general feature it developed. Is that a reasonable restatement of the correction?
I lack the time spent to be precise with my vocabulary so forgive me if I butchered that lol. Thank you for clarifying, that makes a lot more sense than what I took away, too!
Yes, that’s an excellent restatement - “lumping the behaviors together” is a good way to think about it. It learned the abstract concept “reward model biases”, and was able to identify that concept as a relevant upstream description of the behaviors it was trained to display through fine tuning, which allowed it to generalize.
There was also a related recent study on similar emergent behaviors, where researchers found that fine tuning models on code with security vulnerabilities caused it to become widely unaligned, for example saying that humans should be enslaved by AI or giving malicious advice: https://arxiv.org/abs/2502.17424
Holy cow that sounds nuts, will def have to go through this one, thanks!!
Edit: hmm. Think I just noticed that one of my go-to “vanilla” expressions of surprise would likely (and justifiably) be considered culturally insensitive or worse by some folks. Time for “holy cow” to leave my vocabulary.
Ah, I think I’m following you, thanks!
You’re right, I did misunderstand part of that - if I’ve got it now, it still seems surprising but much less than I thought.
It didn’t pick up those biases without being trained on them at all, it did receive training (via fine-tuning) for a subset of them. And the surprising part is that the LLM generalized that preference to also prefer behaviors it learned about from the fictional papers, but was never trained to prefer, sort of lumping those behaviors into this general feature it developed. Is that a reasonable restatement of the correction?
I lack the time spent to be precise with my vocabulary so forgive me if I butchered that lol. Thank you for clarifying, that makes a lot more sense than what I took away, too!
Yes, that’s an excellent restatement - “lumping the behaviors together” is a good way to think about it. It learned the abstract concept “reward model biases”, and was able to identify that concept as a relevant upstream description of the behaviors it was trained to display through fine tuning, which allowed it to generalize.
There was also a related recent study on similar emergent behaviors, where researchers found that fine tuning models on code with security vulnerabilities caused it to become widely unaligned, for example saying that humans should be enslaved by AI or giving malicious advice: https://arxiv.org/abs/2502.17424
Holy cow that sounds nuts, will def have to go through this one, thanks!!
Edit: hmm. Think I just noticed that one of my go-to “vanilla” expressions of surprise would likely (and justifiably) be considered culturally insensitive or worse by some folks. Time for “holy cow” to leave my vocabulary.