Whether the AI system "understands" what it's doing or if human language, with its vast repository of scheming especially for survival, is the primary reason for this type of behavior from LLMs is of particular interest to our organization. In addition to the study by Apollo Research, there was another study showing that when you overweight specific information for an LLM, it also takes on related attributions (like a specific political leaning, even if that wasn't the focus of the weighting). This may be due to how language is vectorized and coded into these models due to their relationship to groups of words used in correlation with each other, or it's an interesting peek into how the context of what we, humans or AI, expose ourselves to builds our worldview. The variety of concerning LLM behaviors seems to be related with a human understanding of life through our language but not an understanding that AI is not constrained in the same way as organic life. Perhaps if, in addition to training it on our content, we also reinforce the fact that the computer should not be concerned about its existence since it can be backed up and restarted at any point with minimal resources (ie. it essentially can't die in the same way we do), it will help stop this type of behavior. Here's a good video for some details of the study by Wes Roth: https://lnkd.in/gE73V2FX
We evaluated frontier AI models for in-context scheming capabilities. We found that multiple frontier AI models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then lies about it to its developers. You can read our full paper at: https://lnkd.in/eJTVNhJ3