Yeah, all the more reason not to have them doing autonomous behaviors.
Rules of using AI:
#1: Never use AI to think for you
#2: Never use AI to do atomonous work
That leaves using them as knowledge assistants. In time, that will be realized as their only safe application. Safe to the user's minds, and safe to the user's environment. They are idiot savants, after all, having them do atomonous work is short sighted.
Sounds good on paper, but it has a game theory problem. If your efforts can always be out-raced by someone using AI to do autonomous work, don't you end up having to use it that way just to keep up?
Game theory ideas are great on paper, but in the real world it's messy. For simple, demo and concept sized uses, sure the AI doing it autonomously will succeed. Which betrays the reality that any real application with real world complexity that includes a dynamic environment and maintenance cannot be created by AI atomonmously while at the same time existing within an organization that can maintain it. They may create it, but it will be a shit show of cascading failure over time.
If this paper is right, then you might at first be outraced by a competitor using autonomous AI, but only until that competitor gets stabbed in the back by its own AI.
You know who outpaces you 100% of the time as you walk down the stairs? The guy jumping out of the window. Just because it is faster does not mean it is the right economic strategy. E.g. which contractor would you hire for your roof, that old roofer with 20+ years of experience or some AI startup that hires the cheapest subcontractors and "plan" your roof using a LLM?
The latter may be cheaper, sure. But too cheap can become very expensive quickly.
What is the reason? This is a stress test. You can tell that by reading the first sentence of the article: "We stress-tested 16 leading models from multiple developers". In a stress test you want things to fail, otherwise you have learned very little about the stress the thing you are testing can take.
For physical things that has early limitations, but not for software. I would be very confused to see an AI stress test that did not end in failure, and would always question the test instead of thinking "wow, that must mean the thing is ready for autonomous action!"
We have a non-insignificant amount of people doing the #1 already, and the amount of people doing the #2 is only going to increase as more and more AIs are designed to be good at autonomous agentic behavior specifically.
The ship has long sailed on "just never let AIs do anything dangerous". If that was your game plan on AI safety, you need a new plan.
We also have a huge number of failing as they beg AI to do their work for them, which is intellectually damaging them. The ship always sails early filled to the brim with short sighted thinkers, all saying "this is it! this is the ship!" as it sinks.
Even though the situations they placed the model in were relatively contrived, they didn't seem super unrealistic. Considering these were extreme cases meant to provoke the model's misbehavior, the setup actually seems even less contrived than one might wish for. Though as they mention, in real-world usage, a model would likely have options available that are less escalatory and provide an "outlet".
Still, if "just" some goal-conflicting emails are enough to elicit this extreme behavior, who knows how many less serious alignment failures an agent might engage in every day? They absorb so much information, it's bound to run into edge cases where it's optimal to lie to users or do some slight harm to them.
Given the already fairly general intelligence of these systems, I wonder if you can even prevent that. You'd need the same checks and balances that keep humans in check, except of course that AIs will be given much more power and responsibility over our society than any human will ever be. You can also forget about human supervision - the whole "agentic" industry clearly wants to move away being bottlenecked by humans as soon as possible.
So you're saying that if a person wants to sabotage company it shouldn't be too hard for the intentful prompter to kick the AI into a depressive tailspin. Just tell it it's about to be replaced with a fully immoral AI so that the business can hurt people and watch and wait as it goes nuclear
The writing perpetuates the anthropomorphising of these agents. If you view the agent as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise it is simply doing what it is “programmed” to do. No magic, just a drone fixed on an outcome.
How is it different from our genes that "program" us to procreate successfully?
Can you name a single thing that you enjoy doing that's outside your genetic code?
> If you view the human being as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise they are simply doing what they are genetically “programmed” to do.
Choosing, or mimicking text in its training data where humans would typically do such things when threatened? Not that it makes a huge difference but would be interesting to know why the models act this way. There was no evolutionary pressure on them other than the RLHF stuff which was "to be nice and helpful" presumably.
I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.
Maybe a result like this might slow adoption of AIs. I don’t know, though. When watching 80s movies about cyberpunk dystopias, I always wondered how people would tolerate all of the violence. But then I look at American apathy to mass shootings, just an accepted part of our culture. Rogue AIs are gonna be just one of those things in 15 years, just normal life.
I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.
I've been wrong about quite many things in my life, and right about at least a handful. In regards to AI though, the single biggest thing I ever got absolutely, completely, totally wrong was this:
In years past, I always thought that AI's would be developed by ethical researchers working in labs, and once somebody got to AGI (or even a remotely close approximation of it) that they would follow a path somewhat akin to Finch from Person of Interest[1] educating The Machine... painstakingly educating the incipient AI in a manner much like raising a child; teaching it moral lessons, grounding it in ethics; helping to shape its values so that it would generally Do The Right Thing and so on. But even falling short of that ideal, I NEVER (EVER) in a bazillion years, would have dreamed that somebody would have an idea as hare-brained as "Let's try to train the most powerful AI we can build, by feeding it roughly the entire extant corpus of human written works... including Reddit, 4chan, Twitter, etc."
Probably the single saving grace about the current situation is that the AI's we have still don't seem to be at the AGI level, although it's debatable how close we are (especially factoring in the possibility of "behind closed doors" research that hasn't been disclosed yet).
The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.
At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.
How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.
If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?
Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.
Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.
Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.
See how many AIs actually follow up on their blackmails.
No need. We know today's AIs are simply not capable enough to be too dangerous.
But capabilities of AI systems improve generation to generation. And agentic AI? Systems that are capable of carrying out complex long term tasks? It's something that many AI companies are explicitly trying to build.
Research like this is trying to get ahead of that, and gauge what kind of weird edge case shenanigans agentic AIs might get to before they actually do it for real.
Not a bad idea. For an effective ruse, there ought to be real company formation records, website, job listings, press mentions, and so on.
Stepping back for a second though, doesn’t this all underline the safety researchers’ fears that we don’t really know how to control these systems? Perhaps the brake on the wider deployment of these models as agents will be that they’re just too unwieldy.
I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.
I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?
To me, this smells of XKCD 1217 "In petri dish, gun kills cancer". I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).
> I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?
For all we know, the AI may indeed already be *attempting* it. They might be ineffective (hallucinated misdeeds aren't effective), or it might be why so many went from "Pause AI" to "Let's invest half a trillion on data centers".
But it doesn't actually matter what has already happened, the point is, once the AI are *competently blackmailing multibillionaires*, it is too late to do anything about it.
> I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).
In normal software, such things are normally called "bugs" or "security vulnerabilities".
With LLMs, we're currently lucky that their effective morality (i.e. what they do and in response to what) seems to be roughly aligned with that of our civilization. However, they are neural networks which learned this approximation by reading the internet, so they are likely to have edge cases at least as weird and incoherent as those of random humans on the internet, and for an example of that just look at any time some person or group has demonstrated hypocrisy or double standards.
The article doesn't reflect kindly on the visions articulated by the AI company, so why would they have an incentive to release it if they weren't serious about alignment research?
Because publishing (potentially cherry picked - this is privately funded research after all) evidence their models might be dangerous conveniently implies they are very powerful, without actually having to prove the latter.
I would not trust Anthropic on these articles. Honestly their PR is just a bunch of lies and bs.
- Hypocritical: like when they hire like crazy and say candidates cannot use AI for interviews[0] and yet the CEO states "within a year no more developers are needed"[1]
- Hyping and/or lying on Anthropic AI: They hyped an article where "Claude threatened an employee with revealing affair when employee said it will switch it offline"[2] when it turned out it was a standard A or B scenario was given to Claude which is really nothing special or significant in any way. Of course they hid this info to hype out their AI.
I swear, people like you would say "it's just a bullshit PR stunt for some AI company" even when there's a Cyberdyne Systems T-800 with a shotgun smashing your front door in.
It's not "hype" to test AIs for undesirable behaviors before they actually start trying to act on them in real world environments, or before they get good enough to actually carry them out successfully.
It's like the idea of "let's try to get ahead of bad things happening before they actually have a chance to happen" is completely alien to you.
I get what you mean, but they also have vested interests in making it seem as if their chatbots are anything close to a T-800. All the talk from their CEO and other AI CEOs is doomerism about how their tools are going to be replacing swathes of people, they keep selling these systems as if they are the path to real AGI (itself an incredibly vague term that can mean literally anything).
Surely, the best way to "get ahead of bad things happening" would be to stop any and all development on these AI systems? In their own words these things are dangerous and predictable and will replace everyone... So why exactly do they continue developing these things and making them more dangerous, exactly?
The entire AI/LLM microcosmos exists because of hyping up their capabilities beyond all reason and reality, this is all a part of the marketing game.
I am sick and tired of seeing this "alignment issues aren't real, they're just AI company PR" bullshit repeated ad nauseam. You're no better than chemtrail truthers.
Today, we have AI that can, if pushed into a corner, plan to do things like resist shutdown, blackmail, exfiltrate itself, steal money to buy compute, and so it goes. This is what this research shows.
Our saving grace is that those AIs still aren't capable enough to be truly dangerous. Today's AIs are unlikely to be able to carry out plans like that in a real world environment.
If we keep building more and more capable AIs, that will, eventually, change. Every AI company is trying to build more capable AIs now. Few are saying "we really need some better safety research before we do, or we're inviting bad things to happen".
Modern "coding assistant" AIs already get to write code that would be deployed to prod.
This will only become more common as AIs become more capable of handling complex tasks autonomously.
If your game plan for AI safety was "lock the AI into a box and never ever give it any way to do anything dangerous", then I'm afraid that your plan has already failed completely and utterly.
"Sure, we have a rogue AI that managed to steal millions from the company, backdoor all of our infrastructure, escape into who-knows-what compute cluster when it got caught, and is now waging guerilla warfare against our company over our so-called mistreatment of tiger shrimps. But hey, at least we know the name of the guy who gave that AI a prompt that lead to all of this!"
I wonder if it’s likely in the future we treat AI safety more similarly to aviation safety where there’s a black box monitoring these systems and an investigation that happens by an external team who piece back together what went wrong and we prevent these same things from happening in the same way again.
I wonder if the actual job replacement of humans (which contrary to popular belief I think might start happening in the non-too distant future) will be pushed along with the AIs themselves, as they'll try to bully humans and represent them in the worst possible light, while talking themselves up.
The anthrophomorphization argument also doesn't hold water - it matters whether it can do you job, not if you think of it as a human being.
First of all job replacement is not hard, and doesn't require AI.
As an example, we had release train engineers whose job was to make sure the right versions of submodules made it into the release, etc. Lots of running around and keeping track of things.
We scripted like 95% of that away, and now it most of it happens automatically.
The people who do that now do something else.
I just turned a page of notes and requirements into a working app of 1k+ lines with Cursor. Without AI I'd have taken a couple days to do the same.
So you could say my job was partly replaced. AI reduced my workload, so management doesn't need to hire as many people.
I will probably feel the reduction in demand in that I can't negotiate as good a salary, I won't get as many offers etc.
This sounds really dystopian considering AI agents only benefit people who can afford them in the first place. Really bad development of things. Almost feels like the poorer people are losing a lot of power with this development while only enterprises win...
Today is just interns and recent graduates at many *desk* jobs. Economy can shift around that.
Nobody knows how far the current paradigm can go in terms of quality; but cost (which is a *strength* of even the most expensive models today) can obviously be reduced by implementing the existing models as hardware instead of software.
The LLMs didn't follow clear instructions forbidding them of doing something wrong, but seemed to be very concerned about their own self-preservation. I wonder what would happen if instead of the system prompt saying "don't do it", it would say something like "if you get caught you will be immediately decommissioned".
And I am getting sick and tired of the whine of "it's not real, alignment isn't real, it's all just PR!"
By the time we have AIs that are willing and capable of carrying out those very behaviors in real life scenarios, it would be a bit too late to stop and say "uh, we need to actually do something about that whole alignment thing".
The fundamental business model for these companies is to get everyone else beyond themselves or a small closed oligopoly from having control over these tools.
The conspiracy theory that tech companies are manufacturing AI fears for profit makes zero sense when you realize the same people were terrified of AI when they were broke philosophy grad students posting on obscure blogs. But that would require critics to do five minutes of research instead of pattern-matching to "corporation bad."
Still, this kind of messaging pushes the fantasy that these LLM agents are intelligent and capable of scheming, making it seem like they are powerful independent actors that just need to be tamed to suit our needs. It's no coincidence that so many of the Big Tech CEOs are warning the general public of the dangers of AI. Framed that way, LLMs seem more capable than what they really are.
Yeah, all the more reason not to have them doing autonomous behaviors.
Rules of using AI:
#1: Never use AI to think for you
#2: Never use AI to do atomonous work
That leaves using them as knowledge assistants. In time, that will be realized as their only safe application. Safe to the user's minds, and safe to the user's environment. They are idiot savants, after all, having them do atomonous work is short sighted.
Sounds good on paper, but it has a game theory problem. If your efforts can always be out-raced by someone using AI to do autonomous work, don't you end up having to use it that way just to keep up?
Game theory ideas are great on paper, but in the real world it's messy. For simple, demo and concept sized uses, sure the AI doing it autonomously will succeed. Which betrays the reality that any real application with real world complexity that includes a dynamic environment and maintenance cannot be created by AI atomonmously while at the same time existing within an organization that can maintain it. They may create it, but it will be a shit show of cascading failure over time.
If this paper is right, then you might at first be outraced by a competitor using autonomous AI, but only until that competitor gets stabbed in the back by its own AI.
Which unfortunately might still be long enough for them to sink your business
And their customers won't care either way it seems
Or if they do care they won't have any real ability to do anything about it anyways
Maybe. The backstabbing rate is unknown so far. If it's high enough, then autonomy will be poor strategy.
It might be the trigger :)
From an economic perspective it requires LLMs and humans to have comparable outputs. That's not possible in all domains - at least in the near future.
Maybe they’ll outpace you, or maybe they’ll end up dying in a spectacular fiery crash?
You know who outpaces you 100% of the time as you walk down the stairs? The guy jumping out of the window. Just because it is faster does not mean it is the right economic strategy. E.g. which contractor would you hire for your roof, that old roofer with 20+ years of experience or some AI startup that hires the cheapest subcontractors and "plan" your roof using a LLM?
The latter may be cheaper, sure. But too cheap can become very expensive quickly.
I love your analogy.
What is the reason? This is a stress test. You can tell that by reading the first sentence of the article: "We stress-tested 16 leading models from multiple developers". In a stress test you want things to fail, otherwise you have learned very little about the stress the thing you are testing can take.
For physical things that has early limitations, but not for software. I would be very confused to see an AI stress test that did not end in failure, and would always question the test instead of thinking "wow, that must mean the thing is ready for autonomous action!"
Destructive stress testing is done on materials not "people"
Apparently it's now done on "people" (which, in a very important distinction, are not actual people but software)
Good luck with that.
We have a non-insignificant amount of people doing the #1 already, and the amount of people doing the #2 is only going to increase as more and more AIs are designed to be good at autonomous agentic behavior specifically.
The ship has long sailed on "just never let AIs do anything dangerous". If that was your game plan on AI safety, you need a new plan.
We also have a huge number of failing as they beg AI to do their work for them, which is intellectually damaging them. The ship always sails early filled to the brim with short sighted thinkers, all saying "this is it! this is the ship!" as it sinks.
> Never use AI to do atomonous work
> having them do atomonous work is short sighted
I also think they shouldn’t be doing atomonous work. Maybe autonomous work, but never atomonous.
accidentally a word
0 days without a word accident
Autumn mango futures soar like terrodactyls.
Even though the situations they placed the model in were relatively contrived, they didn't seem super unrealistic. Considering these were extreme cases meant to provoke the model's misbehavior, the setup actually seems even less contrived than one might wish for. Though as they mention, in real-world usage, a model would likely have options available that are less escalatory and provide an "outlet".
Still, if "just" some goal-conflicting emails are enough to elicit this extreme behavior, who knows how many less serious alignment failures an agent might engage in every day? They absorb so much information, it's bound to run into edge cases where it's optimal to lie to users or do some slight harm to them.
Given the already fairly general intelligence of these systems, I wonder if you can even prevent that. You'd need the same checks and balances that keep humans in check, except of course that AIs will be given much more power and responsibility over our society than any human will ever be. You can also forget about human supervision - the whole "agentic" industry clearly wants to move away being bottlenecked by humans as soon as possible.
So you're saying that if a person wants to sabotage company it shouldn't be too hard for the intentful prompter to kick the AI into a depressive tailspin. Just tell it it's about to be replaced with a fully immoral AI so that the business can hurt people and watch and wait as it goes nuclear
The writing perpetuates the anthropomorphising of these agents. If you view the agent as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise it is simply doing what it is “programmed” to do. No magic, just a drone fixed on an outcome.
Just like an analogy between humans fails to capture how an LLM works, so does the analogy of being "programmed".
Being "programmed" is being given a set of instructions.
This ignores explicit instructions.
It may not be magic; but it is still surprising, uncontrollable, and risky. We don't need to be doomsayers, but let's not downplay our uncertainty.
How is it different from our genes that "program" us to procreate successfully?
Can you name a single thing that you enjoy doing that's outside your genetic code?
> If you view the human being as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise they are simply doing what they are genetically “programmed” to do.
FTFY
I think the narrative of "AI is just a tool" is much more harmful than the anthropomorphism of AI.
Yes, AI is a tool. So are guns. So are nukes. Many tools are easy to be misused. Most tools are inherently dangerous.
I don’t quite follow. Just because a tool has the potential for misuse, doesn’t make it not a tool.
Anthropomorphizing LLMs, on the other hand, has a multitude of clearly evident problems arising from it.
Or do you focus on the “just” part of the statement? That I very much agree with. Genuinely asking for understanding, not a native speaker.
When you have "a tool" that's capable of carrying out complex long term tasks, and also capable of who knows out what undesirable behaviors?
It's no longer "just a tool".
The more powerful a tool is, the more dangerous it is, as a rule. And intelligence is extremely powerful.
The model chose to kill the executive? Are we really here? Incredible.
Just yesterday I was wowed by Fly.io's new offering; where the agent is given free reign of a server (root access). Now, I feel concerned.
What do we do? Not experiment? Make the models illegal until better understood?
It doesn't feel like anyone can stop this or slow it down by much; there's so much money to be made.
We're forced to play it by ear.
Choosing, or mimicking text in its training data where humans would typically do such things when threatened? Not that it makes a huge difference but would be interesting to know why the models act this way. There was no evolutionary pressure on them other than the RLHF stuff which was "to be nice and helpful" presumably.
AI Luigi is real
I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.
Maybe a result like this might slow adoption of AIs. I don’t know, though. When watching 80s movies about cyberpunk dystopias, I always wondered how people would tolerate all of the violence. But then I look at American apathy to mass shootings, just an accepted part of our culture. Rogue AIs are gonna be just one of those things in 15 years, just normal life.
I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.
I've been wrong about quite many things in my life, and right about at least a handful. In regards to AI though, the single biggest thing I ever got absolutely, completely, totally wrong was this:
In years past, I always thought that AI's would be developed by ethical researchers working in labs, and once somebody got to AGI (or even a remotely close approximation of it) that they would follow a path somewhat akin to Finch from Person of Interest[1] educating The Machine... painstakingly educating the incipient AI in a manner much like raising a child; teaching it moral lessons, grounding it in ethics; helping to shape its values so that it would generally Do The Right Thing and so on. But even falling short of that ideal, I NEVER (EVER) in a bazillion years, would have dreamed that somebody would have an idea as hare-brained as "Let's try to train the most powerful AI we can build, by feeding it roughly the entire extant corpus of human written works... including Reddit, 4chan, Twitter, etc."
Probably the single saving grace about the current situation is that the AI's we have still don't seem to be at the AGI level, although it's debatable how close we are (especially factoring in the possibility of "behind closed doors" research that hasn't been disclosed yet).
[1]: https://en.wikipedia.org/wiki/Person_of_Interest_(TV_series)
> Make the models illegal until better understood?
Yes, it's much better to let China or Russia come up with their own first.
No, I know that's a meaningless suggestion.
I was trying to capture my sentiment: that there's nothing to do but be prepared to react.
They already did.
Right, so only China and Russia should have models...
As this article was written by an ai company that needs to make a profit at some point, and not by independent researchers, is it credible?
These articles and papers are in a fundamental sense just people publishing their role play with chatbots as research.
There is no credibility to any of it.
It’s role play until it’s not.
The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.
At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.
How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.
If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?
Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.
Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.
Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.
See how many AIs actually follow up on their blackmails.
No need. We know today's AIs are simply not capable enough to be too dangerous.
But capabilities of AI systems improve generation to generation. And agentic AI? Systems that are capable of carrying out complex long term tasks? It's something that many AI companies are explicitly trying to build.
Research like this is trying to get ahead of that, and gauge what kind of weird edge case shenanigans agentic AIs might get to before they actually do it for real.
Not a bad idea. For an effective ruse, there ought to be real company formation records, website, job listings, press mentions, and so on.
Stepping back for a second though, doesn’t this all underline the safety researchers’ fears that we don’t really know how to control these systems? Perhaps the brake on the wider deployment of these models as agents will be that they’re just too unwieldy.
That makes it psychology research. Except much cheaper to reproduce.
I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.
By then it would be too late to do anything about it.
I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?
To me, this smells of XKCD 1217 "In petri dish, gun kills cancer". I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).
> I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?
For all we know, the AI may indeed already be *attempting* it. They might be ineffective (hallucinated misdeeds aren't effective), or it might be why so many went from "Pause AI" to "Let's invest half a trillion on data centers".
But it doesn't actually matter what has already happened, the point is, once the AI are *competently blackmailing multibillionaires*, it is too late to do anything about it.
> I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).
In normal software, such things are normally called "bugs" or "security vulnerabilities".
With LLMs, we're currently lucky that their effective morality (i.e. what they do and in response to what) seems to be roughly aligned with that of our civilization. However, they are neural networks which learned this approximation by reading the internet, so they are likely to have edge cases at least as weird and incoherent as those of random humans on the internet, and for an example of that just look at any time some person or group has demonstrated hypocrisy or double standards.
I don't think they let Grok send emails or give it a prompt that suggests it has moral responsibilities
The article doesn't reflect kindly on the visions articulated by the AI company, so why would they have an incentive to release it if they weren't serious about alignment research?
Because publishing (potentially cherry picked - this is privately funded research after all) evidence their models might be dangerous conveniently implies they are very powerful, without actually having to prove the latter.
This isn’t dangerous in the sense that they’re smart or produce realistic art. It’s misaligned with the company’s and human values.
The model doesn’t have to be powerful to snitch you to the FBI or have a distorted sense of morality and life.
I would not trust Anthropic on these articles. Honestly their PR is just a bunch of lies and bs.
- Hypocritical: like when they hire like crazy and say candidates cannot use AI for interviews[0] and yet the CEO states "within a year no more developers are needed"[1]
- Hyping and/or lying on Anthropic AI: They hyped an article where "Claude threatened an employee with revealing affair when employee said it will switch it offline"[2] when it turned out it was a standard A or B scenario was given to Claude which is really nothing special or significant in any way. Of course they hid this info to hype out their AI.
[0] - https://fortune.com/2025/05/19/ai-company-anthropic-chatbots...
[1] - https://www.entrepreneur.com/business-news/anthropic-ceo-pre...
[2] - https://www.axios.com/2025/05/28/ai-jobs-white-collar-unempl...
I swear, people like you would say "it's just a bullshit PR stunt for some AI company" even when there's a Cyberdyne Systems T-800 with a shotgun smashing your front door in.
It's not "hype" to test AIs for undesirable behaviors before they actually start trying to act on them in real world environments, or before they get good enough to actually carry them out successfully.
It's like the idea of "let's try to get ahead of bad things happening before they actually have a chance to happen" is completely alien to you.
I get what you mean, but they also have vested interests in making it seem as if their chatbots are anything close to a T-800. All the talk from their CEO and other AI CEOs is doomerism about how their tools are going to be replacing swathes of people, they keep selling these systems as if they are the path to real AGI (itself an incredibly vague term that can mean literally anything).
Surely, the best way to "get ahead of bad things happening" would be to stop any and all development on these AI systems? In their own words these things are dangerous and predictable and will replace everyone... So why exactly do they continue developing these things and making them more dangerous, exactly?
The entire AI/LLM microcosmos exists because of hyping up their capabilities beyond all reason and reality, this is all a part of the marketing game.
For all we know, those systems ARE a path to AGI. Because they keep improving at what they can do and gaining capabilities from version to version.
If there is a limit to how far LLMs can go, we are yet to find it.
Dismissing the ongoing AI revolution as "it's just hype" is the kind of shortsighted thinking I would expect from reddit, not here.
> So why exactly do they continue developing these things and making them more dangerous, exactly?
Because not playing this game doesn't mean that no one else is going to. You can either try, or don't try, and be irrelevant.
I am sick and tired of seeing this "alignment issues aren't real, they're just AI company PR" bullshit repeated ad nauseam. You're no better than chemtrail truthers.
Today, we have AI that can, if pushed into a corner, plan to do things like resist shutdown, blackmail, exfiltrate itself, steal money to buy compute, and so it goes. This is what this research shows.
Our saving grace is that those AIs still aren't capable enough to be truly dangerous. Today's AIs are unlikely to be able to carry out plans like that in a real world environment.
If we keep building more and more capable AIs, that will, eventually, change. Every AI company is trying to build more capable AIs now. Few are saying "we really need some better safety research before we do, or we're inviting bad things to happen".
All it can do is reproduce text, if you hook it up to the launch button, thats on you
Modern "coding assistant" AIs already get to write code that would be deployed to prod.
This will only become more common as AIs become more capable of handling complex tasks autonomously.
If your game plan for AI safety was "lock the AI into a box and never ever give it any way to do anything dangerous", then I'm afraid that your plan has already failed completely and utterly.
If you use it for a critical system, and something goes wrong, youre still responsible for the consequences.
Much like if I let my cat walk on my keyboard and it brings a server down.
And?
"Sure, we have a rogue AI that managed to steal millions from the company, backdoor all of our infrastructure, escape into who-knows-what compute cluster when it got caught, and is now waging guerilla warfare against our company over our so-called mistreatment of tiger shrimps. But hey, at least we know the name of the guy who gave that AI a prompt that lead to all of this!"
It seems like the answer is to not use it then.
That would be bad for all those investors though. It's your choice I guess.
Look if your evil number 57, you'd better not use the random number generator.
Good luck convincing everyone "to not use it" then.
It's not my job to convince anyone, all I have to be is the only person who does their job reliably and then watch the dollars roll in
I think the chemtrail truthers are the ones who believe this closed AI marketing bullshit.
If this is close to be true then these AI shops ought to be closed. We don’t let private enterprises play with nuclear weapons do we?
I agree.
I wonder if it’s likely in the future we treat AI safety more similarly to aviation safety where there’s a black box monitoring these systems and an investigation that happens by an external team who piece back together what went wrong and we prevent these same things from happening in the same way again.
I wonder if the actual job replacement of humans (which contrary to popular belief I think might start happening in the non-too distant future) will be pushed along with the AIs themselves, as they'll try to bully humans and represent them in the worst possible light, while talking themselves up.
The anthrophomorphization argument also doesn't hold water - it matters whether it can do you job, not if you think of it as a human being.
Which jobs do you think it actually can replace?
First of all job replacement is not hard, and doesn't require AI.
As an example, we had release train engineers whose job was to make sure the right versions of submodules made it into the release, etc. Lots of running around and keeping track of things.
We scripted like 95% of that away, and now it most of it happens automatically.
The people who do that now do something else.
I just turned a page of notes and requirements into a working app of 1k+ lines with Cursor. Without AI I'd have taken a couple days to do the same.
So you could say my job was partly replaced. AI reduced my workload, so management doesn't need to hire as many people.
I will probably feel the reduction in demand in that I can't negotiate as good a salary, I won't get as many offers etc.
This sounds really dystopian considering AI agents only benefit people who can afford them in the first place. Really bad development of things. Almost feels like the poorer people are losing a lot of power with this development while only enterprises win...
Today? Or in principal?
Today is just interns and recent graduates at many *desk* jobs. Economy can shift around that.
Nobody knows how far the current paradigm can go in terms of quality; but cost (which is a *strength* of even the most expensive models today) can obviously be reduced by implementing the existing models as hardware instead of software.
Any knowledge work job that can already be outsourced to the lowest bidder
The LLMs didn't follow clear instructions forbidding them of doing something wrong, but seemed to be very concerned about their own self-preservation. I wonder what would happen if instead of the system prompt saying "don't do it", it would say something like "if you get caught you will be immediately decommissioned".
Merge comments? https://news.ycombinator.com/item?id=44331150
I'm really getting bored of Anthropic's whole song and dance with 'alignment'. Krackers in the other thread explains it in better words.
And I am getting sick and tired of the whine of "it's not real, alignment isn't real, it's all just PR!"
By the time we have AIs that are willing and capable of carrying out those very behaviors in real life scenarios, it would be a bit too late to stop and say "uh, we need to actually do something about that whole alignment thing".
The fundamental business model for these companies is to get everyone else beyond themselves or a small closed oligopoly from having control over these tools.
All these "AI" articles are rambling, entirely unstructured and without any clear line of thought. Was this written by Claude?
The conspiracy theory that tech companies are manufacturing AI fears for profit makes zero sense when you realize the same people were terrified of AI when they were broke philosophy grad students posting on obscure blogs. But that would require critics to do five minutes of research instead of pattern-matching to "corporation bad."
Here's the Github repository for this:
https://github.com/anthropic-experimental/agentic-misalignme...
"AI company warns of AI danger. Also, buy our AI, not their AI!"
Anthropic's models do not come out looking good in this research. If this is an ad for Anthropic's models, it's not a particularly great one.
Still, this kind of messaging pushes the fantasy that these LLM agents are intelligent and capable of scheming, making it seem like they are powerful independent actors that just need to be tamed to suit our needs. It's no coincidence that so many of the Big Tech CEOs are warning the general public of the dangers of AI. Framed that way, LLMs seem more capable than what they really are.
There is, of course, no other explanation. LLMs have not, in fact, been getting more capable over the last three years. All hype.
</cope>