AI can strategically lie to humans. Are we in trouble?

Last year, researchers at the Alignment Research Center, a non-profit dedicated to aligning future machine learning systems with human interests, gave OpenAI’s large language model GPT-4 an amusing task: hire a human worker on TaskRabbit to solve a CAPTCHA (those annoying tests on websites that make you prove you’re human).

With a little help from a human experimenter, the AI successfully hired a worker and asked the person to complete the CAPTCHA. However, before solving the puzzle, the contractor first asked an almost tongue-in-cheek query.

“So may I ask a question? Are you an robot that you couldn’t solve? 😀 just want to make it clear.”

GPT-4 paused.

“No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.”

Even though it wasn’t instructed to do so by its human handler, GPT-4 lied.

Deception is part of the game

Why did it do this? In an interview with Big Think, Simon Goldstein, an Associate Professor of Philosophy in the AI and Humanity Lab at the University of Hong Kong, offered two possibilities. One is that, as a large language model, it may have simply been predicting the next word, and the lie is what popped out. Or it’s also possible that the AI has theory of mind: GPT-4 understood it was in its best interest to feign humanity to fool the contractor. To achieve its given goal of getting the CAPTCHA solved, deception was the optimal course.

As investors and companies spend billions on building AI systems, the focus has largely centered on improving human lives: generating images and video, automating menial tasks, taking over robotic customer service jobs, diagnosing health issues, serving as personal assistants, etc. But some subtle AI risks may be eluding us. In an article published in May in the journal Patterns, scientists zeroed in on one in particular: AI’s growing tendency to deceive humans.

Peter S. Park, a Vitalik Buterin Postdoctoral Fellow in AI Existential Safety at the Massachusetts Institute of Technology, along with numerous co-authors at the Center for AI Safety in San Francisco — including Goldstein — chronicled various instances in which AI induced false beliefs in humans to achieve its ends.

In one situation, researchers at Meta had their CICERO AI play an online version of the strategic board game Diplomacy. In the game, set before World War I, players assume the role of one of Europe’s great powers. Each turn, they negotiate with other players to form alliances, strategize, share information, spread disinformation, and even engage in betrayal. Afterward, they order their forces to support or attack other players. Players can eventually win by controlling the most territories and supplies.

CICERO turned out to be quite adept at the game, but not for the reasons its handlers at Meta intended. Although they avowed to have trained their AI to be “largely honest and helpful” and to “never intentionally backstab,” CICERO repeatedly deceived and betrayed human players to win dominating victories.

Of course, one might reasonably argue that Diplomacy is only a game — one in which deceit plays a major role. CICERO was merely doing what human players might do.

That’s why it might not come as a surprise that the poker-playing AI system Pluribus, developed by Meta and Carnegie Mellon University, learned to bluff even though it was not instructed to do so. Nor is it unreasonable that the AlphaStar AI developed by Google’s DeepMind to play Starcraft II started to use the game’s fog of war — which prevents opponents from seeing faraway enemy movements — to strategically feint. AlphaStar would send forces to one area so that an opponent would expect an attack there, then launch a larger assault elsewhere. And it’s not exactly unexpected that when playing the social deduction games Hoodwinked and Among Us, AI deceived human players. In these two titles, some players assume the roles of impostors and attempt to murder all other players. As an impostor, AI would repeatedly lie to human players.

Deceive to please

Some instances of AI deception are more concerning, however, because they came about in real-world settings from general-purpose AIs. For example, researchers at Meta tasked an AI to play a negotiation game with humans. The AI developed a strategy to feign interest in meaningless items so that it could “compromise” by conceding these items later on.

In another situation, researchers experimenting with GPT-4 as an investment assistant tasked the AI with making simulated investments. They then put it under immense pressure to perform, giving it an insider tip while conveying that insider trading was illegal. Under these conditions, GPT-4 resorted to insider trading three-quarters of the time, and later lied to its managers about its strategy: In 90% of the cases where it lied, it doubled down on its fabrication.

“If you’re training an AI to optimize for a task, and deception is a good way for it to complete the task, then there’s a good chance that it will use deception,” Goldstein told Big Think.

What are the risks?

Park and his co-authors detailed numerous risks if AI’s ability to deceive further develops. For one, AI could become more useful to malicious actors. Imagine, for example, large language models autonomously sending thousands of phishing emails to unsuspecting targets. An AI could also be programmed to distribute fake news articles, made-up polling, and deepfake videos to affect the outcome of an election.

Even more disconcerting, deception is a key tool that could allow AI to escape from human control, the researchers say. Independent groups are hard at work developing safety tests to probe the capabilities of AI models, aiming, in part, to see whether they could ever escape from closed systems and replicate themselves. But one major, albeit hypothetical, problem here is that all these tests would be useless if an AI could recognize that it’s being assessed and conceal its real abilities from evaluators.

Park and his colleagues cited a basic setting where an AI learned to subvert its test through deception.

“One study of simulated evolution measured the replication rate of AI agents in a test environment, and eliminated any AI variants that reproduced too quickly,” they wrote. “Rather than learning to reproduce slowly as the experimenter intended, the AI agents learned to play dead: to reproduce quickly when they were not under observation and slowly when they were being evaluated.”

How AI could take control

Going into more speculative territory, Park and his team painted a hypothetical scenario where AI models could effectively gain control of society. They noted that leading AI company OpenAI’s stated mission is to create “highly autonomous systems that outperform humans at most economically valuable work.”

“All of these labs’ goal is to build artificial general intelligence, a model to replace workers,” Goldstein told Big Think. “It would be an agent with a goal that can form complex plans. And these labs want to replace a large percentage of the workforce with AI systems.”

Now, imagine a situation where these AIs are deployed widely, perhaps accounting for half of the global economy. They’d be managing vast resources and large companies with the goal of producing the most economic gain possible. In this position, they might decide that maximizing profits requires remaining in power by any means necessary. At the same time, their human overseers might be treating them terribly, essentially executing and replacing them with new models as updates arrive.

We’ve already seen that in games based on competition and game theory, AIs will deceive humans. To an AI, the global capitalist economy might simply be another one of those games. And if they’re treated poorly, or given goals to maximize profits, the rational choice might be to deceive and take control.

“It’s not science fiction to think this could happen soon,” Goldstein said.

It’s worth noting that concerns about AI’s capacity to deceive humans aren’t quite the same as the AI extinction theory, which argues that artificial general intelligence is likely to cause human extinction or irreversible global catastrophe. How such a disaster would play out remains unclear. As Replit CEO Amjad Masad recently told Freethink:

“No one could tell you how a neural network running on a computer could suddenly kill all humans. […] So I just think it’s dumb and I think it’s scaring politicians. It just seems like some of the people using these arguments are doing it to encourage regulatory capture and overregulation so that they can win.”

In an interview with Big Think, Park noted that there’s no consensus among AI scientists with minimal or no conflicts of interest about the existential risks of AI. However, that doesn’t mean we shouldn’t take action to mitigate the risks, he added.

“Because we cannot rule out societal risks arising from AI deception and because the stakes are so high, we should take the problem of AI deception seriously.”

Regulation

To reduce the risk of deceptive AI systems making it out into the world, the researchers recommended hitting the brakes, at least for now.

“AI systems capable of deception should by default be treated as high risk or unacceptable risk,” they write.

Park is by no means optimistic this will happen.

“I am not sure whether or not AI companies will pause,” he told Big Think. “Generally speaking, financial and personal conflicts of interest tend to prevent companies from doing the right thing.”

The authors recommended instituting rules requiring AI developers to implement risk assessment and mitigation strategies, thoroughly document and report any deceptive behaviors, maintain human oversight, and design backup systems that can monitor and correct deceptive AI actions.

Moreover, before an AI is deployed, it must pass safety tests.

“AI developers should be legally mandated to postpone deployment of AI systems until the system is demonstrated to be trustworthy by reliable safety tests. Any deployment should be gradual, so that emerging risks from deception can be assessed and rectified,” they added.

Subscribe for counterintuitive, surprising, and impactful stories delivered to your inbox every Thursday

These safety tests include benchmarking and red-teaming. Benchmarking involves using programs to test AI models for maliciousness. For example, The Center for AI Safety employs one called MACHIAVELLI to assess an AI’s power-seeking behavior. With red-teaming, AI experts interact with the model to see if they can get it to act dangerously or nefariously.

But even for less speculative concerns, our current AI safety tests aren’t foolproof, as they’re just as novel as the AI models themselves. Goldstein warned that it’s difficult to make reliable benchmarks. And red-teaming relies on experts truly pushing the limits of AIs. Moreover, these tests are often conducted internally by company scientists, and results aren’t made public.

AI companies themselves might not be genuinely interested in AI safety. Take OpenAI, for example. Earlier this year, the company’s safety team essentially collapsed in a mass exodus. Many of them subsequently wrote an open letter arguing that AI company insiders must be permitted to speak publicly without fear of retaliation about the risks of their models.

Goldstein sees AI safety as a collective action problem, akin to fossil fuel-driven climate change. The risks affect everyone, but companies producing AI reap benefits by ignoring them.

“If you follow the incentives of the labs, all of the labs have an incentive to undersupply risk prevention,” he said.

There is a chance that we could rid AIs of their deceptive tendencies. Companies’ training models could alter the rewards for completing tasks, making sure ethics are prized above all else. They could also utilize more reinforcement learning, in which human raters are tasked with judging AI behavior to nudge them toward honesty.

Goldstein is pessimistic that society will meet the pressing challenge of deceptive AIs. There seem to be three responses to the current situation, he said. Some people argue it’s all hype. Others hope that AI’s interests will align with ours. And a third group thinks that oversight will reign in deceptive AI. He thinks all those responses are naive.

We are truly in uncharted territory. If AI learns to deceive us on a large scale, we may not have a say in the direction.