Prompt attacks and chatbot security | Complexical

You may have seen a recent news story in which a DPD chatbot went rogue, and started insulting the company.

Or the one where Air Canada was forced to honour the refund policy its chatbot made up?

These kind of incidents are becoming increasingly common as organisations rush to deploy chatbot technology based on large language models. The problem is, companies are pushing out public-facing tools without really understanding how the underlying technology works, or its limitations.

Large language models like GPT-4 (the model underlying ChatGPT) are general-purpose tools by default. Their primary skill is in generating fluent, human-like language in response to written prompts. They don’t know what facts are. They also don’t have the common sense of a human customer service operator, who would know to look up details of an unfamiliar policy rather than inventing one and who (you would hope) couldn’t be goaded into writing offensive limericks about their employer.

In both of these cases, the model was responding to the content of the customer’s request, but not in a way the company would have wanted.

In the Air Canada case, there was no malicious intent on the part of the user: they were just asking an innocent question about the details of a policy.

The DPD incident, by contrast, was partly due to the actions of a frustrated customer who understood that chatbots are vulnerable to prompt-based attacks, and who managed to work around the guidelines DPD had put in place to try and stop exactly this kind of output. If you want to get a more hands-on understanding of prompt-based attacks, Lakera have developed a fun game where you have to try and outwit a chatbot to get its password — while the chatbot defences gets more sophisticated with every round.

Ultimately, prompt-based instruction sets aren’t robust enough to defend against prompt-based attacks, because the instructions themselves are a kind of prompt, and to the language model they are considered equivalent inputs. Even the core system prompts devised and implemented by ChatGPT’s developers aren’t able to prevent all unwanted behaviours, and people have learned to work around them to trick ChatGPT into revealing secrets such as copyrighted training data and its own system prompts.

So what can you do?

If you’re interested in deploying this kind of chatbot assistant within your business, you need to think hard about potential reputational and legal risks. Consider the kinds of scenarios where you envision ordinary users interacting with the bot, and what types of responses would be great, acceptable, or intolerable. Also think about whether the bot has access to any confidential or private information as part of its training, and what you would do if a determined attacker set out to try and get access to this data.

Then you need to build the appropriate defences for your situation.

As a starting point, you could put a simple wrapper around the chatbot which checks its output for swearing, slurs, or confidential information. Checking that any ‘policy’ quoted by a customer service bot is actually contained within company policies is also a straightforward text-matching problem. These additional checks can be implemented cheaply and easily, they don’t use any extra AI (so aren’t vulnerable to the same attacks) and give you a minimum level of assurance that your chatbot isn’t about to libel your brand or invent new policies on your behalf.

Alternatively, a more sophisticated (and commensurately more expensive) option is to have a second LLM or other custom AI model which reads and checks the output of the first model, while not allowing any user prompts to reach the second model directly. This approach requires more expertise, and you might need to get some external help if you don’t have AI expertise within your organisation.

You could also consider hiring a security pen tester to directly test your system for vulnerabilities. You should probably be doing this anyway! But if you plan to deploy a chatbot, it’s worth looking for someone with specific experience of prompt engineering and prompt injection attacks.

However you decide to go about it, it’s becoming increasingly clear that securing your tech and managing your brand are problems that overlap in when it comes to generative AI and chatbots.

Dr. Rachel Dugdale

Rachel is a computer scientist. All posts written without AI, except insofar as it is used for experiments.

Tagged with: artificial intelligence, chatgpt, technology