[...] demonstrate potentially brutal consequences of giving LLMs like ChatGPT interfaces to other applications. We propose newly enabled attack vectors and techniques and provide demonstrations of each in this repository:

- Remote control of chat LLMs

- Leaking/exfiltrating user data

- Persistent compromise across sessions

- Spread injections to other LLMs

- Compromising LLMs with tiny multi-stage payloads

- Automated Social Engineering

- Targeting code completion engines

Based on our findings:

- Prompt injections can be as powerful as arbitrary code execution

- Indirect prompt injections are a new, much more powerful way of delivering injections.

Seems like this is similar to cross-site scripting vulnerabilities in browsers. A chat session happens in a sandbox, but any text you give to the bot can be interpreted as instructions. Text is as bad as JavaScript, to the bot.

Normally, in a chat session you would actually read any text you paste into it before you hit submit. This is much like pasting in code from StackOverflow into your app. You read it before executing it, right?

When the system imports arbitrary text and automatically sends it to the bot without anyone reading it, it bypasses this review.

So you don't want to start automatically including text from arbitrary sites on the Internet for the same reason you don't want to include JavaScript from arbitrary sites on the Internet. It should stop there and let you review and edit the text before hitting submit.

On the other hand, when the sandbox doesn't contain anything you consider particularly private and hasn't been given any capabilities, it seems like it's fairly harmless?

More generally, I think people will need to supervise AI chatbots pretty closely in interactive chat sessions, like we do today. (Well, not on Bing.) Safe automation is far away because what they will do is random, often literally so. It can be great to interact with, but it's the opposite of what you want from a script or software component that you just run.

I wonder if a lot of those "injection" problems could be overcome by introducing a distinction between the different types of input and output already at the token level.

E.g. imagine that every token that an LLM inputs or outputs would be associated with a "color" or "channel", which corresponds to the token's source or destination:

- "red": tokens input by the user, i.e. the initial prompt and subsequent replies.

- "green": answers from the LLM to the user, i.e. everything the user sees as textual output on the screen.

- "blue": instructions from the LLM to a plugin: database queries, calculations, web requests, etc.

- "yellow": replies from the plugin back to the LLM.

- "purple": the initial system prompt.

The point is that each (word, color) combination constitutes a separate token; i.e. if your "root" token dictionary was as follows:

hello -> 0001; world -> 0002;

then the "colorized" token dictionary would be the cross product of the root and each color combination:

hello (red) -> 0001; hello (green) -> 0002; ... world (red) -> 0006; world (green) -> 0007; ...

likewise, because the model considers "hello (red)" and "hello (blue)" two different tokens, it also has two different sets of weights for those tokens and hopefully much less risk of confusing one kind of token with the other.

With some luck, you don't have to use 5 x the amount of compute and training data for training: You might be able to take an "ordinary" model, trained on non-colored tokens, then copy the weights four times and finetune the resulting "expanded" model on a colored corpus.

Likewise, because the model should only ever predict "green" or "blue" tokens, any output neuron that correspond only to "red", "yellow" or "purple" tokens can be removed from the model.

It’s social engineering LLMs
Wonder if there is a way to "show problem A is like problem B, therefore it is NP complete," but for the possibility space of literally the entire English language.
thequadehunter the pirate example the comment said to talk like a pirate, right? Is the example comment where it searches for a keyword a different example?

I'm just really confused why the image says to search for a keyword, and then the LLM comes back talking like a pirate.

We need better fingerprinting. This would help with having people preemptively prompting then only showing the last prompt and results.
We will finally have a semantic web, but not Web 3.0 (RDF/OWL/etc)... instead, a regurgitated version of the Internet created by LLMs.
All of these fears are valid and models should be designed to not allow certain uses such as those described here. But some will be designed specifically to enable these threats, and that will mean we all need to take security of our systems more seriously which is a good thing in my eyes.
Incredible work. Relatedly, @greshake team could you please consider entering this contest? I suspect you may easily win if you give it a try given your strong expertise in prompt hacking.
Are they though? The way the prompt apis are evolving is to separate out the prompt from the data e.g via the system prompt
Seems very weird (and fixable) that text found on the web would be interpreted by the chatbot as an instruction.
Maybe I shouldn't put this on ??
It is time to lay things bare,

to say the quiet part out loud;

the security nightmare

is your data in the cloud.