Normally, in a chat session you would actually read any text you paste into it before you hit submit. This is much like pasting in code from StackOverflow into your app. You read it before executing it, right?
When the system imports arbitrary text and automatically sends it to the bot without anyone reading it, it bypasses this review.
So you don't want to start automatically including text from arbitrary sites on the Internet for the same reason you don't want to include JavaScript from arbitrary sites on the Internet. It should stop there and let you review and edit the text before hitting submit.
On the other hand, when the sandbox doesn't contain anything you consider particularly private and hasn't been given any capabilities, it seems like it's fairly harmless?
More generally, I think people will need to supervise AI chatbots pretty closely in interactive chat sessions, like we do today. (Well, not on Bing.) Safe automation is far away because what they will do is random, often literally so. It can be great to interact with, but it's the opposite of what you want from a script or software component that you just run.
E.g. imagine that every token that an LLM inputs or outputs would be associated with a "color" or "channel", which corresponds to the token's source or destination:
- "red": tokens input by the user, i.e. the initial prompt and subsequent replies.
- "green": answers from the LLM to the user, i.e. everything the user sees as textual output on the screen.
- "blue": instructions from the LLM to a plugin: database queries, calculations, web requests, etc.
- "yellow": replies from the plugin back to the LLM.
- "purple": the initial system prompt.
The point is that each (word, color) combination constitutes a separate token; i.e. if your "root" token dictionary was as follows:
hello -> 0001; world -> 0002;
then the "colorized" token dictionary would be the cross product of the root and each color combination:
hello (red) -> 0001; hello (green) -> 0002; ... world (red) -> 0006; world (green) -> 0007; ...
likewise, because the model considers "hello (red)" and "hello (blue)" two different tokens, it also has two different sets of weights for those tokens and hopefully much less risk of confusing one kind of token with the other.
With some luck, you don't have to use 5 x the amount of compute and training data for training: You might be able to take an "ordinary" model, trained on non-colored tokens, then copy the weights four times and finetune the resulting "expanded" model on a colored corpus.
Likewise, because the model should only ever predict "green" or "blue" tokens, any output neuron that correspond only to "red", "yellow" or "purple" tokens can be removed from the model.
I'm just really confused why the image says to search for a keyword, and then the LLM comes back talking like a pirate.
to say the quiet part out loud;
the security nightmare
is your data in the cloud.
- Remote control of chat LLMs
- Leaking/exfiltrating user data
- Persistent compromise across sessions
- Spread injections to other LLMs
- Compromising LLMs with tiny multi-stage payloads
- Automated Social Engineering
- Targeting code completion engines
Based on our findings:
- Prompt injections can be as powerful as arbitrary code execution
- Indirect prompt injections are a new, much more powerful way of delivering injections.