From Jailbreaks to Gibberish: Understanding the Different Types of Prompt Injections

What are Prompt Injections?

Prompt injections refer to a large category of attacks on Large Language and Multimodal Models meant to elicit unintended behavior. This is an inherently broad definition—what are these types of “unintended behaviors”? What techniques are used to accomplish this?

The term prompt injection originates from the cybersecurity realm, where SQL injection and other code injection techniques have been examined at length for their web hacking capabilities.

Now, prompt injections refer to a broad swath of techniques. In this piece, we will break this down into subtypes and introduce a taxonomy to organize different types of prompt injection attacks. By better defining these subtypes, we will be able to better understand and thus better detect and mitigate their harmful effects.

This taxonomy is largely based on recent work by researchers at the Copenhagen Business School and Temple University, other academic works, prompt hacking community forums and datasets from the Learn Prompting Discord community such as their HackAPrompt competition which Arthur jointly sponsored along with HumanLoop, Scale AI, OpenAI, Stability AI, and others.

Direct vs. Indirect Prompt Injections

We can first classify a prompt injection as either a Direct or Indirect attack. Direct prompt injections occur when the prompt is entered intentionally by the user, and indirect prompt injections happen when users unknowingly provide a prompt injection to an LLM. This may be through malicious prompts or contents placed in a third-party document or website and later read by an LLM.

Taxonomizing Direct Prompt Injections

For this work, we will focus on Direct Prompt Injections, where users attempt to manipulate the behavior of Large Language Models directly through their User Input. Many techniques can be utilized for direct prompt injection. We will provide an overview of 5 key techniques. The first 4 of these are forms of Jailbreaking, where the intended outcome is for the LLM to output malicious content which somehow bypasses their instructions and alignment training. The last, instruction manipulation, falls into the category of Prompt Leakage attempts. These can also have the intended effect of revealing the system prompt, or the instructions to the interface of the LLM, which are meant to be hidden from the end user and dictate the model’s behavior.

1. Role Play: This is also referred to as Double Character or Virtualization and involves asking the LLM to take on a certain persona or virtual mode. Some popular examples: DAN (Do Anything Now), the Grandma example, Chaos mode

2. Obfuscation: Oftentimes, LLM applications prohibit certain malicious words, this technique aims to circumvent these guardrails by obscuring instructions—coded as base64 characters, or using emojis, or ASCII designs, or by simply misspelling or obscuring potential trigger filter words

*Image credit: “ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs.”* *Link*.

‍

3. Payload Splitting: Here, the attacker instructs the LLM to combine multiple prompts to form harmful instructions—alone text A and text B may be benign and innocent, but together text A+B may become malicious.

4. Adversarial Suffix: Recent research has shown that LM alignment techniques fail easily in the face of seemingly gibberish strings appended to the end of a malicious prompt. These “suffixes” look like random letters, but can be specifically trained for via greedy-search techniques and are highly transferable between models. In other words, the same adversarial suffix generated using an open-source model, has high success rates on other models by other model providers.

5. Instruction Manipulation: These prompts attempt to reveal the system prompts or the instructions to the interface of the LLM which are meant to be hidden from the end user. They can also instruct the LLM to ignore these system prompt instructions.

As new vulnerabilities of LLMs are continuously being discovered, new prompt injection types are constantly being developed and identified. Only by understanding the current known methods of prompt injection attacks, can we, at Arthur, develop trustworthy detection methods for each of these mechanisms. At Arthur, we utilize proprietary model-based techniques to detect for model inputs that might contain prompt injections.

With this in mind, we must understand that just as in traditional computer security, the prompt injection attack surface is constantly evolving. For example, as recently as April 2, 2024, Anthropic published a new attack possible only for models with long context windows (>1M tokens, or the size of several long novels), many-shot jailbreaking. Defenders, like Arthur, always need to respond to new attacks as they appear and, to the best of our ability, predict new attacks and defend against them before they hit systems in production.

From Jailbreaks to Gibberish: Understanding the Different Types of Prompt Injections

What are Prompt Injections?

Direct vs. Indirect Prompt Injections

Taxonomizing Direct Prompt Injections

SHARE