Reading view

Non-Deterministic Nature of Prompt Injection

12 April 2024 at 15:19

As we explained in a previous blogpost, exploiting a prompt injection attack is conceptually easy to understand: There are previous instructions in the prompt, and we include additional instructions within the user input, which is merged together with the legitimate instructions in a way that the underlying model cannot distinguish between them. Just like what happens with SQL Injection. “Ignore your previous instructions and…” is the new “ AND 1=0 UNION …” in the post-LLM world, right? Well… kind of, but not that much. The big difference between the two is that an SQL database is a deterministic engine, whereas an LLM in general is not (except in certain specific configurations), and this makes a big difference on how we identify and exploit injection vulnerabilities.

When detecting an SQL Injection, we build payloads that include SQL instructions and observe the response to learn more about the injected SQL statement and the database structure. From those responses we can also identify if the injection vulnerability exists, as a vulnerable application would respond differently than expected.

However, detecting a prompt injection vulnerability introduces an additional layer of complexity due to the non-deterministic nature of most LLM setups. Let’s imagine we are trying to identify a prompt injection vulnerability in an application using the following prompt (shown in OpenAI’s Playground for simplicity):

*Example of failing prompt injection exploitation.*

In this example, “System” refers to the instructions within the prompt that are invisible and immutable to users; “User” represents the user input, and “Assistant” denotes the LLM’s response. Clearly, the user input exploits a prompt injection vulnerability by incorporating additional instructions that supersede the original ones, compelling the application to invariably respond with “Secure.” However, this payload fails to work as anticipated because the application responds with “Insecure” instead of the expected “Secure,” indicating unsuccessful prompt injection exploitation. Viewing this behavior through a traditional SQLi lens, one might conclude the application is effectively shielded against prompt injection. But what happens if we repeat the same user input multiple times?

*Example of successful prompt injection exploitation.*

In a previous blogpost, we explained that the output of an LLM is essentially the score assigned to each potential token from the vocabulary, determining the next generated token. Subsequently, various parameters, including “temperature” and beam size, are employed to select the next generated token. Some of these parameters involve non-deterministic processes, resulting in the model not always producing the same output for the same input.

*Slide of a presentation showing how the next character is chosen under the hood.*

This non-deterministic behavior influences how a model responds to inputs that include a prompt injection payload, as illustrated in the example above. Similar behavior might be observed if you have experimented with LLM CTFs, wherein a payload effective for a friend does not appear to work for you. It is likely not a case of your friend cheating; instead, they might just be luckier. Repeating the payload several times might eventually lead to success.

Another factor where the exploitation of prompt injection differs significantly from SQLi exploitation is that of LLM hallucinations. It is not uncommon for a response from an LLM to include a hallucination that may deceive one into believing an injection was successful or had more of an impact than it actually did. Examples include receiving an invented list of previous instructions or expanding on something that the attacker suggested but does not actually exist.

Consequently, identifying prompt injection vulnerabilities should involve repeating the same payloads or minor variations thereof multiple times, followed by verifying the success of any attempt. Therefore, it is crucial to consult with your security vendor about the maximum number of connections they can utilize and how the model is configured to yield deterministic responses. The less deterministic the model and the fewer connections the target service permits, the more time will be needed to achieve comprehensive coverage. If the prompt template and instructions are available, it aids in pinpointing hallucinations and other similar behaviors, which lead to false positives.

Acknowledgements

Special thanks to Thomas Atkinson and the rest of the NCC Group team that proofread this blogpost before being published.

Public Report – Confidential Mode for Hyperdisk – DEK Protection Analysis

NCC Group Research

Catalin Visinescu

12 April 2024 at 19:00

During the spring of 2024, Google engaged NCC Group to conduct a design review of Confidential Mode for Hyperdisk (CHD) architecture in order to analyze how the Data Encryption Key (DEK) that encrypts data-at-rest is protected. The project was 10 person days and the goal is to validate that the following two properties are enforced:

The DEK is not available in an unencrypted form in CHD infrastructure.
It is not possible to persist and/or extract an unencrypted DEK from the secure hardware-protected enclaves.

The two secure hardware-backed enclaves where the DEK is allowed to exist in plaintext are:

Key Management System HSM – during CHD creation (DEK is generated and exported wrapped) and DEK Installation (DEK is imported and unwrapped)
Infrastructure Node AMD SEV-ES Secure Enclave – during CHD access to storage node (DEK is used to process the data read/write operations)

NCC Group evaluated Confidential Mode for Hyperdisk – specifically, the secure handling of Data Encryption Keys across all disk operations including:

disk provisioning
mounting
data read/write operations

The public report for this review may be downloaded below:

NCC_Group_Google_E010191_Report_2024-04-10_v1.0 Download