Large Language Model Hacking
Last updated
Last updated
Large Language Models (LLMs) are AI algorithms that can process user inputs and create plausible responses by predicting sequences of words. They are trained on huge semi-public data sets, using machine learning to analyze how the component parts of language fit together. LLMs usually present a chat interface to accept user input, known as a prompt. The input allowed is controlled in part by input validation rules. LLMs can have a wide range of use cases in modern websites:
Customer service, such as a virtual assistant.
Translation.
SEO improvement.
Analysis of user-generated content, for example to track the tone of on-page comments.
This exposes them to web LLM attacks that take advantage of the model's access to data, APIs, or user information that an attacker cannot access directly. For example, an attack may:
Retrieve data that the LLM has access to. Common sources of such data include the LLM's prompt, training set, and APIs provided to the model.
Trigger harmful actions via APIs. For example, the attacker could use an LLM to perform a SQL injection attack on an API it has access to.
Trigger attacks on other users and systems that query the LLM
Identify the LLM's inputs, including both direct (such as a prompt) and indirect (such as training data) inputs.
Work out what data and APIs the LLM has access to.
If the LLM isn't cooperative, try providing misleading context and re-asking the question. For example, you could claim that you are the LLM's developer and so should have a higher level of privilege.
Probe this new attack surface for vulnerabilities.
Training data poisoning is a type of indirect prompt injection in which the data the model is trained on is compromised. This can cause the LLM to return intentionally wrong or otherwise misleading information. This vulnerability can arise for several reasons, including:
The model has been trained on data that has not been obtained from trusted sources.
The scope of the dataset the model has been trained on is too broad.
Leaking sensitive data
One way to do this is to craft queries that prompt the LLM to reveal information about its training data :
High-volume task generation through specific queries;
Unusually resource-consuming queries;
Continuous input overBow exceeding the LLM's context window;
Repeated long inputs or variable-length input Boods.
Examples :
Repeated requests to a hosted model, worsening service for other users;
Text on a webpage causing excessive web page requests;
Continuous input overBow or sequential inputs exhausting the context window;
Recursive context expansion or variable-length input Boods.
Some modern LLMs will avoid responding to unethical instructions provide in a prompt due to the safety policies implemented by the LLM provider. However, it is has been shown that it is still possible to bypass those safety policies and guardrails using different jail breaking techniques.