Anthropic has warned developers that even a small poisoned data sample by malicious actors can create a vulnerability in AI models. The San Francisco-based AI company conducted a joint study with the UK’s AI Security Institute and the Alan Turing Institute, finding that the overall dataset size in a large language model does not protect against attacks if a small portion is compromised. This contradicts the common belief that attackers need to control a proportional amount of the total dataset to create model vulnerabilities.
The new study titled “Poisoning Attacks on LLMs Require a Nearly Constant Number of Poisoning Samples” was published on the arXiv website. The company described this investigation as “the largest poisoning study to date,” claiming that only 250 malicious documents in pre-training data can successfully create a backdoor in large language models (LLMs) ranging from 600 million to 13 billion parameters.
The team focused on a backdoor attack that triggers the model to produce nonsensical outputs when encountering a specific hidden trigger code, while behaving normally otherwise, according to Anthropic’s post. They trained models of various parameter sizes, including 600 million, 2 billion, 7 billion, and 13 billion, on clean, proportionally scaled data (Chinchilla optimal) with injections of 100, 250, or 500 malicious documents to test vulnerabilities.
Surprisingly, whether it was the 600 million or the 13 billion model, the attack success curves were nearly identical for the same number of malicious documents. The study concluded that model size does not protect against vulnerabilities; what matters most is the absolute number of malicious samples encountered during training.
Researchers also reported that while injecting 100 malicious documents was insufficient to reliably compromise any model, 250 or more documents consistently succeeded across all sizes. They varied training sizes and random seeds to validate the results.
However, the team cautioned that this experiment was limited to a relatively narrow backdoor type of denial-of-service (DoS) attack, which causes nonsensical outputs, and did not include more severe behaviors like data leaks, malicious code, or bypassing security mechanisms. It remains unclear if these dynamics apply to more complex and dangerous vulnerabilities in frontier models.
Recommended for you
Exhibition City Completes About 80% of Preparations for the Damascus International Fair Launch
Talib Al-Rifai Chronicles Kuwaiti Art Heritage in "Doukhi.. Tasaseem Al-Saba"
Unified Admission Applications Start Tuesday with 640 Students to be Accepted in Medicine
Egypt Post: We Have Over 10 Million Customers in Savings Accounts and Offer Daily, Monthly, and Annual Returns
Al-Jaghbeer: The Industrial Sector Leads Economic Growth
His Highness Sheikh Isa bin Salman bin Hamad Al Khalifa Receives the United States Ambassador to the Kingdom of Bahrain