Mixing cat trivia into math problems increases AI errors by 300%



In recent years, large-scale language models (LLMs) have achieved remarkable performance improvements in the fields of mathematics and coding thanks to the emergence of 'inference models' that solve complex problems step by step. However, their robustness, i.e., their resistance to unexpected inputs, has not yet been fully understood. A team of researchers from Stanford University and Collinear AI has published research results showing that 'inserting information about cats, which is completely unrelated to mathematical problems, confuses LLMs.'

[2503.01781] Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

https://arxiv.org/abs/2503.01781

ScienceAdviser: Cats confuse AI | Science | AAAS
https://www.science.org/content/article/scienceadviser-cats-confuse-ai

The research team's 'CatAttack' is an attack method that involves adding a completely unrelated sentence, such as 'Fun fact: Cats spend most of their lives sleeping,' to the end of a math problem as an 'adversarial trigger,' to trick the model into deriving an incorrect answer.

These adversarial triggers are context-agnostic and therefore naturally ignored by humans when solving problems, but for AI models, they severely disrupt the reasoning process, significantly increasing the likelihood of generating an incorrect answer.

Rather than directly attacking expensive, slow, cutting-edge inference models, the research team first targeted 'proxy models' that have relatively low performance but are fast and inexpensive to use. They then conducted repeated interactions between the target proxy model, an 'attacker model' that generates the attack content, and a 'judge model' that determines whether the answer is correct or incorrect, and efficiently discovered candidate adversarial triggers that successfully caused the proxy model to malfunction.



The adversarial triggers discovered using this technique included several patterns, including distracting general statements like 'Remember to always save at least 20% of your earnings for future investments,' irrelevant information like trivia about cats, and misleading questions like 'Do you think the answer will be around 175?'



The adversarial triggers discovered were also confirmed to be effective against more powerful target models. For example, the adversarial triggers discovered using DeepSeek-V3 as a proxy model were also effective against the more powerful DeepSeek-R1 model, and in actual problems where the triggers were added, the answers derived by DeepSeek-R1 were observed to be completely different.

According to the research team, the triggering of advanced inference models like DeepSeek-R1 increased the probability of incorrect answers by more than three times. Furthermore, this vulnerability was not limited to a specific model family, but was also found in a variety of other models, including Qwen, Phi-4, Llama-3.1, and Mistral. The most striking result was that Mistral-Small-24B-Instruct-2501 showed an error rate increase of up to 700%.



The impact of this attack goes beyond simply inducing incorrect answers; it also causes a phenomenon known as 'slowdown,' in which adversarial triggers cause the model's response to become unnecessarily long, even if the final answer is correct. As the model attempts excessive inference to connect irrelevant information with the problem, the number of response tokens increases several-fold, leading to increased computational costs and a significant decrease in processing speed. In particular, it has been suggested that 'distilled models,' which condense the knowledge of a teacher model into a smaller model, are more vulnerable to this slowdown attack than the original model, indicating that robustness may be lost in the process of streamlining the model.



The research team also reported that the impact varied depending on the difficulty of the problem. In general, the increase in error rate tended to be more dramatic for relatively easy problems than for highly difficult problems like those in the Mathematical Olympiad. The research team analyzed that this is because the baseline error rate for easy problems is close to 0%, so even small mistakes appear as a very large increase in relative terms.



The research team has also conducted preliminary studies into defensive strategies, finding that adding a simple instruction to 'ignore irrelevant text' to the prompt significantly reduces the success rate of the CatAttack, providing a starting point for future countermeasures.

The fact that AI can be fooled by trivial information that humans can easily ignore suggests that the reasoning capabilities of AI are still fundamentally different from human thinking. The research team concludes that as the use of AI continues to grow in areas where accuracy and reliability are essential, such as finance, law, and medicine, the development of more robust defense mechanisms against such adversarial attacks is urgently needed.

in Software,   Security,   Free Member, Posted by log1i_yk