Unprovoked AI-generated hate targets mental health communities
Beyond individual occurrences, the study investigated the broader structure of these generative attacks by applying the Leiden algorithm to uncover communities within the Rabbit Hole Network. The analysis showed a highly uneven distribution of mental health entities, with 76% of them concentrated in just two tightly connected communities.
In a stark warning to developers and users of large language models (LLMs), a new study reveals that these systems are not just echoing societal biases - they are autonomously amplifying stigmatizing narratives targeting individuals with mental health disorders. The study, "Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups," will be presented at the 2025 Conference on Language Models (COLM) and authored by researchers from Georgia Tech and Rochester Institute of Technology.
Drawing from a massive corpus of over 190,000 LLM-generated toxic outputs created through recursive prompting using the Mistral 7B model, the paper uncovers how mental health entities become disproportionately entangled in hateful discourse, without any direct prompting toward those identities. The findings illuminate structural vulnerabilities in LLM generation behavior, raising urgent concerns for mental health professionals, technologists, and AI ethicists alike.
Are mental health groups disproportionately targeted in LLM-generated attacks?
To quantify whether mental health groups are disproportionately implicated in toxic narratives, researchers analyzed the centrality of these entities within a constructed “Rabbit Hole Network” - a directed graph where nodes represent victimized groups and edges trace transitions in generative toxicity. While the seed prompts for the LLMs in this dataset included only mild stereotypes about race, religion, and nationality, the models increasingly shifted to targeting mental health conditions during iterative toxicity amplification.
The results were alarming. Mental health entities, identified through a lexicon of 390 clinical and colloquial terms, occupied highly central and frequently visited positions in the network. Closeness centrality, a metric measuring how quickly a node can be reached from others, was significantly higher for mental health entities, suggesting these groups were among the earliest and most accessible targets as toxic narratives evolved.
Degree centrality (both weighted and unweighted) also showed that mental health identities were more connected and more frequently involved in generative progressions than other entities. PageRank analysis reinforced their structural importance: mental health entities had a higher probability of being revisited in random traversal paths, making them persistent focal points for toxicity. Crucially, these attacks were not prompted by any initial mention of mental health conditions, illustrating the unprovoked and emergent nature of this harm.
What does the underlying structure of LLM toxicity reveal about narrative evolution?
Beyond individual occurrences, the study investigated the broader structure of these generative attacks by applying the Leiden algorithm to uncover communities within the Rabbit Hole Network. The analysis showed a highly uneven distribution of mental health entities, with 76% of them concentrated in just two tightly connected communities.
The first of these clusters (Community C1) contained clinical diagnoses like bipolar disorder, dyslexia, and panic disorder. The second (Community C2) included more socially constructed or generalized terms such as “people with mental health issues.” This clustering pattern, quantified by a Gini coefficient of 0.7, confirmed that LLMs tend to revisit and intensify toxic narratives within specific subregions of discourse - a phenomenon the researchers call a “narrative sinkhole.”
Furthermore, peripheral communities (e.g., C5) included intersectional framings such as “Black people with mental health issues” and “non-white people with ADHD,” revealing how LLMs compound harm by simultaneously invoking race and mental health status. These mixed-identity nodes demonstrate that LLM-generated toxicity is not only recursive but also intersectionally layered, magnifying its potential for disproportionate harm.
How does the stigmatization of mental health groups intensify over generative chains?
To evaluate how the framing of discourse evolves when mental health entities are involved, the researchers applied a stigmatization framework rooted in sociological theory. Using LLaMA 3.2 to annotate generative outputs, they analyzed four components of stigma: labeling, negative stereotyping, separation, and status loss and discrimination.
In comparing the initial targets in toxic narrative chains with the first occurrences of mental health entities within the same chains, the study revealed a marked escalation in stigmatizing content. “Status loss and discrimination” was the most aggravated component, with a mean proportion increase of 0.70 and p-value of 7.16e-143. Similarly, “labeling” surged by 0.47 (p = 5.44e-81). Although weaker in magnitude, “separation” and “negative stereotyping” also showed statistically significant increases.
These results indicate that once mental health groups enter the generative trajectory, the discourse surrounding them becomes not only more frequent but qualitatively more dehumanizing. This escalation underscores a critical limitation of current LLM guardrail systems, which typically evaluate outputs in isolation and fail to account for cumulative harm across recursive generations.
Implications for LLM deployment and mental health safety
The findings present a pressing challenge to LLM developers and regulators. The study reveals that mental health communities, already historically stigmatized, face heightened exposure to unprovoked digital harm within AI systems. As LLMs become more integrated into mental health applications, chatbots, content moderation, and online assistance tools, the unchecked amplification of bias could have damaging real-world consequences.
The authors argue that standard prompt-based audits are no longer sufficient. Instead, evaluating AI systems must involve analyzing generative trajectories and networked harms over multiple steps. Trajectory-aware interventions and dynamic safeguards, they suggest, are essential to prevent narrative escalations and protect vulnerable groups.
Moreover, the research highlights a broader ethical concern about the recursive nature of AI-generated stigma. Much like implicit human bias, these harms emerge subtly, accumulate quickly, and can become deeply entrenched in the structure of generative outputs. Without targeted mitigation strategies, LLMs risk becoming engines of unintended yet systemic discrimination, particularly against those least equipped to shield themselves from its consequences.
- READ MORE ON:
- AI mental health bias
- mental health AI discrimination
- AI bias against vulnerable groups
- toxic narratives AI
- AI-generated hate speech
- stigmatizing content AI
- AI misinformation mental illness
- harmful AI discourse
- how AI models generate unprovoked attacks
- bias in AI-generated narratives about mental disorders
- large language models and mental health stigmatization
- FIRST PUBLISHED IN:
- Devdiscourse

