Researchers at North Carolina State University announced on Mar. 23 that they have identified key components within large language models (LLMs) that play a crucial role in ensuring these artificial intelligence systems provide safe responses to user queries. The team has developed training techniques designed to enhance LLM safety while minimizing the so-called “alignment tax,” meaning the AI becomes safer without significantly reducing its performance.
The issue of LLM safety is becoming increasingly important as these systems, such as ChatGPT, are used for a wide range of applications—including providing advice or instructions for various tasks. Ensuring that LLMs generate safe responses is essential given their expanding use.
“We don’t want LLMs to tell people to harm themselves or to give them information they can use to harm other people,” said Jung-Eun Kim, corresponding author of the paper and an assistant professor of computer science at North Carolina State University.
Kim explained there are two main challenges: “The first challenge is the so-called alignment tax, which refers to the fact that incorporating safety alignment has an adverse effect on the accuracy of a model’s outputs.”
Jianwei Li, first author and Ph.D. student at NC State, added, “The second challenge is that existing LLMs generally incorporate safety alignment at a superficial level, which makes it possible for users to circumvent safety features.” Li gave an example where asking for instructions with seemingly positive intent could still lead models to provide unsafe information. Fine-tuning by users in specific domains can further weaken these safeguards.
To address this, researchers introduced the Superficial Safety Alignment Hypothesis (SSAH), which suggests current methods treat requests as simply safe or unsafe and make this decision early in response generation. They also identified specific neurons—called safety-critical units—in neural networks responsible for deciding whether requests should be fulfilled or refused.
Li said, “We found that ‘freezing’ these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain.” Kim added that their approach minimizes performance loss while preserving critical safeguards during customization processes.
“The big picture here is that we have developed a hypothesis that serves as a conceptual framework for understanding the challenges associated with safety alignment in LLMs, used that framework to identify a technique that helps us address one of those challenges, and then demonstrated that the technique works,” said Kim.
Looking ahead, Li said: “Moving forward, our work here highlights the need to develop techniques that will allow models to continuously re-evaluate and re-select their reasoning direction – safe or unsafe – throughout the response generation process.”
Their findings will be presented at ICLR2026 from April 23-27 in Rio de Janeiro. Code and additional details are available on their project website: https://ssa-h.github.io/.



