Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights | 박진영 교수 연구실 | 성균관대학교 소프트웨어학과

박진영 교수 연구실

서비스 플랜

연구실 검색

프로젝트 공고

정부 과제 추천

AI 기반 기업 서칭

홈

기본 정보

연구 분야

프로젝트

논문

구성원

preprint|

인용수 0

·2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Choi, Sooyung, J. M. Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak

ArXiv.org

초록

The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

키워드

Unintended consequencesValue (mathematics)Scope (computer science)Key (lock)Human factors and ergonomicsPoison controlEmpirical evidence

타입

preprint

IF / 인용수

- / 0

원문

http://arxiv.org/abs/2506.06404

게재 연도

2025

프로젝트 공고 서비스 문의 자주 묻는 질문 이용약관 개인정보처리방침

주식회사 디써클

대표 장재우,이윤구서울특별시 강남구 역삼로 169, 명우빌딩 2층 (TIPS타운 S2)대표 전화 0507-1312-6417이메일 info@rndcircle.io사업자등록번호 458-87-03380호스팅제공자 구글 클라우드 플랫폼(GCP)